Introduction
On top of other performance characteristics desired by most software, such as execution speed, any audio real-time audio software needs to have predictable performance. This has been covered in a previous post, and in this classic article (and lots of others). This applies to native software and software written using web technologies alike.
The increase in availability and performance of Web
Assembly, and the
somewhat recent
return
of
SharedArrayBuffer
is exciting news for audio programmers. It is especially interesting for
developers used to the usual programming techniques employed when programming
on native platforms, such as SIMD and lock-free concurrency.
After exposing a number of problems related to communication to and from a real-time audio thread, the possible alternative, and the reason why they are not acceptable, this post presents a small JavaScript library (about 1.3kB gzipped), with no dependency and with a permissive license, that aims at solving the problem.
It allows developers to easily communicate with an AudioWorkletGlobalScope
from the browser main thread or a Web Worker thread, in a way that is both
real-time safe, performant and ergonomic.
Communicating with a real-time thread, in native
In native real-time audio programming, a common design is to have a single real-time audio thread. It is set to have a high priority, often using a different scheduling class. In addition, real-time audio programs run other threads, such as a main thread to render the UI, threads to perform input/output operations on disk or using the network, and other threads to offload bits of real-time processing, such as computing expensive convolutions.
All those support threads need to communicate with each other, but more importantly with the real-time audio thread. In turn, the real-time audio thread frequently wants to send data back to other threads, for example to update visuals, record audio to a file, or any other task that doesn’t have to or can’t be on the real-time audio thread.
Since we can’t have the real-time audio thread wait on any other thread, or otherwise do anything potentially unbounded in time, the following constructs are usually not recommended (this list is by no means complete):
- using operating system locks, such as
pthread_mutex_t
,CRITICAL_SECTION
,SRWLOCK
to protect data accessed by the real-time audio thread, and other blocking concurrency mechanism (reading on apipe
, semaphores, etc.) - waiting on a condition variable (or even signaling them in some cases)
- making any kind of IO, using quite a lot of system calls
- performing operations that can cause garbage collection pauses, even though garbage collectors are really fast these days
- dynamically allocating memory using the system allocator (this uses system calls)
- copying a large amount of memory
- using spin locks, either provided by the OS, or custom (this one is not a hard rule and is sometimes useful)
It’s no surprise that real-time audio programmers are very often using atomic memory access facilities, and rely on tried and true patterns to implement complex but unavoidable communication schemes between the various threads of an application.
Communicating with a real-time thread, on the Web
In a Web Browser, things are a little different, but largely similar when squinting a bit.
The equivalent of the real-time audio thread on the Web is the
AudioWorkletGlobalScope
, inside which reside one or multiple
AudioWorkletProcessor
, with their methods called on a thread made real-time.
The browser main thread often handles the UI (although
OffscreenCanvas
exists and is an appealing solution). There are multiple ways to do networks and
disk IO, implemented by the browser, communicating the progress of these
operations asynchronously with either Web Workers or the browser main thread.
Regular OS threads can be spawned using Web Workers. There is no way to change the priority of the underlying thread of a Worker at the moment.
Native | Web |
Real-time audio thread (callback) | AudioWorkletProcessor methods |
Main application thread | Browser main thread |
IO threads (sync/async/etc.) |
Asynchronous fetch , IndexedDB , etc.
|
Regular OS thread | Web Worker |
High or low-priority OS thread | No equivalent |
In terms of concurrency mechanism, it’s a lot simpler than native. Almost
everything is synchronized via the event loop of various threads, workers and
worklets. Communication happens using message passing, using the
postMessage(...)
method. It’s possible to copy the argument passed to
postMessage(...)
, so that they are still accessible at the call-site, but it’s
also possible to transfer
ownership
of the object, and this is
essential
when moving large pieces of memory around. In native lingo, this is like
sending a pointer to a large buffer to another thread.
There is exactly one object that allows multiple threads to access the same
piece of memory concurrently, called the SharedArrayBuffer
. From a distance,
it looks like a regular ArrayBuffer
, but the key difference is that when
passed to postMessage(...)
, it is still available at the call site, in
addition to being available at the destination. It is to be used in tandem with
the Atomics
objects, that has various static methods to perform the usual
store, load, exchange, arithmetic operations, and compare and exchange
operations.
Additionally, it provides a wait
and a notify
method, although
wait
is not available in the AudioWorkletGlobalScope
, and notify
is most
probably unsafe to use from a real-time thread cross-platform, since it takes a
lock on at least some platforms.
Why is postMessage(…) not OK ?
One could think that postMessage(...)
is perfect for real-time audio, it’s
message passing after all. It could have been, but it’s not the case in
practice.
First postMessage(...)
can be rather slow, depending on the implementation,
but this is fixable by browser vendors. Structured
cloning
is applied to objects sent via postMessage(...)
with a serialization and
deserialization step, which is frequently unnecessary for real-time audio. More
concerning is that this structured cloning algorithm creates new JavaScript
objects on its receiving end when a message is transmitted, which can create
garbage to collect. Garbage Collector are extremely fast, but still not
deterministic in JavaScript VMs. It’s best to not risk it for production
software.
Then, looking under the hood, it becomes clear that implementations are taking
locks and using all sorts of forbidden constructs such as allocating memory and
doing system calls in the innards of postMessage(...)
and the associated event
dispatching machinery (on the receiving end, on the onmessage
handler).
In any case, one cannot avoid the use of postMessage(...)
, for example to send
large ArrayBuffer
s or Web Assembly modules to an AudioWorkletGlobalScope
,
but it’s best to not use it for continuously sending data to the real-time
thread, and receiving results from it. Modern GCs deal with rare and short-lived
object allocations very well in practice.
The alternative, presenting ringbuf.js
The Single-Producer Single-Consumer wait-free ring buffer (often called SPSC
ring buffer) is often regarded as the bread and butter data structure for
concurrency in real-time audio programming, and
ringbuf.js
is a version written in
JavaScript, using SharedArrayBuffer
.
It allows communicating between a producer thread and a consumer thread (which cannot change roles without external synchronization) without blocking or waiting.
The core data structure only supports sending integers and floating point values of varying width, but it’s easy to create adapters for more complex data transfer. Two abstractions are also provided, one for sending parameter changes (consisting of an integer for the parameter index and a float for the parameter value), and sending an interleaved audio stream.
It is written in a strange style of JavaScript, in a way that will not create any object, so that garbage collection won’t happen 1, but it’s also quite readable I find. It clocks in at 137 lines of code for 173 lines of comment in the current version, which is a sane ratio for any lock-free code.
Hopefully the API is good enough to use. It has been put to the test in a couple
projects already, and we’ve used it for a conference
demo, reimplementing a toy
HTMLVideoElement
with Web Codecs, the Web Audio API and a <canvas>
.
It’s MPL-licensed,
allowing the use in closed-source programs, well
tested, stress tested
in CI, has
extensive documentation and two practical
example to try and
read to get
started. Thanks to the help of a number of
contributors, it’s
packaged for use in modern web apps, and available on
NPM 2. It also runs on Node.js
.
Benchmarks
In lieu of proper serious benchmarks, I’m going to link to Jack Schaedler‘s
karplus-stress-tester
page3, and offer some
results on the machines I have around 4, comparing postMessage(...)
and
something based on ringbuf.js
.
This web app runs a large number of identical copies of the famous digital signal processing algorithm, either in JavaScript and WASM, and offers a variety of configuration options. Its goal is to try to understand what sort of architecture and techniques are the best for real-time audio on the Web. It then allows to load the real-time audio thread arbitrarily by adding more strings to simulate. Like a “real” real-time audio application, bidirectional communication, to and from the real-time audio thread, happens periodically, respectively to strum the string, and to visualize their amplitude and vibration characteristics on the main thread.
Here, I’m setting 100 strings per worklet, I then choose a particular
communication method (postMessage
or SharedArrayBuffer
), always use the WASM
processor, tick the “Visualize string state” checkbox to generate some main
thread load.
I then increase the amount of strings until I can detect any glitch when “strum all” has been clicked. I consider a workload stable when there are precisely zero glitches for a long period of time, and I scale the number of string down as soon as any glitch is heard.
On the macOS machine, Chrome and Firefox don’t even glitch at 3000 string, but
that’s the maximum the page’s user interface allows setting. Here, we see that
one can expect 2.5x to 6x increase in load capacity if switching to a wait-free
communication technique, compared to using postMessage
.
In summary, it’s night and day on this benchmark. Wait-free concurrency is
bound to be superior to superior to postMessage(...)
in most real-time audio
use-cases. The same type of results have been observed by other users of this
library.
API primer
The API of ringbuf.js
is maybe slightly non-standard for JavaScript
developers, but there are good reasons for this: limiting allocations, and
limiting memory copies.
Given a 1000 element ring buffer that can hold 32-bits floating point values:
var backingStorage = RingBuffer.getStorageForCapacity(1000, Float32Array);
var ringbuffer = new RingBuffer(backingStorage, Float32Array);
Enqueueing data into the ring buffer looks pretty straightforward:
var noise = new Float32Array(32);
for (let i = 0; i < 32; i++) {
noise[i] = Math.random() * 0.5 - 1;
}
let enqueued = ringbuffer.push(noise);
console.log(`{enqueued} elements enqueued`);
The push
operation is real-time safe, and is guaranteed to never block. The
input array can be a view on another memory region, or alternatively the method
can take an offset and an element count, so it’s possible to push only a portion
of a larger buffer, for example here the last 32 elements of a 512 elements
buffer:
var noise = new Float32Array(512);
for (let i = 0; i < 512; i++) {
noise[i] = Math.random() * 2 - 1;
}
let enqueued = ringbuffer.push(noise, 512 - 32, 32);
console.log(`{enqueued} elements enqueued`);
In any case the number of elements successfully enqueued is returned, and the source array is not modified, so it’s possible to push them at a later time.
Dequeuing from a ring buffer is less natural, but will be familiar to native developers:
var output = new Float32Array(512);
let dequeued = ringbuffer.pop(output, 32, 128);
console.log(`{dequeued} elements dequeued`);
Passing in an array into the pop
method allows saving an allocation, and
potentially a copy. This method is also real-time safe, specifically wait-free.
It is possible to ask if the buffer is full, empty, and the number of elements available for reading or writing with the methods of the same name:
console.log(ringbuffer.empty());
console.log(ringbuffer.full());
console.log(ringbuffer.available_read());
console.log(ringbuffer.available_read());
Two last methods are available in the API. By passing a number of elements to write and a callback to the ring buffer, the callback is called with two buffers, where elements can be written to. This can potentially help to save copies or allocations, for example to have a particular processing or synthesis pass write its output directly in the ring buffer.
This method has two versions: one that doesn’t GC but has a slightly lower level API (suitable for real-time threads), and a version that can GC because it creates small object wrappers, but the API is a bit more ergonomic:
function fill_noise(buf, count = buf.len, offset = 0) {
var len = offset == 0 ? count : count + offset;
for (var i = offset; i < len; i++) {
buf[i] = Math.random() * 2 - 1;
}
}
function write_noise(storage, offset1, count1, offset2, count2) {
fill_noise(first_part, count1, offset1);
fill_noise(second_part, count2, offset2);
// implied if no return value, can be lower
return count1 + count2;
}
// The maximum number of elements to be appended to the queue in the following
// calls, it's possible to notify the ring buffer that less elements have been
// enqueued.
let element_count = 256;
// GC-free version
let wrote = ringbuffer.writeCallbackWithOffset(element_count, write_noise);
console.log(`{wrote} elements enqueued.`);
// Ergonomic version that can trigger GC
let wrote = ringbuffer.writeCallback(
element_count
(buffer1, buffer2) => {
fill_noise(buffer1);
fill_noise(buffer2);
// implied if no return value, can be lower
return buffer1.length + buffer2.length;
}
);
console.log(`{wrote} elements enqueued.`);
Examples
Two examples are available, to show two tasks audio developers frequently have to implement (warning: they make noise, clicking the start button will produce sound):
- Recording the output of an
AudioWorkletProcessor
, sending it to a Web Worker for further non-real-time or soft-real-time processing such as encoding, and then performing some IO on the encoded data. This demonstrates how to communicate directly between a Web Worker and anAudioWorkletProcessor
, without touching the main thread. - Generating audio on the main thread,
and playing it using an
AudioWorkletProcessor
, e.g. to implement a push-based audio API. While not recommended in general, this is very useful, for example to implement emulators for older systems (where everything was on a single thread). This demonstrates how to useaudioqueue.js
andparam.js
, the two thin abstractions over the base class provided in the library.
They are interactive and hosted on a mini-site that has other pieces of info: https://ringbuf-js.netlify.app/.
Outro
Despite dramatically increasing the performance of most real-time audio
workloads that use AudioWorkletProcessor
s, I don’t really find using a
solution based on ringbuf.js
particularly more complex than something using
postMessage(...)
.
Developers should probably consider this library as a building block, and use higher-level but zero-cost abstractions in their code. Again, two very common abstractions are available in the same package:
audioqueue.js
allows sending interleaved audio frames through the queueparams.js
allows sending parameter changes through the queue. A parameter is defined as a pair, composed of an index and a floating point value, but one could imagine sending anystruct
using very similar code.
The library is at version 0.3, but
should be fairly stable in terms of API, with no breaking change expected. The
only requirement really is the availability of SharedArrayBuffer
.
As always, I welcome all contributions, and make sure to let me know if you find something to improve!
Now, let’s all push our real-time audio web app further with the new performance budget, and then we’ll find the next thing to optimize!
And who knows, maybe other types of applications can benefit from it?
-
At least in Firefox, I’m not sure about others, but probably the same? ↩︎
-
Something that might be missing is binding definitions for folks using TypeScript, but I’m not sure how to do this yet, please get in touch if you want to help. ↩︎
-
Re-hosted with permission on a server that sets the appropriate headers for
SharedArrayBuffer
to be available. The source is at https://github.com/jackschaedler/karplus-stress-tester ↩︎ -
The macOS machine is an M1 Max. It can do a full Firefox build in about 13-14 minutes. The Linux machine is an HP workstation desktop based on an Intel i9-7940X, running Ubuntu LTS 22.04 (running stock PulseAudio config, explaining the low performance — anybody serious with real-time audio would install JACK, but only Firefox supports it natively). This machine compiles Firefox in about 12 minutes. ↩︎