A wait-free single-producer single-consumer ring buffer for the Web

2022-06-06 in

mozilla

Introduction

On top of other performance characteristics desired by most software, such as execution speed, any audio real-time audio software needs to have predictable performance. This has been covered in a previous post, and in this classic article (and lots of others). This applies to native software and software written using web technologies alike.

The increase in availability and performance of Web Assembly, and the somewhat recent return of SharedArrayBuffer is exciting news for audio programmers. It is especially interesting for developers used to the usual programming techniques employed when programming on native platforms, such as SIMD and lock-free concurrency.

After exposing a number of problems related to communication to and from a real-time audio thread, the possible alternative, and the reason why they are not acceptable, this post presents a small JavaScript library (about 1.3kB gzipped), with no dependency and with a permissive license, that aims at solving the problem.

It allows developers to easily communicate with an AudioWorkletGlobalScope from the browser main thread or a Web Worker thread, in a way that is both real-time safe, performant and ergonomic.

Communicating with a real-time thread, in native

In native real-time audio programming, a common design is to have a single real-time audio thread. It is set to have a high priority, often using a different scheduling class. In addition, real-time audio programs run other threads, such as a main thread to render the UI, threads to perform input/output operations on disk or using the network, and other threads to offload bits of real-time processing, such as computing expensive convolutions.

All those support threads need to communicate with each other, but more importantly with the real-time audio thread. In turn, the real-time audio thread frequently wants to send data back to other threads, for example to update visuals, record audio to a file, or any other task that doesn’t have to or can’t be on the real-time audio thread.

Since we can’t have the real-time audio thread wait on any other thread, or otherwise do anything potentially unbounded in time, the following constructs are usually not recommended (this list is by no means complete):

using operating system locks, such as pthread_mutex_t, CRITICAL_SECTION, SRWLOCK to protect data accessed by the real-time audio thread, and other blocking concurrency mechanism (reading on a pipe, semaphores, etc.)
waiting on a condition variable (or even signaling them in some cases)
making any kind of IO, using quite a lot of system calls
performing operations that can cause garbage collection pauses, even though garbage collectors are really fast these days
dynamically allocating memory using the system allocator (this uses system calls)
copying a large amount of memory
using spin locks, either provided by the OS, or custom (this one is not a hard rule and is sometimes useful)

It’s no surprise that real-time audio programmers are very often using atomic memory access facilities, and rely on tried and true patterns to implement complex but unavoidable communication schemes between the various threads of an application.

Communicating with a real-time thread, on the Web

In a Web Browser, things are a little different, but largely similar when squinting a bit.

The equivalent of the real-time audio thread on the Web is the AudioWorkletGlobalScope, inside which reside one or multiple AudioWorkletProcessor, with their methods called on a thread made real-time.

The browser main thread often handles the UI (although OffscreenCanvas exists and is an appealing solution). There are multiple ways to do networks and disk IO, implemented by the browser, communicating the progress of these operations asynchronously with either Web Workers or the browser main thread.

Regular OS threads can be spawned using Web Workers. There is no way to change the priority of the underlying thread of a Worker at the moment.

Counterparts of native threading concepts on the Web platform
Native	Web
Real-time audio thread (callback)	AudioWorkletProcessor methods
Main application thread	Browser main thread
IO threads (sync/async/etc.)	Asynchronous `fetch`, `IndexedDB`, etc.
Regular OS thread	Web Worker
High or low-priority OS thread	No equivalent

In terms of concurrency mechanism, it’s a lot simpler than native. Almost everything is synchronized via the event loop of various threads, workers and worklets. Communication happens using message passing, using the postMessage(...) method. It’s possible to copy the argument passed to postMessage(...), so that they are still accessible at the call-site, but it’s also possible to transfer ownership of the object, and this is essential when moving large pieces of memory around. In native lingo, this is like sending a pointer to a large buffer to another thread.

There is exactly one object that allows multiple threads to access the same piece of memory concurrently, called the SharedArrayBuffer. From a distance, it looks like a regular ArrayBuffer, but the key difference is that when passed to postMessage(...), it is still available at the call site, in addition to being available at the destination. It is to be used in tandem with the Atomics objects, that has various static methods to perform the usual store, load, exchange, arithmetic operations, and compare and exchange operations.

Additionally, it provides a wait and a notify method, although wait is not available in the AudioWorkletGlobalScope, and notify is most probably unsafe to use from a real-time thread cross-platform, since it takes a lock on at least some platforms.

Why is postMessage(…) not OK ?

One could think that postMessage(...) is perfect for real-time audio, it’s message passing after all. It could have been, but it’s not the case in practice.

First postMessage(...) can be rather slow, depending on the implementation, but this is fixable by browser vendors. Structured cloning is applied to objects sent via postMessage(...) with a serialization and deserialization step, which is frequently unnecessary for real-time audio. More concerning is that this structured cloning algorithm creates new JavaScript objects on its receiving end when a message is transmitted, which can create garbage to collect. Garbage Collector are extremely fast, but still not deterministic in JavaScript VMs. It’s best to not risk it for production software.

Then, looking under the hood, it becomes clear that implementations are taking locks and using all sorts of forbidden constructs such as allocating memory and doing system calls in the innards of postMessage(...) and the associated event dispatching machinery (on the receiving end, on the onmessage handler).

In any case, one cannot avoid the use of postMessage(...), for example to send large ArrayBuffers or Web Assembly modules to an AudioWorkletGlobalScope, but it’s best to not use it for continuously sending data to the real-time thread, and receiving results from it. Modern GCs deal with rare and short-lived object allocations very well in practice.

The alternative, presenting `ringbuf.js`

The Single-Producer Single-Consumer wait-free ring buffer (often called SPSC ring buffer) is often regarded as the bread and butter data structure for concurrency in real-time audio programming, and ringbuf.js is a version written in JavaScript, using SharedArrayBuffer.

It allows communicating between a producer thread and a consumer thread (which cannot change roles without external synchronization) without blocking or waiting.

The core data structure only supports sending integers and floating point values of varying width, but it’s easy to create adapters for more complex data transfer. Two abstractions are also provided, one for sending parameter changes (consisting of an integer for the parameter index and a float for the parameter value), and sending an interleaved audio stream.

It is written in a strange style of JavaScript, in a way that will not create any object, so that garbage collection won’t happen ¹, but it’s also quite readable I find. It clocks in at 137 lines of code for 173 lines of comment in the current version, which is a sane ratio for any lock-free code.

Hopefully the API is good enough to use. It has been put to the test in a couple projects already, and we’ve used it for a conference demo, reimplementing a toy HTMLVideoElement with Web Codecs, the Web Audio API and a <canvas>.

It’s MPL-licensed, allowing the use in closed-source programs, well tested, stress tested in CI, has extensive documentation and two practical example to try and read to get started. Thanks to the help of a number of contributors, it’s packaged for use in modern web apps, and available on NPM ². It also runs on Node.js.

Benchmarks

In lieu of proper serious benchmarks, I’m going to link to Jack Schaedler‘s karplus-stress-tester page³, and offer some results on the machines I have around ⁴, comparing postMessage(...) and something based on ringbuf.js.

This web app runs a large number of identical copies of the famous digital signal processing algorithm, either in JavaScript and WASM, and offers a variety of configuration options. Its goal is to try to understand what sort of architecture and techniques are the best for real-time audio on the Web. It then allows to load the real-time audio thread arbitrarily by adding more strings to simulate. Like a “real” real-time audio application, bidirectional communication, to and from the real-time audio thread, happens periodically, respectively to strum the string, and to visualize their amplitude and vibration characteristics on the main thread.

Here, I’m setting 100 strings per worklet, I then choose a particular communication method (postMessage or SharedArrayBuffer), always use the WASM processor, tick the “Visualize string state” checkbox to generate some main thread load.

I then increase the amount of strings until I can detect any glitch when “strum all” has been clicked. I consider a workload stable when there are precisely zero glitches for a long period of time, and I scale the number of string down as soon as any glitch is heard.

On the macOS machine, Chrome and Firefox don’t even glitch at 3000 string, but that’s the maximum the page’s user interface allows setting. Here, we see that one can expect 2.5x to 6x increase in load capacity if switching to a wait-free communication technique, compared to using postMessage.

In summary, it’s night and day on this benchmark. Wait-free concurrency is bound to be superior to superior to postMessage(...) in most real-time audio use-cases. The same type of results have been observed by other users of this library.

API primer

The API of ringbuf.js is maybe slightly non-standard for JavaScript developers, but there are good reasons for this: limiting allocations, and limiting memory copies.

Given a 1000 element ring buffer that can hold 32-bits floating point values:

var backingStorage = RingBuffer.getStorageForCapacity(1000, Float32Array);
var ringbuffer = new RingBuffer(backingStorage, Float32Array);

Enqueueing data into the ring buffer looks pretty straightforward:

var noise = new Float32Array(32);
for (let i = 0; i < 32; i++) { 
  noise[i] = Math.random() * 0.5 - 1;
}

let enqueued = ringbuffer.push(noise);
console.log(`{enqueued} elements enqueued`);

The push operation is real-time safe, and is guaranteed to never block. The input array can be a view on another memory region, or alternatively the method can take an offset and an element count, so it’s possible to push only a portion of a larger buffer, for example here the last 32 elements of a 512 elements buffer:

var noise = new Float32Array(512);
for (let i = 0; i < 512; i++) { 
  noise[i] = Math.random() * 2 - 1;
}

let enqueued = ringbuffer.push(noise, 512 - 32, 32);
console.log(`{enqueued} elements enqueued`);

In any case the number of elements successfully enqueued is returned, and the source array is not modified, so it’s possible to push them at a later time.

Dequeuing from a ring buffer is less natural, but will be familiar to native developers:

var output = new Float32Array(512);
let dequeued = ringbuffer.pop(output, 32, 128);
console.log(`{dequeued} elements dequeued`);

Passing in an array into the pop method allows saving an allocation, and potentially a copy. This method is also real-time safe, specifically wait-free.

It is possible to ask if the buffer is full, empty, and the number of elements available for reading or writing with the methods of the same name:

console.log(ringbuffer.empty());
console.log(ringbuffer.full());
console.log(ringbuffer.available_read());
console.log(ringbuffer.available_read());

Two last methods are available in the API. By passing a number of elements to write and a callback to the ring buffer, the callback is called with two buffers, where elements can be written to. This can potentially help to save copies or allocations, for example to have a particular processing or synthesis pass write its output directly in the ring buffer.

This method has two versions: one that doesn’t GC but has a slightly lower level API (suitable for real-time threads), and a version that can GC because it creates small object wrappers, but the API is a bit more ergonomic:

function fill_noise(buf, count = buf.len, offset = 0) {
  var len = offset == 0 ? count : count + offset;
  for (var i = offset; i < len; i++) {
    buf[i] = Math.random() * 2 - 1;
  }
}

function write_noise(storage, offset1, count1, offset2, count2) {
  fill_noise(first_part, count1, offset1);
  fill_noise(second_part, count2, offset2);
  // implied if no return value, can be lower
  return count1 + count2;
}

// The maximum number of elements to be appended to the queue in the following
// calls, it's possible to notify the ring buffer that less elements have been
// enqueued.
let element_count = 256;

// GC-free version
let wrote = ringbuffer.writeCallbackWithOffset(element_count, write_noise);
console.log(`{wrote} elements enqueued.`);

// Ergonomic version that can trigger GC
let wrote = ringbuffer.writeCallback(
  element_count
  (buffer1, buffer2) => {
    fill_noise(buffer1);
    fill_noise(buffer2);
    // implied if no return value, can be lower
    return buffer1.length + buffer2.length;
  }
);
console.log(`{wrote} elements enqueued.`);

Examples

Two examples are available, to show two tasks audio developers frequently have to implement (warning: they make noise, clicking the start button will produce sound):

Recording the output of an AudioWorkletProcessor, sending it to a Web Worker for further non-real-time or soft-real-time processing such as encoding, and then performing some IO on the encoded data. This demonstrates how to communicate directly between a Web Worker and an AudioWorkletProcessor, without touching the main thread.
Generating audio on the main thread, and playing it using an AudioWorkletProcessor, e.g. to implement a push-based audio API. While not recommended in general, this is very useful, for example to implement emulators for older systems (where everything was on a single thread). This demonstrates how to use audioqueue.js and param.js, the two thin abstractions over the base class provided in the library.

They are interactive and hosted on a mini-site that has other pieces of info: https://ringbuf-js.netlify.app/.

Outro

Despite dramatically increasing the performance of most real-time audio workloads that use AudioWorkletProcessors, I don’t really find using a solution based on ringbuf.js particularly more complex than something using postMessage(...).

Developers should probably consider this library as a building block, and use higher-level but zero-cost abstractions in their code. Again, two very common abstractions are available in the same package:

audioqueue.js allows sending interleaved audio frames through the queue
params.js allows sending parameter changes through the queue. A parameter is defined as a pair, composed of an index and a floating point value, but one could imagine sending any struct using very similar code.

The library is at version 0.3, but should be fairly stable in terms of API, with no breaking change expected. The only requirement really is the availability of SharedArrayBuffer.

As always, I welcome all contributions, and make sure to let me know if you find something to improve!

Now, let’s all push our real-time audio web app further with the new performance budget, and then we’ll find the next thing to optimize!

And who knows, maybe other types of applications can benefit from it?

At least in Firefox, I’m not sure about others, but probably the same? ↩︎
Something that might be missing is binding definitions for folks using TypeScript, but I’m not sure how to do this yet, please get in touch if you want to help. ↩︎
Re-hosted with permission on a server that sets the appropriate headers for SharedArrayBuffer to be available. The source is at https://github.com/jackschaedler/karplus-stress-tester ↩︎
The macOS machine is an M1 Max. It can do a full Firefox build in about 13-14 minutes. The Linux machine is an HP workstation desktop based on an Intel i9-7940X, running Ubuntu LTS 22.04 (running stock PulseAudio config, explaining the low performance — anybody serious with real-time audio would install JACK, but only Firefox supports it natively). This machine compiles Firefox in about 12 minutes. ↩︎

Previously, Profiling real-time audio workloads in Firefox