Audio/Video synchronization with the Web Audio API

2019-07-23 in

When working with real-time audio, latency is always a concern. Today, devices (computer, laptops) can often reach really low audio latency, but a variety of factors contribute to this latency.

The OS and its version, the hardware used (external sound card, built-in audio chip, Bluetooth headset, etc.), how carefully the program making the sound is written, which API it uses, how big the buffers chosen either by the developers or the users are, how expensive is the computation performed are some of the factors that play a role in the final end to end latency.

Typical round-trip ¹ latency figures on modern configurations, when using wired hardware are as follow:

Windows: it really depends on the driver and API used, WASAPI can be very low (tens of milliseconds with modern drivers on Windows 10 when the program uses the correct APIs) or high (on the order of hundreds of milliseconds). ASIO drivers (usually for professionals, and often paired with external sound cards) can have much lower latencies, but the SDK is proprietary and cannot really be used in open-source web browsers, as far as I know.
Apple: devices (Macs, iPads, iPhones): usually the best choice for pro audio, sub-10 millisecond can be achieved without trouble on default hardware, with very high reliability.
Linux: ALSA and PulseAudio, with proper real-time thread scheduling can be pretty solid at low latencies. 50 milliseconds round-trip is achievable. When running Jack and/or a real-time kernel, a latency of a few frames (hundreds of microseconds) is doable, which is hard to beat.
Android uses its own audio stack, sometimes on top of ALSA. Latency figures vary widely, and the reliability of the audio threads is sometimes not the best. It really differs from device to device (see this page for real data and more explanations). It is certainly much better on average than it used to be a few years ago.

Round trip latency is often used when characterizing latencies on a particular device, because it’s easy to do it with a high degree of confidence. Equipped with a program that opens a input/output duplex audio stream, simply copy the input frames into the output buffer. Then, in a quiet room, generate a short tone/click on the output stream. Measure the time it takes in the program between generating the audio to recording it, and (not always but often) divide by 2 to get the audio output latency. Usually this results in larsens so be careful.

Variation on this procedure can be done: touch to sound latency can be important for virtual instruments on phone, MIDI key press to sound is critical for musicians, the tilt sensor to sound contributes a lot to immersion, in VR experiences.

Audio latencies on the web

On the Web, low latency is really important when playing video games (or any other interactive experience, such as VR or AR), when doing voice calls, when playing a virtual instrument using a controller, or simply processing a real instrument, such as an electric guitar amplifier and cabinet, using the Web Audio API. It is generally accepted that latency is not perceptible when below the 20 milliseconds range. This however varies quite a lot depending on the type of interaction, the frequency content of the sound and quite a lot of other factors, such as the experience of the person tested: an experienced drummer might be annoyed even with sub-10 milliseconds output latency. It’s important to put those numbers in perspective: an electric guitar player playing 5 meters away from the amplifier cabinet has a latency of:

$$ \begin{aligned} \frac{d}{V_s} &= \ \frac{5m}{343m \cdot s^{-1}} &\approx 0.0146s \ &= 14.6ms \end{aligned} $$

Where $ V_s $ is the speed of sound (at 20°C, but from experience when playing the guitar furiously, things heat up quickly and the air gets moist, lowering the speaker to ear latency), $d$ the distance in meter, and $l$ the latency.

Low latency is often unnecessary: when playing non-real time content (such as a song, a movie, or a twitch stream), it’s only necessary to know the exact latency figure, to perform proper audio/video synchronization. Humans don’t notice it too much when audio is late compared to video (up to a point ²), but the opposite has an horrible effect and is really noticeable.

By delaying the video frames an appropriate amount of milliseconds, and by knowing the latency of the audio output device, perfect audio/video synchronization can be achieved, regardless of the inherent audio output latency of the setup. This really is important when watching a movie with a Bluetooth headset, which is a rather common case. Some software allow users to shift the audio or video track a few milliseconds each way, but it’s always best to have it working automatically.

Playing video on the web is very often (but not always) done using the HTMLVideoElement object (<video> in markup). In most browsers, this will do audio output latency compensation automatically. All is well.

However when authors want to do things that are a bit more advanced, they often use the Web Audio API to do their audio processing, and render the video frames using a canvas (either 2D or 3D, depending on the application). When doing things manually like this, the browser cannot automatically shift the video frames based on the audio output latency anymore. Because, by default, Web Audio API implementations try to use low-latency audio output streams, and because on desktop audio output latencies on a lot of configurations are comparable to the display latency ³, this is fine. However when this is not the case, synchronization is incorrect, and authors had no real solution.

New members on the AudioContext interface

The AudioContext is the go-to interface for doing anything non-trivial with audio on the Web. It was originally only doing low-latency audio, and the latency was not configurable, but we later added a way to request higher-latency streams, which can be helpful for two reasons.

First, low-latency audio streams often consume much more CPU and energy than higher-latency audio streams: the audio thread has to wake up more often. Second, because of various reasons, it’s often faster to perform a computation on, say, a 2048 frames buffer, than it it to perform the same computation 16 times on a 128 frames buffer, each computation happening in an audio callback being called in a roughly isochronous fashion.

When instantiating an AudioContext, authors can ask for an arbitrarily low-latency (expressed in seconds), and implementations will try to honor the request, down to a minimum under which it can’t. In the Web Audio API, the hard limit is 128 frames (≈2.66ms at 48kHz, because we’ve specified all the processing based on 128 frames block processing), but in practice, this depends on the setup of the device.

In any way, knowing the real audio output latency is really something that was missing from the Web plaform. To solve this in the Web Audio API, we added three members to the AudioContext interface, and I just landed patches in Firefox to implement them:

baseLatency is a floating point number that represent the number of seconds the AudioContext uses for its internal processing, once the audio reaches the AudioDestinationNode. In Firefox, this is always 0 because we process the audio directly in the audio callback.
outputLatency is another floating point number that represents the number of seconds of latency between the audio reaching the AudioDestinationNode (conceptually the audio sink of the audio processing graph) and the audio output device. This is the number that varies widely depending on the setup.
getOutputTimestamp is a method that returns a JavaScript dictionary that contains two members: contextTime is in the same unit and clock domain as AudioContext.currentTime, and is the sample-frame that is currently being output by the audio output device, converted to seconds. performanceTime is the same number, but in the same clock domain and unit as performance.now() (in milliseconds).

The system clock and the audio clock can (and do very frequently) drift apart, and this third member allow easily mapping the two clock domains together.

Authors working on an application that need audio/video synchronization can use outputLatency to synchronize their rendering: instead of painting video frames as soon as they are rendered, they can shift the clock of the rendering by outputLatency, and audio and video will be in sync. This mostly matters with high-latency audio output devices (often, Bluetooth headset in A2DP profile), which is very common these days, but it’s a good habit to always use this, so that people that are, for example, on low end Android device, that are known to have an audio output latency that is in the longer end of the spectrum, have a better experience.

Now, the catch

While it’s easy to measure the round-trip latency, and reasonably straightforward to estimate the output latency, it’s a bit harder to find this number in the program running, for a variety of reasons:

Sometimes the drivers are just lying when reporting the value. If the device is very common, we can hard-code a sensible value. Otherwise, it’s just too bad.
Sometimes the call to get the value does not work on certain versions of the OS. This is the case on Windows 10 where GetStreamLatency on the IAudioClient interface seem to always return 0 on my (very standard) Dell XPS15. I’m fairly certain it worked when I implemented this feature back in 2013 on Windows 7. I’ll be working around this bug in the near future, but it won’t be accurate until then.
There is time that is hard to account for, in the audio input/output pipeline. A/D and D/A converters usually take a short but non-zero amount of sample, often because they run some filtering on the audio data. Some platforms also are a bit better than other at including more bits of the pipeline in the latency value they report.
The setup between the device and the audio output device can be a bit more complicated: the audio output can go to a mixing desk and or proper sound system with more digital or analog effects, and we can’t really account for this. If the setup is somewhat under control and fixed (for example for an art installation, or a live performance where the performers control part of the setup), adding hard-coded values can help.
Some audio APIs change the buffer size dynamically depending on the load of the system, and/or the presence or absence of other streams with different characteristics (Windows 10 and WASAPI when using the IAudioClient3 interface, and PulseAudio). This is surfaced in the AudioContext API (calling it again yields a different value), but there is no event fired when this value changes. Querying it as part of rendering a video frame each time might be useful, but I am yet to quantify the time it takes to do so. It depends widely on the platform, it can be as easy as reading a integer or two (sometimes atomics, which is a lot more expensive) and performing a couple divisions, or it can be a system call, that I hope is not too expensive.

When running Firefox Nightly (for now, this is going to release on the 22^nd of October 2019), have a look at those values on your setup, and if they don’t make sense, please open a bug in the Web Audio API Bugzilla component (simply login with a GitHub account if you don’t have a Bugzilla account on our instance), explain your setup, and the values you see, and I’ll try to get it fixed.

Round-trip latency refers to the time it takes between the moment a sound wave enters a system (say, a microphone, for a recording system) to the moment it’s output by the output device (speaker, headphones, etc.), when routing directly the input device to the output device. The system can be a human however, the case of key-press to ear latency, or another measurement device, or anything, as long as it’s useful for the problem at hand and clearly defined. ↩︎
Humans are used to see things earlier than they hear it, because sound speed in the air is much slower than the speed of light in air. If someone is speaking 10 meters away from you, it takes about 30ms for the sound waves to reach your ear. The first few chapters of the first volume of Audio Anecdotes are a good introduction to the field of auditory perception for practical purposes. ↩︎
Input and screen latency are certainly not to be ignored when doing interactive programes. For very good reasons (double/triple buffering) and less good reasons, the time it takes from pressing a key on a keyboard to the display changing on a typical computer can be quite long. And this is the ideal scenario, it’s not uncommon for rendering pipelines to have a frame or more of additional latency. Quite a lot of tradeoffs of various complexity can be made to achieve a better overall experience. ↩︎

Previously, Announcing monome-rs 1.0

After that, A robust metronome using the Web Audio API