Audio/Video synchronization with the Web Audio API

When working with real-time audio, latency is always a concern. Today, devices (computer, laptops) can often reach really low audio latency, but a variety of factors contribute to this latency.

The OS and its version, the hardware used (external sound card, built-in audio chip, Bluetooth headset, etc.), how carefully the program making the sound is written, which API it uses, how big the buffers chosen either by the developers or the users are, how expensive is the computation performed are some of the factors that play a role in the final end to end latency.

Typical round-trip 1 latency figures on modern configurations, when using wired hardware are as follow:

Round trip latency is often used when characterizing latencies on a particular device, because it’s easy to do it with a high degree of confidence. Equipped with a program that opens a input/output duplex audio stream, simply copy the input frames into the output buffer. Then, in a quiet room, generate a short tone/click on the output stream. Measure the time it takes in the program between generating the audio to recording it, and (not always but often) divide by 2 to get the audio output latency. Usually this results in larsens so be careful.

Variation on this procedure can be done: touch to sound latency can be important for virtual instruments on phone, MIDI key press to sound is critical for musicians, the tilt sensor to sound contributes a lot to immersion, in VR experiences.

Audio latencies on the web

On the Web, low latency is really important when playing video games (or any other interactive experience, such as VR or AR), when doing voice calls, when playing a virtual instrument using a controller, or simply processing a real instrument, such as an electric guitar amplifier and cabinet, using the Web Audio API. It is generally accepted that latency is not perceptible when below the 20 milliseconds range. This however varies quite a lot depending on the type of interaction, the frequency content of the sound and quite a lot of other factors, such as the experience of the person tested: an experienced drummer might be annoyed even with sub-10 milliseconds output latency. It’s important to put those numbers in perspective: an electric guitar player playing 5 meters away from the amplifier cabinet has a latency of:

$$ \begin{aligned} \frac{d}{V_s} &= \ \frac{5m}{343m \cdot s^{-1}} &\approx 0.0146s \ &= 14.6ms \end{aligned} $$

Where \( V_s \) is the speed of sound (at 20°C, but from experience when playing the guitar furiously, things heat up quickly and the air gets moist, lowering the speaker to ear latency), \(d\) the distance in meter, and \(l\) the latency.

Low latency is often unnecessary: when playing non-real time content (such as a song, a movie, or a twitch stream), it’s only necessary to know the exact latency figure, to perform proper audio/video synchronization. Humans don’t notice it too much when audio is late compared to video (up to a point 2), but the opposite has an horrible effect and is really noticeable.

By delaying the video frames an appropriate amount of milliseconds, and by knowing the latency of the audio output device, perfect audio/video synchronization can be achieved, regardless of the inherent audio output latency of the setup. This really is important when watching a movie with a Bluetooth headset, which is a rather common case. Some software allow users to shift the audio or video track a few milliseconds each way, but it’s always best to have it working automatically.

Playing video on the web is very often (but not always) done using the HTMLVideoElement object (<video> in markup). In most browsers, this will do audio output latency compensation automatically. All is well.

However when authors want to do things that are a bit more advanced, they often use the Web Audio API to do their audio processing, and render the video frames using a canvas (either 2D or 3D, depending on the application). When doing things manually like this, the browser cannot automatically shift the video frames based on the audio output latency anymore. Because, by default, Web Audio API implementations try to use low-latency audio output streams, and because on desktop audio output latencies on a lot of configurations are comparable to the display latency 3, this is fine. However when this is not the case, synchronization is incorrect, and authors had no real solution.

New members on the AudioContext interface

The AudioContext is the go-to interface for doing anything non-trivial with audio on the Web. It was originally only doing low-latency audio, and the latency was not configurable, but we later added a way to request higher-latency streams, which can be helpful for two reasons.

First, low-latency audio streams often consume much more CPU and energy than higher-latency audio streams: the audio thread has to wake up more often. Second, because of various reasons, it’s often faster to perform a computation on, say, a 2048 frames buffer, than it it to perform the same computation 16 times on a 128 frames buffer, each computation happening in an audio callback being called in a roughly isochronous fashion.

When instantiating an AudioContext, authors can ask for an arbitrarily low-latency (expressed in seconds), and implementations will try to honor the request, down to a minimum under which it can’t. In the Web Audio API, the hard limit is 128 frames (≈2.66ms at 48kHz, because we’ve specified all the processing based on 128 frames block processing), but in practice, this depends on the setup of the device.

In any way, knowing the real audio output latency is really something that was missing from the Web plaform. To solve this in the Web Audio API, we added three members to the AudioContext interface, and I just landed patches in Firefox to implement them:

The system clock and the audio clock can (and do very frequently) drift apart, and this third member allow easily mapping the two clock domains together.

Authors working on an application that need audio/video synchronization can use outputLatency to synchronize their rendering: instead of painting video frames as soon as they are rendered, they can shift the clock of the rendering by outputLatency, and audio and video will be in sync. This mostly matters with high-latency audio output devices (often, Bluetooth headset in A2DP profile), which is very common these days, but it’s a good habit to always use this, so that people that are, for example, on low end Android device, that are known to have an audio output latency that is in the longer end of the spectrum, have a better experience.

Now, the catch

While it’s easy to measure the round-trip latency, and reasonably straightforward to estimate the output latency, it’s a bit harder to find this number in the program running, for a variety of reasons:

When running Firefox Nightly (for now, this is going to release on the 22nd of October 2019), have a look at those values on your setup, and if they don’t make sense, please open a bug in the Web Audio API Bugzilla component (simply login with a GitHub account if you don’t have a Bugzilla account on our instance), explain your setup, and the values you see, and I’ll try to get it fixed.

  1. Round-trip latency refers to the time it takes between the moment a sound wave enters a system (say, a microphone, for a recording system) to the moment it’s output by the output device (speaker, headphones, etc.), when routing directly the input device to the output device. The system can be a human however, the case of key-press to ear latency, or another measurement device, or anything, as long as it’s useful for the problem at hand and clearly defined. ↩︎

  2. Humans are used to see things earlier than they hear it, because sound speed in the air is much slower than the speed of light in air. If someone is speaking 10 meters away from you, it takes about 30ms for the sound waves to reach your ear. The first few chapters of the first volume of Audio Anecdotes are a good introduction to the field of auditory perception for practical purposes. ↩︎

  3. Input and screen latency are certainly not to be ignored when doing interactive programes. For very good reasons (double/triple buffering) and less good reasons, the time it takes from pressing a key on a keyboard to the display changing on a typical computer can be quite long. And this is the ideal scenario, it’s not uncommon for rendering pipelines to have a frame or more of additional latency. Quite a lot of tradeoffs of various complexity can be made to achieve a better overall experience. ↩︎