Media Stream Track Audio Source Node available in Firefox Nightly
in

Firefox Nightly now has an implementation of MediaStreamTrackAudioSourceNode. It’s very similar to MediaStreamAudioSourceNode, and might even be poly-fillable on implementations that don’t have this new AudioNode. It’s not a particularly complicated feature. However, it can serve as a demonstration of how a new feature is added to the Web Platform (at least in the Audio Working Group), and as an opportunity to touch upon the problems browser engineers and standards people can run into.

In this post, I’ll detail the standard process and the Firefox implementation of this feature, then try to shed some light on why such a simple feature is needed in the first place, and why it took so long to finish and ship.

Why we really needed this, and the backstory

When the Web Audio API specification first dropped in the early 2011, there wasn’t anything to connect a MediaStream to a processing graph built using the Web Audio API. This was a big problem: MediaStreams are the backbone of real-time media routing on the Web, and can be the input or output of WebRTC PeerConnections, MediaRecorder, audio device input via getUserMedia, and various other bits of the Web Platform.

It was always possible to render the output of a MediaStream using an HTMLMediaElement (either a <video> or <audio> tag), but the only processing one could do to the audio stream was perhaps to change its output volume. When richer audio processing is needed in a Web application, using an AudioContext is always the answer. At the time, the two constructs couldn’t be used together.

A year later, in the 2012 Working Draft appeared a new AudioNode, called MediaStreamAudioSourceNode, that allowed a Web Audio API processing graph to have an input that was a MediaStream. It was described like so:

This interface represents an audio source from a MediaStream. The first AudioMediaStreamTrack from the MediaStream will be used as a source of audio.

With a constructor, in WebIDL, that looked like this:

partial interface AudioContext {
MediaStreamAudioSourceNode createMediaStreamSource(MediaStream mediaStream);
};

This was all and well, and this node has been absolutely instrumental to the success of the Web Audio API, along with its counterpart, MediaStreamAudioDestinationNode, that does the opposite: from a Web Audio API graph, you can get a MediaStream that contains the audio generated by the graph, that you can then use to do something else.

But then we received this message on the public-audio W3C mailing list 1 on October 2013, from Stefan Håkansson LK from Ericsson:

Hi,

I have a comment triggered by the new public draft being announced:

In section 4.24 it is said that “The first AudioMediaStreamTrack from the MediaStream will be used as a source of audio.”

Which audio track that is the first one is undefined (in the case when there is more than one). I think it would make much more sense to deal directly with a MediaStreamTrack (of kind audio) in this API (which should as a consequence be renamed to “ MediaStreamTrackAudioSourceNode”).

And it would probably make sense to rename “MediaStreamAudioDestinationNode” to “MediaStreamTrackAudioDestinationNode”, and have it produce a MediaStreamTrack.

Br,
Stefan

Now, this makes a lot of sense, looking at the specification that defines MediaStream, Media Capture and Streams, we can read:

The tracks of a MediaStream are stored in a track set. The track set MUST contain the MediaStreamTrack objects that correspond to the tracks of the stream. The relative order of the tracks in the set is User Agent defined and the API will never put any requirements on the order. The proper way to find a specific MediaStreamTrack object in the set is to look it up by its id.

So we had the Web Audio API spec saying something that was simply incorrect, and two implementations (and three browsers, at the time, Safari and Chromium were both based on WebKit) had implemented it 2, that was great.

It took a few GitHub issues (ported from the w3c bugzilla instance) to sort the situation out.

In issue #132, an alternative design was proposed. Instead of picking a hypothetical “first track”, the MediaStreamAudioSourceNode would output all tracks, each one in a different output. However it was quickly noticed that this wouldn’t solve the problem completely, as there would be no way for authors to know which input track in the MediaStream matched which output of the AudioNode. In the Web Audio API, the order of the output is meaningful and based on a numeric index: the first output of a ChannelSplitterNode is always the first channel in its single, possibly multichannel input: the mapping is straightforward. The order of the tracks in a MediaStream is not defined, and is based on identifiers, which are most of the time UUIDs (this is recommended by the spec but not mandatory), that do not map naturally to outputs. This design was abandoned.

It felt easier to simply let authors enumerate the tracks on their MediaStream, pick the right one for their use case, and have only this track routed to the audio graph: this lets the browser engine perform optimizations on the other tracks, such as, doing absolutely nothing with them if possible (which is the best optimization one can do).

In issue #264, it was decided that the solution would be two-fold:

1. Create a new AudioNode, MediaStreamTrackAudioSourceNode, that would let authors choose which track was to be routed to the AudioContext ;
2. Specify correctly which track would be picked when creating a MediaStreamAudioSourceNode.

The first bit was easy, done in PR #982. Of course it was initially incorrect when merged: I typo-ed the factory method 3, and then added the fact that trying to create a MediaStreamTrackAudioSourceNode should throw when passed in a track that is not a track of kind "audio". But after those fixups, it was done and correct. Usually, those mistakes at caught during the review, but not this time it seems.

The second bit was not too hard, but a bit arbitrary, which is often a bit annoying in software, where we like to know why things happen or are the way they are. We wanted to have a way (any way really) to pick the first track, in a manner that would be stable across implementations. We ended up ordering by their id property. Now, this works, but sorting strings is a surprisingly complicated affair: you can’t just wing it and hope to be correct when you’re a programmer that mostly deal with audio. When trying to move a W3C specification to a state of Candidate Recommendation, a document has to be reviewed by a few groups, for example for security or privacy issues, but also by the internationalization working group (usually referred to by its abbreviation, i18n). We don’t deal with text too much in the Web Audio API, but they did find that we were trying to sort things “alphabetically”, which makes little sense in reality, and is very vague.

Quoting the review:

Sorting alphabetically seems to be underspecified. Given that MediaStreamTrack id attributes are recommended to be UUIDs and that the “alphabetic” sorting appears to be an arbitrary way to prioritize the item selection, defining the comparison as ASCII case-insensitive might be a good choice. However, since there is no actual restriction on what the values can be, I’d suggest that you replace “alphabetically” with ordering by code point (which means that it is case-sensitive, please note)

I ended up doing that in PR #1811 , by specifiying that the tracks were to be sorted by Unicode code point value, which is the way the sort method in EcmaScript sort things. When picking things arbitrarily, it’s often best to do the same things as elsewhere.

The old and incorrect prose for MediaStreamAudioSourceNode was now correct and defined: on construction, a sort is performed on the tracks, based on their id property, that is a string, the first element in the resulting collection is picked as the input.

It only took 5 years from the time we became aware of the problem to having a correct text. Granted, if we were to tally the cumulative hours spent working on this particular feature, they would amount to no more than a dozen, spent pinpointing the issue, finding a viable solution, writing and reviewing the spec text, tweaking the prose, performing fixes, and reviewing them.

This is a prime example of why things like this take time: the W3C Audio Working Group was certainly not inactive during this time (other things, like AudioWorklet, or other more pressing matters, took precedence). There were simply too many other things on our plate to finish this task in a timely manner.

The implementation part

Most of the implementation work wasn’t done by me: I was mentoring students from the Lyon 1 University in Lyon, France, where Matthieu Moy leads a school project where students work on open source or free software projects. I’ve been doing that with him for a few years now, it’s great.

Sometimes multiple people work on a single bug in this project, sometimes it’s only a single person. What I know is that this was started in October 2017, and that the name on the patch is Léo Paquet, apologies if other people worked on this feature!

This was worked on in Bug 1324548, I did the initial review pass, and my colleague Andreas Pehrson did most of the in-depth review work. There were tests of course, there is no chance that a web-facing feature would be merged in Gecko without comprehensive tests. Those were written using Gecko’s Mochitest test harness. Mochitest, at the time, was the harness of choice when testing features that are exposed to Web. These days, Web Platform Tests are a better option, but have a number of limitations still.

Later, the school project ended, and this was abandoned somewhat.

I picked it up when my manager noticed that it seemed like we were sitting on a mostly finished patch with tests. Of course things had become somewhat bit rotten 4 and there was a MediaStreamTrack leaking at shutdown 5, so I fixed that with, again, the help of Andreas, at the back of a conference room during a big meeting in Mozlando 2018.

He wanted a bit more tests before landing, so I started doing that. Specifically, more testing around the security sensitive bit of the patch: what happens if the MediaStreamTrack that is being routed to the AudioContext processing graph has content that is not same-origin? The content of the MediaStreamTrack has to be replaced by silence, otherwise it’s a pretty severe cross-origin information leakage: a website could open a URL that is not from the same origin, and that has not agreed to be read from another origin, and inspect the contents of the audio file. We also tried to determine whether it was possible to have a source that was same-origin, that would then switch to a cross-origin, for example because of an HTTP redirect.

For a long time, in Firefox, this was possible. This is another good story outlined in Jake Archibald’s post called “I found a browser bug”. In current Firefox, since bug 1443942 landed, mid-flight redirects to cross-origin destinations are now blocked when loading media resources so this is not possible anymore. I do believe Firefox would have been safe (read: would have silenced the cross-origin media dynamically after the redirect) even if this was still possible the cross-origin status on media sources is dynamically updated on the MediaStreamTrackAudioSourceNode, from its source.

Another test was to check whether everything continues working correctly if all the references to the input MediaStreamTrack were to be dropped by the JavaScript code. Historically, this was a bug that Firefox had: if you were to use a MediaStreamAudioSourceNode, with, say, a MediaStream that was from a getUserMedia call, and you dropped all references in the code, the MediaStream would suddently be collected, and you stopped getting input data in the MediaStreamAudioSourceNode. This is why you still see authors sticking expandos with MediaStreams from getUserMedia calls on global objects to work around this Firefox bug, which was fixed in bug 934512.

It would have been better to write Web Platform Tests, and this is something I’ll try to get done in the future. It’s possible to write involved redirect tests, but it does not seem possible to trigger garbage collection yet, from a Web Platform Test (we use a privileged Gecko API do do so in our tests).

Conclusion

Nothing about MediaStreamTrackAudioSourceNode is inherently difficult or complex, but it still took a very long time to specify and implement. Looking back, we can isolate a few problems that caused this simple feature to take a very long time to be finished and shipped.

First, there are very few people that are interested in working on standards, and amongst them even fewer people have an interest in working on fixing historical problems in a specification. It’s the eternal problem of maintenance of existing work: it’s always more interesting to work on new things, but it’s more often than not more useful to work on older things that have bugs. Fortunately, everybody can comment on a W3C spec, and the process invites a number of groups, with specific skill sets, to do formal reviews of the work, when it’s considered advanced enough.

Then, we’ve seen that this is a feature that is somewhat important, but not important enough to be fixed quickly. To come back to our problem here, if a particular problem required to route a particular MediaStreamTrack to a Web Audio API graph, it was always possible to create a new MediaStream, find the track in the initial MediaStream that needs to be routed to the graph, clone it, and add this track to this new MediaStream, and use MediaStreamAudioSourceNode. It’s however nicer to not have to do that.

On the implementation part, it is very often much faster to finish things without waiting. Bit rot will cause lots of churn, sometimes rebases are not trivial. Forgetting about the code and re-learning it to finish a patch is a waste of time, but sometimes you have no choice: I’m certainly not going to finish a feature that is nice to have when I have a couple crashes with security implications to work on.

And finally, the web is always evolving: the thing that was important to test here (the behaviour of cross-origin resources during redirects) was made impossible by another platform change (for the better, this time around): I probably spent a few hours to try to consider all possible cross-origin status changes, to make sure we were not leaving a little hole that an attacker could have used to exfiltrate private data.

Things take time, it’s fine.

1. This is where most of the standards discussion used to happen, before we switched to using GitHub issues. We still use it for official annoucements, such as details for meetings, working draft release, etc. The archive is available publicly at https://lists.w3.org/Archives/Public/public-audio/ [return]
2. Gecko had an implementation at the time, and we shipped the Web Audio API in Firefox 25, that was released exactly 11 days after Stefan’s post to the list. That is to say, the code was already written [return]
3. In Web Audio API spec lingo, this is how we call the methods that are on the BaseAudioContext, such as createGain, vs. the constructors, such as new GainNode(ctx, {gain: 0.3}) [return]
4. In Mozilla jargon, this refers to a patch or a set of patches that does not apply anymore because the underlying code has changed [return]
5. We use reference counting heavily in Firefox, this was a case of a cycle of references between objects, that weren’t registered to the cycle collector. Here, the objects were alive at shutdown (instead of being collected a lot earlier), and this kind of error make the tests fail [return]