Firefox Nightly now has an implementation of MediaStreamTrackAudioSourceNode
.
It’s very similar to MediaStreamAudioSourceNode
, and might even be
poly-fillable on implementations that don’t have this new AudioNode
. It’s not
a particularly complicated feature. However, it can serve as a demonstration of
how a new feature is added to the Web Platform (at least in the Audio Working
Group), and as an opportunity to touch upon the problems browser engineers and
standards people can run into.
In this post, I’ll detail the standard process and the Firefox implementation of this feature, then try to shed some light on why such a simple feature is needed in the first place, and why it took so long to finish and ship.
Why we really needed this, and the backstory
When the Web Audio API specification first
dropped in the early 2011,
there wasn’t anything to connect a MediaStream
to a processing graph built
using the Web Audio API. This was a big problem: MediaStream
s are the
backbone of real-time media routing on the Web, and can be the input or output
of WebRTC PeerConnection
s, MediaRecorder
, audio device input via
getUserMedia
, and various other bits of the Web Platform.
It was always possible to render the output of a MediaStream
using an
HTMLMediaElement
(either a <video>
or <audio>
tag), but the only
processing one could do to the audio stream was perhaps to change its output
volume. When richer audio processing is needed in a Web application, using an
AudioContext
is always the answer. At the time, the two constructs couldn’t
be used together.
A year later, in the 2012 Working
Draft
appeared a new AudioNode
, called MediaStreamAudioSourceNode
, that allowed a
Web Audio API processing graph to have an input that was a MediaStream
. It was
described like so:
This interface represents an audio source from a MediaStream. The first AudioMediaStreamTrack from the MediaStream will be used as a source of audio.
With a constructor, in WebIDL, that looked like this:
partial interface AudioContext {
MediaStreamAudioSourceNode createMediaStreamSource(MediaStream mediaStream);
};
This was all and well, and this node has been absolutely instrumental to the
success of the Web Audio API, along with its counterpart,
MediaStreamAudioDestinationNode
, that does the opposite: from a Web Audio API
graph, you can get a MediaStream
that contains the audio generated by the
graph, that you can then use to do something else.
But then we received this message on the public-audio
W3C mailing list 1 on
October 2013, from Stefan Håkansson LK from Ericsson:
Hi,
I have a comment triggered by the new public draft being announced:
In section 4.24 it is said that “The first AudioMediaStreamTrack from the MediaStream will be used as a source of audio.”
Which audio track that is the first one is undefined (in the case when there is more than one). I think it would make much more sense to deal directly with a MediaStreamTrack (of kind audio) in this API (which should as a consequence be renamed to " MediaStreamTrackAudioSourceNode”).
And it would probably make sense to rename “MediaStreamAudioDestinationNode” to “MediaStreamTrackAudioDestinationNode”, and have it produce a MediaStreamTrack.
Br,
Stefan
Now, this makes a lot of sense, looking at the specification that defines
MediaStream
, Media Capture and
Streams, we can read:
The tracks of a
MediaStream
are stored in a track set. The track set MUST contain theMediaStreamTrack
objects that correspond to the tracks of the stream. The relative order of the tracks in the set is User Agent defined and the API will never put any requirements on the order. The proper way to find a specificMediaStreamTrack
object in the set is to look it up by its id.
So we had the Web Audio API spec saying something that was simply incorrect, and two implementations (and three browsers, at the time, Safari and Chromium were both based on WebKit) had implemented it 2, that was great.
It took a few GitHub issues (ported from the w3c bugzilla instance) to sort the situation out.
In issue #132, an
alternative design was proposed. Instead of picking a hypothetical “first
track”, the MediaStreamAudioSourceNode
would output all tracks, each one in a
different output. However it was quickly noticed that this wouldn’t
solve the problem completely, as there would be no way for authors to know which
input track in the MediaStream
matched which output of the
AudioNode
. In the Web Audio API, the order of the output is meaningful and
based on a numeric index: the first output of a ChannelSplitterNode
is always
the first channel in its single, possibly multichannel input: the mapping is
straightforward. The order of the tracks in a MediaStream
is not defined, and
is based on identifiers, which are most of the time UUIDs (this is recommended
by the spec but not mandatory), that do not map naturally to outputs. This
design was abandoned.
It felt easier to simply let authors enumerate the tracks on their
MediaStream
, pick the right one for their use case, and have only this track
routed to the audio graph: this lets the browser engine perform optimizations on
the other tracks, such as, doing absolutely nothing with them if possible (which
is the best optimization one can do).
In issue #264, it was decided that the solution would be two-fold:
- Create a new
AudioNode
,MediaStreamTrackAudioSourceNode
, that would let authors choose which track was to be routed to theAudioContext
; - Specify correctly which track would be picked when creating a
MediaStreamAudioSourceNode
.
The first bit was easy, done in PR #982.
Of course it was initially incorrect when merged: I typo-ed the factory method
3, and then added the fact that trying to create a
MediaStreamTrackAudioSourceNode
should throw when passed in a track that is
not a track of kind
"audio"
. But after those fixups, it was done and
correct. Usually, those mistakes at caught during the review, but not this time
it seems.
The second bit was not too hard, but a bit arbitrary, which is often a bit
annoying in software, where we like to know why things happen or are the way
they are. We wanted to have a way (any way really) to pick the first track, in a
manner that would be stable across implementations. We ended up ordering by
their id
property. Now, this works, but sorting strings is a surprisingly
complicated affair: you can’t just wing it and hope to be correct when you’re
a programmer that mostly deal with audio. When trying to move a W3C
specification to a state of Candidate
Recommendation, a
document has to be reviewed by a few groups, for example for security or privacy
issues, but also by the internationalization working group (usually referred to
by its abbreviation, i18n). We don’t deal with text too much in the Web Audio
API, but they did find that we were trying to sort things “alphabetically”,
which makes little sense in reality, and is very vague.
Quoting the review:
Sorting alphabetically seems to be underspecified. Given that MediaStreamTrack id attributes are recommended to be UUIDs and that the “alphabetic” sorting appears to be an arbitrary way to prioritize the item selection, defining the comparison as ASCII case-insensitive might be a good choice. However, since there is no actual restriction on what the values can be, I’d suggest that you replace “alphabetically” with ordering by code point (which means that it is case-sensitive, please note)
I ended up doing that in PR #1811
, by specifiying that the tracks were to be sorted by Unicode code point value,
which is the way the sort
method in EcmaScript sort things. When picking
things arbitrarily, it’s often best to do the same things as elsewhere.
The old and incorrect prose for MediaStreamAudioSourceNode
was now correct and
defined: on construction, a sort is performed on the tracks, based on their id
property, that is a string, the first element in the resulting collection is
picked as the input.
It only took 5 years from the time we became aware of the problem to having a correct text. Granted, if we were to tally the cumulative hours spent working on this particular feature, they would amount to no more than a dozen, spent pinpointing the issue, finding a viable solution, writing and reviewing the spec text, tweaking the prose, performing fixes, and reviewing them.
This is a prime example of why things like this take time: the W3C Audio Working
Group was certainly not inactive during this time (other things, like
AudioWorklet
, or other more pressing matters, took precedence). There were
simply too many other things on our plate to finish this task in a timely
manner.
The implementation part
Most of the implementation work wasn’t done by me: I was mentoring students from the Lyon 1 University in Lyon, France, where Matthieu Moy leads a school project where students work on open source or free software projects. I’ve been doing that with him for a few years now, it’s great.
Sometimes multiple people work on a single bug in this project, sometimes it’s only a single person. What I know is that this was started in October 2017, and that the name on the patch is Léo Paquet, apologies if other people worked on this feature!
This was worked on in Bug 1324548, I did the initial review pass, and my colleague Andreas Pehrson did most of the in-depth review work. There were tests of course, there is no chance that a web-facing feature would be merged in Gecko without comprehensive tests. Those were written using Gecko’s Mochitest test harness. Mochitest, at the time, was the harness of choice when testing features that are exposed to Web. These days, Web Platform Tests are a better option, but have a number of limitations still.
Later, the school project ended, and this was abandoned somewhat.
I picked it up when my manager noticed that it seemed like we were sitting on a
mostly finished patch with tests. Of course things had become somewhat bit
rotten 4 and there was a MediaStreamTrack
leaking at shutdown 5, so I
fixed that with, again, the help of Andreas, at the back of a conference room
during a big meeting in Mozlando 2018.
He wanted a bit more tests before landing, so I started doing that.
Specifically, more testing around the security sensitive bit of the patch: what
happens if the MediaStreamTrack
that is being routed to the AudioContext
processing graph has content that is not
same-origin? The content of
the MediaStreamTrack
has to be replaced by silence, otherwise it’s a pretty
severe cross-origin information leakage: a website could open a URL that is not
from the same origin, and that has not agreed to be read from another origin,
and inspect the contents of the audio file. We also tried to determine whether
it was possible to have a source that was same-origin, that would then switch to
a cross-origin, for example because of an HTTP redirect.
For a long time, in Firefox, this was possible. This is another good story
outlined in Jake Archibald’s
post called “I
found a browser bug”. In current Firefox, since bug
1443942 landed,
mid-flight redirects to cross-origin destinations are now blocked when loading
media resources so this is not possible anymore. I do believe Firefox would
have been safe (read: would have silenced the cross-origin media dynamically
after the redirect) even if this was still possible the cross-origin
status on media sources is dynamically updated on the
MediaStreamTrackAudioSourceNode
, from its source.
Another test was to check whether everything continues working correctly if all
the references to the input MediaStreamTrack
were to be dropped by the
JavaScript code. Historically, this was a bug that Firefox had: if you were to
use a MediaStreamAudioSourceNode
, with, say, a MediaStream
that was from a
getUserMedia
call, and you dropped all references in the code, the
MediaStream
would suddently be collected, and you stopped getting input data
in the MediaStreamAudioSourceNode
. This is why you still see authors sticking
expandos with MediaStream
s from getUserMedia
calls on global objects to work
around this Firefox bug, which was fixed in bug
934512.
It would have been better to write Web Platform Tests, and this is something I’ll try to get done in the future. It’s possible to write involved redirect tests, but it does not seem possible to trigger garbage collection yet, from a Web Platform Test (we use a privileged Gecko API do do so in our tests).
Conclusion
Nothing about MediaStreamTrackAudioSourceNode
is inherently difficult or
complex, but it still took a very long time to specify and implement. Looking
back, we can isolate a few problems that caused this simple feature to
take a very long time to be finished and shipped.
First, there are very few people that are interested in working on standards, and amongst them even fewer people have an interest in working on fixing historical problems in a specification. It’s the eternal problem of maintenance of existing work: it’s always more interesting to work on new things, but it’s more often than not more useful to work on older things that have bugs. Fortunately, everybody can comment on a W3C spec, and the process invites a number of groups, with specific skill sets, to do formal reviews of the work, when it’s considered advanced enough.
Then, we’ve seen that this is a feature that is somewhat important, but not
important enough to be fixed quickly. To come back to our problem here, if a
particular problem required to route a particular MediaStreamTrack
to a Web
Audio API graph, it was always possible to create a new MediaStream
, find the
track in the initial MediaStream
that needs to be routed to the graph, clone
it, and add this track to this new MediaStream
, and use
MediaStreamAudioSourceNode
. It’s however nicer to not have to do that.
On the implementation part, it is very often much faster to finish things without waiting. Bit rot will cause lots of churn, sometimes rebases are not trivial. Forgetting about the code and re-learning it to finish a patch is a waste of time, but sometimes you have no choice: I’m certainly not going to finish a feature that is nice to have when I have a couple crashes with security implications to work on.
And finally, the web is always evolving: the thing that was important to test here (the behaviour of cross-origin resources during redirects) was made impossible by another platform change (for the better, this time around): I probably spent a few hours to try to consider all possible cross-origin status changes, to make sure we were not leaving a little hole that an attacker could have used to exfiltrate private data.
Things take time, it’s fine.
-
This is where most of the standards discussion used to happen, before we switched to using GitHub issues. We still use it for official annoucements, such as details for meetings, working draft release, etc. The archive is available publicly at https://lists.w3.org/Archives/Public/public-audio/ ↩︎
-
Gecko had an implementation at the time, and we shipped the Web Audio API in Firefox 25, that was released exactly 11 days after Stefan’s post to the list. That is to say, the code was already written ↩︎
-
In Web Audio API spec lingo, this is how we call the methods that are on the
BaseAudioContext
, such ascreateGain
, vs. the constructors, such asnew GainNode(ctx, {gain: 0.3})
↩︎ -
In Mozilla jargon, this refers to a patch or a set of patches that does not apply anymore because the underlying code has changed ↩︎
-
We use reference counting heavily in Firefox, this was a case of a cycle of references between objects, that weren’t registered to the cycle collector. Here, the objects were alive at shutdown (instead of being collected a lot earlier), and this kind of error make the tests fail ↩︎