Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is MediaStreamTrackProcessor for audio necessary? #29

Open
youennf opened this issue Apr 30, 2021 · 19 comments
Open

Is MediaStreamTrackProcessor for audio necessary? #29

youennf opened this issue Apr 30, 2021 · 19 comments

Comments

@youennf
Copy link
Contributor

youennf commented Apr 30, 2021

Extracting this discussion from #4, since this was not really fully discussed there.
The use cases for MediaStreamTrackProcessor for audio are unclear given its functionality largely overlaps with what WebAudio can do, WebAudio being already largely deployed in all major browsers.

@youennf
Copy link
Contributor Author

youennf commented Apr 30, 2021

CC @padenot

@guidou
Copy link
Contributor

guidou commented May 1, 2021

The fact that there is overlap does not mean that we should not support it. After all, for video there is overlap with existing features as well. Also, while there is overlap, the MediaStreamTrackProcessor model is quite different from the AudioWorklet model.
The question is if the MediaStreamTrackProcessor model is a better fit in some cases. I'll reach out to audio developers to get more feedback, but some things that have been mentioned are:

  1. access to the original timestamps of the audio source
  2. better WebCodecs integration
  3. there are use cases that do not fit naturally with the clock-based synchronous processing model of AudioWorklet (e.g., applications with high CPU requirements but without strong latency requirements). The MediaStreamTrackProcessor model might be a better match in these cases.

@aboba
Copy link
Contributor

aboba commented May 1, 2021

@guidou Quite a few of the WebCodecs Origin Trial participants are using it primarily for audio. Among game developers using WebCodecs for both audio and video, symmetry is an important aspect (e.g. using WebCodecs decode as an MSE substitute).

@guidou
Copy link
Contributor

guidou commented May 1, 2021

I fully agree that symmetry is an important benefit too for developers.

@dogben
Copy link

dogben commented May 3, 2021

@alvestrand
Copy link
Contributor

@youennf are you referring to https://developer.mozilla.org/en-US/docs/Web/API/AudioWorklet (not yet in Safari) or to https://developer.mozilla.org/en-US/docs/Web/API/ScriptProcessorNode (available in all browsers, but deprecated)?

@youennf
Copy link
Contributor Author

youennf commented May 3, 2021

are you referring to https://developer.mozilla.org/en-US/docs/Web/API/AudioWorklet (not yet in Safari) or to

I am referring to AudioWorklet, which is available in Safari.

@padenot
Copy link

padenot commented May 3, 2021

The PR to adjust the compatibility data has just been merged and it appears MDN is slightly out of date: https://github.com/mdn/browser-compat-data/pull/10129/files#r621975812

@youennf
Copy link
Contributor Author

youennf commented May 4, 2021

One use case to consider is https://ai.googleblog.com/2018/04/looking-to-listen-audio-visual-speech.html

This is indeed a good use case. It seems covered AFAIK by getUserMedia+MediaStreamAudioSourceNode+AudioWorklet.

Quite a few of the WebCodecs Origin Trial participants are using it primarily for audio.

Can you clarify which WebCodec original trial API they are primarily using for audio? Is it MediaStreamTrackProcessor?

@youennf
Copy link
Contributor Author

youennf commented May 4, 2021

I'll reach out to audio developers to get more feedback, but some things that have been mentioned are:

Thanks @guidou, this is helpful to identify the shortcomings of AudioWorklet.
Based on that, we should indeed either improve WebAudio support (including API) or envision alternatives.

What was asked in the past is a pros and cons of AudioWorklet vs. audio MediaStreamTrackProcessor.
So far, it seems that MediaStreamTrackProcessor could be shimed by AudioWorklet.

@youennf
Copy link
Contributor Author

youennf commented May 4, 2021

I fully agree that symmetry is an important benefit too for developers.

WebAudio API is very different from rendering API like Canvas/OffscreenCanvas and for good reasons: it was decided to solve a specific a problem in the best possible way.

By trying to build a single API for both audio and video, we miss the opportunity to build the best API dedicated for video.
Symmetry is not always a good friend.

@youennf
Copy link
Contributor Author

youennf commented May 4, 2021

There are some known advantages of using AudioWorklet over MediaStreamTrackProcessor.
With AudioWorklet, an application is able to implement its own buffering strategy and the best way to present data for processing.

For instance, an application might want to start with processing 10 ms chunks and will want to buffer 5 chunks of these.
At some point though, to cope with networking, the application will switch to 50 ms chunks and will increase buffering to 5 chunks of 50ms.

This is not easily doable with MediaStreamTrackProcessor: maxBufferSize is fixed at construction time and audio frames size is not in the application control.

@dogben
Copy link

dogben commented May 4, 2021

One use case to consider is https://ai.googleblog.com/2018/04/looking-to-listen-audio-visual-speech.html

This is indeed a good use case. It seems covered AFAIK by getUserMedia+MediaStreamAudioSourceNode+AudioWorklet.

Apologies if I'm missing something obvious, but it doesn't seem possible to process both the audio and video inputs in an AudioWorklet. Nor does it seem possible for the audio data to be obtained outside of the AudioWorklet so that the audio and video can be processed together in a regular worker.

@youennf
Copy link
Contributor Author

youennf commented May 4, 2021

We can share the audio data to a regular worker through SharedArrayBuffer if possible, postMessage otherwise. I was referring to the audio part of the use case, I agree the video part deserves a better API than canvas.

@jan-ivar jan-ivar mentioned this issue May 18, 2021
@guidou
Copy link
Contributor

guidou commented Jun 18, 2021

I think the question of whether something is necessary is the wrong one to ask, since arguably, nothing is necessary.
For example, using getUserMedia+MediaStreamAudioSourceNode+AudioWorklet + some video processing API (such as MediaStreamTrackProcessor/MediaStreamTrackGenerator) in this context would be a lot more difficult than with having a symmetric API for audio and video.
For starters SharedArrayBuffer requires cross-origin isolation. The setup of MediaStreamAudioSourceNode+AudioWorklet on one hand and Video processing somewhere else using completely different APIs with different programming models adds even more friction.
Moreover, the unique advantages offered by AudioWorklet (e.g., real-time thread) do not apply to this specific use case.

I think this shows that there is real value in adding an audio version of the same API used for video.
Keeping the bug open to continue the discussion.

@alvestrand
Copy link
Contributor

It's possible to add controls for sample size and bufffer size to MediaStreamTrackProcessor if that's a requested feature.
It isn't part of the minimal surface, but where to put the controls is obvious; raw audio data is easy to re-chunk.

@chrisguttandin
Copy link

It was mentioned in this thread that a MediaStreamTrackProcessor for audio is necessary to synchronize audio and video when using WebCodecs.

But in case I didn't miss anything it's probably still hard to accurately encode a MediaStream with WebCodecs even though there is MediaStreamTrackProcessor for audio.

I tried to record the MediaStream coming from the user's mic and camera in Chrome v105. It was obtained in the most simple way.

const mediaStream = await navigator.mediaDevices.getUserMedia({
    audio: true,
    video: true
})

I then used a MediaStreamTrackProcessor for each MediaStreamTrack to get the AudioData and VideoFrame respectively. However the timestamp of the video seems to start at 0 whereas the timestamp of the audio starts somewhere.

I think this is all fine according to the spec but it doesn't really help to synchronize the audio with the video. If I want to start the recording at a given point in time which VideoFrame and which AudioData are the first ones I should pass on to the encoder?

It would be nice to have a way of knowing the offset between the two timestamps. I think some API which says AudioData.timestamp === 62169.819898 and VideoFrame.timestamp === 0.566633 represent the same point in time would be really helpful.

Also I guess this all becomes very tricky when the recording is long enough for the two streams to drift apart.

@guidou
Copy link
Contributor

guidou commented Jan 30, 2024

FWIW, in Chrome, MSTP for audio is used 3X more than MSTP for video nowadays.

@mehagar
Copy link

mehagar commented Mar 7, 2024

At Zoom we're currently using MediaStreamTrackProcessor for video, and WebAudio for audio (very similar to this pattern: https://developer.chrome.com/blog/audio-worklet-design-pattern#webaudio_powerhouse_audio_worklet_and_sharedarraybuffer).

It works, but there's a lot of complexity that comes with WebAudio and SharedArrayBuffers, and handling the case when SharedArrayBuffer is not available. Having MediaStreamTrackProcessor for audio would certainly simplify things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants