Add MediaStreamTrack voice activity detection support. #153

jianjunz · 2024-06-20T09:05:35Z

This change adds support for the voice activity detection (VAD) feature for audio MediaStreamTracks. It is only enabled when voiceActivityDetection constraint is set to true.

With voiceactivitydetected event, web applications are able to show notifications when the user is speaking but audio track is muted.

Fixes #145.

Preview | Diff

This change adds support for the voice activity detection (VAD) feature for audio MediaStreamTracks. It is only enabled when voiceActivityDetection constraint is set to true. With voiceactivitydetected event, web applications are able to show notifications when the user is speaking but audio track is muted.

youennf · 2024-06-21T09:49:26Z

I wonder whether this should actually be at MediaStreamTrack level.
Maybe we do not need a constraint either.

Given this event would fire when the track is muted, the goal would be to unmute the track, which would be done in via MediaSession API. Moving this API to MediaSession makes some sense.

Maybe all we need is a new MediaSession voiceActivity action.
Registering this handler would kick in the necessary UA logic to trigger this action.

@jan-ivar, @guidou, thoughts?

jianjunz · 2024-06-26T08:12:42Z

I'm wondering when this action should be triggered if voiceActivity is moved from MediaStreamTrack to MediaSession

voice activity detection on default audio input device
voice activity detection on any audio input devices
voice activity detection on any audio input devices with MediaStreamTrack created

1 and 2 may have privacy issue because users may not want applications to know their behavior before granting "microphone" permission.

With current AudioWorklet approach, applications are able to know which track has voice activity. I personally believe applications only want to detect voice activity for microphone with MediaStreamTrack created and muted, but I'm not sure if any application applies vad for other audio tracks.

youennf · 2024-06-26T12:17:13Z

The privacy story should be the same whatever the API shape.
I agree with having a voiceActivity MediaSession action only for contexts that have live (and muted) microphone MediaStreakTracks.

If we want to support multimicrophone cases, a deviceId could be exposed within MediaSessionActionDetails.

I personally believe applications only want to detect voice activity for microphone with MediaStreamTrack created and muted

Agreed for the scope of this specific API.

jan-ivar · 2024-06-27T14:08:56Z

I agree with having a voiceActivity MediaSession action only for contexts that have live (and muted) microphone MediaStreakTracks.

Moving this to media session makes sense to me as well.

youennf · 2024-06-27T14:11:07Z

@steimelchrome FYI

youennf · 2024-06-27T14:12:13Z

@jianjunz , would you be ok drafting a PR on MediaSession WG ?
I can take over if you prefer.

guidou · 2024-06-27T14:14:14Z

Since this is intended to help the user to unmute via the unmute button in the app, which would be done via MediaSession, it makes sense that this notification comes via MediaSession.
Given that this is largely a MediaSession thing, I don't think we should have a requirement that a MediaStreamTrack is muted (although it will most likely be).

bradisbell · 2024-06-27T16:54:27Z

I do not think there is any sense in moving this to MediaSession. There are far more use cases for voice activity detection beyond letting the user know that they may be muted. A couple use cases I would implement immediately if this API were available:

During a long recording session, adding metadata to the recording of when the user may have been speaking. (Think of situations like overdubbing commentary.)
Triggering transcription/translation only when someone is speaking.
Alerting remote users that speaking is happening. Currently doing this today in a live production scenario just based on audio peak meters and usually waving at the camera enough until the remote production folks see that and unmute. Would be nice to detect speaking and then visually highlight that remote camera by making it blink or something, in addition to the usual more blunt cues.

These use cases and others like them rely on the voice activity detection firing on the track.

Besides, even if it were moved to MediaSession, choosing the right capture track to trigger on is not possible at the user agent level. It's not uncommon to have several capture tracks. The relevant captured track might even be "remote". (Think of cases where a local second device/screen/camera/mic is set up. Connected via WebRTC, but right there in the room.) Only the application truly knows what is what.

youennf · 2024-06-27T17:13:41Z

There are far more use cases for voice activity detection beyond letting the user know that they may be muted

This was discussed during the WebRTC WG meeting and we think there are two usecases which deserve two different solutions.

The first use case is allowing to unmute when user is talking while muted. This PR is about this specific issue and moving it to MediaSession seems good.

The second use case, which you seem more interested, is exposing whether a live unmuted track contains voice.
This needs more thoughts as firing an event will always be more or less out of sync with the audio data.
And it can already be implemented with audio worklet (though less efficiently) where the extracted data will be in sync with audio. This use case seems more tied to MediaStreamTrack than MediaSession.

jianjunz · 2024-06-28T01:04:46Z

@jianjunz , would you be ok drafting a PR on MediaSession WG ? I can take over if you prefer.

Sure, I'll create a PR on MediaSession WG. Thanks.

jianjunz · 2024-06-28T03:48:04Z

Closing this one as it's moved to MediaSession spec pr333.

jianjunz mentioned this pull request Jun 28, 2024

Add voiceactivity action. w3c/mediasession#333

Merged

jianjunz closed this Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MediaStreamTrack voice activity detection support. #153

Add MediaStreamTrack voice activity detection support. #153

jianjunz commented Jun 20, 2024 •

edited by pr-preview bot

Loading

youennf commented Jun 21, 2024

jianjunz commented Jun 26, 2024

youennf commented Jun 26, 2024

jan-ivar commented Jun 27, 2024

youennf commented Jun 27, 2024

youennf commented Jun 27, 2024

guidou commented Jun 27, 2024

bradisbell commented Jun 27, 2024

youennf commented Jun 27, 2024

jianjunz commented Jun 28, 2024

jianjunz commented Jun 28, 2024

Add MediaStreamTrack voice activity detection support. #153

Add MediaStreamTrack voice activity detection support. #153

Conversation

jianjunz commented Jun 20, 2024 • edited by pr-preview bot Loading

youennf commented Jun 21, 2024

jianjunz commented Jun 26, 2024

youennf commented Jun 26, 2024

jan-ivar commented Jun 27, 2024

youennf commented Jun 27, 2024

youennf commented Jun 27, 2024

guidou commented Jun 27, 2024

bradisbell commented Jun 27, 2024

youennf commented Jun 27, 2024

jianjunz commented Jun 28, 2024

jianjunz commented Jun 28, 2024

jianjunz commented Jun 20, 2024 •

edited by pr-preview bot

Loading