Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MediaStreamTrack voice activity detection support. #153

Closed
wants to merge 1 commit into from

Conversation

jianjunz
Copy link
Member

@jianjunz jianjunz commented Jun 20, 2024

This change adds support for the voice activity detection (VAD) feature for audio MediaStreamTracks. It is only enabled when voiceActivityDetection constraint is set to true.

With voiceactivitydetected event, web applications are able to show notifications when the user is speaking but audio track is muted.

Fixes #145.


Preview | Diff

This change adds support for the voice activity detection (VAD) feature
for audio MediaStreamTracks. It is only enabled when
voiceActivityDetection constraint is set to true.

With voiceactivitydetected event, web applications are able to show
notifications when the user is speaking but audio track is muted.
@youennf
Copy link
Contributor

youennf commented Jun 21, 2024

I wonder whether this should actually be at MediaStreamTrack level.
Maybe we do not need a constraint either.

Given this event would fire when the track is muted, the goal would be to unmute the track, which would be done in via MediaSession API. Moving this API to MediaSession makes some sense.

Maybe all we need is a new MediaSession voiceActivity action.
Registering this handler would kick in the necessary UA logic to trigger this action.

@jan-ivar, @guidou, thoughts?

@jianjunz
Copy link
Member Author

I'm wondering when this action should be triggered if voiceActivity is moved from MediaStreamTrack to MediaSession

  1. voice activity detection on default audio input device
  2. voice activity detection on any audio input devices
  3. voice activity detection on any audio input devices with MediaStreamTrack created

1 and 2 may have privacy issue because users may not want applications to know their behavior before granting "microphone" permission.

With current AudioWorklet approach, applications are able to know which track has voice activity. I personally believe applications only want to detect voice activity for microphone with MediaStreamTrack created and muted, but I'm not sure if any application applies vad for other audio tracks.

@youennf
Copy link
Contributor

youennf commented Jun 26, 2024

The privacy story should be the same whatever the API shape.
I agree with having a voiceActivity MediaSession action only for contexts that have live (and muted) microphone MediaStreakTracks.

If we want to support multimicrophone cases, a deviceId could be exposed within MediaSessionActionDetails.

I personally believe applications only want to detect voice activity for microphone with MediaStreamTrack created and muted

Agreed for the scope of this specific API.

@jan-ivar
Copy link
Member

I agree with having a voiceActivity MediaSession action only for contexts that have live (and muted) microphone MediaStreakTracks.

Moving this to media session makes sense to me as well.

@youennf
Copy link
Contributor

youennf commented Jun 27, 2024

@steimelchrome FYI

@youennf
Copy link
Contributor

youennf commented Jun 27, 2024

@jianjunz , would you be ok drafting a PR on MediaSession WG ?
I can take over if you prefer.

@guidou
Copy link

guidou commented Jun 27, 2024

Since this is intended to help the user to unmute via the unmute button in the app, which would be done via MediaSession, it makes sense that this notification comes via MediaSession.
Given that this is largely a MediaSession thing, I don't think we should have a requirement that a MediaStreamTrack is muted (although it will most likely be).

@bradisbell
Copy link

I do not think there is any sense in moving this to MediaSession. There are far more use cases for voice activity detection beyond letting the user know that they may be muted. A couple use cases I would implement immediately if this API were available:

  • During a long recording session, adding metadata to the recording of when the user may have been speaking. (Think of situations like overdubbing commentary.)
  • Triggering transcription/translation only when someone is speaking.
  • Alerting remote users that speaking is happening. Currently doing this today in a live production scenario just based on audio peak meters and usually waving at the camera enough until the remote production folks see that and unmute. Would be nice to detect speaking and then visually highlight that remote camera by making it blink or something, in addition to the usual more blunt cues.

These use cases and others like them rely on the voice activity detection firing on the track.

Besides, even if it were moved to MediaSession, choosing the right capture track to trigger on is not possible at the user agent level. It's not uncommon to have several capture tracks. The relevant captured track might even be "remote". (Think of cases where a local second device/screen/camera/mic is set up. Connected via WebRTC, but right there in the room.) Only the application truly knows what is what.

@youennf
Copy link
Contributor

youennf commented Jun 27, 2024

There are far more use cases for voice activity detection beyond letting the user know that they may be muted

This was discussed during the WebRTC WG meeting and we think there are two usecases which deserve two different solutions.

The first use case is allowing to unmute when user is talking while muted. This PR is about this specific issue and moving it to MediaSession seems good.

The second use case, which you seem more interested, is exposing whether a live unmuted track contains voice.
This needs more thoughts as firing an event will always be more or less out of sync with the audio data.
And it can already be implemented with audio worklet (though less efficiently) where the extracted data will be in sync with audio. This use case seems more tied to MediaStreamTrack than MediaSession.

@jianjunz
Copy link
Member Author

@jianjunz , would you be ok drafting a PR on MediaSession WG ? I can take over if you prefer.

Sure, I'll create a PR on MediaSession WG. Thanks.

@jianjunz
Copy link
Member Author

Closing this one as it's moved to MediaSession spec pr333.

@jianjunz jianjunz closed this Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider adding onVoiceActivity event on MediaStreamTrack for audio
5 participants