Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add voiceactivity action. #333

Merged
merged 7 commits into from
Jul 18, 2024
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 47 additions & 1 deletion index.bs
Original file line number Diff line number Diff line change
Expand Up @@ -412,6 +412,11 @@ platform UI or media keys, thereby improving the user experience.
the action's intent is to open the media session in a
picture-in-picture window.
</li>
<li>
<dfn enum-value for=MediaSessionAction>voiceactivity</dfn>:
the action's intent is to notify the web page that a voice activity
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the action's intent is to notify the web page that a voice activity
the action's intent is to notify the web page that voice activity

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

has been detected by the microphone.
</li>
</ul>
</p>

Expand Down Expand Up @@ -541,6 +546,30 @@ platform UI or media keys, thereby improving the user experience.
{{MediaSessionActionHandler}} before running, as different tasks, the
steps defined to [$set a track's muted state$].
</p>
<p>
A user agent MUST invoke the {{MediaSessionActionHandler}} for
{{MediaSessionAction/voiceactivity}} only when voice activity is detected
from a microphone with one or more live {{MediaStreamTrack}}s. A user
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If "live" here is the same as {{MediaStreamTrackState/live}}, we could link them:

Suggested change
from a microphone with one or more live {{MediaStreamTrack}}s. A user
from a microphone with one or more {{MediaStreamTrackState/live}} {{MediaStreamTrack}}s. A user

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

agent MAY ignore a {{MediaSessionAction/voiceactivity}} action if all
{{MediaStreamTrack}}s associated with the source are not
{{MediaStreamTrack/muted}}. It is RECOMMENDED for user agents to set a
minimal interval for invoking {{MediaSessionActionHandler}} for
Copy link
Member

@chrisn chrisn Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest clarifying "minimal interval" - is this a time delay after voice activity is detected before the action handler is invoked? And how long does voice activity need to be present for? And does voice activity that comes and goes cause multiple invocations? (Not suggesting we spec these things, just clarify what we're recommending user agents to consider)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest clarifying "minimal interval" - is this a time delay after voice activity is detected before the action handler is invoked?

It's intended to be a minimal interval between two voiceactivity actions (or events? event sounds to be more accurate here).

And how long does voice activity need to be present for? And does voice activity that comes and goes cause multiple invocations?

It actually depends on the voice activity detection (VAD) algorithm. Sometimes VAD algorithm may even consider background noise as a voice activity. Based on the use case (unmute microphone notification) we want to target, it is recommended not invoking this action handler too frequently.

A new note section is added for some background about this action.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
minimal interval for invoking {{MediaSessionActionHandler}} for
minimal interval between invocations of the {{MediaSessionActionHandler}} for

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

{{MediaSessionAction/voiceactivity}} based on privacy and power efficiency
policies.
</p>

<p class=note>
{{MediaSessionAction/voiceactivity}} only indicates the start of a voice
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{{MediaSessionAction/voiceactivity}} only indicates the start of a voice
{{MediaSessionAction/voiceactivity}} only indicates the start of voice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

activity. Application may display a notification if the user is speaking
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
activity. Application may display a notification if the user is speaking
activity. Applications may display a notification if the user is speaking

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

while the {{MediaStreamTrack}} is muted, or start an {{AudioWorklet}} for
audio processing. No action is defined for the end of a voice activity.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
audio processing. No action is defined for the end of a voice activity.
audio processing. No action is defined for the end of voice activity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Unlike other actions which are explicitely triggered by the user,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Unlike other actions which are explicitely triggered by the user,
Unlike other actions which are explicitly triggered by the user,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

{{MediaSessionAction/voiceactivity}} also depends on the voice activity
detection algorithm of the user agent or the system. For privacy and power
efficiency concern, web page may not be notified if the second voice
activity started soon after last {{MediaSessionAction/voiceactivity}}
action.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
efficiency concern, web page may not be notified if the second voice
activity started soon after last {{MediaSessionAction/voiceactivity}}
action.
efficiency concerns, the web page may not be notified if voice
activity ends and restarts soon after the last {{MediaSessionAction/voiceactivity}}
action.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

</p>

<p class=note>
A page should only register a {{MediaSessionActionHandler}} for a <a>media
Expand Down Expand Up @@ -716,7 +745,8 @@ enum MediaSessionAction {
"hangup",
"previousslide",
"nextslide",
"enterpictureinpicture"
"enterpictureinpicture",
"voiceactivity"
};

callback MediaSessionActionHandler = undefined(MediaSessionActionDetails details);
Expand Down Expand Up @@ -1496,6 +1526,7 @@ parameter whose dictionary type is:
<li>{{MediaSessionActionDetails}} for {{MediaSessionAction/nextslide}}.</li>
<li>{{MediaSessionActionDetails}} for
{{MediaSessionAction/enterpictureinpicture}}.</li>
<li>{{MediaSessionActionDetails}} for {{MediaSessionAction/voiceactivity}}.</li>
</ul>

The <dfn dict-member for="MediaSessionActionDetails">action</dfn>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could an example that links voice activity with displaying a UI that can execute setMicrophoneActive(true)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Expand Down Expand Up @@ -1807,6 +1838,21 @@ media session</a>.
</pre>
</div>

<div class="example" id="example-enterpictureinpicture">
Handling voice activity:
<pre class="lang-javascript">
// Create a MediaStream with audio enabled.
const stream = await navigator.mediaDevices.getUserMedia({audio:true});
const track = stream.getAudioTracks()[0];
navigator.mediaSession.setActionHandler("voiceactivity", function() {
if (track.muted) {
// Show unmute notification. If user allows to unmute, call
// setMicrophoneActive(true) to unmute.
}
});
</pre>
</div>

<h2 id="acknowledgments" class="no-num">Acknowledgments</h2>

The editors would like to thank Paul Adenot, Jake Archibald, Tab Atkins,
Expand Down
Loading