Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture, receive, and RTP timestamp concept definitions & normative requirements for gUM/gDM #156

Merged
merged 19 commits into from
Oct 31, 2024
Merged
Changes from 17 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 106 additions & 2 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
// See https://github.com/w3c/respec/wiki/ for how to configure ReSpec
var respecConfig = {
group: "webrtc",
xref: ["geometry-1", "html", "infra", "permissions", "dom", "image-capture", "mediacapture-streams", "webaudio", "webcodecs", "webidl"],
xref: ["geometry-1", "html", "infra", "permissions", "dom", "hr-time", "image-capture", "mediacapture-streams", "screen-capture", "webaudio", "webcodecs", "webidl"],
edDraftURI: "https://w3c.github.io/mediacapture-extensions/",
editors: [
{name: "Jan-Ivar Bruaroey", company: "Mozilla Corporation", w3cid: 79152},
Expand Down Expand Up @@ -58,6 +58,9 @@ <h2>Terminology</h2>
<p>The terms [=permission state=], [=request permission to use=], and
<a data-cite="permissions">prompt the user to choose</a> are defined in
[[!permissions]].</p>
<p>
{{Performance.now()}} is defined in [[!hr-time]].
</p>
</section>
<section id="conformance">
</section>
Expand Down Expand Up @@ -1151,7 +1154,108 @@ <h2>Constrainable Properties</h2>
</tbody>
</table>
</section>
<section>
<section class="informative">
<h2>Video timestamp concepts</h2>
<p>
Video media flowing inside media stream tracks comprises of a sequence of video frames, where
the frames are sampled from the media at instants spread out over time.
</p>
<p>
Each video frame must have a <dfn class="export">presentation timestamp</dfn>
which is relative to a source specific origin.
A source of frames can define how this timestamp is set. A sink of frames
can define how this timestamp is used.
</p>
<p>
The timestamp is present for sinks to be able to define an absolute presentation timeline of the frames
relative to a clock reference, for example for playback.
</p>
<p>
Each frame may have an absolute <dfn class="export">capture timestamp</dfn> representing
the instant the frame capture process began, which is useful for example for
delay measurements and synchronization.
A source of frames can define how this timestamp is set, otherwise it is unset. A
sink of frames can define how this timestamp is used if set.
</p>
<p>
Each frame may have an absolute <dfn class="export">receive timestamp</dfn> representing
the last received timestamp of packets used to produce this video frame was received in its entirety.
aboba marked this conversation as resolved.
Show resolved Hide resolved
The timestamp is useful for example for network jitter measurements.
A source of frames can define how this timestamp is set, otherwise it is unset. A sink of
frames can define how this timestamp is used if set.
</p>
<p>
Each frame may have a <dfn class="export">RTP timestamp</dfn> representing the packet RTP
timestamp used to produce this video frame. The timestamp is useful for example for frame
identification and playback quality measurements. A source of frames can define how the
timestamp is set, otherwise it is unset. A sink of frames can define how this timestamp is
used if set.
The packet RTP timestamp concept is defined in [[?RFC3550]] Section 5.1.
</p>
<h3>Timestamp clock relations</h3>
<p>
The [=capture timestamp=] and [=receive timestamp=] are using the same clock and offset.
The [=presentation timestamp=] and [=capture timestamp=] are using the same clock and
have an offset which can be arbitrarily chosen by the user agent since it isn't
directly observable by script.
</p>
<h3>{{VideoFrameMetadata}}</h3>
<pre class="idl">
partial dictionary VideoFrameMetadata {
DOMHighResTimeStamp captureTime;
DOMHighResTimeStamp receiveTime;
unsigned long rtpTimestamp;
};</pre>
<section class="notoc">
<h5>Members</h5>
<dl class="dictionary-members" data-link-for="VideoFrameMetadata" data-dfn-for="VideoFrameMetadata">
<dt><dfn><code>captureTime</code></dfn> of type <span class="idlMemberType">DOMHighResTimeStamp</span></dt>
<dd>
<p>The capture timestamp of the frame relative to {{Performance}}.{{Performance/timeOrigin}}. It corresponds to
aboba marked this conversation as resolved.
Show resolved Hide resolved
the [=capture timestamp=] of {{MediaStreamTrack}} video frames.
</p>
</dd>
<dt><dfn><code>receiveTime</code></dfn> of type <span class="idlMemberType">DOMHighResTimeStamp</span></dt>
<dd>
<p>The receive time of the corresponding encoded frame relative to {{Performance}}.{{Performance/timeOrigin}}.
It corresponds to the [=receive timestamp=] of {{MediaStreamTrack}} video frames.</p>
</dd>
<dt><dfn><code>rtpTimestamp</code></dfn> of type <span class="idlMemberType">unsigned long</span></dt>
<dd>
<p>The RTP timestamp of the corresponding encoded frame. It corresponds to [=RTP timestamp=] of
{{MediaStreamTrack}} video frames.</p>
</dd>
</dl>
</section>
<h3>Algorithms</h3>
When the <dfn class="abstract-op">Initialize Video Frame Timestamps From Internal MediaStreamTrack Video Frame</dfn>
algorithm is invoked with |frame| and |offset| as input, run the following steps.
<ol class=algorithm>
<li>Set {{VideoFrame/timestamp}} from [=presentation timestamp=] minus |offset|.</li>
<li>Set {{VideoFrameMetadata/captureTime}} from [=capture timestamp=] if set.</li>
<li>Set {{VideoFrameMetadata/receiveTime}} from [=receive timestamp=] if set.</li>
<li>Set {{VideoFrameMetadata/rtpTimestamp}} from [=RTP timestamp=] if set.</li>
</ol>
When the <dfn class="abstract-op">Copy Video Frame Timestamps To Internal MediaStreamTrack Video Frame</dfn>
algorithm runs with |frame| as input, run the following steps.
<ol class=algorithm>
<li>Set [=presentation timestamp=] from {{VideoFrame/timestamp}}.</li>
<li>Set [=capture timestamp=] from {{VideoFrameMetadata/captureTime}} if [=map/exist|present=].</li>
<li>Set [=receive timestamp=] from {{VideoFrameMetadata/receiveTime}} if [=map/exist|present=].</li>
<li>Set [=RTP timestamp=] from {{VideoFrameMetadata/rtpTimestamp}} if [=map/exist|present=].</li>
</ol>
</section>
<section>
<h3>Local video capture timestamps</h3>
<p>
The user agent MUST set the [=capture timestamp=] of each video frame that is sourced from
{{MediaDevices/getUserMedia()}} and {{MediaDevices/getDisplayMedia()}} to its best estimate of the time that
the frame was captured, which MUST be in the past relative to the time the frame is appearing on the track.
handellm marked this conversation as resolved.
Show resolved Hide resolved
This value MUST be monotonically increasing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At TPAC, I think there was consensus on having camera tracks timestamp == capture timestamp.
It does not seem to be the case here though I would guess we keep the idea that the same clock is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry can you elaborate on what you mean here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per https://jsfiddle.net/4yzmwnsL/, it seems both Chrome and Safari use the same frame.timestamp for cloned tracks, even though the second track was cloned 2 seconds after the first one. This seems to be in contradiction with this PR.

Instead, implementations seem to be aligned with the idea that capture timestamp and presentation timestamp are both set by the source, not by the track.

I am also not totally clear of the difference between these two timestamps for local sources.
It seems that the time the frame is appearing on the track would be thepresentation timestamp, and would be slightly greater than the capture timestamp.
If that is the case, we should refer to both capture timestamp and presentation timestamp in that section of the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's right, I spoke to this on TPAC. Currently, Chrome emits absolute capture timestamps from gUM/gDM capture to the webcodecs timestamp slot in MSTP. We believed we could do this since the nature of the timestamp isn't really well specified in webcodecs. It was later found that there are web apps that depend on a 0-based timeline and we have heuristics to support both cases in MSTG. The purpose of this PR series is to clean this whole topic up and stop using 1 field for 2 things.

presentation timestamp is 0-based and increments by frame duration. We intend to change Chrome back to 0-based here after the PR series is landed.
capture timestamp is absolute & unobservable on MediaStreamTracks but there are behavior requirements stated for gDM/gUM in this PR. It becomes observable first in VideoFrameMetadata creation at the MSTP.

I am also not totally clear of the difference between these two timestamps for local sources.

For gUM/gDM sources, assume frame sequence indices 0, 1, 2, i ... N. presentation timestamp[i + 1] - presentation timestamp[i] = capture timestamp[i + 1] - capture timestamp[i] = VideoFrame.duration[i] i.e. same clock rate, different origin.

Suggestions on how to make this clearer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this PR series is to clean this whole topic up and stop using 1 field for 2 things.

That is fine to me.

presentation timestamp is 0-based and increments by frame duration.

This PR assumes that 0 is per track. Meaning that track and track.clone() would not have the same definition of 0. This seems unnatural to me. I would tend to go with 0 as the time of the first frame emitted by the source.
For VTG, 0 would refer to the time the first video frame is enqueued.

For this PR, that would mean changing which is relative to the first frame appearing on the track. to which is relative to the first frame emitted by the track's [[Source]].

Or it could be 0 per track's sink, meaning that MediaStreamTrackProcessor would always provide its first frame to the web app with a timestamp of 0 (provided the web app reads the frames quickly enough).

It was later found that there are web apps that depend on a 0-based timeline

Maybe these web apps (and the heuristics you mentioned) could tell us whether 0 should be per track, per track's sink or per track's source.

For gUM/gDM sources, assume frame sequence indices 0, 1, 2, i ... N. presentation timestamp[i + 1] - presentation timestamp[i] = capture timestamp[i + 1] - capture timestamp[i] = VideoFrame.duration[i] i.e. same clock rate, different origin.

That makes sense to me.

Suggestions on how to make this clearer?

I would add some wording stating that presentation timestamp and capture timestamp are using the same clock and have a constant offset.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

presentation timestamp is 0-based and increments by frame duration.

This PR assumes that 0 is per track. Meaning that track and track.clone() would not have the same definition of 0. This seems unnatural to me. I would tend to go with 0 as the time of the first frame emitted by the source. For VTG, 0 would refer to the time the first video frame is enqueued.

For this PR, that would mean changing which is relative to the first frame appearing on the track. to which is relative to the first frame emitted by the track's [[Source]].

Or it could be 0 per track's sink, meaning that MediaStreamTrackProcessor would always provide its first frame to the web app with a timestamp of 0 (provided the web app reads the frames quickly enough).

I think 0 per track source is what makes most sense. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this PR series is to clean this whole topic up and stop using 1 field for 2 things.

That is fine to me.

presentation timestamp is 0-based and increments by frame duration.

This PR assumes that 0 is per track. Meaning that track and track.clone() would not have the same definition of 0. This seems unnatural to me. I would tend to go with 0 as the time of the first frame emitted by the source. For VTG, 0 would refer to the time the first video frame is enqueued.

For this PR, that would mean changing which is relative to the first frame appearing on the track. to which is relative to the first frame emitted by the track's [[Source]].

I think 0 per track source is what makes most sense. WDYT?

Done in a8fc811.

Or it could be 0 per track's sink, meaning that MediaStreamTrackProcessor would always provide its first frame to the web app with a timestamp of 0 (provided the web app reads the frames quickly enough).

It was later found that there are web apps that depend on a 0-based timeline

Maybe these web apps (and the heuristics you mentioned) could tell us whether 0 should be per track, per track's sink or per track's source.

The usages we found would work great if MSTP exposes 0-based from creation.

Suggestions on how to make this clearer?

I would add some wording stating that presentation timestamp and capture timestamp are using the same clock and have a constant offset.

Done in a8fc811.

</p>
</section>

<section>
<h2>Exposing MediaStreamTrack source heuristic reactions support</h2>
<div>
<p>Some platforms or User Agents may provide built-in support for video effects triggered by user motion heuristics, in particular for camera video streams.
Expand Down