Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture, receive, and RTP timestamp concept definitions & normative requirements for gUM/gDM #156

Merged
merged 19 commits into from
Oct 31, 2024

Conversation

handellm
Copy link
Contributor

@handellm handellm commented Sep 11, 2024

Partly addresses w3c/webcodecs#813 (review).

Defines capturetime in mediacapture-extensions.


Preview | Diff

Copy link
Member

@jan-ivar jan-ivar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't the existing videoFrame.timestamp good enough for AV sync, especially when it's defined for both audio and video, and has a higher resolution to boot?

index.html Outdated
Comment on lines 1161 to 1162
Some video sources can supply information about when a video frame was captured.
This information is useful for example for AV sync and end-to-end delay measurement.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it help with AV sync if it's only on video frames?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolute audio capture timestamps are available via other web APIs right?
To try to answer the overarching question, presentation timestamps alone are insufficient if you want to do accurate AV sync on a local (think MediaRecorder) or on a remote receiver, especially in the presence of delays in the capture process, or insert VTGs in the capture pipeline to do video processing.
Is this answer enough or do you want to see further changes to clarify this in the spec text based on latest uploaded revision?

index.html Outdated
Comment on lines 1178 to 1184
<dt><dfn><code>captureTime</code></dfn> of type <span
class="idlMemberType">DOMHighResTimeStamp, readonly</span></dt>
<dd>
<p>
The capture time of the frame, defined as {{Performance.timeOrigin}} + {{Performance.now()}}.
This is the user agent's best estimate of the instant the frame content was captured or
generated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the videoFrame.timestamp of these frames? Why do we need another one?

This is in milliseconds when the videoFrame.timestamp is in microseconds. That seems inconsistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @jan-ivar - I was a little trigger happy in uploading this PR, I will upload changes soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

captureTime is absolute which enables delay measurements in a video pipeline. Also, video capture OS APIs typically provide media sampling timestamps which I've measured to be sometimes tens of milliseconds before they're available in a capture sink. I've tried to update what the use of the timestamps are in non-normative text.

I don't know why VideoFrame.timestamp was specified in micros. DOMHighResTimestamp with resolution around 0.1ms seems fine.

index.html Outdated Show resolved Hide resolved
@jan-ivar
Copy link
Member

The presentation timestamp seems a bit under-specified, but I gather it starts at 0 at the start of capture, and follows the "media timeline" which I suppose in theory can get out of sync (what happens to it with track.enabled = false?). So I can see a need here.

I just think it would be good to write down the differences and why we need a video capture timestamp.

@alvestrand alvestrand marked this pull request as draft September 19, 2024 14:40
@handellm
Copy link
Contributor Author

41da924 is uploaded state prior to serving Youenn's request of extending with timestamp writing algorithms for local sources.

@handellm handellm changed the title Capture time Capture, receive, and RTP timestamp concept definitions & normative requirements for gUM/gDM Sep 20, 2024
index.html Outdated Show resolved Hide resolved
Copy link
Contributor

@aboba aboba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're going to cover remote tracks as well as local ones, then some clarifications are needed. But it might be simplest to focus on locally captured tracks.

index.html Show resolved Hide resolved
index.html Show resolved Hide resolved
@handellm handellm requested a review from jan-ivar September 27, 2024 19:14
@handellm handellm marked this pull request as ready for review September 27, 2024 19:15
index.html Outdated Show resolved Hide resolved
index.html Outdated Show resolved Hide resolved
index.html Outdated Show resolved Hide resolved
index.html Outdated Show resolved Hide resolved
index.html Outdated Show resolved Hide resolved
handellm and others added 4 commits October 7, 2024 12:12
Co-authored-by: Dominique Hazael-Massieux <[email protected]>
Co-authored-by: Dominique Hazael-Massieux <[email protected]>
Co-authored-by: Dominique Hazael-Massieux <[email protected]>
Co-authored-by: Dominique Hazael-Massieux <[email protected]>
index.html Outdated Show resolved Hide resolved
index.html Outdated
</p>
<p>
Each video frame has a <dfn class="export">presentation timestamp</dfn>
which is relative to the first frame appearing on the track. The timestamp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we state that the first video frame of a {{MediaStreamTrack}} has a presentation timestamp of 0?
Or is it the first frame of the track's source?
Does this mean that the first video frame of a cloned track will have a timestamp of 0, or will cloned tracks have the same video frame timestamps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we state that the first video frame of a {{MediaStreamTrack}} has a presentation timestamp of 0?

This - updated with clarification that the first frame's timestamp is 0.

Does this mean that the first video frame of a cloned track will have a timestamp of 0, or will cloned tracks have the same video frame timestamps?

Having it start with 0 for cloned tracks is consistent with the definition as written here. Trying to specify 0 for the first frame from a source could be interpreted as the same timestamp sequence is replicated across all tabs - we don't want this for privacy reasons, right?

index.html Outdated Show resolved Hide resolved
index.html Outdated Show resolved Hide resolved
index.html Outdated Show resolved Hide resolved
The user agent MUST set the [= capture timestamp =] of each video frame that is sourced from
{{MediaDevices/getUserMedia()}} and {{MediaDevices/getDisplayMedia()}} to its best estimate of the time that
the frame was captured, which MUST be in the past relative to the time the frame is appearing on the track.
This value MUST be monotonically increasing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At TPAC, I think there was consensus on having camera tracks timestamp == capture timestamp.
It does not seem to be the case here though I would guess we keep the idea that the same clock is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry can you elaborate on what you mean here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per https://jsfiddle.net/4yzmwnsL/, it seems both Chrome and Safari use the same frame.timestamp for cloned tracks, even though the second track was cloned 2 seconds after the first one. This seems to be in contradiction with this PR.

Instead, implementations seem to be aligned with the idea that capture timestamp and presentation timestamp are both set by the source, not by the track.

I am also not totally clear of the difference between these two timestamps for local sources.
It seems that the time the frame is appearing on the track would be thepresentation timestamp, and would be slightly greater than the capture timestamp.
If that is the case, we should refer to both capture timestamp and presentation timestamp in that section of the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's right, I spoke to this on TPAC. Currently, Chrome emits absolute capture timestamps from gUM/gDM capture to the webcodecs timestamp slot in MSTP. We believed we could do this since the nature of the timestamp isn't really well specified in webcodecs. It was later found that there are web apps that depend on a 0-based timeline and we have heuristics to support both cases in MSTG. The purpose of this PR series is to clean this whole topic up and stop using 1 field for 2 things.

presentation timestamp is 0-based and increments by frame duration. We intend to change Chrome back to 0-based here after the PR series is landed.
capture timestamp is absolute & unobservable on MediaStreamTracks but there are behavior requirements stated for gDM/gUM in this PR. It becomes observable first in VideoFrameMetadata creation at the MSTP.

I am also not totally clear of the difference between these two timestamps for local sources.

For gUM/gDM sources, assume frame sequence indices 0, 1, 2, i ... N. presentation timestamp[i + 1] - presentation timestamp[i] = capture timestamp[i + 1] - capture timestamp[i] = VideoFrame.duration[i] i.e. same clock rate, different origin.

Suggestions on how to make this clearer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this PR series is to clean this whole topic up and stop using 1 field for 2 things.

That is fine to me.

presentation timestamp is 0-based and increments by frame duration.

This PR assumes that 0 is per track. Meaning that track and track.clone() would not have the same definition of 0. This seems unnatural to me. I would tend to go with 0 as the time of the first frame emitted by the source.
For VTG, 0 would refer to the time the first video frame is enqueued.

For this PR, that would mean changing which is relative to the first frame appearing on the track. to which is relative to the first frame emitted by the track's [[Source]].

Or it could be 0 per track's sink, meaning that MediaStreamTrackProcessor would always provide its first frame to the web app with a timestamp of 0 (provided the web app reads the frames quickly enough).

It was later found that there are web apps that depend on a 0-based timeline

Maybe these web apps (and the heuristics you mentioned) could tell us whether 0 should be per track, per track's sink or per track's source.

For gUM/gDM sources, assume frame sequence indices 0, 1, 2, i ... N. presentation timestamp[i + 1] - presentation timestamp[i] = capture timestamp[i + 1] - capture timestamp[i] = VideoFrame.duration[i] i.e. same clock rate, different origin.

That makes sense to me.

Suggestions on how to make this clearer?

I would add some wording stating that presentation timestamp and capture timestamp are using the same clock and have a constant offset.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

presentation timestamp is 0-based and increments by frame duration.

This PR assumes that 0 is per track. Meaning that track and track.clone() would not have the same definition of 0. This seems unnatural to me. I would tend to go with 0 as the time of the first frame emitted by the source. For VTG, 0 would refer to the time the first video frame is enqueued.

For this PR, that would mean changing which is relative to the first frame appearing on the track. to which is relative to the first frame emitted by the track's [[Source]].

Or it could be 0 per track's sink, meaning that MediaStreamTrackProcessor would always provide its first frame to the web app with a timestamp of 0 (provided the web app reads the frames quickly enough).

I think 0 per track source is what makes most sense. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this PR series is to clean this whole topic up and stop using 1 field for 2 things.

That is fine to me.

presentation timestamp is 0-based and increments by frame duration.

This PR assumes that 0 is per track. Meaning that track and track.clone() would not have the same definition of 0. This seems unnatural to me. I would tend to go with 0 as the time of the first frame emitted by the source. For VTG, 0 would refer to the time the first video frame is enqueued.

For this PR, that would mean changing which is relative to the first frame appearing on the track. to which is relative to the first frame emitted by the track's [[Source]].

I think 0 per track source is what makes most sense. WDYT?

Done in a8fc811.

Or it could be 0 per track's sink, meaning that MediaStreamTrackProcessor would always provide its first frame to the web app with a timestamp of 0 (provided the web app reads the frames quickly enough).

It was later found that there are web apps that depend on a 0-based timeline

Maybe these web apps (and the heuristics you mentioned) could tell us whether 0 should be per track, per track's sink or per track's source.

The usages we found would work great if MSTP exposes 0-based from creation.

Suggestions on how to make this clearer?

I would add some wording stating that presentation timestamp and capture timestamp are using the same clock and have a constant offset.

Done in a8fc811.

@youennf
Copy link
Contributor

youennf commented Oct 8, 2024

I think 0 per track source is what makes most sense.

That seems indeed better compared to per track.

The usages we found would work great if MSTP exposes 0-based from creation.

That is closer to per track's sink timestamps, but...
MSTP can be specified so that the presentation timestamp of the video frames it is exposing to web pages will be relative to the first video frame it is exposing.

If we do that change in MSTP, do we need presentation timestamp and capture timestamp to be different for local tracks?

@handellm
Copy link
Contributor Author

handellm commented Oct 8, 2024

I think 0 per track source is what makes most sense.

That seems indeed better compared to per track.

The usages we found would work great if MSTP exposes 0-based from creation.

That is closer to per track's sink timestamps, but... MSTP can be specified so that the presentation timestamp of the video frames it is exposing to web pages will be relative to the first video frame it is exposing.

I was expressing myself sloppily, I meant what you wrote. So for this PR, we could prepare for that by adding an |offset| parameter to the "Initialize Video Frame Timestamps From Internal MediaStreamTrack Video Frame" algorithm, which is specified later from mediacapture-transform as "the [=presentation timestamp=] of the first frame". WDYT?

If we do that change in MSTP, do we need presentation timestamp and capture timestamp to be different for local tracks?

In general yes, it's not possible to recreate any accurate capture timestamp given presentation timestamp alone since the origin is unknown.

@youennf
Copy link
Contributor

youennf commented Oct 8, 2024

we could prepare for that by adding an |offset|

We could consider that MSTP is receiving a VideoFrame object and use https://w3c.github.io/webcodecs/#videoframe-initialize-frame-from-other-frame to create a new VideoFrame with an updated timestamp.

If we do that change in MSTP, do we need presentation timestamp and capture timestamp to be different for local tracks?

In general yes, it's not possible to recreate any accurate capture timestamp given presentation timestamp alone since the origin is unknown.

It seems that presentation timestamp will be exposed in a few places:

  • Via MSTP where it would be updated by MSTP before exposure to JS.
  • Via rvfc where it could also be updated in the same way (offset by the timestamp of the first video frame being rendered by the media element).

Based on this, the fact that the presentation timestamp of the first video frame of a track's source is 0 may not be observable to JavaScript, since the only presentation timestamps exposed to JS would be computed by each sink and would be relative to that sink.

The only requirement could be that presentation timestamps have the same clock as capture timestamps for local sources. presentation timestamp == capture timestamp for local sources seems then sufficient and easier to understand.

@handellm
Copy link
Contributor Author

handellm commented Oct 8, 2024

  • Via MSTP where it would be updated by MSTP before exposure to JS.
  • Via rvfc where it could also be updated in the same way (offset by the timestamp of the first video frame being rendered by the media element).

Yes, VideoFrame.timestamp for the former and mediaTime for the latter (they're currently both described to be "presentation timestamp" or "media presentation timestamp (PTS)").

The only requirement could be that presentation timestamps have the same clock as capture timestamps for local sources. presentation timestamp == capture timestamp for local sources seems then sufficient and easier to understand.

I buy requiring the same clock rate, but I don't see the requirement to require equality for the offset.

Given sinks that expose presentation timestamp subtract an |offset| being the first used presentation timestamp to produce a 0-based observable sequence it doesn't seem to me the actual origin of presentation timestamp in the UA matters and hence don't need to be strictly defined?

Wdyt we just add a note explaining this and that it's allowable to have the same clock (rate and offset) for presentation timestamp and capture timestamp as a special case?

@youennf
Copy link
Contributor

youennf commented Oct 9, 2024

it doesn't seem to me the actual origin of presentation timestamp in the UA matters and hence don't need to be strictly defined?

Agreed.

Wdyt we just add a note explaining this and that it's allowable to have the same clock (rate and offset) for presentation timestamp and capture timestamp as a special case?

Right, this seems editorial.
I would tend to define presentation timestamp equal to capture timestamp for simplicity/readability, but either is fine really.

@handellm
Copy link
Contributor Author

handellm commented Oct 9, 2024

@dontcallmedom I can't parse what respec complains about, ideas?

@dontcallmedom
Copy link
Member

looks like it was a temporary hiccup, re-running the job fixed it

Copy link
Contributor

@youennf youennf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once we say something of the [=presentation timestamp=] of local video capture, either same value as [=capture timestamp=] or with a fixed offset.
Plus the removal of a MUST we cannot really test.

index.html Outdated Show resolved Hide resolved
Co-authored-by: youennf <[email protected]>
@youennf
Copy link
Contributor

youennf commented Oct 10, 2024

e [=presentation timestamp=] of local video capture, either same value as [=capture timestamp=]

Only for local tracks of course.

@jan-ivar
Copy link
Member

Thanks for all the activity! Let me have a look. I'd also love for @padenot to look this over.

@henbos
Copy link
Contributor

henbos commented Oct 10, 2024

From editor's meeting:

@padenot
Copy link

padenot commented Oct 10, 2024

All good from my perspective, with comments from @youennf addressed. Please remember to put this upstream and not here, thanks.

@handellm
Copy link
Contributor Author

@padenot thanks - can you just clarify what you meant by "upstream and not here"? We had one interpretation on the editors meeting but reading again I'm not sure that was right.

@happylinks
Copy link

Just want to share my appreciation for adding capture timestamp. This is exactly what we are missing for having accurate multi-stream recording in tella.tv
Right now we start 2 mediarecorders and basically hope for the best in terms of sync between screen and camera, there is no way to know when the first frame was actually captured.
Seems like this will fix that problem, so thanks for working on it!

@handellm
Copy link
Contributor Author

@padenot @jan-ivar - friendly ping!

Copy link
Member

@jan-ivar jan-ivar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I approve of overall direction. Happy to file followups if I find something.

@youennf youennf merged commit 464ebe8 into w3c:main Oct 31, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants