-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add face detection constraints and VideoFrameMetadata members #78
Conversation
Raised an issue on Mozilla's standard positions |
Thanks! Responding having seen you post w3c/webcodecs#607, and speaking as BBC contributor, with my Media WG chair hat removed. We need to be sure that there are no ethical issues exposing this to the web, concerns I mentioned at the TPAC 2021 breakout meeting: https://www.w3.org/2021/10/20-webrtc-ic-minutes.html It's good that detecting facial expressions is stated as a non-goal, but I'd recommend going further to say it "must not" rather than "does not need to". Misdetection is a concern, as mentioned in the explainer, but also, there are privacy implications of exposing inferred emotions, at least without strong user consent. As such I'd want to see this proposal go through wide review, including Privacy and TAG. |
I changed the wording in the explainer as you suggested and it will be updated in the next PR. However, while not having a problem updating the wording, I don't personally see this as an issue with the proposed API. Misdetection is an issue, but by not offering the detection in the Web API we just make people to run their custom detection algorithms which hardly improves the situation. I don't see any privacy issues here -- the metadata is inferred from the same frame where it is attached to, so it does not bring any new information to whoever gets the frame what the original frame alone wouldn't already have. Privacy issues would exist only if the metadata would be delivered to user without the related video frame, but that is not done by the proposed or other Web APIs. |
ff3c99b
to
5f8b11b
Compare
Changes in 5f8b11b:
|
face-detection-explainer.md
Outdated
|
||
The field `landmarks` provides a list of facial features belonging to the detected face, such as eyes or mouth. User agent may return accurate contour for the features, but early implementations are expected to deliver only a single point corresponding to the center of a feature. | ||
dictionary HumanFaceLandmark { | ||
Point2D centerPoint; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we mark it required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess they would be initially required but in the future it could contain a bounding box as well. Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. I did not use required
here because in future we might have also bounding box/contour or other fields which would make centerPoint
unnecessary. I'm happy to mark it required
for now unless it complicates amending the spec later with other type of location information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use required
here. See rationale in other comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After changes needed to unblock issue 79, this doesn't look anymore feasible.
face-detection-explainer.md
Outdated
dictionary HumanFace { | ||
long id; | ||
float probability; | ||
DOMRectReadOnly boundingBox; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make boundingBox required, or is there a case where we would know that there is a face in the image but we do not know where?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know see that we have a kind of union boudingBox or face landmark, so we cannot make it required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have a kind of union boudingBox or face landmark
No, if with union you mean that only either one would be valid. The intent is to allow both face boundingBox
and face landmarks simultaneously. The boundingBox
within the HumanFace
dict would indicate bounding box for the whole face. Landmarks are part of human face and they can be set or not set independently of the whole face bounding box.
The reason why I did not mark boundingBox
as required
is the same as for the centerPoint
: to anticipate extensions to HumanFace
where we could have something like contour or bitmask for the face and not need bounding box anymore. If you think it is not a problem to now mark boundingBox
as required
and later to remove the keyword when dictionary is extended with new types of location, I am happy to add the keyword now. However, I am worried that if the member is now required, Web apps could break if the keyword is removed later and bounding box made optional. Probably this wouldn't happen though, as humanFaceDetectionMode
would need to be set to something else than bounding-box
too.
is there a case where we would know that there is a face in the image but we do not know where
With this draft version, there wouldn't be such a case. However, it could be future extension: add a new ObjectDetectionMode
called presence
which just indicates that a face exists in the frame without its location. Use case could be for example to prevent screensaver while user is in the front of his/her computer. We did have the presence
mode in some earlier draft, but I removed it to simplify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am worried that if the member is now required, Web apps could break if the keyword is removed later and bounding box made optional.
The required
WebIDL keyboard is a no-op on outputs, yet it's become a convention many specs lean on to indicate that user agents will never return a dictionary without filling in the member. It's still technically up to algorithms to specify what is actually filled in. But some specs leave algorithms to be inferred.
Sometimes though, like here, the invariants are more complicated, and a single required
keyword won't suffice. The invariant seems to be:
if (track.getSettings().humanFaceLandmarkDetectionMode == "center-point") {
for (const face of frame.metadata().humanFaces) {
const x = face.leftEye?.centerPoint.x; // i.e. no need to write centerPoint?.x
Is that right?
Similarly, only tracks whose settings include humanFaceDetectionMode: "bounding-box"
are guaranteed to have a face.boundingBox
. These invariants seem worth documenting in prose or algorithms.
boundingBox
cannot be marked as required
because it would be absent if you set the former but not the latter constraint.
But I would mark centerPoint
as required, because it's always a sub-member, e.g. leftEye
wouldn't be there if not for humanFaceLandmarkDetectionMode == "center-point"
. We should be able to remove that particular required
keyword later without issue when another humanFaceLandmarkDetectionMode mode is added, because the new access mode would need to be requested explicitly by JS (assuming browsers cannot default to any of these modes, which I don't see this spec speaking to actually).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, and right, in the example there is no need to write centerPoint?.x because the metadata originates from a video track with humanFaceLandmarkDetectionMode == "center-point".
I agree with the suggestion to document the invariants. I changed the PR as follows:
bounding-box: This source generates metadata into {{VideoFrameMetadata}} related to human face detection including the bounding box information for each detected face. As an input, this is interpreted as a command to turn the setting of human face detection to bounding-box.
But I would mark centerPoint as required
After changes needed to unblock issue 79, this doesn't look anymore feasible.
partially (or in special cases even fully) outside of the visible image. | ||
A coordinate is interpreted to represent a location in a normalized square space. The origin of | ||
coordinates (x,y) = (0.0, 0.0) represents the upper leftmost corner whereas the (x,y) = | ||
(1.0, 1.0) represents the lower rightmost corner relative to the rendered frame: the x-coordinate (columns) increases rightwards and the y-coordinate (rows) increases downwards. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to normalise?
DOMRectReadOnly has a width/height and an origin which would not be normalised but Point2D would be?
What is doing https://wicg.github.io/shape-detection-api? OS APIs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a discussion with the WebCodecs people (eg. Dan Sanders) and the issue of coordinate system turned to be a bit complicated. A normalized coordinate space seemed to be more straightforward and avoids some issues compared to natural size which doesn't always match with copyTo(). It is too bad that it doesn't seem to be consistent with the existing Web APIs. Shape detection API uses natural size (unlike we) but after the discussions it seems that the whole definition of coordinate space in that spec is not adequate, at least not for our purposes with VideoFrame which has internally a rotation, visibleRect, and codedRect.
Therefore, I would rather stick with the definition from Image Capture spec's Points of interest with normalized coordinates. I think that spec is more mainstream and avoids the above issues. (edit: actually PoI spec doesn't consider rotation nor visibleRect/codedRect either. Therefore we say in the spec/explainer relative to the rendered frame which should avoid the possible confusion.)
The platform APIs don't help here much, Windows uses normalized coordinates (up to INT_MAX or something) while Android/ChromeOS uses native pixel coordinates.
I think that both Point2D
and DOMRectReadOnly
types should use the same coordinate system for consistency. If we use normalized coordinates, also width
and height
should be normalized so that right
= x + width
for consistency. I will clarify that in the draft API once we agree on coordinate system.
Is the PR ready for CfC? |
Thanks @ttoivone for updating the explainer, looks OK from my point of view. |
We are still waiting for review comments from WebCodecs team (Dan Sanders/Dale Curtis). |
37840a4
to
e2ec3d6
Compare
Changes in e2ec3d6:
Feedback was positive from WebCodecs (Dale Curtis) "Structure looks good to me for VideoFrameMetadata. I defer to @youennf around correctness issues for what metadata should be there." @jan-ivar was removed inadvertently from the reviewer list and I still can't add reviewers myself, sorry. @youennf @jan-ivar: Please let us know if further updates are needed into the PR before CfC, thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi all, thanks so much for all this work! Overall I think it looks good, just lots of nits and comments since it's taken me a while to dive in and review this. My apologies for that.
To summarize some of my feedback:
- document invariants better around reading metadata, so web devs don't have to
?.
every data point - clarify, maybe even simplify
probability
? - orient examples around detecting support using getSettings() not getSupportedConstraints()
- define what left/right mean (eye of human or beholder?)
- clarify if the two constraints are dependent or not (and why there are two)
- remove gaze correction
Thanks! Happy to take changes after CfC has concluded.
index.html
Outdated
@@ -654,5 +654,400 @@ <h2>Exposing change of MediaStreamTrack configuration</h2> | |||
</p> | |||
</div> | |||
</section> | |||
<section> | |||
<h2>Human faces</h2> | |||
<p>Human face metadata describes the human faces in video frames. It can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a reader coming in cold, I think it would be good to clarify early that this is detection, not recognition of faces. "describes the human faces" sounds like the latter to me, which seems unfortunate. How about:
<p>Human face metadata describes the human faces in video frames. It can | |
<p>Human face metadata describes the presence of human faces in video frames. It can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to "Human face metadata describes the geometrical information of human faces". I think that is more precise. "presence" sounds like boolean to me, I would understand it describing whether there are faces on the frame or not.
face-detection-explainer.md
Outdated
```js | ||
partial dictionary MediaTrackSupportedConstraints { | ||
boolean humanFaceDetectionMode = true; | ||
boolean humanFaceLandmarkDetectionMode = true; | ||
}; | ||
|
||
partial dictionary MediaTrackCapabilities { | ||
sequence<DOMString> humanFaceDetectionMode; | ||
sequence<DOMString> humanFaceLandmarkDetectionMode; | ||
}; | ||
|
||
partial dictionary MediaTrackConstraintSet { | ||
ConstrainDOMString humanFaceDetectionMode; | ||
ConstrainDOMString humanFaceLandmarkDetectionMode; | ||
}; | ||
|
||
partial dictionary MediaTrackSettings { | ||
DOMString humanFaceDetectionMode; | ||
DOMString humanFaceLandmarkDetectionMode; | ||
}; | ||
|
||
enum HumanFaceDetectionMode { | ||
"none", // Face or landmark detection is not needed | ||
"bounding-box", // Bounding box of the detected object is returned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WebIDL is for specs not explainers IMHO. Can this be replaced with JS snippets showing expected syntax?
if not then please replace js
with idl
on the first line to get the correct syntax highlighting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like having WebIDL in explainers as long as we're not yet ready to land it in specs. It's a pain to have it both places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think JS snippets are better for the examples at the end of the explainer. For explaining things, I found JS to be inconvenient. How would an enum be written in Javascript, for instance? Therefore, replaced js
with idl
.
face-detection-explainer.md
Outdated
|
||
The constraint `faceDetectionMode` is used by applications to describe the level of facial data that they need. At the lowest enabled level, `presence` will return the sequence of `DetectedFace`, but the `contour` and `landmarks` sequences will be empty. When `faceDetectionMode` is `contour`, arbitrary number of points around the faces will be returned but no landmarks. An user agent might return only four contour points corresponding to face bounding box. If a Web application needs only maximum of four contour points (bounding box), it can set `faceDetectionMode` to `bounding-box` which limits number of contour points to four, residing at the corners of a rectangle around the detected faces. | ||
New members are added to capabilities, constraints, and settings for Web applications to enable and control face and face landmark detection with `getUserMedia()` and `applyConstraints()` and to query capabilities of face detection with `getCapabilities()` methods. Web applications should not ask more facial metadata than what they need to limit computation. For example, if an applications is content with just a face bounding box, it should set the constraint `humanFaceLandmarkDetectionMode` to `"none"`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sentence "Web applications should not ask more facial metadata than what they need to limit computation" has a typo, but also seems to be admonishing web developers to behave a certain way, which reads a bit odd to me.
Could it be rephrased to emphasize the benefits of asking for less information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrased to "Web applications can reduce the computational load on the user agent by only requesting the necessary amount of facial data, rather than asking for more than they need."
face-detection-explainer.md
Outdated
|
||
At the highest level, when `faceDetectionMode` is `landmarks`, the full precision contour which is available is returned along with landmark features. | ||
The enumeration constraints `humanFaceDetectionMode` and `humanFaceLandmarkDetectionMode` set the level of detection needed for human faces and their landmarks, respectively. These settings can be one of the enumeration values in `HumanFaceDetectionMode` and `HumanFaceLandmarkDetectionMode`. When `humanFaceDetectionMode` is `"bounding-box"`, user agent must attempt face detection and set the metadata in video frames correspondingly. When the setting is `"none"`, face description metadata (including landmarks) is not set. Similarly, when `humanFaceLandmarkDetectionMode` is `"none"`, the landmarks (ie. members `leftEye`, `rightEye`, and `mouth` in dictionary `HumanFace`) are not set. When the setting is `"center-point"` and face detection is enabled, the user agent must attempt to detect face landmarks and set the location information in the members of type `HumanFaceLandmark` accordingly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"enumeration constraints" is an odd term. They are DOMString constraints. Maybe just say "constraints"?
Also to nitpick on language, these constraints are "applied" not "set". And even when successfully set, it is probably wise for the application to read back videoTrack.getSettings()
to verify that the constraints actually stuck, as I believe that is the only point at which user agents MUST perform the feature (e.g. a user COULD have configured their user agent to not expose this functionality, even though the constraint name is recognized by getSupportedConstraints
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed "enumeration", rephrased the paragraph. It now begins with "The constraints humanFaceDetectionMode
and humanFaceLandmarkDetectionMode
request the metadata detail level that the application wants for human faces and their landmarks, respectively."
index.html
Outdated
frames originating from the same {{MediaStreamTrack}} source, {{id}} is set to the same integer value | ||
for the face in all frames.</p> | ||
<p>User agent MUST NOT select the assigned value of {{id}} in such a way that the detected faces could | ||
be correlated to match in any way between different {{MediaStreamTrack}} objects.</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that a const trackB = trackA.clone()
will have different ids. Was that the intent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is the main rule and I don't see a reason to make an exception in this case. Do you think there would be a reason to do otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that actually elsewhere in the PR notes of different {{MediaStreamTrack}} sources are made, contradicting with this sentence. Therefore, changing this sentence to: "be correlated to match between different {{MediaStreamTrack}} sources."
face-detection-explainer.md
Outdated
The field `contour` provides an arbitrary number of points which enclose the face. Sophisticated implementations could provide a large number of points but initial implementations are expected to provide four contour points located in the corners of a rectangular region which describes the face bounding box. User agent is allowed to provide a minimum of one point, in which case the point should be located at the center of the face. | ||
dictionary HumanFace { | ||
long id; | ||
float probability; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably better expressed as a "confidence".
It's not really a probability: the face is either there or not. This value is intended to represent how confident the API is that the face is there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that here "probability" is the estimated probability provided by the model, based on the data it was trained on.
Confidence is typically not expressed as a single number, but rather a range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that here "probability" is the estimated probability provided by the model,
Naming aside, leaving the range of possible values implementation-defined may make it hard for apps to reason about the values. E.g. if one implementation's model rarely ever outputs above 0.7 for some technical purity reason, even in a well-lit face front situation, and another model reliably produces 1.0 except in corner cases, it could lead to weird web compat issues.
Are there precedents we could look at? Would it make sense to say something like 1.0
SHOULD represent the highest value the model can consistently output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think confidence is the right term.
Each confidence estimate might be seen as a probability of correctness, reflect-
ing the model’s uncertainty about a detection or the pixel mask. During inference,
we expect the estimated confidence to match the observed precision for a predic-
tion. For example, given 100 predictions with 80% confidence each, we expect 80
predictions to be correctly predicted
https://arxiv.org/pdf/2202.12785
That very paper describes how to ensure proper calibration of this confidence for object detection; I'm not sure whether we want to go to that level of specificity though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leaving the range of possible values implementation-defined may make it hard for apps to reason about the values.
I agree with this 100% and think that unless the value is properly defined it shouldn't be included into the spec.
The paper that @dontcallmedom quotes says "confidence estimate might be seen as a probability of correctness". To me this doesn't look like very rigorous or widely used definition. I'd rather go to commonly used definitions, such as Wikipedia. I believe what we would like to have here is conditional probability, but the condition needs to be more precisely defined than what it used to be. In the latest explainer/PR I changed the definition as follows:
"the estimate of the conditional probability that the segmented object actually is of the type indicated by the {{Segment/type}} member on the condition that the detection has been made." (where {{Segment/type}} describes whether it is a face or some other type of object; see issue 79.
It is an estimate, not exact probability. As an implementor advise, the implementor could run the face detection algorithm on some test videos and of the detections that the algorithm made, check how many are correct (missed faces would not affect the conditional probability here although the algorithm quality affects that too). Other factors could also be taken into account, such as confidence/score from the algorithm, camera image quality, etc.
e2ec3d6
to
67c9dba
Compare
Updated the PR. All previous comments should have been now addressed either by changing the PR or otherwise. Asking reviewers to check if this version could be merged or if more changes are needed. @jan-ivar In particular, after the CfC, three objections were made: Segmentation metadata #79 As per the Feb 21 meeting, proposal was to mark issues 84 and 85 as non blocking. This updated PR should now address issue 79 which was a blocker. Asking @adoba to mark issues 84 and 85 as non-blockers and checking if this PR now unblocks issue 79. |
Given @ttoivone comment, I think we should review the PR at next editor's meeting. |
67c9dba
to
311cd1c
Compare
Changes in the latest update of the PR:
|
Editors agreed to merge with the change above. |
Add specifications for human face metadata and related constraints, capabilities, and settings. Add also corresponding examples.
311cd1c
to
a1c636a
Compare
This PR supersedes previous PRs related to face detection (#57 , #48 ). It adds the constraints (and related settings and capabilities) and extends the recently introduced VideoFrameMetadata to have descriptions for faces in the frames.
The feedback has been taken into consideration, simplifying the API by removing most of the previously proposed constraints. Also the mesh-based facial description has been removed. Only those judged to be essential for good performance are left. An exception is face landmarks, which are already supported by some platforms and could be therefore immediately useful. Furthermore, HumanFace-term is used instead of more generic DetectedFace to anticipate future extensions of VideoFrameMetadata.
The PR consists of two commits. The first updates the explainer and the second updates the spec.
Preview | Diff