From 311cd1cd34005730d2da27b07eb2c9d82e7ed41c Mon Sep 17 00:00:00 2001
From: Tuukka Toivonen <tuukka.toivonen@intel.com>
Date: Thu, 6 Apr 2023 13:17:33 +0300
Subject: [PATCH] Face detection: add specification

Add specifications for human face metadata and related constraints,
capabilities, and settings. Add also corresponding examples.
---
 face-detection-explainer.md | 451 +++++++++---------------------------
 index.html                  | 324 +++++++++++++++++++++++++-
 2 files changed, 438 insertions(+), 337 deletions(-)

diff --git a/face-detection-explainer.md b/face-detection-explainer.md
index 227b33b..6eb722b 100644
--- a/face-detection-explainer.md
+++ b/face-detection-explainer.md
@@ -6,72 +6,128 @@
 - Tuukka Toivonen, Intel Corporation
 - Eero Häkkinen, Intel Corporation
 
-## Participate
-- github.com/riju/faceDetection/issues/
-
 ## Introduction
 
 This document describes a proposal to the WebRTC WG. At this stage, this proposal has not been accepted by the WG.
-Face Detection is the process of detecting human faces in a given scene and distinguishing them from other objects. There are multiple ways to perform face detection on the Web. Libraries and ML frameworks (with WebAssembly and WebGL backend) exist, both proprietary and open source, which can perform  face detection either in client within the browser or using a vendor cloud service. Computation in vendor cloud adds latencies depending on network speed and adds dependency to third party service. 
+Face detection is the process of detecting the presence and location of human faces in a given image or video frame and distinguishing them from other objects. Face detection does not involve the recognition of faces ie. associating the detected faces with their owners. There are multiple ways to perform face detection on the Web. Libraries and machine learning (ML) frameworks (with WebAssembly and WebGL backend) exist, both proprietary and open source, which can perform  face detection either in client within the browser or using a vendor cloud service. Computation in vendor cloud adds latencies depending on network speed and adds dependency to a third party service. 
 
-[Shape Detection API has a FaceDetector](https://wicg.github.io/shape-detection-api/) which enables Web applications to use a system provided face detector, but it requires image data to be provided by the Web app itself. It surely helps that it works on images to detect faces, but from a video conference perspective, it means the app would first need to capture frames from a camera and then feed them as input to the Shape Detection API. Many platforms offer a camera API which can perform face detection directly on image frames from the system camera. Cameras run a face detection algorithm by default to make their 3A algorithms work better. Both Microsoft and ChromeOS offer native platforms APIs to hook into those algorithms and offer performant face detection to the Web.
+[Shape Detection API has a FaceDetector](https://wicg.github.io/shape-detection-api/) which enables Web applications to use a system provided face detector, but it requires image data to be provided by the Web app itself. It surely helps that it works on images to detect faces, but from a video conference perspective, it means the app would first need to capture frames from a camera and then feed them as input to the Shape Detection API. Many platforms offer a camera API which can perform face detection directly on image frames from the system camera. Cameras run a face detection algorithm by default to make their 3A algorithms work better. Both Windows and ChromeOS offer native platforms APIs to hook into those algorithms and offer performant face detection to the Web.
 
 
 ## Goals 
 
-* Face Detection API should be anchored to [VideoFrame](https://www.w3.org/TR/webcodecs/#videoframe-interface) defined in WebCodecs instead of [MediaStreamTrack](https://www.w3.org/TR/mediacapture-streams/#dom-mediastreamtrack).
+* Face detection API should be split into two parts: first to allow enabling face detection on [MediaStreamTrack](https://www.w3.org/TR/mediacapture-streams/#dom-mediastreamtrack) source, and then to define the description of detected faces in WebCodecs [VideoFrame](https://www.w3.org/TR/webcodecs/#videoframe-interface)s.
 
-* Face Detection API should be anchored to [VideoFrameMetadata](https://wicg.github.io/video-rvfc/#dictdef-videoframemetadata) defined in [HTMLVideoElement.requestVideoFrameCallback](https://wicg.github.io/video-rvfc/).
+* The description of faces in frames should extend the [VideoFrameMetadata](https://www.w3.org/TR/webcodecs/#dictdef-videoframemetadata) and the description can be supplemented or replaced by Web applications.
 
-* Face Detection API should try to return a **contour** instead of a bounding box. The number of points describing the contour  can be user defined via **faceDetectionMode** settings and implementations presently can default to a four point rectangle.
+* Face detection API should return information on detected faces and landmarks as available on current platform APIs. For faces it should return bounding box and for landmarks the center point of the landmarks.
 
-* Face Detection API should allow face tracking.
+* Face descriptions should allow face tracking but not face recognizion or correlation of faces between different sources.
 
-* Face Detection API should work with [TransformStream](https://developer.mozilla.org/en-US/docs/Web/API/TransformStream).
+* Face detection API should work with [TransformStream](https://developer.mozilla.org/en-US/docs/Web/API/TransformStream).
 
-* Face Detection API could be used as an input to various other APIs like Background Blur, Eye Gaze Correction, Face Framing, etc. Face Detection minimizes the surface area other dependent features need to work on for a faster implementation. It should be easy to use Face Detection API along with a custom Eye Gaze Correction or a Funny Hats feature from a ML framework by passing the face coordinates.
+* Face descriptions could be used as an input to various algorithms like background blur, eye gaze correction, face framing, etc. Face detection minimizes the surface area that other algorithms need to process for a faster implementation. It should be easy to use face detection API along with a custom eye gaze correction or the Funny Hats feature from a ML framework by passing the face coordinates.
 
-* In the spirit of successive enhancement, it should be possible to get results from Face Detection API and add custom enhancements before feeding the metadata to the video stream.
+* Facial landmarks like *eyes* and *mouth* should be exposed if there's support in the platform and user enables it.
 
-* Facial Landmarks like *eyes*, *nose* and *mouth* should be detected if there's support in the platform.
+* The API should be as minimal as possible, while still supporting current platforms and allowing applications to ask for the minimum amount of extra computation that they need.
 
 ## Non-goals
 
-* Face Detection API does not need to support facial expressions. Many platforms do support **BLINK** and **SMILE** and ML Frameworks do support a diverse set of expressions, usually **ANGER**, **DISGUST**, **FEAR**, **HAPPINESS**, **SADNESS**, **SURPRISE**; **NEUTRAL**, etc.  [Many](https://www.w3.org/2021/11/24-webrtc-minutes.html#t04) felt expressions being more subjective and there’s a concern about the expression detection going wrong.
+* Face detection API must not support facial expressions. Many platforms support *blink* and *smile* and ML frameworks support a diverse set of expressions, typically *anger*, *disgust*, *fear*, *happiness*, *sadness*, *surprise*, and *neutral*.  [Many people](https://www.w3.org/2021/11/24-webrtc-minutes.html#t04) felt that expressions are too subjective and there's a concern of misdetecting expressions.
 
-* Face Detection API does not need to return a mesh corresponding to the detected faces. Even though TensorFlow returns a 468 landmark FaceMesh and most DNNs can return something similar, mesh is not supported on any platforms presently, and for the sake of simplicity, it is excluded for now. However, in the long term it may be appropriate to extend the face detection API to be able to also return mesh-based face detection results. This is left for future work.
+* Face detection API does not need to return a mesh or contour corresponding to the detected faces. Even though TensorFlow returns a 468-landmark FaceMesh and most DNNs can return something similar, mesh is not supported on any platforms presently, and for the sake of simplicity, it is excluded for now. However, in the long term it may be appropriate to extend the face detection API to be able to also return mesh-based face detection results. This is left for future work.
 
-## Using the face detection API
+## Face detection API
+
+### Metadata
+
+The API consists of two parts: first, metadata which describes the faces available in video frames. The metadata could be also set by user by creating a new video frame (modifying video frame metadata in existing frames is not allowed by the specification). The metadata could be used by WebCodecs encoders to improve video encoding quality, for example by allocating more bits to face areas in frames. As the encoding algorithms are not specified in standards, also the exact way how the facial metadata is used is not specified.
+
+```idl
+
+partial dictionary VideoFrameMetadata {
+  sequence<Segment> segments;
+};
+
+dictionary Segment {
+  required SegmentType type;
+  required long        id;
+  long                 partOf;      // References the parent segment id
+  required float       probability; // conditional probability estimate
+  Point2D              centerPoint;
+  DOMRectReadOnly      boundingBox;
+  // sequence<Point2D> contour;     // Possible future extension
+};
+
+enum SegmentType {
+  "human-face",
+  "left-eye",       // oculus sinister 
+  "right-eye",      // oculus dexter
+  "eye",            // either eye
+  "mouth",
+  // More SegmentTypes may follow in future
+};
+```
+
+### Constraints
+
+The second part consists of a new constraint to `getUserMedia()` or `applyConstrains()` which is used to negotiate and enable the desired face detection mode. Often, camera drivers run already internally face detection for 3A, so enabling face detection might just make the results available to Web applications. The corresponding members are added also to media track settings and capabilities. 
+
+```idl
+partial dictionary MediaTrackSupportedConstraints {
+  boolean humanFaceDetectionMode = true;
+};
+
+partial dictionary MediaTrackCapabilities {
+  sequence<DOMString> humanFaceDetectionMode;
+};
+
+partial dictionary MediaTrackConstraintSet {
+  ConstrainDOMString humanFaceDetectionMode;
+};
+
+partial dictionary MediaTrackSettings {
+  DOMString humanFaceDetectionMode;
+};
+
+enum HumanFaceDetectionMode {
+  "none",                                    // Face or landmark detection is not needed
+  "bounding-box",                            // Bounding box of the detected object is returned
+  "bounding-box-with-landmark-center-point", // Also landmarks and their center points are detected
+};
 
+```
+
+## Using the face detection API
 
-The API consists of two parts: first, the constraint `faceDetectionMode` to `getUserMedia()` or `applyConstrains()` is used to negotiate and enable the desired face detection mode. Often, camera drivers run already internally face detection for 3A, so enabling face detection might just make the results available to Web applications. The corresponding fields are added also to media track settings and capabilities. 
+### Metadata
 
-The second part of the API adds a new field `detectedFaces` of type sequence of `DetectedFace` into WebCodecs `VideoFrameMetadata` which provides information of the detected faces on the frame. In `DetectedFace`, the field `id` is used to track faces between frames: the same non-zero `id` in a face between different frames indicates that it is the same face. The value of zero means that tracking is not supported. `probability` is the probability that the returned face is in fact a face and not a false detection. This should be always greater than zero but less than one, since no perfect algorithm exists. If probability is not specified, user agent sets the field to zero.
+The first part of the API adds a new member `segments` of type sequence of `Segment` into WebCodecs [`VideoFrameMetadata`](https://www.w3.org/TR/webcodecs/#dictdef-videoframemetadata) which provides segmentation data for the frame. In dictionary `Segment`, the member `type` indicates the type of the enclosed object inside the segment. The member `id` is used to identify segments for two purposes. First, it is used to track segments between successive frames in a frame sequence (video): the same `id` of a segment between different frames indicates that it is the same object under tracking. However, it is specifically required that it must not be possible to correlate faces between different sources or video sequences by matching the `id` between them. If host uses face recognition to track objects, it must assign a random integer to `id` between sequences to avoid privacy issues.
 
-The field `contour` provides an arbitrary number of points which enclose the face. Sophisticated implementations could provide a large number of points but initial implementations are expected to provide four contour points located in the corners of a rectangular region which describes the face bounding box. User agent is allowed to provide a minimum of one point, in which case the point should be located at the center of the face.
+The second purpose of the `id` is to use it in conjunction of the optional `partOf` member. This member is used to reference another segmented object in the same `VideoFrame` of which the current object is part of. Thus, the integer value of the member `partOf` of a segment corresponding to a human eye would be set to the same value as the member `id` of another segment which corresponds to the human face where the eye is located. When the `partOf` member is undefined, the segment is described to not be part of any other segment.
 
-The field `landmarks` provides a list of facial features belonging to the detected face, such as eyes or mouth. User agent may return accurate contour for the features, but early implementations are expected to deliver only a single point corresponding to the center of a feature.
+The member `probability` represents an estimate of the conditional probability that the segmented object is of the type indicated by the `type` member, between zero and one, on the condition of the given segmentation. The user agent also has the option to omit the probability by setting the value to zero. As an implementer advise for `probability`, they could run the platform face detection algorithm on a set of test videos and of the detections that the algorithm made, check how many are correct. The implementor or user agent could take also other factors into account when estimating `probability`, such as confidence or score from the algorithm, camera image quality, and so on.
 
-The contour of the detected faces and landmarks is described as a sequence of `Point2D` and the units
-conform to the 
+The members `centerPoint` and `boundingBox` provide the geometrical segmentation information. The first one indicates an estimate of the center of the object and the latter the estimate of the tight enclosing bounding box of the object. The coordinates in these members are defined similarly as in the
 [`pointsOfInterest`](https://w3c.github.io/mediacapture-image/#points-of-interest)
-field in [`MediaTrackSettings`](https://w3c.github.io/mediacapture-image/#dom-mediatracksettings-pointsofinterest)
-with the exception that the points may also lie outside of the frame since a detected face could be
-partially outside of the visible image.
-A `Point2D` is interpreted to represent a pixel position in a normalized square space. The origin of
-coordinates {x,y} = {0.0, 0.0} represents the upper leftmost corner whereas the {x,y} =
-{1.0, 1.0} represents the lower rightmost corner relative to the rendered frame: the x coordinate (columns) increases rightwards and the y
-coordinate (rows) increases downwards.
+member in [`MediaTrackSettings`](https://w3c.github.io/mediacapture-image/#dom-mediatracksettings-pointsofinterest)
+with the exception that the coordinates may also lie outside of the frame since a detected face could be
+partially (or in special cases even fully) outside of the visible image.
+A coordinate is interpreted to represent a location in a normalized square space. The origin of
+coordinates (x,y) = (0.0, 0.0) represents the upper leftmost corner whereas the (x,y) =
+(1.0, 1.0) represents the lower rightmost corner relative to the rendered frame: the x-coordinate (columns) increases rightwards and the y-coordinate (rows) increases downwards.
 
-The constraint `faceDetectionMode` is used by applications to describe the level of facial data that they need. At the lowest enabled level, `presence` will return the sequence of `DetectedFace`, but the `contour` and `landmarks` sequences will be empty. When `faceDetectionMode` is `contour`, arbitrary number of points around the faces will be returned but no landmarks. An user agent might return only four contour points corresponding to face bounding box. If a Web application needs only maximum of four contour points (bounding box), it can set `faceDetectionMode` to `bounding-box` which limits number of contour points to four, residing at the corners of a rectangle around the detected faces.
+### Constraints
 
-At the highest level, when `faceDetectionMode` is `landmarks`, the full precision contour which is available is returned along with landmark features.
+The new constraint `humanFaceDetectionMode` requests the metadata detail level that the application wants for human faces and their landmarks. This setting can be one of the enumeration values in `HumanFaceDetectionMode`. When the setting is `"none"`, face segmentation metadata is not set by the user agent (platform may still run face detection algorithm for its own purposes but the results are invisible to the Web app). When the setting is `"bounding-box"`, the user agent must attempt face detection and set the metadata in video frames correspondingly. Finally, when the setting is `"bounding-box-with-landmark-center-point"` the user agent must attempt detection of both face bounding boxes and the center points of face landmarks and set the segmentation metadata accordingly.
 
+Web applications can reduce the computational load on the user agent by only requesting the necessary amount of facial metadata, rather than asking for more than they need. For example, if an application is content with just a face bounding box, it should apply the value `"bounding-box"` to the constraint.
 
 ## Platform Support 
 
 
-| OS               | API              | FaceDetection|
+| OS               | API              | face detection|
 | ------------- |:-------------:| :-----:|
 | Windows      | Media Foundation|   [KSPROPERTY_CAMERACONTROL_EXTENDED_FACEDETECTION ](https://docs.microsoft.com/en-us/windows-hardware/drivers/stream/ksproperty-cameracontrol-extended-facedetection?redirectedfrom=MSDN)|
 | ChromeOS/Android      | Camera HAL3 | [STATISTICS_FACE_DETECT_MODE_FULL  ](https://developer.android.com/reference/android/hardware/camera2/CameraMetadata#STATISTICS_FACE_DETECT_MODE_FULL)[STATISTICS_FACE_DETECT_MODE_SIMPLE ](https://developer.android.com/reference/android/hardware/camera2/CameraMetadata#STATISTICS_FACE_DETECT_MODE_SIMPLE)|
@@ -86,158 +142,25 @@ The results were normalized against base case (viewfinder only, no face detectio
 
 ![Package Power Consumption](images/face-detection-ptat-fd15fps-rel.png)
 
-Javacript test programs were created to capture frames and to detect faces at VGA resolution (640x480) at 15 fps. The tests were run on Intel Tigerlake running Windows 11. A test run length was 120 seconds (2 minutes) with 640x480 pixel frame resolution. 
+Javacript test programs were created to capture frames and to detect faces at VGA resolution (640x480) at 15 fps. The tests were run on Intel Tiger Lake running Windows 11. A test run length was 120 seconds (2 minutes) with 640x480 pixel frame resolution. 
 
 ## User research
 
 * TEAMS : Supportive, but [not supportive of adding facial expressions](https://www.w3.org/2021/11/24-webrtc-minutes.html#t04) and doubtful on the accuracy of emotion analysis.
 
-*Agreed, Facial expressions are not part of [Non-goals](https://github.com/riju/faceDetection/blob/main/explainer.md#non-goals) for this API*
+*Agreed, facial expressions are part of [Non-goals](#non-goals) for this API*
 
 
 * MEET : Supportive, but many applications might not be using rectangle-bounding box, mask/contour more useful.
 
-*Currently common platforms such as ChromeOS, Android, and Windows support [system APIs](https://github.com/riju/faceDetection/blob/main/explainer.md#platform-support) which return only face bounding box and landmarks, not accurate contour, and therefore initial implementations are expected to support only bounding-box (ie. a contour with maximum of four points). We would keep the API extensible, so that proper contour support can be added in future. A few [use-cases for bounding-box face detection](https://github.com/riju/faceDetection/blob/main/explainer.md#key-scenarios) are listed.*
+*Currently common platforms such as ChromeOS, Android, and Windows support [system APIs](#platform-support) which return only face bounding box and landmarks, not accurate contour, and therefore initial implementations are expected to support only bounding-box (ie. a contour with maximum of four points). We would keep the API extensible, so that proper contour support can be added in future. A few [use-cases for bounding-box face detection](#key-scenarios) are listed.*
 
 
 * Zoom : 
 
-## Face Detection API
-
-### Common API
-
-```js
-dictionary DetectedFace {
-  required long                     id;
-  required float                    probability;
-  FrozenArray<Point2D>              contour;
-  FrozenArray<DetectedFaceLandmark> landmarks;
-};
-
-dictionary DetectedFaceLandmark {
-  required FrozenArray<Point2D> contour;
-  FaceLandmark                  type;
-};
-
-enum FaceLandmark {
-  "eye",
-  "eyeLeft",
-  "eyeRight",
-  "mouth",
-  "nose"
-};
-
-partial dictionary MediaTrackSupportedConstraints {
-  boolean faceDetectionMode = true;
-};
-
-partial dictionary MediaTrackCapabilities {
-  sequence<DOMString> faceDetectionMode;
-};
-
-partial dictionary MediaTrackConstraintSet {
-  ConstrainDOMString faceDetectionMode;
-};
-
-partial dictionary MediaTrackSettings {
-  DOMString faceDetectionMode;
-};
-
-enum FaceDetectionMode {
-  "none",         // Face detection is not needed
-  "presence",     // Only the presence of face or faces is returned, not location
-  "bounding-box", // Return bounding box for face
-  "contour",      // Approximate contour of the detected faces is returned
-  "landmarks",    // Approximate contour of the detected faces is returned with facial landmarks
-};
-```
-
-### Extended VideoFrame API (alternative 1)
-
-In this section we propose an API which extends WebCodecs `VideoFrame` and
-`VideoFrameMetadata` in order to provide face detection results.
-
-```js
-partial interface VideoFrame {
-  readonly attribute FrozenArray<DetectedFace>? detectedFaces;
-};
-
-partial dictionary VideoFrameMetadata {
-  FrozenArray<DetectedFace> detectedFaces;
-};
-```
-
-Example are shown in sections [Example 1-1](#example-1-1) and
-[Example 2-1](#example-2-1).
-
-### VideoFrame Side-Channel API (alternative 2)
-
-WebCodecs feedback on
-the [Extended VideoFrame API (alternative 1)](#extended-videoframe-api-alternative-1)
-was that
- 1. They do not have a generic metadata interface for WebCodecs
-    `VideoFrame` and efforts to create one have not made progress
-    (see also the issues w3c/webcodecs#95 and w3c/webcodecs#189).
- 2. They prefer to supply independent metadata in a side-channel,
-    typically an additional callback argument.
- 3. Properly adding a new metadata field to VideoFrame requires updating
-    a lot of things (mostly around construct/clone/copy behavior), and
-    they are reluctant to build in direct support for this feature at
-    that level
- 4. Simply adding a new field to the JS object however is something
-    a polyfill could do, and they don't really have a problem with that
-    level of integration if we don't have a way to supply it separately.
-    But such a field would not survive `clone()`. They also want to
-    review what happens with serialization (`postMessage()`, Streams).
-
-Therefore, in this section we propose an API which uses the side-channel method
-to provide face detection results.
-
-We modify `MediaStreamTrackProcessor` by adding to `MediaStreamTrackProcessorInit`
-a new `metadata` parameter which can be used to change the `MediaStreamTrackProcessor`
-to process video frames and metadata chunks instead of video frame only chunks.
-We also modify `HTMLVideoElement` by adding a new callback to it which provides
-video frame metadata including face detection results.
-
-```js
-partial dictionary MediaStreamTrackProcessorInit {
-  boolean metadata = false;
-};
-
-dictionary MediaStreamVideoFrameMetadata /* : VideoFrameMetadata */ {
-  FrozenArray<DetectedFace> detectedFaces;
-};
-
-dictionary MediaStreamVideoFrameAndMetadata {
-  required VideoFrame videoFrame;
-  required MediaStreamVideoFrameMetadata metadata;
-};
-
-callback MediaStreamVideoFrameMetadataRequestCallback = undefined(DOMHighResTimeStamp now, MediaStreamVideoFrameMetadata metadata);
-
-partial interface HTMLVideoElement {
-    unsigned long requestVideoFrameMediaStreamMetadataCallback(MediaStreamVideoFrameMetadataRequestCallback callback);
-    undefined cancelVideoFrameMediaStreamMetadataCallback(unsigned long handle);
-};
-```
-
-If a `MediaStreamTrackProcessor` is initiated with `{metadata: true}`,
-the `MediaStreamTrackProcessor` processes `MediaStreamVideoFrameAndMetadata`
-chunks (instead of `VideoFrame` chunks) and therefore passes them
-to `Transformer.transform(chunk, controller)` and accepts them
-for `controller.enqueue()`.
-An example is shown in section [Example 1-2](#example-1-2).
-
-The new callback `MediaStreamVideoFrameMetadataRequestCallback` is called
-when new face detection results are available on a video stream,
-similarly to `requestVideoFrameCallback`.
-An example is shown in Section [Example 2-2](#example-2-2).
-
-[PR](https://github.com/w3c/mediacapture-extensions/pull/48)
-
 ## Key scenarios
 
-Currently common platforms such as ChromeOS, Android, and Windows support system APIs which return only face bounding box and landmarks, not accurate contour, and therefore initial implementations are expected to support only bounding-box (ie. a contour with maximum of four points). These features can be used as the major building block in several scenarios such as:
+Currently common platforms such as ChromeOS, Android, and Windows support system APIs which return face bounding box and landmarks. This information can be used as the major building block in several scenarios such as:
 
 * Auto-mute: a videoconferencing application can automatically mute microphone or blank camera image if user (face) presence is not detected.
 
@@ -245,95 +168,31 @@ Currently common platforms such as ChromeOS, Android, and Windows support system
 
 * Face enhancement: application can apply various image enhancement filters to user's face. The filters may be either designed exclusively to faces, or when it is desired to save computation, background can be excluded from the filtering.
 
-* Funny hats: application may want to render augmented reality on top of user faces by drawing features such as glasses or a hat. For accurate rendering, facial landmarks would be preferred.
+* Funny Hats: application may want to render augmented reality on top of user faces by drawing features such as glasses or a hat. For accurate rendering, facial landmarks would be preferred.
 
 * Video encoding: many video encoders can allocate higher amount of bits to given locations in frames. Face bounding boxes can be used to increase the visual quality of faces at the cost of lower background quality.
 
 * Neural networks: these can be used to derive accurate face contours, recognize faces, or extract other facial information. However, these are typically slow and heavyweight algorithms which are too burdensome to apply to entire images. A known face bounding box allows applying an algorithm only to the relevant part of images.
 
 
-## Examples
-
-### Example 1-1
-
-This example uses
-[Extended VideoFrame API (API alternative-1)](#extended-videoframe-api-alternative-1).
+## Example
 
 ```js
 // main.js:
-// Check if face detection is supported by the browser
-const supports = navigator.mediaDevices.getSupportedConstraints();
-if (supports.faceDetectionMode) {
-  // Browser supports face detection.
-} else {
-  throw('Face detection is not supported');
-}
-
 // Open camera with face detection enabled
 const stream = await navigator.mediaDevices.getUserMedia({
-  video: { faceDetectionMode: 'bounding-box' }
+  video: { humanFaceDetectionMode: 'bounding-box' }
 });
 const [videoTrack] = stream.getVideoTracks();
 
 // Use a video worker and show to user.
-const videoElement = document.querySelector("video");
-const videoGenerator = new MediaStreamTrackGenerator({kind: 'video'});
-const videoProcessor = new MediaStreamTrackProcessor({track: videoTrack});
 const videoSettings = videoTrack.getSettings();
-const videoWorker = new Worker('video-worker.js');
-videoWorker.postMessage({
-  videoReadable: videoProcessor.readable,
-  videoWritable: videoGenerator.writable
-}, [videoProcessor.readable, videoGenerator.writable]);
-videoElement.srcObject = new MediaStream([videoGenerator]);
-videoElement.onloadedmetadata = event => videoElement.play();
-
-// video-worker.js:
-self.onmessage = async function(e) {
-  const videoTransformer = new TransformStream({
-    async transform(videoFrame, controller) {
-      for (const face of videoFrame.detectedFaces) {
-        let s = '';
-        for (const f of face.contour) {
-	  s += `(${f.x}, ${f.y}),`
-	}
-        console.log(`Face @ (${s})`);
-      }
-      controller.enqueue(videoFrame);
-    }
-  });
-  e.data.videoReadable
-  .pipeThrough(videoTransformer)
-  .pipeTo(e.data.videoWritable);
-}
-```
-
-### Example 1-2
-
-This example uses
-[VideoFrame Side-Channel (API alternative 2)](#videoframe-side-channel-api-alternative-2).
-
-```js
-// main.js:
-// Check if face detection is supported by the browser
-const supports = navigator.mediaDevices.getSupportedConstraints();
-if (supports.faceDetectionMode) {
-  // Browser supports face detection.
-} else {
+if (videoSettings.humanFaceDetectionMode != 'bounding-box') {
   throw('Face detection is not supported');
 }
-
-// Open camera with face detection enabled
-const stream = await navigator.mediaDevices.getUserMedia({
-  video: { faceDetectionMode: 'bounding-box' }
-});
-const [videoTrack] = stream.getVideoTracks();
-
-// Use a video worker and show to user.
 const videoElement = document.querySelector("video");
 const videoGenerator = new MediaStreamTrackGenerator({kind: 'video'});
-const videoProcessor = new MediaStreamTrackProcessor({track: videoTrack, metadata: true});
-const videoSettings = videoTrack.getSettings();
+const videoProcessor = new MediaStreamTrackProcessor({track: videoTrack});
 const videoWorker = new Worker('video-worker.js');
 videoWorker.postMessage({
   videoReadable: videoProcessor.readable,
@@ -345,15 +204,18 @@ videoElement.onloadedmetadata = event => videoElement.play();
 // video-worker.js:
 self.onmessage = async function(e) {
   const videoTransformer = new TransformStream({
-    async transform({videoFrame, metadata}, controller) {
-      for (const face of metadata.detectedFaces) {
-        let s = '';
-        for (const f of face.contour) {
-	  s += `(${f.x}, ${f.y}),`
-	}
-        console.log(`Face @ (${s})`);
+    async transform(videoFrame, controller) {
+      for (const segment of videoFrame.metadata().segments || []) {
+        if (segment.type === 'human-face') {
+          // the metadata is coming directly from the video track with
+          // bounding-box face detection enabled
+          console.log(`Face @ (${segment.boundingBox.left},` +
+                              `${segment.boundingBox.top},` +
+                              `${segment.boundingBox.right},` +
+                              `${segment.boundingBox.bottom})`);
+        }
       }
-      controller.enqueue({videoFrame, metadata});
+      controller.enqueue(videoFrame);
     }
   });
   e.data.videoReadable
@@ -362,112 +224,29 @@ self.onmessage = async function(e) {
 }
 ```
 
-### Example 2-1
-
-This example uses
-[Extended VideoFrame API (API alternative-1)](#extended-videoframe-api-alternative-1).
-
-```js
-function updateCanvas(now, metadata) {
-  const canvasCtx.drawImage(video, 0, 0, canvas.width, canvas.height);
-  for (const face of (metadata.detectedFaces || [])) {
-    if (!face.contour || !face.contour.length)
-      continue;
-    canvasCtx.beginPath();
-    canvasCtx.moveTo(face.contour[0].x, face.contour[0].y);
-    for (const point of face.contour.slice(1))
-      canvasCtx.lineTo(point.x, point.y);
-    canvasCtx.closePath();
-    canvasCtx.strokeStyle = 'red';
-    canvasCtx.stroke();
-  }
-  videoElement.requestVideoFrameCallback(updateCanvas);
-}
-
-// Check if face detection is supported by the browser
-const supports = navigator.mediaDevices.getSupportedConstraints();
-if (supports.faceDetectionMode) {
-  // Browser supports face contour detection.
-} else {
-  throw('Face contour detection is not supported');
-}
-
-// Open camera with face detection enabled
-const stream = await navigator.mediaDevices.getUserMedia({
-  video: { faceDetectionMode: ['contour', 'bounding-box'] }
-});
-
-// Show to user.
-const canvasElement = document.querySelector("canvas");
-const canvasCtx = canvasElement.getContext("2d");
-const videoElement = document.querySelector("video");
-videoElement.srcObject = new MediaStream([videoGenerator]);
-videoElement.onloadedmetadata = event => videoElement.play();
-videoElement.requestVideoFrameCallback(updateCanvas);
-```
-
-### Example 2-2
-
-This example uses
-[VideoFrame Side-Channel (API alternative 2)](#videoframe-side-channel-api-alternative-2).
-
-```js
-function updateCanvas(now, metadata) {
-  const canvasCtx.drawImage(video, 0, 0, canvas.width, canvas.height);
-  for (const face of (metadata.detectedFaces || [])) {
-    if (!face.contour || !face.contour.length)
-      continue;
-    canvasCtx.beginPath();
-    canvasCtx.moveTo(face.contour[0].x, face.contour[0].y);
-    for (const point of face.contour.slice(1))
-      canvasCtx.lineTo(point.x, point.y);
-    canvasCtx.closePath();
-    canvasCtx.strokeStyle = 'red';
-    canvasCtx.stroke();
-  }
-  videoElement.requestMediaStreamVideoFrameMetadataCallback(updateCanvas);
-}
-
-// Check if face detection is supported by the browser
-const supports = navigator.mediaDevices.getSupportedConstraints();
-if (supports.faceDetectionMode) {
-  // Browser supports face contour detection.
-} else {
-  throw('Face contour detection is not supported');
-}
-
-// Open camera with face detection enabled
-const stream = await navigator.mediaDevices.getUserMedia({
-  video: { faceDetectionMode: ['contour', 'bounding-box'] }
-});
-
-// Show to user.
-const canvasElement = document.querySelector("canvas");
-const canvasCtx = canvasElement.getContext("2d");
-const videoElement = document.querySelector("video");
-videoElement.srcObject = new MediaStream([videoGenerator]);
-videoElement.onloadedmetadata = event => videoElement.play();
-videoElement.requestMediaStreamVideoFrameMetadataCallback(updateCanvas);
-```
-
 ## Stakeholder Feedback / Opposition
 
 [Implementors and other stakeholders may already have publicly stated positions on this work. If you can, list them here with links to evidence as appropriate.]
 
-- [Firefox] : No signals
+- [[Firefox](https://github.com/mozilla/standards-positions/issues/706)] : No signals
 - [Safari] : No signals
 
 [If appropriate, explain the reasons given by other implementors for their concerns.]
 
-## References & acknowledgements
+## Acknowledgements
 
 Many thanks for valuable feedback and advice from:
 
 - Bernard Aboba
 - Harald Alvestrand
 - Jan-Ivar Bruaroey
+- Dale Curtis
 - Youenn Fablet
 - Dominique Hazael-Massieux
+- Chris Needham
+- Tim Panton
+- Dan Sanders
+- ...and anyone we missed
 
 ## Disclaimer
 
diff --git a/index.html b/index.html
index fdcd059..24f597a 100644
--- a/index.html
+++ b/index.html
@@ -9,7 +9,7 @@
   // See https://github.com/w3c/respec/wiki/ for how to configure ReSpec
   var respecConfig = {
     group: "webrtc",
-    xref: ["html", "infra", "permissions", "dom", "mediacapture-streams", "webaudio", "webidl"],
+    xref: ["geometry-1", "html", "infra", "permissions", "dom", "image-capture", "mediacapture-streams", "webaudio", "webcodecs", "webidl"],
     edDraftURI: "https://w3c.github.io/mediacapture-extensions/",
     editors:  [
       {name: "Jan-Ivar Bruaroey", company: "Mozilla Corporation", w3cid: 79152},
@@ -654,5 +654,327 @@ <h2>Exposing change of MediaStreamTrack configuration</h2>
     </p>
     </div>
   </section>
+  <section>
+    <h2>Human face segmentation</h2>
+    <p>Human face metadata describes the geometrical information of human faces
+    in video frames. It can be set by web applications using the standard means
+    when creating {{VideoFrameMetadata}} for {{VideoFrame}}s or it can be set by
+    a user agent when the media track constraint, defined below, is used to
+    enable face detection for the {{MediaStreamTrack}} which provides the
+    {{VideoFrame}}s.</p>
+    <p>The facial metadata can be used by video encoders to enhance the quality
+    of the faces in encoded video streams or for other suitable purposes.</p>
+    <section>
+      <h3>{{VideoFrameMetadata}}</h3>
+      <pre class="idl">
+partial dictionary VideoFrameMetadata {
+  sequence&lt;Segment&gt; segments;
+};</pre>
+      <section class="notoc">
+        <h4>Members</h4>
+        <dl class="dictionary-members" data-link-for="VideoFrameMetadata" data-dfn-for="VideoFrameMetadata">
+          <dt><dfn><code>segments</code></dfn> of type <code>sequence&lt;{{Segment}}&gt;</code></dt>
+          <dd>
+            <p>The set of known geometrical segments in the video frame.</p>
+          </dd>
+        </dl>
+      </section>
+    </section>
+    <section>
+      <h3>{{Segment}}</h3>
+      <pre class="idl">
+dictionary Segment {
+  required SegmentType type;
+  required long        id;
+  long                 partOf;
+  required float       probability;
+  Point2D              centerPoint;
+  DOMRectReadOnly      boundingBox;
+};</pre>
+<pre class="idl">
+enum SegmentType {
+  "human-face",
+  "left-eye",
+  "right-eye",
+  "eye",
+  "mouth",
+};</pre>
+      <section class="notoc">
+        <h4>Dictionary {{Segment}} Members</h4>
+        <dl class="dictionary-members" data-dfn-for="Segment" data-link-for="HumanFace">
+          <dt><dfn><code>type</code></dfn> of type <code>{{SegmentType}}</code></dt>
+          <dd>
+            <p>The type of segment which the segment refers to.</p>
+            <p>It must be one of the following values:</p>
+            <dl data-dfn-for="SegmentType" data-link-for="SegmentType">
+              <dt><dfn><code>human-face</code></dfn></dt>
+              <dd>
+                <p>The segment describes a human face.</p>
+              </dd>
+              <dt><dfn><code>left-eye</code></dfn></dt>
+              <dd>
+                <p>The segment describes <i>oculus sinister</i>.</p>
+              </dd>
+              <dt><dfn><code>right-eye</code></dfn></dt>
+              <dd>
+                <p>The segment describes <i>oculus dexter</i>.</p>
+              </dd>
+              <dt><dfn><code>eye</code></dfn></dt>
+              <dd>
+                <p>The segment describes an eye, either left or right.</p>
+              </dd>
+              <dt><dfn><code>mouth</code></dfn></dt>
+              <dd>
+                <p>The segment describes a mouth.</p>
+              </dd>
+            </dl>
+          </dd>
+          <dt><dfn><code>id</code></dfn> of type {{long}}</dt>
+          <dd>
+            <p>An identifier of the object described by the segment, unique
+	    within a sequence. If the same object can be tracked over multiple
+	    frames originating from the same {{MediaStreamTrack}} source or it
+	    can be matched to correspond to the same object in
+	    {{MediaStreamTrack}}s which are cloned from the same original
+	    {{MediaStreamTrack}}, the user agent SHOULD use the same
+	    {{Segment/id}} for the segments which describe the object.
+            {{Segment/id}} is also used in conjunction with the member
+            {{Segment/partOf}}.</p>
+            <p>The user agent MUST NOT select the value to assign to
+            {{Segment/id}} in such a way that the detected objects could be
+            correlated to match between different {{MediaStreamTrack}} sources
+            unless the {{MediaStreamTrack}}s are cloned from the same original
+            {{MediaStreamTrack}}.
+            </p>
+          </dd>
+          <dt><dfn><code>partOf</code></dfn> of type {{long}}</dt>
+          <dd>
+            <p>If defined, references another segment which has the member {{Segment/id}} set to the same value.
+            The referenced segment corresponds to an object of which the object described by this segment
+            is part of.</p>
+            <p>If undefined, the object described by this segment is not known to be part of any other object
+            described by any segment associated with the {{MediaStreamTrack}}.</p>
+          </dd>
+          <dt><dfn><code>probability</code></dfn> of type {{float}}</dt>
+          <dd>
+            <p>If nonzero, this is the estimate of the conditional probability that the segmented object
+            actually is of the type indicated by the {{Segment/type}} member on the condition that
+            the detection has been made. The value of this member must be always zero or above with a
+            maximum of one. The special value of zero indicates that the probability estimate
+            is not available.</p>
+          </dd>
+          <dt><dfn><code>centerPoint</code></dfn> of type {{Point2D}}</dt>
+          <dd><p>The coordinates of the approximate center of the object described by this {{Segment}}.
+            The object location in the frame can be specified even if it is obscured by other objects in
+            front of it or it lies partially or fully outside of the frame.</p>
+            <p>The {{Point2D/x}} and {{Point2D/y}} values of the point are interpreted to represent a coordinate in a
+            normalized square space. The origin of coordinates {x,y} =
+            {0.0, 0.0} represents the upper left corner whereas the {x,y} =
+            {1.0, 1.0} represents the lower right corner relative to the
+            rendered frame.</p>
+          </dd>
+          <dt><dfn><code>boundingBox</code></dfn> of type <code>{{DOMRectReadOnly}}</code></dt>
+          <dd>
+            <p>A bounding box surrounding the object described by this segment. </p>
+            <p>The object bounding box in the frame can be specified even if it is
+            obscured by other objects in front of it or it lies partially or
+            fully outside of the frame.</p>
+            <p>See the member {{Segment/centerPoint}} for the definition of the
+            coordinate system.
+            </p>
+          </dd>
+        </dl>
+      </section>
+    </section>
+    <section>
+      <h3>{{MediaTrackSupportedConstraints}} dictionary extensions</h3>
+<pre class="idl">
+partial dictionary MediaTrackSupportedConstraints {
+  boolean humanFaceDetectionMode = true;
+};</pre>
+      <section class="notoc">
+        <h4>Dictionary {{MediaTrackSupportedConstraints}} Members</h4>
+        <dl class="dictionary-members" data-dfn-for="MediaTrackSupportedConstraints" data-link-for="MediaTrackSupportedConstraints">
+          <dt><dfn>humanFaceDetectionMode</dfn> of type {{boolean}}, defaulting to <code>true</code></dt>
+            <dd>See <a href=
+            "#def-constraint-humanFaceDetectionMode">
+            humanFaceDetectionMode</a> for details.</dd>
+        </dl>
+      </section>
+    </section>
+    <section>
+      <h3>{{MediaTrackCapabilities}} dictionary extensions</h3>
+      <pre class="idl">
+partial dictionary MediaTrackCapabilities {
+  sequence&lt;DOMString&gt; humanFaceDetectionMode;
+};</pre>
+      <section class="notoc">
+        <h4>Dictionary {{MediaTrackCapabilities}} Members</h4>
+        <dl class="dictionary-members" data-dfn-for="MediaTrackCapabilities" data-link-for="MediaTrackCapabilities">
+          <dt><dfn>humanFaceDetectionMode</dfn> of type <code>sequence&lt;{{DOMString}}&gt;</code></dt>
+          <dd>
+            <p>The sequence of supported face detection modes.
+            Each string MUST be one of the members of {{HumanFaceDetectionModeEnum}}.
+            See <a href="#def-constraint-humanFaceDetectionMode">
+            humanFaceDetectionMode</a> for additional details.</p>
+          </dd>
+        </dl>
+      </section>
+    </section>
+    <section>
+      <h3>{{MediaTrackConstraintSet}} dictionary extensions</h3>
+<pre class="idl">
+partial dictionary MediaTrackConstraintSet {
+  ConstrainDOMString humanFaceDetectionMode;
+};</pre>
+      <section class="notoc">
+        <h4>Dictionary {{MediaTrackConstraintSet}} Members</h4>
+        <dl class="dictionary-members" data-dfn-for="MediaTrackConstraintSet" data-link-for="MediaTrackConstraintSet">
+          <dt><dfn>humanFaceDetectionMode</dfn> of type {{ConstrainDOMString}}</dt>
+          <dd>See <a href=
+            "#def-constraint-humanFaceDetectionMode">
+            humanFaceDetectionMode</a> for details.</dd>
+          </dd>
+        </dl>
+      </section>
+    </section>
+    <section>
+      <h3>{{MediaTrackSettings}}</h3>
+<pre class="idl">
+partial dictionary MediaTrackSettings {
+  DOMString humanFaceDetectionMode;
+};</pre>
+      <section class="notoc">
+        <h4>Dictionary {{MediaTrackSettings}} Members</h4>
+        <dl class="dictionary-members" data-dfn-for="MediaTrackSettings" data-link-for="MediaTrackSettings">
+          <dt><dfn>humanFaceDetectionMode</dfn> of type {{DOMString}}</dt>
+          <dd>See <a href=
+            "#def-constraint-humanFaceDetectionMode">
+            humanFaceDetectionMode</a> for details.</dd>
+          </dd>
+        </dl>
+      </section>
+    </section>
+    <section>
+      <h3>{{HumanFaceDetectionModeEnum}}</h3>
+<pre class="idl">
+enum HumanFaceDetectionModeEnum {
+  "none",
+  "bounding-box",
+  "bounding-box-with-landmark-center-point",
+};</pre>
+      <section class="notoc">
+        <h4>{{HumanFaceDetectionModeEnum}} Enumeration Description</h4>
+        <dl data-dfn-for="HumanFaceDetectionModeEnum" data-link-for="HumanFaceDetectionModeEnum">
+          <dt><dfn><code>none</code></dfn></dt>
+          <dd>
+            <p>This {{MediaStreamTrack}} source does not set metadata in
+            {{VideoFrameMetadata}} of {{VideoFrame}}s related to human faces or
+            human face landmarks, that is, to any {{Segment}} which has
+            the {{Segment/type}} set to any of the alternatives listed in
+            enumeration {{SegmentType}}.
+            As an input, this is interpreted as a command to turn off the
+            setting of human face and landmark detection.</p>
+          </dd>
+          <dt><dfn><code>bounding-box</code></dfn></dt>
+          <dd>
+            <p>This source sets metadata related to human faces
+            (segment type of {{SegmentType/"human-face"}}) including bounding
+            box information in the member {{Segment/boundingBox}} of each
+            {{Segment}} related to a detected face. The source does not set the
+            human face landmark information. As an input, this is interpreted
+            as a command to enable the setting of human face detection and to
+            find the bounding box of each detected face.</p>
+          </dd>
+          <dt><dfn><code>bounding-box-with-landmark-center-point</code></dfn></dt>
+          <dd>
+            <p>With this setting, the source sets a superset of the metadata
+            compared to the {{HumanFaceDetectionModeEnum/"bounding-box"}}
+            setting. The source sets the same metadata and additionally metadata
+            related to human face landmarks (all other {{SegmentType}}s except
+            {{SegmentType/"human-face"}}) including center point information
+            in the member {{Segment/centerPoint}} of each {{Segment}} related
+            to a detected landmark. As an input, this is interpreted as a
+            command to enable the setting of human face and face landmark
+            detection and to set bounding box related information to face
+            segment metadata and to set the center point information of each
+            detected face landmark.</p>
+          </dd>
+        </dl>
+      </section>
+    </section>
+    <section>
+      <h2>Constrainable Properties</h2>
+      <p>The constrainable properties in this section are defined below.</p>
+      <table class="simple">
+        <thead>
+          <tr>
+            <th>Property Name</th>
+            <th>Values</th>
+            <th>Notes</th>
+          </tr>
+        </thead>
+        <tbody>
+          <tr>
+            <td><dfn id="def-constraint-humanFaceDetectionMode">
+              humanFaceDetectionMode</dfn></td>
+            <td>{{ConstrainDOMString}}</td>
+            <td>
+              <p>This string (or each string, when a list) should be one of the
+              members of {{HumanFaceDetectionModeEnum}}.</p>
+              <p>As a {{MediaStreamTrack}} constraint, its value allows choosing
+              whether face and face landmark detection is preferred.</p>
+              <p>As a setting, this reflects which face geometrical
+              properties the user agent detects and sets in the metadata of the
+              {{VideoFrame}}s obtained from the track.</p>
+            </td>
+          </tr>
+        </tbody>
+      </table>
+    </section>
+    <section>
+      <h3>Examples</h3>
+      <pre class="example">
+// main.js:
+// Open camera with face detection enabled
+const stream = await navigator.mediaDevices.getUserMedia({
+  video: { humanFaceDetectionMode: 'bounding-box' }
+});
+const [videoTrack] = stream.getVideoTracks();
+if (videoTrack.getSettings().humanFaceDetectionMode != 'bounding-box') {
+  throw('Face bounding box detection is not supported');
+}
+
+// Use a video worker and show to user.
+const videoElement = document.querySelector('video');
+const videoWorker = new Worker('video-worker.js');
+videoWorker.postMessage({track: videoTrack}, [videoTrack]);
+const {data} = await new Promise(r => videoWorker.onmessage);
+videoElement.srcObject = new MediaStream([data.videoTrack]);
+
+// video-worker.js:
+self.onmessage = async ({data: {track}}) => {
+  const generator = new VideoTrackGenerator();
+  parent.postMessage({videoTrack: generator.track}, [generator.track]);
+  const {readable} = new MediaStreamTrackProcessor({track});
+  const transformer = new TransformStream({
+    async transform(frame, controller) {
+      for (const segment of frame.metadata().segments || []) {
+        if (segment.type === 'human-face') {
+          // the metadata is coming directly from the video track with
+          // bounding-box face detection enabled
+          console.log(
+            `Face @ (${segment.boundingBox.left}, ${segment.boundingBox.top}), ` +
+                   `(${segment.boundingBox.right}, ${segment.boundingBox.bottom})`);
+        }
+      }
+      controller.enqueue(frame);
+    }
+  });
+  await readable.pipeThrough(transformer).pipeTo(generator.writable);
+};
+      </pre>
+    </section>
+  </section>
 </body>
 </html>