The UMP format is used by YouTube for a number of requests and responses. This document details the format for the purposes of interoperability.
The UMP format uses variable size integers in a number of places. These are implemented very similarly to variable length integers in RFC8794, with a slight variation in the case of 5-byte integers.
The first 5 bits of the first byte set the size of the integer:
- If the top bit is unset, it's a 1-byte value.
- If the top bit is set, but the next bit is not, it's a 2-byte value.
- If the top two bits are set, but the next bit is not, it's a 3-byte value.
- If the top three bits are set, but the next bit is not, it's a 4-byte value.
- If the top four bits are set, but the next bit is not, it's a 5-byte value.
- If all top five bits are set, the integer is invalid.
Getting the size of the integer from the first byte can be implemented as follows:
int getVarIntSize(uint8_t b)
{
int size = 0;
for (int shift = 1; shift <= 5; shift++)
{
if ((b & (128 >> (shift - 1))) == 0)
{
size = shift;
break;
}
}
assert(size >= 1 && size <= 5);
return size;
}
The remainder of the bits in the first byte are used as part of the integer, except for in 5-byte integers where those bits are ignored.
The variable integer decoding can be implemented as follows:
int readNextByte(uint8_t* buf, int* pos)
{
int ofs = *pos;
*pos = ofs + 1;
return buf[ofs];
}
int readVarInt(uint8_t* buf, int ofs)
{
int pos = ofs;
uint8_t prefix = readNextByte(buf, &ofs);
int size = getVarIntSize(prefix);
switch (size)
{
case 1:
return prefix;
case 2:
return (readNextByte(buf, &pos) << 6) | (prefix & 0b111111);
case 3:
return
(
readNextByte(buf, &pos) |
(readNextByte(buf, &pos) << 8)
) | (prefix & 0b11111);
case 4:
return
(
readNextByte(buf, &pos) |
(readNextByte(buf, &pos) << 8) |
(readNextByte(buf, &pos) << 16)
) | (prefix & 0b1111);
default:
return
(
readNextByte(buf, &pos) |
(readNextByte(buf, &pos) << 8) |
(readNextByte(buf, &pos) << 16) |
(readNextByte(buf, &pos) << 24)
);
}
}
Note that the 5-byte integer case behaves differently, ignoring the bottom 3 bits in the first byte entirely, and just reading a 32-bit little-endian integer from the next four bytes. In the RFC8794 standard, a 5-byte integer includes the bottom 3 bits of the first byte, producing a 35-bit integer. Presumably this deviation was made so that the result could be neatly stored in a 32-bit integer, rather than needing to promote variables to 64-bit everywhere.
The UMP format requires that the Content-Type header in the HTTP response contains "application/vnd.yt-ump".
A UMP response is split into parts. Each part is prefixed by a pair of variable length integers, the first being the part type and the second being the part payload length. In pseudo-C, it'd look like this:
struct UmpPart
{
varInt type;
varInt size;
uint8_t data[size];
};
Note that you must treat the type field as a variable length integer. The current type numbers are all below 128, which will produce a single-byte encoding, but if you read the type ID as a single byte instead of properly decoding it as a variable length integer your implementation will break if/when new part types are added.
Each HTTP response starts with a "onesie header" part (type 20) followed by any number of other parts.
Parts are not guaranteed to be wholly contained within one response payload. It is quite common to find that a part's length exceeds the length of the HTTP response payload. This means that the part will continue in the next response payload.
If a response payload is self-contained, i.e. it does not end with a partial part, the last part in the buffer will end precisely at the end of the buffer. Typically the last part will be MEDIA_END
(type 22) with a single null byte payload, but this is not guaranteed.
If a payload is not self-contained, i.e. its final part has a length exceeding the amount of remaining data in the buffer, its data will continue in the next response payload. In such a case, the next payload will start with a onesie header part (type 20) followed by a part of the same type as the partial one, whose data is a continuation of the partial part from the previous payload. Parts can be split up over an arbitrary number of response payloads.
This is a little hard to picture, so here's an example. Let's say you've got a part with a length of 2,500,000 bytes, and each response payload can be maximum of 1MB (this is just for example; in practice there is no such hard limit). The resulting response payloads will look something like this:
response 1:
part 20 (onesie header)
size=...
data=...
part 21 (media data)
size=2500000
data=... (len=1000000)
response 2:
part 20 (onesie header)
size=...
data=...
part 21 (media data)
size=1500000
data=... (len=1000000)
response 3:
part 20 (onesie header)
size=...
data=...
part 21 (media data)
size=500000
data=... (len=500000)
part 22 (MEDIA_END)
size=1
data=00
This gets decoded as a single type 21 part of size 2,500,000 bytes, followed by a type 22 part. Note that there could be a different part type in response 3, after part 21, instead of MEDIA_END
. The end of a part is determined solely by all of its data being read; the next part type is irrelevant.
Once a partial part begins, responding with a different part type (e.g. sending a partial part 22, then following up with a part 32 before sending the rest of the first part) has undefined behaviour. As far as I could tell from the implementations, error handling is variable here. Some will throw an exception, but some appear to blindly accept the data as a continuation of the data, even if the part type ID is wrong. Fun! I would recommend being rigorous in checking for the correct type when decoding partial parts.
You can read UMP parts with a state machine:
- Read a response payload in chunks.
- If the amount of data remaining in the buffer is zero, go back to step 1.
- Read the UMP part type as a variable sized integer.
- Read the UMP part size as a variable sized integer.
- If the part size is zero, decode the part as a zero-length part (i.e. no payload), passing in an empty buffer, then go back to step 2.
- If the part size is less than or equal to the remaining buffer size, decode the part from that slice of the buffer, then increment the position within the buffer and go back to step 3.
- If the part size is greater than the remaining buffer size, keep a copy of the remaining buffer data, read the next buffer, stitch the two together, and continue parsing from step 5.
The following part types have been observed.
OnesieHeader. Possibly older format, deprecated?
Unknown format, but probably protobufs.
If present, must be preceded by OnesieHeader (part 10). Possibly older format, deprecated?
Unknown format, but probably protobufs.
Unknown format, but probably protobufs.
Present at the start of all known UMP response payloads.
The payload is protobufs. So far I've been decoding this manually with https://protobuf-decoder.netlify.app/
- Field 1 is a varint and is always zero.
- Field 2 is the video ID as a string
- Field 3 is a varint which matches the
itag
URL parameter. - Field 4 is varint that matches the
lmt
URL parameter, which seems to be a microsecond timestamp related to the last modified time of the stream (likely the time at which encoding completed for a given video stream) - Field 6 is a varint that is probably the start of the data range
- Field 13 is a protobuf message with two fields:
- Field 1 is the
itag
- Field 2 is the
lmt
for the itag.
- Field 1 is the
- Field 14 is the content length.
Field 14 probably represents the size of the media chunk being sent. If the payload does not end with a partial part, this number has always been observed to match the size of the data in the MEDIA
part, not including the null byte prefix. When there is a partial part, the number tends to be a fair bit bigger, possibly representing the size of the entire video or chapter. When the video is a livestream, this value is missing, since YouTube has no idea when the stream will end.
Contains the actual media itself. Starts with a single null byte, followed by the media data.
If you pull the data out of these and into a webm file, you can play them with VLC!
Terminator part. Usually included at the end of all payloads where there is no partial payload at the end.
Size is usually 1, with the data being a single null byte.
Also referred to as "SABR Live Metadata", and is related to "SABR Live Protocols".
Possibly streaming related?
Unknown format.
SABR Live Metadata Promise, related to "SABR Live Protocols".
Possibly streaming related?
Unknown format.
Cancellation of a SABR Live Metadata Promise (originally sent as part 33), related to "SABR Live Protocols".
Possibly streaming related?
Unknown format.
Unknown purpose and format.
Probably related to signalling the media format information for livestreams.
Unknown format.
Format selection config. Related to the user changing format preferences (e.g. force 1080p).
Unknown format.
Clearly streaming related but not sure what this is for.
Unknown format.
Not yet observed.
Not yet observed.
Not yet observed.
Unknown purpose and format.
Unknown purpose and format.
Unknown purpose and format.
Unknown purpose and format.
Likely tells the player that the page needs to be reloaded.
Unknown format.
Unknown purpose and format.
Unknown purpose and format.
Unknown purpose and format.
Unknown purpose and format.
Unknown purpose and format.
Unknown purpose and format.
Unknown purpose and format.
Unknown purpose and format.
Unknown purpose and format.
Unknown purpose and format.
Unknown purpose and format.
Unknown purpose and format.
Unknown purpose and format.
Unknown purpose and format.
Unknown purpose and format.
Unknown purpose and format.
Unknown purpose and format.
Unknown purpose and format.
Unknown purpose and format.
All information published in this document is hereby released in the public domain.
Thanks to the folks from #invidious on libera.chat for their invaluable assistance and insight.
Additional thanks to the maintainers of Invidious, Piped, youtube-dl, yt-dlp, uBlock Origin, and Sponsorblock for all their hard work in making YouTube a far more tolerable platform to interact with.