-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large keyframes can result in consistent freezes on every keyframe #101
Comments
Was watching a Jackbox stream on Glimesh and noticed this issue reproducing. What's interesting is that beyond the initial freeze, you get some weird jumbled frames briefly in-between that are really hard to notice (kudos to @danstiner for pointing this out 😁) Timer is of format |
Looked deeper into this, haven't solved the issue completely but I'm convinced part of it is borked congestion handling in
I think this is because the sdk is trying to be clever and do congestion control in a way that seems reasonable (leaky bucket based on peak bandwidth) but for several reasons their approach is broken. There are at least three issue:
I instrumented a bit so you can see the pattern:
A test fix for issue 3 is simple diff --git a/libftl/ftl_private.h b/libftl/ftl_private.h
index 0606052..4e26bb3 100644
--- a/libftl/ftl_private.h
+++ b/libftl/ftl_private.h
@@ -69,7 +69,7 @@
#define NACK_RTT_AVG_SECONDS 5
#define MAX_STATUS_MESSAGE_QUEUED 10
#define MAX_FRAME_SIZE_ELEMENTS 64 //must be a minimum of 3
-#define MAX_XMIT_LEVEL_IN_MS 100 //allows a maximum burst size of 100ms at the target bitrate
+#define MAX_XMIT_LEVEL_IN_MS 1000 //allows a maximum burst size of 1000ms at the target bitrate
#define VIDEO_RTP_TS_CLOCK_HZ 90000
#define AUDIO_SAMPLE_RATE 48000
#define AUDIO_PACKET_DURATION_MS 20
diff --git a/libftl/media.c b/libftl/media.c
index 8bfe0f1..6033d8c 100644
--- a/libftl/media.c
+++ b/libftl/media.c
@@ -1257,6 +1257,7 @@ OS_THREAD_ROUTINE video_send_thread(void *data)
if (transmit_level <= 0) {
ftl->video.media_component.stats.bw_throttling_count++;
+ FTL_LOG(ftl, FTL_LOG_INFO, "Sleeping for %d ms; transmit_level:%d", MAX_MTU / bytes_per_ms + 1, transmit_level);
sleep_ms(MAX_MTU / bytes_per_ms + 1);
_update_xmit_level(ftl, &transmit_level, &start_tv, bytes_per_ms);
}
@@ -1308,13 +1309,17 @@ static void _update_xmit_level(ftl_stream_configuration_private_t *ftl, int *tra
gettimeofday(&stop_tv, NULL);
- *transmit_level += (int)timeval_subtract_to_ms(&stop_tv, start_tv) * bytes_per_ms;
+ int64_t elapsed_ms = timeval_subtract_to_ms(&stop_tv, start_tv);
- if (*transmit_level > (MAX_XMIT_LEVEL_IN_MS * bytes_per_ms)) {
- *transmit_level = MAX_XMIT_LEVEL_IN_MS * bytes_per_ms;
- }
+ if (elapsed_ms > 0) {
+ *transmit_level += (int)elapsed_ms * bytes_per_ms;
+
+ if (*transmit_level > (MAX_XMIT_LEVEL_IN_MS * bytes_per_ms)) {
+ *transmit_level = MAX_XMIT_LEVEL_IN_MS * bytes_per_ms;
+ }
- *start_tv = stop_tv;
+ *start_tv = stop_tv;
+ }
}
static int _update_stats(ftl_stream_configuration_private_t *ftl) { After applying that I no longer get sleeps, but I still see picture loss. I did some additional testing and as the ratio of keyframe size to avg bitrate increases I start seeing more and more NACKs from the ingest, mostly for keyframes. So I'm semi-hopeful the remaining issue is somehow server side, but it's been really hard to pin down what's happening.
I'd like to get a change like the above into OBS, but the above may not be enough to fix all the issues, though it is a nice small change and fixes at least one of the issues in ftl-sdk at low risk. Alternatively we could also tear out congestion control entirely, it doesn't seem to actually feedback to OBS to lower it's bitrate so at least for me it's not really helping anything. (I also tested removing it entirely and got the same behavior as the fixed implementation) But in any case we should probably fully solve the freezing and be very confident in the change before opening a PR to the obs folks. |
I think the original commit was microsoft/ftl-sdk@1a9e529 |
Pulled logs from Chrome's webrtc stack. I see one of two patterns of errors happening with my above patch when streaming a mostly static, but complex video scene. Pattern 1:
Pattern 2:
I also at one point saw the following from the ingest node:
Pattern 2 looks like packet loss, but I'm fairly sure my home vm lab has zero packet loss. Pattern 1 looks like either packet loss or, I'm leaning towards this is somehow incorrect packetization of the video stream by ftl-sdk or OBS. The error messages at least give something to google. I think the next step is to test ftl-sdk outside of OBS to see if the problem repros. Then if it does next is to instrument ftl-sdk and OBS to save both the raw video frames and raw packets and start looking at how they line up. |
Confirmed this happens even with zero packet loss or re-ordering on the edge/ingest nodes. |
https://webrtchacks.com/video_replay/ looks interesting for debugging this from the perspective of the browser. |
After instrumenting everything up I found at least one repro case where very large keyframes would lead to bursts of packet loss between the edge <-> browser that caused stuttering. The issue is worse with the above patch to remove rate limiting from ftl-sdk. The issue was resolved by increasing the kernel tunable receive/transmit buffers. Looking at the remaining issues, I know of several other major causes of stuttering, but I'm not sure any of them are consistent enough to explain this issue:
|
Think I've finally found the root cause of the consistent stuttering on every keyframe, it's actually client side in the WebRTC stack. Note this specific issue only happens in low motion streams, other forms of intermittent stuttering or complete loss of video are usually due to packet loss in my testing which will be at least partially addressed by #95 tl;dr in this scenario the WebRTC stack in Chrome/Firefox/etc is not waiting for all the data for a video frame to arrive before trying to render it, leading to skipped/corrupted frames. For context, every frame in a RTP video stream is "packetized", or broken down into one or more packets. The larger the frame the more packets and the longer time interval it takes for them to all reach the client. To account for this and other network delays, WebRTC has a jitter buffer that holds packets for a short time until it's time to render the frame. If not all packets are available when it's time to render the frame you'll see either a pause in the video as the frame is skipped or the renderer will try it's best with missing info and you might see visual artifacts / loss of color. Normally the jitter buffer works fine and a reasonable playout delay is calculated that is enough all packets of a frame arrive before WebRTC tries to render the frame. However in my testing there's poor behavior that happens when I-frames (keyframes) are significantly larger than P-frames (intraframes). For example, one test saw WebRTC calculate a delay of 100ms but the keyframes were ~250KB for a 5mbps stream, due to the rate limiting in In my previous testing I was able to work around this effectively removing the rate limiting in ftl-sdk, this let OBS spike the send rate and get the keyframe sent significantly faster so it fit within the buffer delay calculated by WebRTC. This works only if the streamer and viewer have a network bandwidth at least 3-4x more than the stream bitrate. That's more than many connections can handle, and big spikes in UDP traffic can actually lead to high packet loss that also causes stuttering/lost frames, so just increasing the send rate like I was trying in Glimesh/ftl-sdk#3 is not a slam-dunk fix. In my testing there were many factors that make this issue more/less prevalent by affecting the ratio of I-frame to P-frame size. Encoder and network issues often seem like black magic but I'll try to document what I found:
However I think the real fix here has to be on the viewer side with WebRTC itself (changes to encoder settings would also help but are much more difficult for us to control and tune for all situations). While this could result in 100-200ms of latency being added to the playout delay, in my view that is better than stuttering video. Once we have this solved we can go back and look at optimal encoder settings or changing the rate limiting in OBS to reduce that latency by reducing the time it takes for keyframes to be delivered. So there are a couple potential next steps:
I'd still like to come back to the other causes of stuttering I outlined above, but separately from this issue. For example #95 covers adding back NACKs to help with packet loss on the streamer side. Here's a sample clip that can be used to reproduce the stuttering behavior. Importantly it's a mostly static image that is difficult to compress so keyframes are very large, but P-frames are small because there is almost no motion. The clock helps see when stuttering happens (these is no stuttering in this clip, it's a source clip to reproduce the issue not a capture of what the viewer sees) |
To get into the nitty-gritty, I added some logging to WebRTC and tested in Chromium to see the behavior live. It appears what happens in the jitter estimator is that the keyframe is correctly calculated to have arrived ~200ms later than expected, but then an upper bound is applied to the frame delay. This actually makes it look like the keyframe arrived faster than expected for it's size, which I think is why the estimated jitter delay actually goes down when processing the keyframe, then it goes back up slightly because the following P-frames were delayed behind the large keyframe. There are other parts of the calculation that also try to exclude large keyframes, but these lines seem like the main issue in my testing:
This sample data shows the default behavior, notice how the large keyframes actually shrink the bound because of the capping behavior, the frames just after the keyframe grow the bound slightly because they were delayed, and then once things are back to normal the bound shrinks again as small P-frames are delivered with low jitter. Also notice there should be ~119 frames between keyframes for this 60fps stream, but a large number of them are getting dropped because the large keyframe delays so many subsequent frames that WebRTC gets upset and starts dropping frames until one successfully decodes, logging errors like:
However, if I remove the frame delay cap the stuttering goes away! And I see very different behavior in the jitter estimator that explains this. In this graph you can clearly see the estimated delay has grown large enough to handle the large keyframes that take nearly 200ms to arrive. While the upper/lower bounds that would have been used to cap the frame delay do expand a bit, the large keyframes are still well outside the bounds. I don't think the solution will actually be to just remove this limit, I'm sure it was added for a good reason. But this does prove to me that playout delay is the core issue here and that the WebRTC is where we should be primarily looking to solve this issue. |
Bad ass analysis! We should get in touch with the folks responsible for this code. Is the Chromium bug tracker the best place to kick this off? Or maybe we can start a thread with the author. |
Thanks! That looks like a very relevant commit. There is a WebRTC section on that tracker, though it may be better to start with the more casual mailing list: https://www.w3.org/groups/wg/webrtc |
Do you want to shoot a message with your findings to that DL and add me? Or do you want to meet and compose something together? |
I've managed to mitigate this issue on the encoder side by tuning x264 to forcibly prevent bitrate spikes from keyframes. I've had most success with CRF encoding, using Test videos used were the British Arrows (as an "average video" longer test) and a continuous loop of this F-Zero GX tool-assisted speedrun (which will make any h264 encoder beg for mercy, and reliably triggers the large keyframe issue when played in a loop). Optimally, this should be combined with
Of course, this doesn't much help with other encoders (OpenH264, NVENC, Quick Sync, AMD AMF) unless equivalent parameters can be set. :) NVIDIA's docs seem to imply this is possible for NVENC at least, and I suspect it's likely possible for others as well. I haven't been able to test any of this at the moment, though - AMD hardware encoding on Linux is a mess in general, and my NVIDIA GPU is currently being repurposed in a Windows VM. Come to think of it, enabling *In practice, with my local machine in MN on Comcast and the server running on a NYC-based VPS, I'm achieving anywhere from 330ms to 400ms latency on average and have seen it as low as 250-270ms. Majority of testing was performed with these x264 settings: 1080p30, CRF encoding with CRF=20, keyframes every 5 seconds, |
Any update on this? I get stream hangs very frequently, it's quite disruptive. |
@compucat belated thank you for investigating. I had similar findings and things like limiting the x264 buffer size can certainly help. @StrikerMan780 As an update, we've rolled out NACKs to help with packet loss and will be rolling out the playout delay change which should help with this issue (without needing encoder changes) in the next few weeks. My test stream for this is similar to above, but simplified to the bare minimum, a changing clock with a static white noise background. This makes it easy to measure stream lag, p-frames are small because the change frame-to-frame is very small, but i-frames are huge because the background has high information content (hard to compress). For example: This kind of video stuff is very complex, so it took awhile before I've had time to get back to looking at and implement features to help. I'd still like to push for a fix in Chrome itself if we prove the playout delay extension helps significantly. |
@danstiner Hello! Your research is amazing. Have you managed to find out anything else about this problem? I came across it in my own plugin for Janus, but my case is a little different. I wanted to use Of course, this is not an encoder problem, I have an alternative protocol and a client for receiving H264, and everything works fine there. It seems that WebRTC is too smart and its jitter algorithms spoil everything in my case. Is fixing the WebRTC stack the only way? Or is there some other option? Edit: a new place in the sources: https://github.com/maitrungduc1410/webrtc/blob/master/modules/video_coding/timing/jitter_estimator.cc // Cap frame_delay based on the current time deviation noise.
TimeDelta max_time_deviation =
TimeDelta::Millis(time_deviation_upper_bound_ * sqrt(var_noise_) + 0.5);
frame_delay.Clamp(-max_time_deviation, max_time_deviation); |
@mdevaev Hi Maxium. Glad my investigations could help you. I don't work on real time video stuff anymore but I'll share a couple thoughts.
|
@danstiner Thank you for sharing this. I will explore WebRTC further and maybe I can improve the frame processing chains. Unfortunately, I can't use VP9 due to the hardware limitations of my platform, so it remains only to fix the browser :) |
Some users have noticed that under some circumstances, the stream will freeze briefly on every keyframe.
Based on some observations, it looks like the problem is most reproducible in scenes where keyframe packets are of substantial size. Given this, and the fact that this issue reproduced on Mixer's server implementation as well as ours which is entirely re-written, it looks like it may be due to how the keyframes are fragmented by the client-side ftl-sdk.
Relevant code is here:
https://github.com/microsoft/ftl-sdk/blob/d0c8469f66806b5ea738d607f7d2b000af8b1129/libftl/media.c#L983
Some ideas to help investigate the root cause:
The text was updated successfully, but these errors were encountered: