-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LPMS bottleneck fix #141
LPMS bottleneck fix #141
Conversation
@@ -194,7 +209,7 @@ func Transcode3(input *TranscodeOptionsIn, ps []TranscodeOptions) (*TranscodeRes | |||
filters := fmt.Sprintf("fps=%d/%d,%s='w=if(gte(iw,ih),%d,-2):h=if(lt(iw,ih),%d,-2)'", param.Framerate, 1, scale_filter, w, h) | |||
if input.Accel != Software && p.Accel == Software { | |||
// needed for hw dec -> hw rescale -> sw enc | |||
filters = filters + ":format=yuv420p,hwdownload" | |||
filters = filters + ",hwdownload,format=nv12" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the change from converting to yuv420p before hwdownload to converting to nv12 after hwdownload?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old way was actually broken - it only worked with the npp
scaler which we phased out in favor of CUDA before the original GPU PR was even merged. See livepeer/go-livepeer#951
This seems to work better now that we're using cuvid as a decoder, but I have to check that it actually fixes livepeer/go-livepeer#951
@@ -203,6 +218,13 @@ func Transcode3(input *TranscodeOptionsIn, ps []TranscodeOptions) (*TranscodeRes | |||
muxOpts.name = C.CString(p.Muxer.Name) | |||
defer C.free(unsafe.Pointer(muxOpts.name)) | |||
} | |||
// Set some default encoding options | |||
if len(p.VideoEncoder.Name) <= 0 && len(p.VideoEncoder.Opts) <= 0 { | |||
p.VideoEncoder.Opts = map[string]string{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, in what scenarios wouldn't we want to enforce IDR frames or closed GOPs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't set it if the user passes in their own options. Practical scenarios for open GOP and non-intra I frames include non-segmented VOD or real time streaming where you want to take advantage of a larger range of pictures for the bitrate savings.
On checking again, nvenc doesn't actually the cgop option, so it's useless for us now. Will remove. It was a leftover from when trying to kludge x264 into doing persistent transcode sessions. We aren't anymore, so this option is no longer needed.
The forced-idr
option might not be needed as well...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So forced-idr
is in fact needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the force-idr
option is required - without it, we get decoder errors at the beginning of playback because SPS/PPS extradata isn't always is emitted for the first frame of new segments, which we signal via an explicit keyframe. This option forces that.
Ideally this option wouldn't be needed at all; ffmpeg itself should but smart enough to emit extradata for the first frame after a flush; I may add that in.
Also just checked and it seems that disabling this option doesn't lead to any unit test failures - we really should capture this behavior in tests. Will make a note to correct that.
In theory, if an existing handle was initialized using a certain set of output renditions, that handle could be re-used for any stream as long as the stream is using those set of output renditions right? Updating the transcode API so that the user can only pass in parameters that can change on a per-segment basis sounds like a good idea to me.
Do you have anything in particular in mind or would using a struct just be for future-proofing? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't comment in code, so commenting here:
// we hit this case when decoder is flushing; will be no input packet
// (we don't need decoded frames since this stream is doing a copy)
if (ipkt.pts == AV_NOPTS_VALUE) continue;
from av_read_frame
documentation:
pkt->pts can be AV_NOPTS_VALUE if the video format has B-frames
So we will skip B-frames in case of stream copy?
@darkdarkdragon Damn! Have to test for this, but you're probably right. Good catch! |
4c3c91c
to
c423f33
Compare
Opted not to do this for now - both because the actual API itself requires some more thought and discussion that I'd like to have some more time for, and because any such API would likely have knock-on effects on the go-livepeer integration that would require even more work there. Since there isn't really an appetite for prolonging this PR even further, we can keep the API as-is for now. cc @yondonfu |
@j0sh av_init_packet(pkt);
// loop until a new frame has been decoded, or EAGAIN
while (1) { not a loop anymore? It has unconditional break at the end? |
Not in this PR, since that's an unrelated issue.
Ah yeah - this PR didn't change anything there, but stream copy changed the behavior so the comment has been outdated. Updated the comment, but avoided code changes in e34a473 |
The nvenc fix needed to support flushing is now in ffmpeg master.
do we have issue created in Github for this? otherwise, LGTM, 🚢 ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just had one clarifying question
@@ -203,6 +218,13 @@ func Transcode3(input *TranscodeOptionsIn, ps []TranscodeOptions) (*TranscodeRes | |||
muxOpts.name = C.CString(p.Muxer.Name) | |||
defer C.free(unsafe.Pointer(muxOpts.name)) | |||
} | |||
// Set some default encoding options | |||
if len(p.VideoEncoder.Name) <= 0 && len(p.VideoEncoder.Opts) <= 0 { | |||
p.VideoEncoder.Opts = map[string]string{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So forced-idr
is in fact needed here?
The commits in this PR are rebased and squashed from the following WIP branch: https://github.com/livepeer/lpms/compare/ja/old-bottleneck
Background
There is a substantial overhead cost to starting new transcode sessions on a GPU. Adding more streams and GPUs exacerbates the problem [1]. Up until now, a new GPU session has been started for each segment that's received. This has led to a massive slowdown in transcoding latency, and becomes worse with each additional concurrent stream.
To mitigate this setup overhead, we make transcode sessions persistent for each stream, so segments after the first can re-use the same session.
[1] There may be a global lock somewhere within FFmpeg or the Nvidia drivers, or it may be intrinsic.
LPMS API changes:
Transcoding becomes stateful: a transcode session is persisted on the GPU in between segments. The caller becomes responsible for terminating session. Working with the transcoder now involves allocating the transcoder, transcoding and stopping the transcoder. The API is as follows:
The
Transcoder
struct is currently opaque to outside packages - nothing within that needs to be exposed at the moment.The signature to the
Transcode
function match the existingTranscode3
function signature. Existing transcoding APIs are re-defined in terms of this new API and will continue to work unmodified in older programs.Note that this doubles down on the "one segment at a time" restriction. In order to process multiple segments for a the same job in parallel, a new session needs to be created. However, this may end up being very slow. Sessions should be reused wherever possible.
The older
Transcode3
API is preserved for single-shot transcoding, and is implemented as a wrapper around the three-step API.Code Flow and Changes
Muxer, Filtergraphs and Audio: No change from what we have now. Always recreated in between sessions. Enough code may have been moved around to obscure this fact, but the there should be no net effect on the muxer, filtergraphs and audio as a result of this PR.
Decoding: Reused in between sessions. The number of streams and the indices of audio / video components are also cached in between segments in order to avoid a latency-incurring probing step. If the stream ordering changes in between segments, things will likely break. (This shouldn't happen under most circumstances, but isn't unheard of, eg in case there is a reconfiguration somewhere at the source. In which case the best thing to do there would be to create a new session, rather than reuse an existing session.)
Decoder buffering is a problem: sometimes we have completed inputting all packets from a segment, but haven't received all pending frames from the decoder. In order to flush the decoder buffer, we send dummy packet with a sentinel timestamp, keep feeding decoder until we receive a decoded frame back with the sentiel timestamp. Discard any frames with the sentinel timestamp. The dummy packet is taken from the first packet of each segment (and reset for subsequent segments).
For hardware encoding, we current don't do anything to explicitly flush the encoder in between segments. This seems OK for now - we receive all frames we're supposed to, according to our tests - but this makes me nervous that certain inputs or configurations will leave us with buffered frames (eg, possibly B-frames). (There is a solution that might work for flushing the Nvidia encoder but it involves modifying FFmpeg in questionable ways.)
API Usability
Some usability shortcomings of the current API that should be mentioned. (We can opt to fix this before merging, but I haven't gotten around to it. Happy to do this sooner since it'd be an API break.)
Currently, the output renditions are fixed for the duration of the transcode session, but the user can pass in whatever they want for subsequent segments [1]. We attempt to "check" this with a rudimentary length check of the number of outputs, but doesn't actually enforce that the configuration of those outputs match. A better API would look something like:
That would mean defining new structs for
NewTranscoder
andTranscoder
since we shouldn't re-use or modify the old parameter sets.However, the in/out filenames might not be the only thing we legitimately want to pass in to the transcoder on a per-segment basis, so the parameters should be a struct.
[1] Example of poor API usage:
Fixes livepeer/go-livepeer#997
Fixes livepeer/go-livepeer#951