You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue summarizes the current benchmarking progress with a single O and outlines further steps.
Single-O / Single-T
We are currently seeing a low success rate (~80%) when transcoding one 4-rendition stream [1] with a single O and a 16-core T (1O-1T). Two factors contribute to this low success rate:
Segments are not always a consistent length. When a segment is followed by a shorter segment, it is likely the O is still busy with the first segment. This accounts for ~55% of the failures.
The T cannot keep up in real time for a number of segments. This accounts for ~45% of remaining failures. In practice, this slowness exacerbates the first factor: an O would be able to better handle variations in segment lengths if it were faster in processing them.
[1] 240p, 360p, 576p, 720p
The LPMS transcoder uses a new session for each segment. Setting up and tearing down a new transcode session frequently incurs significant overhead for both CPU and GPU encoding. For a 16 vCPU machine, the slowdown is roughly 1.75x over performing all the processing for a given stream in the same session. The slowdown is such that even a 16-core transcoder has difficulty keeping up with some incoming segments. (Related: livepeer/lpms#119 , although the slowdown is closer to 75-100% rather than the 10-15% mentioned in the issue)
The O/T itself adds less than 2% of overhead; the running time is dominated by the LPMS transcoder. (Just to demonstrate how tight the transcoding margins are: statistically the success rate should go up to ~86% if the O/T overhead were excluded.)
With the particular bottleneck for 1O-1T identified as being the transcoder itself, we can proceed to some next steps for single-O testing.
Single-O Multi-T
Given that we know the bottleneck with the current testing configuration is the transcoder itself, we should dial down the transcoding configuration in order to benchmark other aspects of single-O. Two things remain to be tested:
Multi-T bottlenecks. Given Ts that are "fast enough", we need to ensure that O's load balancing among multiple T is reasonable. This is separate from the bottleneck of a single T, which we have already identified.
Single-O bottlenecks. Once we have determined that multi-T load balancing is in a good place, we can increase the capacity attached to an O in order to discover other bottlenecks in the O.
We should be comfortable with our understanding of any multi-T limitations prior to further benchmarking single-O, and incorporate real transcoders into the testing. This way we can stress the full transcoding pipeline in its entirety.
Testing Steps
One approach to doing multi-T benchmarking is this:
Identify a transcoding configuration which a given T can complete comfortably under real-time. One option is the set of 240p, 360p, 576p. Rough numbers show this is typically 45% faster than realtime with a single stream on a 16 vCPU machine.
Verify single-T benchmark numbers with 240p, 360p, 576p or an equally comfortable alternative
Using the current goclient master branch, determine the baseline success rate for 1O-1T-1 stream at 240p, 360p, 576p. The success rate may not be 100% due to variation in segment lengths.
Increase the number of concurrent streams and concurrent transcoders. Maintain a 1-stream-1T ratio. Hypothesis: For our testing, 1O-4T-4 streams is likely to only see a roughly ~50% success rate on the master branch.
Run the same 1O-4T-4S test on the ja/loadfactor branch. As long as there are no bottlenecks for O, we should see a similar success rate as 1O-1T on master. If not, we may need another approach to multi-T load balancing.
Continue adding T/S until we hit diminishing returns for a single O (as marked by an increase in error rates).
The text was updated successfully, but these errors were encountered:
I ran the benchmarking to compare master to ja/loadfactor and although they have the same performance for 1 stream , ja/loadfactor performed significantly better when adding more concurrent streams (while maintaining 1 stream per T) here is the comparison.
Single-O Testing
This issue summarizes the current benchmarking progress with a single O and outlines further steps.
Single-O / Single-T
We are currently seeing a low success rate (~80%) when transcoding one 4-rendition stream [1] with a single O and a 16-core T (1O-1T). Two factors contribute to this low success rate:
Segments are not always a consistent length. When a segment is followed by a shorter segment, it is likely the O is still busy with the first segment. This accounts for ~55% of the failures.
The T cannot keep up in real time for a number of segments. This accounts for ~45% of remaining failures. In practice, this slowness exacerbates the first factor: an O would be able to better handle variations in segment lengths if it were faster in processing them.
[1] 240p, 360p, 576p, 720p
The LPMS transcoder uses a new session for each segment. Setting up and tearing down a new transcode session frequently incurs significant overhead for both CPU and GPU encoding. For a 16 vCPU machine, the slowdown is roughly 1.75x over performing all the processing for a given stream in the same session. The slowdown is such that even a 16-core transcoder has difficulty keeping up with some incoming segments. (Related: livepeer/lpms#119 , although the slowdown is closer to 75-100% rather than the 10-15% mentioned in the issue)
The O/T itself adds less than 2% of overhead; the running time is dominated by the LPMS transcoder. (Just to demonstrate how tight the transcoding margins are: statistically the success rate should go up to ~86% if the O/T overhead were excluded.)
With the particular bottleneck for 1O-1T identified as being the transcoder itself, we can proceed to some next steps for single-O testing.
Single-O Multi-T
Given that we know the bottleneck with the current testing configuration is the transcoder itself, we should dial down the transcoding configuration in order to benchmark other aspects of single-O. Two things remain to be tested:
Multi-T bottlenecks. Given Ts that are "fast enough", we need to ensure that O's load balancing among multiple T is reasonable. This is separate from the bottleneck of a single T, which we have already identified.
Single-O bottlenecks. Once we have determined that multi-T load balancing is in a good place, we can increase the capacity attached to an O in order to discover other bottlenecks in the O.
We should be comfortable with our understanding of any multi-T limitations prior to further benchmarking single-O, and incorporate real transcoders into the testing. This way we can stress the full transcoding pipeline in its entirety.
Testing Steps
One approach to doing multi-T benchmarking is this:
Identify a transcoding configuration which a given T can complete comfortably under real-time. One option is the set of
240p, 360p, 576p
. Rough numbers show this is typically 45% faster than realtime with a single stream on a 16 vCPU machine.240p, 360p, 576p
or an equally comfortable alternativemaster
branch, determine the baseline success rate for 1O-1T-1 stream at240p, 360p, 576p
. The success rate may not be 100% due to variation in segment lengths.master
branch.ja/loadfactor
branch. As long as there are no bottlenecks for O, we should see a similar success rate as 1O-1T onmaster
. If not, we may need another approach to multi-T load balancing.The text was updated successfully, but these errors were encountered: