Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single-O Benchmarking #989

Closed
j0sh opened this issue Jul 17, 2019 · 2 comments
Closed

Single-O Benchmarking #989

j0sh opened this issue Jul 17, 2019 · 2 comments
Assignees

Comments

@j0sh
Copy link
Collaborator

j0sh commented Jul 17, 2019

Single-O Testing

This issue summarizes the current benchmarking progress with a single O and outlines further steps.

Single-O / Single-T

We are currently seeing a low success rate (~80%) when transcoding one 4-rendition stream [1] with a single O and a 16-core T (1O-1T). Two factors contribute to this low success rate:

  1. Segments are not always a consistent length. When a segment is followed by a shorter segment, it is likely the O is still busy with the first segment. This accounts for ~55% of the failures.

  2. The T cannot keep up in real time for a number of segments. This accounts for ~45% of remaining failures. In practice, this slowness exacerbates the first factor: an O would be able to better handle variations in segment lengths if it were faster in processing them.

[1] 240p, 360p, 576p, 720p

The LPMS transcoder uses a new session for each segment. Setting up and tearing down a new transcode session frequently incurs significant overhead for both CPU and GPU encoding. For a 16 vCPU machine, the slowdown is roughly 1.75x over performing all the processing for a given stream in the same session. The slowdown is such that even a 16-core transcoder has difficulty keeping up with some incoming segments. (Related: livepeer/lpms#119 , although the slowdown is closer to 75-100% rather than the 10-15% mentioned in the issue)

The O/T itself adds less than 2% of overhead; the running time is dominated by the LPMS transcoder. (Just to demonstrate how tight the transcoding margins are: statistically the success rate should go up to ~86% if the O/T overhead were excluded.)

With the particular bottleneck for 1O-1T identified as being the transcoder itself, we can proceed to some next steps for single-O testing.

Single-O Multi-T

Given that we know the bottleneck with the current testing configuration is the transcoder itself, we should dial down the transcoding configuration in order to benchmark other aspects of single-O. Two things remain to be tested:

  1. Multi-T bottlenecks. Given Ts that are "fast enough", we need to ensure that O's load balancing among multiple T is reasonable. This is separate from the bottleneck of a single T, which we have already identified.

  2. Single-O bottlenecks. Once we have determined that multi-T load balancing is in a good place, we can increase the capacity attached to an O in order to discover other bottlenecks in the O.

We should be comfortable with our understanding of any multi-T limitations prior to further benchmarking single-O, and incorporate real transcoders into the testing. This way we can stress the full transcoding pipeline in its entirety.

Testing Steps

One approach to doing multi-T benchmarking is this:

Identify a transcoding configuration which a given T can complete comfortably under real-time. One option is the set of 240p, 360p, 576p. Rough numbers show this is typically 45% faster than realtime with a single stream on a 16 vCPU machine.

    • Verify single-T benchmark numbers with 240p, 360p, 576p or an equally comfortable alternative
    • Using the current goclient master branch, determine the baseline success rate for 1O-1T-1 stream at 240p, 360p, 576p. The success rate may not be 100% due to variation in segment lengths.
    • Increase the number of concurrent streams and concurrent transcoders. Maintain a 1-stream-1T ratio. Hypothesis: For our testing, 1O-4T-4 streams is likely to only see a roughly ~50% success rate on the master branch.
    • Run the same 1O-4T-4S test on the ja/loadfactor branch. As long as there are no bottlenecks for O, we should see a similar success rate as 1O-1T on master. If not, we may need another approach to multi-T load balancing.
    • Continue adding T/S until we hit diminishing returns for a single O (as marked by an increase in error rates).
@ya7ya
Copy link
Contributor

ya7ya commented Jul 18, 2019

I ran the benchmarking to compare master to ja/loadfactor and although they have the same performance for 1 stream , ja/loadfactor performed significantly better when adding more concurrent streams (while maintaining 1 stream per T) here is the comparison.

master (blue) vs ja_loadfactor (red)

setup renditions concurrent streams master success rate ja/loadfactor success rate
1B 1O 1T P240p30fps16x9,P360p30fps16x9,P576p30fps16x9 1 96.7061 96.8961
1B 1O 2T P240p30fps16x9,P360p30fps16x9,P576p30fps16x9 2 93.7922 97.8885
1B 1O 3T P240p30fps16x9,P360p30fps16x9,P576p30fps16x9 3 84.1216 97.3395
1B 1O 4T P240p30fps16x9,P360p30fps16x9,P576p30fps16x9 4 74.0287 96.7061

@j0sh
Copy link
Collaborator Author

j0sh commented Sep 20, 2019

@darkdarkdragon Can we close this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants