Profile-Guided Optimization (PGO) benchmark report #5

zamazan4ik · 2024-09-29T11:04:44Z

Hi!

As I have done many times before, I decided to test the Profile-Guided Optimization (PGO) technique to optimize the library performance. For reference, results for other projects are available at https://github.com/zamazan4ik/awesome-pgo . Since PGO has helped many other libraries, I decided to apply it to serde-brief to see if a performance win (or loss) can be achieved. Here are my benchmark results. For benchmarks, I used these benchmarks since it was mentioned in the Reddit post.

This information can be interesting for anyone who wants to achieve more performance with the library in their use cases.

Test environment

Fedora 40
Linux kernel 6.10.11
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Rustc 1.81.0
serde-brief version: used in the benchmark above (I guess the latest for the moment)
Disabled Turbo boost

Benchmark

For PGO optimization I use cargo-pgo tool. Release bench results I got with taskset -c 0 cargo bench --no-default-features --features serde-brief command. The PGO training phase is done with taskset -c 0 cargo pgo bench -- --no-default-features --features serde-brief, PGO optimization phase - with taskset -c 0 cargo pgo optimize bench -- --no-default-features --features serde-brief.

taskset -c 0 is used for reducing the OS scheduler's influence on the results. All measurements are done on the same machine, with the same background "noise" (as much as I can guarantee).

Results

I got the following results:

Release: https://gist.github.com/zamazan4ik/07cc20a860970904d2bd53df98335d5a
PGO optimized compared to Release: https://gist.github.com/zamazan4ik/37e6ca037c3c92bc7f1af13f9dddc612
(just for reference) PGO instrumented compared to Release: https://gist.github.com/zamazan4ik/1e8ac46c88113b14c2d712f7d10ce5c2

According to the results, PGO measurably improves the library's performance.

Further steps

I understand that the steps above can be time-consuming and hard to implement in practice. At the very least, the library's users can find this performance report and decide to enable PGO for their applications if they care about the library's performance in their workloads. Maybe a small note somewhere in the documentation (the README file?) will be enough to raise awareness about this work.

Please don't treat the issue like an actual issue - it's just a benchmark report (since Discussions are disabled for the repo).

Thank you.

The text was updated successfully, but these errors were encountered:

FlixCoder · 2024-09-29T11:13:45Z

Thank you! Sounds helpful and a considerable speed up.

Feel free to make a PR adding documentation for it. Or if you don't want to, I can try to do it as well. (I imagine a new markdown file in the docs folder, that can be linked in the readme and included in the library's docs module :D). I don't want to take your contribution away :)

FlixCoder · 2024-09-29T11:16:08Z

Oh and as an addition: I expect there are more improvements possible by optimizing the source code still. Postcard is about 3x faster at serialization right now. It won't be possible to reach this, since we do encode more data, but I bet it is possible to halve time needed time (I hope).

Just need to learn how to do it :D

zamazan4ik · 2024-09-29T11:46:28Z

Feel free to make a PR adding documentation for it. Or if you don't want to, I can try to do it as well.

I think it would be better if you could create such a document - in this case, it will be written in a consistent way with other pieces of documentation for the project. As a reference, I have several examples of such PGO docs in other projects (applications and libraries): https://github.com/zamazan4ik/awesome-pgo?tab=readme-ov-file#project-specific-documentation-about-pgo . I hope it will be helpful.

I don't want to take your contribution away

Oh, no worries about that at all! I am already happy that one more person is interested in PGO!

I expect there are more improvements possible by optimizing the source code still. Postcard is about 3x faster at serialization right now. It won't be possible to reach this, since we do encode more data, but I bet it is possible to halve time needed time (I hope).

Yep, I understand. You can try to get some insights about possible optimization from PGO too since you can compare flamegraphs before and after PGO to check a difference (or even compare the resulting assembly/LLVM IR before and after PGO). It could be time-consuming, though. A nice thing with PGO is that all that "boring optimization stuff" is done semi-automatically by a compiler. You focus on high-level optimizations, the compiler does the "dirty" low-level optimization stuff ;)

FlixCoder · 2024-10-03T11:03:44Z

Alright, it is documented in #6. I would be interested in your feedback if you have the time :)

The PR will close this, but it is linked in the docs, so users can find the information :)

zamazan4ik · 2024-10-03T14:26:00Z

I would be interested in your feedback if you have the time :)

Sure! Just did it ;)

FlixCoder mentioned this issue Oct 3, 2024

Document PGO usage #6

Merged

FlixCoder closed this as completed in #6 Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profile-Guided Optimization (PGO) benchmark report #5

Profile-Guided Optimization (PGO) benchmark report #5

zamazan4ik commented Sep 29, 2024

FlixCoder commented Sep 29, 2024

FlixCoder commented Sep 29, 2024

zamazan4ik commented Sep 29, 2024

FlixCoder commented Oct 3, 2024

zamazan4ik commented Oct 3, 2024 •

edited

Loading

Profile-Guided Optimization (PGO) benchmark report #5

Profile-Guided Optimization (PGO) benchmark report #5

Comments

zamazan4ik commented Sep 29, 2024

Test environment

Benchmark

Results

Further steps

FlixCoder commented Sep 29, 2024

FlixCoder commented Sep 29, 2024

zamazan4ik commented Sep 29, 2024

FlixCoder commented Oct 3, 2024

zamazan4ik commented Oct 3, 2024 • edited Loading

zamazan4ik commented Oct 3, 2024 •

edited

Loading