fix: switch `ld` with `lld` on linux, fixes OOM #1388

giladchase · 2024-10-14T09:01:29Z

We've recently taken to limit parallelism during builds due to Out Of Memory errors we got while running cargo test on 16-core machine, or 8-core machines running with -j16.

We hit OOM due to two reasons:

Commit fae5aed introduced llvm dependencies to the blockifier, which notably increased compilation and linking time.
Before fae5aed, we already had a quite memory-heavy linking stage, specifically ~7-8 ld instances running in parallel, each with ~2-3GB of memory, which was just enough not to hit OOM on most 16core machines.
This was primarily due to a combination of factors:
2.1 the blockifier being a large lib that needs a 2-3GB ld process
to link in isolation.
2.2 many crates have the blockifier as a dependency (transitively).
2.3 ld naively parallelizes linking tasks.
The combination of the above three issues caused a heavily parallelized environment to create many heavy linking jobs that shared no memory (blockifier linking was duplicated).

Switching lld handles parallelizing better and reduces memory stress from ~10 2-3GB ld linker instances post fae5aed, to ~2 1-2GB lld linker instances for the same tasks, which has a minimal memory impact.

Moreover, lld is already the default linker on rust nightly, and will eventually become the default on stable (see code comment for link to tracking issue).

This change is

codecov · 2024-10-20T10:56:10Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 4.83%. Comparing base (b0cfe82) to head (bb6d35e).
Report is 433 commits behind head on main.

❗ There is a different number of reports uploaded between BASE (b0cfe82) and HEAD (bb6d35e). Click for more details.

HEAD has 2 uploads less than BASE

Flag BASE (b0cfe82) HEAD (bb6d35e)

3 1

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #1388       +/-   ##
==========================================
- Coverage   74.18%   4.83%   -69.36%     
==========================================
  Files         359      34      -325     
  Lines       36240    2109    -34131     
  Branches    36240    2109    -34131     
==========================================
- Hits        26886     102    -26784     
+ Misses       7220    1998     -5222     
+ Partials     2134       9     -2125

Flag	Coverage Δ
	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2024-10-20T19:02:15Z

Benchmark movements:
tree_computation_flow performance improved 😺
tree_computation_flow time: [33.810 ms 33.836 ms 33.868 ms]
change: [-5.5448% -4.1828% -3.0447%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
3 (3.00%) high mild
3 (3.00%) high severe

full_committer_flow performance improved 😺
full_committer_flow time: [29.462 ms 29.520 ms 29.594 ms]
change: [-1.5140% -1.3021% -1.0299%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
2 (2.00%) high mild
3 (3.00%) high severe

giladchase

Reviewable status: 0 of 3 files reviewed, all discussions resolved (waiting on @dorimedini-starkware and @elintul)

.github/workflows/committer_ci.yml line 90 at r1 (raw file):

      # Benchmark the new code, splitting the benchmarks, and prepare the results for posting a comment.
      - uses: ./.github/actions/bootstrap

This calls dependencies.sh, which was missing (probably wasn't needed?).

Code quote:

      - uses: ./.github/actions/bootstrap

dorimedini-starkware

Reviewable status: 0 of 3 files reviewed, 1 unresolved discussion (waiting on @elintul and @giladchase)

a discussion (no related file):
please bump blockifier and committer on py side and open PR to pass hodor

dorimedini-starkware

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @elintul and @giladchase)

elintul

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @giladchase)

We've recently taken to limit parallelism during builds due to Out Of Memory errors we got while running `cargo test` on 16-core machine, or 8-core machines running with `-j16`. We hit OOM due to two reasons: 1. Commit fae5aed introduced `llvm` dependencies to the blockifier, which notably increased compilation and linking time. 2. Before fae5aed, we already had a quite memory-heavy linking stage, specifically ~7-8 `ld` instances running in parallel, each with ~2-3GB of memory, which was just enough not to hit OOM on most 16core machines. This was primarily due to a combination of factors: 2.1 the blockifier being a large lib that needs a 2-3GB `ld` process to link in isolation. 2.2 many crates have the blockifier as a dependency (transitively). 2.3 `ld` naively parallelizes linking tasks. The combination of the above three issues caused a heavily parallelized environment to create many heavy linking jobs that shared no memory (blockifier linking was duplicated). This PR switches to the rust linker from `ld` to `lld`, thus solving the OOM issue without artificially bounding the amount of parallelization. `lld` handles parallelizing better and reduces memory stress from ~10 2-3GB `ld` linker instances post fae5aed, to ~2 1-2GB `lld` linker instances for the same tasks, which has a minimal memory impact. Moreover, `lld` is already the default linker on `rust` nightly, and will eventually become the default on stable (see code comment for link to tracking issue).

github-actions · 2024-10-21T10:13:28Z

Benchmark movements:
full_committer_flow performance improved 😺
full_committer_flow time: [29.520 ms 29.571 ms 29.641 ms]
change: [-2.0274% -1.7945% -1.5358%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe

elintul

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @giladchase)

dorimedini-starkware

Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @giladchase)

giladchase

Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @giladchase)

a discussion (no related file):

Previously, dorimedini-starkware wrote…

please bump blockifier and committer on py side and open PR to pass hodor

Done, python is only failing on a flaky test which isn't a blocker.

We've recently taken to limit parallelism during builds due to Out Of Memory errors we got while running `cargo test` on 16-core machine, or 8-core machines running with `-j16`. We hit OOM due to two reasons: 1. Commit fae5aed introduced `llvm` dependencies to the blockifier, which notably increased compilation and linking time. 2. Before fae5aed, we already had a quite memory-heavy linking stage, specifically ~7-8 `ld` instances running in parallel, each with ~2-3GB of memory, which was just enough not to hit OOM on most 16core machines. This was primarily due to a combination of factors: 2.1 the blockifier being a large lib that needs a 2-3GB `ld` process to link in isolation. 2.2 many crates have the blockifier as a dependency (transitively). 2.3 `ld` naively parallelizes linking tasks. The combination of the above three issues caused a heavily parallelized environment to create many heavy linking jobs that shared no memory (blockifier linking was duplicated). This PR switches to the rust linker from `ld` to `lld`, thus solving the OOM issue without artificially bounding the amount of parallelization. `lld` handles parallelizing better and reduces memory stress from ~10 2-3GB `ld` linker instances post fae5aed, to ~2 1-2GB `lld` linker instances for the same tasks, which has a minimal memory impact. Moreover, `lld` is already the default linker on `rust` nightly, and will eventually become the default on stable (see code comment for link to tracking issue). Co-Authored-By: Gilad Chase <[email protected]>

giladchase force-pushed the gilad/gilad/linker branch 11 times, most recently from 6dc3bd4 to 32de843 Compare October 20, 2024 10:46

giladchase force-pushed the gilad/gilad/linker branch 3 times, most recently from bb6d35e to 19321a8 Compare October 20, 2024 18:46

giladchase requested review from elintul and dorimedini-starkware October 21, 2024 08:05

giladchase commented Oct 21, 2024

View reviewed changes

dorimedini-starkware requested changes Oct 21, 2024

View reviewed changes

dorimedini-starkware reviewed Oct 21, 2024

View reviewed changes

elintul approved these changes Oct 21, 2024

View reviewed changes

giladchase force-pushed the gilad/gilad/linker branch from 19321a8 to bf80e0c Compare October 21, 2024 09:58

elintul approved these changes Oct 21, 2024

View reviewed changes

dorimedini-starkware approved these changes Oct 21, 2024

View reviewed changes

giladchase commented Oct 21, 2024

View reviewed changes

giladchase merged commit 42a2694 into main Oct 21, 2024
22 checks passed

giladchase deleted the gilad/gilad/linker branch October 21, 2024 10:43

github-actions bot locked and limited conversation to collaborators Oct 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: switch `ld` with `lld` on linux, fixes OOM #1388

fix: switch `ld` with `lld` on linux, fixes OOM #1388

giladchase commented Oct 14, 2024 •

edited by alon-dotan-starkware

Loading

codecov bot commented Oct 20, 2024 •

edited

Loading

github-actions bot commented Oct 20, 2024

giladchase left a comment

dorimedini-starkware left a comment

dorimedini-starkware left a comment

elintul left a comment

github-actions bot commented Oct 21, 2024

elintul left a comment

dorimedini-starkware left a comment

giladchase left a comment

fix: switch ld with lld on linux, fixes OOM #1388

fix: switch ld with lld on linux, fixes OOM #1388

Conversation

giladchase commented Oct 14, 2024 • edited by alon-dotan-starkware Loading

codecov bot commented Oct 20, 2024 • edited Loading

Codecov Report

github-actions bot commented Oct 20, 2024

giladchase left a comment

Choose a reason for hiding this comment

dorimedini-starkware left a comment

Choose a reason for hiding this comment

dorimedini-starkware left a comment

Choose a reason for hiding this comment

elintul left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 21, 2024

elintul left a comment

Choose a reason for hiding this comment

dorimedini-starkware left a comment

Choose a reason for hiding this comment

giladchase left a comment

Choose a reason for hiding this comment

fix: switch `ld` with `lld` on linux, fixes OOM #1388

fix: switch `ld` with `lld` on linux, fixes OOM #1388

giladchase commented Oct 14, 2024 •

edited by alon-dotan-starkware

Loading

codecov bot commented Oct 20, 2024 •

edited

Loading