Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: switch ld with lld on linux, fixes OOM #1388

Merged
merged 1 commit into from
Oct 21, 2024
Merged

Conversation

giladchase
Copy link
Contributor

@giladchase giladchase commented Oct 14, 2024

We've recently taken to limit parallelism during builds due to Out Of Memory errors we got while running cargo test on 16-core machine, or 8-core machines running with -j16.

We hit OOM due to two reasons:

  1. Commit fae5aed introduced llvm dependencies to the blockifier, which notably increased compilation and linking time.
  2. Before fae5aed, we already had a quite memory-heavy linking stage, specifically ~7-8 ld instances running in parallel, each with ~2-3GB of memory, which was just enough not to hit OOM on most 16core machines.
    This was primarily due to a combination of factors:
    2.1 the blockifier being a large lib that needs a 2-3GB ld process
    to link in isolation.
    2.2 many crates have the blockifier as a dependency (transitively).
    2.3 ld naively parallelizes linking tasks.
    The combination of the above three issues caused a heavily parallelized environment to create many heavy linking jobs that shared no memory (blockifier linking was duplicated).

Switching lld handles parallelizing better and reduces memory stress from ~10 2-3GB ld linker instances post fae5aed, to ~2 1-2GB lld linker instances for the same tasks, which has a minimal memory impact.

Moreover, lld is already the default linker on rust nightly, and will eventually become the default on stable (see code comment for link to tracking issue).


This change is Reviewable

@giladchase giladchase force-pushed the gilad/gilad/linker branch 11 times, most recently from 6dc3bd4 to 32de843 Compare October 20, 2024 10:46
Copy link

codecov bot commented Oct 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 4.83%. Comparing base (b0cfe82) to head (bb6d35e).
Report is 433 commits behind head on main.

❗ There is a different number of reports uploaded between BASE (b0cfe82) and HEAD (bb6d35e). Click for more details.

HEAD has 2 uploads less than BASE
Flag BASE (b0cfe82) HEAD (bb6d35e)
3 1
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #1388       +/-   ##
==========================================
- Coverage   74.18%   4.83%   -69.36%     
==========================================
  Files         359      34      -325     
  Lines       36240    2109    -34131     
  Branches    36240    2109    -34131     
==========================================
- Hits        26886     102    -26784     
+ Misses       7220    1998     -5222     
+ Partials     2134       9     -2125     
Flag Coverage Δ
?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@giladchase giladchase force-pushed the gilad/gilad/linker branch 3 times, most recently from bb6d35e to 19321a8 Compare October 20, 2024 18:46
Copy link

Benchmark movements:
tree_computation_flow performance improved 😺
tree_computation_flow time: [33.810 ms 33.836 ms 33.868 ms]
change: [-5.5448% -4.1828% -3.0447%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
3 (3.00%) high mild
3 (3.00%) high severe

full_committer_flow performance improved 😺
full_committer_flow time: [29.462 ms 29.520 ms 29.594 ms]
change: [-1.5140% -1.3021% -1.0299%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
2 (2.00%) high mild
3 (3.00%) high severe

Copy link
Contributor Author

@giladchase giladchase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 3 files reviewed, all discussions resolved (waiting on @dorimedini-starkware and @elintul)


.github/workflows/committer_ci.yml line 90 at r1 (raw file):

      # Benchmark the new code, splitting the benchmarks, and prepare the results for posting a comment.
      - uses: ./.github/actions/bootstrap

This calls dependencies.sh, which was missing (probably wasn't needed?).

Code quote:

      - uses: ./.github/actions/bootstrap

Copy link
Collaborator

@dorimedini-starkware dorimedini-starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 3 files reviewed, 1 unresolved discussion (waiting on @elintul and @giladchase)

a discussion (no related file):
please bump blockifier and committer on py side and open PR to pass hodor


Copy link
Collaborator

@dorimedini-starkware dorimedini-starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @elintul and @giladchase)

Copy link
Collaborator

@elintul elintul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @giladchase)

We've recently taken to limit parallelism during builds due to Out Of
Memory errors we got while running `cargo test` on 16-core machine,
or 8-core machines running with `-j16`.

We hit OOM due to two reasons:
1. Commit fae5aed introduced `llvm`
    dependencies to the blockifier, which notably increased compilation
    and linking time.
2. Before fae5aed, we already had a quite
memory-heavy linking stage, specifically ~7-8 `ld` instances running in
parallel, each with ~2-3GB of memory, which was just enough not to hit
OOM on most 16core machines.
This was primarily due to a combination of factors:
    2.1 the blockifier being a large lib that needs a 2-3GB `ld` process
    to link in isolation.
    2.2 many crates have the blockifier as a dependency (transitively).
    2.3 `ld` naively parallelizes linking tasks.
The combination of the above three issues caused a heavily parallelized
environment to create many heavy linking jobs that shared no
memory (blockifier linking was duplicated).

This PR switches to the rust linker from `ld` to `lld`, thus solving the
OOM issue without artificially bounding the amount of
parallelization.

`lld` handles parallelizing better and reduces memory stress
from ~10 2-3GB `ld` linker instances post fae5aed,
to ~2 1-2GB `lld` linker instances for the same tasks, which has a
minimal memory impact.

Moreover, `lld` is already the default linker on `rust` nightly, and
will eventually become the default on stable (see code comment for link
to tracking issue).
Copy link

Benchmark movements:
full_committer_flow performance improved 😺
full_committer_flow time: [29.520 ms 29.571 ms 29.641 ms]
change: [-2.0274% -1.7945% -1.5358%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe

Copy link
Collaborator

@elintul elintul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @giladchase)

Copy link
Collaborator

@dorimedini-starkware dorimedini-starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @giladchase)

Copy link
Contributor Author

@giladchase giladchase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @giladchase)

a discussion (no related file):

Previously, dorimedini-starkware wrote…

please bump blockifier and committer on py side and open PR to pass hodor

Done, python is only failing on a flaky test which isn't a blocker.


@giladchase giladchase merged commit 42a2694 into main Oct 21, 2024
22 checks passed
@giladchase giladchase deleted the gilad/gilad/linker branch October 21, 2024 10:43
guy-starkware pushed a commit that referenced this pull request Oct 22, 2024
We've recently taken to limit parallelism during builds due to Out Of
Memory errors we got while running `cargo test` on 16-core machine,
or 8-core machines running with `-j16`.

We hit OOM due to two reasons:
1. Commit fae5aed introduced `llvm`
    dependencies to the blockifier, which notably increased compilation
    and linking time.
2. Before fae5aed, we already had a quite
memory-heavy linking stage, specifically ~7-8 `ld` instances running in
parallel, each with ~2-3GB of memory, which was just enough not to hit
OOM on most 16core machines.
This was primarily due to a combination of factors:
    2.1 the blockifier being a large lib that needs a 2-3GB `ld` process
    to link in isolation.
    2.2 many crates have the blockifier as a dependency (transitively).
    2.3 `ld` naively parallelizes linking tasks.
The combination of the above three issues caused a heavily parallelized
environment to create many heavy linking jobs that shared no
memory (blockifier linking was duplicated).

This PR switches to the rust linker from `ld` to `lld`, thus solving the
OOM issue without artificially bounding the amount of
parallelization.

`lld` handles parallelizing better and reduces memory stress
from ~10 2-3GB `ld` linker instances post fae5aed,
to ~2 1-2GB `lld` linker instances for the same tasks, which has a
minimal memory impact.

Moreover, `lld` is already the default linker on `rust` nightly, and
will eventually become the default on stable (see code comment for link
to tracking issue).

Co-Authored-By: Gilad Chase <[email protected]>
@github-actions github-actions bot locked and limited conversation to collaborators Oct 26, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants