Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate why Solana is so slow. #6

Open
Eh2406 opened this issue Aug 14, 2024 · 44 comments
Open

Investigate why Solana is so slow. #6

Eh2406 opened this issue Aug 14, 2024 · 44 comments

Comments

@Eh2406
Copy link
Owner

Eh2406 commented Aug 14, 2024

Currently the Solana crates are filtered out of our test suite here. For some reason they take an extraordinarily long time to resolve (both for the existing resolver and even more so for PubGrub). It is not clear why they are difficult. Presumably they have some share dependency with some pattern that the resolvers do not like. What dependency, what pattern, we do not know. Perhaps if we had a smaller example that still exemplified the problematic behavior it would be easier to see the problem.

@Eh2406
Copy link
Owner Author

Eh2406 commented Aug 19, 2024

Here is the generated index file for one of the particularly problematic crates that was filtered out.
[email protected].

@Eh2406
Copy link
Owner Author

Eh2406 commented Aug 21, 2024

Focusing in on crates that had a high ratio of runtime to should cancel call count, pulled out the solana-transaction-status for investigation. After applying some minimization tools, here is a file that takes ~25s and is only 350k lines long. Which may or may not be useful to look at.
[email protected]

Edit: [email protected] was fixed with 7d6a3f0

@x-hgg-x
Copy link
Contributor

x-hgg-x commented Oct 7, 2024

It seems that there is a lot a backtracking in the cargo resolver for specific versions of solana-core.

After checking the dependencies, I can see why:

  • Each of the 78 versions of solana-core between 1.0.0 and 1.5.14 has a direct dependency to the other 20+ solana-X crates of the form ^1.x.y, where 1.x.y is the version of solana-core (this also applies for all solana-X crates before 1.5.14).
  • Each of the 252 versions of solana-core between 1.5.15 and 1.18.25 has changed the format of the solana-X dependencies, which is now =1.x.y (this also applies for all solana-X crates after 1.5.15).

Since the resolver tries to get the most recent compatible version if possible, if we set solana-core = "=1.0.0" in a Cargo.toml the resolver will try all combinations of the 300+ versions for the 20+ solana crates without success, before finding a solution where all solana crates are at version 1.0.0. In my machine the resolution took almost 10min with cargo generate-lockfile, and 1m30s if I use -Zminimal-versions.

For versions 1.5.15 and above, the resolution is very fast since there is only one possible solution for the solana crates due to the = in the version requirement.

@Eh2406
Copy link
Owner Author

Eh2406 commented Oct 7, 2024

I guess I'm missing a detail, for versions <=1.5.14 what about the structure requires all the version numbers match?

The general question is can we find a smaller example that has the same reason for being slow? The easiest to debug are samples where every crate has a regular structure, and the overall runtime is exponential (or at least cubic) in the number of crates/versions. But those are hard to create from real-world examples. I've also been able to find the problem by identifying a small change that makes a big difference. Like "removing this one feature of one dependency of one crate makes things twice as fast" help me track down the [email protected] problem.

Looking at crates with a high -Zminimal-versions time to normal resolution time is a good idea I have not yet tried. Thanks.

@x-hgg-x
Copy link
Contributor

x-hgg-x commented Oct 8, 2024

By looking at the logs using CARGO_LOG='off,cargo::core::resolver::conflict_cache=debug', I managed to reproduce and minimize the slowdown. The main problem is that many versions of solana-crate-features have a dependency of cc = ^1.0.45 with the num_cpus feature, which is only available until cc v1.0.49 (see https://lib.rs/crates/cc/versions), so there is a lot of incompatibilities.

Here is a test showing this behavior:

#[test]
fn test() {
    let mut reg = Vec::new();

    const V: usize = 100;

    reg.push(pkg_dep_with(("cc", "1.0.0"), vec![], &[("f", &[])]));
    for v in 1..V {
        reg.push(pkg(("cc", format!("1.0.{v}"))));
    }

    reg.push(pkg_dep(
        ("solana-core", "1.0.0"),
        vec![
            dep("cc"), // resolve is instantaneous without this dependency
            dep_req("solana-1", "1.0.0"),
        ],
    ));

    const N: usize = 100;
    const C: usize = 3;

    for v in 0..N {
        let version = format!("1.0.{v}");
        for c in 1..C {
            reg.push(pkg_dep(
                (format!("solana-{c}"), &version),
                vec![dep_req(&format!("solana-{}", c + 1), &version)],
            ));
        }
        reg.push(pkg_dep(
            (format!("solana-{C}"), &version),
            vec![dep_req("cc", "1.0.0").with(&["f"])],
        ));
    }

    let deps = vec![dep("solana-core")];
    let res = resolve(deps, &reg).unwrap();
    println!("{:?}", res);
}

Here is the results with various values of C and N when V = 100, when running the test in release mode:
test-logs.tar.gz

With N = 99 the resolve is instantaneous, but with N = 100 it takes 12s, so there seems to be a threshold somewhere.

We can see that when C >= 2, the resolve time is linear over C but cubic over N:

N=100 N=200 N=400
C=1 0.25s 0.5s 1.4s
C=2 7s 44s 314s
C=3 12s 77s 537s
C=6 30s 149s 1107s
C=12 99s 447s 2393s

When plugging these test cases in the SAT resolver, there is no slowdown at all: a solution is always found in less than 10ms.

Another thing: activate_deps_loop() is called in a loop in the resolver, and each time the conflict cache restart from zero since it is a local variable, so there is some duplicate work:
https://github.com/rust-lang/cargo/blob/aea3404eaea3f613a5f541abc1218b8eadd991ff/src/cargo/core/resolver/mod.rs#L140-L143.

@x-hgg-x
Copy link
Contributor

x-hgg-x commented Oct 8, 2024

When removing the cc dependency for solana-core, the resolve is instantaneous since it eliminates directly all versions of cc without the f feature.
When keeping this dependency, I think this is not possible because of feature unification.

@x-hgg-x
Copy link
Contributor

x-hgg-x commented Oct 8, 2024

(Note: I tried to run the sat resolver using infos from the [email protected] file, but the target platform for dependencies is missing in the file, so the generated summaries are invalid).

@x-hgg-x
Copy link
Contributor

x-hgg-x commented Oct 8, 2024

After more tests, it seems that the threshold occurs when N >= V.

So in the example above, setting V to 101 instead of 100 completely eliminates the slowdown, even if the number of candidate versions has increased.

@Eh2406
Copy link
Owner Author

Eh2406 commented Oct 8, 2024

That is very intriguing results! I wonder if similar slowdowns happen with the PubGrub based resolver from this crate.

(Note: I tried to run the sat resolver using infos from the [email protected] file, but the target platform for dependencies is missing in the file, so the generated summaries are invalid).

It is really unfortunate that we have two different testing infrastructures one here and one in Cargo's repo. I wonder how we could bring them closer to being inter-compatible.

After more tests, it seems that the threshold occurs when N >= V.

There are various heuristics that use "smallest number of candidates" to decide which requirements to resolve first. I wonder if that's why this cut off occurs.

@x-hgg-x
Copy link
Contributor

x-hgg-x commented Oct 8, 2024

After a quick perf, 70% of the time is spent hashing in this structure:
https://github.com/rust-lang/cargo/blob/ac39e69a0fff26f0b8c5e886c42175e8b58a6403/src/cargo/core/resolver/context.rs#L18-L29

@Eh2406
Copy link
Owner Author

Eh2406 commented Oct 8, 2024

I like to collect perf results with https://github.com/mstange/samply so that the results can be shared with just a few clicks.

@x-hgg-x
Copy link
Contributor

x-hgg-x commented Oct 8, 2024

Here is two perf profiles with C = 3 and N = 100 (read with https://profiler.firefox.com):
cargo.tar.gz
For the second profile I replaced the hasher with rustc-hash for the 4 fields in the ResolverContext of type im_rc::HashMap or im_rc::HashSet, and it shows a 60% decrease in resolution time.

@Eh2406
Copy link
Owner Author

Eh2406 commented Oct 8, 2024

That seems like an easy win! I look forward to your a PR to Cargo.

@x-hgg-x
Copy link
Contributor

x-hgg-x commented Oct 8, 2024

That is very intriguing results! I wonder if similar slowdowns happen with the PubGrub based resolver from this crate.

For this test it takes less than 2s to resolve using the PubGrub resolver with C = 12 and N = 400.

But I observe the same threshold when N >= V, and the same trend when N increases, but all resolution times are much smaller than with the current cargo resolver.

@Eh2406
Copy link
Owner Author

Eh2406 commented Oct 9, 2024

I don't know if this is helpful, but I just ran this project for the Solana crates. out_solana.zip here's a CSV of the output. There may be valuable insight by comparing consecutive versions of crates. For example:

  • I don't see a big jump in time/cargo_time comparing versions of solana-core just before 1.5.15 with just after. So the changes in ^ to = don't seem to have mattered.
  • solana-core 1.2.18 and 1.2.19 are the slowest for cargo (cargo_time of >250s). But 1.3.1 is half that time. And 1.2.9 is 43s.
  • solana-archiver-lib 1.1.12 is the slowest for pub (time of >96s). But 1.1.13 is 73s.

If you can push the rustc-hash branch somewhere I will happily do another data collection to see how big a difference that makes overall!

@x-hgg-x
Copy link
Contributor

x-hgg-x commented Oct 9, 2024

Here is my branch: https://github.com/x-hgg-x/cargo/tree/resolver-perf
I also tried to use nohash instead of rustc-hash, and it further improves resolve time by another 30% (using C = 3 and N = 100) :

commit duration
master 16s
rustc-hash 6.7s
nohash 4.6s

Firefox perf profiles: perf.tar.gz

@Eh2406
Copy link
Owner Author

Eh2406 commented Oct 9, 2024

Running all crates in the index went from 315.49min to 236.12min when going from whatever cargo commit I was on to your rustc-hash commit. That is a very good improvement! (I didn't compare to the nohash because this tool runs resolution in parallel, and your branch adds PackageId::new and a ACTIVATION_KEY_CACHE which will probably make things difficult to run in parallel.) Looking at the disaggregated data there are very few examples that got slower with this change. solana-core 1.2.18 and 1.2.19 are now in the 190s range! I look forward to the PR!

@x-hgg-x
Copy link
Contributor

x-hgg-x commented Oct 9, 2024

  • I don't see a big jump in time/cargo_time comparing versions of solana-core just before 1.5.15 with just after. So the changes in ^ to = don't seem to have mattered.

Versions between 1.4.15 and 1.8.14 didn't succeed because of yanked crates.

(Note: I needed to write a small Rust program to sort the csv by increasing name and version to better analyze it, so it should be done before writing the csv.)

@x-hgg-x
Copy link
Contributor

x-hgg-x commented Oct 9, 2024

(I didn't compare to the nohash because this tool runs resolution in parallel, and your branch adds PackageId::new and a ACTIVATION_KEY_CACHE which will probably make things difficult to run in parallel.)

PackageId::new already exists and already locks a Mutex. My commit added a second lock, but since the PackageId are only constructed once when loading a manifest there should be no visible slowdown even when running in parallel.

bors added a commit to rust-lang/cargo that referenced this issue Oct 10, 2024
Improve resolver speed

### What does this PR try to resolve?

This PR improves the resolver speed after investigations from Eh2406/pubgrub-crates-benchmark#6.

### How should we test and review this PR?

Commit 1 adds a test showing a slow case in the resolver, where resolving time is cubic over the number of versions of a crate. It can be used for benchmarking the resolver.

Comparison of the resolving time with various values of `N=LAST_CRATE_VERSION_COUNT` and `C=TRANSITIVE_CRATES_COUNT`:

|      | N=100 | N=200 | N=400 |
|------|-------|-------|-------|
| C=1  | 0.25s | 0.5s  | 1.4s  |
| C=2  | 7s    | 44s   | 314s  |
| C=3  | 12s   | 77s   | 537s  |
| C=6  | 30s   | 149s  | 1107s |
| C=12 | 99s   | 447s  | 2393s |

Commit 2 replaces the default hasher with the hasher from `rustc-hash`, decreasing resolving time by more than 50%.

Performance comparison with the test from commit 1 by setting `LAST_CRATE_VERSION_COUNT = 100`:
| commit          | duration |
|-----------------|----------|
| master          | 16s      |
| with rustc-hash | 6.7s     |

Firefox profiles, can be read with https://profiler.firefox.com:
[perf.tar.gz](https://github.com/user-attachments/files/17318243/perf.tar.gz)

r? Eh2406
bors added a commit to rust-lang/cargo that referenced this issue Oct 10, 2024
Improve resolver speed

### What does this PR try to resolve?

This PR improves the resolver speed after investigations from Eh2406/pubgrub-crates-benchmark#6.

### How should we test and review this PR?

Commit 1 adds a test showing a slow case in the resolver, where resolving time is cubic over the number of versions of a crate. It can be used for benchmarking the resolver.

Comparison of the resolving time with various values of `N=LAST_CRATE_VERSION_COUNT` and `C=TRANSITIVE_CRATES_COUNT`:

|      | N=100 | N=200 | N=400 |
|------|-------|-------|-------|
| C=1  | 0.25s | 0.5s  | 1.4s  |
| C=2  | 7s    | 44s   | 314s  |
| C=3  | 12s   | 77s   | 537s  |
| C=6  | 30s   | 149s  | 1107s |
| C=12 | 99s   | 447s  | 2393s |

Commit 2 replaces the default hasher with the hasher from `rustc-hash`, decreasing resolving time by more than 50%.

Performance comparison with the test from commit 1 by setting `LAST_CRATE_VERSION_COUNT = 100`:
| commit          | duration |
|-----------------|----------|
| master          | 16s      |
| with rustc-hash | 6.7s     |

Firefox profiles, can be read with https://profiler.firefox.com:
[perf.tar.gz](https://github.com/user-attachments/files/17318243/perf.tar.gz)

r? Eh2406
@x-hgg-x
Copy link
Contributor

x-hgg-x commented Oct 10, 2024

Another thing: activate_deps_loop() is called in a loop in the resolver, and each time the conflict cache restart from zero since it is a local variable, so there is some duplicate work:
https://github.com/rust-lang/cargo/blob/aea3404eaea3f613a5f541abc1218b8eadd991ff/src/cargo/core/resolver/mod.rs#L140-L143.

For solana-archiver-lib = "1.0.2", the conflict cache is cleaned in the middle of the resolving when the registry has pending index requests, so it does the same work 7 times.

This may explain the long time spent dropping the ResolverContext (rust-lang/cargo#14665 (comment)), and the not consistent benchmarks (rust-lang/cargo#14665 (comment)).

@x-hgg-x
Copy link
Contributor

x-hgg-x commented Oct 10, 2024

By moving the conflict cache out of the loop, the resolving time for solana-archiver-lib = "1.0.2" drops from 400s to 50s.

@x-hgg-x
Copy link
Contributor

x-hgg-x commented Oct 10, 2024

And the resolving time for solana-core = "=1.2.18" drops to 25s (!).

@Eh2406
Copy link
Owner Author

Eh2406 commented Oct 10, 2024

The loop has to do with requesting data from the network. By processing all the existing data it determines what new data is required. Then it waits for that data to be available and repeats. This should all be deterministic.

Whether the conflict cash can be shared will require more thought than I can give it at the moment. It has to do with whether new facts could ever invalidate a conflict that sorry been registered. (I think you're right that it can be shared, but I'm not confident.)

I don't see how it could affect the long dropping time for ResolverContext can you elaborate? I also don't see how it could cause nondeterminism?

And it definitely is not related to the slow times were observing when running this tool, because this tool peloads all of the data and does not do that loop.

@x-hgg-x
Copy link
Contributor

x-hgg-x commented Oct 10, 2024

I don't see how it could affect the long dropping time for ResolverContext can you elaborate?

The ResolverContext contains a lot of data so the drop takes several seconds. But this value is multiplied by the 7 times we rebuild and drop the ResolverContext.

I also don't see how it could cause nondeterminism?

Here is two consecutive runs for solana-core = "=1.0.5" on master. I also printed the instant when the loop is restarted:

Run 1 (236s):

    Updating crates.io index
110.065434ms
113.120678ms
767.306032ms [===========>                     ] 1 complete; 1 pending                                       
769.198182ms
1.479256828s [=================>               ] 32 complete; 26 pending                                     
5.338349626s [====================>            ] 207 complete; 61 pending                                    
   Resolving dependency graph...
10.957888526s[=======================>         ] 407 complete; 16 pending                                    
   Resolving dependency graph...
48.325845407s[==========================>      ] 493 complete; 5 pending                                     
   Resolving dependency graph...
86.382807965s[=============================>   ] 526 complete; 2 pending                                     
   Resolving dependency graph...
124.414130285s=============================>   ] 528 complete; 5 pending                                     
   Resolving dependency graph...
162.600263637s=============================>   ] 533 complete; 1 pending                                     
   Resolving dependency graph...
200.951554361s=============================>   ] 534 complete; 0 pending                                     
   Resolving dependency graph...
     Locking 513 packages to latest compatible versions

Run 2 (310s):

    Updating crates.io index
108.795614ms
111.606964ms
760.045514ms [===========>                     ] 1 complete; 1 pending                                                                                                                                                 
763.731469ms
1.451621731s [=================>               ] 48 complete; 10 pending                                                                                                                                               
4.560725379s [====================>            ] 237 complete; 20 pending                                                                                                                                              
   Resolving dependency graph...
9.873827135s [=======================>         ] 412 complete; 8 pending                                                                                                                                               
   Resolving dependency graph...
41.735024564s[==========================>      ] 488 complete; 9 pending                                                                                                                                                
   Resolving dependency graph...
79.806753686s[=============================>   ] 518 complete; 5 pending                                                                                                                                                
   Resolving dependency graph...
118.547341727s=============================>   ] 523 complete; 9 pending                                                                                                                                                
   Resolving dependency graph...
157.759883082s=============================>   ] 532 complete; 1 pending                                                                                                                                                
   Resolving dependency graph...
196.723956203s=============================>   ] 533 complete; 0 pending                                                                                                                                                
   Resolving dependency graph...
235.680510433s=============================>   ] 533 complete; 1 pending                                                                                                                                                
   Resolving dependency graph...
274.420162854s=============================>   ] 534 complete; 0 pending                                                                                                                                                
   Resolving dependency graph...
     Locking 513 packages to latest compatible versions

And it definitely is not related to the slow times were observing when running this tool, because this tool peloads all of the data and does not do that loop.

Yes, so I should use this repo to do my benchmarks.

@Eh2406
Copy link
Owner Author

Eh2406 commented Oct 11, 2024

But this value is multiplied by the 7 times we rebuild and drop the ResolverContext.

Ah, yes that makes sense. Thanks.

The ResolverContext contains a lot of data so the drop takes several seconds.

One thing we observed in this repo is that using tikv_jemallocator or mimalloc made a big difference for multithreaded performance. I think I also measured that it helped with single threaded performance. We should double check what allocator cargo is using, and what allocator rustc is using, and if they're not the same switch cargo to the approved rustc one to see if it helps. That may make a big difference to users at very little effort.

Here is two consecutive runs for solana-core = "=1.0.5" on master. I also printed the instant when the loop is restarted:

There's definitely non-determinism. When looking at individual packages, I've been observing it too. I'm not convinced that the restarts are the cause of the non-determinism, or if there's another root cause we should be looking for.

Yes, so I should use this repo to do my benchmarks.

I didn't mean to go that far. On one hand, this repo is set up to make it convenient possible to do reasonably reproducible benchmarks over a representative set of all packages. But by testing cargo directly, you managed to find some significant speedups that are directly applicable to users today. Both approaches have their value and uses. I'm happy to support you on whatever effort brings you joy. I'm just trying to keep track of which issues were making progress on.

@x-hgg-x
Copy link
Contributor

x-hgg-x commented Oct 11, 2024

We should double check what allocator cargo is using, and what allocator rustc is using, and if they're not the same switch cargo to the approved rustc one to see if it helps.

There is an option in rustc to enable jemalloc, and I think it is enabled for the final binary:
https://github.com/rust-lang/rust/blob/484c8e78cb915bf1aca57ec62953324fe76b5d37/config.example.toml#L723-L725
https://github.com/search?q=repo%3Arust-lang%2Frust%20rust.jemalloc&type=code

I'm not convinced that the restarts are the cause of the non-determinism, or if there's another root cause we should be looking for.

There is at least a difference between a fully offline registry where everything is always ready and the online registry which needs to download from index.crates.io. For solana-core = "=1.2.18", resolving with an offline registry takes 200s, but only 80s in the last iteration of the loop for an online registry. It seems that much less candidates are considered, since pending queries cannot add new candidates for the resolver. So the path taken by the resolver can vary depending of which query is ready at a given time, and it also means that the conflict cache cannot be reused. If this is the case, it may be possible to find a case where resolving with an online registry will error, but not when using an offline registry.

I didn't mean to go that far. On one hand, this repo is set up to make it convenient possible to do reasonably reproducible benchmarks over a representative set of all packages. But by testing cargo directly, you managed to find some significant speedups that are directly applicable to users today. Both approaches have their value and uses. I'm happy to support you on whatever effort brings you joy. I'm just trying to keep track of which issues were making progress on.

It's all good, I also think that both approaches have their own advantages.

@x-hgg-x
Copy link
Contributor

x-hgg-x commented Oct 11, 2024

I have found the non-determinist bug: this code is incorrect:
https://github.com/rust-lang/cargo/blob/e8adba9d24257e14fef8793cecabb55b2fbcd0e3/src/cargo/sources/registry/http_remote.rs#L798-L811.

self.handle_completed_downloads() needs to be called after self.multi.perform(), or else we can return from the function when all downloads are finished but not processed.

We can see it in the above log:

235.680510433s=============================>   ] 533 complete; 1 pending                                                                                                                                                
   Resolving dependency graph...
274.420162854s=============================>   ] 534 complete; 0 pending                                                                                                                                                
   Resolving dependency graph...

@Eh2406
Copy link
Owner Author

Eh2406 commented Oct 11, 2024

There is an option in rustc to enable jemalloc, and I think it is enabled for the final binary:

So that's what used for the compiler and LLVM. The next question is what used for cargo? Which appears to be the system default allocator https://github.com/search?q=repo%3Arust-lang%2Fcargo%20global_allocator&type=code We should try jemalloc and see if it helps!

resolving with an offline registry takes 200s, but only 80s in the last iteration of the loop for an online registry.

That's very intriguing. There are several layers of caching that might explain this difference. The most direct is the reuse of the RegistryQueryer, but also registries themselves cash reading from disk. I don't know if that's enough to explain a 2x difference.

It seems that much less candidates are considered, since pending queries cannot add new candidates for the resolver. So the path taken by the resolver can vary depending of which query is ready at a given time, and it also means that the conflict cache cannot be reused. If this is the case, it may be possible to find a case where resolving with an online registry will error, but not when using an offline registry.

In my understanding, the final loop of resolution and the only loop when off-line should be identical. Turning on logging and comparing the output between the two different scenarios should make this clear. ( Or adding debug statements that print out the precisely relevant info if the logs are too verbose. ) If they're not the same, I would really love to understand why!

By definition, it's the final loop because nothing returned pending. Therefore it's getting the same input as the off-line mode where everything returns ready on the first request. By not sharing anything that could affect the order of operations in the resolver between calls to the resolver we ensure that they get the exact same output.

But the more I talk about it, the more I'm convinced that you are right about being able to share the conflict cash. The code is set up so that if an incremental resolution fails then we know that waiting for the network will not cause it to succeed, we can report it to the user as soon as we find the error. Which is incompatible with the conflict cash storing a problem that would be invalidated by future network data.

@x-hgg-x
Copy link
Contributor

x-hgg-x commented Oct 11, 2024

Turning on logging and comparing the output between the two different scenarios should make this clear. ( Or adding debug statements that print out the precisely relevant info if the logs are too verbose. )

Here is the conflict cache logs for solana-core = "=1.2.18":
logs.tar.gz
Files iter1.log to iter8.log are logs for a resolve using an online registry (each file representing a loop iteration), and full.log is the log using the repo for the offline registry.

We can see that iteration 4 to 8 are the same using the online registry, and the conflict cache using the offline registry is much bigger than for the iteration 8.

Here is the conflict cache log when sharing the conflict cache between iterations:
shared.log

@Eh2406
Copy link
Owner Author

Eh2406 commented Oct 12, 2024

That is so interesting! I wonder why the last loop didn't match the full run. It's always so intriguing to discover one's knowledge is incomplete. Thank you! I wonder where the two runs first diverge? Not just for the conflict clauses they had, but for every decision they make.

The fact that for 4 iterations we generate the same conflicts, definitely backs up your measured evidence of huge performance savings by sharing it. Add that to your list of future PR's!

Due to the holiday, I will probably have limited availability to converse on the subject, despite its deep interest to me, until Monday.

@x-hgg-x
Copy link
Contributor

x-hgg-x commented Oct 12, 2024

I wonder where the two runs first diverge? Not just for the conflict clauses they had, but for every decision they make.

After analysis the difference is that adding solana-core = "=1.2.18" in a Cargo.toml doesn't take dev dependencies into account. After directly copying the Cargo.toml of the dependency, I observe exactly the same behavior between an offline registry and the last iteration for an online registry.

@x-hgg-x
Copy link
Contributor

x-hgg-x commented Oct 12, 2024

When using an online registry, deps which are not yet fetched are not added to the candidate summary:
https://github.com/rust-lang/cargo/blob/82c489f1c612096c2dffc48a4091d21af4b8e046/src/cargo/core/resolver/dep_cache.rs#L248-L266.

On the other hand, if a dependency query returns Poll::Ready, then all compatible summaries are returned, so we cannot have a partial view where only some compatible summaries would be returned.

This means we cannot add a conflict in the cache which would be incompatible with future fetches, and therefore the conflict cache should be able to be shared.

Comparing to an offline registry or the last iteration for an online registry, doing the resolve with a shared conflict cache and an online registry makes the order of operations slightly differs: the content of the cache is different but the resolve result after the last iteration is the same.

Here is the sorted conflict cache logs for when sharing it: shared_sorted.log
For comparison, the same logs for an offline registry: full_sorted.log

@x-hgg-x
Copy link
Contributor

x-hgg-x commented Oct 13, 2024

We should try jemalloc and see if it helps!

Here is the perf profiles for solana-core = "=1.0.5" without dev deps, where the baseline is using the nohash optimization and the fix from rust-lang/cargo#14680: profiles.tar.gz

4 profiles are compared:

  1. baseline
  2. switch to jemalloc (see branch https://github.com/x-hgg-x/cargo/commits/resolver-perf-3)
  3. optimized rcvec: RcVecIter struct is optimized so that we don't need to clone the inner T when iterating (see branch https://github.com/x-hgg-x/cargo/commits/resolver-perf-4)
  4. optimized rcvec + switch to jemalloc
profile duration
baseline 213s
rcvec 202s
jemalloc 193s
rcvec+jemalloc 189s

Output of perf diff between profile 1 and 4:

# Baseline  Delta Abs  Shared Object  Symbol                                                                                                                                                                                                                                                                                                                                           
     8.74%     -6.85%  cargo          [.] cargo::core::resolver::activate_deps_loop
     3.85%     +4.24%  cargo          [.] <cargo::core::dependency::Dependency as core::hash::Hash>::hash
    11.95%     -4.21%  cargo          [.] cargo::core::resolver::RemainingCandidates::next
     2.98%     -2.98%  libc.so.6      [.] _int_malloc
     2.03%     -2.03%  libc.so.6      [.] _int_free
     2.36%     +1.93%  cargo          [.] core::hash::Hash::hash_slice
     5.78%     +1.69%  libc.so.6      [.] __memmove_avx_unaligned_erms
     6.36%     +1.39%  cargo          [.] <bitmaps::bitmap::Iter<Size> as core::iter::traits::iterator::Iterator>::next
     2.75%     +1.37%  cargo          [.] alloc::rc::Rc<T,A>::make_mut
     1.50%     +1.09%  cargo          [.] cargo::core::resolver::activate
     1.30%     +1.06%  cargo          [.] cargo::core::resolver::conflict_cache::ConflictCache::conflicting
     3.24%     +1.05%  cargo          [.] <alloc::rc::Rc<T,A> as core::ops::drop::Drop>::drop
               +0.79%  cargo          [.] _rjem_sdallocx
               +0.66%  cargo          [.] tikv_jemallocator::layout_to_flags
     3.12%     +0.51%  cargo          [.] <core::hash::sip::Hasher<S> as core::hash::Hasher>::write

We see that the next thing to optimize will be the hashing of Dependency.

@Eh2406
Copy link
Owner Author

Eh2406 commented Oct 14, 2024

This means we cannot add a conflict in the cache which would be incompatible with future fetches, and therefore the conflict cache should be able to be shared.

I am convinced. Looking forward to seeing a PR for this.

Looks like jemalloc and rcvec are solid wins. More PR's for you! Although I would do them separately, because the review concerns of adding a new dependency are different from the review concerns of modifying an internal data structure.

bors added a commit to rust-lang/cargo that referenced this issue Oct 15, 2024
fix(resolver): share conflict cache between activation retries

### What does this PR try to resolve?

Found in Eh2406/pubgrub-crates-benchmark#6.

Currently the resolve is done in a loop, restarting if there are pending dependencies which are not yet fetched, and recreating the conflict cache at each iteration.

This means we do a lot of duplicate work, especially if the newly fetched dependencies has zero or few dependencies which doesn't conflict.

This also means that the maximum depth of the dependency graph has a big impact on the total resolving time, since adding a dependency not yet seen means we will restart the resolve at the next iteration.

Here is the output of the resolve for the master branch using `solana-core = "=1.2.18"` in `Cargo.toml`, printing the duration from the start after each iteration:
```
    Updating crates.io index
103.142341ms   [=====>                           ] 1 complete; 0 pending
118.486144ms   [========>                        ] 2 complete; 0 pending
785.389055ms   [===========>                     ] 62 complete; 0 pending
4.916033377s   [==============>                  ] 277 complete; 0 pending
9.13101404s    [=================>               ] 439 complete; 0 pending
50.083251549s  [====================>            ] 520 complete; 0 pending
133.401303265s [=======================>         ] 561 complete; 0 pending
214.120483345s [==========================>      ] 565 complete; 0 pending
294.180677785s [=============================>   ] 566 complete; 0 pending
    Locking 536 packages to latest compatible versions
```

To improve the situation, this PR moves the conflict cache outside the activation loop so it can be reused for all iterations.

This is possible since when using an online registry, dependencies which are not yet fetched are not added to the candidate summary:

https://github.com/rust-lang/cargo/blob/82c489f1c612096c2dffc48a4091d21af4b8e046/src/cargo/core/resolver/dep_cache.rs#L248-L266

On the other hand, if a dependency query returns `Poll::Ready`, then all compatible summaries are returned, so we cannot have a partial view where only some compatible summaries would be returned. This means we cannot add a conflict in the cache which would be incompatible with future fetches, and therefore the conflict cache can be shared.

Here is the output of the resolve with this PR:
```
    Updating crates.io index
98.239126ms   [=====>                           ] 1 complete; 0 pending
127.528982ms  [========>                        ] 2 complete; 0 pending
821.601257ms  [===========>                     ] 62 complete; 0 pending
4.67917132s   [==============>                  ] 277 complete; 0 pending
5.933230172s  [=================>               ] 431 complete; 0 pending
14.74321904s  [====================>            ] 508 complete; 0 pending
24.607428893s [=======================>         ] 546 complete; 0 pending
24.700610469s [==========================>      ] 550 complete; 0 pending
24.757651875s [=============================>   ] 551 complete; 0 pending
    Locking 536 packages to latest compatible versions
```

Besides the huge performance savings between iterations, sharing the conflict cache has the side effect of eliminating dead paths earlier in the resolve after a restart. We can see that the duration of all iterations using this PR is less than the duration of the last iteration on master, and that it needs to resolve less dependencies overall.
bors added a commit to rust-lang/cargo that referenced this issue Oct 15, 2024
fix(resolver): avoid cloning when iterating using RcVecIter

### What does this PR try to resolve?

Follow up of Eh2406/pubgrub-crates-benchmark#6 (comment).

This PR modifies the internal `RcVecIter` struct so that we can iterate on its items without cloning them. This improves performance of the resolver because eagerly cloning `Arc<T>` is not free.

Comparison of performance using `solana-core = "=1.0.5"` in `Cargo.toml`:

| branch     | duration |
|------------|----------|
| master     | 213s     |
| PR         | 202s     |

r? Eh2406
@Eh2406
Copy link
Owner Author

Eh2406 commented Oct 15, 2024

Thank you for all the work you've done so far. It's made a dramatic improvement to the real-world performance for users of cargo.

The Jewish holy days come altogether this time of year. I will be unable to respond until Monday. I promise, it's almost over just this week and next week, and then I'll be back to real availability. In addition to my lack of availability, Eric and Ed also have limited availability for a while. (Vacation and a conference in China respectively.)

You seem to be coming up with ways to improve things on your own, I'm excited to see what else you come up with. If you're looking for things to do, the next step on the SAT tests would be adding features to PropTest. Alternatively, you could investigate the bug you made such a clear report for. If you want to stay focused on performance, I had a thought to reduce the drop time of ResolverContext.

ResolverContext is a copy of the critical data structures in case we need to jump back to the state to try a different decision. In general making data structures that are cheap to reset is hard, which is why we rely on im-rs. But two of these structures are "append only" and keep track of when they were modified. Specifically activations keeps track by storing a conflict age for each key, and links implicitly keeps track by whether the package is active (although it could also keep track explicitly by storing an age). So I think, for these two we don't need a stack of copies we can "just" remove the new entries when were back tracking. (Although I'd want to be really careful testing it. I'd want a commit where we add the new data structure without removing the old data structure and assert that they are equal every time either one is modified.) Straightforwardly we could use .retain to only keep the ones that are still active, in O(#active). If we switch to a order preserving map (indexmap we should be able to pop the most recent items for a O(distance backtracked). It should work in theory, but I haven't gotten at work in practice yet.

In this issue before we have also had some other leads. Which you could follow-up on if you wanted to. Like trying to create a minimal example for why some Solana crates take >20 sec on the PubGrub resolver. Or trying to improve interoperability between the testing infrastructure here and in the Cargo repo.

Anyway just some ideas and encouragement.

@x-hgg-x
Copy link
Contributor

x-hgg-x commented Oct 22, 2024

I experimented with the pubgrub crate to find why it is slow, and after finding the reason (a lot of allocations in the hoot loop), I tried to fix it in my fork: https://github.com/x-hgg-x/pubgrub.

I have done two main improvements:

  • Use u32 for package type, so that we don't clone it everywhere and its hash doesn't need memory access.
  • Limit the number of versions of a package to 63, so that we can use u8 for version type and u64 for version set and term types, using it as a bitset. A dependency provider will need to get all possible version of package, then sort them and compute the version as the sorted index. To support more than 63 versions we can use virtual packages to limit the resolver to only one true version per package (see x-hgg-x/pubgrub@ddbfa6d).

By doing all this, there is almost no remaining allocations in the resolver hot loop, so I expect the resolving time to be divided by 5 to 10.

Flamegraph before (total time 30s): flamegraph.1.svg.tar.gz

Flamegraph when using u32 as package type (total time 15s): flamegraph.2.svg.tar.gz

Flamegraph when using u64 as version set: TODO

@x-hgg-x
Copy link
Contributor

x-hgg-x commented Oct 22, 2024

During this process I also found how to implement weak dependencies in pubgrub.

We need to lower this clause: feature => (dep => dep_feature).

For package a with feature weak = ["b?/c"], we can set these packages and dependencies:

a#1.x.y+feat(weak) @ 1.2.3  =>  a#1.x.y+feat(weak)+enabled @ (off | 1.2.3)
                                                              ^ "off" is preferred to other versions when choosing versions
a#1.x.y+feat(weak)+enabled @ off  =>  a#1.x.y+dep(b) @ off
                                                       ^ "off" prevents the dep "b" to be activated (common to all versions)
a#1.x.y+feat(weak)+enabled @ 1.2.3  =>  a#1.x.y+dep(b)+feat(c) @ 1.2.3

@Eh2406
Copy link
Owner Author

Eh2406 commented Oct 22, 2024

Thank you for spiking those out! I'm still hoping to get to an understanding of why these versions are algorithmically so much slower than similar nearby versions. In the meantime I'm happy to see any kind of speed up the comes along!

I did not have a chance to fully understand your fork. But the first suggestion sounds similar to pubgrub-rs/pubgrub#242. Similarly, at one time I had a branch that did thread local interning in this project. Given that your seeing a 50% improvement, I should really rebased both of those efforts.

Using a bitset as a VS is a pretty interesting idea. It does mean listing all versions of the dependent before the parent package can be processed. Fundamentally, if A depends on B@">2.0" we don't know how to express ">2.0" until we listed the versions of B which means we need to wait for that network request to return the dependencies of A. But that is a problem we can figure out later, if the performance numbers are good enough.

@x-hgg-x
Copy link
Contributor

x-hgg-x commented Oct 23, 2024

It does mean listing all versions of the dependent before the parent package can be processed.

Yes, but processing the requirement once par package per call to resolve() in the get_dependencies() method should be better than reprocessing it at each loop iteration in the prioritize() method for all packages and in the choose_version() method for the activated package.

@x-hgg-x
Copy link
Contributor

x-hgg-x commented Oct 24, 2024

These two lines seem incorrect:
https://github.com/pubgrub-rs/pubgrub/blob/be526674f1039309be04da191abd822a7d00bfeb/src/term.rs#L113
https://github.com/pubgrub-rs/pubgrub/blob/be526674f1039309be04da191abd822a7d00bfeb/src/term.rs#L141

They cause the following tests to fail:

#[test]
fn test_is_disjoint() {
    let p1 = Term::Positive(Range::<u32>::higher_than(20u32));
    let n1 = Term::Negative(p1.unwrap_positive().complement());

    let p2 = Term::Positive(Range::<u32>::lower_than(10u32));
    let n2 = Term::Negative(p2.unwrap_positive().complement());

    assert!(p1.is_disjoint(&p2)); // true
    assert!(p1.is_disjoint(&n2)); // true
    assert!(n1.is_disjoint(&p2)); // true
    assert!(n1.is_disjoint(&n2)); // false
}

#[test]
fn test_subset_of() {
    let p1 = Term::Positive(Range::<u32>::higher_than(20u32));
    let n1 = Term::Negative(p1.unwrap_positive().complement());

    let p2 = Term::Positive(Range::<u32>::higher_than(10u32));
    let n2 = Term::Negative(p2.unwrap_positive().complement());

    assert!(p1.subset_of(&p2)); // true
    assert!(p1.subset_of(&n2)); // true
    assert!(n1.subset_of(&p2)); // false
    assert!(n1.subset_of(&n2)); // true
}

As the docs, Positive(r) and Negative(r.complement()) should be equivalent:
https://github.com/pubgrub-rs/pubgrub/blob/be526674f1039309be04da191abd822a7d00bfeb/src/term.rs#L12

But there is also a special handling when the term is positive: https://github.com/pubgrub-rs/pubgrub/blob/be526674f1039309be04da191abd822a7d00bfeb/src/internal/partial_solution.rs#L246

Can Positive(r) and Negative(r.complement()) be used interchangeably? It could be useful if we need to add custom incompatibilities with positive and negative terms (I want to try extending the api of get_dependencies() to add a OneOf variant, which could replace the Wide package variant for dependencies covering multiple major versions).

@Eh2406
Copy link
Owner Author

Eh2406 commented Oct 28, 2024

Sorry for the slow reply. The holidays are over so hopefully I can be more prompt going forward.

As the docs, Positive(r) and Negative(r.complement()) should be equivalent:
https://github.com/pubgrub-rs/pubgrub/blob/be526674f1039309be04da191abd822a7d00bfeb/src/term.rs#L12

Everyone has been confused by this when learning the PubGrub code, myself included. The rest of the paragraph you linked to tries to explain when Positive(r) and Negative(r.complement()) are not equivalent. That documentation was added when konstin, was confused by the issue here. I'm always happy to add more documentation, or restructure the code, to make the distinction clearer.

@x-hgg-x
Copy link
Contributor

x-hgg-x commented Oct 28, 2024

I made a PR to improve documentation: pubgrub-rs/pubgrub#264.

@x-hgg-x
Copy link
Contributor

x-hgg-x commented Nov 1, 2024

I implemented multiple optimizations in pubgrub in https://github.com/x-hgg-x/pubgrub/tree/perf (including the two above optimizations), and I applied it for cargo resolution in a new repo: https://github.com/x-hgg-x/pubgrub-bench.

Resolution time for solana-archiver-lib v1.1.12:

  • Base resolution time is around 90s using your repo.
  • When using u32 for package type and u64 for version set and terms, the resolution time drops to around 5.6s: perf1.tar.gz
  • By optimizing processing of incompatibilities contradictions (x-hgg-x/pubgrub@2f1bd40), the resolution time drops to around 3.3s: perf2.tar.gz
  • By using the fact that a Package is an arena index, we can replace maps of packages to vecs for incompatibilities and package assignments (x-hgg-x/pubgrub@78fee1e). With this change resolution time drops to around 2.1s: perf3.tar.gz

Sorted list of packages in the solution: solution.txt.tar.gz
Pubgrub log (backtracked 7000 times): pubgrub.log.tar.gz

@x-hgg-x
Copy link
Contributor

x-hgg-x commented Nov 7, 2024

Associated PR: pubgrub-rs/pubgrub#274

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants