How to read all blobs in a repository with fastest throughput? #906

bradlarsen · 2023-06-21T19:32:38Z

bradlarsen
Jun 21, 2023

Hi there!

tl;dr: I was wondering: what is the recommended way to read all blobs from a repository in parallel using gitoxide? I'm looking for the greatest throughput on multi-core machines.

Details:

Several months ago with @Byron's help we switched Nosey Parker to use gitoxide instead of git2, giving a big performance boost, particularly on multi-core systems.

Lots has changed both in Nosey Parker and in gitoxide since the initial Dec 2022 integration. Nosey Parker now builds on both Intel and ARM, and shouldn't involve more than installing cmake and doing a cargo build.

Nosey Parker currently operates in two phases: the first phase gets a list of all blob IDs to scan in a repository, and the second phase uses Rayon to read that list of objects in parallel. This second phase is implemented here:

https://github.com/praetorian-inc/noseyparker/blob/18224b8eb9c92ef41d0626f698e53fbf45120a0b/crates/noseyparker-cli/src/bin/noseyparker/cmd_scan.rs#L388-L440

I doubt that this current approach gives the best performance.

The two-phase approach is not essential, and I would drop it for a single-phase approach if it was sufficiently faster.

What is likely to be the fastest way to read all blobs in parallel from a repository using gitoxide?

Thanks!
Brad Larsen

Answered by Byron

Jun 22, 2023

Thanks for asking! I am looking forward to seeing Nosey Parker unfold it's ultimate performance potential :).

Remembering what I do, I also think that it's probably not needed, worth-it or feasible to maintain a global "seen" set of object ids like is currently done, even though it's probably something to experiment with.

To obtain a each and every blob, one would have to efficiently decode the object database and not miss an object. In a probably-more-complex-way-than-you-need-it kind of fashion, this is done in gix_odb::Store::verify_integrity(), which will traverse each index (and pack), each multi-index (and multiple packs) and each loose object database, across all potentially linked…

View full answer

Byron · 2023-06-22T07:26:35Z

Byron
Jun 22, 2023
Maintainer

Thanks for asking! I am looking forward to seeing Nosey Parker unfold it's ultimate performance potential :).

Remembering what I do, I also think that it's probably not needed, worth-it or feasible to maintain a global "seen" set of object ids like is currently done, even though it's probably something to experiment with.

To obtain a each and every blob, one would have to efficiently decode the object database and not miss an object. In a probably-more-complex-way-than-you-need-it kind of fashion, this is done in gix_odb::Store::verify_integrity(), which will traverse each index (and pack), each multi-index (and multiple packs) and each loose object database, across all potentially linked alternate database and among other things decode each object in it unconditionally. Unfortunately there no general-purpose version of that yet, but the building blocks exist to achieve the same.

User-code can use the gix_odb::Store::structure() call to learn about all pieces of the object database. From there, for each index, open it up as Bundle, and then on its index one can call traverse() to see each object contained within in the fastest possible way. An object kind filter can't be applied there yet, and I'd argue that can be left for future optimization.

With multi-indices, it's the same workflow, but one index/pack pair at a time.

Finally, one can traverse all loose objects in the loose object databases and probably parallelize this oneself with a couple of threads to do the actual processing.

It's notable that for each pack, a bunch of threads will be spun up and shut down, so keeping all cores busy perfectly probably isn't possible unless one also processes at least two repos at a time for good measure - most modern machines take kindly to overcommitting, but maybe that's another optimization I am digressing into here.

Thinking about it, taking the verify() load as representative for what Nosey Parker is doing, you can run gix --trace corpus -p <CORPUS-PATH> run -t VERI -n and see how it parallelizes - it will run full-ODB verification on one repo each time (remove the -n to actually perform the run and maybe add -r "LIMIT 10" to perform smaller trial runs) - this can already show you how fast it would be with an object processing workload that is much lighter than what Nosey Parker does.

5 replies

bradlarsen Jun 22, 2023
Author

Thank you for the detailed and thoughtful reply!

Remembering what I do, I also think that it's probably not needed, worth-it or feasible to maintain a global "seen" set of object ids like is currently done, even though it's probably something to experiment with.

The global "seen" set of objects seems mostly redundant when considering the case of scanning a single repository. However, in the common Nosey Parker use case of scanning many repositories at once, blobs can be duplicated across repositories (e.g., from forked repos, vendored source code, or copied-and-pasted code).

The data structure that Nosey Parker uses to maintain a set of seen object IDs is a (256-element array of mutex-protected std::collection::HashSet)[https://github.com/praetorian-inc/noseyparker/blob/main/crates/noseyparker/src/blob_id_set.rs]. The first byte of a blob ID is used to figure out which hash set it belongs in.

Since SHA1 digests are effectively random, this approach should scale well even to highly multithreaded use cases, as the odds of two blob IDs belonging in the same mutex-protected hash set is 1/256. I think we can model this as a generalized birthday problem. With 256 "days in the year" (i.e., mutex-protected hash sets), I think it is 20 "people" (i.e., threads) that you would need to get 50% probability of 2 blob IDs being in the same bucket.

Or in other words, there should be relatively little contention on that global data structure even when using lots of threads.

All this said, I should profile to be sure that the global seen blobs set is not a bottleneck!

bradlarsen Jun 22, 2023
Author

The Nosey Parker results database (an sqlite database accessed by a single writer thread) is currently a bottleneck for certain inputs (too many small transactions). I'm working on fixing that presently.

Aside from the problematic inputs, on big inputs with many Git repos, I have seen nearly linear speedup in Nosey Parker up 64 threads (haven't tried on a bigger machine yet).

Byron Jun 23, 2023
Maintainer

All this said, I should profile to be sure that the global seen blobs set is not a bottleneck!

It probably won't be for quite some time, and when there is forks to consider then it's probably also worth having it.

To see if the new mode of parallelization is worth it, I recommend trying gix corpus run -t VERI as mentioned earlier to get a feel for the possible performance - max-throughput is inherently parallelizing each pack, but there is many of them and some are small so there are some limits on how fast it can get for the average case. Depending on the dataset the overall performance gain might not be worth it, so I think it's a good idea to approximate it as cheaply as possible to gather some data.

bradlarsen Jun 23, 2023
Author

Oh, cool. I tried gix corpus run -t VERI in a couple different scenarios:

Mirrored clones of cpython and linux, created using git clone --mirror (137GB)
1067 smaller git repositories cloned using git clone --bare (20GB)

In the first case, gix corpus run -t VERI reports that decoding was done at 2.3GB/s on my old 6-core Intel MBP. Nosey Parker, in contrast, scans through the content at 1.55GB/s. So in that scenario, the "speed of light" or maximum possible improvement for Nosey Parker is about a 50% speedup over present speed.

In the second case, gix corpus run -t VERI runs many times slower than Nosey Parker. I suspect it's either the overhead of starting and stopping threads for each repository that causes this, or perhaps some other overhead in the gix corpus run process. Nosey Parker's parallelism approach here is to use one thread pool for all the repositories, and so starting and stopping threads only happens once there.

Byron Jun 24, 2023
Maintainer

Thanks for running gix corpus against your corpus :), and for validating a hunch I also had been having: for some reason, the per-pack performance is below expectations as it seems to take a long time to switch to each one. That will definitely need investigation and be improved as well.

However, unless you implement parallelization across repositories yourself (or across packs, or both), Nosey Parker might see the same issues. However, maybe that's what it takes to understand better what's going on and fix it. Even starting 10 threads was supposed to be cheap and happen very quickly, so I was never concerned about it. But maybe the costs now show if 10 threads are spun up and shut down thousands of times.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to read all blobs in a repository with fastest throughput? #906

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to read all blobs in a repository with fastest throughput? #906

bradlarsen Jun 21, 2023

Replies: 1 comment · 5 replies

Byron Jun 22, 2023 Maintainer

bradlarsen Jun 22, 2023 Author

bradlarsen Jun 22, 2023 Author

Byron Jun 23, 2023 Maintainer

bradlarsen Jun 23, 2023 Author

Byron Jun 24, 2023 Maintainer

bradlarsen
Jun 21, 2023

Replies: 1 comment 5 replies

Byron
Jun 22, 2023
Maintainer

bradlarsen Jun 22, 2023
Author

bradlarsen Jun 22, 2023
Author

Byron Jun 23, 2023
Maintainer

bradlarsen Jun 23, 2023
Author

Byron Jun 24, 2023
Maintainer