How to read all blobs in a repository with fastest throughput? #906
-
Hi there! tl;dr: I was wondering: what is the recommended way to read all blobs from a repository in parallel using gitoxide? I'm looking for the greatest throughput on multi-core machines. Details: Several months ago with @Byron's help we switched Nosey Parker to use gitoxide instead of git2, giving a big performance boost, particularly on multi-core systems. Lots has changed both in Nosey Parker and in gitoxide since the initial Dec 2022 integration. Nosey Parker now builds on both Intel and ARM, and shouldn't involve more than installing Nosey Parker currently operates in two phases: the first phase gets a list of all blob IDs to scan in a repository, and the second phase uses Rayon to read that list of objects in parallel. This second phase is implemented here: I doubt that this current approach gives the best performance. The two-phase approach is not essential, and I would drop it for a single-phase approach if it was sufficiently faster. What is likely to be the fastest way to read all blobs in parallel from a repository using gitoxide? Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 5 replies
-
Thanks for asking! I am looking forward to seeing Nosey Parker unfold it's ultimate performance potential :). Remembering what I do, I also think that it's probably not needed, worth-it or feasible to maintain a global "seen" set of object ids like is currently done, even though it's probably something to experiment with. To obtain a each and every blob, one would have to efficiently decode the object database and not miss an object. In a probably-more-complex-way-than-you-need-it kind of fashion, this is done in User-code can use the With multi-indices, it's the same workflow, but one index/pack pair at a time. Finally, one can traverse all loose objects in the loose object databases and probably parallelize this oneself with a couple of threads to do the actual processing. It's notable that for each pack, a bunch of threads will be spun up and shut down, so keeping all cores busy perfectly probably isn't possible unless one also processes at least two repos at a time for good measure - most modern machines take kindly to overcommitting, but maybe that's another optimization I am digressing into here. Thinking about it, taking the |
Beta Was this translation helpful? Give feedback.
Thanks for asking! I am looking forward to seeing Nosey Parker unfold it's ultimate performance potential :).
Remembering what I do, I also think that it's probably not needed, worth-it or feasible to maintain a global "seen" set of object ids like is currently done, even though it's probably something to experiment with.
To obtain a each and every blob, one would have to efficiently decode the object database and not miss an object. In a probably-more-complex-way-than-you-need-it kind of fashion, this is done in
gix_odb::Store::verify_integrity()
, which will traverse each index (and pack), each multi-index (and multiple packs) and each loose object database, across all potentially linked…