-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Async backends are here #2342
Merged
Merged
Async backends are here #2342
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
🎶 Imperial march plays 🎶 as
It is here. |
yuja
approved these changes
Oct 7, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I don't have much experience in async Rust.
This is to avoid confusion with instances of the `Store` type.
The commit backend at Google is cloud-based (and so are the other backends); it reads and writes commits from/to a server, which stores them in a database. That makes latency much higher than for disk-based backends. To reduce the latency, we have a local daemon process that caches and prefetches objects. There are still many cases where latency is high, such as when diffing two uncached commits. We can improve that by changing some of our (jj's) algorithms to read many objects concurrently from the backend. In the case of tree-diffing, we can fetch one level (depth) of the tree at a time. There are several ways of doing that: * Make the backend methods `async` * Use many threads for reading from the backend * Add backend methods for batch reading I don't think we typically need CPU parallelism, so it's wasteful to have hundreds of threads running in order to fetch hundreds of objects in parallel (especially when using a synchronous backend like the Git backend). Batching would work well for the tree-diffing case, but it's not as composable as `async`. For example, if we wanted to fetch some commits at the same time as we were doing a diff, it's hard to see how to do that with batching. Using async seems like our best bet. I didn't make the backend interface's write functions async because writes are already async with the daemon we have at Google. That daemon will hash the object and immediately return, and then send the object to the server in the background. I think any cloud-based solution will need a similar daemon process. However, we may need to reconsider this if/when jj gets used on a server with a custom backend that writes directly to a database (i.e. no async daemon in between). I've tried to measure the performance impact. That's the largest difference I've been able to measure was on `jj diff --ignore-working-copy -s --from v5.0 --to v6.0` in the Linux repo, which increases from 749 ms to 773 ms (3.3%). In most cases I've tested, there's no measurable difference. I've tried diffing from the root commit, as well as `jj --ignore-working-copy log --no-graph -r '::v3.0 & author(torvalds)' -T 'commit_id ++ "\n"'` (to test a commit-heavy load).
`futures::stream::Stream::collect()` requires a collection that implements `Default` and `Extend`, and I would like to to be able to collect a stream of trees.
I'm going to rewrite `TreeDiffIterator` to fetch one level (depth) of the tree at a time and concurrently. One step towards that is to convert the iterator to a `Stream`. I'd like to do that by making the current `Iterator` implementation call the new `Stream` implementation. However, we can't call `futures::executor::block_on()` on a future that itself calls `futures::executor::block_on()` (as `Store::read_tree()` does), so the first step is to bubble up the async-ness a bit. This patch does that by fetching both sides of the diff concurrently. That should give close to a 2x speedup on high-latency backends. (It doesn't help with our backend at Google, however, because we have a daemon process that does some speculative prefetching that usually downloads the child trees anyway.)
martinvonz
force-pushed
the
push-kzszouwuuupm
branch
from
October 9, 2023 06:23
ab0b908
to
a7844f1
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR makes the
Backend
trait's read methods async. It's a small step towards using async throughout the code base - async code seems to spread very quickly.I was hoping to have updated more code to be async so I could do some performance measurements and see what the overall impact is, but I'm afraid it will take longer than I would like to wait, and it seems quite unavoidable to start using async anyway. Backends that download data on demand has been a goal from the start of the project. I think the performance impact I have been able to measure so far is promisingly small (3.3% in the worst case).
Checklist
If applicable:
CHANGELOG.md