Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent metadata fetching #22

Closed

Conversation

aochagavia
Copy link
Contributor

@aochagavia aochagavia commented Jan 25, 2024

Closes #3

@aochagavia
Copy link
Contributor Author

aochagavia commented Jan 25, 2024

Some remarks and questions:

  1. What do you think, on an API level, about the modifications to the DependencyProvider trait and the introduction of the OneShotSender struct?
  2. I had to introduce RefCell everywhere, to be able to borrow &self instead of &mut self inside the future, so multiple futures could run in parallel. The result is pretty ugly and I'll be thinking of alternatives (e.g. more cleanly splitting off the piece of state that gets updated asynchronously). If you have any ideas please let me know.
  3. The current implementation uses async, which IMO proved very useful. I picked smol as the runtime because it allows spawning futures that are non 'static (in contrast to tokio, which forces you to reference count anything you want to share, making the RefCell issue even more painful). I'm not sure smol is mature enough, but if 2 goes well we might get away using tokio after all. Additionally, I think we'll want to use a single-threaded async runtime, because there's no IO involved, we are using async purely as syntactic sugar.

@tdejager
Copy link
Contributor

I'll have a good look at this tomorrow!

@baszalmstra
Copy link
Contributor

I think this currently slightly misses the point.

The main goal that I would like to achieve is as follows: when solving incrementally, we know after each partial solve that there are 1 or more candidates that we are going to need more information from. Currently we fetch this information sequentially but especially when network requests are involved it makes sense to request this information concurrently.

Another approach could be to add a function to the dependency provider that provides the result for multiple solvables. Implementers can choose to make this concurrent or not.

Although this callback api is nice, the same could be achieved by passing in a closure that should be called. Internally the closure can do whatever.

I also wonder if an async runtime is needed, couldnt you achieve the same result with a normal blocking channel? Since the solver loop is still blocking anyway.

@tdejager
Copy link
Contributor

Yeah I agree a blocking channel can also be used in the asynchronous context because 'send' is non-blocking.

Because you are using futures and smol here anyway I see no reason not to make the functions in the DependencyProvider asynchronous, you are bringing in a runtime anyways.

However, avoiding this is also great but then I think a channel would just work :)

@aochagavia
Copy link
Contributor Author

I think this currently slightly misses the point.

The main goal that I would like to achieve is as follows: when solving incrementally, we know after each partial solve that there are 1 or more candidates that we are going to need more information from. Currently we fetch this information sequentially but especially when network requests are involved it makes sense to request this information concurrently.

For a bit of context, my proposed async / await approach is meant to keep the sequential structure of the code, while enabling concurrency. Await is used on the solver side whenever there is a chance of having to wait on the dependency provider. This, combined with spawn, gives us the ability to concurrently call the dependency provider's methods after each partial solve (I added a test to showcase this, where the dependency provider sleeps on a separate thread and notifies the solver when it's done).

At least from a functionality perspective, this seems to cover the use case you mention. Or am I missing something?

I also wonder if an async runtime is needed, couldnt you achieve the same result with a normal blocking channel? Since the solver loop is still blocking anyway.

A normal blocking channel was my first prototype, but my impression was that it would require significant changes to the solver code (maybe I'm wrong, though). While the solver loop is indeed blocking, in some cases it has to pause execution of Solver::add_clauses_for_solvable and resume it later (e.g. if the dependency provider does some processing in a different thread, using async or plain old blocking code). In the meantime, while it waits for that function to resume execution, it can concurrently call Solver::add_clauses_for_solvable for the next solvable in the list. As I mentioned above, using async / await allows us to keep the sequential structure of the code (i.e. add_clauses_for_solvable and friends), which seems like a huge benefit to me.

Regarding using an async runtime... It's kind of inevitable once you go down the async / await route. However, we could create our own barebones runtime with only the features we need (i.e. single-threaded and supporting async one-shot channels), instead of pulling a more heavyweight dependency. I had to create a custom runtime for my last project and found it surprisingly straightforward (once you understand that async / await is just syntactic sugar), so we should be able to pull it off if we want to go down that path.

Because you are using futures and smol here anyway I see no reason not to make the functions in the DependencyProvider asynchronous, you are bringing in a runtime anyways.

I'm not totally sure about that. I'd say the solver shouldn't care about which runtime is used by the dependency provider. If we allow the dependency provider to return futures, that means they must be compatible with the solver's runtime. I consider the fact that the solver uses async / await under the hood to be an implementation detail and would rather avoid leaking that information into its API.

Additionally, sharing a single async runtime between the solver and the dependency provider would force the solver to use a full-blown async runtime such as tokio to work well, which sounds like overkill in one of our main use cases (Conda packages) where there is no IO / waiting during solving.

@tdejager
Copy link
Contributor

But if the solver defines the trait it defines what Futures it allows. So this means that it can choose its runtime in that sense.

If you bring in a Runtime anyway I don't see a reason not to make the DependencyProvider async.

Granted it would be best if the get_dependencies calls and get_candidates would not block the runtime it can potentially be running in but this can only be alleviated by essentially making the solve itself async, or running the solver on a seperate thread.

@baszalmstra
Copy link
Contributor

We had a long discussion offline. I didnt properly read the code. I think this does fix the goals.

We decided to keep the (current) callback approach instead of making the DependencyProvider async.

@aochagavia will clean up the code and bring this in a mergable state!

@aochagavia aochagavia force-pushed the concurrent-metadata-fetching branch 3 times, most recently from 125cfdc to fe79c92 Compare January 30, 2024 14:46
// This prevents concurrent requests to add clauses for a solvable from racing. Only the
// first request for that solvable will go through, and others will wait till it
// completes.
let mut clauses_added = mutex.lock().await;
Copy link
Contributor Author

@aochagavia aochagavia Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this await (and others below) is inside a loop, we could make concurrent metadata fetching even more aggressive by writing this loop in such a way that it processes its items concurrently... I didn't do it, just to keep things simple, but let me know if that's desirable after all!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think that would be quite nice! But maybe we can do that in another PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me!

&mut output.clauses_to_watch,
dependency_name,
)
.await?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, it might pay off to handle the iterations of this loop concurrently.

&mut output.clauses_to_watch,
dependency_name,
)
.await?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, it might pay off to handle the iterations of this loop concurrently.

tests/solver.rs Outdated
.build()
.unwrap();

provider.runtime = Some(executor);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code means that all tests using solve_snapshot are now also using the async provider behind the scenes. If that's undesirable, let me know!

@aochagavia
Copy link
Contributor Author

I have pushed a new wip commit. There are still a few open points, some in the form of comments, some written below:

  1. I feel tempted to ditch smol in favor of tokio, just to be sure we are relying on solid foundations. We ended up using locks and channels, so having a production-ready async runtime seems beneficial. The downside is that we'd need to use more Rc, but it shouldn't be that much of a pain. Opinions?
  2. We might be able to get rid of RefCell in the arena for clauses by postponing their allocation, but I'm afraid that might complicate the code more than it simplifies it, so my current stance is to keep the RefCell.

@baszalmstra
Copy link
Contributor

I feel tempted to ditch smol in favor of tokio, just to be sure we are relying on solid foundations. We ended up using locks and channels, so having a production-ready async runtime seems beneficial. The downside is that we'd need to use more Rc, but it shouldn't be that much of a pain. Opinions?

Mm, adding tokio would add a TON of dependencies. I always thought that smol is also production ready?

We might be able to get rid of RefCell in the arena for clauses by postponing their allocation, but I'm afraid that might complicate the code more than it simplifies it, so my current stance is to keep the RefCell.

Lets do it in another PR then. :)

@aochagavia
Copy link
Contributor Author

Mm, adding tokio would add a TON of dependencies. I always thought that smol is also production ready?

Tokio's dependencies can be kept reasonably low by only enabling the features you need, so I don't think there'd be much of a difference between it and smol. Regarding smol being production-ready, I have the feeling that tokio has much stronger backing and is more battle-tested, but I'm not opposed to giving smol a try. A nice benefit is that we get to keep the current code without Rcs, which is nice!

I'll do a last polishing round directly using smol's underlying crates, to make sure we don't pull in more dependencies than strictly needed.

@baszalmstra
Copy link
Contributor

I would also be happy to use tokio if you feel that doesnt add to much weight.

@aochagavia aochagavia force-pushed the concurrent-metadata-fetching branch from fe79c92 to ce12e32 Compare January 31, 2024 08:32
@tdejager
Copy link
Contributor

If you end up adding Tokio you
can use the oneshot channel from there instead.

@aochagavia
Copy link
Contributor Author

aochagavia commented Jan 31, 2024

I ended up using smol's async-executor after all, with all other concurrency primitives taken from the futures crate (including a one shot channel). Tokio was more painful to introduce than I expected 😅

@aochagavia aochagavia marked this pull request as ready for review January 31, 2024 09:13
@baszalmstra
Copy link
Contributor

I tried updating rip with the changes from this branch and ran into a couple of issues.

To integrate the changes I implemented a two-stage approach. We have an object PypiResolvoBridge that implements the DependencyProvider trait from this crate. This type holds an Arc<PypiDependencyProvider>. In the modified functions I do something like this:

fn get_candidates(&self, name: NameId, sender: OneShotSender<Option<Candidates>>) {
    let provider = self.inner.clone();
    tokio::spawn(async move {
        let candidates = provider.get_candidates(name).await;
        let _ = sender.send(candidates);
    });
}

Here the inner PypiDependencyProvider has an async function get_candidates.

I think this is what you intended with your design?

The provider is an Arc because when I tokio::spawn something the future has to be 'static. I had to change a couple of fields in the inner dependency provider to make the object Sync. I simply used a couple of RwLocks.

However, I ran into trouble when I came to the Pool stored in the provider.

The return values of the Candidates contain several Ids which are allocated from the Pool. I could probably add another RwLock around Pool except that the DependencyProvider trait itself has this function:

fn pool(&self) -> &Pool<VS, N>;

This limits me from turning the internal pool into an RwLock.

The only way around this that I can think of is to make the Pool sync, however, that might come at the cost which is a shame for the non-concurrent usecase.

Another approach would be to make the DependencyProvider trait async itself which allows me to have more granular control over when I use tokio::spawn because I can opt to only "spawn" parts of my function (e.g. the heavy lifting work) and leave all interactions with the pool on the same thread as the task itself.

WDYT @aochagavia ?

@aochagavia aochagavia force-pushed the concurrent-metadata-fetching branch from ce12e32 to fe6a7fa Compare February 2, 2024 10:52
@aochagavia
Copy link
Contributor Author

aochagavia commented Feb 2, 2024

We discussed this on Discord and came up with the following solution:

  • Put Pool behind Rc so it can be easily shared across async tasks.
  • Use single-threaded tokio in the solver instead of smol.
  • Let users configure the async runtime before passing it to the solver.
  • Document how to use different async runtimes, if desired.
  • Make the the async functions in DependencyProvider actually async, instead of relying on channels.

I just pushed the changes and will be checking if this update is more rip-friendly!

@aochagavia aochagavia force-pushed the concurrent-metadata-fetching branch from fe6a7fa to 6e26432 Compare February 2, 2024 11:04
@aochagavia aochagavia force-pushed the concurrent-metadata-fetching branch from 6e26432 to 0938551 Compare February 2, 2024 11:11
@aochagavia aochagavia force-pushed the concurrent-metadata-fetching branch from 96c8b60 to a6d2138 Compare February 2, 2024 16:06
@tdejager
Copy link
Contributor

tdejager commented Feb 5, 2024

So we just found out that for rattler (you can check-out this branch conda/rattler#504) we make use of functions in the SolverCache that are async but the sort_candidates function isn't. This presents a problem when trying to use that function.

@aochagavia
Copy link
Contributor Author

Is that something you could solve by blocking on the futures, because you know they'll always return Ready(...)? Otherwise it might make sense to make sort_candidates async.

@baszalmstra
Copy link
Contributor

If thats the case (always ready) we can use .now_or_never()

@tdejager
Copy link
Contributor

tdejager commented Feb 5, 2024

Right, that would work until someone changes the API but we could make that clear in the expect I suppose :)

@baszalmstra
Copy link
Contributor

Im also not entirely sure that the assumption holds. This function may be called while the not all candidates for other packages have been loaded. But lets try to find out.

@baszalmstra
Copy link
Contributor

Looking at the code it should be trivial to make sort_candidates async as well. I think this is the proper solution because that allows you to freely call the methods on the SolverCache and await them without issues.

@tdejager
Copy link
Contributor

tdejager commented Feb 6, 2024

Okay let me give it a go later!

@tdejager
Copy link
Contributor

tdejager commented Feb 8, 2024

Closing this in favor of #29 (from benchmarks that one is slightly faster)

@tdejager tdejager closed this Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support concurrent metadata fetching
3 participants