Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP pool changes #3582

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

WIP pool changes #3582

wants to merge 15 commits into from

Conversation

abonander
Copy link
Collaborator

@abonander abonander commented Oct 29, 2024

  • Use a separate waiting queue for new connections.
  • Pool inheritance (used for testing) only steals connect permits, not acquire permits.
  • Spawn connection attempts as their own task so they may complete even if the acquire() call is cancelled.
  • Race opening a new connection with acquiring one from the idle queue.
  • acquire() should now be completely cancel-safe.
  • Separate timeout for connecting.
  • New PoolConnector trait superceding both before_connect (requested but not yet implemented) and after_connect callbacks.
    • Implemented for closures returning Future, albeit with a 'static requirement for the returned Future (instead of BoxFuture).
    • May be updated to use async closures in a future release (hopefully backwards compatible but will require an MSRV bump): https://blog.rust-lang.org/inside-rust/2024/08/09/async-closures-call-for-testing.html
    • Can be used to support high availability, or implement custom backoff or connection throttling schemes (e.g. token bucket).
  • Use usize for all connection counts to get rid of weird inconsistencies.

Breaking Changes

  • Pool::set_connect_options() and get_connect_options() have been removed. Instead, implement the new PoolConnector trait (or use a closure) using something like Arc<RwLock<impl ConnectOptions>>.
  • PoolOptions::after_connect() has been removed. Instead, implement PoolConnector (or use a closure), open a connection and then apply any operations necessary.
  • PoolOptions::min_connections(), PoolOptions::max_connections() and Pool::size() now use usize instead of u32.

Fixes #3513
Fixes #3315
Fixes #3132
Fixes #3117
Fixes #2848

}

#[cfg(not(feature = "_rt-async-std"))]
missing_rt((duration, f))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When playing around with this PR locally (to see if it fixes an acquire timeout issue, which it unfortunately doesn't), I found that this caused a compile error. I think it should be

Suggested change
missing_rt((duration, f))
missing_rt((deadline, f))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jplatte if you have a solid repro for acquire timeouts, I'd love to add it as a test here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish. It's in the proprietary version of the main work codebase, and somehow only happens w/ hyper 1.0 / axum 0.7. But if other debugging approaches don't work out, I can try the hyper upgrade on the much smaller OSS version of the codebase and reduce from there next week.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that Axum does is cancel the handler future if the client disconnects. I wonder if it's triggering a cancellation bug somewhere.

Do you have a before_acquire callback set?

Copy link
Collaborator Author

@abonander abonander Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some digging a few weeks back and realized that connections could potentially get stuck in return_to_pool because there's no timeout: estuary/flow#1676 (comment)

That's a change I was meaning to add to this PR but hadn't gotten to yet. There's a timeout when it goes to close the connection, but no timeout for the task as a whole.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a cancelation bug. It happens in a test that does a bunch of requests in parallel (50 originally, I can turn it down to 20 and still reliably reproduce the hang, but at 18 it succeded).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the max size of the test pool?

And what's the acquire timeout set at?

Copy link

@svix-jplatte svix-jplatte Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, the max size of the pool is exactly 20, and once I use that amount of parallelism it breaks. Tried 19 too and that works. Acquire timeout is 20s, much longer than it takes the test to run to completion with up to 19 parallel requests.

I also tried raising the pool size to 50, exactly same thing: Once the number of parallel requests is at least as big as the pool size, it hangs (until timeout).

Further, I was using a tokio::sync::Barrier and separate reqwest::Clients, tokio tasks for the requests to happen as closely together as possible (this test was originally written to catch another race). If I don't make the tasks wait on the barrier before making the request, that seems to already mix things up sufficiently for the test succeed, even at a pool size of 20 and 50 parallel requests.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the bug, it has nothing to do with SQLx itself. The test was deadlocking the server in a really weird way (related to the DB pool).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment