Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove delay-based backpressure in favor of explicit queue limits #1515

Merged
merged 1 commit into from
Nov 12, 2024

Conversation

mkeeter
Copy link
Contributor

@mkeeter mkeeter commented Oct 23, 2024

Right now, we have a backpressure delay that's a function of (job count, bytes in flight).

The actual queue length is set implicitly by the shape of that function (e.g. its gain) and the actual downstairs IO time. If a downstairs slows down due to load, then the queue length has to grow to compensate!

(for more on this, come to my OxCon talk this afternoon!)

Having an implicitly-defined queue length is tricky, because it's not well controlled. Queue length also affects latency in subtle ways, so directly controlling the queue length would be helpful.

This PR removes the delay-based backpressure implementation in favor of a simple pair of semaphores: there are a certain number of job and byte permits available, and the Guest has to acquire them before sending a job to the Upstairs.

In other words, writes will be accepted as fast as possible until we run out of permits; we will then shift to a one-in, one-out operation mode where jobs have to be completed by the downstairs before a new job is submitted. The maximum queue length (either in jobs or bytes) is well known and set by global constants.

Architecturally, we replace the BackpressureGuard with a new IOLimitGuard, which is claimed by the Guest and stored in relevant BlockOp variants (then moved into the DownstairsIO). It still uses RAII to automatically release permits when dropped, and we still manually drop it early (once a job is complete on a particular downstairs).

The IOLimitGuard will also be used (eventually) for configured IOP and bandwidth limiting, which was removed in #1506; we can use a strategy similar to this example from the Tokio docs.

@mkeeter mkeeter force-pushed the mkeeter/oops-no-backpressure branch from e5b535f to 07fc3bf Compare October 28, 2024 12:37
Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some other random thoughts

Maybe not now, but if setting tokens was part of the BlockIO trait, then we would be setup to pass them down and controll them from the Volume layer?

If an IO blocks in read/write/flush because there are not enough tokens, the rest of the upstairs still makes progress right? Like, we don't need the same task that is stuck in a read to pull something else off the queue elsewhere to free up resources or anything like that? Or, is that what the _try stuff is all about?

@@ -62,7 +62,6 @@ crucible_upstairs*:::up-status
* Job ID delta and backpressure
*/
json(copyinstr(arg1), "ok.next_job_id"),
json(copyinstr(arg1), "ok.up_backpressure"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How hard would it be to get the IOLimits remaining into the dtrace info?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty easy to add in Crucible, but we probably don't have room for all 6x values (bytes and jobs available for each client) in the DTrace UIs...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, actually displaying them will be an exercise left to the DTracer. I just wanted to know they exist.

I'll spend way way to much time crafting another script to show them :)

upstairs/src/downstairs.rs Show resolved Hide resolved
upstairs/src/downstairs.rs Show resolved Hide resolved
upstairs/src/guest.rs Show resolved Hide resolved
upstairs/src/lib.rs Show resolved Hide resolved
@mkeeter mkeeter force-pushed the mkeeter/oops-no-backpressure branch 2 times, most recently from 9249897 to f9abea5 Compare October 29, 2024 20:20
@mkeeter
Copy link
Contributor Author

mkeeter commented Oct 29, 2024

Maybe not now, but if setting tokens was part of the BlockIO trait, then we would be setup to pass them down and controll them from the Volume layer?

That's something that we'll have to figure out – previously, IO limits were attached to the GuestIOHandle, but in a world of multi-Guest volumes, that may need to change. We'll have to see whether we can attach limits to the Volume layer, or whether we have to add new APIs to BlockIO.

(Guest is also increasingly misnamed, but that's neither here nor there)

If an IO blocks in read/write/flush because there are not enough tokens, the rest of the upstairs still makes progress right? Like, we don't need the same task that is stuck in a read to pull something else off the queue elsewhere to free up resources or anything like that? Or, is that what the _try stuff is all about?

Good question! The intention is for the Guest task to claim IO permits (blocking until they're available), then pass an IOLimitGuard into the Upstairs task. It's fine for the guest task to block on obtaining permits, as long as the upstairs task continues running, because it's the thing which actually completes IO. I tweaked the API slightly in f9abea5 so that async fn claim(..0 is only on the struct IOLimitView (available to the Guest), and not present on the struct IOLimits (available in the Upstairs). This should make it hard to deadlock oneself!

@mkeeter mkeeter force-pushed the mkeeter/oops-no-backpressure branch 2 times, most recently from d4549c6 to eb5634b Compare October 30, 2024 19:50
@leftwo leftwo self-requested a review October 30, 2024 22:06
Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't want to add the DTrace collection points now, expect a PR from me soon that does have them :)

@mkeeter mkeeter force-pushed the mkeeter/oops-no-backpressure branch 6 times, most recently from a8db296 to b016244 Compare November 5, 2024 22:34
Copy link
Contributor

@faithanalog faithanalog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only part of this i dont think i caught is where the disable_backpressure flag actually comes into effect. the rest of it I understand (and the way these async semaphores work is quite neat to learn about)

@mkeeter
Copy link
Contributor Author

mkeeter commented Nov 7, 2024

@faithanalog

the only part of this i dont think i caught is where the disable_backpressure flag actually comes into effect. the rest of it I understand (and the way these async semaphores work is quite neat to learn about)

Good question – the disable_backpressure flag in GuestIOHandle doesn't do anything itself anymore, but it's checked in up_main to disable per-client backpressure (which still exists!). It's basically an elaborate way to pass a flag into up_main 🙃

Copy link
Contributor

@jmpesp jmpesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship it, pending some comments about handling chained semaphore acquires!

upstairs/src/io_limits.rs Show resolved Hide resolved
upstairs/src/io_limits.rs Show resolved Hide resolved
upstairs/src/io_limits.rs Show resolved Hide resolved
/// The IO permits are released when this handle is dropped
#[derive(Debug)]
pub struct ClientIOLimitGuard {
#[expect(unused)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is relatively new (Rust 1.81)!

const IO_OUTSTANDING_MAX_BYTES: u64 = 1024 * 1024 * 1024; // 1 GiB
/// If we exceed this value, the upstairs will give up and mark the offline
/// downstairs as faulted.
const IO_OUTSTANDING_MAX_BYTES: u64 = 50 * 1024 * 1024; // 50 MiB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did this number come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly vibes! This was previously the point at which backpressure delays started.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I don't think this is a bad number! Just wondering, esp when you were looking at the graphs showing the buffering cliff that real devices had.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One data point: with 10x crutest instances doing 4K random writes on a single Gimlet (so 30x Downstairs), we end up buffering about 8 seconds of writes:
Screenshot 2024-11-12 at 10 31 47 AM

(there's 60 seconds of IO, then you can watch the number of active jobs drain)

This is a significant improvement from the ~50 seconds that we see on main:
Screenshot 2024-11-12 at 10 31 57 AM

It may still be too much, but it's easy to tune these values later.

@mkeeter mkeeter force-pushed the mkeeter/oops-no-backpressure branch from 455f730 to 542aaee Compare November 12, 2024 15:36
@mkeeter mkeeter merged commit a6624f1 into main Nov 12, 2024
16 checks passed
@mkeeter mkeeter deleted the mkeeter/oops-no-backpressure branch November 12, 2024 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants