-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove delay-based backpressure in favor of explicit queue limits #1515
Conversation
e5b535f
to
07fc3bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some other random thoughts
Maybe not now, but if setting tokens was part of the BlockIO trait, then we would be setup to pass them down and controll them from the Volume layer?
If an IO blocks in read/write/flush because there are not enough tokens, the rest of the upstairs still makes progress right? Like, we don't need the same task that is stuck in a read to pull something else off the queue elsewhere to free up resources or anything like that? Or, is that what the _try stuff is all about?
@@ -62,7 +62,6 @@ crucible_upstairs*:::up-status | |||
* Job ID delta and backpressure | |||
*/ | |||
json(copyinstr(arg1), "ok.next_job_id"), | |||
json(copyinstr(arg1), "ok.up_backpressure"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How hard would it be to get the IOLimits remaining into the dtrace info?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's pretty easy to add in Crucible, but we probably don't have room for all 6x values (bytes and jobs available for each client) in the DTrace UIs...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, actually displaying them will be an exercise left to the DTracer. I just wanted to know they exist.
I'll spend way way to much time crafting another script to show them :)
9249897
to
f9abea5
Compare
That's something that we'll have to figure out – previously, IO limits were attached to the (
Good question! The intention is for the |
d4549c6
to
eb5634b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't want to add the DTrace collection points now, expect a PR from me soon that does have them :)
a8db296
to
b016244
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only part of this i dont think i caught is where the disable_backpressure flag actually comes into effect. the rest of it I understand (and the way these async semaphores work is quite neat to learn about)
Good question – the |
b016244
to
c904b9f
Compare
c904b9f
to
455f730
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ship it, pending some comments about handling chained semaphore acquires!
/// The IO permits are released when this handle is dropped | ||
#[derive(Debug)] | ||
pub struct ClientIOLimitGuard { | ||
#[expect(unused)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is relatively new (Rust 1.81)!
const IO_OUTSTANDING_MAX_BYTES: u64 = 1024 * 1024 * 1024; // 1 GiB | ||
/// If we exceed this value, the upstairs will give up and mark the offline | ||
/// downstairs as faulted. | ||
const IO_OUTSTANDING_MAX_BYTES: u64 = 50 * 1024 * 1024; // 50 MiB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did this number come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly vibes! This was previously the point at which backpressure delays started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I don't think this is a bad number! Just wondering, esp when you were looking at the graphs showing the buffering cliff that real devices had.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One data point: with 10x crutest
instances doing 4K random writes on a single Gimlet (so 30x Downstairs), we end up buffering about 8 seconds of writes:
(there's 60 seconds of IO, then you can watch the number of active jobs drain)
This is a significant improvement from the ~50 seconds that we see on main
:
It may still be too much, but it's easy to tune these values later.
455f730
to
542aaee
Compare
Right now, we have a backpressure delay that's a function of
(job count, bytes in flight)
.The actual queue length is set implicitly by the shape of that function (e.g. its gain) and the actual downstairs IO time. If a downstairs slows down due to load, then the queue length has to grow to compensate!
(for more on this, come to my OxCon talk this afternoon!)
Having an implicitly-defined queue length is tricky, because it's not well controlled. Queue length also affects latency in subtle ways, so directly controlling the queue length would be helpful.
This PR removes the delay-based backpressure implementation in favor of a simple pair of semaphores: there are a certain number of job and byte permits available, and the
Guest
has to acquire them before sending a job to theUpstairs
.In other words, writes will be accepted as fast as possible until we run out of permits; we will then shift to a one-in, one-out operation mode where jobs have to be completed by the downstairs before a new job is submitted. The maximum queue length (either in jobs or bytes) is well known and set by global constants.
Architecturally, we replace the
BackpressureGuard
with a newIOLimitGuard
, which is claimed by theGuest
and stored in relevantBlockOp
variants (then moved into theDownstairsIO
). It still uses RAII to automatically release permits when dropped, and we still manually drop it early (once a job is complete on a particular downstairs).The
IOLimitGuard
will also be used (eventually) for configured IOP and bandwidth limiting, which was removed in #1506; we can use a strategy similar to this example from the Tokio docs.