Remove delay-based backpressure in favor of explicit queue limits #1515

mkeeter · 2024-10-23T16:41:18Z

Right now, we have a backpressure delay that's a function of (job count, bytes in flight).

The actual queue length is set implicitly by the shape of that function (e.g. its gain) and the actual downstairs IO time. If a downstairs slows down due to load, then the queue length has to grow to compensate!

(for more on this, come to my OxCon talk this afternoon!)

Having an implicitly-defined queue length is tricky, because it's not well controlled. Queue length also affects latency in subtle ways, so directly controlling the queue length would be helpful.

This PR removes the delay-based backpressure implementation in favor of a simple pair of semaphores: there are a certain number of job and byte permits available, and the Guest has to acquire them before sending a job to the Upstairs.

In other words, writes will be accepted as fast as possible until we run out of permits; we will then shift to a one-in, one-out operation mode where jobs have to be completed by the downstairs before a new job is submitted. The maximum queue length (either in jobs or bytes) is well known and set by global constants.

Architecturally, we replace the BackpressureGuard with a new IOLimitGuard, which is claimed by the Guest and stored in relevant BlockOp variants (then moved into the DownstairsIO). It still uses RAII to automatically release permits when dropped, and we still manually drop it early (once a job is complete on a particular downstairs).

The IOLimitGuard will also be used (eventually) for configured IOP and bandwidth limiting, which was removed in #1506; we can use a strategy similar to this example from the Tokio docs.

leftwo

Some other random thoughts

Maybe not now, but if setting tokens was part of the BlockIO trait, then we would be setup to pass them down and controll them from the Volume layer?

If an IO blocks in read/write/flush because there are not enough tokens, the rest of the upstairs still makes progress right? Like, we don't need the same task that is stuck in a read to pull something else off the queue elsewhere to free up resources or anything like that? Or, is that what the _try stuff is all about?

leftwo · 2024-10-29T15:49:54Z

tools/dtrace/single_up_info.d

@@ -62,7 +62,6 @@ crucible_upstairs*:::up-status
     * Job ID delta and backpressure
     */
    json(copyinstr(arg1), "ok.next_job_id"),
-    json(copyinstr(arg1), "ok.up_backpressure"),


How hard would it be to get the IOLimits remaining into the dtrace info?

It's pretty easy to add in Crucible, but we probably don't have room for all 6x values (bytes and jobs available for each client) in the DTrace UIs...

Oh, actually displaying them will be an exercise left to the DTracer. I just wanted to know they exist.

I'll spend way way to much time crafting another script to show them :)

upstairs/src/downstairs.rs

upstairs/src/guest.rs

upstairs/src/lib.rs

mkeeter · 2024-10-29T20:27:06Z

Maybe not now, but if setting tokens was part of the BlockIO trait, then we would be setup to pass them down and controll them from the Volume layer?

That's something that we'll have to figure out – previously, IO limits were attached to the GuestIOHandle, but in a world of multi-Guest volumes, that may need to change. We'll have to see whether we can attach limits to the Volume layer, or whether we have to add new APIs to BlockIO.

(Guest is also increasingly misnamed, but that's neither here nor there)

If an IO blocks in read/write/flush because there are not enough tokens, the rest of the upstairs still makes progress right? Like, we don't need the same task that is stuck in a read to pull something else off the queue elsewhere to free up resources or anything like that? Or, is that what the _try stuff is all about?

Good question! The intention is for the Guest task to claim IO permits (blocking until they're available), then pass an IOLimitGuard into the Upstairs task. It's fine for the guest task to block on obtaining permits, as long as the upstairs task continues running, because it's the thing which actually completes IO. I tweaked the API slightly in f9abea5 so that async fn claim(..0 is only on the struct IOLimitView (available to the Guest), and not present on the struct IOLimits (available in the Upstairs). This should make it hard to deadlock oneself!

leftwo

If you don't want to add the DTrace collection points now, expect a PR from me soon that does have them :)

faithanalog

the only part of this i dont think i caught is where the disable_backpressure flag actually comes into effect. the rest of it I understand (and the way these async semaphores work is quite neat to learn about)

mkeeter · 2024-11-07T18:49:25Z

@faithanalog

the only part of this i dont think i caught is where the disable_backpressure flag actually comes into effect. the rest of it I understand (and the way these async semaphores work is quite neat to learn about)

Good question – the disable_backpressure flag in GuestIOHandle doesn't do anything itself anymore, but it's checked in up_main to disable per-client backpressure (which still exists!). It's basically an elaborate way to pass a flag into up_main 🙃

jmpesp

Ship it, pending some comments about handling chained semaphore acquires!

upstairs/src/io_limits.rs

jmpesp · 2024-11-11T22:06:40Z

upstairs/src/io_limits.rs

+/// The IO permits are released when this handle is dropped
+#[derive(Debug)]
+pub struct ClientIOLimitGuard {
+    #[expect(unused)]


This is relatively new (Rust 1.81)!

jmpesp · 2024-11-11T22:36:43Z

upstairs/src/lib.rs

-const IO_OUTSTANDING_MAX_BYTES: u64 = 1024 * 1024 * 1024; // 1 GiB
+/// If we exceed this value, the upstairs will give up and mark the offline
+/// downstairs as faulted.
+const IO_OUTSTANDING_MAX_BYTES: u64 = 50 * 1024 * 1024; // 50 MiB


Where did this number come from?

Mostly vibes! This was previously the point at which backpressure delays started.

To be clear, I don't think this is a bad number! Just wondering, esp when you were looking at the graphs showing the buffering cliff that real devices had.

One data point: with 10x crutest instances doing 4K random writes on a single Gimlet (so 30x Downstairs), we end up buffering about 8 seconds of writes:

(there's 60 seconds of IO, then you can watch the number of active jobs drain)

This is a significant improvement from the ~50 seconds that we see on main:

It may still be too much, but it's easy to tune these values later.

mkeeter requested review from faithanalog, jmpesp and leftwo October 23, 2024 16:41

mkeeter force-pushed the mkeeter/oops-no-backpressure branch from e5b535f to 07fc3bf Compare October 28, 2024 12:37

leftwo reviewed Oct 29, 2024

View reviewed changes

mkeeter force-pushed the mkeeter/oops-no-backpressure branch 2 times, most recently from 9249897 to f9abea5 Compare October 29, 2024 20:20

mkeeter force-pushed the mkeeter/oops-no-backpressure branch 2 times, most recently from d4549c6 to eb5634b Compare October 30, 2024 19:50

leftwo self-requested a review October 30, 2024 22:06

leftwo approved these changes Oct 30, 2024

View reviewed changes

mkeeter force-pushed the mkeeter/oops-no-backpressure branch 6 times, most recently from a8db296 to b016244 Compare November 5, 2024 22:34

faithanalog approved these changes Nov 7, 2024

View reviewed changes

mkeeter mentioned this pull request Nov 8, 2024

Skip backpressure guard when job is skipped; fix log #1554

Closed

mkeeter force-pushed the mkeeter/oops-no-backpressure branch from b016244 to c904b9f Compare November 11, 2024 14:41

faithanalog mentioned this pull request Nov 11, 2024

mystery failure on london BRM42220036 #1553

Open

mkeeter force-pushed the mkeeter/oops-no-backpressure branch from c904b9f to 455f730 Compare November 11, 2024 21:36

jmpesp approved these changes Nov 11, 2024

View reviewed changes

Use semaphores for global backpressure, instead of delays

542aaee

mkeeter force-pushed the mkeeter/oops-no-backpressure branch from 455f730 to 542aaee Compare November 12, 2024 15:36

mkeeter merged commit a6624f1 into main Nov 12, 2024
16 checks passed

mkeeter deleted the mkeeter/oops-no-backpressure branch November 12, 2024 16:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove delay-based backpressure in favor of explicit queue limits #1515

Remove delay-based backpressure in favor of explicit queue limits #1515

mkeeter commented Oct 23, 2024 •

edited

Loading

leftwo left a comment

leftwo Oct 29, 2024

mkeeter Oct 30, 2024

leftwo Oct 30, 2024

mkeeter commented Oct 29, 2024

leftwo left a comment

faithanalog left a comment

mkeeter commented Nov 7, 2024 •

edited

Loading

jmpesp left a comment

jmpesp Nov 11, 2024

mkeeter Nov 12, 2024

jmpesp Nov 11, 2024

mkeeter Nov 12, 2024

jmpesp Nov 12, 2024

mkeeter Nov 12, 2024

Remove delay-based backpressure in favor of explicit queue limits #1515

Remove delay-based backpressure in favor of explicit queue limits #1515

Conversation

mkeeter commented Oct 23, 2024 • edited Loading

leftwo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkeeter commented Oct 29, 2024

leftwo left a comment

Choose a reason for hiding this comment

faithanalog left a comment

Choose a reason for hiding this comment

mkeeter commented Nov 7, 2024 • edited Loading

jmpesp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkeeter commented Oct 23, 2024 •

edited

Loading

mkeeter commented Nov 7, 2024 •

edited

Loading