-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_instance_create_timeout_while_creating_zone
is flaky
#6197
Comments
I've been looking at this again, because I'm touching these tests for something unrelated. I think it's not possible to do what we want in this test. The intent behind the test is this:
The problem is that this bridges async and sync code. The If we call the mock closure at all before the test completes, it will block until whatever mechanism we use completes. That is a synchronous closure, and without an await point there is no way to yield back to the Tokio runtime to drive the rest of the tasks to completion. So what we want is a way to delay completing that closure, but that's not possible without indefinitely delaying the runtime itself. |
This seems to be pretty common in practice. Would it be worth adding retries to it at the nextest level (if the fix may take a while to develop)? |
Unless I'm missing something, I don't believe it's possible to write this test the way we want. I don't see how one can block the mock zone-boot method, without also blocking the single-threaded Tokio runtime. I.e., the test just needs to take the full time we are pretending zone boot takes (60s), or we fundamentally restructure the test or remove it. I personally feel the latter is the most expedient option, but I'd love to learn how we can keep the test. |
Hmm, could you do something like injecting a oneshot channel to wait on? (Also might be worth using a multithreaded tokio runtime for this test -- we do it for a few others) |
Hmm, a multithreaded runtime does mean we lose the ability to use tokio's pause/resume time events. I guess the only real way to do this would be to pass the source of time down as a dependency. If that's too much work, it's probably just worth removing the test. |
I think you mean something like: passing in a start time or deadline into the If that closure does not wait, well then there's really no way for us to make booting appear slow! And since that's the whole point of the test, I feel like it lowers the value of the test. @sunshowers Am I right in my assessment of the closure? Since it's sync, and we're controlling time in a single-threaded runtime, there's no way for it "appear slow" without blocking the test entirely. |
That's right in the context of a single-threaded runtime -- but using a multithreaded runtime should (I think) be able to work around that. Let's say you somehow arranged for Then, in And in the test itself, you could spin up a task which sends a message via the channel. With a multithreaded runtime, neither should block on the other. In essence, you're moving away from global time manipulation (which I agree is pretty fragile) to something more controlled than that. But there's naturally a divergence between the test and the real code, and I don't know if in this world the test is still useful. |
I'm not sure this will work. @lifning seeing as this was your test originally, how strongly do you feel it be preserved? I don't see a straightforward way to pretend zone boot takes a long time, since we safely block the single-threaded Tokio runtime and still advance time from outside the closure. We could use a multi-threaded runtime, and have the test take the full duration of whatever slow boot process we wanted to emulate, but would not be able to advance time in that scenario. |
Ah not if you call spawn_blocking or block_in_place though! Which you can in a multi threaded runtime context. |
@sunshowers Yeah, |
- Fixes #6197 - Use a multi-threaded runtime for the test so we can literally sleep in the boot closure without blocking the test itself.
(for context, i had posted some context from digging into an issue with this test last week at #6169 (comment) - is my conclusion there about channels panicking when runtimes tear down actually founded?)
the intent of adding it was for us to make sure we had something exercising our system's behavior in the timeout case, back when propolis zone creation was observed sometimes taking over a minute (and causing instance create requests to time out.) |
I see this "channel closed" error quite frequently, and agree with you that it seems due to the unordered shutdown of Tokio's tasks.
Sounds reasonable. I think given that we found a way that (I think) works around the sync-async boundary issues in the test, it's good to keep the test around at this point. We can simulate slow-boot in a reasonable amount of time, which assuages my main concern that the test would just require waiting for many 10s of seconds. |
The test
test_instance_create_timeout_while_creating_zone
attempts to verify what happens when zone-boot times out while creating an instance. Prior to b74d691, that was inadvertently relying on a DNS resolution request that would never succeed to "time out" the zone boot. In that commit, we removed that DNS resolution, revealing the problem in the test.However, the test remains flaky. It's relying on Tokio's time facilities to pause / advance time far enough that the we time out the request to create an instance, but not far enough that the zone actually boots. That kind of control of time is always finicky, since it's often the case that time needs to be advanced in small jumps to allow the runtime to schedule and drive other tasks in the meantime. In this case, we can apparently just never call the
MockZones::boot()
method, presumably because the test succeeds and that task is torn down before it's called.We should probably make the test work better by advancing time in small jumps, well beyond the expected timeout for the zone boot call itself.
Originally posted by @hawkw in #5749 (comment)
The text was updated successfully, but these errors were encountered: