Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix flaky test_demo_saga #6388

Merged
merged 3 commits into from
Aug 19, 2024
Merged

fix flaky test_demo_saga #6388

merged 3 commits into from
Aug 19, 2024

Conversation

davepacheco
Copy link
Collaborator

@davepacheco davepacheco commented Aug 19, 2024

Fixes #6383.

In that issue, it looks like the test successfully created the demo saga, but then when it went to complete it, it wasn't found in the CompletingDemoSagas struct. I didn't appreciate that there's a race here: a newly-created demo saga does not register itself with CompletingDemoSagas until its (first and only) saga action starts running, but that's asynchronous with respect to the internal API HTTP request that creates the demo saga. As a result, clients can get a 404 when they try to complete the demo saga, but if they retry after that first saga action starts running, then it will succeed.

The first fix I had here (9f6b11a) just has the test retry a 404 from this request, using wait_for_condition so that this won't hang nor sleep much longer than it needs to. This works fine to fix the test and if we have to fall back to this then that's okay. The only consumers of this API are probably humans (who won't be fast enough to hit this race) or other automated tests. But I wanted to try to do better because this API kind of sucks: you create a thing, then you want to operate on it, but you have to wait for some other thing to happen first. It's this weird eventual consistency pattern that we've avoid everywhere else. And we'll have to replicate the retry behavior in other consumers (mainly, future automated tests).

So instead I've reworked the CompletingDemoSagas struct so that you can do the completion and subscription steps in either order. The reason I hadn't done this before was that the obvious solution (create the entry for a demo saga id when either of these events happen) creates a different, worse problem: if the consumer provides the wrong id (which is easier than you'd think with this API, for a few reasons), then the request would silently succeed, leading to implicit failures later (because you hadn't actually completed the saga you thought you had). What I did instead here was add a preregistration step that happens when we create the demo saga. If we get a request to complete a saga that hasn't started waiting and isn't preregistered, then we produce the 404. Notably, this doesn't cover the case that the demo saga was recovered (rather than created in this Nexus's lifetime). There's no good way to do that, but I think it doesn't really matter since saga recovery is always asynchronous anyway.

Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems good to me!

Comment on lines 69 to 70
&std::time::Duration::from_millis(20),
&std::time::Duration::from_secs(10),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpicky, take it or leave it: perhaps we should just import std::time::Duration?

@davepacheco
Copy link
Collaborator Author

Okay I re-did the testing I'd done for #6215 / #6281 and it all still works.

@davepacheco davepacheco marked this pull request as ready for review August 19, 2024 19:45
@davepacheco
Copy link
Collaborator Author

In case anybody is watching this issue, I completely rewrote the PR since the review so I want another review round.

@hawkw hawkw self-requested a review August 19, 2024 23:18
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation makes sense to me overall! I left a couple small comments.

Comment on lines +62 to +66
let sem =
self.sagas.entry(id).or_insert_with(|| Arc::new(Semaphore::new(0)));
let sem_clone = sem.clone();
async move {
sem_clone
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, take it or leave it: why not just write:

Suggested change
let sem =
self.sagas.entry(id).or_insert_with(|| Arc::new(Semaphore::new(0)));
let sem_clone = sem.clone();
async move {
sem_clone
let sem =
self.sagas.entry(id).or_insert_with(|| Arc::new(Semaphore::new(0))).clone();
async move {
sem

pub struct CompletingDemoSagas {
ids: BTreeMap<DemoSagaUuid, oneshot::Sender<()>>,
sagas: BTreeMap<DemoSagaUuid, Arc<Semaphore>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we, perhaps, add some documentation explaining the motivation for using a Semaphore with one 0-1 permit, rather than a Notify?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think there is no good reason! I thought if you called notify.notify_one(), but nobody was waiting, then that wakeup would be lost. I just checked the docs and that's not the case. Do you think I should change it to use Notify?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interest of fixing the flaky test quickly, I'm going to land anyway, but we can always change this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that seems fine to me. I agree that Notify::notify_one would also have gotten you the exact same behavior, but it also doesn't actually matter. Notify might be very marginally more efficient but this is far from a hot path...

@davepacheco davepacheco merged commit a338e8a into main Aug 19, 2024
22 checks passed
@davepacheco davepacheco deleted the dap/fix-demo-saga branch August 19, 2024 23:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test failed in CI: omicron-nexus::test_all integration_tests::demo_saga::test_demo_saga
2 participants