-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix flaky test_demo_saga #6388
fix flaky test_demo_saga #6388
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems good to me!
&std::time::Duration::from_millis(20), | ||
&std::time::Duration::from_secs(10), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpicky, take it or leave it: perhaps we should just import std::time::Duration
?
In case anybody is watching this issue, I completely rewrote the PR since the review so I want another review round. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation makes sense to me overall! I left a couple small comments.
let sem = | ||
self.sagas.entry(id).or_insert_with(|| Arc::new(Semaphore::new(0))); | ||
let sem_clone = sem.clone(); | ||
async move { | ||
sem_clone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, take it or leave it: why not just write:
let sem = | |
self.sagas.entry(id).or_insert_with(|| Arc::new(Semaphore::new(0))); | |
let sem_clone = sem.clone(); | |
async move { | |
sem_clone | |
let sem = | |
self.sagas.entry(id).or_insert_with(|| Arc::new(Semaphore::new(0))).clone(); | |
async move { | |
sem |
pub struct CompletingDemoSagas { | ||
ids: BTreeMap<DemoSagaUuid, oneshot::Sender<()>>, | ||
sagas: BTreeMap<DemoSagaUuid, Arc<Semaphore>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we, perhaps, add some documentation explaining the motivation for using a Semaphore
with one 0-1 permit, rather than a Notify
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think there is no good reason! I thought if you called notify.notify_one()
, but nobody was waiting, then that wakeup would be lost. I just checked the docs and that's not the case. Do you think I should change it to use Notify
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the interest of fixing the flaky test quickly, I'm going to land anyway, but we can always change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that seems fine to me. I agree that Notify::notify_one
would also have gotten you the exact same behavior, but it also doesn't actually matter. Notify
might be very marginally more efficient but this is far from a hot path...
Fixes #6383.
In that issue, it looks like the test successfully created the demo saga, but then when it went to complete it, it wasn't found in the
CompletingDemoSagas
struct. I didn't appreciate that there's a race here: a newly-created demo saga does not register itself withCompletingDemoSagas
until its (first and only) saga action starts running, but that's asynchronous with respect to the internal API HTTP request that creates the demo saga. As a result, clients can get a 404 when they try to complete the demo saga, but if they retry after that first saga action starts running, then it will succeed.The first fix I had here (9f6b11a) just has the test retry a 404 from this request, using
wait_for_condition
so that this won't hang nor sleep much longer than it needs to. This works fine to fix the test and if we have to fall back to this then that's okay. The only consumers of this API are probably humans (who won't be fast enough to hit this race) or other automated tests. But I wanted to try to do better because this API kind of sucks: you create a thing, then you want to operate on it, but you have to wait for some other thing to happen first. It's this weird eventual consistency pattern that we've avoid everywhere else. And we'll have to replicate the retry behavior in other consumers (mainly, future automated tests).So instead I've reworked the
CompletingDemoSagas
struct so that you can do the completion and subscription steps in either order. The reason I hadn't done this before was that the obvious solution (create the entry for a demo saga id when either of these events happen) creates a different, worse problem: if the consumer provides the wrong id (which is easier than you'd think with this API, for a few reasons), then the request would silently succeed, leading to implicit failures later (because you hadn't actually completed the saga you thought you had). What I did instead here was add a preregistration step that happens when we create the demo saga. If we get a request to complete a saga that hasn't started waiting and isn't preregistered, then we produce the 404. Notably, this doesn't cover the case that the demo saga was recovered (rather than created in this Nexus's lifetime). There's no good way to do that, but I think it doesn't really matter since saga recovery is always asynchronous anyway.