-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not retry indefinitely if service is gone #5789
Conversation
If there's a call to an external service, saga execution cannot move forward until the result of that call is known, in the sense that Nexus received a result. If there are transient problems, Nexus must retry until a known result is returned. This is problematic when the destination service is gone - Nexus will retry indefinitely, halting the saga execution. Worse, in the case of sagas calling the volume delete subsaga, subsequent calls will also halt. With the introduction of a physical disk policy, Nexus can know when to stop retrying a call - the destination service is gone, so the known result is an error. This commit adds a `ProgenitorOperationRetry` object that takes an operation to retry plus a "gone" check, and checks each retry iteration if the destination is gone. If it is, then bail out, otherwise assume that any errors seen are transient. Further work is required to deprecate the `retry_until_known_result` function, as retrying indefinitely is a bad pattern. Fixes oxidecomputer#4331 Fixes oxidecomputer#5022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some smallish style suggestions to hopefully make the nested matches a little more comprehensible --- feel free to ignore me if you don't like them :)
Ok(dest_is_gone) => { | ||
if dest_is_gone { | ||
return Err(BackoffError::Permanent( | ||
ProgenitorOperationRetryError::Gone | ||
)); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nit, take it or leave it: this could be a bit more concise as:
Ok(dest_is_gone) => { | |
if dest_is_gone { | |
return Err(BackoffError::Permanent( | |
ProgenitorOperationRetryError::Gone | |
)); | |
} | |
} | |
Ok(true) => return Err(BackoffError::Permanent( | |
ProgenitorOperationRetryError::Gone | |
)), | |
Ok(false) => {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, I prefer the more explicit style, but thanks :)
Quite the opposite, thanks for them! I find myself writing nested match code like this lately, so these suggestions help :) |
Thanks Eliza! Co-authored-by: Eliza Weisman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
James, Overall I like the solution. It maps to what we discussed wrt expunged. However, I left a few suggestions for architectural fixes that probably are worth at least a look.
}); | ||
|
||
// It won't finish until the dataset is expunged. | ||
tokio::time::sleep(Duration::from_secs(3)).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you need some heuristic to ensure something is not done, but I really find sleeps like this in tests to be problematic. Not only do they sometimes cause flakiness, but they also make tests take longer. 3 seconds here, 3 seconds there and pretty soon your talking real time.
My recommendation instead of a sleep is usually to put one side of a channel in the task and have the test task communicate with it to determine if a given state has been reached. That seems somewhat difficult in this case, as you are essentially checking to see if the call is hung. I'm not sure how to fix this, but I still don't like it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this test to just wait on the task in 0a45871, that works too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it changes the semantics of the test though. You now have a much more likely chance that that the spawned task hasn't even started to run yet, making the assert!(!jh.is_finished())
somewhat meaningless. I still think on balance gettting rid of the sleep is the right, call, so I'm fine with this. But. maybe make a note about why there is no sleep and what the goal of the test is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, added a oneshot to wait until the task starts in 6917f42
nexus/src/app/crucible.rs
Outdated
backoff::retry_notify( | ||
backoff::retry_policy_internal_service_aggressive(), | ||
|| async { | ||
let region = match self.maybe_get_crucible_region( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make a note that self.maybe_get_crucible_region
will check for permanent errors and so we don't need to add a check to this loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't quite grok this comment - this section matches against Err(e)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be outdated. In either case, I wasn't even clear enough for myself to remember what I was talking about. Feel free to ignore it.
nexus/src/app/crucible.rs
Outdated
|
||
// Return Ok if the dataset's agent is gone, no | ||
// delete call is required. | ||
Err(Error::Gone) => return Ok(()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally noticed here, that we don't log when something is gone. Should we add that logging to the ProgenitorOperationRetry
code or leave it to the users. In the latter case, we should add a log here, as well as all other places Gone
is returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - I don't think ProgenitorOperationRetry
is a good place for this because it doesn't have any visibility into what is gone, just that the gone_check function returned true, so I've added it one layer up: see 0baa9c8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Thanks.
add missing format!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the hard work and cleanup @jmpesp!
If there's a call to an external service, saga execution cannot move forward until the result of that call is known, in the sense that Nexus received a result. If there are transient problems, Nexus must retry until a known result is returned.
This is problematic when the destination service is gone - Nexus will retry indefinitely, halting the saga execution. Worse, in the case of sagas calling the volume delete subsaga, subsequent calls that also call volume delete will also halt.
With the introduction of a physical disk policy, Nexus can know when to stop retrying a call - the destination service is gone, so the known result is an error.
This commit adds a
ProgenitorOperationRetry
object that takes an operation to retry plus a "gone" check, and checks each retry iteration if the destination is gone. If it is, then bail out, otherwise assume that any errors seen are transient.Further work is required to deprecate the
retry_until_known_result
function, as retrying indefinitely is a bad pattern.Fixes #4331
Fixes #5022