-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle region snapshot replacement volume deletes #7046
Handle region snapshot replacement volume deletes #7046
Conversation
Volumes can be deleted at any time, but the tasks and sagas that perform region snapshot replacement did not account for this. This commit adds checks in a few places for if a volume is soft-deleted or hard-deleted, and bails out of any affected region snapshot replacement accordingly: - if a volume that has the region snapshot being replaced is soft-deleted, then skip making a region snapshot replacement step for it - if a region snapshot replacement step has the volume deleted after the step was created, transition it directly to the VolumeDeleted state - if a region snapshot replacement step has the volume deleted during the saga invocation, then skip notifying any Upstairs and allow the saga to transition the request to Complete, where then associated clean up can proceed An interesting race condition emerged during unit testing: the read-only region allocated to replace a region snapshot would be swapped into the snapshot volume, but would be susceptible to being deleted by the user, and therefore unable to be swapped into other volumes that have that snapshot volume as a read-only parent. This requires an additional volume that used that read-only region in order to bump the reference count associated with that region, so that the user cannot delete it before it was used to replace all other uses of the region snapshot it was meant to replace. This additional volume's lifetime lives as long as the region snapshot replacement, and therefore needs to be deleted when the region snapshot replacement is finished. This required a new region snapshot replacement finish saga, which required a new "Completing" state to perform the same type of state based lock on the replacement request done for all the other sagas. Testing also revealed that there were scenarios where `find_deleted_volume_regions` would return volumes for hard-deletion prematurely. The function now returns a struct instead of a list of tuples, and in that struct, regions freed for deletion are now distinct from volumes freed for deletion. Volumes are now only returned for hard-deletion when all associated read/write regions have been (or are going to be) deleted. Fixes oxidecomputer#6353
Do not allow a volume repair record to be created if the volume does not exist, or was hard-deleted!
serialize VolumeReplaceResult for a debugging breadcrumb
nothing should directly insert into volume_repair dsl anymore
I'm not sure I'm following this--what is the thing the user can delete that causes the region reference count(?) to reach zero? My understanding is/was that regions aren't user-addressable in the API, but the disks/volumes and snapshots that depend on them are. Is the thing being deleted the snapshot that's now having a region replaced? |
Yep! The user's able to delete the snapshot (and therefore the snapshot volume) at any time. Here's the race:
|
That seems... correct? Insofar as there really are no volumes referring to the region in this case. It sounds like the deal here is that
Is that the right mental model? |
You're not wrong: it would be correct if the only use of RS2 was the snapshot volume, but every disk (or image) that was created using that snapshot as a block source will contain (in the read-only parent) a copy of the snapshot volume, and therefore a reference to RS2 (from the previous example). The new read-only region is meant to replace every copy of RS2 in any volume that references it, so it has to stick around until all those references are replaced. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think the direction here makes sense. I have a few general clarifying questions/comments, but don't see any substantial synchronization issues.
@@ -133,6 +140,12 @@ pub struct RegionSnapshotReplacement { | |||
pub replacement_state: RegionSnapshotReplacementState, | |||
|
|||
pub operating_saga_id: Option<Uuid>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this PR, but a doc comment explaining the use of this field would be a welcome addition for slow-on-the-uptake readers like me :)
// Check if the volume was deleted _after_ the replacement step was | ||
// created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just an optimization? I don't see anything that'd prevent a volume from being deleted between the check on line 416 and the saga execution on line 478. I don't think that'll break anything, though, because volume_replace_snapshot
will bail if the volume is deleted.
Assuming this is right I'd leave a comment noting that this is just here to avoid the (not-inconsiderable!) expense of having to run a saga that we know isn't going to do anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct, this is just an optimization - the region_snapshot_replacement_step saga's volume_replace_snapshot
will detect if the volume is deleted and do the right thing. Added comments in 9fc46ab
/// Running | | ||
/// | set in region snapshot replacement | ||
/// | | finish background task | ||
/// | | ||
/// | | | ||
/// v | | ||
/// | responsibility of region snapshot | ||
/// Completing | replacement finish saga |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: there's another version of this diagram (in the saga comments, I think) that has a back-edge from completing to running, though I think this is only if it unwinds, and I'm not sure if you're including those edges here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a nit, that's important! added in 2cd7881
// Create a volume to inflate the reference count of the newly created | ||
// read-only region. If this is not done it's possible that a user could | ||
// delete the snapshot volume _after_ the new read-only region was swapped | ||
// in, removing the last reference to it and causing garbage collection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my understanding: What prevents the region from being garbage-collected when it's just been ensured but hasn't been linked into anything yet? My guess is that completely orphaned regions don't get GC'ed and that possibly-unused regions are found by looking for them in soft-deleted volumes. Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kinda yeah:
- for read-only regions, they're only returned from
soft_delete_volume
for deletion when they appear in the volume being soft-deleted. - it's similar for read/write regions, but an additional way they're returned for GC is from
find_deleted_volume_regions
. That function join the regions table with the volume table and then only operates on soft-deleted volumes. The relevant section is:
// only operate on soft deleted volumes
let soft_deleted = match &volume {
Some(volume) => volume.time_deleted.is_some(),
None => false,
};
if !soft_deleted {
continue;
}
If the region's volume isn't inserted yet, then volume
will be None
here.
@@ -423,6 +424,21 @@ async fn rsrss_notify_upstairs( | |||
let params = sagactx.saga_params::<Params>()?; | |||
let log = sagactx.user_data().log(); | |||
|
|||
// If the associated volume was deleted, then skip this notification step. | |||
// It's likely there is no Upstairs to talk to, but continue with the saga |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Likely"? :)
I wonder if it'd be worthwhile to have a #[cfg(debug_assertions)]
block here in which we assert that if the volume is deleted, then either it has no disk or the disk is detached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a78cabf reworks the comment.
I tried your suggestion by adding a ``validate_higher_level_resource_deletedfunction to the
validate_volume_invariants` check, but it returned an Err during the `disk_create` saga unwind. I'm not sure how to solve this, or even if it can be solved...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A high-level comment: my understanding of the underlying issue is that it involves sagas, some initiated by user requests and some by background tasks, racing with each other and creating havoc. Is that correct? If it is, do the new tests in this module fail deterministically without the new fixes in place, or do we need races between the sagas they start to break the right way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A high-level comment: my understanding of the underlying issue is that it involves sagas, some initiated by user requests and some by background tasks, racing with each other and creating havoc. Is that correct?
Yes it is!
do the new tests in this module fail deterministically without the new fixes in place
test_racing_replacements_for_soft_deleted_disk_volume
does fail without the fixes in place - I wrote it before writing the fixes! :) I'm not sure which fix(es) are required for that test to pass, or even if it's exclusively fixed by this PR or needs the previous related ones, but it was the intention of that test to first reproduce it, then fix it.
do we need races between the sagas
More extensive testing would be possible with a testing harness like app::sagas::test_helpers::action_failure_can_unwind_idempotently
where instead of injecting an error would run an arbitrary function, which in this case could be a whole snapshot delete. No such functionality exists, so manual testing is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a ton of great work.
log, | ||
"{s}"; | ||
"request id" => ?request.id, | ||
"volume id" => ?volume.id(), | ||
); | ||
status.errors.push(s); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our list of regions to look at is built with get_running_region_snapshot_replacements
.
That is only good while the region snapshot replacement is actually running, correct?
If so, then I believe any "missed" volume due to repair lock or other errors, must be re-searched for and handled before the running region snapshot replacement job finishes.
Is that all correct? And, could we miss the window here where we don't catch it wile the replacement is running and then it becomes an orphan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to ask a related question about this, but I think I managed instead to convince myself this is OK. I think the way this is meant to work is that
- the
region_snapshot
table row for a given snapshot is garbage collected once no more volumes refer to it - if there is a snapshot whose dataset is on an expunged sled, the region-snapshot-replace-start background task will try to create a new replacement request for it if none already exists
In the case I think you have in mind, the following would happen:
- the call to
find_volumes_referencing_socket_addr
on line 263 will find all the volumes that refer to the snapshot of interest; let's assume there's just one of these - assume the call to
create_region_snapshot_replacement_step
above (line 335) fails because the volume is already locked - the region-snapshot-replace-finish task will call
in_progress_region_snapshot_replacement_steps
and find there are no replacement steps being resolved; this allows the finish task to try to mark the replacement as having Completed - however, there's still a volume referring to the snapshot, which keeps it from being deleted; this means a subsequent run of the replace-start task will start another replacement cycle for it
This could go on forever if the region snapshot replacement never manages to acquire the volume lock. I'm not sure there's an easy way to prevent that (we'd have to make the lock fair and I haven't thought of a straightforward way to do that). But absent that kind of persistent unfairness, some replacement attempt or another should eventually get the volume lock and do this work.
@jmpesp does this check out? The other question I had was whether it was possible for an ill-timed "finish" task to prematurely decide that a Running replacement was finished (because it happened to run and evaluate the replacement before the corresponding "create steps" background task got to create any steps for it). Is this possible or is it foreclosed upon by the replacement state machine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the region-snapshot-replace-finish task will call in_progress_region_snapshot_replacement_steps and find there are no replacement steps being resolved; this allows the finish task to try to mark the replacement as having Completed
The only way a region snapshot request transitions out of the Running state (via the finish saga) is if
- there are no in-progress region snapshot replacement steps, and
- the request's associated region snapshot record was deleted.
There shouldn't be a way to miss volumes: if there are any volumes referencing the region snapshot, then the record will not have been deleted. In the case that (I think) Alan's referring to, we don't create a region snapshot replacement step for a volume due to an existing lock, but the request will not be finished because of #2 above.
As well, if the request is in the Running state, then the snapshot volume's already had a successful replacement performed, meaning no new volumes that reference the region snapshot can be created, meaning there can't be a race between something creating a new reference (by copying the snapshot volume as a read-only parent) and deleting the last reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misread the comment on line 107 of region_snapshot_replacement_finish.rs (it refers to what happens if the conditional branch on line 131 of that file is taken and not to what has happened by passing the condition on line 106). This makes more sense now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few more general remarks/questions from the newest commits. I still don't see anything that's blocking, but (as you know) there's a ton going on here and I'm not 100% sure I haven't overlooked or misunderstood something.
I don't want to hold up the PR just for this, but I'd be delighted to have something in the docs
directory that explains the overall theory of how these kinds of replacements are meant to work. (A lot of this information is in the various sagas' and tasks' module comments but having a single narrative doc to refer to would be great.)
Err(err.bail(Error::conflict(format!( | ||
"region snapshot replacement {} set to {:?} \ | ||
(operating saga id {:?})", | ||
region_snapshot_replacement_id, | ||
record.replacement_state, | ||
record.operating_saga_id, | ||
)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unexpected, correct? (It means either that the operating saga ID was wrong, or the caller called this on a replacement that wasn't Completing.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unexpected, yeah. Even in the case where the saga node is rerun the state should be set to Complete already and the saga shouldn't unwind.
// An associated volume repair record isn't _strictly_ | ||
// needed: snapshot volumes should never be directly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I follow the bit about snapshots not being directly accessed by an upstairs, but I thought the repair record was still needed for mutual exclusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not strictly speaking necessary - many replacements could occur on the snapshot volume at the same time, and because it's never constructed there wouldn't be any repair operation required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They wouldn't contend on the snapshot volume's database record?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They would only contend around the volume repair record. If there was no lock for the snapshot volume, then the individual replacement transactions could all fire in whatever order they're going to serialize in, and it would probably work.
log, | ||
"{s}"; | ||
"request id" => ?request.id, | ||
"volume id" => ?volume.id(), | ||
); | ||
status.errors.push(s); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to ask a related question about this, but I think I managed instead to convince myself this is OK. I think the way this is meant to work is that
- the
region_snapshot
table row for a given snapshot is garbage collected once no more volumes refer to it - if there is a snapshot whose dataset is on an expunged sled, the region-snapshot-replace-start background task will try to create a new replacement request for it if none already exists
In the case I think you have in mind, the following would happen:
- the call to
find_volumes_referencing_socket_addr
on line 263 will find all the volumes that refer to the snapshot of interest; let's assume there's just one of these - assume the call to
create_region_snapshot_replacement_step
above (line 335) fails because the volume is already locked - the region-snapshot-replace-finish task will call
in_progress_region_snapshot_replacement_steps
and find there are no replacement steps being resolved; this allows the finish task to try to mark the replacement as having Completed - however, there's still a volume referring to the snapshot, which keeps it from being deleted; this means a subsequent run of the replace-start task will start another replacement cycle for it
This could go on forever if the region snapshot replacement never manages to acquire the volume lock. I'm not sure there's an easy way to prevent that (we'd have to make the lock fair and I haven't thought of a straightforward way to do that). But absent that kind of persistent unfairness, some replacement attempt or another should eventually get the volume lock and do this work.
@jmpesp does this check out? The other question I had was whether it was possible for an ill-timed "finish" task to prematurely decide that a Running replacement was finished (because it happened to run and evaluate the replacement before the corresponding "create steps" background task got to create any steps for it). Is this possible or is it foreclosed upon by the replacement state machine?
task, insert into the `requests_completed_ok` vec! also, add a test for this, which exposed another bug: the continue that would have skipped starting the start saga was in the wrong place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've looked over the commits individually (and asked some basic questions in DMs) and I think I grok what's going on here pretty thoroughly now. Thanks @jmpesp for all the back-and-forth on this one!
// An associated volume repair record isn't _strictly_ | ||
// needed: snapshot volumes should never be directly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They wouldn't contend on the snapshot volume's database record?
log, | ||
"{s}"; | ||
"request id" => ?request.id, | ||
"volume id" => ?volume.id(), | ||
); | ||
status.errors.push(s); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misread the comment on line 107 of region_snapshot_replacement_finish.rs (it refers to what happens if the conditional branch on line 131 of that file is taken and not to what has happened by passing the condition on line 106). This makes more sense now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few final bikeshed comments, great to see this getting done here.
Volumes can be deleted at any time, but the tasks and sagas that perform region snapshot replacement did not account for this. This commit adds checks in a few places for if a volume is soft-deleted or hard-deleted, and bails out of any affected region snapshot replacement accordingly:
if a volume that has the region snapshot being replaced is soft-deleted, then skip making a region snapshot replacement step for it
if a region snapshot replacement step has the volume deleted after the step was created, transition it directly to the VolumeDeleted state
if a region snapshot replacement step has the volume deleted during the saga invocation, then skip notifying any Upstairs and allow the saga to transition the request to Complete, where then associated clean up can proceed
An interesting race condition emerged during unit testing: the read-only region allocated to replace a region snapshot would be swapped into the snapshot volume, but would be susceptible to being deleted by the user, and therefore unable to be swapped into other volumes that have that snapshot volume as a read-only parent.
This requires an additional volume that used that read-only region in order to bump the reference count associated with that region, so that the user cannot delete it before it was used to replace all other uses of the region snapshot it was meant to replace.
This additional volume's lifetime lives as long as the region snapshot replacement, and therefore needs to be deleted when the region snapshot replacement is finished. This required a new region snapshot replacement finish saga, which required a new "Completing" state to perform the same type of state based lock on the replacement request done for all the other sagas.
Testing also revealed that there were scenarios where
find_deleted_volume_regions
would return volumes for hard-deletion prematurely. The function now returns a struct instead of a list of tuples, and in that struct, regions freed for deletion are now distinct from volumes freed for deletion. Volumes are now only returned for hard-deletion when all associated read/write regions have been (or are going to be) deleted.Fixes #6353