Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle region snapshot replacement volume deletes #7046

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Nov 12, 2024

Volumes can be deleted at any time, but the tasks and sagas that perform region snapshot replacement did not account for this. This commit adds checks in a few places for if a volume is soft-deleted or hard-deleted, and bails out of any affected region snapshot replacement accordingly:

  • if a volume that has the region snapshot being replaced is soft-deleted, then skip making a region snapshot replacement step for it

  • if a region snapshot replacement step has the volume deleted after the step was created, transition it directly to the VolumeDeleted state

  • if a region snapshot replacement step has the volume deleted during the saga invocation, then skip notifying any Upstairs and allow the saga to transition the request to Complete, where then associated clean up can proceed

An interesting race condition emerged during unit testing: the read-only region allocated to replace a region snapshot would be swapped into the snapshot volume, but would be susceptible to being deleted by the user, and therefore unable to be swapped into other volumes that have that snapshot volume as a read-only parent.

This requires an additional volume that used that read-only region in order to bump the reference count associated with that region, so that the user cannot delete it before it was used to replace all other uses of the region snapshot it was meant to replace.

This additional volume's lifetime lives as long as the region snapshot replacement, and therefore needs to be deleted when the region snapshot replacement is finished. This required a new region snapshot replacement finish saga, which required a new "Completing" state to perform the same type of state based lock on the replacement request done for all the other sagas.

Testing also revealed that there were scenarios where find_deleted_volume_regions would return volumes for hard-deletion prematurely. The function now returns a struct instead of a list of tuples, and in that struct, regions freed for deletion are now distinct from volumes freed for deletion. Volumes are now only returned for hard-deletion when all associated read/write regions have been (or are going to be) deleted.

Fixes #6353

Volumes can be deleted at any time, but the tasks and sagas that perform
region snapshot replacement did not account for this. This commit adds
checks in a few places for if a volume is soft-deleted or hard-deleted,
and bails out of any affected region snapshot replacement accordingly:

- if a volume that has the region snapshot being replaced is
  soft-deleted, then skip making a region snapshot replacement step for
  it

- if a region snapshot replacement step has the volume deleted after the
  step was created, transition it directly to the VolumeDeleted state

- if a region snapshot replacement step has the volume deleted during
  the saga invocation, then skip notifying any Upstairs and allow the
  saga to transition the request to Complete, where then associated
  clean up can proceed

An interesting race condition emerged during unit testing: the read-only
region allocated to replace a region snapshot would be swapped into the
snapshot volume, but would be susceptible to being deleted by the user,
and therefore unable to be swapped into other volumes that have that
snapshot volume as a read-only parent.

This requires an additional volume that used that read-only region in
order to bump the reference count associated with that region, so that
the user cannot delete it before it was used to replace all other uses
of the region snapshot it was meant to replace.

This additional volume's lifetime lives as long as the region snapshot
replacement, and therefore needs to be deleted when the region snapshot
replacement is finished. This required a new region snapshot replacement
finish saga, which required a new "Completing" state to perform the same
type of state based lock on the replacement request done for all the
other sagas.

Testing also revealed that there were scenarios where
`find_deleted_volume_regions` would return volumes for hard-deletion
prematurely. The function now returns a struct instead of a list of
tuples, and in that struct, regions freed for deletion are now distinct
from volumes freed for deletion. Volumes are now only returned for
hard-deletion when all associated read/write regions have been (or are
going to be) deleted.

Fixes oxidecomputer#6353
@jmpesp jmpesp removed the request for review from smklein November 18, 2024 19:46
@jmpesp jmpesp requested a review from gjcolombo November 22, 2024 16:38
@gjcolombo
Copy link
Contributor

An interesting race condition emerged during unit testing: the read-only region allocated to replace a region snapshot would be swapped into the snapshot volume, but would be susceptible to being deleted by the user, and therefore unable to be swapped into other volumes that have that snapshot volume as a read-only parent.

This requires an additional volume that used that read-only region in order to bump the reference count associated with that region, so that the user cannot delete it before it was used to replace all other uses of the region snapshot it was meant to replace.

I'm not sure I'm following this--what is the thing the user can delete that causes the region reference count(?) to reach zero? My understanding is/was that regions aren't user-addressable in the API, but the disks/volumes and snapshots that depend on them are. Is the thing being deleted the snapshot that's now having a region replaced?

@jmpesp
Copy link
Contributor Author

jmpesp commented Nov 22, 2024

Is the thing being deleted the snapshot that's now having a region replaced?

Yep! The user's able to delete the snapshot (and therefore the snapshot volume) at any time. Here's the race:

  1. region snapshot replacement starts, targeting one of the three region snapshots in a snapshot volume:
targets: [
  RS1,
  RS2, <- replace this one
  RS3
]
  1. a read-only region is cloned from either RS1 or RS3

  2. that read-only region is swapped into the snapshot volume

targets: [
  RS1,
  read-only region,
  RS3
]
  1. if this was the only volume that read-only region was a part of, then its reference count would be 1. if the user deletes the snapshot now, the reference count would move to zero.

  2. the fix is to create an additional volume to reference that read-only region before swapping it into the snapshot volume:

"new region" volume:
targets: [
  read-only region
]
snapshot volume:
targets: [
  RS1,
  RS2, <- replace this one 
  RS3
]
  1. now the reference count for the read-only region is 1 before the swap takes place, and cannot move to zero unless the "new region" volume is deleted

@gjcolombo
Copy link
Contributor

if this was the only volume that read-only region was a part of, then its reference count would be 1. if the user deletes the snapshot now, the reference count would move to zero.

That seems... correct? Insofar as there really are no volumes referring to the region in this case. It sounds like the deal here is that

  • the region-replacement procedure needs to hold a reference to the region it's swapping in
  • region references are held by volume records; there is no other way to prevent a region from being GC'ed; ergo
  • the replacement procedure needs to create a fake volume to keep its new region alive until the procedure ends (even if the replacement-target volume is deleted midstream)

Is that the right mental model?

@jmpesp
Copy link
Contributor Author

jmpesp commented Nov 22, 2024

if this was the only volume that read-only region was a part of, then its reference count would be 1. if the user deletes the snapshot now, the reference count would move to zero.

That seems... correct? Insofar as there really are no volumes referring to the region in this case. It sounds like the deal here is that

* the region-replacement procedure needs to hold a reference to the region it's swapping in

* region references are held by volume records; there is no other way to prevent a region from being GC'ed; ergo

* the replacement procedure needs to create a fake volume to keep its new region alive until the procedure ends (even if the replacement-target volume is deleted midstream)

Is that the right mental model?

You're not wrong: it would be correct if the only use of RS2 was the snapshot volume, but every disk (or image) that was created using that snapshot as a block source will contain (in the read-only parent) a copy of the snapshot volume, and therefore a reference to RS2 (from the previous example). The new read-only region is meant to replace every copy of RS2 in any volume that references it, so it has to stick around until all those references are replaced.

Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think the direction here makes sense. I have a few general clarifying questions/comments, but don't see any substantial synchronization issues.

@@ -133,6 +140,12 @@ pub struct RegionSnapshotReplacement {
pub replacement_state: RegionSnapshotReplacementState,

pub operating_saga_id: Option<Uuid>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR, but a doc comment explaining the use of this field would be a welcome addition for slow-on-the-uptake readers like me :)

Comment on lines 412 to 413
// Check if the volume was deleted _after_ the replacement step was
// created.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just an optimization? I don't see anything that'd prevent a volume from being deleted between the check on line 416 and the saga execution on line 478. I don't think that'll break anything, though, because volume_replace_snapshot will bail if the volume is deleted.

Assuming this is right I'd leave a comment noting that this is just here to avoid the (not-inconsiderable!) expense of having to run a saga that we know isn't going to do anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct, this is just an optimization - the region_snapshot_replacement_step saga's volume_replace_snapshot will detect if the volume is deleted and do the right thing. Added comments in 9fc46ab

Comment on lines 84 to 89
/// Running |
/// | set in region snapshot replacement
/// | | finish background task
/// |
/// | |
/// v |
/// | responsibility of region snapshot
/// Completing | replacement finish saga
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: there's another version of this diagram (in the saga comments, I think) that has a back-edge from completing to running, though I think this is only if it unwinds, and I'm not sure if you're including those edges here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a nit, that's important! added in 2cd7881

Comment on lines +517 to +520
// Create a volume to inflate the reference count of the newly created
// read-only region. If this is not done it's possible that a user could
// delete the snapshot volume _after_ the new read-only region was swapped
// in, removing the last reference to it and causing garbage collection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my understanding: What prevents the region from being garbage-collected when it's just been ensured but hasn't been linked into anything yet? My guess is that completely orphaned regions don't get GC'ed and that possibly-unused regions are found by looking for them in soft-deleted volumes. Is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda yeah:

  • for read-only regions, they're only returned from soft_delete_volume for deletion when they appear in the volume being soft-deleted.
  • it's similar for read/write regions, but an additional way they're returned for GC is from find_deleted_volume_regions. That function join the regions table with the volume table and then only operates on soft-deleted volumes. The relevant section is:
            // only operate on soft deleted volumes
            let soft_deleted = match &volume {
                Some(volume) => volume.time_deleted.is_some(),
                None => false,
            };

            if !soft_deleted {
                continue;
            }

If the region's volume isn't inserted yet, then volume will be None here.

@@ -423,6 +424,21 @@ async fn rsrss_notify_upstairs(
let params = sagactx.saga_params::<Params>()?;
let log = sagactx.user_data().log();

// If the associated volume was deleted, then skip this notification step.
// It's likely there is no Upstairs to talk to, but continue with the saga
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Likely"? :)

I wonder if it'd be worthwhile to have a #[cfg(debug_assertions)] block here in which we assert that if the volume is deleted, then either it has no disk or the disk is detached.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a78cabf reworks the comment.

I tried your suggestion by adding a ``validate_higher_level_resource_deletedfunction to thevalidate_volume_invariants` check, but it returned an Err during the `disk_create` saga unwind. I'm not sure how to solve this, or even if it can be solved...?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A high-level comment: my understanding of the underlying issue is that it involves sagas, some initiated by user requests and some by background tasks, racing with each other and creating havoc. Is that correct? If it is, do the new tests in this module fail deterministically without the new fixes in place, or do we need races between the sagas they start to break the right way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A high-level comment: my understanding of the underlying issue is that it involves sagas, some initiated by user requests and some by background tasks, racing with each other and creating havoc. Is that correct?

Yes it is!

do the new tests in this module fail deterministically without the new fixes in place

test_racing_replacements_for_soft_deleted_disk_volume does fail without the fixes in place - I wrote it before writing the fixes! :) I'm not sure which fix(es) are required for that test to pass, or even if it's exclusively fixed by this PR or needs the previous related ones, but it was the intention of that test to first reproduce it, then fix it.

do we need races between the sagas

More extensive testing would be possible with a testing harness like app::sagas::test_helpers::action_failure_can_unwind_idempotently where instead of injecting an error would run an arbitrary function, which in this case could be a whole snapshot delete. No such functionality exists, so manual testing is required.

Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a ton of great work.

nexus/db-queries/src/db/datastore/volume.rs Show resolved Hide resolved
nexus/db-model/src/schema.rs Show resolved Hide resolved
nexus/db-queries/src/db/datastore/volume.rs Show resolved Hide resolved
log,
"{s}";
"request id" => ?request.id,
"volume id" => ?volume.id(),
);
status.errors.push(s);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our list of regions to look at is built with get_running_region_snapshot_replacements.
That is only good while the region snapshot replacement is actually running, correct?

If so, then I believe any "missed" volume due to repair lock or other errors, must be re-searched for and handled before the running region snapshot replacement job finishes.

Is that all correct? And, could we miss the window here where we don't catch it wile the replacement is running and then it becomes an orphan?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to ask a related question about this, but I think I managed instead to convince myself this is OK. I think the way this is meant to work is that

  • the region_snapshot table row for a given snapshot is garbage collected once no more volumes refer to it
  • if there is a snapshot whose dataset is on an expunged sled, the region-snapshot-replace-start background task will try to create a new replacement request for it if none already exists

In the case I think you have in mind, the following would happen:

  • the call to find_volumes_referencing_socket_addr on line 263 will find all the volumes that refer to the snapshot of interest; let's assume there's just one of these
  • assume the call to create_region_snapshot_replacement_step above (line 335) fails because the volume is already locked
  • the region-snapshot-replace-finish task will call in_progress_region_snapshot_replacement_steps and find there are no replacement steps being resolved; this allows the finish task to try to mark the replacement as having Completed
  • however, there's still a volume referring to the snapshot, which keeps it from being deleted; this means a subsequent run of the replace-start task will start another replacement cycle for it

This could go on forever if the region snapshot replacement never manages to acquire the volume lock. I'm not sure there's an easy way to prevent that (we'd have to make the lock fair and I haven't thought of a straightforward way to do that). But absent that kind of persistent unfairness, some replacement attempt or another should eventually get the volume lock and do this work.

@jmpesp does this check out? The other question I had was whether it was possible for an ill-timed "finish" task to prematurely decide that a Running replacement was finished (because it happened to run and evaluate the replacement before the corresponding "create steps" background task got to create any steps for it). Is this possible or is it foreclosed upon by the replacement state machine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the region-snapshot-replace-finish task will call in_progress_region_snapshot_replacement_steps and find there are no replacement steps being resolved; this allows the finish task to try to mark the replacement as having Completed

The only way a region snapshot request transitions out of the Running state (via the finish saga) is if

  1. there are no in-progress region snapshot replacement steps, and
  2. the request's associated region snapshot record was deleted.

There shouldn't be a way to miss volumes: if there are any volumes referencing the region snapshot, then the record will not have been deleted. In the case that (I think) Alan's referring to, we don't create a region snapshot replacement step for a volume due to an existing lock, but the request will not be finished because of #2 above.

As well, if the request is in the Running state, then the snapshot volume's already had a successful replacement performed, meaning no new volumes that reference the region snapshot can be created, meaning there can't be a race between something creating a new reference (by copying the snapshot volume as a read-only parent) and deleting the last reference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misread the comment on line 107 of region_snapshot_replacement_finish.rs (it refers to what happens if the conditional branch on line 131 of that file is taken and not to what has happened by passing the condition on line 106). This makes more sense now!

nexus/tests/integration_tests/crucible_replacements.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few more general remarks/questions from the newest commits. I still don't see anything that's blocking, but (as you know) there's a ton going on here and I'm not 100% sure I haven't overlooked or misunderstood something.

I don't want to hold up the PR just for this, but I'd be delighted to have something in the docs directory that explains the overall theory of how these kinds of replacements are meant to work. (A lot of this information is in the various sagas' and tasks' module comments but having a single narrative doc to refer to would be great.)

nexus/tests/integration_tests/crucible_replacements.rs Outdated Show resolved Hide resolved
Comment on lines +741 to +747
Err(err.bail(Error::conflict(format!(
"region snapshot replacement {} set to {:?} \
(operating saga id {:?})",
region_snapshot_replacement_id,
record.replacement_state,
record.operating_saga_id,
))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unexpected, correct? (It means either that the operating saga ID was wrong, or the caller called this on a replacement that wasn't Completing.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unexpected, yeah. Even in the case where the saga node is rerun the state should be set to Complete already and the saga shouldn't unwind.

Comment on lines +107 to +108
// An associated volume repair record isn't _strictly_
// needed: snapshot volumes should never be directly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I follow the bit about snapshots not being directly accessed by an upstairs, but I thought the repair record was still needed for mutual exclusion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not strictly speaking necessary - many replacements could occur on the snapshot volume at the same time, and because it's never constructed there wouldn't be any repair operation required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They wouldn't contend on the snapshot volume's database record?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They would only contend around the volume repair record. If there was no lock for the snapshot volume, then the individual replacement transactions could all fire in whatever order they're going to serialize in, and it would probably work.

log,
"{s}";
"request id" => ?request.id,
"volume id" => ?volume.id(),
);
status.errors.push(s);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to ask a related question about this, but I think I managed instead to convince myself this is OK. I think the way this is meant to work is that

  • the region_snapshot table row for a given snapshot is garbage collected once no more volumes refer to it
  • if there is a snapshot whose dataset is on an expunged sled, the region-snapshot-replace-start background task will try to create a new replacement request for it if none already exists

In the case I think you have in mind, the following would happen:

  • the call to find_volumes_referencing_socket_addr on line 263 will find all the volumes that refer to the snapshot of interest; let's assume there's just one of these
  • assume the call to create_region_snapshot_replacement_step above (line 335) fails because the volume is already locked
  • the region-snapshot-replace-finish task will call in_progress_region_snapshot_replacement_steps and find there are no replacement steps being resolved; this allows the finish task to try to mark the replacement as having Completed
  • however, there's still a volume referring to the snapshot, which keeps it from being deleted; this means a subsequent run of the replace-start task will start another replacement cycle for it

This could go on forever if the region snapshot replacement never manages to acquire the volume lock. I'm not sure there's an easy way to prevent that (we'd have to make the lock fair and I haven't thought of a straightforward way to do that). But absent that kind of persistent unfairness, some replacement attempt or another should eventually get the volume lock and do this work.

@jmpesp does this check out? The other question I had was whether it was possible for an ill-timed "finish" task to prematurely decide that a Running replacement was finished (because it happened to run and evaluate the replacement before the corresponding "create steps" background task got to create any steps for it). Is this possible or is it foreclosed upon by the replacement state machine?

task, insert into the `requests_completed_ok` vec!

also, add a test for this, which exposed another bug: the continue that
would have skipped starting the start saga was in the wrong place.
Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked over the commits individually (and asked some basic questions in DMs) and I think I grok what's going on here pretty thoroughly now. Thanks @jmpesp for all the back-and-forth on this one!

Comment on lines +107 to +108
// An associated volume repair record isn't _strictly_
// needed: snapshot volumes should never be directly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They wouldn't contend on the snapshot volume's database record?

log,
"{s}";
"request id" => ?request.id,
"volume id" => ?volume.id(),
);
status.errors.push(s);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misread the comment on line 107 of region_snapshot_replacement_finish.rs (it refers to what happens if the conditional branch on line 131 of that file is taken and not to what has happened by passing the condition on line 106). This makes more sense now!

@leftwo leftwo self-requested a review December 19, 2024 01:37
Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few final bikeshed comments, great to see this getting done here.

@jmpesp jmpesp enabled auto-merge (squash) December 19, 2024 04:26
@jmpesp jmpesp merged commit 09b150f into oxidecomputer:main Dec 19, 2024
16 checks passed
@jmpesp jmpesp deleted the region_snapshot_replacement_account_for_deleted_volumes branch December 19, 2024 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pantry crashed when processing a region with mismatched information
3 participants