[#5333 5/6] Region snapshot replacement step #6350

jmpesp · 2024-08-15T17:19:21Z

After the region snapshot replacement start saga finishes, the snapshot's volume is no longer in a degraded state: the requested read-only region was cloned to a new region, and the reference was replaced in the construction request. Any disk that is now created using the snapshot as a source will work without issues.

The problem now is volumes that still reference the replaced read-only region, and any Upstairs constructed from a VCR that references that region: disks created using a snapshot as a source will clone the snapshot volume and use that as the read-only parent for the new disk.

This commit adds a new background task that finds all volumes that reference the replaced read-only region, creates a "region snapshot replacement step" record for them, and triggers the region snapshot replacement step saga.

This is a much less involved process than region replacement: no continuous monitoring and driving is required, only a single best effort replace call to any relevant propolis. No pantry notification is required because there's nothing that currently attaches a volume to a pantry other than bulk import, and that's not relevant because those volumes do not have read-only parents.

This commit also adds a garbage collection for these new "step" records, as they also stash the old snapshot target in a volume for later deletion.

After the region snapshot replacement start saga finishes, the snapshot's volume is no longer in a degraded state: the requested read-only region was cloned to a new region, and the reference was replaced in the construction request. Any disk that is now created using the snapshot as a source will work without issues. The problem now is volumes that still reference the replaced read-only region, and any Upstairs constructed from a VCR that references that region: disks created using a snapshot as a source will clone the snapshot volume and use that as the read-only parent for the new disk. This commit adds a new background task that finds all volumes that reference the replaced read-only region, creates a "region snapshot replacement step" record for them, and triggers the region snapshot replacement step saga. This is a much less involved process than region replacement: no continuous monitoring and driving is required, only a single best effort replace call to any relevant propolis. No pantry notification is required because there's nothing that currently attaches a volume to a pantry other than bulk import, and that's not relevant because those volumes do not have read-only parents. This commit also adds a garbage collection for these new "step" records, as they also stash the old snapshot target in a volume for later deletion.

leftwo

Is this a two pass operation?

LIke, first find all the volumes effected by a snapshot replacement,
then, start sagas for each volume effected?

We do prevent any new volumes from the original snapshot, as that
snapshot has the replacement, but if someone did this sequence:

Original snapshot created.
New volume (call this volumeA) created from original snapshot.
Disk source in original snapshot fails.
Disk is replaced in original snapshot.
This saga makes a list of volumes that need updates.
<-- Someone makes a snapshot of VolumeA
This saga stars working on the list of volumes that needs updates.

Could we be missing one in that situation?

leftwo · 2024-08-15T17:26:18Z

nexus/examples/config.toml

@@ -127,6 +127,7 @@ saga_recovery.period_secs = 600
 lookup_region_port.period_secs = 60
 region_snapshot_replacement_start.period_secs = 30
 region_snapshot_replacement_garbage_collection.period_secs = 30
+region_snapshot_replacement_step.period_secs = 30


Could region_snapshot_replacement_start, region_snapshot_replacement_garbage_collection,
and region_snapshot_replacement_step all start at the same time?
Like, do we need to consider thundering heard type problems here and stagger these from each other
a littlie? Or will the trigger to start each one come in at different times and keeping them the same
will ensure that they don't collide?

As far as I know, they usually do start at the same time! :) I think we're ok - each task only operates on requests that are in a certain state, and so are bounded in that sense from it being a thundering herd.

I'm thinking about a sled being expunged and all 10 crucible zones and all the things
that could be on those 10 zones, etc. Making sure we can handle an outage like that.

Right now we only limit the number of operations that can happen at one time to a Volume, and don't currently limit otherwise.I'm not too concerned

(hit comment too early!) I'm not too concerned, but it will be quite a bit of traffic to have all those reconciliations, live repairs, and clones going on... on second thought maybe I am concerned haha.

I'm not sure what to do other than to test it out and see though. We might be able to come up with an upper bound on the number of regions and snapshots maybe? Even if we do we can't know what the non-expunge load on the rack will be at any given time.

What do you think? Do you think some sort of limit is a good idea?

It's hard to say if we should set a limit without knowing what the load a repair might generate.

But, that being said, it is probably better to put in a limit to make things slower but still
able to finish and then loosen that limit as we discover how much the rack can handle.
The alternative would be to just let repair take as many resources as it wants and, then
let things break if it's too much? That second choice does not sound like something
I would want to explain to a customer :)

We don't have anything yet we can use as an overall guage of rack "busyness". In additon,
we don't know the load an "average repair" generates even. Maybe our first steps would
be to build the tools to determine what the current repair load is so we can measure
it's impact?

nexus/src/app/sagas/region_snapshot_replacement_step.rs

leftwo · 2024-08-15T18:31:17Z

nexus/src/app/background/tasks/region_snapshot_replacement_step.rs

+            };
+
+            for volume in volumes {
+                // Otherwise, any volume referencing the old socket addr needs


The Otherwise here suggests there was a larger comment at one point?

Yeah, you're right this is weird, it's possible something got moved around or deleted? Fixed in 5af0ff9

nexus/src/app/background/tasks/region_snapshot_replacement_step.rs

leftwo · 2024-08-15T18:36:40Z

nexus/src/app/background/tasks/region_snapshot_replacement_step.rs

+                //
+                // Also note: this function returns a conflict error if another
+                // step record references this volume id in the "old snapshot
+                // volume id" colume - this is ok! Region snapshot replacement


nit: colume -> column

Fixed in 804568b

nexus/src/app/sagas/region_snapshot_replacement_step.rs

leftwo · 2024-08-15T19:03:46Z

nexus/src/app/sagas/region_snapshot_replacement_step_garbage_collect.rs

+// License, v. 2.0. If a copy of the MPL was not distributed with this
+// file, You can obtain one at https://mozilla.org/MPL/2.0/.
+
+//! Clean up the volume that stashes the target replaced during a region


By Clean up we mean Delete here? If so, we should just say delete.

We could say:
Delete the temporary volume we created during the snapshot replacement step saga.

Maybe?

Cool cool, done in 9a9cc60

jmpesp · 2024-08-15T20:24:29Z

Is this a two pass operation?

LIke, first find all the volumes effected by a snapshot replacement, then, start sagas for each volume effected?

Correct yeah :)

We do prevent any new volumes from the original snapshot, as that snapshot has the replacement, but if someone did this sequence:

Original snapshot created.
New volume (call this volumeA) created from original snapshot.
Disk source in original snapshot fails.
Disk is replaced in original snapshot.
This saga makes a list of volumes that need updates.
<-- Someone makes a snapshot of VolumeA
This saga stars working on the list of volumes that needs updates.

Could we be missing one in that situation?

There's no such prevention required :) the request will stay in the "Running" state (which means the step saga will be activated for it) until all the references to the region snapshot being replaced are gone. We're using the accounting machinery of the volume_references column in the region_snapshot table that is updated during all volume creates and deletes to tell us when there are no references left. Every copy of the volume that contains the target representing the region snapshot being replaced will bump this counter, every delete will decrement it, and when it goes to zero (and is deleted) we'll know that the whole replacement process is done.

andrewjstone

Looks great. Can't believe I have no notes :)

andrewjstone · 2024-08-19T18:05:01Z

nexus/src/app/background/tasks/region_snapshot_replacement_step.rs

+// License, v. 2.0. If a copy of the MPL was not distributed with this
+// file, You can obtain one at https://mozilla.org/MPL/2.0/.
+
+//! Background task for detecting volumes affected by a region snapshot


This is such a great comment. I see that's why you also put it in the commit message 😄

andrewjstone · 2024-08-19T18:56:21Z

nexus/src/app/sagas/region_snapshot_replacement_step.rs

+// License, v. 2.0. If a copy of the MPL was not distributed with this
+// file, You can obtain one at https://mozilla.org/MPL/2.0/.
+
+//! Region snapshot replacement is distinct from region replacement: replacing


Another awesome header comment!

…ter#6350) After the region snapshot replacement start saga finishes, the snapshot's volume is no longer in a degraded state: the requested read-only region was cloned to a new region, and the reference was replaced in the construction request. Any disk that is now created using the snapshot as a source will work without issues. The problem now is volumes that still reference the replaced read-only region, and any Upstairs constructed from a VCR that references that region: disks created using a snapshot as a source will clone the snapshot volume and use that as the read-only parent for the new disk. This commit adds a new background task that finds all volumes that reference the replaced read-only region, creates a "region snapshot replacement step" record for them, and triggers the region snapshot replacement step saga. This is a much less involved process than region replacement: no continuous monitoring and driving is required, only a single best effort replace call to any relevant propolis. No pantry notification is required because there's nothing that currently attaches a volume to a pantry other than bulk import, and that's not relevant because those volumes do not have read-only parents. This commit also adds a garbage collection for these new "step" records, as they also stash the old snapshot target in a volume for later deletion.

jmpesp requested review from andrewjstone and leftwo August 15, 2024 17:19

fix test_omdb_success_cases

b507133

leftwo reviewed Aug 15, 2024

View reviewed changes

jmpesp added 4 commits August 15, 2024 19:33

colume -> column

804568b

otherwise what? that makes no sense!

5af0ff9

Merge branch 'main' into snapshot_replacement_part_5

4de6cc4

be specific, use delete over clean up

9a9cc60

more comments always

43f1d71

andrewjstone approved these changes Aug 19, 2024

View reviewed changes

Merge branch 'main' into snapshot_replacement_part_5

11c5f13

leftwo self-requested a review August 26, 2024 20:22

leftwo approved these changes Aug 26, 2024

View reviewed changes

jmpesp enabled auto-merge (squash) August 26, 2024 20:25

jmpesp merged commit e434307 into oxidecomputer:main Aug 26, 2024
16 checks passed

jmpesp deleted the snapshot_replacement_part_5 branch August 26, 2024 20:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#5333 5/6] Region snapshot replacement step #6350

[#5333 5/6] Region snapshot replacement step #6350

jmpesp commented Aug 15, 2024

leftwo left a comment

leftwo Aug 15, 2024

jmpesp Aug 15, 2024

leftwo Aug 15, 2024

jmpesp Aug 16, 2024

jmpesp Aug 16, 2024

leftwo Aug 16, 2024

leftwo Aug 15, 2024

jmpesp Aug 15, 2024

leftwo Aug 15, 2024

jmpesp Aug 15, 2024

leftwo Aug 15, 2024

jmpesp Aug 15, 2024

jmpesp commented Aug 15, 2024

andrewjstone left a comment

andrewjstone Aug 19, 2024

andrewjstone Aug 19, 2024

[#5333 5/6] Region snapshot replacement step #6350

[#5333 5/6] Region snapshot replacement step #6350

Conversation

jmpesp commented Aug 15, 2024

leftwo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmpesp commented Aug 15, 2024

andrewjstone left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment