Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#5333 5/6] Region snapshot replacement step #6350

Merged
merged 8 commits into from
Aug 26, 2024

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Aug 15, 2024

After the region snapshot replacement start saga finishes, the snapshot's volume is no longer in a degraded state: the requested read-only region was cloned to a new region, and the reference was replaced in the construction request. Any disk that is now created using the snapshot as a source will work without issues.

The problem now is volumes that still reference the replaced read-only region, and any Upstairs constructed from a VCR that references that region: disks created using a snapshot as a source will clone the snapshot volume and use that as the read-only parent for the new disk.

This commit adds a new background task that finds all volumes that reference the replaced read-only region, creates a "region snapshot replacement step" record for them, and triggers the region snapshot replacement step saga.

This is a much less involved process than region replacement: no continuous monitoring and driving is required, only a single best effort replace call to any relevant propolis. No pantry notification is required because there's nothing that currently attaches a volume to a pantry other than bulk import, and that's not relevant because those volumes do not have read-only parents.

This commit also adds a garbage collection for these new "step" records, as they also stash the old snapshot target in a volume for later deletion.

After the region snapshot replacement start saga finishes, the
snapshot's volume is no longer in a degraded state: the requested
read-only region was cloned to a new region, and the reference was
replaced in the construction request. Any disk that is now created using
the snapshot as a source will work without issues.

The problem now is volumes that still reference the replaced read-only
region, and any Upstairs constructed from a VCR that references that
region: disks created using a snapshot as a source will clone the
snapshot volume and use that as the read-only parent for the new disk.

This commit adds a new background task that finds all volumes that
reference the replaced read-only region, creates a "region snapshot
replacement step" record for them, and triggers the region snapshot
replacement step saga.

This is a much less involved process than region replacement: no
continuous monitoring and driving is required, only a single best effort
replace call to any relevant propolis. No pantry notification is
required because there's nothing that currently attaches a volume to a
pantry other than bulk import, and that's not relevant because those
volumes do not have read-only parents.

This commit also adds a garbage collection for these new "step" records,
as they also stash the old snapshot target in a volume for later
deletion.
@jmpesp jmpesp requested review from andrewjstone and leftwo August 15, 2024 17:19
Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a two pass operation?

LIke, first find all the volumes effected by a snapshot replacement,
then, start sagas for each volume effected?

We do prevent any new volumes from the original snapshot, as that
snapshot has the replacement, but if someone did this sequence:

Original snapshot created.
New volume (call this volumeA) created from original snapshot.
Disk source in original snapshot fails.
Disk is replaced in original snapshot.
This saga makes a list of volumes that need updates.
<-- Someone makes a snapshot of VolumeA
This saga stars working on the list of volumes that needs updates.

Could we be missing one in that situation?

@@ -127,6 +127,7 @@ saga_recovery.period_secs = 600
lookup_region_port.period_secs = 60
region_snapshot_replacement_start.period_secs = 30
region_snapshot_replacement_garbage_collection.period_secs = 30
region_snapshot_replacement_step.period_secs = 30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could region_snapshot_replacement_start, region_snapshot_replacement_garbage_collection,
and region_snapshot_replacement_step all start at the same time?
Like, do we need to consider thundering heard type problems here and stagger these from each other
a littlie? Or will the trigger to start each one come in at different times and keeping them the same
will ensure that they don't collide?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, they usually do start at the same time! :) I think we're ok - each task only operates on requests that are in a certain state, and so are bounded in that sense from it being a thundering herd.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about a sled being expunged and all 10 crucible zones and all the things
that could be on those 10 zones, etc. Making sure we can handle an outage like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we only limit the number of operations that can happen at one time to a Volume, and don't currently limit otherwise.I'm not too concerned

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(hit comment too early!) I'm not too concerned, but it will be quite a bit of traffic to have all those reconciliations, live repairs, and clones going on... on second thought maybe I am concerned haha.

I'm not sure what to do other than to test it out and see though. We might be able to come up with an upper bound on the number of regions and snapshots maybe? Even if we do we can't know what the non-expunge load on the rack will be at any given time.

What do you think? Do you think some sort of limit is a good idea?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to say if we should set a limit without knowing what the load a repair might generate.

But, that being said, it is probably better to put in a limit to make things slower but still
able to finish and then loosen that limit as we discover how much the rack can handle.
The alternative would be to just let repair take as many resources as it wants and, then
let things break if it's too much? That second choice does not sound like something
I would want to explain to a customer :)

We don't have anything yet we can use as an overall guage of rack "busyness". In additon,
we don't know the load an "average repair" generates even. Maybe our first steps would
be to build the tools to determine what the current repair load is so we can measure
it's impact?

};

for volume in volumes {
// Otherwise, any volume referencing the old socket addr needs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Otherwise here suggests there was a larger comment at one point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right this is weird, it's possible something got moved around or deleted? Fixed in 5af0ff9

//
// Also note: this function returns a conflict error if another
// step record references this volume id in the "old snapshot
// volume id" colume - this is ok! Region snapshot replacement
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: colume -> column

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 804568b

// License, v. 2.0. If a copy of the MPL was not distributed with this
// file, You can obtain one at https://mozilla.org/MPL/2.0/.

//! Clean up the volume that stashes the target replaced during a region
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By Clean up we mean Delete here? If so, we should just say delete.

We could say:
Delete the temporary volume we created during the snapshot replacement step saga.

Maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool cool, done in 9a9cc60

@jmpesp
Copy link
Contributor Author

jmpesp commented Aug 15, 2024

Is this a two pass operation?

LIke, first find all the volumes effected by a snapshot replacement, then, start sagas for each volume effected?

Correct yeah :)

We do prevent any new volumes from the original snapshot, as that snapshot has the replacement, but if someone did this sequence:

Original snapshot created.
New volume (call this volumeA) created from original snapshot.
Disk source in original snapshot fails.
Disk is replaced in original snapshot.
This saga makes a list of volumes that need updates.
<-- Someone makes a snapshot of VolumeA
This saga stars working on the list of volumes that needs updates.

Could we be missing one in that situation?

There's no such prevention required :) the request will stay in the "Running" state (which means the step saga will be activated for it) until all the references to the region snapshot being replaced are gone. We're using the accounting machinery of the volume_references column in the region_snapshot table that is updated during all volume creates and deletes to tell us when there are no references left. Every copy of the volume that contains the target representing the region snapshot being replaced will bump this counter, every delete will decrement it, and when it goes to zero (and is deleted) we'll know that the whole replacement process is done.

Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Can't believe I have no notes :)

// License, v. 2.0. If a copy of the MPL was not distributed with this
// file, You can obtain one at https://mozilla.org/MPL/2.0/.

//! Background task for detecting volumes affected by a region snapshot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is such a great comment. I see that's why you also put it in the commit message 😄

// License, v. 2.0. If a copy of the MPL was not distributed with this
// file, You can obtain one at https://mozilla.org/MPL/2.0/.

//! Region snapshot replacement is distinct from region replacement: replacing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another awesome header comment!

@leftwo leftwo self-requested a review August 26, 2024 20:22
@jmpesp jmpesp enabled auto-merge (squash) August 26, 2024 20:25
@jmpesp jmpesp merged commit e434307 into oxidecomputer:main Aug 26, 2024
16 checks passed
@jmpesp jmpesp deleted the snapshot_replacement_part_5 branch August 26, 2024 20:59
nickryand pushed a commit to nickryand/omicron that referenced this pull request Aug 27, 2024
…ter#6350)

After the region snapshot replacement start saga finishes, the
snapshot's volume is no longer in a degraded state: the requested
read-only region was cloned to a new region, and the reference was
replaced in the construction request. Any disk that is now created using
the snapshot as a source will work without issues.

The problem now is volumes that still reference the replaced read-only
region, and any Upstairs constructed from a VCR that references that
region: disks created using a snapshot as a source will clone the
snapshot volume and use that as the read-only parent for the new disk.

This commit adds a new background task that finds all volumes that
reference the replaced read-only region, creates a "region snapshot
replacement step" record for them, and triggers the region snapshot
replacement step saga.

This is a much less involved process than region replacement: no
continuous monitoring and driving is required, only a single best effort
replace call to any relevant propolis. No pantry notification is
required because there's nothing that currently attaches a volume to a
pantry other than bulk import, and that's not relevant because those
volumes do not have read-only parents.

This commit also adds a garbage collection for these new "step" records,
as they also stash the old snapshot target in a volume for later
deletion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants