-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#5333 5/6] Region snapshot replacement step #6350
[#5333 5/6] Region snapshot replacement step #6350
Conversation
After the region snapshot replacement start saga finishes, the snapshot's volume is no longer in a degraded state: the requested read-only region was cloned to a new region, and the reference was replaced in the construction request. Any disk that is now created using the snapshot as a source will work without issues. The problem now is volumes that still reference the replaced read-only region, and any Upstairs constructed from a VCR that references that region: disks created using a snapshot as a source will clone the snapshot volume and use that as the read-only parent for the new disk. This commit adds a new background task that finds all volumes that reference the replaced read-only region, creates a "region snapshot replacement step" record for them, and triggers the region snapshot replacement step saga. This is a much less involved process than region replacement: no continuous monitoring and driving is required, only a single best effort replace call to any relevant propolis. No pantry notification is required because there's nothing that currently attaches a volume to a pantry other than bulk import, and that's not relevant because those volumes do not have read-only parents. This commit also adds a garbage collection for these new "step" records, as they also stash the old snapshot target in a volume for later deletion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a two pass operation?
LIke, first find all the volumes effected by a snapshot replacement,
then, start sagas for each volume effected?
We do prevent any new volumes from the original snapshot, as that
snapshot has the replacement, but if someone did this sequence:
Original snapshot created.
New volume (call this volumeA) created from original snapshot.
Disk source in original snapshot fails.
Disk is replaced in original snapshot.
This saga makes a list of volumes that need updates.
<-- Someone makes a snapshot of VolumeA
This saga stars working on the list of volumes that needs updates.
Could we be missing one in that situation?
@@ -127,6 +127,7 @@ saga_recovery.period_secs = 600 | |||
lookup_region_port.period_secs = 60 | |||
region_snapshot_replacement_start.period_secs = 30 | |||
region_snapshot_replacement_garbage_collection.period_secs = 30 | |||
region_snapshot_replacement_step.period_secs = 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could region_snapshot_replacement_start
, region_snapshot_replacement_garbage_collection
,
and region_snapshot_replacement_step
all start at the same time?
Like, do we need to consider thundering heard type problems here and stagger these from each other
a littlie? Or will the trigger to start each one come in at different times and keeping them the same
will ensure that they don't collide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know, they usually do start at the same time! :) I think we're ok - each task only operates on requests that are in a certain state, and so are bounded in that sense from it being a thundering herd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking about a sled being expunged and all 10 crucible zones and all the things
that could be on those 10 zones, etc. Making sure we can handle an outage like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now we only limit the number of operations that can happen at one time to a Volume, and don't currently limit otherwise.I'm not too concerned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(hit comment too early!) I'm not too concerned, but it will be quite a bit of traffic to have all those reconciliations, live repairs, and clones going on... on second thought maybe I am concerned haha.
I'm not sure what to do other than to test it out and see though. We might be able to come up with an upper bound on the number of regions and snapshots maybe? Even if we do we can't know what the non-expunge load on the rack will be at any given time.
What do you think? Do you think some sort of limit is a good idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to say if we should set a limit without knowing what the load a repair might generate.
But, that being said, it is probably better to put in a limit to make things slower but still
able to finish and then loosen that limit as we discover how much the rack can handle.
The alternative would be to just let repair take as many resources as it wants and, then
let things break if it's too much? That second choice does not sound like something
I would want to explain to a customer :)
We don't have anything yet we can use as an overall guage of rack "busyness". In additon,
we don't know the load an "average repair" generates even. Maybe our first steps would
be to build the tools to determine what the current repair load is so we can measure
it's impact?
}; | ||
|
||
for volume in volumes { | ||
// Otherwise, any volume referencing the old socket addr needs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Otherwise here suggests there was a larger comment at one point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're right this is weird, it's possible something got moved around or deleted? Fixed in 5af0ff9
// | ||
// Also note: this function returns a conflict error if another | ||
// step record references this volume id in the "old snapshot | ||
// volume id" colume - this is ok! Region snapshot replacement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: colume
-> column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 804568b
// License, v. 2.0. If a copy of the MPL was not distributed with this | ||
// file, You can obtain one at https://mozilla.org/MPL/2.0/. | ||
|
||
//! Clean up the volume that stashes the target replaced during a region |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By Clean up
we mean Delete
here? If so, we should just say delete.
We could say:
Delete the temporary volume we created during the snapshot replacement step saga.
Maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool cool, done in 9a9cc60
Correct yeah :)
There's no such prevention required :) the request will stay in the "Running" state (which means the step saga will be activated for it) until all the references to the region snapshot being replaced are gone. We're using the accounting machinery of the volume_references column in the region_snapshot table that is updated during all volume creates and deletes to tell us when there are no references left. Every copy of the volume that contains the target representing the region snapshot being replaced will bump this counter, every delete will decrement it, and when it goes to zero (and is deleted) we'll know that the whole replacement process is done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Can't believe I have no notes :)
// License, v. 2.0. If a copy of the MPL was not distributed with this | ||
// file, You can obtain one at https://mozilla.org/MPL/2.0/. | ||
|
||
//! Background task for detecting volumes affected by a region snapshot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is such a great comment. I see that's why you also put it in the commit message 😄
// License, v. 2.0. If a copy of the MPL was not distributed with this | ||
// file, You can obtain one at https://mozilla.org/MPL/2.0/. | ||
|
||
//! Region snapshot replacement is distinct from region replacement: replacing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another awesome header comment!
…ter#6350) After the region snapshot replacement start saga finishes, the snapshot's volume is no longer in a degraded state: the requested read-only region was cloned to a new region, and the reference was replaced in the construction request. Any disk that is now created using the snapshot as a source will work without issues. The problem now is volumes that still reference the replaced read-only region, and any Upstairs constructed from a VCR that references that region: disks created using a snapshot as a source will clone the snapshot volume and use that as the read-only parent for the new disk. This commit adds a new background task that finds all volumes that reference the replaced read-only region, creates a "region snapshot replacement step" record for them, and triggers the region snapshot replacement step saga. This is a much less involved process than region replacement: no continuous monitoring and driving is required, only a single best effort replace call to any relevant propolis. No pantry notification is required because there's nothing that currently attaches a volume to a pantry other than bulk import, and that's not relevant because those volumes do not have read-only parents. This commit also adds a garbage collection for these new "step" records, as they also stash the old snapshot target in a volume for later deletion.
After the region snapshot replacement start saga finishes, the snapshot's volume is no longer in a degraded state: the requested read-only region was cloned to a new region, and the reference was replaced in the construction request. Any disk that is now created using the snapshot as a source will work without issues.
The problem now is volumes that still reference the replaced read-only region, and any Upstairs constructed from a VCR that references that region: disks created using a snapshot as a source will clone the snapshot volume and use that as the read-only parent for the new disk.
This commit adds a new background task that finds all volumes that reference the replaced read-only region, creates a "region snapshot replacement step" record for them, and triggers the region snapshot replacement step saga.
This is a much less involved process than region replacement: no continuous monitoring and driving is required, only a single best effort replace call to any relevant propolis. No pantry notification is required because there's nothing that currently attaches a volume to a pantry other than bulk import, and that's not relevant because those volumes do not have read-only parents.
This commit also adds a garbage collection for these new "step" records, as they also stash the old snapshot target in a volume for later deletion.