Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot creation races with region replacement #6213

Merged
merged 1 commit into from
Aug 6, 2024

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Aug 2, 2024

Prevent the snapshot creation saga from racing with the region replacement saga by having the snapshot creation saga create a volume repair record, locking the volume for the duration of the snapshot creation.

There's a large comment in the new ssc_take_volume_lock talking about the implications of these two sagas racing.

Prevent the snapshot creation saga from racing with the region
replacement saga by having the snapshot creation saga create a volume
repair record, locking the volume for the duration of the snapshot
creation.

There's a large comment in the new `ssc_take_volume_lock` talking about
the implications of these two sagas racing.
@jmpesp jmpesp requested a review from leftwo August 2, 2024 16:40
@leftwo
Copy link
Contributor

leftwo commented Aug 2, 2024

Does locking the volume for the duration of the snapshot creation include setting the snapshot to running?

@@ -200,6 +204,9 @@ declare_saga_actions! {
FINALIZE_SNAPSHOT_RECORD -> "finalized_snapshot" {
+ ssc_finalize_snapshot_record
}
RELEASE_VOLUME_LOCK -> "volume_unlock" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we fail during some previous unwind step and never get to this unlock
step (and, leave the volume locked)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if the snapshot create saga gets stuck, then the volume remains locked.

@@ -137,6 +137,10 @@ pub(crate) struct Params {
// snapshot create saga: actions
declare_saga_actions! {
snapshot_create;
TAKE_VOLUME_LOCK -> "volume_lock" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't also need a volume lock step on the region_replacement saga?
Is that becuase the volume_repair_delete_query() does the locking for us?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a region replacement request is created, the associated volume repair record is created at the same time. It's deleted when the region replacement request transitions to Completed.

@jmpesp
Copy link
Contributor Author

jmpesp commented Aug 2, 2024

Does locking the volume for the duration of the snapshot creation include setting the snapshot to running?

The snapshot create saga does this, though it doesn't wait for the RunningSnapshot response from the agent to be in state == Created! A fix for that is coming in another PR.

@leftwo leftwo self-requested a review August 2, 2024 19:22
@jmpesp jmpesp merged commit 5831dc6 into oxidecomputer:main Aug 6, 2024
16 checks passed
@jmpesp jmpesp deleted the snapshot_create_volume_lock branch August 6, 2024 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants