-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snapshot creation races with region replacement #6213
Conversation
Prevent the snapshot creation saga from racing with the region replacement saga by having the snapshot creation saga create a volume repair record, locking the volume for the duration of the snapshot creation. There's a large comment in the new `ssc_take_volume_lock` talking about the implications of these two sagas racing.
Does |
@@ -200,6 +204,9 @@ declare_saga_actions! { | |||
FINALIZE_SNAPSHOT_RECORD -> "finalized_snapshot" { | |||
+ ssc_finalize_snapshot_record | |||
} | |||
RELEASE_VOLUME_LOCK -> "volume_unlock" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we fail during some previous unwind step and never get to this unlock
step (and, leave the volume locked)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if the snapshot create saga gets stuck, then the volume remains locked.
@@ -137,6 +137,10 @@ pub(crate) struct Params { | |||
// snapshot create saga: actions | |||
declare_saga_actions! { | |||
snapshot_create; | |||
TAKE_VOLUME_LOCK -> "volume_lock" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't also need a volume lock step on the region_replacement saga?
Is that becuase the volume_repair_delete_query()
does the locking for us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a region replacement request is created, the associated volume repair record is created at the same time. It's deleted when the region replacement request transitions to Completed.
The snapshot create saga does this, though it doesn't wait for the RunningSnapshot response from the agent to be in state == Created! A fix for that is coming in another PR. |
Prevent the snapshot creation saga from racing with the region replacement saga by having the snapshot creation saga create a volume repair record, locking the volume for the duration of the snapshot creation.
There's a large comment in the new
ssc_take_volume_lock
talking about the implications of these two sagas racing.