-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement region replacement for Volumes #5683
Conversation
When a disk is expunged, any region that was on that disk is assumed to be gone. A single disk expungement can put many Volumes into degraded states, as one of the three mirrors of a region set is now gone. Volumes that are degraded in this way remain degraded until a new region is swapped in, and the Upstairs performs the necessary repair operation (either through a Live Repair or Reconciliation). Nexus can only initiate these repairs - it does not participate in them, instead requesting that a Crucible Upstairs perform the repair. These repair operations can only be done by an Upstairs running as part of an activated Volume: either Nexus has to send this Volume to a Pantry and repair it there, or Nexus has to talk to a propolis that has that active Volume. Further complicating things is that the Volumes in question can be activated and deactivated as a result of user action, namely starting and stopping Instances. This will interrupt any on-going repair. This is ok! Both operations support being interrupted, but as a result it's then Nexus' job to continually monitor these repair operations and initiate further operations if the current one is interrupted. Nexus starts by creating region replacement requests, either manually or as a result of disk expungement. These region replacement requests go through the following states: Requested <-- | | | v | | Allocating -- | v Running <-- | | | v | | Driving -- | v ReplacementDone <-- | | | v | | Completing -- | v Completed A single saga invocation is not enough to continually make sure a Volume is being repaired, so region replacement is structured as series of background tasks and saga invocations from those background tasks. Here's a high level summary: - a `region replacement` background task: - looks for disks that have been expunged and inserts region replacement requests into CRDB with state `Requested` - looks for all region replacemnt requests in state `Requested` (picking up new requests and requests that failed to transition to `Running`), and invokes a `region replacement start` saga. - the `region replacement start` saga: - transitions the request to state `Allocating`, blocking out other invocations of the same saga - allocates a new replacement region - alters the Volume Construction Request by swapping out the old region for the replacement one - transitions the request to state `Running` - any unwind will transition the request back to the `Requested` state. - a `region replacement drive` background task: - looks for requests with state `Running`, and invokes the `region replacement drive` saga for those requests - looks for requests with state `ReplacementDone`, and invokes the `region replacement finish` saga for those requests - the `region replacement drive` saga will: - transition a request to state `Driving`, again blocking out other invocations of the same saga - check if Nexus has taken an action to initiate a repair yet. if not, then one is needed. if it _has_ previously initiated a repair operation, the state of the system is examined: is that operation still running? has something changed? further action may be required depending on this observation. - if an action is required, Nexus will prepare an action that will initiate either Live Repair or Reconciliation based on the current observed state of the system. - that action is then executed. if there was an error, then the saga unwinds. if it was successful, it is recorded as a "repair step" in CRDB and will be checked the next time the saga runs. - if Nexus observed an Upstairs telling it that a repair was completed or not necessary, then the request is placed into the `ReplacementDone` state, otherwise it is placed back into the `Running` state. if the saga unwinds, it unwinds back to the `Running` state. - finally, the `region replacement finish` saga will: - transition a request into `Completing` - delete the old region by deleting a transient Volume that refers to it (in the case where a sled or disk is actually physically gone, expunging that will trigger oxidecomputer#4331, which needs to be fixed!) - transition the request to the `Complete` state More detailed documentation is provided in each of the region replacement saga's beginning docstrings. Testing was done manually using the Canada region using the following test cases: - a disk needing repair is attached to a instance for the duration of the repair - a disk needing repair is attached to a instance that is migrated mid-repair - a disk needing repair is attached to a instance that is stopped mid-repair - a disk needing repair is attached to a instance that is stopped mid-repair, then started in the middle of the pantry's repair - a detached disk needs repair - a detached disk needs repair, and is then attached to an instance that is then started - a sled is expunged, causing region replacement requests for all regions on it Fixes oxidecomputer#3886 Fixes oxidecomputer#5191
fix case where mark_region_replacement_as_done wasn't changing the state of a request for which there was a drive saga running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
James, this is epic work. I've only given it a cursory look so far, and will need to spend much more time digging in. Given how far behind current main this is, and the slight alleviation of urgency, I was wondering if you could split this up into multiple logical PRs to make it easier to review. I think this should be feasible by splitting along datastore queries and then saga / background task lines. Each of those can be added to the code and tested without being used. The background tasks for instance don't need to be enabled immediately and the sagas don't need to be triggered by the background tasks and or omdb. The OMDB change can come in last. My gut feeling is that this would also make it easier to test things in isolation, as you may see issues while doing the split and writing individual commit messages.
.transaction_async(|conn| async move { | ||
use db::schema::region_replacement::dsl; | ||
|
||
match (args.state, args.after) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Rather than match on different filters, you could create a query without the filters, and then append them. This should be much less code. Here's an example:
omicron/dev-tools/omdb/src/bin/omdb/db.rs
Lines 765 to 776 in d2ed452
use db::schema::disk::dsl; | |
let mut query = dsl::disk.into_boxed(); | |
if !fetch_opts.include_deleted { | |
query = query.filter(dsl::time_deleted.is_null()); | |
} | |
let disks = query | |
.limit(i64::from(u32::from(fetch_opts.fetch_limit))) | |
.select(Disk::as_select()) | |
.load_async(&*datastore.pool_connection_for_tests().await?) | |
.await | |
.context("loading disks")?; |
//! TODO this is currently a placeholder for a future PR | ||
//! This task's responsibility is to create region replacement requests when | ||
//! physical disks are expunged, and trigger the region replacement start saga | ||
//! for any requests that are in state "Requested". See the documentation there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the documentation where? The region replacement start saga?
@@ -109,6 +109,11 @@ blueprints.period_secs_execute = 600 | |||
sync_service_zone_nat.period_secs = 30 | |||
switch_port_settings_manager.period_secs = 30 | |||
region_replacement.period_secs = 30 | |||
# The driver task should wake up frequently, something like every 10 seconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's unfortunate. It would be nice if this could be redacted out just for this message, but I'm not sure if that's possible.
// License, v. 2.0. If a copy of the MPL was not distributed with this | ||
// file, You can obtain one at https://mozilla.org/MPL/2.0/. | ||
|
||
//! # first, some Crucible background # |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great comment!
Closing this, will split it up! |
Splitting up oxidecomputer#5683 first by separating out the DB models, queries, and schema changes required: This commit adds a Region Replacement record, which is a request to replace a region in a volume. It transitions through the following states: Requested <-- | | | v | | Allocating -- | v Running <-- | | | v | | Driving -- | v ReplacementDone <-- | | | v | | Completing -- | v Completed which are captured in the `RegionReplacementState` enum. Transitioning from Requested to Running is the responsibility of the "start" saga, iterating between Running and Driving is the responsibility of the "drive" saga, and transitioning from ReplacementDone to Completed is the responsibility of the "finish" saga. All of these will come in subsequent PRs. The state transitions themselves are performed by these sagas and all involve a query that: - checks that the starting state (and other values as required) make sense - updates the state while setting a unique `operating_saga_id` id (and any other fields as appropriate) As multiple background tasks will be waking up, checking to see what sagas need to be triggered, and requesting that these region replacement sagas run, this is meant to block multiple sagas from running at the same time in an effort to cut down on interference - most will unwind at the first step instead of somewhere in the middle. As region replacement takes place, Nexus will be making calls to services in order to trigger the necessary Crucible operations meant to actually perform th replacement. These steps are recorded in the database so that they can be consulted by subsequent steps, and additionally act as breadcrumbs if there is an issue. Nexus should take care to only replace one region (or snapshot!) for a volume at a time. Technically, the Upstairs can support two at a time, but codifying "only one at a time" is safer, and does not allow the possiblity for a Nexus bug to replace all three regions of a region set at a time (aka total data loss!). This "one at a time" constraint is enforced by each repair also creating a VolumeRepair record, a table for which there is a UNIQUE CONSTRAINT on the volume ID. The `volume_replace_region` function is also included in this PR. In a single transaction, this will: - set the target region's volume id to the replacement's volume id - set the replacement region's volume id to the target's volume id - update the target volume's construction request to replace the target region's SocketAddrV6 with the replacement region's This is called from the "start" saga, after allocating the replacement region, and is meant to transition the Volume's construction request from "indefinitely degraded, pointing to region that is gone" to "currently degraded, but can be repaired".
Splitting up oxidecomputer#5683 first by separating out the DB models, queries, and schema changes required: 1. region replacement records This commit adds a Region Replacement record, which is a request to replace a region in a volume. It transitions through the following states: Requested <-- | | | v | | Allocating -- | v Running <-- | | | v | | Driving -- | v ReplacementDone <-- | | | v | | Completing -- | v Completed which are captured in the `RegionReplacementState` enum. Transitioning from Requested to Running is the responsibility of the "start" saga, iterating between Running and Driving is the responsibility of the "drive" saga, and transitioning from ReplacementDone to Completed is the responsibility of the "finish" saga. All of these will come in subsequent PRs. The state transitions themselves are performed by these sagas and all involve a query that: - checks that the starting state (and other values as required) make sense - updates the state while setting a unique `operating_saga_id` id (and any other fields as appropriate) As multiple background tasks will be waking up, checking to see what sagas need to be triggered, and requesting that these region replacement sagas run, this is meant to block multiple sagas from running at the same time in an effort to cut down on interference - most will unwind at the first step instead of somewhere in the middle. 2. region replacement step records As region replacement takes place, Nexus will be making calls to services in order to trigger the necessary Crucible operations meant to actually perform th replacement. These steps are recorded in the database so that they can be consulted by subsequent steps, and additionally act as breadcrumbs if there is an issue. 3. vollume repair records Nexus should take care to only replace one region (or snapshot!) for a volume at a time. Technically, the Upstairs can support two at a time, but codifying "only one at a time" is safer, and does not allow the possiblity for a Nexus bug to replace all three regions of a region set at a time (aka total data loss!). This "one at a time" constraint is enforced by each repair also creating a VolumeRepair record, a table for which there is a UNIQUE CONSTRAINT on the volume ID. 4. also, the `volume_replace_region` function The `volume_replace_region` function is also included in this PR. In a single transaction, this will: - set the target region's volume id to the replacement's volume id - set the replacement region's volume id to the target's volume id - update the target volume's construction request to replace the target region's SocketAddrV6 with the replacement region's This is called from the "start" saga, after allocating the replacement region, and is meant to transition the Volume's construction request from "indefinitely degraded, pointing to region that is gone" to "currently degraded, but can be repaired".
Splitting up oxidecomputer#5683 first by separating out the DB models, queries, and schema changes required: 1. region replacement records This commit adds a Region Replacement record, which is a request to replace a region in a volume. It transitions through the following states: Requested <-- | | | v | | Allocating -- | v Running <-- | | | v | | Driving -- | v ReplacementDone <-- | | | v | | Completing -- | v Completed which are captured in the `RegionReplacementState` enum. Transitioning from Requested to Running is the responsibility of the "start" saga, iterating between Running and Driving is the responsibility of the "drive" saga, and transitioning from ReplacementDone to Completed is the responsibility of the "finish" saga. All of these will come in subsequent PRs. The state transitions themselves are performed by these sagas and all involve a query that: - checks that the starting state (and other values as required) make sense - updates the state while setting a unique `operating_saga_id` id (and any other fields as appropriate) As multiple background tasks will be waking up, checking to see what sagas need to be triggered, and requesting that these region replacement sagas run, this is meant to block multiple sagas from running at the same time in an effort to cut down on interference - most will unwind at the first step instead of somewhere in the middle. 2. region replacement step records As region replacement takes place, Nexus will be making calls to services in order to trigger the necessary Crucible operations meant to actually perform th replacement. These steps are recorded in the database so that they can be consulted by subsequent steps, and additionally act as breadcrumbs if there is an issue. 3. volume repair records Nexus should take care to only replace one region (or snapshot!) for a volume at a time. Technically, the Upstairs can support two at a time, but codifying "only one at a time" is safer, and does not allow the possiblity for a Nexus bug to replace all three regions of a region set at a time (aka total data loss!). This "one at a time" constraint is enforced by each repair also creating a VolumeRepair record, a table for which there is a UNIQUE CONSTRAINT on the volume ID. 4. also, the `volume_replace_region` function The `volume_replace_region` function is also included in this PR. In a single transaction, this will: - set the target region's volume id to the replacement's volume id - set the replacement region's volume id to the target's volume id - update the target volume's construction request to replace the target region's SocketAddrV6 with the replacement region's This is called from the "start" saga, after allocating the replacement region, and is meant to transition the Volume's construction request from "indefinitely degraded, pointing to region that is gone" to "currently degraded, but can be repaired".
Splitting up #5683 first by separating out the DB models, queries, and schema changes required: 1. region replacement records This commit adds a Region Replacement record, which is a request to replace a region in a volume. It transitions through the following states: Requested <-- | | | v | | Allocating -- | v Running <-- | | | v | | Driving -- | v ReplacementDone <-- | | | v | | Completing -- | v Completed which are captured in the `RegionReplacementState` enum. Transitioning from Requested to Running is the responsibility of the "start" saga, iterating between Running and Driving is the responsibility of the "drive" saga, and transitioning from ReplacementDone to Completed is the responsibility of the "finish" saga. All of these will come in subsequent PRs. The state transitions themselves are performed by these sagas and all involve a query that: - checks that the starting state (and other values as required) make sense - updates the state while setting a unique `operating_saga_id` id (and any other fields as appropriate) As multiple background tasks will be waking up, checking to see what sagas need to be triggered, and requesting that these region replacement sagas run, this is meant to block multiple sagas from running at the same time in an effort to cut down on interference - most will unwind at the first step instead of somewhere in the middle. 2. region replacement step records As region replacement takes place, Nexus will be making calls to services in order to trigger the necessary Crucible operations meant to actually perform th replacement. These steps are recorded in the database so that they can be consulted by subsequent steps, and additionally act as breadcrumbs if there is an issue. 3. volume repair records Nexus should take care to only replace one region (or snapshot!) for a volume at a time. Technically, the Upstairs can support two at a time, but codifying "only one at a time" is safer, and does not allow the possiblity for a Nexus bug to replace all three regions of a region set at a time (aka total data loss!). This "one at a time" constraint is enforced by each repair also creating a VolumeRepair record, a table for which there is a UNIQUE CONSTRAINT on the volume ID. 4. also, the `volume_replace_region` function The `volume_replace_region` function is also included in this PR. In a single transaction, this will: - set the target region's volume id to the replacement's volume id - set the replacement region's volume id to the target's volume id - update the target volume's construction request to replace the target region's SocketAddrV6 with the replacement region's This is called from the "start" saga, after allocating the replacement region, and is meant to transition the Volume's construction request from "indefinitely degraded, pointing to region that is gone" to "currently degraded, but can be repaired".
When a disk is expunged, any region that was on that disk is assumed to be gone. A single disk expungement can put many Volumes into degraded states, as one of the three mirrors of a region set is now gone. Volumes that are degraded in this way remain degraded until a new region is swapped in, and the Upstairs performs the necessary repair operation (either through a Live Repair or Reconciliation). Nexus can only initiate these repairs - it does not participate in them, instead requesting that a Crucible Upstairs perform the repair.
These repair operations can only be done by an Upstairs running as part of an activated Volume: either Nexus has to send this Volume to a Pantry and repair it there, or Nexus has to talk to a propolis that has that active Volume. Further complicating things is that the Volumes in question can be activated and deactivated as a result of user action, namely starting and stopping Instances. This will interrupt any on-going repair. This is ok! Both operations support being interrupted, but as a result it's then Nexus' job to continually monitor these repair operations and initiate further operations if the current one is interrupted.
Nexus starts by creating region replacement requests, either manually or as a result of disk expungement. These region replacement requests go through the following states:
A single saga invocation is not enough to continually make sure a Volume is being repaired, so region replacement is structured as series of background tasks and saga invocations from those background tasks.
Here's a high level summary:
a
region replacement
background task:looks for disks that have been expunged and inserts region replacement requests into CRDB with state
Requested
looks for all region replacemnt requests in state
Requested
(picking up new requests and requests that failed to transition toRunning
), and invokes aregion replacement start
saga.the
region replacement start
saga:transitions the request to state
Allocating
, blocking out other invocations of the same sagaallocates a new replacement region
alters the Volume Construction Request by swapping out the old region for the replacement one
transitions the request to state
Running
any unwind will transition the request back to the
Requested
state.a
region replacement drive
background task:looks for requests with state
Running
, and invokes theregion replacement drive
saga for those requestslooks for requests with state
ReplacementDone
, and invokes theregion replacement finish
saga for those requeststhe
region replacement drive
saga will:transition a request to state
Driving
, again blocking out other invocations of the same sagacheck if Nexus has taken an action to initiate a repair yet. if not, then one is needed. if it has previously initiated a repair operation, the state of the system is examined: is that operation still running? has something changed? further action may be required depending on this observation.
if an action is required, Nexus will prepare an action that will initiate either Live Repair or Reconciliation based on the current observed state of the system.
that action is then executed. if there was an error, then the saga unwinds. if it was successful, it is recorded as a "repair step" in CRDB and will be checked the next time the saga runs.
if Nexus observed an Upstairs telling it that a repair was completed or not necessary, then the request is placed into the
ReplacementDone
state, otherwise it is placed back into theRunning
state. if the saga unwinds, it unwinds back to theRunning
state.finally, the
region replacement finish
saga will:transition a request into
Completing
delete the old region by deleting a transient Volume that refers to it (in the case where a sled or disk is actually physically gone, expunging that will trigger all disk deletes hang while Crucible downstairs is unreachable #4331, which needs to be fixed!)
transition the request to the
Complete
stateMore detailed documentation is provided in each of the region replacement saga's beginning docstrings.
Testing was done manually using the Canada region using the following test cases:
a disk needing repair is attached to a instance for the duration of the repair
a disk needing repair is attached to a instance that is migrated mid-repair
a disk needing repair is attached to a instance that is stopped mid-repair
a disk needing repair is attached to a instance that is stopped mid-repair, then started in the middle of the pantry's repair
a detached disk needs repair
a detached disk needs repair, and is then attached to an instance that is then started
a sled is expunged, causing region replacement requests for all regions on it
Fixes #3886
Fixes #5191