Implement region replacement for Volumes #5683

jmpesp · 2024-05-01T20:06:37Z

When a disk is expunged, any region that was on that disk is assumed to be gone. A single disk expungement can put many Volumes into degraded states, as one of the three mirrors of a region set is now gone. Volumes that are degraded in this way remain degraded until a new region is swapped in, and the Upstairs performs the necessary repair operation (either through a Live Repair or Reconciliation). Nexus can only initiate these repairs - it does not participate in them, instead requesting that a Crucible Upstairs perform the repair.

These repair operations can only be done by an Upstairs running as part of an activated Volume: either Nexus has to send this Volume to a Pantry and repair it there, or Nexus has to talk to a propolis that has that active Volume. Further complicating things is that the Volumes in question can be activated and deactivated as a result of user action, namely starting and stopping Instances. This will interrupt any on-going repair. This is ok! Both operations support being interrupted, but as a result it's then Nexus' job to continually monitor these repair operations and initiate further operations if the current one is interrupted.

Nexus starts by creating region replacement requests, either manually or as a result of disk expungement. These region replacement requests go through the following states:

    Requested   <--
                  |
        |         |
        v         |
                  |
    Allocating  --

        |
        v

     Running    <--
                  |
        |         |
        v         |
                  |
     Driving    --

        |
        v

 ReplacementDone  <--
                    |
        |           |
        v           |
                    |
    Completing    --

        |
        v

    Completed

A single saga invocation is not enough to continually make sure a Volume is being repaired, so region replacement is structured as series of background tasks and saga invocations from those background tasks.

Here's a high level summary:

a region replacement background task:
- looks for disks that have been expunged and inserts region replacement requests into CRDB with state Requested
- looks for all region replacemnt requests in state Requested (picking up new requests and requests that failed to transition to Running), and invokes a region replacement start saga.
the region replacement start saga:
- transitions the request to state Allocating, blocking out other invocations of the same saga
- allocates a new replacement region
- alters the Volume Construction Request by swapping out the old region for the replacement one
- transitions the request to state Running
- any unwind will transition the request back to the Requested state.
a region replacement drive background task:
- looks for requests with state Running, and invokes the region replacement drive saga for those requests
- looks for requests with state ReplacementDone, and invokes the region replacement finish saga for those requests
the region replacement drive saga will:
- transition a request to state Driving, again blocking out other invocations of the same saga
- check if Nexus has taken an action to initiate a repair yet. if not, then one is needed. if it has previously initiated a repair operation, the state of the system is examined: is that operation still running? has something changed? further action may be required depending on this observation.
- if an action is required, Nexus will prepare an action that will initiate either Live Repair or Reconciliation based on the current observed state of the system.
- that action is then executed. if there was an error, then the saga unwinds. if it was successful, it is recorded as a "repair step" in CRDB and will be checked the next time the saga runs.
- if Nexus observed an Upstairs telling it that a repair was completed or not necessary, then the request is placed into the ReplacementDone state, otherwise it is placed back into the Running state. if the saga unwinds, it unwinds back to the Running state.
finally, the region replacement finish saga will:
- transition a request into Completing
- delete the old region by deleting a transient Volume that refers to it (in the case where a sled or disk is actually physically gone, expunging that will trigger all disk deletes hang while Crucible downstairs is unreachable #4331, which needs to be fixed!)
- transition the request to the Complete state

More detailed documentation is provided in each of the region replacement saga's beginning docstrings.

Testing was done manually using the Canada region using the following test cases:

a disk needing repair is attached to a instance for the duration of the repair
a disk needing repair is attached to a instance that is migrated mid-repair
a disk needing repair is attached to a instance that is stopped mid-repair
a disk needing repair is attached to a instance that is stopped mid-repair, then started in the middle of the pantry's repair
a detached disk needs repair
a detached disk needs repair, and is then attached to an instance that is then started
a sled is expunged, causing region replacement requests for all regions on it

Fixes #3886
Fixes #5191

When a disk is expunged, any region that was on that disk is assumed to be gone. A single disk expungement can put many Volumes into degraded states, as one of the three mirrors of a region set is now gone. Volumes that are degraded in this way remain degraded until a new region is swapped in, and the Upstairs performs the necessary repair operation (either through a Live Repair or Reconciliation). Nexus can only initiate these repairs - it does not participate in them, instead requesting that a Crucible Upstairs perform the repair. These repair operations can only be done by an Upstairs running as part of an activated Volume: either Nexus has to send this Volume to a Pantry and repair it there, or Nexus has to talk to a propolis that has that active Volume. Further complicating things is that the Volumes in question can be activated and deactivated as a result of user action, namely starting and stopping Instances. This will interrupt any on-going repair. This is ok! Both operations support being interrupted, but as a result it's then Nexus' job to continually monitor these repair operations and initiate further operations if the current one is interrupted. Nexus starts by creating region replacement requests, either manually or as a result of disk expungement. These region replacement requests go through the following states: Requested <-- | | | v | | Allocating -- | v Running <-- | | | v | | Driving -- | v ReplacementDone <-- | | | v | | Completing -- | v Completed A single saga invocation is not enough to continually make sure a Volume is being repaired, so region replacement is structured as series of background tasks and saga invocations from those background tasks. Here's a high level summary: - a `region replacement` background task: - looks for disks that have been expunged and inserts region replacement requests into CRDB with state `Requested` - looks for all region replacemnt requests in state `Requested` (picking up new requests and requests that failed to transition to `Running`), and invokes a `region replacement start` saga. - the `region replacement start` saga: - transitions the request to state `Allocating`, blocking out other invocations of the same saga - allocates a new replacement region - alters the Volume Construction Request by swapping out the old region for the replacement one - transitions the request to state `Running` - any unwind will transition the request back to the `Requested` state. - a `region replacement drive` background task: - looks for requests with state `Running`, and invokes the `region replacement drive` saga for those requests - looks for requests with state `ReplacementDone`, and invokes the `region replacement finish` saga for those requests - the `region replacement drive` saga will: - transition a request to state `Driving`, again blocking out other invocations of the same saga - check if Nexus has taken an action to initiate a repair yet. if not, then one is needed. if it _has_ previously initiated a repair operation, the state of the system is examined: is that operation still running? has something changed? further action may be required depending on this observation. - if an action is required, Nexus will prepare an action that will initiate either Live Repair or Reconciliation based on the current observed state of the system. - that action is then executed. if there was an error, then the saga unwinds. if it was successful, it is recorded as a "repair step" in CRDB and will be checked the next time the saga runs. - if Nexus observed an Upstairs telling it that a repair was completed or not necessary, then the request is placed into the `ReplacementDone` state, otherwise it is placed back into the `Running` state. if the saga unwinds, it unwinds back to the `Running` state. - finally, the `region replacement finish` saga will: - transition a request into `Completing` - delete the old region by deleting a transient Volume that refers to it (in the case where a sled or disk is actually physically gone, expunging that will trigger oxidecomputer#4331, which needs to be fixed!) - transition the request to the `Complete` state More detailed documentation is provided in each of the region replacement saga's beginning docstrings. Testing was done manually using the Canada region using the following test cases: - a disk needing repair is attached to a instance for the duration of the repair - a disk needing repair is attached to a instance that is migrated mid-repair - a disk needing repair is attached to a instance that is stopped mid-repair - a disk needing repair is attached to a instance that is stopped mid-repair, then started in the middle of the pantry's repair - a detached disk needs repair - a detached disk needs repair, and is then attached to an instance that is then started - a sled is expunged, causing region replacement requests for all regions on it Fixes oxidecomputer#3886 Fixes oxidecomputer#5191

fix case where mark_region_replacement_as_done wasn't changing the state of a request for which there was a drive saga running.

andrewjstone

James, this is epic work. I've only given it a cursory look so far, and will need to spend much more time digging in. Given how far behind current main this is, and the slight alleviation of urgency, I was wondering if you could split this up into multiple logical PRs to make it easier to review. I think this should be feasible by splitting along datastore queries and then saga / background task lines. Each of those can be added to the code and tested without being used. The background tasks for instance don't need to be enabled immediately and the sagas don't need to be triggered by the background tasks and or omdb. The OMDB change can come in last. My gut feeling is that this would also make it easier to test things in isolation, as you may see issues while doing the split and writing individual commit messages.

andrewjstone · 2024-05-13T20:18:40Z

dev-tools/omdb/src/bin/omdb/db.rs

+        .transaction_async(|conn| async move {
+            use db::schema::region_replacement::dsl;
+
+            match (args.state, args.after) {


Nit: Rather than match on different filters, you could create a query without the filters, and then append them. This should be much less code. Here's an example:

omicron/dev-tools/omdb/src/bin/omdb/db.rs

Lines 765 to 776 in d2ed452

use db::schema::disk::dsl;

let mut query = dsl::disk.into_boxed();

if !fetch_opts.include_deleted {

query = query.filter(dsl::time_deleted.is_null());

}

let disks = query

.limit(i64::from(u32::from(fetch_opts.fetch_limit)))

.select(Disk::as_select())

.load_async(&*datastore.pool_connection_for_tests().await?)

.await

.context("loading disks")?;

andrewjstone · 2024-05-14T04:10:45Z

nexus/src/app/background/region_replacement.rs

-//! TODO this is currently a placeholder for a future PR
+//! This task's responsibility is to create region replacement requests when
+//! physical disks are expunged, and trigger the region replacement start saga
+//! for any requests that are in state "Requested". See the documentation there


See the documentation where? The region replacement start saga?

andrewjstone · 2024-05-14T04:27:56Z

nexus/tests/config.test.toml

@@ -109,6 +109,11 @@ blueprints.period_secs_execute = 600
 sync_service_zone_nat.period_secs = 30
 switch_port_settings_manager.period_secs = 30
 region_replacement.period_secs = 30
+# The driver task should wake up frequently, something like every 10 seconds.


That's unfortunate. It would be nice if this could be redacted out just for this message, but I'm not sure if that's possible.

andrewjstone · 2024-05-14T16:45:44Z

nexus/src/app/sagas/region_replacement_drive.rs

+// License, v. 2.0. If a copy of the MPL was not distributed with this
+// file, You can obtain one at https://mozilla.org/MPL/2.0/.
+
+//! # first, some Crucible background #


great comment!

jmpesp · 2024-05-17T20:09:39Z

Closing this, will split it up!

Splitting up oxidecomputer#5683 first by separating out the DB models, queries, and schema changes required: This commit adds a Region Replacement record, which is a request to replace a region in a volume. It transitions through the following states: Requested <-- | | | v | | Allocating -- | v Running <-- | | | v | | Driving -- | v ReplacementDone <-- | | | v | | Completing -- | v Completed which are captured in the `RegionReplacementState` enum. Transitioning from Requested to Running is the responsibility of the "start" saga, iterating between Running and Driving is the responsibility of the "drive" saga, and transitioning from ReplacementDone to Completed is the responsibility of the "finish" saga. All of these will come in subsequent PRs. The state transitions themselves are performed by these sagas and all involve a query that: - checks that the starting state (and other values as required) make sense - updates the state while setting a unique `operating_saga_id` id (and any other fields as appropriate) As multiple background tasks will be waking up, checking to see what sagas need to be triggered, and requesting that these region replacement sagas run, this is meant to block multiple sagas from running at the same time in an effort to cut down on interference - most will unwind at the first step instead of somewhere in the middle. As region replacement takes place, Nexus will be making calls to services in order to trigger the necessary Crucible operations meant to actually perform th replacement. These steps are recorded in the database so that they can be consulted by subsequent steps, and additionally act as breadcrumbs if there is an issue. Nexus should take care to only replace one region (or snapshot!) for a volume at a time. Technically, the Upstairs can support two at a time, but codifying "only one at a time" is safer, and does not allow the possiblity for a Nexus bug to replace all three regions of a region set at a time (aka total data loss!). This "one at a time" constraint is enforced by each repair also creating a VolumeRepair record, a table for which there is a UNIQUE CONSTRAINT on the volume ID. The `volume_replace_region` function is also included in this PR. In a single transaction, this will: - set the target region's volume id to the replacement's volume id - set the replacement region's volume id to the target's volume id - update the target volume's construction request to replace the target region's SocketAddrV6 with the replacement region's This is called from the "start" saga, after allocating the replacement region, and is meant to transition the Volume's construction request from "indefinitely degraded, pointing to region that is gone" to "currently degraded, but can be repaired".

Splitting up oxidecomputer#5683 first by separating out the DB models, queries, and schema changes required: 1. region replacement records This commit adds a Region Replacement record, which is a request to replace a region in a volume. It transitions through the following states: Requested <-- | | | v | | Allocating -- | v Running <-- | | | v | | Driving -- | v ReplacementDone <-- | | | v | | Completing -- | v Completed which are captured in the `RegionReplacementState` enum. Transitioning from Requested to Running is the responsibility of the "start" saga, iterating between Running and Driving is the responsibility of the "drive" saga, and transitioning from ReplacementDone to Completed is the responsibility of the "finish" saga. All of these will come in subsequent PRs. The state transitions themselves are performed by these sagas and all involve a query that: - checks that the starting state (and other values as required) make sense - updates the state while setting a unique `operating_saga_id` id (and any other fields as appropriate) As multiple background tasks will be waking up, checking to see what sagas need to be triggered, and requesting that these region replacement sagas run, this is meant to block multiple sagas from running at the same time in an effort to cut down on interference - most will unwind at the first step instead of somewhere in the middle. 2. region replacement step records As region replacement takes place, Nexus will be making calls to services in order to trigger the necessary Crucible operations meant to actually perform th replacement. These steps are recorded in the database so that they can be consulted by subsequent steps, and additionally act as breadcrumbs if there is an issue. 3. vollume repair records Nexus should take care to only replace one region (or snapshot!) for a volume at a time. Technically, the Upstairs can support two at a time, but codifying "only one at a time" is safer, and does not allow the possiblity for a Nexus bug to replace all three regions of a region set at a time (aka total data loss!). This "one at a time" constraint is enforced by each repair also creating a VolumeRepair record, a table for which there is a UNIQUE CONSTRAINT on the volume ID. 4. also, the `volume_replace_region` function The `volume_replace_region` function is also included in this PR. In a single transaction, this will: - set the target region's volume id to the replacement's volume id - set the replacement region's volume id to the target's volume id - update the target volume's construction request to replace the target region's SocketAddrV6 with the replacement region's This is called from the "start" saga, after allocating the replacement region, and is meant to transition the Volume's construction request from "indefinitely degraded, pointing to region that is gone" to "currently degraded, but can be repaired".

Splitting up oxidecomputer#5683 first by separating out the DB models, queries, and schema changes required: 1. region replacement records This commit adds a Region Replacement record, which is a request to replace a region in a volume. It transitions through the following states: Requested <-- | | | v | | Allocating -- | v Running <-- | | | v | | Driving -- | v ReplacementDone <-- | | | v | | Completing -- | v Completed which are captured in the `RegionReplacementState` enum. Transitioning from Requested to Running is the responsibility of the "start" saga, iterating between Running and Driving is the responsibility of the "drive" saga, and transitioning from ReplacementDone to Completed is the responsibility of the "finish" saga. All of these will come in subsequent PRs. The state transitions themselves are performed by these sagas and all involve a query that: - checks that the starting state (and other values as required) make sense - updates the state while setting a unique `operating_saga_id` id (and any other fields as appropriate) As multiple background tasks will be waking up, checking to see what sagas need to be triggered, and requesting that these region replacement sagas run, this is meant to block multiple sagas from running at the same time in an effort to cut down on interference - most will unwind at the first step instead of somewhere in the middle. 2. region replacement step records As region replacement takes place, Nexus will be making calls to services in order to trigger the necessary Crucible operations meant to actually perform th replacement. These steps are recorded in the database so that they can be consulted by subsequent steps, and additionally act as breadcrumbs if there is an issue. 3. volume repair records Nexus should take care to only replace one region (or snapshot!) for a volume at a time. Technically, the Upstairs can support two at a time, but codifying "only one at a time" is safer, and does not allow the possiblity for a Nexus bug to replace all three regions of a region set at a time (aka total data loss!). This "one at a time" constraint is enforced by each repair also creating a VolumeRepair record, a table for which there is a UNIQUE CONSTRAINT on the volume ID. 4. also, the `volume_replace_region` function The `volume_replace_region` function is also included in this PR. In a single transaction, this will: - set the target region's volume id to the replacement's volume id - set the replacement region's volume id to the target's volume id - update the target volume's construction request to replace the target region's SocketAddrV6 with the replacement region's This is called from the "start" saga, after allocating the replacement region, and is meant to transition the Volume's construction request from "indefinitely degraded, pointing to region that is gone" to "currently degraded, but can be repaired".

Splitting up #5683 first by separating out the DB models, queries, and schema changes required: 1. region replacement records This commit adds a Region Replacement record, which is a request to replace a region in a volume. It transitions through the following states: Requested <-- | | | v | | Allocating -- | v Running <-- | | | v | | Driving -- | v ReplacementDone <-- | | | v | | Completing -- | v Completed which are captured in the `RegionReplacementState` enum. Transitioning from Requested to Running is the responsibility of the "start" saga, iterating between Running and Driving is the responsibility of the "drive" saga, and transitioning from ReplacementDone to Completed is the responsibility of the "finish" saga. All of these will come in subsequent PRs. The state transitions themselves are performed by these sagas and all involve a query that: - checks that the starting state (and other values as required) make sense - updates the state while setting a unique `operating_saga_id` id (and any other fields as appropriate) As multiple background tasks will be waking up, checking to see what sagas need to be triggered, and requesting that these region replacement sagas run, this is meant to block multiple sagas from running at the same time in an effort to cut down on interference - most will unwind at the first step instead of somewhere in the middle. 2. region replacement step records As region replacement takes place, Nexus will be making calls to services in order to trigger the necessary Crucible operations meant to actually perform th replacement. These steps are recorded in the database so that they can be consulted by subsequent steps, and additionally act as breadcrumbs if there is an issue. 3. volume repair records Nexus should take care to only replace one region (or snapshot!) for a volume at a time. Technically, the Upstairs can support two at a time, but codifying "only one at a time" is safer, and does not allow the possiblity for a Nexus bug to replace all three regions of a region set at a time (aka total data loss!). This "one at a time" constraint is enforced by each repair also creating a VolumeRepair record, a table for which there is a UNIQUE CONSTRAINT on the volume ID. 4. also, the `volume_replace_region` function The `volume_replace_region` function is also included in this PR. In a single transaction, this will: - set the target region's volume id to the replacement's volume id - set the replacement region's volume id to the target's volume id - update the target volume's construction request to replace the target region's SocketAddrV6 with the replacement region's This is called from the "start" saga, after allocating the replacement region, and is meant to transition the Volume's construction request from "indefinitely degraded, pointing to region that is gone" to "currently degraded, but can be repaired".

jmpesp requested review from andrewjstone and leftwo May 1, 2024 20:06

jmpesp added 6 commits May 2, 2024 02:57

mark_region_replacement_as_done should always work

b45ac93

fix case where mark_region_replacement_as_done wasn't changing the state of a request for which there was a drive saga running.

set request to ReplacementDone in the drive background task

471f9b7

deal with background task racing with finish saga

3449f97

remove non-idempotent code

6ae24e5

store volume that points to old region in replacement request

93b3e84

andrewjstone reviewed May 14, 2024

View reviewed changes

jmpesp closed this May 17, 2024

jmpesp mentioned this pull request May 17, 2024

[#3886 1/4] Region replacement models and queries #5791

Merged

jmpesp deleted the region_replacement branch June 12, 2024 21:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement region replacement for Volumes #5683

Implement region replacement for Volumes #5683

jmpesp commented May 1, 2024

andrewjstone left a comment

andrewjstone May 13, 2024

andrewjstone May 14, 2024

andrewjstone May 14, 2024

andrewjstone May 14, 2024

jmpesp commented May 17, 2024

	use db::schema::disk::dsl;
	let mut query = dsl::disk.into_boxed();
	if !fetch_opts.include_deleted {
	query = query.filter(dsl::time_deleted.is_null());
	}

	let disks = query
	.limit(i64::from(u32::from(fetch_opts.fetch_limit)))
	.select(Disk::as_select())
	.load_async(&*datastore.pool_connection_for_tests().await?)
	.await
	.context("loading disks")?;

Implement region replacement for Volumes #5683

Implement region replacement for Volumes #5683

Conversation

jmpesp commented May 1, 2024

andrewjstone left a comment

Choose a reason for hiding this comment

andrewjstone May 13, 2024

Choose a reason for hiding this comment

andrewjstone May 14, 2024

Choose a reason for hiding this comment

andrewjstone May 14, 2024

Choose a reason for hiding this comment

andrewjstone May 14, 2024

Choose a reason for hiding this comment

jmpesp commented May 17, 2024