Skip to content

Commit

Permalink
[#3886 4/4] Drive and finish a region replacement (#5885)
Browse files Browse the repository at this point in the history
When a disk is expunged, any region that was on that disk is assumed to
be gone. A single disk expungement can put many Volumes into degraded
states, as one of the three mirrors of a region set is now gone. Volumes
that are degraded in this way remain degraded until a new region is
swapped in, and an Upstairs performs the necessary repair operation
(either through a Live Repair or Reconciliation). Nexus can only
initiate these repairs - it does not participate in them, instead
requesting that an Upstairs perform the repair.

These repair operations can only be done by an Upstairs running as part
of an activated Volume: either Nexus has to send this Volume to a Pantry
and repair it there, or Nexus has to talk to a propolis that has that
active Volume. Further complicating things is that the Volumes in
question can be activated and deactivated as a result of user action,
namely starting and stopping Instances. This will interrupt any on-going
repair. This is ok! Both operations support being interrupted, but as a
result it's then Nexus' job to continually monitor these repair
operations and initiate further operations if the current one is
interrupted.

A single saga invocation is not enough to continually make sure a Volume
is being repaired, so driving one of the repair operations forward
happens as a saga that is triggered from a background task: this is
called the _region replacement drive_ saga. It wll:

- transition a region replacement request to state Driving, again
blocking out other invocations of the same saga

- check if Nexus has taken an action to initiate a repair yet. if not,
then one is needed. if it has previously initiated a repair operation,
the state of the system is examined: is that operation still running?
has something changed? further action may be required depending on this
observation.

- if an action is required, Nexus will prepare an action that will
initiate either Live Repair or Reconciliation based on the current
observed state of the system.

- that action is then executed. if there was an error, then the saga
unwinds. if it was successful, it is recorded as a "repair step" in CRDB
and will be checked the next time the saga runs.

- if Nexus observed an Upstairs telling it that a repair was completed
or not necessary, then the request is placed into the ReplacementDone
state, otherwise it is placed back into the Running state. if the saga
unwinds, it unwinds back to the Running state.

The background task responsible for triggering the drive saga will also
scan for notifications of a successful live repair or reconciliation,
and transition region replacement requests to ReplacementDone if it sees
one.

If a region replacement request is in state ReplacementDone, _region
replacement finish_ saga is triggered, which will:

- transition a request into Completing

- delete the old region by deleting a transient Volume that refers to it

- transition the request to the Complete state

For the entire region replacement work, testing was done manually using
the Canada region using the following test cases:

- a disk needing repair is attached to a instance for the duration of
the repair

- a disk needing repair is attached to a instance that is migrated
mid-repair

- a disk needing repair is attached to a instance that is stopped
mid-repair

- a disk needing repair is attached to a instance that is stopped
mid-repair, then started in the middle of the pantry's repair

- a detached disk needs repair

- a detached disk needs repair, and is then attached to an instance that
is then started

- a sled is expunged, causing region replacement requests for all
regions on it

Fixes #3886
Fixes #5191
  • Loading branch information
jmpesp authored Jun 25, 2024
1 parent 01d8b37 commit faa518c
Show file tree
Hide file tree
Showing 20 changed files with 3,077 additions and 25 deletions.
33 changes: 33 additions & 0 deletions dev-tools/omdb/src/bin/omdb/nexus.rs
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ use nexus_client::types::SledSelector;
use nexus_client::types::UninitializedSledId;
use nexus_db_queries::db::lookup::LookupPath;
use nexus_types::deployment::Blueprint;
use nexus_types::internal_api::background::RegionReplacementDriverStatus;
use nexus_types::inventory::BaseboardId;
use omicron_uuid_kinds::CollectionUuid;
use omicron_uuid_kinds::GenericUuid;
Expand Down Expand Up @@ -1049,6 +1050,38 @@ fn print_task_details(bgtask: &BackgroundTask, details: &serde_json::Value) {
);
}
};
} else if name == "region_replacement_driver" {
match serde_json::from_value::<RegionReplacementDriverStatus>(
details.clone(),
) {
Err(error) => eprintln!(
"warning: failed to interpret task details: {:?}: {:?}",
error, details
),

Ok(status) => {
println!(
" number of region replacement drive sagas started ok: {}",
status.drive_invoked_ok.len()
);
for line in &status.drive_invoked_ok {
println!(" > {line}");
}

println!(
" number of region replacement finish sagas started ok: {}",
status.finish_invoked_ok.len()
);
for line in &status.finish_invoked_ok {
println!(" > {line}");
}

println!(" number of errors: {}", status.errors.len());
for line in &status.errors {
println!(" > {line}");
}
}
};
} else {
println!(
"warning: unknown background task: {:?} \
Expand Down
12 changes: 12 additions & 0 deletions dev-tools/omdb/tests/env.out
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,10 @@ task: "region_replacement"
detects if a region requires replacing and begins the process


task: "region_replacement_driver"
drive region replacements forward to completion


task: "service_firewall_rule_propagation"
propagates VPC firewall rules for Omicron services with external network
connectivity
Expand Down Expand Up @@ -234,6 +238,10 @@ task: "region_replacement"
detects if a region requires replacing and begins the process


task: "region_replacement_driver"
drive region replacements forward to completion


task: "service_firewall_rule_propagation"
propagates VPC firewall rules for Omicron services with external network
connectivity
Expand Down Expand Up @@ -345,6 +353,10 @@ task: "region_replacement"
detects if a region requires replacing and begins the process


task: "region_replacement_driver"
drive region replacements forward to completion


task: "service_firewall_rule_propagation"
propagates VPC firewall rules for Omicron services with external network
connectivity
Expand Down
13 changes: 13 additions & 0 deletions dev-tools/omdb/tests/successes.out
Original file line number Diff line number Diff line change
Expand Up @@ -311,6 +311,10 @@ task: "region_replacement"
detects if a region requires replacing and begins the process


task: "region_replacement_driver"
drive region replacements forward to completion


task: "service_firewall_rule_propagation"
propagates VPC firewall rules for Omicron services with external network
connectivity
Expand Down Expand Up @@ -505,6 +509,15 @@ task: "region_replacement"
number of region replacements started ok: 0
number of region replacement start errors: 0

task: "region_replacement_driver"
configured period: every 30s
currently executing: no
last completed activation: <REDACTED ITERATIONS>, triggered by an explicit signal
started at <REDACTED TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
number of region replacement drive sagas started ok: 0
number of region replacement finish sagas started ok: 0
number of errors: 0

task: "service_firewall_rule_propagation"
configured period: every 5m
currently executing: no
Expand Down
18 changes: 17 additions & 1 deletion nexus-config/src/nexus_config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -373,8 +373,10 @@ pub struct BackgroundTaskConfig {
pub bfd_manager: BfdManagerConfig,
/// configuration for the switch port settings manager task
pub switch_port_settings_manager: SwitchPortSettingsManagerConfig,
/// configuration for region replacement task
/// configuration for region replacement starter task
pub region_replacement: RegionReplacementConfig,
/// configuration for region replacement driver task
pub region_replacement_driver: RegionReplacementDriverConfig,
/// configuration for instance watcher task
pub instance_watcher: InstanceWatcherConfig,
/// configuration for service VPC firewall propagation task
Expand Down Expand Up @@ -564,6 +566,14 @@ pub struct AbandonedVmmReaperConfig {
pub period_secs: Duration,
}

#[serde_as]
#[derive(Clone, Debug, Deserialize, Eq, PartialEq, Serialize)]
pub struct RegionReplacementDriverConfig {
/// period (in seconds) for periodic activations of this background task
#[serde_as(as = "DurationSeconds<u64>")]
pub period_secs: Duration,
}

/// Configuration for a nexus server
#[derive(Clone, Debug, Deserialize, PartialEq, Serialize)]
pub struct PackageConfig {
Expand Down Expand Up @@ -801,6 +811,7 @@ mod test {
sync_service_zone_nat.period_secs = 30
switch_port_settings_manager.period_secs = 30
region_replacement.period_secs = 30
region_replacement_driver.period_secs = 30
instance_watcher.period_secs = 30
service_firewall_propagation.period_secs = 300
v2p_mapping_propagation.period_secs = 30
Expand Down Expand Up @@ -935,6 +946,10 @@ mod test {
region_replacement: RegionReplacementConfig {
period_secs: Duration::from_secs(30),
},
region_replacement_driver:
RegionReplacementDriverConfig {
period_secs: Duration::from_secs(30),
},
instance_watcher: InstanceWatcherConfig {
period_secs: Duration::from_secs(30),
},
Expand Down Expand Up @@ -1015,6 +1030,7 @@ mod test {
sync_service_zone_nat.period_secs = 30
switch_port_settings_manager.period_secs = 30
region_replacement.period_secs = 30
region_replacement_driver.period_secs = 30
instance_watcher.period_secs = 30
service_firewall_propagation.period_secs = 300
v2p_mapping_propagation.period_secs = 30
Expand Down
1 change: 1 addition & 0 deletions nexus/examples/config.toml
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,7 @@ blueprints.period_secs_collect_crdb_node_ids = 180
sync_service_zone_nat.period_secs = 30
switch_port_settings_manager.period_secs = 30
region_replacement.period_secs = 30
region_replacement_driver.period_secs = 10
# How frequently to query the status of active instances.
instance_watcher.period_secs = 30
service_firewall_propagation.period_secs = 300
Expand Down
48 changes: 37 additions & 11 deletions nexus/src/app/background/init.rs
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ use super::nat_cleanup;
use super::phantom_disks;
use super::physical_disk_adoption;
use super::region_replacement;
use super::region_replacement_driver;
use super::service_firewall_rules;
use super::sync_service_zone_nat::ServiceZoneNatTracker;
use super::sync_switch_configuration::SwitchPortSettingsManager;
Expand Down Expand Up @@ -103,6 +104,9 @@ pub struct BackgroundTasks {
/// begins the process
pub task_region_replacement: common::TaskHandle,

/// task handle for the task that drives region replacements forward
pub task_region_replacement_driver: common::TaskHandle,

/// task handle for the task that polls sled agents for instance states.
pub task_instance_watcher: common::TaskHandle,

Expand Down Expand Up @@ -395,6 +399,26 @@ impl BackgroundTasks {
task
};

// Background task: drive region replacements forward to completion
let task_region_replacement_driver = {
let detector =
region_replacement_driver::RegionReplacementDriver::new(
datastore.clone(),
saga_request.clone(),
);

let task = driver.register(
String::from("region_replacement_driver"),
String::from("drive region replacements forward to completion"),
config.region_replacement_driver.period_secs,
Box::new(detector),
opctx.child(BTreeMap::new()),
vec![],
);

task
};

let task_instance_watcher = {
let watcher = instance_watcher::InstanceWatcher::new(
datastore.clone(),
Expand All @@ -412,6 +436,7 @@ impl BackgroundTasks {
vec![],
)
};

// Background task: service firewall rule propagation
let task_service_firewall_propagation = driver.register(
String::from("service_firewall_rule_propagation"),
Expand All @@ -429,17 +454,17 @@ impl BackgroundTasks {

// Background task: abandoned VMM reaping
let task_abandoned_vmm_reaper = driver.register(
String::from("abandoned_vmm_reaper"),
String::from(
"deletes sled reservations for VMMs that have been abandoned by their instances",
),
config.abandoned_vmm_reaper.period_secs,
Box::new(abandoned_vmm_reaper::AbandonedVmmReaper::new(
datastore,
)),
opctx.child(BTreeMap::new()),
vec![],
);
String::from("abandoned_vmm_reaper"),
String::from(
"deletes sled reservations for VMMs that have been abandoned by their instances",
),
config.abandoned_vmm_reaper.period_secs,
Box::new(abandoned_vmm_reaper::AbandonedVmmReaper::new(
datastore,
)),
opctx.child(BTreeMap::new()),
vec![],
);

BackgroundTasks {
driver,
Expand All @@ -462,6 +487,7 @@ impl BackgroundTasks {
task_switch_port_settings_manager,
task_v2p_manager,
task_region_replacement,
task_region_replacement_driver,
task_instance_watcher,
task_service_firewall_propagation,
task_abandoned_vmm_reaper,
Expand Down
1 change: 1 addition & 0 deletions nexus/src/app/background/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ mod networking;
mod phantom_disks;
mod physical_disk_adoption;
mod region_replacement;
mod region_replacement_driver;
mod service_firewall_rules;
mod status;
mod sync_service_zone_nat;
Expand Down
Loading

0 comments on commit faa518c

Please sign in to comment.