sled expungement must clean up failed instances #5553

davepacheco · 2024-04-17T20:44:29Z

Depends on #4872. Once we have #4872, we should use that mechanism to deal with any instances on sleds that have been expunged. (I think #4872 is not happening all that soon, so in the meantime, instances running on sleds that get expunged will be stuck similar to what happens today if a sled reboots -- see #3633.)

I'm assuming this will clean up both vmm and sled_resource rows for the sled. If not, we'll need to do that separately.

The text was updated successfully, but these errors were encountered:

As of #6455, instances in the `Failed` state are allowed to be restarted. As described in [RFD 486 § 4.3], Nexus should automatically restart such instances when the instance record's `auto_restart_policy` field (née `boot_on_fault`, see #6499) indicates that it is permitted to do so. This branch implements this behavior, by adding a new `instance_reincarnation` RPW. This RPW consists of a background task which queries CRDB for instances in the `Failed` state which are eligible to be automatically restarted, and starts a new `instance-start` saga for any instances it discovers. Instances are considered eligible for reincarnation if all of the following conditions are true: - **The instance is in the `InstanceState::Failed` state.** Meaning, it must have *no active VMM* --- if the instance is in the `InstanceState::Vmm` state with an active VMM which is in `VmmState::Failed`, that indicates no instance-update saga has run to clean up the old VMM's resources. In this case, we must wait until the instance record itself has transitioned to `Failed` before attempting to restart it. - **The instance's `auto_restart_policy` permits it to be restarted.** The `auto_restart_policy` enum is changed from representing the classess of failures on which an instance may be restarted (`never`, `sled_failures_only`, and `all_failures`) was changed to a more general quality-of-service for automatic restarts. Presently, this can either be `never`, meaning that the instance can never be restarted automatically; or `best_effort`, meaning that the control plane should try to keep the instance running but is permitted to not restart it if necessary to preserve the stability of the whole system. The default policy for instances which don't provide one is `best_effort`. - **A cooldown period has elapsed since the instance's last automatic restart.**. In order to prevent instances which fail frequently from compromising the reliability of the system as a whole, we enforce a cooldown period between automatic restarts. This is tracked by setting a last-auto-restart timestamp in the `instance` record. At present, the cooldown period is always one hour unless the instance record overrides it, which currently can only be done using a test-only Nexus method. In the future, we may consider allowing the cooldown period to be configured by users, but because it presents a denial-of-service risk, setting it should probably be a more priveliged operation than creating an instance. The implementation of the actual reincarnation background task is relatively straightforward: it runs a new `DataStore::find_reincarnatable_instances` query, which returns instances in the `Failed` state, and it spawns `instance-start` sagas for any instances which satisfy the above conditions. The restart cooldown is implemented by adding a new `time_last_auto_restarted` field to the `instance` table and to the `InstanceRuntimeState` type. We add a `Reason` enum to the `instance-start` saga's `Params` that indicates why the instance is being started, and if it is `Reason::AutoRestart`, then the start saga will set the `time_last_auto_restarted` timestamp when moving the instance record to the `Starting` state. Additionally, I've added the `auto_restart_policy` field to the `params::InstanceCreate` type in the Nexus API so that instances can be created with auto-restart policies. Reviewers should note that many of the files changed in this diff were only modified by adding that field to a `params::InstanceCreate` literal. This branch adds a minimal, initial implementation of instance auto-restarting. [RFD 486 § 4.3]: https://rfd.shared.oxide.computer/rfd/486#_instances Fixes #5553

davepacheco mentioned this issue Apr 17, 2024

what's involved with removing a sled? #4787

Closed

davepacheco mentioned this issue Sep 12, 2024

[nexus] Reincarnate Failed instances #6503

Merged

hawkw closed this as completed in #6503 Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sled expungement must clean up failed instances #5553

sled expungement must clean up failed instances #5553

davepacheco commented Apr 17, 2024 •

edited

Loading

sled expungement must clean up failed instances #5553

sled expungement must clean up failed instances #5553

Comments

davepacheco commented Apr 17, 2024 • edited Loading

davepacheco commented Apr 17, 2024 •

edited

Loading