Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sled expungement must clean up failed instances #5553

Closed
davepacheco opened this issue Apr 17, 2024 · 0 comments · Fixed by #6503
Closed

sled expungement must clean up failed instances #5553

davepacheco opened this issue Apr 17, 2024 · 0 comments · Fixed by #6503

Comments

@davepacheco
Copy link
Collaborator

davepacheco commented Apr 17, 2024

Depends on #4872. Once we have #4872, we should use that mechanism to deal with any instances on sleds that have been expunged. (I think #4872 is not happening all that soon, so in the meantime, instances running on sleds that get expunged will be stuck similar to what happens today if a sled reboots -- see #3633.)

I'm assuming this will clean up both vmm and sled_resource rows for the sled. If not, we'll need to do that separately.

hawkw added a commit that referenced this issue Sep 23, 2024
As of #6455, instances in the `Failed` state are allowed to be
restarted. As described in [RFD 486 § 4.3], Nexus should automatically
restart such instances when the instance record's `auto_restart_policy`
field (née `boot_on_fault`, see #6499) indicates that it is permitted to
do so. This branch implements this behavior, by adding a new
`instance_reincarnation` RPW. This RPW consists of a background task
which queries CRDB for instances in the `Failed` state which are
eligible to be automatically restarted, and starts a new
`instance-start` saga for any instances it discovers.

Instances are considered eligible for reincarnation if all of the
following conditions are true:

- **The instance is in the `InstanceState::Failed` state.** Meaning, it
  must have *no active VMM* --- if the instance is in the
  `InstanceState::Vmm` state with an active VMM which is in
  `VmmState::Failed`, that indicates no instance-update saga has run to
  clean up the old VMM's resources. In this case, we must wait until the
  instance record itself has transitioned to `Failed` before attempting
  to restart it.
- **The instance's `auto_restart_policy` permits it to be restarted.**
  The `auto_restart_policy` enum is changed from representing the
  classess of failures on which an instance may be restarted (`never`,
  `sled_failures_only`, and `all_failures`) was changed to a more
  general quality-of-service for automatic restarts. Presently, this can
  either be `never`, meaning that the instance can never be restarted
  automatically; or `best_effort`, meaning that the control plane should
  try to keep the instance running but is permitted to not restart it if
  necessary to preserve the stability of the whole system. The default
  policy for instances which don't provide one is `best_effort`.
- **A cooldown period has elapsed since the instance's last automatic
  restart.**. In order to prevent instances which fail frequently from
  compromising the reliability of the system as a whole, we enforce a
  cooldown period between automatic restarts. This is tracked by setting
  a last-auto-restart timestamp in the `instance` record. At present,
  the cooldown period is always one hour unless the instance record
  overrides it, which currently can only be done using a test-only Nexus
  method. In the future, we may consider allowing the cooldown period to
  be configured by users, but because it presents a denial-of-service
  risk, setting it should probably be a more priveliged operation than
  creating an instance.

The implementation of the actual reincarnation background task is
relatively straightforward: it runs a new
`DataStore::find_reincarnatable_instances` query, which returns
instances in the `Failed` state, and it spawns `instance-start` sagas
for any instances which satisfy the above conditions. The restart
cooldown is implemented by adding a new `time_last_auto_restarted` field
to the `instance` table and to the `InstanceRuntimeState` type. We add a
`Reason` enum to the `instance-start` saga's `Params` that indicates why
the instance is being started, and if it is `Reason::AutoRestart`, then
the start saga will set the `time_last_auto_restarted` timestamp when
moving the instance record to the `Starting` state. Additionally, I've
added the `auto_restart_policy` field to the `params::InstanceCreate`
type in the Nexus API so that instances can be created with auto-restart
policies. Reviewers should note that many of the files changed in this
diff were only modified by adding that field to a
`params::InstanceCreate` literal.

This branch adds a minimal, initial implementation of instance
auto-restarting.

[RFD 486 § 4.3]: https://rfd.shared.oxide.computer/rfd/486#_instances

Fixes #5553
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant