Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking: Instance Lifecycle Overhaul #3742

Open
6 of 13 tasks
smklein opened this issue Jul 21, 2023 · 3 comments
Open
6 of 13 tasks

Tracking: Instance Lifecycle Overhaul #3742

smklein opened this issue Jul 21, 2023 · 3 comments
Labels
nexus Related to nexus Sled Agent Related to the Per-Sled Configuration and Management virtualization Propolis Integration & VM Management
Milestone

Comments

@smklein
Copy link
Collaborator

smklein commented Jul 21, 2023

  • Updating Instance State Information within Nexus
    • "Sled Agent registering itself with Nexus" should also transfer information about "Here are the instances the sled agent knows about". It can start as an empty set. See restart customer Instances after sled reboot #3633 for a lot more detail here.
    • The Sled Agent should refuse to handle instance requests until it successfully registers itself with Nexus. This would help avoid race conditions where: Nexus sends a request to a rebooting sled, at the same time as the sled registers with nexus and identifies that "all instances are dead now", inadvertently marking a very new instance as failed.
    • Nexus should look up all instances that should have been running on the sled and mark them failed.
    • Later Nexus can use an RPW to look for instances that are marked as "failed + auto_boot_on_fault", and re-provision them in the background.
    • Idea: We could plausible update the "normal" instance provisioning workflow to rely on this RPW for provisioning, too. This would let "instance create" return much faster, and leave the work of "finding an appropriate sled and starting the instance" to a background task that could tolerate slower APIs to the backend.
    • Ensuring metric registration: As part of the above RPW, one would like to also ensure that running instances have an assignment to an oximeter collector recorded in the omicron.public.metric_producer table. When instances are stopped, that assignment needs to be removed by the cleanup-portion of that RPW.
  • Instances without Sleds
    • We need to make it possible for Instances to not have a propolis ID / sled ID, in the case that they are stopped.
    • We also have the cleanup to do, ensuring that the virtual resources consumed by instances are no longer consumed in the case when an instance is stopped, but not deleted.
  • Handling Failed Instances
    • Confirm that instances can be forcefully deleted after being marked failed
    • Plumb through the sled agent API @gjcolombo mentioned to "force-stop an instance" through the public-facing API for this failed case, to ensure that the instance is truly destroyed.
@smklein smklein added Sled Agent Related to the Per-Sled Configuration and Management nexus Related to nexus virtualization Propolis Integration & VM Management labels Jul 21, 2023
@gjcolombo
Copy link
Contributor

#2315 also tracks the "instances without sleds" work. It probably depends on #2824, since starting an instance with no resource reservation is a multi-step process.

Nexus can use an RPW to look for instances that are marked as "failed + auto_boot_on_fault", and re-provision them in the background.

If we use the existing Failed state for this, we'll need to make sure that

  • instances we've marked as Failed and are restarting have definitely been torn down (e.g. what happens if a Nexus-to-sled-agent call fails and causes Nexus to move an instance to the Failed state, but the problem was transient and the instance is actually alive?)
  • instance we've marked as Failed have some prospect of being recovered (e.g. suppose an instance is Failed because Propolis's startup sequence failed, and the problem is persistent; will the RPW constantly try and fail to start the VM?)

We might decide to have different failure reasons to help us distinguish these cases.

@smklein
Copy link
Collaborator Author

smklein commented Jul 25, 2023

See also: #2825

@hawkw
Copy link
Member

hawkw commented Sep 24, 2024

Most of the stuff described in "Updating Instance State Within Nexus" was implemented in a combination of #5611, #5759, and #6503. The proactive registration of sled-agents with Nexus isn't something we've done yet.

@morlandi7 morlandi7 modified the milestones: 11, 12 Sep 26, 2024
@morlandi7 morlandi7 modified the milestones: 12, 13 Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nexus Related to nexus Sled Agent Related to the Per-Sled Configuration and Management virtualization Propolis Integration & VM Management
Projects
None yet
Development

No branches or pull requests

5 participants