Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Split instance state into Instance and VMM tables (#4194)
Refactor the definition of an `Instance` throughout the control plane so that an `Instance` is separate from the `Vmm`s that incarnate it. This confers several advantages: - VMMs have their own state that sled agent can update without necessarily changing their instance's state. It's also possible to change an instance's active Propolis ID without having to know or update an instance's Propolis IP or current sled ID, since these change when an instance's active Propolis ID changes. This removes a great deal of complexity in sled agent, especially when live migrating an instance, and also simplifies the live migration saga considerably. - Resource reservations for instances have much clearer lifetimes: a reservation can be released when its VMM has moved to a terminal state. Nexus no longer has to reason about VMM lifetimes from changes to an instance's Propolis ID columns. - It's now possible for an Instance not to have an active Propolis at all! This allows an instance not to reserve sled resources when it's not running. It also allows an instance to stop and restart on a different sled. - It's also possible to get a history of an instance's VMMs for, e.g., zone bundle examination purposes ("my VMM had a problem two days ago but it went away when I stopped and restarted it; can you investigate?"). Rework callers throughout Nexus who depend on knowing an instance's current state and/or its current sled ID. In many cases (e.g. disk and NIC attach and detach), the relevant detail is whether the instance has an active Propolis; for simplicity, augment these checks with "has an active Propolis ID" instead of trying to grab both instance and VMM states. ## Known issues/remaining work - The virtual provisioning table is still updated only at instance creation/deletion time. Usage metrics that depend on this table might report strange and wonderful values if a user creates many more instances than can be started at one time. - Instances still use the generic "resource attachment" CTE to manage attaching and detaching disks. Previously these queries looked at instance states; now they look at an instance's state and whether it has an active Propolis, but not at the active Propolis's state. This will need to be revisited in the future to support disk hotplug. - `handle_instance_put_result` is still very aggressive about setting instances to the Failed state if sled agent returns errors other than invalid-request-flavored errors. I think we should reconsider this behavior, but this change is big enough as it is. I will file a TODO for this and update the new comments accordingly before this merges. - The new live migration logic is not tested yet and differs from the "two-table" TLA+ model in RFD 361. More work will be needed here before we can declare live migration fully ready for selfhosting. - It would be nice to have an `omdb vmm` command; for now I've just updated existing `omdb` commands to deal with the optionality of Propolises and sleds. Tests: - Unit/integration tests - On a single-machine dev cluster, created two instances and verified that: - The instances only have resource reservations while they're running (and they reserve reservoir space now) - The instances can reach each other over their internal and external IPs when they're both running (and can still reach each other even if you try to delete one while it's active) - `scadm` shows the appropriate IP mappings being added/deleted as the instances start/stop - The instances' serial consoles work as expected - Attaching a new disk to an instance is only possible if the instance is stopped - Disk snapshot succeeds when invoked on a running instance's attached disk - Deleting an instance detaches its disks - `omicron-stress` on a single-machine dev cluster ran for about an hour and created ~800 instances without any instances going to the Failed state (previously this would happen in the first 5-10 minutes)
- Loading branch information