-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race between start saga, start saga unwind, and sled agent instance teardown can cause provisioning counter underflow #5042
Comments
Between #5830 and #5749 I believe this is fixed (virtual provisioning counter deletions are idempotent; if an instance start saga unwinds due to a timeout, a second saga starts the instance, and the first attempt's sled then reports a Propolis state change, the instance update saga will ignore it, since the instance now points to the second VMM). |
hawkw
added a commit
that referenced
this issue
Aug 9, 2024
A number of bugs relating to guest instance lifecycle management have been observed. These include: - Instances getting "stuck" in a transient state, such as `Starting` or `Stopping`, with no way to forcibly terminate them (#4004) - Race conditions between instances starting and receiving state updates, which cause provisioning counters to underflow (#5042) - Instances entering and exiting the `Failed` state when nothing is actually wrong with them, potentially leaking virtual resources (#4226) These typically require support intervention to resolve. Broadly , these issues exist because the control plane's current mechanisms for understanding and managing an instance's lifecycle state machine are "kind of a mess". In particular: - **(Conceptual) ownership of the CRDB `instance` record is currently split between Nexus and sled-agent(s).** Although Nexus is the only entity that actually reads or writes to the database, the instance's runtime state is also modified by the sled-agents that manage its active Propolis (and, if it's migrating, it's target Propolis), and written to CRDB on their behalf by Nexus. This means that there are multiple copies of the instance's state in different places at the same time, which can potentially get out of sync. When an instance is migrating, its state is updated by two different sled-agents, and they may potentially generate state updates that conflict with each other. And, splitting the responsibility between Nexus and sled-agent makes the code more complex and harder to understand: there is no one place where all instance state machine transitions are performed. - **Nexus doesn't ensure that instance state updates are processed reliably.** Instance state transitions triggered by user actions, such as `instance-start` and `instance-delete`, are performed by distributed sagas, ensuring that they run to completion even if the Nexus instance executing them comes to an untimely end. This is *not* the case for operations that result from instance state transitions reported by sled-agents, which just happen in the HTTP APIs for reporting instance states. If the Nexus processing such a transition crashes, gets network partition'd, or encountering a transient error, the instance is left in an incomplete state and the remainder of the operation will not be performed. This branch rewrites much of the control plane's instance state management subsystem to resolve these issues. At a high level, it makes the following high-level changes: - **Nexus is now the sole owner of the `instance` record.** Sled-agents no longer have their own copies of an instance's `InstanceRuntimeState`, and do not generate changes to that state when reporting instance observations to Nexus. Instead, the sled-agent only publishes updates to the `vmm` and `migration` records (which are never modified by Nexus directly) and Nexus is the only entity responsible for determining how an instance's state should change in response to a VMM or migration state update. - **When an instance has an active VMM, its effective external state is determined primarily by the active `vmm` record**, so that fewer state transitions *require* changes to the `instance` record. PR #5854 laid the ground work for this change, but it's relevant here as well. - **All updates to an `instance` record (and resources conceptually owned by that instance) are performed by a distributed saga.** I've introduced a new `instance-update` saga, which is responsible for performing all changes to the `instance` record, virtual provisioning resources, and instance network config that are performed as part of a state transition. Moving this to a saga helps us to ensure that these operations are always run to completion, even in the event of a sudden Nexus death. - **Consistency of instance state changes is ensured by distributed locking.** State changes may be published by multiple sled-agents to different Nexus replicas. If one Nexus replica is processing a state change received from a sled-agent, and then the instance's state changes again, and the sled-agent publishes that state change to a *different* Nexus...lots of bad things can happen, since the second state change may be performed from the previous initial state, when it *should* have a "happens-after" relationship with the other state transition. And, some operations may contradict each other when performed concurrently. To prevent these race conditions, this PR has the dubious honor of using the first _distributed lock_ in the Oxide control plane, the "instance updater lock". I introduced the locking primitives in PR #5831 --- see that branch for more discussion of locking. - **Background tasks are added to prevent missed updates**. To ensure we cannot accidentally miss an instance update even if a Nexus dies, hits a network partition, or just chooses to eat the state update accidentally, we add a new `instance-updater` background task, which queries the database for instances that are in states that require an update saga without such a saga running, and starts the requisite sagas. Currently, the instance update saga runs in the following cases: - An instance's active VMM transitions to `Destroyed`, in which case the instance's virtual resources are cleaned up and the active VMM is unlinked. - Either side of an instance's live migration reports that the migration has completed successfully. - Either side of an instance's live migration reports that the migration has failed. The inner workings of the instance-update saga itself is fairly complex, and has some kind of interesting idiosyncrasies relative to the existing sagas. I've written up a [lengthy comment] that provides an overview of the theory behind the design of the saga and its principles of operation, so I won't reproduce that in this commit message. [lengthy comment]: https://github.com/oxidecomputer/omicron/blob/357f29c8b532fef5d05ed8cbfa1e64a07e0953a5/nexus/src/app/sagas/instance_update/mod.rs#L5-L254
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Summary
There appears to be an extremely subtle race between the instance start saga, its unwind path, and the instance cleanup logic in sled agent that can cause an instance's virtual provisioning counters to underflow. It goes something like this:
setup_propolis_locked
) and telling it to create a VM (ensure_propolis_and_tasks
)ensure_unregistered
call to S1Destroyed
, then returns to NexusInstanceInner::observe_state
to apply its state observation from step 12. This is a no-op because the instance's zone was destroyed in step 13. However, the monitoring task still notifies Nexus of the state change!Instance::notify_instance_updated
processes this update; it callsvirtual_provisioning_collection_delete_instance
againvirtual_provisioning_resource
entry for the instance (it was created in step 16), but does not delete it, because the notify call supplied instance generation 3 and the instance was already advanced to generation 5. However, this doesn't suffice to keep thevirtual_provisioning_collection
counters from being modified, for reasons explained below.And there you have it: the same VMM's provisioning counters were removed twice (step 9 and step 20), causing underflow when the project's last instance is stopped.
Analysis, or, "how the heck did you come up with this?"
A dogfood user reported that creating an instance had failed with a mysterious "value is too small for a byte count" error message emanating from a saga node whose output was named "no_result". A quick look through the code reveals that this is the output name for the "allocate disk space" node in the disk create saga, which touches the virtual provisioning tables. As a quick recap:
virtual_provisioning_resource
table has one row for every "thing" in the system (running instance, disk, control plane service, etc.) that is consuming some resources (CPU, swap charges, reservoir, disk space) on some sled.virtual_provisioning_collection
table has one row for the fleet, each silo, and each project. Each of these rows tracks the total usage of the provisioning resources that belong to that fleet/silo/project.Most of the queries that touch the
virtual_provisioning_resource
tables return aVec<VirtualProvisioningCollection>
. This type includes aByteCount
(for memory and disk) that implsTryFrom<i64>
. The conversion fails if the byte count is negative. Looking at this table in dogfood reveals that the user's project has a problem:(Note that this failure occurs after the query has executed--it is a failure to convert the values returned from the database into Rust types, not a failure to execute the query itself!)
How did this underflow occur?
virtual_provisioning_collection_delete_instance
is a complex CTE that has three relevant parts:virtual_provisioning_collection
table entries. This is not caller-provided; it is executed whenever the first subquery returns "true," even if the second subquery decides that the request to delete the provisioning resource is too old and shouldn't be executed.This is a bug in
virtual_provisioning_collection_delete_instance
--both the removal of the resource and the update of the collections should be conditioned on the generation number check, not just the resource removal. Unfortunately the provisioning queries are not quite written to allow this (the caller can supply extra filters to subquery #2, but not to subquery #3).To show that this is in fact the problem, we can look at the history of the instances in this project, most of which are short-lived VMs that are used for Hubris builds. Here's a table showing VMMs that belonged to these instances, along with the instance creation times and the VMMs' creation and destruction times:
Before VMM
a07cf1db
is started, there are no active VMMs in this project; after it's started, there's at least one VMM until3b0e9b33
exits.Rummaging through the Nexus logs looking for state updates from these VMMs reveals some interesting facts. First, we can see where the underflow actually occurs: the last VMM to be destroyed (
3b0e9b33
) fails to update its state with an underflow error, then sends another update that doesn't hit that error:This instance may only be a victim, though: it was the last one to be cleaned up, so it's possible the accounting error occurred with a different VM and was only noticed here. Looking over the Nexus logs for the other VMs shows one that looks a little different from the others:
The "starting to destroyed" transition suggests that something went amiss with starting
3debbe6e
the first time: Propolis doesn't do that, so this is a sign that something may have gone wrong in the instance startup path. And, sure enough, we find that a start attempt for this instance failed and tried to unwind:Not too longer after this we see the instance being registered again as part of a new attempt to start:
Followed by a successful attempt to start it:
The second start attempt's virtual provisioning counters must have been acquired by 17:43:29.666Z (when the attempt to run the instance is logged). The retried attempt to send state from the old VMM lands at 17:43:43.424, which is after this but before the instance has actually started. This send has an obsolete instance generation, but (as described above) that's not enough to prevent the counters from being deleted. And there you have it!
The text was updated successfully, but these errors were encountered: