-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sled agent should terminate Propolis zones when Propolis indicates a previously-started VM has gone missing #3209
Comments
Maybe related to #4872? |
Because the |
We discussed this at the 9/26/24 hypervisor huddle. The thing we want to get right here is that when a Propolis zone panics, sled agent should be able to discover as much and then should (a) move the VMM to Failed, (b) terminate the zone, and (c) ensure that a bundle is taken that includes logs and the panic core. (b) and (c) happen naturally as part of sled agent's existing "VMM terminated" logic, and (a) is trivial once the zone is gone (or, more precisely, once sled agent stops watching for updates from the relevant propolis-server). The main issue, then, is making sure that sled agent properly detects that Propolis was terminated and restarted. Fortunately, when a client asks Propolis to do something that requires an active VM, Propolis will return a unique "VM? what VM?" error code if no VM is present. This means that sled agent should be able to detect Propolis restarts: if it knows that it previously started a VM in a Propolis server, and now the server says there's no VM there, then the whole enterprise has failed and sled agent should clean up accordingly. |
Thanks @gjcolombo for renaming this one --- I think the new title is much better. I investigated what we're currently doing in sled-agent in #3238 (comment), and the TL;DR is "it's still pretty bad": the error code Propolis returns when it's been restarted is blindly passed up to Nexus (which isn't terrible, it's at least not a 500 anymore), but Nexus doesn't know what it means, and sled-agent doesn't actually clean up the zone in that case. So I'm gonna go fix that. |
Good news, the propolis-server Bad news, the sled-agent omicron/sled-agent/src/instance.rs Lines 253 to 289 in 888f6a1
We would expect this to drop the send-side of the channel it uses to communicate with the omicron/sled-agent/src/instance.rs Lines 388 to 394 in 888f6a1
HOWEVER, as the comment points out, the omicron/sled-agent/src/instance.rs Lines 308 to 309 in 888f6a1
AFAICT, this means we'll just totally give up on monitoring the instance as soon as we hit any error here, which seems...quite bad. I think that this actually means that when a Propolis process crashes unexpectedly, we'll get an error when the TCP connection closes, bail out, and then never try hitting the instance monitor endpoint ever again1. So, we won't notice that the Propolis is actually out to lunch until we try to ask it to change state. This Is Not Great! We should probably rework this so that the Footnotes
|
retagging this one as "bug" because it's turned out that this is kinda bad and i'm going to make fixing it my top priority. |
At present, sled-agent's handling of the error codes used by Propolis to indicate that it has crashed and been restarted is woefully incorrect. In particular, there are two cases where such an error may be encountered by a sled-agent: 1. When attempting to ask the VMM to change state (e.g. reboot or stop the instance) 2. When hitting Propolis' `instance-state-monitor` API endpoint to proactively check the VM's current state Neither of these are handled correctly today, In the first case, if a sled-agent operation on behalf of Nexus encounters a client error from Propolis, it will forward that error code to Nexus...but, _Nexus_ only moves instances to `Failed` in the face of sled-agent's `410 NO_SUCH_INSTANCE`, and _not_ [if it sees the `NoInstance` error code from Propolis][1], which means that the ghosts left behind by crashed and restarted Propolii still won't be moved to `Failed`. Agh. Furthermore, in that case, sled-agent itself will not perform the necessary cleanup actions to deal with the now-defunct Propolis zone (e.g. collecting a zone bundle and then destroying the zone). In the second case, where we hit the instance-state-monitor endpoint and get back a `NoInstance` error, things are equally dire. The `InstanceMonitorRunner` task, which is responsible for polling the state monitor endpoint, will just bail out on receipt of any error from Propolis: https://github.com/oxidecomputer/omicron/blob/888f6a1eae91e5e7091f2e174dec7a8ee5bd04b5/sled-agent/src/instance.rs#L253-L289 We would _expect_ this to drop the send-side of the channel it uses to communicate with the `InstanceRunner`, closing the channel, and hitting this select, which would correctly mark the VMM as failed and do what we want, despite eating the actual error message from propolis: https://github.com/oxidecomputer/omicron/blob/888f6a1eae91e5e7091f2e174dec7a8ee5bd04b5/sled-agent/src/instance.rs#L388-L394 HOWEVER, as the comment points out, the `InstanceRunner` is _also_ holding its own clone of the channel sender, keeping us from ever hitting that case: https://github.com/oxidecomputer/omicron/blob/888f6a1eae91e5e7091f2e174dec7a8ee5bd04b5/sled-agent/src/instance.rs#L308-L309 AFAICT, this means we'll just totally give up on monitoring the instance as soon as we hit any error here, which seems...quite bad. I *think* that this actually means that when a Propolis process crashes unexpectedly, we'll get an error when the TCP connection closes, bail out, and then _never try hitting the instance monitor endpoint ever again_[^1]. So, we won't notice that the Propolis is actually out to lunch until we try to ask it to change state. **This Is Not Great!** This commit fixes both of these cases, by making sled-agent actually handle Propolis' error codes correctly. I've added a dependency on the `propolis_api_types` crate, which is where the error code lives, and some code in sled-agent to attempt to parse these codes when receiving an error response from the Propolis client. Now, when we try to PUT a new state to the instance, we'll look at the error code that comes back, mark it as `Failed` if the error indicates that we should do so, and publish the `Failed` VMM state to Nexus before tearing down the zone. The `InstanceMonitorTask` now behaves similarly, and I've changed it to retry all other errors with a backoff, rather than just bailing out immediately on receipt fo the first error. I've manually tested this on `london` as discussed here: #6726 (comment) Unfortunately, it's hard to do any kind of automated testing of this with the current test situation, as none of this code is exercised by the simulated sled-agent. Fixes #3209 Fixes #3238 [^1]: Although I've not actually verified that this is what happens. [1]: https://github.com/oxidecomputer/propolis/blob/516dabe473cecdc3baea1db98a80441968c144cf/crates/propolis-api-types/src/lib.rs#L434-L441
An instance on rack2 appeared to be stuck because:
svc:/system/svc/restarter:default
)It isn't clear to me that restarting propolis-server is the right thing to do here. We should consider having the service go into maintenance and not restart when that happens.
Related: #2825
The text was updated successfully, but these errors were encountered: