Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sled agent should terminate Propolis zones when Propolis indicates a previously-started VM has gone missing #3209

Closed
jordanhendricks opened this issue May 24, 2023 · 6 comments · Fixed by #6726
Assignees
Labels
bug Something that isn't working. known issue To include in customer documentation and training Sled Agent Related to the Per-Sled Configuration and Management virtualization Propolis Integration & VM Management
Milestone

Comments

@jordanhendricks
Copy link
Contributor

jordanhendricks commented May 24, 2023

An instance on rack2 appeared to be stuck because:

  • propolis-server panicked
  • the propolis-server service was restarted by SMF (its restarter is svc:/system/svc/restarter:default)
  • sled-agent failed to notice that propolis-server had panicked (sled-agent failed to notice propolis-server panicked #3206), and the instance was still marked as running
  • a new propolis-server ran, and requests from the serial console in the web console continued, but there was no instance, so the console seemed to hang

It isn't clear to me that restarting propolis-server is the right thing to do here. We should consider having the service go into maintenance and not restart when that happens.

Related: #2825

@davepacheco
Copy link
Collaborator

Maybe related to #4872?

@morlandi7 morlandi7 removed this from the 6 milestone Feb 20, 2024
@hawkw hawkw self-assigned this Feb 20, 2024
@hawkw
Copy link
Member

hawkw commented Feb 20, 2024

Because the propolis-server process owns userspace VMM state that isn't really persisted anywhere, it's more or less impossible to recover a VM in a usable state after a restart, so restarting the propolis-server process is basically never correct. When a propolis-server comes back up, it won't know about the VM it will have previously managed, and there's no way for it to resume managing that VM correctly as a bunch of state that only exists in userspace is gone. @gjcolombo and I talked it over today and we agreed that we should change this to never restart propolis-server.

@askfongjojo askfongjojo added the known issue To include in customer documentation and training label Mar 9, 2024
@askfongjojo askfongjojo added this to the 8 milestone Mar 9, 2024
@morlandi7 morlandi7 modified the milestones: 8, 9 May 2, 2024
@morlandi7 morlandi7 modified the milestones: 9, 10 Jul 1, 2024
@morlandi7 morlandi7 modified the milestones: 10, 11 Aug 14, 2024
@morlandi7 morlandi7 modified the milestones: 11, 12 Sep 26, 2024
@gjcolombo gjcolombo modified the milestone: 12 Sep 26, 2024
@gjcolombo
Copy link
Contributor

We discussed this at the 9/26/24 hypervisor huddle. The thing we want to get right here is that when a Propolis zone panics, sled agent should be able to discover as much and then should (a) move the VMM to Failed, (b) terminate the zone, and (c) ensure that a bundle is taken that includes logs and the panic core.

(b) and (c) happen naturally as part of sled agent's existing "VMM terminated" logic, and (a) is trivial once the zone is gone (or, more precisely, once sled agent stops watching for updates from the relevant propolis-server). The main issue, then, is making sure that sled agent properly detects that Propolis was terminated and restarted.

Fortunately, when a client asks Propolis to do something that requires an active VM, Propolis will return a unique "VM? what VM?" error code if no VM is present. This means that sled agent should be able to detect Propolis restarts: if it knows that it previously started a VM in a Propolis server, and now the server says there's no VM there, then the whole enterprise has failed and sled agent should clean up accordingly.

@gjcolombo gjcolombo changed the title consider not restarting propolis-server after failure sled agent should terminate Propolis zones when Propolis indicates a previously-started VM has gone missing Sep 27, 2024
@hawkw
Copy link
Member

hawkw commented Sep 27, 2024

Thanks @gjcolombo for renaming this one --- I think the new title is much better. I investigated what we're currently doing in sled-agent in #3238 (comment), and the TL;DR is "it's still pretty bad": the error code Propolis returns when it's been restarted is blindly passed up to Nexus (which isn't terrible, it's at least not a 500 anymore), but Nexus doesn't know what it means, and sled-agent doesn't actually clean up the zone in that case. So I'm gonna go fix that.

@hawkw hawkw self-assigned this Sep 27, 2024
@hawkw
Copy link
Member

hawkw commented Sep 27, 2024

Good news, the propolis-server InstanceStateMonitor API will also return the error codes we want from it (which, it turns out, I had added, but I forgot about): https://github.com/oxidecomputer/propolis/blob/dfcd4430f0211160c9fffd4688d1be556296c769/bin/propolis-server/src/lib/server.rs#L389-L418

Bad news, the sled-agent InstanceMonitorRunner will just bail out on receipt of any error from Propolis:

impl InstanceMonitorRunner {
async fn run(self) -> Result<(), anyhow::Error> {
let mut gen = 0;
loop {
// State monitoring always returns the most recent state/gen pair
// known to Propolis.
let response = self
.client
.instance_state_monitor()
.body(propolis_client::types::InstanceStateMonitorRequest {
gen,
})
.send()
.await?
.into_inner();
let observed_gen = response.gen;
// Now that we have the response from Propolis' HTTP server, we
// forward that to the InstanceRunner.
//
// It will decide the new state, provide that info to Nexus,
// and possibly identify if we should terminate.
let (tx, rx) = oneshot::channel();
self.tx_monitor
.send(InstanceMonitorRequest::Update { state: response, tx })
.await?;
if let Reaction::Terminate = rx.await? {
return Ok(());
}
// Update the generation number we're asking for, to ensure the
// Propolis will only return more recent values.
gen = observed_gen + 1;
}
}
}

We would expect this to drop the send-side of the channel it uses to communicate with the InstanceRunner, closing the channel, and hitting this select, which would correctly mark the VMM as failed and do what we want, despite eating the actual error message from propolis:

// NOTE: This case shouldn't really happen, as we keep a copy
// of the sender alive in "self.tx_monitor".
None => {
warn!(self.log, "Instance 'VMM monitor' channel closed; shutting down");
let mark_failed = true;
self.terminate(mark_failed).await;
},

HOWEVER, as the comment points out, the InstanceRunner is also holding its own clone of the channel sender, keeping us from ever hitting that case:

// Request channel on which monitor requests are made.
tx_monitor: mpsc::Sender<InstanceMonitorRequest>,

AFAICT, this means we'll just totally give up on monitoring the instance as soon as we hit any error here, which seems...quite bad. I think that this actually means that when a Propolis process crashes unexpectedly, we'll get an error when the TCP connection closes, bail out, and then never try hitting the instance monitor endpoint ever again1. So, we won't notice that the Propolis is actually out to lunch until we try to ask it to change state. This Is Not Great!

We should probably rework this so that the InstanceMonitorRunner will proactively send any errors which definitively indicate goneness to the InstanceRunner, and just keep trying in the face of any other errors (at least, connection errors), probably with a smallish backoff.

Footnotes

  1. Although I've not actually verified that this is what happens.

@hawkw hawkw added bug Something that isn't working. Sled Agent Related to the Per-Sled Configuration and Management labels Sep 27, 2024
@hawkw
Copy link
Member

hawkw commented Sep 27, 2024

retagging this one as "bug" because it's turned out that this is kinda bad and i'm going to make fixing it my top priority.

@hawkw hawkw closed this as completed in 3093818 Oct 1, 2024
hawkw added a commit that referenced this issue Oct 2, 2024
At present, sled-agent's handling of the error codes used by Propolis to
indicate that it has crashed and been restarted is woefully incorrect.
In particular, there are two cases where such an error may be
encountered by a sled-agent:

1. When attempting to ask the VMM to change state (e.g. reboot or stop
   the instance)
2. When hitting Propolis' `instance-state-monitor` API endpoint to
   proactively check the VM's current state

Neither of these are handled correctly today,

In the first case, if a sled-agent operation on behalf of Nexus
encounters a client error from Propolis, it will forward that error code
to Nexus...but, _Nexus_ only moves instances to `Failed` in the face of
sled-agent's `410 NO_SUCH_INSTANCE`, and _not_ [if it sees the
`NoInstance` error code from Propolis][1], which means that the ghosts
left behind by crashed and restarted Propolii still won't be moved to
`Failed`. Agh. Furthermore, in that case, sled-agent itself will not
perform the necessary cleanup actions to deal with the now-defunct
Propolis zone (e.g. collecting a zone bundle and then destroying the
zone).

In the second case, where we hit the instance-state-monitor endpoint and
get back a `NoInstance` error, things are equally dire. The
`InstanceMonitorRunner` task, which is responsible for polling the state
monitor endpoint, will just bail out on receipt of any error from
Propolis:

https://github.com/oxidecomputer/omicron/blob/888f6a1eae91e5e7091f2e174dec7a8ee5bd04b5/sled-agent/src/instance.rs#L253-L289

We would _expect_ this to drop the send-side of the channel it uses to
communicate with the `InstanceRunner`, closing the channel, and hitting
this select, which would correctly mark the VMM as failed and do what we
want, despite eating the actual error message from propolis:

https://github.com/oxidecomputer/omicron/blob/888f6a1eae91e5e7091f2e174dec7a8ee5bd04b5/sled-agent/src/instance.rs#L388-L394

HOWEVER, as the comment points out, the `InstanceRunner` is _also_
holding its own clone of the channel sender, keeping us from ever
hitting that case:

https://github.com/oxidecomputer/omicron/blob/888f6a1eae91e5e7091f2e174dec7a8ee5bd04b5/sled-agent/src/instance.rs#L308-L309

AFAICT, this means we'll just totally give up on monitoring the instance
as soon as we hit any error here, which seems...quite bad. I *think*
that this actually means that when a Propolis process crashes
unexpectedly, we'll get an error when the TCP connection closes, bail
out, and then _never try hitting the instance monitor endpoint ever
again_[^1]. So, we won't notice that the Propolis is actually out to
lunch until we try to ask it to change state. **This Is Not Great!**

This commit fixes both of these cases, by making sled-agent actually
handle Propolis' error codes correctly. I've added a dependency on the
`propolis_api_types` crate, which is where the error code lives, and
some code in sled-agent to attempt to parse these codes when receiving
an error response from the Propolis client. Now, when we try to PUT a
new state to the instance, we'll look at the error code that comes back,
mark it as `Failed` if the error indicates that we should do so, and
publish the `Failed` VMM state to Nexus before tearing down the zone.
The `InstanceMonitorTask` now behaves similarly, and I've changed it to
retry all other errors with a backoff, rather than just bailing out
immediately on receipt fo the first error.

I've manually tested this on `london` as discussed here:

#6726 (comment)

Unfortunately, it's hard to do any kind of automated testing of this
with the current test situation, as none of this code is exercised by
the simulated sled-agent.

Fixes #3209
Fixes #3238

[^1]: Although I've not actually verified that this is what happens.

[1]: https://github.com/oxidecomputer/propolis/blob/516dabe473cecdc3baea1db98a80441968c144cf/crates/propolis-api-types/src/lib.rs#L434-L441
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that isn't working. known issue To include in customer documentation and training Sled Agent Related to the Per-Sled Configuration and Management virtualization Propolis Integration & VM Management
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants