-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Crucible and Propolis #6453
Conversation
Does it make sense to wait for oxidecomputer/propolis#747 to land? |
Sure, we can bundle those here, that makes sense. |
Looks like there's a failure to launch an instance in the deploy test, here. It seems like that's real, and may be due to some changes in Propolis I wasn't aware of. Maybe @gjcolombo or @hawkw has more info on that failure? |
Uh, hmm, that doesn't seem great! Here's the log line from Nexus when it returns that error: https://buildomat.eng.oxide.computer/wg/0/artefact/01J6AKFJRKWJPS37CVJHG4DSS0/ART3IgYELS66I1BhAJN1DYEUcCNv3ju6WVfa08cYtqryQY4R/01J6AKG5T6KS0P7MPDJRB5F6VJ/01J6APMJS3G46GXHSVPP7YYCX3/oxide-nexus:default.log?format=x-bunyan#L73386 I wonder if the issue here is that the test doesn't wait until the instance has started before trying to get the serial console? Or, perhaps, the Propolis has failed to start? We'll have to look at the sled-agent logs next, I think. Incidentally, that error message should probably be improved; it's using the instance's internal state, when it should really be using the instance's effective state determined from the instance and VMM records. Then, it would actually tell us what the VMM's state was, rather than saying "instance is no VMM", which is both unhelpful and grammatically incorrect :) |
Thanks @hawkw. It's possible Propolis panicked, though I've tested this several times now. I'll give a whirl locally and look at the logs. |
Aaaaand it looks like Propolis has, indeed, failed to start: https://buildomat.eng.oxide.computer/wg/0/artefact/01J6AKFJRKWJPS37CVJHG4DSS0/ART3IgYELS66I1BhAJN1DYEUcCNv3ju6WVfa08cYtqryQY4R/01J6AKG5T6KS0P7MPDJRB5F6VJ/01J6APH53H72QA69KDSA063TG0/oxide-sled-agent:default.log?format=x-bunyan#L2604, returning a 500 Internal Server Error with no useful message. Unfortunately, it looks like there aren't any logs from the Propolis process in the test run? |
Yeah, I think the zone is gone by the time we collect them, or we only do so from live zones maybe. I'll keep digging. |
(edit: preempted by the above) Ruh-roh:
Unfortunately we don't seem to have gotten any Propolis logs or a bundle from this particular test run. I think this is because the "ensure request failed" path doesn't actually go through the regular |
Thanks, if you can manage to get logs from the Propolis, that would be great --- it's probably worth looping in @gjcolombo, because this might be related to some of his recent propolis-server API refactoring --- maybe we're parsing something wrong now? Edit: oh, hi greg! |
Well, at least I can repro it locally. I'm patching things to generate a zone bundle in this case, so at least we'll have logs and cores. |
I've got to run for a while, but here's the failure:
This is supposed to be looking up the Crucible backend for the virtual block device, but apparently I've borked the logic for finding that backend. I'm not sure how yet, since this worked deploying block devs backed by Crucible locally. |
I think I broke this in oxidecomputer/propolis#743. Working on a fix. |
Filed oxidecomputer/propolis#750 to track. |
@bnaecker: this should be fixed by oxidecomputer/propolis#751. You'll just need to update the Propolis deps once that merges. Sorry for the trouble! |
@gjcolombo All good, thanks for the quick fix! We're going to wait for the networking metrics to be integrated anyway, so not a huge rush. But I'm glad we found this. Was there any more testing I could have done locally in Propolis to suss this out? |
The automated test gap here is that we don't have anything in Propolis CI that exercises the "regular" instance ensure API that Omicron uses. PHD uses a different instance ensure endpoint that gives it finer-grained control over the VM components it creates. The bug here was in the shim that converts from the Omicron endpoint's arguments to the PHD endpoint's arguments (so a little earlier in the instance startup process than PHD is able to catch). You (or really I, since this was my bug) could have caught this in manual testing by starting a VM using In the long run my hope (I won't call it an "expectation" yet) is that the existing |
@zeeshanlakhani Let me know when your instance NIC PR in Propolis merges, and I can update this PR with those commits. That'll also pull in @gjcolombo's fix for the backend-mapping issue. |
Yep. I'm finishing testing today. |
Co-authored-by: Benjamin Naecker <[email protected]>
3ff6d0a
to
f387bda
Compare
Did rebase here as the deps changed over a few times. Now, with just the bits for crucible and propolis. |
No description provided.