Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Crucible and Propolis #6453

Merged
merged 1 commit into from
Sep 3, 2024
Merged

Conversation

bnaecker
Copy link
Collaborator

No description provided.

@bnaecker bnaecker requested review from pfmooney, jmpesp and leftwo August 27, 2024 19:00
@pfmooney
Copy link
Contributor

Does it make sense to wait for oxidecomputer/propolis#747 to land?

@bnaecker
Copy link
Collaborator Author

Does it make sense to wait for oxidecomputer/propolis#747 to land?

Sure, we can bundle those here, that makes sense.

@bnaecker
Copy link
Collaborator Author

Looks like there's a failure to launch an instance in the deploy test, here. It seems like that's real, and may be due to some changes in Propolis I wasn't aware of. Maybe @gjcolombo or @hawkw has more info on that failure?

@hawkw
Copy link
Member

hawkw commented Aug 27, 2024

Uh, hmm, that doesn't seem great!

Here's the log line from Nexus when it returns that error: https://buildomat.eng.oxide.computer/wg/0/artefact/01J6AKFJRKWJPS37CVJHG4DSS0/ART3IgYELS66I1BhAJN1DYEUcCNv3ju6WVfa08cYtqryQY4R/01J6AKG5T6KS0P7MPDJRB5F6VJ/01J6APMJS3G46GXHSVPP7YYCX3/oxide-nexus:default.log?format=x-bunyan#L73386

I wonder if the issue here is that the test doesn't wait until the instance has started before trying to get the serial console? Or, perhaps, the Propolis has failed to start? We'll have to look at the sled-agent logs next, I think.

Incidentally, that error message should probably be improved; it's using the instance's internal state, when it should really be using the instance's effective state determined from the instance and VMM records. Then, it would actually tell us what the VMM's state was, rather than saying "instance is no VMM", which is both unhelpful and grammatically incorrect :)

@bnaecker
Copy link
Collaborator Author

Thanks @hawkw. It's possible Propolis panicked, though I've tested this several times now. I'll give a whirl locally and look at the logs.

@hawkw
Copy link
Member

hawkw commented Aug 27, 2024

Aaaaand it looks like Propolis has, indeed, failed to start: https://buildomat.eng.oxide.computer/wg/0/artefact/01J6AKFJRKWJPS37CVJHG4DSS0/ART3IgYELS66I1BhAJN1DYEUcCNv3ju6WVfa08cYtqryQY4R/01J6AKG5T6KS0P7MPDJRB5F6VJ/01J6APH53H72QA69KDSA063TG0/oxide-sled-agent:default.log?format=x-bunyan#L2604, returning a 500 Internal Server Error with no useful message.

Unfortunately, it looks like there aren't any logs from the Propolis process in the test run?

@bnaecker
Copy link
Collaborator Author

Unfortunately, it looks like there aren't any logs from the Propolis process in the test run?

Yeah, I think the zone is gone by the time we collect them, or we only do so from live zones maybe. I'll keep digging.

@gjcolombo
Copy link
Contributor

gjcolombo commented Aug 27, 2024

(edit: preempted by the above)

Ruh-roh:

SledAgent (InstanceManager): vmm setup failed: Err(Propolis(Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "d79ab988-e7c4-4413-9380-fb660ccb5ac0", "content-length": "124", "date": "Tue, 27 Aug 2024 19:55:15 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "d79ab988-e7c4-4413-9380-fb660ccb5ac0" }))

Unfortunately we don't seem to have gotten any Propolis logs or a bundle from this particular test run. I think this is because the "ensure request failed" path doesn't actually go through the regular InstanceRunner::terminate flow that would cause a bundle to be collected. I'll file an issue for this, but we'll still need to investigate what happened here, probably by running this branch in a dev environment and seeing if we can grab logs directly from it.

@hawkw
Copy link
Member

hawkw commented Aug 27, 2024

Unfortunately, it looks like there aren't any logs from the Propolis process in the test run?

Yeah, I think the zone is gone by the time we collect them, or we only do so from live zones maybe. I'll keep digging.

Thanks, if you can manage to get logs from the Propolis, that would be great --- it's probably worth looping in @gjcolombo, because this might be related to some of his recent propolis-server API refactoring --- maybe we're parsing something wrong now?

Edit: oh, hi greg!

@bnaecker
Copy link
Collaborator Author

Well, at least I can repro it locally. I'm patching things to generate a zone bundle in this case, so at least we'll have logs and cores.

@bnaecker
Copy link
Collaborator Author

I've got to run for a while, but here's the failure:

{"msg":"request completed","v":0,"name":"propolis-server","level":30,"time":"2024-08-27T22:06:02.849971242Z","hostname":"oxz_propolis-server_d28d2aaa-88f6-486c-a5ab-3adbf6859bc5","pid":10432,"uri":"/instance","method":"PUT","req_id":"7943ba37-34d8-40b5-892c-7e807d7be422","remote_addr":"[fd00:1122:3344:101::1]:38280","local_addr":"[fd00:1122:3344:101::1:9]:12400","error_message_external":"Internal Server Error","error_message_internal":"VM initialization failed: Backend test-3be20cee-0d96-4fc9-921e-66d-2857e7 not found for storage device test-3be20cee-0d96-4fc9-921e-66d-2857e7","latency_us":71803,"response_code":"500"}

This is supposed to be looking up the Crucible backend for the virtual block device, but apparently I've borked the logic for finding that backend. I'm not sure how yet, since this worked deploying block devs backed by Crucible locally.

@gjcolombo
Copy link
Contributor

I think I broke this in oxidecomputer/propolis#743. Working on a fix.

@gjcolombo
Copy link
Contributor

Filed oxidecomputer/propolis#750 to track.

@gjcolombo
Copy link
Contributor

gjcolombo commented Aug 28, 2024

@bnaecker: this should be fixed by oxidecomputer/propolis#751. You'll just need to update the Propolis deps once that merges. Sorry for the trouble!

@bnaecker
Copy link
Collaborator Author

@gjcolombo All good, thanks for the quick fix! We're going to wait for the networking metrics to be integrated anyway, so not a huge rush.

But I'm glad we found this. Was there any more testing I could have done locally in Propolis to suss this out?

@gjcolombo
Copy link
Contributor

Was there any more testing I could have done locally in Propolis to suss this out?

The automated test gap here is that we don't have anything in Propolis CI that exercises the "regular" instance ensure API that Omicron uses. PHD uses a different instance ensure endpoint that gives it finer-grained control over the VM components it creates. The bug here was in the shim that converts from the Omicron endpoint's arguments to the PHD endpoint's arguments (so a little earlier in the instance startup process than PHD is able to catch). You (or really I, since this was my bug) could have caught this in manual testing by starting a VM using propolis-cli and passing JSON files containing Crucible disk requests on the command line (via the --crucible-disks parameter when invoking propolis-cli new).

In the long run my hope (I won't call it an "expectation" yet) is that the existing InstanceEnsureRequest API will be deprecated and that sled agent will switch over to using the instance-spec based API that PHD uses. But that's a ways off.

@bnaecker
Copy link
Collaborator Author

@zeeshanlakhani Let me know when your instance NIC PR in Propolis merges, and I can update this PR with those commits. That'll also pull in @gjcolombo's fix for the backend-mapping issue.

@zeeshanlakhani
Copy link
Collaborator

@zeeshanlakhani Let me know when your instance NIC PR in Propolis merges, and I can update this PR with those commits. That'll also pull in @gjcolombo's fix for the backend-mapping issue.

Yep. I'm finishing testing today.

@zeeshanlakhani zeeshanlakhani force-pushed the update-crucible-and-propolis branch from 3ff6d0a to f387bda Compare September 3, 2024 15:52
@zeeshanlakhani
Copy link
Collaborator

zeeshanlakhani commented Sep 3, 2024

Did rebase here as the deps changed over a few times. Now, with just the bits for crucible and propolis.

@zeeshanlakhani zeeshanlakhani merged commit 165d42a into main Sep 3, 2024
24 checks passed
@zeeshanlakhani zeeshanlakhani deleted the update-crucible-and-propolis branch September 3, 2024 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants