Update Crucible and Propolis #6453

bnaecker · 2024-08-27T19:00:26Z

No description provided.

pfmooney · 2024-08-27T19:13:01Z

Does it make sense to wait for oxidecomputer/propolis#747 to land?

bnaecker · 2024-08-27T19:13:28Z

Does it make sense to wait for oxidecomputer/propolis#747 to land?

Sure, we can bundle those here, that makes sense.

bnaecker · 2024-08-27T20:04:29Z

Looks like there's a failure to launch an instance in the deploy test, here. It seems like that's real, and may be due to some changes in Propolis I wasn't aware of. Maybe @gjcolombo or @hawkw has more info on that failure?

hawkw · 2024-08-27T20:12:30Z

Uh, hmm, that doesn't seem great!

Here's the log line from Nexus when it returns that error: https://buildomat.eng.oxide.computer/wg/0/artefact/01J6AKFJRKWJPS37CVJHG4DSS0/ART3IgYELS66I1BhAJN1DYEUcCNv3ju6WVfa08cYtqryQY4R/01J6AKG5T6KS0P7MPDJRB5F6VJ/01J6APMJS3G46GXHSVPP7YYCX3/oxide-nexus:default.log?format=x-bunyan#L73386

I wonder if the issue here is that the test doesn't wait until the instance has started before trying to get the serial console? Or, perhaps, the Propolis has failed to start? We'll have to look at the sled-agent logs next, I think.

Incidentally, that error message should probably be improved; it's using the instance's internal state, when it should really be using the instance's effective state determined from the instance and VMM records. Then, it would actually tell us what the VMM's state was, rather than saying "instance is no VMM", which is both unhelpful and grammatically incorrect :)

bnaecker · 2024-08-27T20:14:05Z

Thanks @hawkw. It's possible Propolis panicked, though I've tested this several times now. I'll give a whirl locally and look at the logs.

hawkw · 2024-08-27T20:14:40Z

Aaaaand it looks like Propolis has, indeed, failed to start: https://buildomat.eng.oxide.computer/wg/0/artefact/01J6AKFJRKWJPS37CVJHG4DSS0/ART3IgYELS66I1BhAJN1DYEUcCNv3ju6WVfa08cYtqryQY4R/01J6AKG5T6KS0P7MPDJRB5F6VJ/01J6APH53H72QA69KDSA063TG0/oxide-sled-agent:default.log?format=x-bunyan#L2604, returning a 500 Internal Server Error with no useful message.

Unfortunately, it looks like there aren't any logs from the Propolis process in the test run?

bnaecker · 2024-08-27T20:15:58Z

Unfortunately, it looks like there aren't any logs from the Propolis process in the test run?

Yeah, I think the zone is gone by the time we collect them, or we only do so from live zones maybe. I'll keep digging.

gjcolombo · 2024-08-27T20:16:54Z

(edit: preempted by the above)

Ruh-roh:

SledAgent (InstanceManager): vmm setup failed: Err(Propolis(Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "d79ab988-e7c4-4413-9380-fb660ccb5ac0", "content-length": "124", "date": "Tue, 27 Aug 2024 19:55:15 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "d79ab988-e7c4-4413-9380-fb660ccb5ac0" }))

Unfortunately we don't seem to have gotten any Propolis logs or a bundle from this particular test run. I think this is because the "ensure request failed" path doesn't actually go through the regular InstanceRunner::terminate flow that would cause a bundle to be collected. I'll file an issue for this, but we'll still need to investigate what happened here, probably by running this branch in a dev environment and seeing if we can grab logs directly from it.

hawkw · 2024-08-27T20:17:32Z

Unfortunately, it looks like there aren't any logs from the Propolis process in the test run?

Yeah, I think the zone is gone by the time we collect them, or we only do so from live zones maybe. I'll keep digging.

Thanks, if you can manage to get logs from the Propolis, that would be great --- it's probably worth looping in @gjcolombo, because this might be related to some of his recent propolis-server API refactoring --- maybe we're parsing something wrong now?

Edit: oh, hi greg!

bnaecker · 2024-08-27T21:25:02Z

Well, at least I can repro it locally. I'm patching things to generate a zone bundle in this case, so at least we'll have logs and cores.

bnaecker · 2024-08-27T22:08:22Z

I've got to run for a while, but here's the failure:

{"msg":"request completed","v":0,"name":"propolis-server","level":30,"time":"2024-08-27T22:06:02.849971242Z","hostname":"oxz_propolis-server_d28d2aaa-88f6-486c-a5ab-3adbf6859bc5","pid":10432,"uri":"/instance","method":"PUT","req_id":"7943ba37-34d8-40b5-892c-7e807d7be422","remote_addr":"[fd00:1122:3344:101::1]:38280","local_addr":"[fd00:1122:3344:101::1:9]:12400","error_message_external":"Internal Server Error","error_message_internal":"VM initialization failed: Backend test-3be20cee-0d96-4fc9-921e-66d-2857e7 not found for storage device test-3be20cee-0d96-4fc9-921e-66d-2857e7","latency_us":71803,"response_code":"500"}

This is supposed to be looking up the Crucible backend for the virtual block device, but apparently I've borked the logic for finding that backend. I'm not sure how yet, since this worked deploying block devs backed by Crucible locally.

gjcolombo · 2024-08-27T22:47:37Z

I think I broke this in oxidecomputer/propolis#743. Working on a fix.

gjcolombo · 2024-08-27T22:52:20Z

Filed oxidecomputer/propolis#750 to track.

gjcolombo · 2024-08-28T00:02:52Z

@bnaecker: this should be fixed by oxidecomputer/propolis#751. You'll just need to update the Propolis deps once that merges. Sorry for the trouble!

bnaecker · 2024-08-28T00:15:32Z

@gjcolombo All good, thanks for the quick fix! We're going to wait for the networking metrics to be integrated anyway, so not a huge rush.

But I'm glad we found this. Was there any more testing I could have done locally in Propolis to suss this out?

gjcolombo · 2024-08-28T00:31:29Z

Was there any more testing I could have done locally in Propolis to suss this out?

The automated test gap here is that we don't have anything in Propolis CI that exercises the "regular" instance ensure API that Omicron uses. PHD uses a different instance ensure endpoint that gives it finer-grained control over the VM components it creates. The bug here was in the shim that converts from the Omicron endpoint's arguments to the PHD endpoint's arguments (so a little earlier in the instance startup process than PHD is able to catch). You (or really I, since this was my bug) could have caught this in manual testing by starting a VM using propolis-cli and passing JSON files containing Crucible disk requests on the command line (via the --crucible-disks parameter when invoking propolis-cli new).

In the long run my hope (I won't call it an "expectation" yet) is that the existing InstanceEnsureRequest API will be deprecated and that sled agent will switch over to using the instance-spec based API that PHD uses. But that's a ways off.

bnaecker · 2024-08-28T00:39:16Z

@zeeshanlakhani Let me know when your instance NIC PR in Propolis merges, and I can update this PR with those commits. That'll also pull in @gjcolombo's fix for the backend-mapping issue.

zeeshanlakhani · 2024-08-28T12:45:22Z

@zeeshanlakhani Let me know when your instance NIC PR in Propolis merges, and I can update this PR with those commits. That'll also pull in @gjcolombo's fix for the backend-mapping issue.

Yep. I'm finishing testing today.

Co-authored-by: Benjamin Naecker <[email protected]>

zeeshanlakhani · 2024-09-03T15:55:11Z

Did rebase here as the deps changed over a few times. Now, with just the bits for crucible and propolis.

bnaecker requested review from pfmooney, jmpesp and leftwo August 27, 2024 19:00

hawkw mentioned this pull request Aug 27, 2024

errors returned by instance serial console Nexus API should include the *effective* state #6457

Open

gjcolombo mentioned this pull request Aug 27, 2024

after failing to start a Propolis VM sled agent should collect a zone bundle before removing the zone #6459

Open

[prep deps] update crucible and propolis deps

f387bda

Co-authored-by: Benjamin Naecker <[email protected]>

zeeshanlakhani force-pushed the update-crucible-and-propolis branch from 3ff6d0a to f387bda Compare September 3, 2024 15:52

zeeshanlakhani mentioned this pull request Sep 3, 2024

[network metrics] instance network interface schema #6414

Merged

1 task

iliana approved these changes Sep 3, 2024

View reviewed changes

zeeshanlakhani merged commit 165d42a into main Sep 3, 2024
24 checks passed

zeeshanlakhani deleted the update-crucible-and-propolis branch September 3, 2024 17:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Crucible and Propolis #6453

Update Crucible and Propolis #6453

bnaecker commented Aug 27, 2024

pfmooney commented Aug 27, 2024

bnaecker commented Aug 27, 2024

bnaecker commented Aug 27, 2024

hawkw commented Aug 27, 2024

bnaecker commented Aug 27, 2024

hawkw commented Aug 27, 2024

bnaecker commented Aug 27, 2024

gjcolombo commented Aug 27, 2024 •

edited

Loading

hawkw commented Aug 27, 2024 •

edited

Loading

bnaecker commented Aug 27, 2024

bnaecker commented Aug 27, 2024

gjcolombo commented Aug 27, 2024

gjcolombo commented Aug 27, 2024

gjcolombo commented Aug 28, 2024 •

edited

Loading

bnaecker commented Aug 28, 2024

gjcolombo commented Aug 28, 2024

bnaecker commented Aug 28, 2024

zeeshanlakhani commented Aug 28, 2024

zeeshanlakhani commented Sep 3, 2024 •

edited

Loading

Update Crucible and Propolis #6453

Update Crucible and Propolis #6453

Conversation

bnaecker commented Aug 27, 2024

pfmooney commented Aug 27, 2024

bnaecker commented Aug 27, 2024

bnaecker commented Aug 27, 2024

hawkw commented Aug 27, 2024

bnaecker commented Aug 27, 2024

hawkw commented Aug 27, 2024

bnaecker commented Aug 27, 2024

gjcolombo commented Aug 27, 2024 • edited Loading

hawkw commented Aug 27, 2024 • edited Loading

bnaecker commented Aug 27, 2024

bnaecker commented Aug 27, 2024

gjcolombo commented Aug 27, 2024

gjcolombo commented Aug 27, 2024

gjcolombo commented Aug 28, 2024 • edited Loading

bnaecker commented Aug 28, 2024

gjcolombo commented Aug 28, 2024

bnaecker commented Aug 28, 2024

zeeshanlakhani commented Aug 28, 2024

zeeshanlakhani commented Sep 3, 2024 • edited Loading

gjcolombo commented Aug 27, 2024 •

edited

Loading

hawkw commented Aug 27, 2024 •

edited

Loading

gjcolombo commented Aug 28, 2024 •

edited

Loading

zeeshanlakhani commented Sep 3, 2024 •

edited

Loading