ingest new Propolis VM creation API #7211

gjcolombo · 2024-12-06T00:33:54Z

Update Omicron to use the new Propolis VM creation API defined in oxidecomputer/propolis#813 and oxidecomputer/propolis#816. This API requires clients to pass instance specs to create new VMs and component replacement lists to migrate existing VMs. Construct these in sled agent for now; in the future this logic can move to Nexus and become part of a robust virtual platform abstraction. For now the goal is just to keep everything working for existing VMs while adapting to the new Propolis API.

Slightly adjust the sled agent instance APIs so that Nexus specifies disks and boot orderings using sled-agent-defined types and not re-exported Propolis types.

Finally, adapt Nexus to the fact that Crucible's VolumeConstructionRequest and CrucibleOpts types no longer appear in Propolis's generated client (and so don't appear in sled agent's client either). Instead, the propolis-client crate re-exports these types directly from its crucible-client-types dependency. For the most part, this involves updating use directives and storing SocketAddrs in their natively typed form instead of converting them to and from strings.

Tests: cargo nextest, plus ad hoc testing in a dev cluster as described in the PR comments.

gjcolombo

Making notes of a couple of things I saw while writing the PR description.

I also need to check that disk snapshots against a running instance work as intended, since that API also changed (I've tested this on the Propolis side but need to make sure it works here too).

sled-agent/types/src/instance.rs

sled-agent/src/instance.rs

gjcolombo · 2024-12-06T19:55:43Z

I also need to check that disk snapshots against a running instance work as intended, since that API also changed (I've tested this on the Propolis side but need to make sure it works here too).

I did this by spinning up a dev cluster, creating an instance and logging into it, touching some files there, and taking a snapshot of the image's disk while it was still running. The Propolis logs showed it servicing the snapshot request, and afterward I had a usable snapshot that I could use to create a new image/disk/instance that contains the files I'd touched pre-snapshot.

gjcolombo · 2024-12-18T22:59:29Z

Now that all the relevant Propolis changes have merged, I've rebased this to the latest main and updated the pointee Propolis commit to d4529fd8247386b422b78e1203315d5baea5ea8b (the last one to contain all the shiny new API changes).

I also tested somewhat more rigorously that similarly-configured VMs have the same guest-visible topology and IDs before and after this change by spinning up dev clusters with and without the changes and doing the following:

Upload a Debian 11 generic cloud image named "debian-genericcloud"
Create and boot an instance with this image
Stop the instance
Create and attach two 1 GiB data disks named "joryu-data-1" and "joryu-data-2"
Start the instance
Check the outputs of lspci and various items in devfs/sysfs, including NVMe disk IDs and serial numbers to make sure everything matches

Next I need to recheck disk snapshots and VCR replacement--if both of those work as intended I'll open this up for review.

Detailed output

With the changes:

debian@joryu:~$ lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:08.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:10.0 Non-Volatile memory controller: Oxide Computer Company Device 0000
00:11.0 Non-Volatile memory controller: Oxide Computer Company Device 0000
00:12.0 Non-Volatile memory controller: Oxide Computer Company Device 0000
00:18.0 SCSI storage controller: Red Hat, Inc. Virtio block device
debian@joryu:~$ ls /dev/disk/by-id
nvme-nvme.01de-6a6f7279752d646174612d31--00000001  nvme-nvme.01de-6a6f7279752d64656269616e2d67656e65726963--00000001        nvme-nvme.01de-6a6f7279752d64656269616e2d67656e65726963--00000001-part14
nvme-nvme.01de-6a6f7279752d646174612d32--00000001  nvme-nvme.01de-6a6f7279752d64656269616e2d67656e65726963--00000001-part1  nvme-nvme.01de-6a6f7279752d64656269616e2d67656e65726963--00000001-part15

debian@joryu:~$ cat /sys/block/nvme0n1/device/serial
joryu-debian-generic
debian@joryu:~$ cat /sys/block/nvme1n1/device/serial
joryu-data-1
debian@joryu:~$ cat /sys/block/nvme2n1/device/serial
joryu-data-2

00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
        Subsystem: Oxide Computer Company 440FX - 82441FX PMC [Natoma]
        Flags: bus master, fast devsel, latency 0

00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
        Subsystem: Oxide Computer Company 82371SB PIIX3 ISA [Natoma/Triton II]
        Flags: bus master, fast devsel, latency 0

00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
        Subsystem: Oxide Computer Company 82371AB/EB/MB PIIX4 ACPI
        Flags: bus master, fast devsel, latency 0, IRQ 9

00:08.0 Ethernet controller: Red Hat, Inc. Virtio network device
        Subsystem: Red Hat, Inc. Virtio network device
        Flags: bus master, fast devsel, latency 0, IRQ 10
        I/O ports at c200 [size=512]
        Memory at c001a000 (32-bit, non-prefetchable) [size=8K]
        Capabilities: <access denied>
        Kernel driver in use: virtio-pci
        Kernel modules: virtio_pci

00:10.0 Non-Volatile memory controller: Oxide Computer Company Device 0000 (prog-if 02 [NVM Express])
        Subsystem: Oxide Computer Company Device 0000
        Flags: bus master, fast devsel, latency 0, NUMA node 0
        Memory at c001c000 (64-bit, non-prefetchable) [size=16K]
        Memory at c0010000 (32-bit, non-prefetchable) [size=32K]
        Capabilities: <access denied>
        Kernel driver in use: nvme

00:11.0 Non-Volatile memory controller: Oxide Computer Company Device 0000 (prog-if 02 [NVM Express])
        Subsystem: Oxide Computer Company Device 0000
        Flags: bus master, fast devsel, latency 0, NUMA node 0
        Memory at c0020000 (64-bit, non-prefetchable) [size=16K]
        Memory at c0008000 (32-bit, non-prefetchable) [size=32K]
        Capabilities: <access denied>
        Kernel driver in use: nvme

00:12.0 Non-Volatile memory controller: Oxide Computer Company Device 0000 (prog-if 02 [NVM Express])
        Subsystem: Oxide Computer Company Device 0000
        Flags: bus master, fast devsel, latency 0, NUMA node 0
        Memory at c0024000 (64-bit, non-prefetchable) [size=16K]
        Memory at c0000000 (32-bit, non-prefetchable) [size=32K]
        Capabilities: <access denied>
        Kernel driver in use: nvme

00:18.0 SCSI storage controller: Red Hat, Inc. Virtio block device
        Subsystem: Red Hat, Inc. Virtio block device
        Flags: bus master, fast devsel, latency 0, IRQ 10
        I/O ports at c000 [size=512]
        Memory at c0018000 (32-bit, non-prefetchable) [size=8K]
        Capabilities: <access denied>
        Kernel driver in use: virtio-pci
        Kernel modules: virtio_pci

Without:

debian@joryu:~$ lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:08.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:10.0 Non-Volatile memory controller: Oxide Computer Company Device 0000
00:11.0 Non-Volatile memory controller: Oxide Computer Company Device 0000
00:12.0 Non-Volatile memory controller: Oxide Computer Company Device 0000
00:18.0 SCSI storage controller: Red Hat, Inc. Virtio block device
debian@joryu:~$ ls /dev/disk/by-id
nvme-nvme.01de-6a6f7279752d646174612d31--00000001  nvme-nvme.01de-6a6f7279752d64656269616e2d67656e65726963--00000001        nvme-nvme.01de-6a6f7279752d64656269616e2d67656e65726963--00000001-part14
nvme-nvme.01de-6a6f7279752d646174612d32--00000001  nvme-nvme.01de-6a6f7279752d64656269616e2d67656e65726963--00000001-part1  nvme-nvme.01de-6a6f7279752d64656269616e2d67656e65726963--00000001-part15
debian@joryu:~$ cat /sys/block/nvme0n1/device/serial
joryu-debian-generic
debian@joryu:~$ cat /sys/block/nvme1n1/device/serial
joryu-data-1
debian@joryu:~$ cat /sys/block/nvme2n1/device/serial
joryu-data-2

00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
        Subsystem: Oxide Computer Company 440FX - 82441FX PMC [Natoma]
        Flags: bus master, fast devsel, latency 0

00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
        Subsystem: Oxide Computer Company 82371SB PIIX3 ISA [Natoma/Triton II]
        Flags: bus master, fast devsel, latency 0

00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
        Subsystem: Oxide Computer Company 82371AB/EB/MB PIIX4 ACPI
        Flags: bus master, fast devsel, latency 0, IRQ 9

00:08.0 Ethernet controller: Red Hat, Inc. Virtio network device
        Subsystem: Red Hat, Inc. Virtio network device
        Flags: bus master, fast devsel, latency 0, IRQ 10
        I/O ports at c200 [size=512]
        Memory at c001a000 (32-bit, non-prefetchable) [size=8K]
        Capabilities: <access denied>
        Kernel driver in use: virtio-pci
        Kernel modules: virtio_pci

00:10.0 Non-Volatile memory controller: Oxide Computer Company Device 0000 (prog-if 02 [NVM Express])
        Subsystem: Oxide Computer Company Device 0000
        Flags: bus master, fast devsel, latency 0, NUMA node 0
        Memory at c001c000 (64-bit, non-prefetchable) [size=16K]
        Memory at c0010000 (32-bit, non-prefetchable) [size=32K]
        Capabilities: <access denied>
        Kernel driver in use: nvme

00:11.0 Non-Volatile memory controller: Oxide Computer Company Device 0000 (prog-if 02 [NVM Express])
        Subsystem: Oxide Computer Company Device 0000
        Flags: bus master, fast devsel, latency 0, NUMA node 0
        Memory at c0020000 (64-bit, non-prefetchable) [size=16K]
        Memory at c0008000 (32-bit, non-prefetchable) [size=32K]
        Capabilities: <access denied>
        Kernel driver in use: nvme

00:12.0 Non-Volatile memory controller: Oxide Computer Company Device 0000 (prog-if 02 [NVM Express])
        Subsystem: Oxide Computer Company Device 0000
        Flags: bus master, fast devsel, latency 0, NUMA node 0
        Memory at c0024000 (64-bit, non-prefetchable) [size=16K]
        Memory at c0000000 (32-bit, non-prefetchable) [size=32K]
        Capabilities: <access denied>
        Kernel driver in use: nvme

00:18.0 SCSI storage controller: Red Hat, Inc. Virtio block device
        Subsystem: Red Hat, Inc. Virtio block device
        Flags: bus master, fast devsel, latency 0, IRQ 10
        I/O ports at c000 [size=512]
        Memory at c0018000 (32-bit, non-prefetchable) [size=8K]
        Capabilities: <access denied>
        Kernel driver in use: virtio-pci
        Kernel modules: virtio_pci

gjcolombo · 2024-12-19T00:24:30Z

I've also tested the following:

Disk snapshot: started an instance with a disk attached, took a snapshot, verified that the Propolis logs reflected that the disk snapshot API was called and that it returned success
VCR replacement: expunged one of the simulated disks backing the instance's disk, verified that the Propolis logs showed a VCR replacement and a subsequent attempt to repair the disk

Looking into the integration test failures now.

gjcolombo · 2024-12-19T04:54:26Z

The most recent integration test failure (in test_disk_move_between_instances) appears to be a preexisting race condition in simulated sled agent that's not related to these changes. I filed #7282 to track it and will rerun this job.

jmpesp

I like it!

nexus/db-queries/src/db/datastore/volume.rs

sled-agent/src/instance.rs

sled-agent/types/src/instance.rs

jmpesp · 2024-12-19T22:11:52Z

sled-agent/types/src/instance.rs

@@ -152,9 +184,6 @@ pub struct VmmUnregisterResponse {
 /// migration.
 #[derive(Copy, Clone, Debug, Deserialize, Serialize, JsonSchema)]
 pub struct InstanceMigrationTargetParams {
-    /// The Propolis ID of the migration source.
-    pub src_propolis_id: Uuid,


It seems like a mistake to lose this, from the perspective of wishing we had sent the expected UUID of the Downstairs we're connecting to from the Upstairs, and the bugs not having it has caused.

Can you elaborate on which bugs these are?

I'm trying to think about what kinds of bugs this would prevent in this path. The main one I can think of is a sort of bug in which we intended to migrate from VMM A, but for some reason we passed VMM B's server address to the ensure request. It seems like it would be hard to do that up in Nexus given where all this information comes from.

I guess we could forestall this by having the source Propolis check that its ID matches the one the target supplied. This would still not prevent a different bug, though, in which an instance is currently running in VMM A, but somehow Nexus reads the VMM record for VMM B and passes its ID and address to this request--in that case the migration will be accepted but we'll still end up migrating from the wrong server.

In any event: I'm not sure I can address this directly in this PR, because the corresponding parameter is also gone from the Propolis API and migration protocol. But if it can effectively prevent a bug I'm all for adding it back.

#6353 is a manifestation of the bug I'm talking about: an Upstairs constructed by a Pantry panicked when one of the regions it was connecting to did not match the others in the region set. The only information the Upstairs receives about Downstairs is SocketAddrs to connect to:

targets: [ addr1, // downstairs A addr2, // downstairs B addr3 // downstairs C ]

When the Crucible agent creates and destroys Downstairs, it will aggressively reuse what ports are assigned to new Downstairs. If Nexus destroys Downstairs C, then creates a new Downstairs D (without doing anything else with the Agent), then Downstairs D will bind to the same port that Downstairs C did. If that happens then the targets array that the Upstairs was sent is instantly invalid, and constructing it will either 1) lead to a panic because the Downstairs D region is a different size than A and B, or 2) not panic because Downstairs D is the same size as A and B, but it will be blank, leading to other issues.

Without any additional information in the targets array, the Upstairs has to panic: it would be invalid to continue and operate on regions with mismatched sizes. Adding the expected UUID would give the Upstairs the option to bail instead:

targets: [ { uuidA, addr1 }, { uuidB, addr2 }, { uuidC, addr3 } ]

During the negotiation with the downstairs it would be able to compare the expected UUID in the targets array with what the Downstairs it's talking to is reporting, and fail to connect instead of panicking.

sled-agent/src/instance.rs

hawkw

Overall, this looks good to me! I agree with @jmpesp that it would be nice to explain a few of the weird-looking bits.

nexus/db-queries/src/db/datastore/volume.rs

nexus/src/app/background/tasks/region_snapshot_replacement_step.rs

nexus/src/app/sagas/region_snapshot_replacement_step.rs

sled-agent/src/instance.rs

sled-agent/types/src/instance.rs

gjcolombo

Thanks for the reviews, @jmpesp and @hawkw!

nexus/db-queries/src/db/datastore/volume.rs

sled-agent/src/instance.rs

gjcolombo · 2024-12-20T01:56:53Z

sled-agent/types/src/instance.rs

@@ -152,9 +184,6 @@ pub struct VmmUnregisterResponse {
 /// migration.
 #[derive(Copy, Clone, Debug, Deserialize, Serialize, JsonSchema)]
 pub struct InstanceMigrationTargetParams {
-    /// The Propolis ID of the migration source.
-    pub src_propolis_id: Uuid,


Can you elaborate on which bugs these are?

I'm trying to think about what kinds of bugs this would prevent in this path. The main one I can think of is a sort of bug in which we intended to migrate from VMM A, but for some reason we passed VMM B's server address to the ensure request. It seems like it would be hard to do that up in Nexus given where all this information comes from.

I guess we could forestall this by having the source Propolis check that its ID matches the one the target supplied. This would still not prevent a different bug, though, in which an instance is currently running in VMM A, but somehow Nexus reads the VMM record for VMM B and passes its ID and address to this request--in that case the migration will be accepted but we'll still end up migrating from the wrong server.

In any event: I'm not sure I can address this directly in this PR, because the corresponding parameter is also gone from the Propolis API and migration protocol. But if it can effectively prevent a bug I'm all for adding it back.

sled-agent/src/instance.rs

nexus/db-queries/src/db/datastore/volume.rs

nexus/src/app/background/tasks/region_snapshot_replacement_step.rs

iximeow

yet more approve-with-comments, now with some type pedantics :) overall: really nice, thanks!

nexus/db-queries/src/db/datastore/volume.rs

nexus/src/app/instance.rs

nexus/src/app/sagas/snapshot_create.rs

nexus/tests/integration_tests/volume_management.rs

sled-agent/src/instance.rs

This ends up touching a substantial number of files because of the way Nexus accessed Crucible client types: - Nexus imported Crucible types from the generated sled agent client - Sled agent's code included generated Crucible types from the Propolis client - The Propolis client generates Crucible types because the Propolis server API used a DiskRequest type that referred to the Crucible client library's VolumeConstructionRequest and CrucibleOpts types The Propolis server API no longer has a DiskRequest type, so generated Propolis clients no longer include the Crucible types, and neither do the generated sled agent clients. Deriving Crucible types from the Propolis client is useful because it ensures that Nexus's conception of a VolumeConstructionRequest is the same as Propolis's. To maintain this property, the Propolis client depends on the crucible-client-types crate and re-exports types from it; the sled agent client then re-exports from the Propolis client library. The catch is that Progenitor-generated versions of these types use Strings for socket addresses, while the native types use native SocketAddrs. This requires a fair bit of munging in the tests to remove some conversions between strings and socket addresses.

gjcolombo · 2024-12-20T21:55:45Z

@iximeow Thanks for the review! I think I've addressed all your comments in f0c692d. I've also retested on a dev cluster that instance creation/start and disk snapshot seem to work (to try to gain some confidence that I didn't bungle any SocketAddr conversions or boot order designations).

I'm planning to merge this if/when CI comes back green.

gjcolombo · 2024-12-20T22:31:10Z

CI failures are #7251, #7255, and #4657; requested a rerun.

gjcolombo · 2024-12-20T22:32:18Z

Hmm, I see #7251 and #7255 should be fixed... looking a little more closely at those failures now (but still letting the rerun go).

gjcolombo commented Dec 6, 2024

View reviewed changes

sled-agent/types/src/instance.rs Outdated Show resolved Hide resolved

sled-agent/src/instance.rs Show resolved Hide resolved

gjcolombo force-pushed the gjcolombo/ingest-propolis-ensure-api branch from 222ff66 to dd376b0 Compare December 6, 2024 19:53

gjcolombo force-pushed the gjcolombo/ingest-propolis-ensure-api branch from dd376b0 to 17e1ac0 Compare December 18, 2024 22:51

gjcolombo mentioned this pull request Dec 19, 2024

sim-sled-agent removes destroyed VMMs from its table before reporting to Nexus #7282

Open

gjcolombo force-pushed the gjcolombo/ingest-propolis-ensure-api branch from a01c51d to 4292e00 Compare December 19, 2024 18:56

gjcolombo marked this pull request as ready for review December 19, 2024 19:20

gjcolombo requested review from hawkw, jmpesp and iximeow December 19, 2024 19:25

jmpesp approved these changes Dec 19, 2024

View reviewed changes

hawkw approved these changes Dec 19, 2024

View reviewed changes

gjcolombo mentioned this pull request Dec 20, 2024

Add 'platform version' to Instances #615

Open

gjcolombo commented Dec 20, 2024

View reviewed changes

iximeow approved these changes Dec 20, 2024

View reviewed changes

gjcolombo added 10 commits December 20, 2024 21:09

pick up new Propolis ensure API in sled agent

659d908

sled agent openapi refresh

82be58c

cleanup

c7c8137

explain pci device number selection

bfa5075

remove unneeded sled agent param

30cfb17

fmt

9829624

use SocketAddrV6::into::<SocketAddr>

6de669f

improve comments/docs; assert more consistently

72103b1

PR feedback: SocketAddrV6, misc cleanup

f0c692d

gjcolombo force-pushed the gjcolombo/ingest-propolis-ensure-api branch from 97d1e40 to f0c692d Compare December 20, 2024 21:10

gjcolombo mentioned this pull request Dec 20, 2024

500 error during instance creation due to space exhaustion #7294

Open

gjcolombo merged commit c1281a9 into main Dec 20, 2024
19 checks passed

gjcolombo deleted the gjcolombo/ingest-propolis-ensure-api branch December 20, 2024 23:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest new Propolis VM creation API #7211

ingest new Propolis VM creation API #7211

gjcolombo commented Dec 6, 2024 •

edited

Loading

gjcolombo left a comment

gjcolombo commented Dec 6, 2024

gjcolombo commented Dec 18, 2024

gjcolombo commented Dec 19, 2024

gjcolombo commented Dec 19, 2024

jmpesp left a comment

jmpesp Dec 19, 2024

gjcolombo Dec 20, 2024

jmpesp Dec 20, 2024

hawkw left a comment

gjcolombo left a comment

gjcolombo Dec 20, 2024

iximeow left a comment

gjcolombo commented Dec 20, 2024

gjcolombo commented Dec 20, 2024

gjcolombo commented Dec 20, 2024

ingest new Propolis VM creation API #7211

ingest new Propolis VM creation API #7211

Conversation

gjcolombo commented Dec 6, 2024 • edited Loading

gjcolombo left a comment

Choose a reason for hiding this comment

gjcolombo commented Dec 6, 2024

gjcolombo commented Dec 18, 2024

gjcolombo commented Dec 19, 2024

gjcolombo commented Dec 19, 2024

jmpesp left a comment

Choose a reason for hiding this comment

jmpesp Dec 19, 2024

Choose a reason for hiding this comment

gjcolombo Dec 20, 2024

Choose a reason for hiding this comment

jmpesp Dec 20, 2024

Choose a reason for hiding this comment

hawkw left a comment

Choose a reason for hiding this comment

gjcolombo left a comment

Choose a reason for hiding this comment

gjcolombo Dec 20, 2024

Choose a reason for hiding this comment

iximeow left a comment

Choose a reason for hiding this comment

gjcolombo commented Dec 20, 2024

gjcolombo commented Dec 20, 2024

gjcolombo commented Dec 20, 2024

gjcolombo commented Dec 6, 2024 •

edited

Loading