Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest new Propolis VM creation API #7211

Merged
merged 10 commits into from
Dec 20, 2024

Conversation

gjcolombo
Copy link
Contributor

@gjcolombo gjcolombo commented Dec 6, 2024

Update Omicron to use the new Propolis VM creation API defined in oxidecomputer/propolis#813 and oxidecomputer/propolis#816. This API requires clients to pass instance specs to create new VMs and component replacement lists to migrate existing VMs. Construct these in sled agent for now; in the future this logic can move to Nexus and become part of a robust virtual platform abstraction. For now the goal is just to keep everything working for existing VMs while adapting to the new Propolis API.

Slightly adjust the sled agent instance APIs so that Nexus specifies disks and boot orderings using sled-agent-defined types and not re-exported Propolis types.

Finally, adapt Nexus to the fact that Crucible's VolumeConstructionRequest and CrucibleOpts types no longer appear in Propolis's generated client (and so don't appear in sled agent's client either). Instead, the propolis-client crate re-exports these types directly from its crucible-client-types dependency. For the most part, this involves updating use directives and storing SocketAddrs in their natively typed form instead of converting them to and from strings.

Tests: cargo nextest, plus ad hoc testing in a dev cluster as described in the PR comments.

Copy link
Contributor Author

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making notes of a couple of things I saw while writing the PR description.

I also need to check that disk snapshots against a running instance work as intended, since that API also changed (I've tested this on the Propolis side but need to make sure it works here too).

sled-agent/types/src/instance.rs Outdated Show resolved Hide resolved
sled-agent/src/instance.rs Show resolved Hide resolved
@gjcolombo gjcolombo force-pushed the gjcolombo/ingest-propolis-ensure-api branch from 222ff66 to dd376b0 Compare December 6, 2024 19:53
@gjcolombo
Copy link
Contributor Author

I also need to check that disk snapshots against a running instance work as intended, since that API also changed (I've tested this on the Propolis side but need to make sure it works here too).

I did this by spinning up a dev cluster, creating an instance and logging into it, touching some files there, and taking a snapshot of the image's disk while it was still running. The Propolis logs showed it servicing the snapshot request, and afterward I had a usable snapshot that I could use to create a new image/disk/instance that contains the files I'd touched pre-snapshot.

@gjcolombo gjcolombo force-pushed the gjcolombo/ingest-propolis-ensure-api branch from dd376b0 to 17e1ac0 Compare December 18, 2024 22:51
@gjcolombo
Copy link
Contributor Author

Now that all the relevant Propolis changes have merged, I've rebased this to the latest main and updated the pointee Propolis commit to d4529fd8247386b422b78e1203315d5baea5ea8b (the last one to contain all the shiny new API changes).

I also tested somewhat more rigorously that similarly-configured VMs have the same guest-visible topology and IDs before and after this change by spinning up dev clusters with and without the changes and doing the following:

  1. Upload a Debian 11 generic cloud image named "debian-genericcloud"
  2. Create and boot an instance with this image
  3. Stop the instance
  4. Create and attach two 1 GiB data disks named "joryu-data-1" and "joryu-data-2"
  5. Start the instance
  6. Check the outputs of lspci and various items in devfs/sysfs, including NVMe disk IDs and serial numbers to make sure everything matches

Next I need to recheck disk snapshots and VCR replacement--if both of those work as intended I'll open this up for review.

Detailed output

With the changes:

debian@joryu:~$ lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:08.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:10.0 Non-Volatile memory controller: Oxide Computer Company Device 0000
00:11.0 Non-Volatile memory controller: Oxide Computer Company Device 0000
00:12.0 Non-Volatile memory controller: Oxide Computer Company Device 0000
00:18.0 SCSI storage controller: Red Hat, Inc. Virtio block device
debian@joryu:~$ ls /dev/disk/by-id
nvme-nvme.01de-6a6f7279752d646174612d31--00000001  nvme-nvme.01de-6a6f7279752d64656269616e2d67656e65726963--00000001        nvme-nvme.01de-6a6f7279752d64656269616e2d67656e65726963--00000001-part14
nvme-nvme.01de-6a6f7279752d646174612d32--00000001  nvme-nvme.01de-6a6f7279752d64656269616e2d67656e65726963--00000001-part1  nvme-nvme.01de-6a6f7279752d64656269616e2d67656e65726963--00000001-part15

debian@joryu:~$ cat /sys/block/nvme0n1/device/serial
joryu-debian-generic
debian@joryu:~$ cat /sys/block/nvme1n1/device/serial
joryu-data-1
debian@joryu:~$ cat /sys/block/nvme2n1/device/serial
joryu-data-2

00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
        Subsystem: Oxide Computer Company 440FX - 82441FX PMC [Natoma]
        Flags: bus master, fast devsel, latency 0

00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
        Subsystem: Oxide Computer Company 82371SB PIIX3 ISA [Natoma/Triton II]
        Flags: bus master, fast devsel, latency 0

00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
        Subsystem: Oxide Computer Company 82371AB/EB/MB PIIX4 ACPI
        Flags: bus master, fast devsel, latency 0, IRQ 9

00:08.0 Ethernet controller: Red Hat, Inc. Virtio network device
        Subsystem: Red Hat, Inc. Virtio network device
        Flags: bus master, fast devsel, latency 0, IRQ 10
        I/O ports at c200 [size=512]
        Memory at c001a000 (32-bit, non-prefetchable) [size=8K]
        Capabilities: <access denied>
        Kernel driver in use: virtio-pci
        Kernel modules: virtio_pci

00:10.0 Non-Volatile memory controller: Oxide Computer Company Device 0000 (prog-if 02 [NVM Express])
        Subsystem: Oxide Computer Company Device 0000
        Flags: bus master, fast devsel, latency 0, NUMA node 0
        Memory at c001c000 (64-bit, non-prefetchable) [size=16K]
        Memory at c0010000 (32-bit, non-prefetchable) [size=32K]
        Capabilities: <access denied>
        Kernel driver in use: nvme

00:11.0 Non-Volatile memory controller: Oxide Computer Company Device 0000 (prog-if 02 [NVM Express])
        Subsystem: Oxide Computer Company Device 0000
        Flags: bus master, fast devsel, latency 0, NUMA node 0
        Memory at c0020000 (64-bit, non-prefetchable) [size=16K]
        Memory at c0008000 (32-bit, non-prefetchable) [size=32K]
        Capabilities: <access denied>
        Kernel driver in use: nvme

00:12.0 Non-Volatile memory controller: Oxide Computer Company Device 0000 (prog-if 02 [NVM Express])
        Subsystem: Oxide Computer Company Device 0000
        Flags: bus master, fast devsel, latency 0, NUMA node 0
        Memory at c0024000 (64-bit, non-prefetchable) [size=16K]
        Memory at c0000000 (32-bit, non-prefetchable) [size=32K]
        Capabilities: <access denied>
        Kernel driver in use: nvme

00:18.0 SCSI storage controller: Red Hat, Inc. Virtio block device
        Subsystem: Red Hat, Inc. Virtio block device
        Flags: bus master, fast devsel, latency 0, IRQ 10
        I/O ports at c000 [size=512]
        Memory at c0018000 (32-bit, non-prefetchable) [size=8K]
        Capabilities: <access denied>
        Kernel driver in use: virtio-pci
        Kernel modules: virtio_pci

Without:

debian@joryu:~$ lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:08.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:10.0 Non-Volatile memory controller: Oxide Computer Company Device 0000
00:11.0 Non-Volatile memory controller: Oxide Computer Company Device 0000
00:12.0 Non-Volatile memory controller: Oxide Computer Company Device 0000
00:18.0 SCSI storage controller: Red Hat, Inc. Virtio block device
debian@joryu:~$ ls /dev/disk/by-id
nvme-nvme.01de-6a6f7279752d646174612d31--00000001  nvme-nvme.01de-6a6f7279752d64656269616e2d67656e65726963--00000001        nvme-nvme.01de-6a6f7279752d64656269616e2d67656e65726963--00000001-part14
nvme-nvme.01de-6a6f7279752d646174612d32--00000001  nvme-nvme.01de-6a6f7279752d64656269616e2d67656e65726963--00000001-part1  nvme-nvme.01de-6a6f7279752d64656269616e2d67656e65726963--00000001-part15
debian@joryu:~$ cat /sys/block/nvme0n1/device/serial
joryu-debian-generic
debian@joryu:~$ cat /sys/block/nvme1n1/device/serial
joryu-data-1
debian@joryu:~$ cat /sys/block/nvme2n1/device/serial
joryu-data-2

00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
        Subsystem: Oxide Computer Company 440FX - 82441FX PMC [Natoma]
        Flags: bus master, fast devsel, latency 0

00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
        Subsystem: Oxide Computer Company 82371SB PIIX3 ISA [Natoma/Triton II]
        Flags: bus master, fast devsel, latency 0

00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
        Subsystem: Oxide Computer Company 82371AB/EB/MB PIIX4 ACPI
        Flags: bus master, fast devsel, latency 0, IRQ 9

00:08.0 Ethernet controller: Red Hat, Inc. Virtio network device
        Subsystem: Red Hat, Inc. Virtio network device
        Flags: bus master, fast devsel, latency 0, IRQ 10
        I/O ports at c200 [size=512]
        Memory at c001a000 (32-bit, non-prefetchable) [size=8K]
        Capabilities: <access denied>
        Kernel driver in use: virtio-pci
        Kernel modules: virtio_pci

00:10.0 Non-Volatile memory controller: Oxide Computer Company Device 0000 (prog-if 02 [NVM Express])
        Subsystem: Oxide Computer Company Device 0000
        Flags: bus master, fast devsel, latency 0, NUMA node 0
        Memory at c001c000 (64-bit, non-prefetchable) [size=16K]
        Memory at c0010000 (32-bit, non-prefetchable) [size=32K]
        Capabilities: <access denied>
        Kernel driver in use: nvme

00:11.0 Non-Volatile memory controller: Oxide Computer Company Device 0000 (prog-if 02 [NVM Express])
        Subsystem: Oxide Computer Company Device 0000
        Flags: bus master, fast devsel, latency 0, NUMA node 0
        Memory at c0020000 (64-bit, non-prefetchable) [size=16K]
        Memory at c0008000 (32-bit, non-prefetchable) [size=32K]
        Capabilities: <access denied>
        Kernel driver in use: nvme

00:12.0 Non-Volatile memory controller: Oxide Computer Company Device 0000 (prog-if 02 [NVM Express])
        Subsystem: Oxide Computer Company Device 0000
        Flags: bus master, fast devsel, latency 0, NUMA node 0
        Memory at c0024000 (64-bit, non-prefetchable) [size=16K]
        Memory at c0000000 (32-bit, non-prefetchable) [size=32K]
        Capabilities: <access denied>
        Kernel driver in use: nvme

00:18.0 SCSI storage controller: Red Hat, Inc. Virtio block device
        Subsystem: Red Hat, Inc. Virtio block device
        Flags: bus master, fast devsel, latency 0, IRQ 10
        I/O ports at c000 [size=512]
        Memory at c0018000 (32-bit, non-prefetchable) [size=8K]
        Capabilities: <access denied>
        Kernel driver in use: virtio-pci
        Kernel modules: virtio_pci

@gjcolombo
Copy link
Contributor Author

I've also tested the following:

  • Disk snapshot: started an instance with a disk attached, took a snapshot, verified that the Propolis logs reflected that the disk snapshot API was called and that it returned success
  • VCR replacement: expunged one of the simulated disks backing the instance's disk, verified that the Propolis logs showed a VCR replacement and a subsequent attempt to repair the disk

Looking into the integration test failures now.

@gjcolombo
Copy link
Contributor Author

The most recent integration test failure (in test_disk_move_between_instances) appears to be a preexisting race condition in simulated sled agent that's not related to these changes. I filed #7282 to track it and will rerun this job.

@gjcolombo gjcolombo force-pushed the gjcolombo/ingest-propolis-ensure-api branch from a01c51d to 4292e00 Compare December 19, 2024 18:56
@gjcolombo gjcolombo marked this pull request as ready for review December 19, 2024 19:20
Copy link
Contributor

@jmpesp jmpesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it!

nexus/db-queries/src/db/datastore/volume.rs Outdated Show resolved Hide resolved
sled-agent/src/instance.rs Outdated Show resolved Hide resolved
sled-agent/src/instance.rs Outdated Show resolved Hide resolved
sled-agent/src/instance.rs Outdated Show resolved Hide resolved
sled-agent/types/src/instance.rs Show resolved Hide resolved
@@ -152,9 +184,6 @@ pub struct VmmUnregisterResponse {
/// migration.
#[derive(Copy, Clone, Debug, Deserialize, Serialize, JsonSchema)]
pub struct InstanceMigrationTargetParams {
/// The Propolis ID of the migration source.
pub src_propolis_id: Uuid,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like a mistake to lose this, from the perspective of wishing we had sent the expected UUID of the Downstairs we're connecting to from the Upstairs, and the bugs not having it has caused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on which bugs these are?

I'm trying to think about what kinds of bugs this would prevent in this path. The main one I can think of is a sort of bug in which we intended to migrate from VMM A, but for some reason we passed VMM B's server address to the ensure request. It seems like it would be hard to do that up in Nexus given where all this information comes from.

I guess we could forestall this by having the source Propolis check that its ID matches the one the target supplied. This would still not prevent a different bug, though, in which an instance is currently running in VMM A, but somehow Nexus reads the VMM record for VMM B and passes its ID and address to this request--in that case the migration will be accepted but we'll still end up migrating from the wrong server.

In any event: I'm not sure I can address this directly in this PR, because the corresponding parameter is also gone from the Propolis API and migration protocol. But if it can effectively prevent a bug I'm all for adding it back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#6353 is a manifestation of the bug I'm talking about: an Upstairs constructed by a Pantry panicked when one of the regions it was connecting to did not match the others in the region set. The only information the Upstairs receives about Downstairs is SocketAddrs to connect to:

targets: [
  addr1, // downstairs A
  addr2, // downstairs B
  addr3  // downstairs C
]

When the Crucible agent creates and destroys Downstairs, it will aggressively reuse what ports are assigned to new Downstairs. If Nexus destroys Downstairs C, then creates a new Downstairs D (without doing anything else with the Agent), then Downstairs D will bind to the same port that Downstairs C did. If that happens then the targets array that the Upstairs was sent is instantly invalid, and constructing it will either 1) lead to a panic because the Downstairs D region is a different size than A and B, or 2) not panic because Downstairs D is the same size as A and B, but it will be blank, leading to other issues.

Without any additional information in the targets array, the Upstairs has to panic: it would be invalid to continue and operate on regions with mismatched sizes. Adding the expected UUID would give the Upstairs the option to bail instead:

targets: [
  { uuidA, addr1 },
  { uuidB, addr2 },
  { uuidC, addr3 }
]

During the negotiation with the downstairs it would be able to compare the expected UUID in the targets array with what the Downstairs it's talking to is reporting, and fail to connect instead of panicking.

sled-agent/src/instance.rs Outdated Show resolved Hide resolved
sled-agent/src/instance.rs Outdated Show resolved Hide resolved
sled-agent/src/instance.rs Outdated Show resolved Hide resolved
sled-agent/src/instance.rs Outdated Show resolved Hide resolved
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks good to me! I agree with @jmpesp that it would be nice to explain a few of the weird-looking bits.

nexus/db-queries/src/db/datastore/volume.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/volume.rs Show resolved Hide resolved
sled-agent/src/instance.rs Outdated Show resolved Hide resolved
sled-agent/src/instance.rs Outdated Show resolved Hide resolved
sled-agent/src/instance.rs Outdated Show resolved Hide resolved
sled-agent/src/instance.rs Outdated Show resolved Hide resolved
sled-agent/types/src/instance.rs Show resolved Hide resolved
Copy link
Contributor Author

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reviews, @jmpesp and @hawkw!

nexus/db-queries/src/db/datastore/volume.rs Outdated Show resolved Hide resolved
sled-agent/src/instance.rs Outdated Show resolved Hide resolved
sled-agent/src/instance.rs Outdated Show resolved Hide resolved
sled-agent/src/instance.rs Outdated Show resolved Hide resolved
sled-agent/src/instance.rs Outdated Show resolved Hide resolved
sled-agent/src/instance.rs Outdated Show resolved Hide resolved
@@ -152,9 +184,6 @@ pub struct VmmUnregisterResponse {
/// migration.
#[derive(Copy, Clone, Debug, Deserialize, Serialize, JsonSchema)]
pub struct InstanceMigrationTargetParams {
/// The Propolis ID of the migration source.
pub src_propolis_id: Uuid,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on which bugs these are?

I'm trying to think about what kinds of bugs this would prevent in this path. The main one I can think of is a sort of bug in which we intended to migrate from VMM A, but for some reason we passed VMM B's server address to the ensure request. It seems like it would be hard to do that up in Nexus given where all this information comes from.

I guess we could forestall this by having the source Propolis check that its ID matches the one the target supplied. This would still not prevent a different bug, though, in which an instance is currently running in VMM A, but somehow Nexus reads the VMM record for VMM B and passes its ID and address to this request--in that case the migration will be accepted but we'll still end up migrating from the wrong server.

In any event: I'm not sure I can address this directly in this PR, because the corresponding parameter is also gone from the Propolis API and migration protocol. But if it can effectively prevent a bug I'm all for adding it back.

sled-agent/src/instance.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/volume.rs Show resolved Hide resolved
Copy link
Member

@iximeow iximeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yet more approve-with-comments, now with some type pedantics :) overall: really nice, thanks!

nexus/db-queries/src/db/datastore/volume.rs Show resolved Hide resolved
nexus/db-queries/src/db/datastore/volume.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/volume.rs Show resolved Hide resolved
nexus/src/app/instance.rs Outdated Show resolved Hide resolved
nexus/src/app/sagas/snapshot_create.rs Outdated Show resolved Hide resolved
nexus/src/app/sagas/snapshot_create.rs Show resolved Hide resolved
nexus/tests/integration_tests/volume_management.rs Outdated Show resolved Hide resolved
sled-agent/src/instance.rs Outdated Show resolved Hide resolved
sled-agent/src/instance.rs Outdated Show resolved Hide resolved
This ends up touching a substantial number of files because of the way
Nexus accessed Crucible client types:

- Nexus imported Crucible types from the generated sled agent client
- Sled agent's code included generated Crucible types from the Propolis
  client
- The Propolis client generates Crucible types because the Propolis
  server API used a DiskRequest type that referred to the Crucible
  client library's VolumeConstructionRequest and CrucibleOpts types

The Propolis server API no longer has a DiskRequest type, so generated
Propolis clients no longer include the Crucible types, and neither do
the generated sled agent clients.

Deriving Crucible types from the Propolis client is useful because it
ensures that Nexus's conception of a VolumeConstructionRequest is the
same as Propolis's. To maintain this property, the Propolis client
depends on the crucible-client-types crate and re-exports types from it;
the sled agent client then re-exports from the Propolis client
library.

The catch is that Progenitor-generated versions of these types use
Strings for socket addresses, while the native types use native
SocketAddrs. This requires a fair bit of munging in the tests to remove
some conversions between strings and socket addresses.
@gjcolombo gjcolombo force-pushed the gjcolombo/ingest-propolis-ensure-api branch from 97d1e40 to f0c692d Compare December 20, 2024 21:10
@gjcolombo
Copy link
Contributor Author

@iximeow Thanks for the review! I think I've addressed all your comments in f0c692d. I've also retested on a dev cluster that instance creation/start and disk snapshot seem to work (to try to gain some confidence that I didn't bungle any SocketAddr conversions or boot order designations).

I'm planning to merge this if/when CI comes back green.

@gjcolombo
Copy link
Contributor Author

CI failures are #7251, #7255, and #4657; requested a rerun.

@gjcolombo
Copy link
Contributor Author

Hmm, I see #7251 and #7255 should be fixed... looking a little more closely at those failures now (but still letting the rerun go).

@gjcolombo gjcolombo merged commit c1281a9 into main Dec 20, 2024
19 checks passed
@gjcolombo gjcolombo deleted the gjcolombo/ingest-propolis-ensure-api branch December 20, 2024 23:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants