sled agent: index running VMMs by VMM ID, not instance ID #6429

gjcolombo · 2024-08-24T00:34:18Z

Change sled agent's instance lookup tables so that Propolis jobs are indexed by Propolis/VMM IDs instead of instance IDs.
This is a prerequisite to revisiting how the Failed instance state works. See RFD 486 section 6.1 for all the details of why this is needed, but very broadly: when an instance's VMM is Destroyed, we'd like sled agent to tell Nexus that before the agent deregisters the instance from the sled, for reasons described in the RFD; but if we do that with no other changes, there's a race where Nexus may try to restart the instance on the same sled before sled agent can update its instance table, causing instance start to fail.

To achieve this:

In sled agent, change the InstanceManagerRunner's instance map to a BTreeMap<PropolisUuid, Instance>, then clean up all the compilation errors.
In Nexus:
- Make callers of instance APIs furnish a Propolis ID instead of an instance ID. This is generally very straightforward because they already had to get a VMM record to figure out what sled to talk to.
- Change cpapi_instances_put to take a Propolis ID instead of an instance ID. Regular sled agent still has both IDs, but with these changes, simulated sled agents only have a Propolis ID to work with, and plumbing an instance ID down to them requires significant code changes.
Update test code:
- Unify the Nexus helper routines that let integration tests get sled agent clients or sled IDs; now they get a single struct containing both of those and the instance's Propolis IDs.
- Update users of the simulated agent's poke endpoints to use Propolis IDs.
- Delete the "detach disks on instance stop" bits of simulated sled agent. These don't appear to be load-bearing, they don't correspond to any behavior in the actual sled agent (which doesn't manage disk attachment or detachment), and it was a pain to rework them to work with Propolis IDs.

Tests: cargo nextest.

Related: #4226 and #4872, among others.

nexus/src/app/instance.rs

hawkw

This is great! Even setting aside that this is a prerequisite for implementing RFD 486, I think that, in a post-#5749 world, it feels a lot more conceptually correct for sled-agents to refer to things by their VMM IDs rather than their instance IDs. And, thanks so much for cleaning up some of the weird naming and such that I left behind in #5749! :)

I left some suggestions for some small improvements we could make, but I don't have any high-level concerns about the change overall. It would be nice to get rid of the code in sled-agent that loops over the BTreeMap trying to find a Propolis UUID that looks the same as a Propolis zone name when we can just look up the UUID now, and it might be worth eliminating the VMM refetch by having vmm_and_migration_update_runtime return the instance ID? But, for the most part, I love this change.

nexus/src/app/background/tasks/abandoned_vmm_reaper.rs

nexus/src/app/background/tasks/instance_watcher.rs

nexus/src/app/instance.rs

sled-agent/src/instance.rs

sled-agent/src/instance_manager.rs

This is mostly renaming and cleanup; the main functional change is to the instance zone bundle collection routine, which now parses the input zone name to get a Propolis ID instead of converting all registered Propolis IDs to zone names to look for a match.

hawkw

this all looks good to me whenever you're happy with it!

nexus/db-queries/src/db/datastore/vmm.rs

nexus/src/app/instance.rs

sled-agent/src/instance_manager.rs

gjcolombo · 2024-08-26T23:25:59Z

Thanks as always for the review, @hawkw -- I'd like to check manually that instance zone bundles still work as intended; provided that they do I'll merge once main is open again.

gjcolombo added 10 commits August 23, 2024 23:23

index sled agent instances by propolis id, not instance id

def1ceb

fix up Nexus to call new APIs

79b6816

fix simulated sled agent client paths

36fe7c3

turn cpapi_instances_put on its head

6e574f5

fix instance ip tests

d56af4f

fix assorted uuid misuses

6e4ac97

remove unnecessary propolis_id

63d0990

cleanup

fadceea

explain why vmm record refetch is safe

f20099a

fmt, fix doc links

1f9aa26

hawkw self-requested a review August 24, 2024 01:00

hawkw reviewed Aug 24, 2024

View reviewed changes

nexus/src/app/instance.rs Show resolved Hide resolved

hawkw reviewed Aug 24, 2024

View reviewed changes

nexus/src/app/instance.rs Show resolved Hide resolved

hawkw reviewed Aug 24, 2024

View reviewed changes

gjcolombo added 5 commits August 26, 2024 19:52

PR feedback

6d3d843

This is mostly renaming and cleanup; the main functional change is to the instance zone bundle collection routine, which now parses the input zone name to get a Propolis ID instead of converting all registered Propolis IDs to zone names to look for a match.

get instance id from vmm find-and-update

9f4156a

remove obsolete comment

88a488a

fmt

8dd6e9e

fix doc links

3ab2431

gjcolombo requested a review from hawkw August 26, 2024 21:35

hawkw approved these changes Aug 26, 2024

View reviewed changes

nexus/db-queries/src/db/datastore/vmm.rs Show resolved Hide resolved

nexus/src/app/instance.rs Outdated Show resolved Hide resolved

sled-agent/src/instance_manager.rs Outdated Show resolved Hide resolved

PR feedback

fb9ee7e

gjcolombo merged commit a0cdce7 into main Aug 27, 2024
23 checks passed

gjcolombo deleted the gjcolombo/sa-index-by-vmm-id branch August 27, 2024 16:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sled agent: index running VMMs by VMM ID, not instance ID #6429

sled agent: index running VMMs by VMM ID, not instance ID #6429

gjcolombo commented Aug 24, 2024

hawkw left a comment

hawkw left a comment

gjcolombo commented Aug 26, 2024

sled agent: index running VMMs by VMM ID, not instance ID #6429

sled agent: index running VMMs by VMM ID, not instance ID #6429

Conversation

gjcolombo commented Aug 24, 2024

hawkw left a comment

Choose a reason for hiding this comment

hawkw left a comment

Choose a reason for hiding this comment

gjcolombo commented Aug 26, 2024