Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sled agent: index running VMMs by VMM ID, not instance ID #6429

Merged
merged 16 commits into from
Aug 27, 2024

Conversation

gjcolombo
Copy link
Contributor

Change sled agent's instance lookup tables so that Propolis jobs are indexed by Propolis/VMM IDs instead of instance IDs.
This is a prerequisite to revisiting how the Failed instance state works. See RFD 486 section 6.1 for all the details of why this is needed, but very broadly: when an instance's VMM is Destroyed, we'd like sled agent to tell Nexus that before the agent deregisters the instance from the sled, for reasons described in the RFD; but if we do that with no other changes, there's a race where Nexus may try to restart the instance on the same sled before sled agent can update its instance table, causing instance start to fail.

To achieve this:

  • In sled agent, change the InstanceManagerRunner's instance map to a BTreeMap<PropolisUuid, Instance>, then clean up all the compilation errors.
  • In Nexus:
    • Make callers of instance APIs furnish a Propolis ID instead of an instance ID. This is generally very straightforward because they already had to get a VMM record to figure out what sled to talk to.
    • Change cpapi_instances_put to take a Propolis ID instead of an instance ID. Regular sled agent still has both IDs, but with these changes, simulated sled agents only have a Propolis ID to work with, and plumbing an instance ID down to them requires significant code changes.
  • Update test code:
    • Unify the Nexus helper routines that let integration tests get sled agent clients or sled IDs; now they get a single struct containing both of those and the instance's Propolis IDs.
    • Update users of the simulated agent's poke endpoints to use Propolis IDs.
    • Delete the "detach disks on instance stop" bits of simulated sled agent. These don't appear to be load-bearing, they don't correspond to any behavior in the actual sled agent (which doesn't manage disk attachment or detachment), and it was a pain to rework them to work with Propolis IDs.

Tests: cargo nextest.

Related: #4226 and #4872, among others.

@hawkw hawkw self-requested a review August 24, 2024 01:00
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Even setting aside that this is a prerequisite for implementing RFD 486, I think that, in a post-#5749 world, it feels a lot more conceptually correct for sled-agents to refer to things by their VMM IDs rather than their instance IDs. And, thanks so much for cleaning up some of the weird naming and such that I left behind in #5749! :)

I left some suggestions for some small improvements we could make, but I don't have any high-level concerns about the change overall. It would be nice to get rid of the code in sled-agent that loops over the BTreeMap trying to find a Propolis UUID that looks the same as a Propolis zone name when we can just look up the UUID now, and it might be worth eliminating the VMM refetch by having vmm_and_migration_update_runtime return the instance ID? But, for the most part, I love this change.

nexus/src/app/background/tasks/abandoned_vmm_reaper.rs Outdated Show resolved Hide resolved
nexus/src/app/background/tasks/instance_watcher.rs Outdated Show resolved Hide resolved
nexus/src/app/instance.rs Show resolved Hide resolved
nexus/src/app/instance.rs Outdated Show resolved Hide resolved
nexus/src/app/instance.rs Outdated Show resolved Hide resolved
sled-agent/src/instance.rs Outdated Show resolved Hide resolved
sled-agent/src/instance.rs Outdated Show resolved Hide resolved
sled-agent/src/instance_manager.rs Outdated Show resolved Hide resolved
sled-agent/src/instance_manager.rs Outdated Show resolved Hide resolved
sled-agent/src/instance_manager.rs Outdated Show resolved Hide resolved
This is mostly renaming and cleanup; the main functional change is to
the instance zone bundle collection routine, which now parses the input
zone name to get a Propolis ID instead of converting all registered
Propolis IDs to zone names to look for a match.
@gjcolombo gjcolombo requested a review from hawkw August 26, 2024 21:35
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this all looks good to me whenever you're happy with it!

nexus/db-queries/src/db/datastore/vmm.rs Show resolved Hide resolved
nexus/src/app/instance.rs Outdated Show resolved Hide resolved
sled-agent/src/instance_manager.rs Outdated Show resolved Hide resolved
@gjcolombo
Copy link
Contributor Author

Thanks as always for the review, @hawkw -- I'd like to check manually that instance zone bundles still work as intended; provided that they do I'll merge once main is open again.

@gjcolombo gjcolombo merged commit a0cdce7 into main Aug 27, 2024
23 checks passed
@gjcolombo gjcolombo deleted the gjcolombo/sa-index-by-vmm-id branch August 27, 2024 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants