-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add migration
table and explicit migration tracking in sled-agent
#5859
Conversation
ef871e2
to
9362651
Compare
miigration
table and explicit migration tracking in sled-agent
miigration
table and explicit migration tracking in sled-agentmigration
table and explicit migration tracking in sled-agent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks solid, great idea to split out the source/destination columns so explicitly. Already seems easier to follow, and I think the omdb commands are a great idea.
/// The state of the migration target VMM. | ||
pub target_state: MigrationState, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not opposed to starting with the "state" being the same at source and destination, but I wonder if we'll eventually want different types here. Seems possible that there's a source-only or target-only state in the future.
(No change needed, just musing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, the intention for this table is to store pretty low-resolution data about the migration state, basically just whether it's completed or failed, so we're not including more detailed migration status reported by Propolis. So, we don't really expect these to include states that are specific to one side of the migration currently, although I suppose we could...
ObservedMigrationStatus::NoMigration | ||
| ObservedMigrationStatus::InProgress | ||
| ObservedMigrationStatus::Pending => {} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to leave this comment on right-side line 382 but GitHub.
What happens in the following case?
- migration saga calls
set_migration_ids
on a migration source - before the target VMM launches, the guest in the source shuts down
- the source's sled reaches
apply_propolis_observation
and the call toclear_migration_ids
Does the source migration status just get stuck in Pending? (The migration writ large should eventually fail, provided Propolis pushes the updates we expect: the target VMM will start, try to connect to the source, and find that it's missing, which should cause that side of the migration to fail.)
If the answer's "yes," does that matter given that the target should eventually report migration failure, or should we do something to sanitize this in clear_migration_ids
? (For example, what happens if the target sled crashes just after it launches the target Propolis, such that it never reports anything else, either?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, my thinking was that the target observing a failure here would be sufficient to handle this case, but your point that the target sled could also have crashed and might not see the source's failure... I'll see about having this fail the migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, as of f6c9875 we should report failure immediately if a VMM fails/is destroyed while it's part of an in-progress migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I suspect that in the long run we're probably going to need Nexus to deal with this in a health check task: if Nexus discovers that a VMM is unexpectedly gone, then the VMM's state changes and any migrations that it was a part of fail. But we'll address that when we add those checks.
Thanks @gjcolombo and @smklein for the very thorough reviews, I really appreciate it! 🖤 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for going another lap on this!
ObservedMigrationStatus::NoMigration | ||
| ObservedMigrationStatus::InProgress | ||
| ObservedMigrationStatus::Pending => {} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I suspect that in the long run we're probably going to need Nexus to deal with this in a health check task: if Nexus discovers that a VMM is unexpectedly gone, then the VMM's state changes and any migrations that it was a part of fail. But we'll address that when we add those checks.
As part of ongoing work on improving instance lifecycle management (see #5749) , we intend to remove the
InstanceRuntimeState
tracking fromsled-agent
, and make Nexus the sole owner ofinstance
records in CRDB, with a newinstance-update
saga taking over the responsibility of managing the instance's state transitions.In order to properly manage the instance state machine, Nexus will need information about the status of active migrations that are currently only available to sled-agents. For example, if an instance is migrating, and a sled agent reports that the source VMM is
Destroyed
, Nexus doesn't presently have the capability to determine whether the source VMM was destroyed because the migration completed successfully, or that the source shut down prior to starting the migration, resulting in a failure.In order for Nexus to correctly manage state updates during live migration, we introduce a new
migration
table to the schema for tracking the state of ongoing migrations. Theinstance-migrate
saga creates amigration
record when beginning a migration. The Nexus and sled-agent APIs are extended to include migration state updates from sled-agents to Nexus. In this branch, themigration
table is (basically) write-only; Nexus doesn't really read from it, and just stuffs updates into it. In the future, however, this will be used by theinstance-update
saga.It occurred to me that, in addition to using the migration table for instance updates, it might also be useful to add an OMDB command to look up the status of a migration using this table. However, I decided that made more sense as a follow-up change, as I'd like to get back to integrating this into #5749.
Fixes #2948