add `migration` table and explicit migration tracking in sled-agent #5859

hawkw · 2024-06-05T20:49:13Z

As part of ongoing work on improving instance lifecycle management (see #5749) , we intend to remove the InstanceRuntimeState tracking from sled-agent, and make Nexus the sole owner of instance records in CRDB, with a new instance-update saga taking over the responsibility of managing the instance's state transitions.

In order to properly manage the instance state machine, Nexus will need information about the status of active migrations that are currently only available to sled-agents. For example, if an instance is migrating, and a sled agent reports that the source VMM is Destroyed, Nexus doesn't presently have the capability to determine whether the source VMM was destroyed because the migration completed successfully, or that the source shut down prior to starting the migration, resulting in a failure.

In order for Nexus to correctly manage state updates during live migration, we introduce a new migration table to the schema for tracking the state of ongoing migrations. The instance-migrate saga creates a migration record when beginning a migration. The Nexus and sled-agent APIs are extended to include migration state updates from sled-agents to Nexus. In this branch, the migration table is (basically) write-only; Nexus doesn't really read from it, and just stuffs updates into it. In the future, however, this will be used by the instance-update saga.

It occurred to me that, in addition to using the migration table for instance updates, it might also be useful to add an OMDB command to look up the status of a migration using this table. However, I decided that made more sense as a follow-up change, as I'd like to get back to integrating this into #5749.

Fixes #2948

schema/crdb/dbinit.sql

common/src/api/internal/nexus.rs

smklein

Looks solid, great idea to split out the source/destination columns so explicitly. Already seems easier to follow, and I think the omdb commands are a great idea.

schema/crdb/dbinit.sql

smklein · 2024-06-10T17:30:06Z

nexus/db-model/src/migration.rs

+    /// The state of the migration target VMM.
+    pub target_state: MigrationState,


I'm not opposed to starting with the "state" being the same at source and destination, but I wonder if we'll eventually want different types here. Seems possible that there's a source-only or target-only state in the future.

(No change needed, just musing)

Right now, the intention for this table is to store pretty low-resolution data about the migration state, basically just whether it's completed or failed, so we're not including more detailed migration status reported by Propolis. So, we don't really expect these to include states that are specific to one side of the migration currently, although I suppose we could...

nexus/db-queries/src/db/datastore/instance.rs

nexus/db-queries/src/db/queries/instance.rs

nexus/src/app/instance.rs

sled-agent/src/common/instance.rs

nexus/db-model/src/migration.rs

gjcolombo · 2024-06-10T16:44:58Z

sled-agent/src/common/instance.rs

            ObservedMigrationStatus::NoMigration
-            | ObservedMigrationStatus::InProgress
            | ObservedMigrationStatus::Pending => {}
        }



I want to leave this comment on right-side line 382 but GitHub.

What happens in the following case?

migration saga calls set_migration_ids on a migration source

before the target VMM launches, the guest in the source shuts down

the source's sled reaches apply_propolis_observation and the call to clear_migration_ids

Does the source migration status just get stuck in Pending? (The migration writ large should eventually fail, provided Propolis pushes the updates we expect: the target VMM will start, try to connect to the source, and find that it's missing, which should cause that side of the migration to fail.)

If the answer's "yes," does that matter given that the target should eventually report migration failure, or should we do something to sanitize this in clear_migration_ids? (For example, what happens if the target sled crashes just after it launches the target Propolis, such that it never reports anything else, either?)

Hmm, my thinking was that the target observing a failure here would be sufficient to handle this case, but your point that the target sled could also have crashed and might not see the source's failure... I'll see about having this fail the migration.

Okay, as of f6c9875 we should report failure immediately if a VMM fails/is destroyed while it's part of an in-progress migration.

Looks good. I suspect that in the long run we're probably going to need Nexus to deal with this in a health check task: if Nexus discovers that a VMM is unexpectedly gone, then the VMM's state changes and any migrations that it was a part of fail. But we'll address that when we add those checks.

nexus/db-queries/src/db/queries/instance.rs

nexus/db-queries/src/db/datastore/migration.rs

nexus/src/app/instance.rs

nexus/src/app/sagas/instance_migrate.rs

sled-agent/src/sim/instance.rs

hawkw · 2024-06-10T18:41:51Z

Thanks @gjcolombo and @smklein for the very thorough reviews, I really appreciate it! 🖤

gjcolombo

Thanks for going another lap on this!

sled-agent/src/common/instance.rs

nexus/db-queries/src/db/datastore/migration.rs

gjcolombo · 2024-06-12T15:58:37Z

sled-agent/src/common/instance.rs

            ObservedMigrationStatus::NoMigration
-            | ObservedMigrationStatus::InProgress
            | ObservedMigrationStatus::Pending => {}
        }



Looks good. I suspect that in the long run we're probably going to need Nexus to deal with this in a health check task: if Nexus discovers that a VMM is unexpectedly gone, then the VMM's state changes and any migrations that it was a part of fail. But we'll address that when we add those checks.

hawkw added 3 commits June 5, 2024 13:51

add migration table

370b73c

add migration state to nexus API

d017eb0

okay i really hopep the CTE works

9362651

hawkw force-pushed the eliza/migration-table branch from ef871e2 to 9362651 Compare June 5, 2024 20:51

hawkw added 2 commits June 5, 2024 14:45

unbreak CTE, et cetera

e6e5e14

fix migration name

c8a0c7c

gjcolombo reviewed Jun 5, 2024

View reviewed changes

schema/crdb/dbinit.sql Outdated Show resolved Hide resolved

hawkw added 10 commits June 5, 2024 15:41

add generation numbers

00c2c93

regen openapi again

7f95255

add timestamps

6304817

actually create and delete migration records

d57060f

sled-agent api plumbing

77e3080

fix CTE syntax error

6b63d22

add test that migration state updates

7f3c7fa

clippy

45d7844

make simulated sled agent complete src migration

c6311fb

do the actual correct simulated migration

43254ad

hawkw changed the title ~~WIP: migration table~~ add miigration table and explicit migration tracking in sled-agent Jun 7, 2024

hawkw changed the title ~~add miigration table and explicit migration tracking in sled-agent~~ add migration table and explicit migration tracking in sled-agent Jun 7, 2024

hawkw marked this pull request as ready for review June 7, 2024 18:10

Merge branch 'main' into eliza/migration-table

f649f82

hawkw requested a review from gjcolombo June 7, 2024 18:11

hawkw commented Jun 7, 2024

View reviewed changes

common/src/api/internal/nexus.rs Outdated Show resolved Hide resolved

common/src/api/internal/nexus.rs Show resolved Hide resolved

hawkw added 4 commits June 7, 2024 11:25

Update common/src/api/internal/nexus.rs

fd5e49c

add pending state

631ba0f

fix sled-agent sim panicking when unsetting migration

1aecc88

fix unexpected null

270368f

hawkw requested a review from smklein June 10, 2024 17:11

smklein reviewed Jun 10, 2024

View reviewed changes

gjcolombo reviewed Jun 10, 2024

View reviewed changes

hawkw added 9 commits June 10, 2024 14:25

fix places where i got source/target backwards

47cfd55

misc. @gjcolombo review feedback

f399eb1

@smklein's review feedback

8e75658

update tests to expect migration states

762febc

update tests, start out in pending

7634307

don't generate spurious state transitions

e4f7c6b

delete records when a migration completes

9d0de2b

report migration failure if an in-progress vmm vanishes

f6c9875

update comment

77a6dfe

hawkw requested review from gjcolombo and smklein June 11, 2024 20:18

hawkw added 4 commits June 11, 2024 15:42

fix comment

29eae36

rm code from other branch

18ccc7e

oops

00ad191

Merge branch 'main' into eliza/migration-table

4e3e4f2

gjcolombo approved these changes Jun 12, 2024

View reviewed changes

Update sled-agent/src/common/instance.rs

a1ca946

hawkw enabled auto-merge (squash) June 12, 2024 17:42

hawkw merged commit 0a0db97 into main Jun 12, 2024
20 checks passed

hawkw deleted the eliza/migration-table branch June 12, 2024 18:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add `migration` table and explicit migration tracking in sled-agent #5859

add `migration` table and explicit migration tracking in sled-agent #5859

hawkw commented Jun 5, 2024 •

edited

Loading

smklein left a comment

smklein Jun 10, 2024

hawkw Jun 11, 2024

gjcolombo Jun 10, 2024

hawkw Jun 11, 2024

hawkw Jun 11, 2024

gjcolombo Jun 12, 2024

hawkw commented Jun 10, 2024

gjcolombo left a comment

gjcolombo Jun 12, 2024

		/// The state of the migration target VMM.
		pub target_state: MigrationState,

add migration table and explicit migration tracking in sled-agent #5859

add migration table and explicit migration tracking in sled-agent #5859

Conversation

hawkw commented Jun 5, 2024 • edited Loading

smklein left a comment

Choose a reason for hiding this comment

smklein Jun 10, 2024

Choose a reason for hiding this comment

hawkw Jun 11, 2024

Choose a reason for hiding this comment

gjcolombo Jun 10, 2024

Choose a reason for hiding this comment

hawkw Jun 11, 2024

Choose a reason for hiding this comment

hawkw Jun 11, 2024

Choose a reason for hiding this comment

gjcolombo Jun 12, 2024

Choose a reason for hiding this comment

hawkw commented Jun 10, 2024

gjcolombo left a comment

Choose a reason for hiding this comment

gjcolombo Jun 12, 2024

Choose a reason for hiding this comment

add `migration` table and explicit migration tracking in sled-agent #5859

add `migration` table and explicit migration tracking in sled-agent #5859

hawkw commented Jun 5, 2024 •

edited

Loading