Add a background task for update plan execution #4891

andrewjstone · 2024-01-24T21:22:21Z

This PR is the first step in creating a background task that is capable of taking a Blueprint and then reifying that blueprint into deployed or updated software. This PR uses the initial version of a Blueprint introduced in #4804. A basic executor that sends the related OmicronZonesConfig to the appropriate sled-agents for newly added sleds was created.

A test is included that shows how a hypothetical planner for an add-sled workflow will deploy Omicron zones in a manner similar to RSS, where first the internal DNS zone is deployed and then the internal DNS and NTP zones are deployed. Deployment alwyas contains all zones expected to be running on the sled-agent. Any zones running that are not included are expected to be shut down.

I still need to hook this into nexus/src/app/background/init.rs

This PR is the first step in creating a background task that is capable of taking a `Blueprint` and then reifying that blueprint into deployed or updated software. This PR uses the initial version of a Blueprint introduced in #4804. A basic executor that sends the related `OmicronZonesConfig` to the appropriate sled-agents for newly added sleds was created. A test is included that shows how a hypothetical planner for an `add-sled` workflow will deploy Omicron zones in a manner similar to RSS, where first the internal DNS zone is deployed and then the internal DNS and NTP zones are deployed. Deployment alwyas contains all zones expected to be running on the sled-agent. Any zones running that are not included are expected to be shut down.

nexus/src/app/background/plan_execution.rs

andrewjstone · 2024-01-29T07:32:43Z

This is probably good enough to review. There are still 1-2 more automated tests that should be added: one for the background task that loads the target blueprint, and probably one for testing interaction between the two background tasks, similar to what's done with DNS. I would have already written those, but instead I have spent all weekend trying to get the whole thing running on the a4x2 testbed to see if we could rapidly iterate on testing e2e add-sled and run it in CI. I'm pretty close I think, but am stuck currently with some weird rack init setup errors related to lrtq that need more detailed debugging. I expect a few small PRs to come out of that effort as well.

davepacheco

Very nice!

nexus/src/app/background/blueprint_load.rs

nexus/src/app/background/blueprint_execution.rs

davepacheco · 2024-01-29T18:24:19Z

nexus/src/app/background/blueprint_execution.rs

+    }
+
+    #[nexus_test(server = crate::Server)]
+    async fn test_deploy_omicron_zones(cptestctx: &ControlPlaneTestContext) {


Nice test!

I think we could also/instead have one that uses an actual simulated sled agent and fetches the inventory back to verify it. I'm not sure there'd be much advantage here, except that eventually we'll probably want to make sure that works so we can test higher-level stuff. (I wouldn't really worry about this now, just mentioning it.)

I particularly didn't do that, because I just wanted to test the behavior of the background task. My hope is that I can get the whole thing running in the a4x2 job and we can use the real sled-agent in CI.

davepacheco · 2024-01-29T18:26:04Z

nexus/src/app/background/blueprint_execution.rs

+/// the state of the system based on the `Blueprint`.
+pub struct BlueprintExecutor {
+    datastore: Arc<DataStore>,
+    rx_blueprint: watch::Receiver<Option<Arc<Blueprint>>>,


I think it's pretty important that we have a way to turn this thing off at runtime if we find the system is doing something harmful. I believe we already have an enabled bool on the BlueprintTarget for this purpose. So I think maybe this thing should accept (BlueprintTarget, BlueprintTarget) (or something similar, instead of just Blueprint). Then this task could just do nothing if !target.enabled.

(I assume the suggestion of (BlueprintTarget, BlueprintTarget) is a typo and should've been (Blueprint, BlueprintTarget).)

Just thinking out loud here, this may dumb: Should we squish BlueprintTarget into Blueprint somehow, or possibly add a new type for that combination? The db method for reading the current target also returns the 2-tuple, which felt a little awkward. I think the only thing from BlueprintTarget that anything other than a debugging human will care about is whether it's enabled, right? Some options in no particular order:

Add an enabled field to Blueprint (I think this doesn't make sense, because what does this field mean if the blueprint isn't the target? but maybe there's some way to make it make sense)

Add a blueprint field to BlueprintTarget

Add a struct that has Blueprint and BlueprintTarget as fields

Add EnabledBlueprint / DisabledBlueprint newtypes around Blueprint, which can be created from a BlueprintTarget and its Blueprint

(I assume the suggestion of (BlueprintTarget, BlueprintTarget) is a typo and should've been (Blueprint, BlueprintTarget).)

I also assumed this, but reversed the types in the tuple ;)

Just thinking out loud here, this may dumb: Should we squish BlueprintTarget into Blueprint somehow, or possibly add a new type for that combination? The db method for reading the current target also returns the 2-tuple, which felt a little awkward. I think the only thing from BlueprintTarget that anything other than a debugging human will care about is whether it's enabled, right? Some options in no particular order:

Add an enabled field to Blueprint (I think this doesn't make sense, because what does this field mean if the blueprint isn't the target? but maybe there's some way to make it make sense)

Add a blueprint field to BlueprintTarget

Add a struct that has Blueprint and BlueprintTarget as fields

Add EnabledBlueprint / DisabledBlueprint newtypes around Blueprint, which can be created from a BlueprintTarget and its Blueprint

I think dealing with tuples is a pain in the ass, but I don't like the idea of squishing these in the DB. Ideally the Blueprint is immutable, and the pointer to it (the target) is mutable. This would change that. The newtype idea could work though.

Ah, I didn't mean squishing in the db itself, just in the types / return values from the db query. But yeah I think the newtype is maybe my preference, although maybe it should be an enum? Something like

enum CurrentTargetBlueprint { Enabled(Blueprint), Disabled(Blueprint), }

?

Hm, this loses the extra debugging info that's present in BlueprintTarget. Maybe I'm now leaning toward squishing the tuple into a struct just to give it a name?

struct CurrentTargetBlueprint { blueprint: Blueprint, metadata: BlueprintTarget, }

or something.

Deleted my last comment, because it made no sense. lol.

I implemented this using the current tuple version in 9d95fd8

I'm happy to change to use to a new type if we agree, although it can probably wait for a follow up.

I'm cool with waiting, and if we don't like the tuple (reasonable!) I'd vote for the named struct that combines this stuff (#4891 (comment)). I'd also be cool with flattening the target fields into the struct (so it'd be like blueprint, then enabled, etc.).

jgallagher · 2024-01-30T16:34:37Z

nexus/src/app/background/blueprint_load.rs

+
+    /// Expose the target blueprint
+    pub fn watcher(&self) -> watch::Receiver<Option<Arc<Blueprint>>> {
+        self.rx.clone()


Tiny nit - I think we could use self.tx.subscribe() here and drop self.rx entirely (unless somewhere else we're assuming there's always at least one receiver, which I'm not seeing at a glance)

Good call! Fixed in 769cda3

jgallagher · 2024-01-30T16:38:29Z

nexus/src/app/background/blueprint_load.rs

+                            log,
+                            "found new target blueprint";
+                            "target_id" => &target_id,
+                            "time_created" => &time_created


Tiny nit - I'd maybe prefix these as new_target_id and new_time_created to contrast with the current_* properties set on the logger when we created it above.

I dislike prefixing all the properties. What if I changed current_* to original_* and left everything else as is?

I used the original prefix in aab1bb0

jgallagher · 2024-01-30T16:39:38Z

nexus/src/app/background/blueprint_load.rs

+                        json!({
+                            "target_id": target_id,
+                            "time_created": time_created
+                        })


Happy to defer to y'all on saga output, but would it make sense to have some kind of serializable status enum that both this crate and omdb could use? It would round-trip through serde_json but would address the immediate issue of all the different fields.

jgallagher · 2024-01-30T16:41:05Z

nexus/src/app/background/blueprint_execution.rs

+/// the state of the system based on the `Blueprint`.
+pub struct BlueprintExecutor {
+    datastore: Arc<DataStore>,
+    rx_blueprint: watch::Receiver<Option<Arc<Blueprint>>>,


Ah, I didn't mean squishing in the db itself, just in the types / return values from the db query. But yeah I think the newtype is maybe my preference, although maybe it should be an enum? Something like

enum CurrentTargetBlueprint { Enabled(Blueprint), Disabled(Blueprint), }

?

andrewjstone · 2024-01-30T23:03:08Z

@davepacheco @jgallagher
I believe I've resolved all your comments, minus the omdb changes. I also added another test. This is ready for another round of review.

davepacheco · 2024-01-31T18:27:05Z

I noticed about the test failure: this one seems to be because your new tasks get printed in non-deterministic order. Adding your tasks to this list should fix this. (omdb could probably print the tasks it doesn't know about in alphabetical order to avoid having to do this. Sorry.)

I can't tell if you have another problem here which is that blueprint_loader ran twice instead of once while the test was running, which spuriously affected the output. I think the easiest way to deal with this would be to tune the period up in the test suite's config file (which I think is separate from the other ones, so that's hopefully easy).

andrewjstone · 2024-01-31T18:38:03Z

I noticed about the test failure: this one seems to be because your new tasks get printed in non-deterministic order. Adding your tasks to this list should fix this. (omdb could probably print the tasks it doesn't know about in alphabetical order to avoid having to do this. Sorry.)

I actually had those tasks listed and removed them because I was getting that error. It's possible I used the wrong names though. I'll try agian.

I can't tell if you have another problem here which is that blueprint_loader ran twice instead of once while the test was running, which spuriously affected the output. I think the easiest way to deal with this would be to tune the period up in the test suite's config file (which I think is separate from the other ones, so that's hopefully easy).

I will do that! Thanks.

davepacheco · 2024-01-31T18:47:36Z

I noticed about the test failure: this one seems to be because your new tasks get printed in non-deterministic order. Adding your tasks to this list should fix this. (omdb could probably print the tasks it doesn't know about in alphabetical order to avoid having to do this. Sorry.)

I actually had those tasks listed and removed them because I was getting that error. It's possible I used the wrong names though. I'll try agian.

Ah, I didn't look closely enough at the omdb code. I think it does print the unknown tasks in alphabetical order. So I'm guessing what happened here is the checked-in expectorate output is from a run where they were in that list in the other order.

andrewjstone mentioned this pull request Jan 24, 2024

Add a background task for update plan execution #4732

Closed

jgallagher reviewed Jan 25, 2024

View reviewed changes

andrewjstone added 5 commits January 25, 2024 20:08

Add blueprint loader background task

0eced33

Add blueprint related bg task init

2b2ad63

Merge branch 'main' into ajs/plan-execution-bg-task

864bca7

Use DB backed blueprints from #4899

24b34f7

Add BlueprintTasksConfig

aa88e65

andrewjstone changed the title ~~WIP: Add a background task for update plan execution~~ Add a background task for update plan execution Jan 29, 2024

andrewjstone marked this pull request as ready for review January 29, 2024 07:27

davepacheco reviewed Jan 29, 2024

View reviewed changes

davepacheco mentioned this pull request Jan 29, 2024

planner should wait on NTP zones before adding more zones #4924

Merged

andrewjstone added 4 commits January 30, 2024 14:32

Merge branch 'main' into ajs/plan-execution-bg-task

16cd3c8

test config fix

80dfeea

some review fixes

596c9be

some more review fixes

5d72471

jgallagher reviewed Jan 30, 2024

View reviewed changes

andrewjstone added 7 commits January 30, 2024 17:05

Only execute enabled target blueprint

9d95fd8

fix omdb tests

bbc1318

nit fix

769cda3

nit fix

aab1bb0

Add a test for loading blueprints

ea5a8e0

clippy

61c3a00

remove unnecessary changes to omdb

57641df

clippy

dcf982d

davepacheco assigned andrewjstone Jan 31, 2024

davepacheco approved these changes Jan 31, 2024

View reviewed changes

andrewjstone added 2 commits January 31, 2024 19:12

fix tests

5661c86

Merge branch 'main' into ajs/plan-execution-bg-task

9d1e6ac

andrewjstone enabled auto-merge (squash) February 1, 2024 16:21

andrewjstone merged commit e72625c into main Feb 1, 2024
20 checks passed

andrewjstone deleted the ajs/plan-execution-bg-task branch February 1, 2024 17:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a background task for update plan execution #4891

Add a background task for update plan execution #4891

andrewjstone commented Jan 24, 2024

andrewjstone commented Jan 29, 2024 •

edited

Loading

davepacheco left a comment

davepacheco Jan 29, 2024

andrewjstone Jan 30, 2024

davepacheco Jan 29, 2024

jgallagher Jan 30, 2024

andrewjstone Jan 30, 2024

jgallagher Jan 30, 2024

jgallagher Jan 30, 2024

andrewjstone Jan 30, 2024

andrewjstone Jan 30, 2024

davepacheco Jan 30, 2024

jgallagher Jan 30, 2024

andrewjstone Jan 30, 2024

jgallagher Jan 30, 2024

andrewjstone Jan 30, 2024 •

edited

Loading

andrewjstone Jan 30, 2024

jgallagher Jan 30, 2024

jgallagher Jan 30, 2024

andrewjstone commented Jan 30, 2024

davepacheco commented Jan 31, 2024

andrewjstone commented Jan 31, 2024

davepacheco commented Jan 31, 2024

Add a background task for update plan execution #4891

Add a background task for update plan execution #4891

Conversation

andrewjstone commented Jan 24, 2024

andrewjstone commented Jan 29, 2024 • edited Loading

davepacheco left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewjstone Jan 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewjstone commented Jan 30, 2024

davepacheco commented Jan 31, 2024

andrewjstone commented Jan 31, 2024

davepacheco commented Jan 31, 2024

andrewjstone commented Jan 29, 2024 •

edited

Loading

andrewjstone Jan 30, 2024 •

edited

Loading