Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a background task for update plan execution #4891

Merged
merged 20 commits into from
Feb 1, 2024

Conversation

andrewjstone
Copy link
Contributor

This PR is the first step in creating a background task that is capable of taking a Blueprint and then reifying that blueprint into deployed or updated software. This PR uses the initial version of a Blueprint introduced in #4804. A basic executor that sends the related OmicronZonesConfig to the appropriate sled-agents for newly added sleds was created.

A test is included that shows how a hypothetical planner for an add-sled workflow will deploy Omicron zones in a manner similar to RSS, where first the internal DNS zone is deployed and then the internal DNS and NTP zones are deployed. Deployment alwyas contains all zones expected to be running on the sled-agent. Any zones running that are not included are expected to be shut down.

I still need to hook this into nexus/src/app/background/init.rs

This PR is the first step in creating a background task that is
capable of taking a `Blueprint` and then reifying that blueprint into
deployed or updated software. This PR uses the initial version of a
Blueprint introduced in #4804. A basic executor that sends the related
`OmicronZonesConfig` to the appropriate sled-agents for newly added
sleds was created.

A test is included that shows how a hypothetical planner for an
`add-sled` workflow will deploy Omicron zones in a manner similar to
RSS, where first the internal DNS zone is deployed and then the internal
DNS and NTP zones are deployed. Deployment alwyas contains all zones
expected to be running on the sled-agent. Any zones running that are not
included are expected to be shut down.
nexus/src/app/background/plan_execution.rs Outdated Show resolved Hide resolved
nexus/src/app/background/plan_execution.rs Outdated Show resolved Hide resolved
nexus/src/app/background/plan_execution.rs Outdated Show resolved Hide resolved
nexus/src/app/background/plan_execution.rs Outdated Show resolved Hide resolved
nexus/src/app/background/plan_execution.rs Outdated Show resolved Hide resolved
@andrewjstone andrewjstone changed the title WIP: Add a background task for update plan execution Add a background task for update plan execution Jan 29, 2024
@andrewjstone andrewjstone marked this pull request as ready for review January 29, 2024 07:27
@andrewjstone
Copy link
Contributor Author

andrewjstone commented Jan 29, 2024

This is probably good enough to review. There are still 1-2 more automated tests that should be added: one for the background task that loads the target blueprint, and probably one for testing interaction between the two background tasks, similar to what's done with DNS. I would have already written those, but instead I have spent all weekend trying to get the whole thing running on the a4x2 testbed to see if we could rapidly iterate on testing e2e add-sled and run it in CI. I'm pretty close I think, but am stuck currently with some weird rack init setup errors related to lrtq that need more detailed debugging. I expect a few small PRs to come out of that effort as well.

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

nexus/src/app/background/blueprint_load.rs Outdated Show resolved Hide resolved
nexus/src/app/background/blueprint_load.rs Outdated Show resolved Hide resolved
nexus/src/app/background/blueprint_load.rs Outdated Show resolved Hide resolved
nexus/src/app/background/blueprint_load.rs Outdated Show resolved Hide resolved
nexus/src/app/background/blueprint_load.rs Outdated Show resolved Hide resolved
nexus/src/app/background/blueprint_execution.rs Outdated Show resolved Hide resolved
nexus/src/app/background/blueprint_execution.rs Outdated Show resolved Hide resolved
nexus/src/app/background/blueprint_execution.rs Outdated Show resolved Hide resolved
}

#[nexus_test(server = crate::Server)]
async fn test_deploy_omicron_zones(cptestctx: &ControlPlaneTestContext) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test!

I think we could also/instead have one that uses an actual simulated sled agent and fetches the inventory back to verify it. I'm not sure there'd be much advantage here, except that eventually we'll probably want to make sure that works so we can test higher-level stuff. (I wouldn't really worry about this now, just mentioning it.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I particularly didn't do that, because I just wanted to test the behavior of the background task. My hope is that I can get the whole thing running in the a4x2 job and we can use the real sled-agent in CI.

/// the state of the system based on the `Blueprint`.
pub struct BlueprintExecutor {
datastore: Arc<DataStore>,
rx_blueprint: watch::Receiver<Option<Arc<Blueprint>>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's pretty important that we have a way to turn this thing off at runtime if we find the system is doing something harmful. I believe we already have an enabled bool on the BlueprintTarget for this purpose. So I think maybe this thing should accept (BlueprintTarget, BlueprintTarget) (or something similar, instead of just Blueprint). Then this task could just do nothing if !target.enabled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I assume the suggestion of (BlueprintTarget, BlueprintTarget) is a typo and should've been (Blueprint, BlueprintTarget).)

Just thinking out loud here, this may dumb: Should we squish BlueprintTarget into Blueprint somehow, or possibly add a new type for that combination? The db method for reading the current target also returns the 2-tuple, which felt a little awkward. I think the only thing from BlueprintTarget that anything other than a debugging human will care about is whether it's enabled, right? Some options in no particular order:

  • Add an enabled field to Blueprint (I think this doesn't make sense, because what does this field mean if the blueprint isn't the target? but maybe there's some way to make it make sense)
  • Add a blueprint field to BlueprintTarget
  • Add a struct that has Blueprint and BlueprintTarget as fields
  • Add EnabledBlueprint / DisabledBlueprint newtypes around Blueprint, which can be created from a BlueprintTarget and its Blueprint

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I assume the suggestion of (BlueprintTarget, BlueprintTarget) is a typo and should've been (Blueprint, BlueprintTarget).)

I also assumed this, but reversed the types in the tuple ;)

Just thinking out loud here, this may dumb: Should we squish BlueprintTarget into Blueprint somehow, or possibly add a new type for that combination? The db method for reading the current target also returns the 2-tuple, which felt a little awkward. I think the only thing from BlueprintTarget that anything other than a debugging human will care about is whether it's enabled, right? Some options in no particular order:

  • Add an enabled field to Blueprint (I think this doesn't make sense, because what does this field mean if the blueprint isn't the target? but maybe there's some way to make it make sense)
  • Add a blueprint field to BlueprintTarget
  • Add a struct that has Blueprint and BlueprintTarget as fields
  • Add EnabledBlueprint / DisabledBlueprint newtypes around Blueprint, which can be created from a BlueprintTarget and its Blueprint

I think dealing with tuples is a pain in the ass, but I don't like the idea of squishing these in the DB. Ideally the Blueprint is immutable, and the pointer to it (the target) is mutable. This would change that. The newtype idea could work though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't mean squishing in the db itself, just in the types / return values from the db query. But yeah I think the newtype is maybe my preference, although maybe it should be an enum? Something like

enum CurrentTargetBlueprint {
    Enabled(Blueprint),
    Disabled(Blueprint),
}

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this loses the extra debugging info that's present in BlueprintTarget. Maybe I'm now leaning toward squishing the tuple into a struct just to give it a name?

struct CurrentTargetBlueprint {
    blueprint: Blueprint,
    metadata: BlueprintTarget,
}

or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted my last comment, because it made no sense. lol.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented this using the current tuple version in 9d95fd8

I'm happy to change to use to a new type if we agree, although it can probably wait for a follow up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm cool with waiting, and if we don't like the tuple (reasonable!) I'd vote for the named struct that combines this stuff (#4891 (comment)). I'd also be cool with flattening the target fields into the struct (so it'd be like blueprint, then enabled, etc.).


/// Expose the target blueprint
pub fn watcher(&self) -> watch::Receiver<Option<Arc<Blueprint>>> {
self.rx.clone()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit - I think we could use self.tx.subscribe() here and drop self.rx entirely (unless somewhere else we're assuming there's always at least one receiver, which I'm not seeing at a glance)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! Fixed in 769cda3

log,
"found new target blueprint";
"target_id" => &target_id,
"time_created" => &time_created
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit - I'd maybe prefix these as new_target_id and new_time_created to contrast with the current_* properties set on the logger when we created it above.

Copy link
Contributor Author

@andrewjstone andrewjstone Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike prefixing all the properties. What if I changed current_* to original_* and left everything else as is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the original prefix in aab1bb0

json!({
"target_id": target_id,
"time_created": time_created
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to defer to y'all on saga output, but would it make sense to have some kind of serializable status enum that both this crate and omdb could use? It would round-trip through serde_json but would address the immediate issue of all the different fields.

/// the state of the system based on the `Blueprint`.
pub struct BlueprintExecutor {
datastore: Arc<DataStore>,
rx_blueprint: watch::Receiver<Option<Arc<Blueprint>>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't mean squishing in the db itself, just in the types / return values from the db query. But yeah I think the newtype is maybe my preference, although maybe it should be an enum? Something like

enum CurrentTargetBlueprint {
    Enabled(Blueprint),
    Disabled(Blueprint),
}

?

@andrewjstone
Copy link
Contributor Author

@davepacheco @jgallagher
I believe I've resolved all your comments, minus the omdb changes. I also added another test. This is ready for another round of review.

@davepacheco
Copy link
Collaborator

I noticed about the test failure: this one seems to be because your new tasks get printed in non-deterministic order. Adding your tasks to this list should fix this. (omdb could probably print the tasks it doesn't know about in alphabetical order to avoid having to do this. Sorry.)

I can't tell if you have another problem here which is that blueprint_loader ran twice instead of once while the test was running, which spuriously affected the output. I think the easiest way to deal with this would be to tune the period up in the test suite's config file (which I think is separate from the other ones, so that's hopefully easy).

@andrewjstone
Copy link
Contributor Author

I noticed about the test failure: this one seems to be because your new tasks get printed in non-deterministic order. Adding your tasks to this list should fix this. (omdb could probably print the tasks it doesn't know about in alphabetical order to avoid having to do this. Sorry.)

I actually had those tasks listed and removed them because I was getting that error. It's possible I used the wrong names though. I'll try agian.

I can't tell if you have another problem here which is that blueprint_loader ran twice instead of once while the test was running, which spuriously affected the output. I think the easiest way to deal with this would be to tune the period up in the test suite's config file (which I think is separate from the other ones, so that's hopefully easy).

I will do that! Thanks.

@davepacheco
Copy link
Collaborator

I noticed about the test failure: this one seems to be because your new tasks get printed in non-deterministic order. Adding your tasks to this list should fix this. (omdb could probably print the tasks it doesn't know about in alphabetical order to avoid having to do this. Sorry.)

I actually had those tasks listed and removed them because I was getting that error. It's possible I used the wrong names though. I'll try agian.

Ah, I didn't look closely enough at the omdb code. I think it does print the unknown tasks in alphabetical order. So I'm guessing what happened here is the checked-in expectorate output is from a run where they were in that list in the other order.

@andrewjstone andrewjstone enabled auto-merge (squash) February 1, 2024 16:21
@andrewjstone andrewjstone merged commit e72625c into main Feb 1, 2024
20 checks passed
@andrewjstone andrewjstone deleted the ajs/plan-execution-bg-task branch February 1, 2024 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants