-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a background task for update plan execution #4891
Conversation
This PR is the first step in creating a background task that is capable of taking a `Blueprint` and then reifying that blueprint into deployed or updated software. This PR uses the initial version of a Blueprint introduced in #4804. A basic executor that sends the related `OmicronZonesConfig` to the appropriate sled-agents for newly added sleds was created. A test is included that shows how a hypothetical planner for an `add-sled` workflow will deploy Omicron zones in a manner similar to RSS, where first the internal DNS zone is deployed and then the internal DNS and NTP zones are deployed. Deployment alwyas contains all zones expected to be running on the sled-agent. Any zones running that are not included are expected to be shut down.
This is probably good enough to review. There are still 1-2 more automated tests that should be added: one for the background task that loads the target blueprint, and probably one for testing interaction between the two background tasks, similar to what's done with DNS. I would have already written those, but instead I have spent all weekend trying to get the whole thing running on the a4x2 testbed to see if we could rapidly iterate on testing e2e add-sled and run it in CI. I'm pretty close I think, but am stuck currently with some weird rack init setup errors related to lrtq that need more detailed debugging. I expect a few small PRs to come out of that effort as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice!
} | ||
|
||
#[nexus_test(server = crate::Server)] | ||
async fn test_deploy_omicron_zones(cptestctx: &ControlPlaneTestContext) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice test!
I think we could also/instead have one that uses an actual simulated sled agent and fetches the inventory back to verify it. I'm not sure there'd be much advantage here, except that eventually we'll probably want to make sure that works so we can test higher-level stuff. (I wouldn't really worry about this now, just mentioning it.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I particularly didn't do that, because I just wanted to test the behavior of the background task. My hope is that I can get the whole thing running in the a4x2 job and we can use the real sled-agent in CI.
/// the state of the system based on the `Blueprint`. | ||
pub struct BlueprintExecutor { | ||
datastore: Arc<DataStore>, | ||
rx_blueprint: watch::Receiver<Option<Arc<Blueprint>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's pretty important that we have a way to turn this thing off at runtime if we find the system is doing something harmful. I believe we already have an enabled
bool on the BlueprintTarget for this purpose. So I think maybe this thing should accept (BlueprintTarget, BlueprintTarget)
(or something similar, instead of just Blueprint
). Then this task could just do nothing if !target.enabled
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I assume the suggestion of (BlueprintTarget, BlueprintTarget)
is a typo and should've been (Blueprint, BlueprintTarget)
.)
Just thinking out loud here, this may dumb: Should we squish BlueprintTarget
into Blueprint
somehow, or possibly add a new type for that combination? The db method for reading the current target also returns the 2-tuple, which felt a little awkward. I think the only thing from BlueprintTarget
that anything other than a debugging human will care about is whether it's enabled, right? Some options in no particular order:
- Add an
enabled
field toBlueprint
(I think this doesn't make sense, because what does this field mean if the blueprint isn't the target? but maybe there's some way to make it make sense) - Add a
blueprint
field toBlueprintTarget
- Add a struct that has
Blueprint
andBlueprintTarget
as fields - Add
EnabledBlueprint
/DisabledBlueprint
newtypes aroundBlueprint
, which can be created from aBlueprintTarget
and itsBlueprint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I assume the suggestion of
(BlueprintTarget, BlueprintTarget)
is a typo and should've been(Blueprint, BlueprintTarget)
.)
I also assumed this, but reversed the types in the tuple ;)
Just thinking out loud here, this may dumb: Should we squish
BlueprintTarget
intoBlueprint
somehow, or possibly add a new type for that combination? The db method for reading the current target also returns the 2-tuple, which felt a little awkward. I think the only thing fromBlueprintTarget
that anything other than a debugging human will care about is whether it's enabled, right? Some options in no particular order:
- Add an
enabled
field toBlueprint
(I think this doesn't make sense, because what does this field mean if the blueprint isn't the target? but maybe there's some way to make it make sense)- Add a
blueprint
field toBlueprintTarget
- Add a struct that has
Blueprint
andBlueprintTarget
as fields- Add
EnabledBlueprint
/DisabledBlueprint
newtypes aroundBlueprint
, which can be created from aBlueprintTarget
and itsBlueprint
I think dealing with tuples is a pain in the ass, but I don't like the idea of squishing these in the DB. Ideally the Blueprint is immutable, and the pointer to it (the target) is mutable. This would change that. The newtype idea could work though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't mean squishing in the db itself, just in the types / return values from the db query. But yeah I think the newtype is maybe my preference, although maybe it should be an enum? Something like
enum CurrentTargetBlueprint {
Enabled(Blueprint),
Disabled(Blueprint),
}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, this loses the extra debugging info that's present in BlueprintTarget
. Maybe I'm now leaning toward squishing the tuple into a struct just to give it a name?
struct CurrentTargetBlueprint {
blueprint: Blueprint,
metadata: BlueprintTarget,
}
or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted my last comment, because it made no sense. lol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented this using the current tuple version in 9d95fd8
I'm happy to change to use to a new type if we agree, although it can probably wait for a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm cool with waiting, and if we don't like the tuple (reasonable!) I'd vote for the named struct that combines this stuff (#4891 (comment)). I'd also be cool with flattening the target fields into the struct (so it'd be like blueprint
, then enabled
, etc.).
|
||
/// Expose the target blueprint | ||
pub fn watcher(&self) -> watch::Receiver<Option<Arc<Blueprint>>> { | ||
self.rx.clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny nit - I think we could use self.tx.subscribe()
here and drop self.rx
entirely (unless somewhere else we're assuming there's always at least one receiver, which I'm not seeing at a glance)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call! Fixed in 769cda3
log, | ||
"found new target blueprint"; | ||
"target_id" => &target_id, | ||
"time_created" => &time_created |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny nit - I'd maybe prefix these as new_target_id
and new_time_created
to contrast with the current_*
properties set on the logger when we created it above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dislike prefixing all the properties. What if I changed current_*
to original_*
and left everything else as is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the original
prefix in aab1bb0
json!({ | ||
"target_id": target_id, | ||
"time_created": time_created | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to defer to y'all on saga output, but would it make sense to have some kind of serializable status enum that both this crate and omdb could use? It would round-trip through serde_json
but would address the immediate issue of all the different fields.
/// the state of the system based on the `Blueprint`. | ||
pub struct BlueprintExecutor { | ||
datastore: Arc<DataStore>, | ||
rx_blueprint: watch::Receiver<Option<Arc<Blueprint>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't mean squishing in the db itself, just in the types / return values from the db query. But yeah I think the newtype is maybe my preference, although maybe it should be an enum? Something like
enum CurrentTargetBlueprint {
Enabled(Blueprint),
Disabled(Blueprint),
}
?
@davepacheco @jgallagher |
I noticed about the test failure: this one seems to be because your new tasks get printed in non-deterministic order. Adding your tasks to this list should fix this. (omdb could probably print the tasks it doesn't know about in alphabetical order to avoid having to do this. Sorry.) I can't tell if you have another problem here which is that |
I actually had those tasks listed and removed them because I was getting that error. It's possible I used the wrong names though. I'll try agian.
I will do that! Thanks. |
Ah, I didn't look closely enough at the omdb code. I think it does print the unknown tasks in alphabetical order. So I'm guessing what happened here is the checked-in expectorate output is from a run where they were in that list in the other order. |
This PR is the first step in creating a background task that is capable of taking a
Blueprint
and then reifying that blueprint into deployed or updated software. This PR uses the initial version of a Blueprint introduced in #4804. A basic executor that sends the relatedOmicronZonesConfig
to the appropriate sled-agents for newly added sleds was created.A test is included that shows how a hypothetical planner for an
add-sled
workflow will deploy Omicron zones in a manner similar to RSS, where first the internal DNS zone is deployed and then the internal DNS and NTP zones are deployed. Deployment alwyas contains all zones expected to be running on the sled-agent. Any zones running that are not included are expected to be shut down.I still need to hook this into
nexus/src/app/background/init.rs