-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
planner should wait on NTP zones before adding more zones #4924
Conversation
…adding more zones
// be there before we can proceed to add anything else. Otherwise, | ||
// we may wind up trying to provision this zone at the same time as | ||
// other zones, and Sled Agent will reject requests to provision | ||
// other zones before the clock is synchronized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is exactly what we want to check. During RSS we wait not only for the NTP zone to exist, but for time to be synchronized - I assume that's what the subsequent service zones really care about. Should inventory collect the timesync state of each sled, so here we can wait for timesync.sync
instead of the NTP zone's presence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we discussed a few weeks ago (not sure where) and opted to do #4778 to handle that. So the idea is: we provision the NTP zone first, then once it's there, we're free to provision other zones knowing that sled agent will report a (retryable) error for that request until time is sync'd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! Right, completely forgot about that.
// general criteria we'll want to use in picking a collection here. But | ||
// as of this writing, this is only used for one specific case, which is | ||
// to implement a gate that prevents the planner from provisioning | ||
// non-NTP zones on a sled unless we know there's an NTP zone already on | ||
// that sled. For that purpose, it's okay if this collection is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we soon going to have other uses for the latest inventory in the planner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good question. I am not sure yet. We mostly don't need it for zones (aside from this case) because we can always tell sled agents exactly what zones to run and it will just do nothing if it's already running those zones. We might eventually want the inventory when we have operations that don't work that way, where it's expensive to do them and you don't want to re-do them if you don't have to. The MGS-managed updates (like SP update, host OS flash update) come to mind, where you want to avoid doing the update if you know it's already at the right version. We may instead want to update slot 0, wait for the inventory to reflect the new version in slot 0, then update slot 1, etc. But at this point I think that use case is pretty far away.
if !has_ntp_inventory { | ||
info!( | ||
&self.log, | ||
"parent blueprint contains NTP zone, but it's not in \ | ||
inventory yet"; | ||
"sled_id" => ?sled_id, | ||
); | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A thought that occurred to me: is there some place in the blueprint, or elsewhere, that we should be storing these decisions and the reasons for them? Just having them in logs seems hard to diagnose on in the future. I know eventually we want to build a full-fledged event log for operators, but I think an incremental step where we just store things like "here are all the places we skipped making a decision, and here's why" would be pretty useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea. One thought is that we could have an enum of different events ("event" isn't quite the right word for it) -- basically one for each branch or decision made. Then we could store a list of these somewhere. So you might have [ FoundNtpBlueprint, FoundNtpInventory, ProvisionedCrucible ]
. But I'm not sure where would make sense for it. We could stick an array in the Blueprint itself, or use a separate table, or assume that we can pull this out of the log...other ideas? I think we should be sure that this is only for debugging. I don't think we want programmatic consumers looking at this to try to pick apart what changes were part of a blueprint, despite it being tempting to do that.
Tangentially: I kind of expect that in the long run, we'll wind up breaking the planner into more orthogonal steps. Originally I'd envisioned a sequence of trait objects that provide something like a plan(previous: &Blueprint, next: &mut BlueprintBuilder)
method, and the (generic) planner would just run them in sequence and each one would make whatever modifications it wanted. If we do wind up doing something like that, these methods could return a description of what choice(s) they made. I'm not convinced we've got enough concrete impls here yet for this to be worthwhile though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I've spent some more time in this code, it looks pretty good to me!
(This is not urgent and I'm planning to wait for #4891 before landing this.)
The planner needs to wait for NTP zones to show up in inventory before provisioning other zones. This change teaches adds an inventory collection reference to the planner and uses it to check this one thing. This does make things a little more annoying to test because you need to fake up inventory too but I believe it's necessary. I have some ideas for improving testing here in a follow on PR.