Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

planner should wait on NTP zones before adding more zones #4924

Merged
merged 3 commits into from
Feb 1, 2024

Conversation

davepacheco
Copy link
Collaborator

@davepacheco davepacheco commented Jan 29, 2024

(This is not urgent and I'm planning to wait for #4891 before landing this.)

The planner needs to wait for NTP zones to show up in inventory before provisioning other zones. This change teaches adds an inventory collection reference to the planner and uses it to check this one thing. This does make things a little more annoying to test because you need to fake up inventory too but I believe it's necessary. I have some ideas for improving testing here in a follow on PR.

// be there before we can proceed to add anything else. Otherwise,
// we may wind up trying to provision this zone at the same time as
// other zones, and Sled Agent will reject requests to provision
// other zones before the clock is synchronized.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is exactly what we want to check. During RSS we wait not only for the NTP zone to exist, but for time to be synchronized - I assume that's what the subsequent service zones really care about. Should inventory collect the timesync state of each sled, so here we can wait for timesync.sync instead of the NTP zone's presence?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we discussed a few weeks ago (not sure where) and opted to do #4778 to handle that. So the idea is: we provision the NTP zone first, then once it's there, we're free to provision other zones knowing that sled agent will report a (retryable) error for that request until time is sync'd.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Right, completely forgot about that.

Comment on lines +188 to +192
// general criteria we'll want to use in picking a collection here. But
// as of this writing, this is only used for one specific case, which is
// to implement a gate that prevents the planner from provisioning
// non-NTP zones on a sled unless we know there's an NTP zone already on
// that sled. For that purpose, it's okay if this collection is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we soon going to have other uses for the latest inventory in the planner?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good question. I am not sure yet. We mostly don't need it for zones (aside from this case) because we can always tell sled agents exactly what zones to run and it will just do nothing if it's already running those zones. We might eventually want the inventory when we have operations that don't work that way, where it's expensive to do them and you don't want to re-do them if you don't have to. The MGS-managed updates (like SP update, host OS flash update) come to mind, where you want to avoid doing the update if you know it's already at the right version. We may instead want to update slot 0, wait for the inventory to reflect the new version in slot 0, then update slot 1, etc. But at this point I think that use case is pretty far away.

Comment on lines +108 to +115
if !has_ntp_inventory {
info!(
&self.log,
"parent blueprint contains NTP zone, but it's not in \
inventory yet";
"sled_id" => ?sled_id,
);
continue;
Copy link
Contributor

@sunshowers sunshowers Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A thought that occurred to me: is there some place in the blueprint, or elsewhere, that we should be storing these decisions and the reasons for them? Just having them in logs seems hard to diagnose on in the future. I know eventually we want to build a full-fledged event log for operators, but I think an incremental step where we just store things like "here are all the places we skipped making a decision, and here's why" would be pretty useful.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. One thought is that we could have an enum of different events ("event" isn't quite the right word for it) -- basically one for each branch or decision made. Then we could store a list of these somewhere. So you might have [ FoundNtpBlueprint, FoundNtpInventory, ProvisionedCrucible ]. But I'm not sure where would make sense for it. We could stick an array in the Blueprint itself, or use a separate table, or assume that we can pull this out of the log...other ideas? I think we should be sure that this is only for debugging. I don't think we want programmatic consumers looking at this to try to pick apart what changes were part of a blueprint, despite it being tempting to do that.

Tangentially: I kind of expect that in the long run, we'll wind up breaking the planner into more orthogonal steps. Originally I'd envisioned a sequence of trait objects that provide something like a plan(previous: &Blueprint, next: &mut BlueprintBuilder) method, and the (generic) planner would just run them in sequence and each one would make whatever modifications it wanted. If we do wind up doing something like that, these methods could return a description of what choice(s) they made. I'm not convinced we've got enough concrete impls here yet for this to be worthwhile though.

Copy link
Contributor

@sunshowers sunshowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I've spent some more time in this code, it looks pretty good to me!

@davepacheco davepacheco enabled auto-merge (squash) February 1, 2024 17:17
@davepacheco davepacheco merged commit b88e966 into main Feb 1, 2024
20 checks passed
@davepacheco davepacheco deleted the dap/planner-inventory branch February 1, 2024 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants