planner should wait on NTP zones before adding more zones #4924

davepacheco · 2024-01-29T23:42:58Z

(This is not urgent and I'm planning to wait for #4891 before landing this.)

The planner needs to wait for NTP zones to show up in inventory before provisioning other zones. This change teaches adds an inventory collection reference to the planner and uses it to check this one thing. This does make things a little more annoying to test because you need to fake up inventory too but I believe it's necessary. I have some ideas for improving testing here in a follow on PR.

…adding more zones

jgallagher · 2024-01-30T19:01:20Z

nexus/deployment/src/planner.rs

+            // be there before we can proceed to add anything else.  Otherwise,
+            // we may wind up trying to provision this zone at the same time as
+            // other zones, and Sled Agent will reject requests to provision
+            // other zones before the clock is synchronized.


I'm not sure this is exactly what we want to check. During RSS we wait not only for the NTP zone to exist, but for time to be synchronized - I assume that's what the subsequent service zones really care about. Should inventory collect the timesync state of each sled, so here we can wait for timesync.sync instead of the NTP zone's presence?

Yeah, we discussed a few weeks ago (not sure where) and opted to do #4778 to handle that. So the idea is: we provision the NTP zone first, then once it's there, we're free to provision other zones knowing that sled agent will report a (retryable) error for that request until time is sync'd.

Oh! Right, completely forgot about that.

sunshowers · 2024-01-30T22:15:09Z

nexus/src/app/deployment.rs

+        // general criteria we'll want to use in picking a collection here.  But
+        // as of this writing, this is only used for one specific case, which is
+        // to implement a gate that prevents the planner from provisioning
+        // non-NTP zones on a sled unless we know there's an NTP zone already on
+        // that sled.  For that purpose, it's okay if this collection is


Are we soon going to have other uses for the latest inventory in the planner?

It's a good question. I am not sure yet. We mostly don't need it for zones (aside from this case) because we can always tell sled agents exactly what zones to run and it will just do nothing if it's already running those zones. We might eventually want the inventory when we have operations that don't work that way, where it's expensive to do them and you don't want to re-do them if you don't have to. The MGS-managed updates (like SP update, host OS flash update) come to mind, where you want to avoid doing the update if you know it's already at the right version. We may instead want to update slot 0, wait for the inventory to reflect the new version in slot 0, then update slot 1, etc. But at this point I think that use case is pretty far away.

sunshowers · 2024-02-01T01:15:36Z

nexus/deployment/src/planner.rs

+            if !has_ntp_inventory {
+                info!(
+                    &self.log,
+                    "parent blueprint contains NTP zone, but it's not in \
+                    inventory yet";
+                    "sled_id" => ?sled_id,
+                );
+                continue;


A thought that occurred to me: is there some place in the blueprint, or elsewhere, that we should be storing these decisions and the reasons for them? Just having them in logs seems hard to diagnose on in the future. I know eventually we want to build a full-fledged event log for operators, but I think an incremental step where we just store things like "here are all the places we skipped making a decision, and here's why" would be pretty useful.

That's a good idea. One thought is that we could have an enum of different events ("event" isn't quite the right word for it) -- basically one for each branch or decision made. Then we could store a list of these somewhere. So you might have [ FoundNtpBlueprint, FoundNtpInventory, ProvisionedCrucible ]. But I'm not sure where would make sense for it. We could stick an array in the Blueprint itself, or use a separate table, or assume that we can pull this out of the log...other ideas? I think we should be sure that this is only for debugging. I don't think we want programmatic consumers looking at this to try to pick apart what changes were part of a blueprint, despite it being tempting to do that.

Tangentially: I kind of expect that in the long run, we'll wind up breaking the planner into more orthogonal steps. Originally I'd envisioned a sequence of trait objects that provide something like a plan(previous: &Blueprint, next: &mut BlueprintBuilder) method, and the (generic) planner would just run them in sequence and each one would make whatever modifications it wanted. If we do wind up doing something like that, these methods could return a description of what choice(s) they made. I'm not convinced we've got enough concrete impls here yet for this to be worthwhile though.

sunshowers

Now that I've spent some more time in this code, it looks pretty good to me!

blueprint planner should wait for NTP zones to be provisioned before …

bb38674

…adding more zones

davepacheco assigned sunshowers and jgallagher Jan 29, 2024

jgallagher reviewed Jan 30, 2024

View reviewed changes

jgallagher approved these changes Jan 30, 2024

View reviewed changes

sunshowers reviewed Jan 30, 2024

View reviewed changes

davepacheco assigned davepacheco and unassigned sunshowers and jgallagher Jan 31, 2024

sunshowers reviewed Feb 1, 2024

View reviewed changes

sunshowers approved these changes Feb 1, 2024

View reviewed changes

davepacheco added 2 commits February 1, 2024 08:50

Merge branch 'main' into dap/planner-inventory

6237e15

Merge branch 'main' into dap/planner-inventory

7f7c7d9

davepacheco enabled auto-merge (squash) February 1, 2024 17:17

davepacheco merged commit b88e966 into main Feb 1, 2024
20 checks passed

davepacheco deleted the dap/planner-inventory branch February 1, 2024 19:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

planner should wait on NTP zones before adding more zones #4924

planner should wait on NTP zones before adding more zones #4924

davepacheco commented Jan 29, 2024 •

edited

Loading

jgallagher Jan 30, 2024

davepacheco Jan 30, 2024

jgallagher Jan 30, 2024

sunshowers Jan 30, 2024

davepacheco Jan 30, 2024

sunshowers Feb 1, 2024 •

edited

Loading

davepacheco Feb 1, 2024

sunshowers left a comment

planner should wait on NTP zones before adding more zones #4924

planner should wait on NTP zones before adding more zones #4924

Conversation

davepacheco commented Jan 29, 2024 • edited Loading

jgallagher Jan 30, 2024

Choose a reason for hiding this comment

davepacheco Jan 30, 2024

Choose a reason for hiding this comment

jgallagher Jan 30, 2024

Choose a reason for hiding this comment

sunshowers Jan 30, 2024

Choose a reason for hiding this comment

davepacheco Jan 30, 2024

Choose a reason for hiding this comment

sunshowers Feb 1, 2024 • edited Loading

Choose a reason for hiding this comment

davepacheco Feb 1, 2024

Choose a reason for hiding this comment

sunshowers left a comment

Choose a reason for hiding this comment

davepacheco commented Jan 29, 2024 •

edited

Loading

sunshowers Feb 1, 2024 •

edited

Loading