[reconfigurator] turn expunged sleds into expunged zones #5493

sunshowers · 2024-04-10T09:02:58Z

It actually looks like this bit falls out nicely. Turns out we were already
disregarding Nexus zones on expunged sleds, so more of it works than I thought
it did!

Still to do:

Fix and add to blueprint diff tests
Split up handling of internal and external IPs while constructing a BlueprintBuilder
Add tests for this

Depends on #5488 and #5555.

Created using spr 1.3.6-beta.1

Created using spr 1.3.6-beta.1 [skip ci]

Created using spr 1.3.6-beta.1

Created using spr 1.3.6-beta.1 [skip ci]

Created using spr 1.3.6-beta.1

Created using spr 1.3.6-beta.1 [skip ci]

Created using spr 1.3.6-beta.1

nexus/reconfigurator/planning/src/blueprint_builder.rs

Created using spr 1.3.6-beta.1

sunshowers · 2024-04-17T20:26:48Z

nexus/reconfigurator/planning/src/blueprint_builder.rs

@@ -353,30 +366,26 @@ impl<'a> BlueprintBuilder<'a> {
        // of Nexus instances), but wouldn't be ideal if we have many resources
        // we need to skip. We could do something smarter here based on the sets
        // of used resources we built above if needed.
-        let nexus_v4_ips = Box::new(
+        let nexus_v4_ips = AvailableIterator::new(


I ended up not using the extra functionality provided by AvailableIterator, but this will be essential for testing #5552.

What is the extra functionality?

This lets you look at which resources were considered in-use at the time the iterator was constructed.

davepacheco

Nice!

nexus/reconfigurator/planning/src/blueprint_builder.rs

nexus/reconfigurator/planning/src/planner.rs

nexus/types/src/deployment/planning_input.rs

davepacheco · 2024-04-17T23:07:34Z

nexus/reconfigurator/planning/src/blueprint_builder.rs

+    ///
+    /// This method does not check whether the sled actually should be expunged
+    /// -- that is the responsibility of higher-level code.
+    pub fn build_pending_expunge_for_sled(


Why is this separated into two steps? I'm wondering if that makes it possible to misuse it by accident.

Good question -- the main reason is that I wanted to be able to unify the set of zones that should be expunged across all the possible sources of zone expungement, so it can be done in one go. The current ones are:

sled expungement => all zones on the sled

disk expungement => crucible zones backed by the disk

So the idea would be: first, build up the set of zones that should be expunged, then actually perform the expungement. It's a pattern I've generally found to be very good for introspection and testing against.

re misuse, one way is to add #[must_use] to a pending expunge, and I'll do so. But in general this should be caught pretty quickly by tests.

I think you'd get the same behavior from a function that was just let summary = builder.expunge_sled(sled_id, reason) because as far as I can tell, the caller always:

calls build_pending_expunge_for_sled() 0 or more times

for each returned value, call apply_pending_expunge

i.e., there's no merging or other processing of the returned values before calling apply().

I can definitely see separating it for testing but callers could use a wrapper that always composes the two functions correctly.

It makes sense to use #[must_use] to ensure that the caller doesn't forget to call apply(). There's also the risk that they try to call it more than once. I think this is impossible today because apply() takes ownership and PendingExpunge doesn't impl Clone (though I can see somebody wanting to add Clone for some other reason and then introducing this risk). I think what I'm most worried about is actually that separating the computation of PendingExpunge from applying it means there's a window in which a caller could do any number of other operations that invalidate the PendingExpunge. What if they use other BlueprintBuilder functions to deploy more zones to that sled in the meantime? Or use sled_ensure_disks() or something like that. Or it might be easy to do something like:

// there are 3 nexus zones, one of which is on $sled_id builder.build_pending_expunge_for_sled($sled_id, ...) // go calculate how many Nexus zones should exist and erroneously include the one on $sled_id because it hasn't actually been expunged yet builder.apply_pending_expunge()

Of course we can add validation to various functions to prevent this stuff but it just seems more complicated than structurally ensuring that external callers can't do anything in the meantime because they have no opportunity to.

That's my pitch but this isn't that important either way and I'm fine if you want to keep it as is!

I can definitely see separating it for testing but callers could use a wrapper that always composes the two functions correctly.

I think I would vote for this, with the caveat that it doesn't look like we currently have any tests these two in isolation, so maybe the split might be premature even for this case vs joining them into one function that returns a summary as you suggested?

i.e., there's no merging or other processing of the returned values before calling apply().

So... this is how the code is today, but that's only because the only source of zone expungement is sled expungement. In the very near future (release 9 I believe), we'll have a second source of zone expungement: individual disk expungement. In that case, it is possible that a zone is expunged for multiple reasons -- e.g. for a crucible zone, more than one of the backing disks was expunged.

In that case, there are two ways to address this:

Compute and apply each expungement directly to the builder, relying on the fact that expungement is idempotent.

Build a set of pending expunges, merge all of them building up the reasons for each, and then apply all of them together.

From experience, absent other overriding concerns the solution I generally tend to gravitate towards in these situations is 2. 1 is definitely harder to misuse, but 2 allows for many things, including dry-runs and easier testing. Many, many times in my life I've rewritten code from 1 to 2, and so I just start from pattern 2 these days. (In this case, one of the immediate things pattern 2 does is make us able to produce better comments for the blueprint.)

But let me try and make the case along the same lines for pattern 1:

A blueprint itself is pattern 2, a dry-run. A blueprint builder is a dry-run of a dry-run. This is a dry-run^3, which maybe is too much dry-run.

Maybe the solution to providing better comments is to actually do the merging within the comment layer. So instead of storing raw strings, we store an operation log that we can merge intelligently, and then resolve to the list of comments while finishing up the blueprint.

That is on top of the undeniable argument that pattern 1 is harder to misuse. So given that both of you feel this way, I think I'll switch to pattern 1.

I've switched to pattern 1 in this case, though (as outlined above) I think it'll have some repercussions down the road that we'll have to deal with.

nexus/reconfigurator/planning/tests/output/planner_nonprovisionable_1_2.txt

davepacheco · 2024-04-17T23:32:05Z

nexus/reconfigurator/planning/src/blueprint_builder.rs

@@ -353,30 +366,26 @@ impl<'a> BlueprintBuilder<'a> {
        // of Nexus instances), but wouldn't be ideal if we have many resources
        // we need to skip. We could do something smarter here based on the sets
        // of used resources we built above if needed.
-        let nexus_v4_ips = Box::new(
+        let nexus_v4_ips = AvailableIterator::new(


What is the extra functionality?

nexus/reconfigurator/planning/src/blueprint_builder.rs

Created using spr 1.3.6-beta.1 [skip ci]

Created using spr 1.3.6-beta.1

sunshowers · 2024-04-18T08:14:07Z

Want to add a few more tests tomorrow, but hopefully we can get some cycles going with this.

Created using spr 1.3.6-beta.1 [skip ci]

Created using spr 1.3.6-beta.1

nexus/reconfigurator/planning/src/planner.rs

jgallagher · 2024-04-19T13:11:30Z

nexus/reconfigurator/planning/src/blueprint_builder.rs

+    ///
+    /// This method does not check whether the sled actually should be expunged
+    /// -- that is the responsibility of higher-level code.
+    pub fn build_pending_expunge_for_sled(


I can definitely see separating it for testing but callers could use a wrapper that always composes the two functions correctly.

I think I would vote for this, with the caveat that it doesn't look like we currently have any tests these two in isolation, so maybe the split might be premature even for this case vs joining them into one function that returns a summary as you suggested?

jgallagher · 2024-04-19T13:18:27Z

nexus/reconfigurator/planning/src/blueprint_builder.rs

+                    self.zones.iter_mut().find(|z| z.zone.id == *zone_id)
+                {
+                    // Just check that the zone is still expungeable.
+                    is_already_expunged(&zone.zone, zone.state)?;


I don't entirely follow what this is guarding here. In particular, what would a return value of Ok(true) mean? We already expunged the zone but forgot to mark the state as Modified somehow?

This is making sure that any zones passed in at this level are in a state where marking them as expunged is valid. This is an unavoidable check that happens a second time. The list of zones to expunge has to be passed in -- it cannot be constructed here, because a BuilderZonesConfig is only created after we've decided to change the sled. If there are no zones to expunge then a BuilderZonesConfig shouldn't be created.

Hm, but am I right that Ok(true) would be a nonsensical return value? Would it make sense to separate "validate that this zone is expungeable" from "has the zone already been marked expunged", and only call the first one here?

Ok(true) indicates a no-op (idempotence). I changed the definition of is_already_expunged slightly, and especially with this new definition -- but even with the old one -- I don't think it makes sense to split them out. The simplest definition of a validate_expungeable function would be to just call is_already_expunged()?; and drop the return value.

(I went through this series of arguments myself while writing this code. Options I considered included renaming is_already_expunged to also include validation -- I decided that was too wordy, and already indicated by the fact that it returns a Result.)

In #5493 we'd like to track invariants like: a zone should not be added and expunged in the same blueprint. In order to do that, we need to track this state. (There are probably other ways to do it, but this is the most explicit method and I really like that.) This lives in a submodule because I don't want the rest of the blueprint builder to reach into the internals here. I split this from #5493 because it became somewhat complex in its own right, with its own tests.

nexus/reconfigurator/planning/src/planner.rs

Created using spr 1.3.6-beta.1 [skip ci]

Created using spr 1.3.6-beta.1

andrewjstone

Great work @sunshowers. I tested this on a4x2 and it seems to properly mark zones as expunged. I suggest we get it in asap and build upon it.

sunshowers · 2024-04-20T06:18:53Z

Thanks @andrewjstone! Going to land this and we can take care of any remaining comments in followups.

sunshowers added 2 commits April 10, 2024 02:02

[𝘀𝗽𝗿] initial version

07063d9

Created using spr 1.3.6-beta.1

[𝘀𝗽𝗿] changes to main this commit is based on

ed83d43

Created using spr 1.3.6-beta.1 [skip ci]

sunshowers marked this pull request as draft April 10, 2024 09:03

wip

0e2b4a1

Created using spr 1.3.6-beta.1

sunshowers mentioned this pull request Apr 10, 2024

Add BlueprintZoneFilter to all_omicron_zones #5348

Merged

3 tasks

sunshowers added 4 commits April 11, 2024 12:48

[𝘀𝗽𝗿] changes introduced through rebase

9a8763e

Created using spr 1.3.6-beta.1 [skip ci]

rebase

ac84f20

Created using spr 1.3.6-beta.1

[𝘀𝗽𝗿] changes introduced through rebase

d7243ef

Created using spr 1.3.6-beta.1 [skip ci]

ready for review?

c92255d

Created using spr 1.3.6-beta.1

sunshowers marked this pull request as ready for review April 12, 2024 06:04

sunshowers commented Apr 12, 2024

View reviewed changes

nexus/reconfigurator/planning/src/blueprint_builder.rs Outdated Show resolved Hide resolved

sunshowers changed the base branch from sunshowers/spr/main.reconfigurator-turn-expunged-sleds-into-expunged-zones to main April 12, 2024 06:08

sunshowers marked this pull request as draft April 14, 2024 06:32

still a wip, need to fix up planning input

0738a88

Created using spr 1.3.6-beta.1

morlandi7 added this to the 8 milestone Apr 16, 2024

davepacheco assigned sunshowers Apr 16, 2024

sunshowers added 2 commits April 16, 2024 14:19

further wip, will not do garbage collection for r8

d13063e

Created using spr 1.3.6-beta.1

MVP for r8

cf33653

Created using spr 1.3.6-beta.1

sunshowers commented Apr 17, 2024

View reviewed changes

sunshowers marked this pull request as ready for review April 17, 2024 20:27

sunshowers requested review from jgallagher, davepacheco, smklein and andrewjstone April 17, 2024 20:27

davepacheco reviewed Apr 17, 2024

View reviewed changes

sunshowers mentioned this pull request Apr 18, 2024

[reconfigurator-planning] track more state while building zones #5555

Merged

sunshowers added 2 commits April 18, 2024 01:13

[𝘀𝗽𝗿] changes introduced through rebase

5502f76

Created using spr 1.3.6-beta.1 [skip ci]

Rebase on top of 5555

4b22868

Created using spr 1.3.6-beta.1

sunshowers changed the base branch from main to sunshowers/spr/main.reconfigurator-turn-expunged-sleds-into-expunged-zones April 18, 2024 08:13

sunshowers added 2 commits April 18, 2024 01:15

[𝘀𝗽𝗿] changes introduced through rebase

9770b70

Created using spr 1.3.6-beta.1 [skip ci]

I think I hit a bug in github

ab0577f

Created using spr 1.3.6-beta.1

jgallagher reviewed Apr 19, 2024

View reviewed changes

andrewjstone reviewed Apr 19, 2024

View reviewed changes

nexus/reconfigurator/planning/src/planner.rs Outdated Show resolved Hide resolved

labbott and others added 2 commits April 19, 2024 16:22

[𝘀𝗽𝗿] changes introduced through rebase

d70bc05

Created using spr 1.3.6-beta.1 [skip ci]

Rebase, address comments

681bb44

Created using spr 1.3.6-beta.1

sunshowers changed the base branch from sunshowers/spr/main.reconfigurator-turn-expunged-sleds-into-expunged-zones to main April 19, 2024 23:38

sunshowers mentioned this pull request Apr 20, 2024

Blueprints should store the reasons zone dispositions were changed #5586

Open

andrewjstone mentioned this pull request Apr 20, 2024

Testing: Expunge Sled #5480

Closed

andrewjstone approved these changes Apr 20, 2024

View reviewed changes

sunshowers merged commit ffd72fd into main Apr 20, 2024
26 checks passed

sunshowers deleted the sunshowers/spr/reconfigurator-turn-expunged-sleds-into-expunged-zones branch April 20, 2024 06:19

sunshowers mentioned this pull request Apr 23, 2024

[nexus] Remove zones on expunged disks #5599

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[reconfigurator] turn expunged sleds into expunged zones #5493

[reconfigurator] turn expunged sleds into expunged zones #5493

sunshowers commented Apr 10, 2024 •

edited

Loading

sunshowers Apr 17, 2024

davepacheco Apr 17, 2024

sunshowers Apr 17, 2024

davepacheco left a comment

davepacheco Apr 17, 2024

sunshowers Apr 18, 2024

davepacheco Apr 18, 2024

jgallagher Apr 19, 2024

sunshowers Apr 19, 2024

sunshowers Apr 19, 2024

davepacheco Apr 17, 2024

sunshowers commented Apr 18, 2024

jgallagher Apr 19, 2024

jgallagher Apr 19, 2024

sunshowers Apr 19, 2024

jgallagher Apr 19, 2024

sunshowers Apr 19, 2024

andrewjstone left a comment

sunshowers commented Apr 20, 2024

[reconfigurator] turn expunged sleds into expunged zones #5493

[reconfigurator] turn expunged sleds into expunged zones #5493

Conversation

sunshowers commented Apr 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davepacheco left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunshowers commented Apr 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewjstone left a comment

Choose a reason for hiding this comment

sunshowers commented Apr 20, 2024

sunshowers commented Apr 10, 2024 •

edited

Loading