-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[reconfigurator] turn expunged sleds into expunged zones #5493
[reconfigurator] turn expunged sleds into expunged zones #5493
Conversation
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1 [skip ci]
Created using spr 1.3.6-beta.1 [skip ci]
Created using spr 1.3.6-beta.1 [skip ci]
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
@@ -353,30 +366,26 @@ impl<'a> BlueprintBuilder<'a> { | |||
// of Nexus instances), but wouldn't be ideal if we have many resources | |||
// we need to skip. We could do something smarter here based on the sets | |||
// of used resources we built above if needed. | |||
let nexus_v4_ips = Box::new( | |||
let nexus_v4_ips = AvailableIterator::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up not using the extra functionality provided by AvailableIterator
, but this will be essential for testing #5552.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the extra functionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lets you look at which resources were considered in-use at the time the iterator was constructed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
/// | ||
/// This method does not check whether the sled actually should be expunged | ||
/// -- that is the responsibility of higher-level code. | ||
pub fn build_pending_expunge_for_sled( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this separated into two steps? I'm wondering if that makes it possible to misuse it by accident.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question -- the main reason is that I wanted to be able to unify the set of zones that should be expunged across all the possible sources of zone expungement, so it can be done in one go. The current ones are:
- sled expungement => all zones on the sled
- disk expungement => crucible zones backed by the disk
So the idea would be: first, build up the set of zones that should be expunged, then actually perform the expungement. It's a pattern I've generally found to be very good for introspection and testing against.
re misuse, one way is to add #[must_use]
to a pending expunge, and I'll do so. But in general this should be caught pretty quickly by tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'd get the same behavior from a function that was just let summary = builder.expunge_sled(sled_id, reason)
because as far as I can tell, the caller always:
- calls
build_pending_expunge_for_sled()
0 or more times - for each returned value, call
apply_pending_expunge
i.e., there's no merging or other processing of the returned values before calling apply()
.
I can definitely see separating it for testing but callers could use a wrapper that always composes the two functions correctly.
It makes sense to use #[must_use]
to ensure that the caller doesn't forget to call apply()
. There's also the risk that they try to call it more than once. I think this is impossible today because apply()
takes ownership and PendingExpunge
doesn't impl Clone
(though I can see somebody wanting to add Clone
for some other reason and then introducing this risk). I think what I'm most worried about is actually that separating the computation of PendingExpunge
from applying it means there's a window in which a caller could do any number of other operations that invalidate the PendingExpunge
. What if they use other BlueprintBuilder
functions to deploy more zones to that sled in the meantime? Or use sled_ensure_disks()
or something like that. Or it might be easy to do something like:
// there are 3 nexus zones, one of which is on $sled_id
builder.build_pending_expunge_for_sled($sled_id, ...)
// go calculate how many Nexus zones should exist and erroneously include the one on $sled_id because it hasn't actually been expunged yet
builder.apply_pending_expunge()
Of course we can add validation to various functions to prevent this stuff but it just seems more complicated than structurally ensuring that external callers can't do anything in the meantime because they have no opportunity to.
That's my pitch but this isn't that important either way and I'm fine if you want to keep it as is!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can definitely see separating it for testing but callers could use a wrapper that always composes the two functions correctly.
I think I would vote for this, with the caveat that it doesn't look like we currently have any tests these two in isolation, so maybe the split might be premature even for this case vs joining them into one function that returns a summary as you suggested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e., there's no merging or other processing of the returned values before calling apply().
So... this is how the code is today, but that's only because the only source of zone expungement is sled expungement. In the very near future (release 9 I believe), we'll have a second source of zone expungement: individual disk expungement. In that case, it is possible that a zone is expunged for multiple reasons -- e.g. for a crucible zone, more than one of the backing disks was expunged.
In that case, there are two ways to address this:
- Compute and apply each expungement directly to the builder, relying on the fact that expungement is idempotent.
- Build a set of pending expunges, merge all of them building up the reasons for each, and then apply all of them together.
From experience, absent other overriding concerns the solution I generally tend to gravitate towards in these situations is 2. 1 is definitely harder to misuse, but 2 allows for many things, including dry-runs and easier testing. Many, many times in my life I've rewritten code from 1 to 2, and so I just start from pattern 2 these days. (In this case, one of the immediate things pattern 2 does is make us able to produce better comments for the blueprint.)
But let me try and make the case along the same lines for pattern 1:
- A blueprint itself is pattern 2, a dry-run. A blueprint builder is a dry-run of a dry-run. This is a dry-run^3, which maybe is too much dry-run.
- Maybe the solution to providing better comments is to actually do the merging within the comment layer. So instead of storing raw strings, we store an operation log that we can merge intelligently, and then resolve to the list of comments while finishing up the blueprint.
That is on top of the undeniable argument that pattern 1 is harder to misuse. So given that both of you feel this way, I think I'll switch to pattern 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've switched to pattern 1 in this case, though (as outlined above) I think it'll have some repercussions down the road that we'll have to deal with.
nexus/reconfigurator/planning/tests/output/planner_nonprovisionable_1_2.txt
Outdated
Show resolved
Hide resolved
nexus/reconfigurator/planning/tests/output/planner_nonprovisionable_1_2.txt
Show resolved
Hide resolved
@@ -353,30 +366,26 @@ impl<'a> BlueprintBuilder<'a> { | |||
// of Nexus instances), but wouldn't be ideal if we have many resources | |||
// we need to skip. We could do something smarter here based on the sets | |||
// of used resources we built above if needed. | |||
let nexus_v4_ips = Box::new( | |||
let nexus_v4_ips = AvailableIterator::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the extra functionality?
Created using spr 1.3.6-beta.1 [skip ci]
Created using spr 1.3.6-beta.1
Want to add a few more tests tomorrow, but hopefully we can get some cycles going with this. |
Created using spr 1.3.6-beta.1 [skip ci]
Created using spr 1.3.6-beta.1
/// | ||
/// This method does not check whether the sled actually should be expunged | ||
/// -- that is the responsibility of higher-level code. | ||
pub fn build_pending_expunge_for_sled( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can definitely see separating it for testing but callers could use a wrapper that always composes the two functions correctly.
I think I would vote for this, with the caveat that it doesn't look like we currently have any tests these two in isolation, so maybe the split might be premature even for this case vs joining them into one function that returns a summary as you suggested?
self.zones.iter_mut().find(|z| z.zone.id == *zone_id) | ||
{ | ||
// Just check that the zone is still expungeable. | ||
is_already_expunged(&zone.zone, zone.state)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't entirely follow what this is guarding here. In particular, what would a return value of Ok(true)
mean? We already expunged the zone but forgot to mark the state as Modified
somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is making sure that any zones passed in at this level are in a state where marking them as expunged is valid. This is an unavoidable check that happens a second time. The list of zones to expunge has to be passed in -- it cannot be constructed here, because a BuilderZonesConfig
is only created after we've decided to change the sled. If there are no zones to expunge then a BuilderZonesConfig
shouldn't be created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, but am I right that Ok(true)
would be a nonsensical return value? Would it make sense to separate "validate that this zone is expungeable" from "has the zone already been marked expunged", and only call the first one here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok(true)
indicates a no-op (idempotence). I changed the definition of is_already_expunged
slightly, and especially with this new definition -- but even with the old one -- I don't think it makes sense to split them out. The simplest definition of a validate_expungeable
function would be to just call is_already_expunged()?;
and drop the return value.
(I went through this series of arguments myself while writing this code. Options I considered included renaming is_already_expunged
to also include validation -- I decided that was too wordy, and already indicated by the fact that it returns a Result
.)
In #5493 we'd like to track invariants like: a zone should not be added and expunged in the same blueprint. In order to do that, we need to track this state. (There are probably other ways to do it, but this is the most explicit method and I really like that.) This lives in a submodule because I don't want the rest of the blueprint builder to reach into the internals here. I split this from #5493 because it became somewhat complex in its own right, with its own tests.
Created using spr 1.3.6-beta.1 [skip ci]
Created using spr 1.3.6-beta.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @sunshowers. I tested this on a4x2 and it seems to properly mark zones as expunged. I suggest we get it in asap and build upon it.
Thanks @andrewjstone! Going to land this and we can take care of any remaining comments in followups. |
It actually looks like this bit falls out nicely. Turns out we were already
disregarding Nexus zones on expunged sleds, so more of it works than I thought
it did!
Still to do:
BlueprintBuilder
Depends on #5488 and #5555.