Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Reconfigurator] Blippy #7276

Merged
merged 14 commits into from
Dec 20, 2024
Merged

[Reconfigurator] Blippy #7276

merged 14 commits into from
Dec 20, 2024

Conversation

jgallagher
Copy link
Contributor

This PR introduces Blippy for linting blueprints (see #6987). It initially only reports FATAL errors associated with specific sleds, but hopefully provides enough structure to see how that can expand to include other severities and other components. (At a minimum, there will be some blueprint- or policy-level component for things like "there aren't enough Nexus zones" that aren't associated with any particular sled.)

As of this PR, the only user of Blippy is the builder test's verify_blueprint, from which I imported most of the checks that it current performs. I made a few of these checks slightly more strict, and from that I had to patch up a handful of tests that were doing weird things (e.g., manually expunging a zone without expunging its datasets) and also found one legitimate planner bug (I'll note in a separate comment below).

There aren't many doc comments (yet?) - the interface should be pretty straightforward, but I'm happy to add docs if folks are okay with the general structure of this. I'm not at all wedded to the Note / Severity / Kind / Component breakdown, so any feedback there is welcome.

skip_zpools.insert(zpool);
}
if let Some(zpool) = &zone_config.filesystem_pool {
skip_zpools.insert(zpool);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the legit planner bug I mentioned in the PR description; without this fix, we have several tests that fail with a blippy report that looks something like:

blippy report for blueprint 30f46d73-5dd5-4125-800f-6c910bb24638: 2 notes
  sled 77370f38-8ef9-4dc1-a455-5b2f72149eed: FATAL note: zpool oxp_03a64b5d-c9c6-4edd-ace0-7d4bbc3293e2 has two zone filesystems of the same kind (CruciblePantry 2220bf46-72e3-4d88-baae-6929f1891219 and CruciblePantry 319edb3b-f8a8-4881-8f28-e2d80ce861bc)
  sled 77370f38-8ef9-4dc1-a455-5b2f72149eed: FATAL note: zpool oxp_03a64b5d-c9c6-4edd-ace0-7d4bbc3293e2 has two zone filesystems of the same kind (CruciblePantry 319edb3b-f8a8-4881-8f28-e2d80ce861bc and CruciblePantry 6f6f16f9-3b33-408a-8508-2192249e3b0f)

This fix is also obvious in the blueprint diffs below in the PR, where we can see that before this change there were multiple zones of the same kind placed on the same zpool. I believe this bug does not affect production due to #7229.

let mut slf = Self { blueprint, notes: Vec::new() };
checks::perform_all_blueprint_only_checks(&mut slf);
slf
}
Copy link
Contributor Author

@jgallagher jgallagher Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intent here is to add at least one more method here that would let us perform additional checks that need the PlanningInput (or at least the Policy), but make that optional so it can still be used to check just a standalone blueprint. So something like:

let just_blueprint = Blippy::new(blueprint).into_report();                     // today
let full_context   = Blippy::new(blueprint).with_policy(policy).into_report(); // future

Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for taking this on!

nexus/reconfigurator/blippy/src/blippy.rs Outdated Show resolved Hide resolved
},
/// Two sleds are using the same sled subnet.
ConflictingSledSubnets {
other_sled: SledUuid,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have both sled UUIDs here? Unclear which one is the "other"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I went back and forth on this. Every Note has a Component; this kind is assuming its note's component is a sled, and then this value is the "other" sled. I could put both sled IDs in here and just duplicate one with the Component? Or maybe this is pointing to a more serious structural problem (e.g., some Kinds are only sensible with specific Components, so maybe breaking them up the way I have is not right)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhhh I didn't put two and two together to see the higher-level usage of Component here.

I'm looking at the structure of:

pub struct Note<'a> {
    pub component: Component,
    pub severity: Severity,
    pub kind: Kind<'a>,
}

I think I agree with your comment, this Kind is actually a SledKind - I wouldn't want to lump non-sled stuff into this Kind enum.

Maybe we should do something like:

pub struct Note<'a> {
    pub severity: Severity,
    pub component_note: ComponentNote<'a>,
}

pub enum ComponentNote<'a> {
   Sled {
     id: SledUuid,
     kind: SledKind<'a>,
   },
   OtherComponentKind {
     id: OtherUuid,
     kind: OtherKind<'a>,
   } 
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that seems better, thanks 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworked in 15f04e8

nexus/reconfigurator/blippy/src/checks.rs Outdated Show resolved Hide resolved
Comment on lines 105 to 151
/// Two zones with the same durable dataset kind are on the same zpool.
ZoneDurableDatasetCollision {
zone1: &'a BlueprintZoneConfig,
zone2: &'a BlueprintZoneConfig,
zpool: ZpoolName,
},
/// Two zones with the same filesystem dataset kind are on the same zpool.
ZpoolFilesystemDatasetCollision {
zone1: &'a BlueprintZoneConfig,
zone2: &'a BlueprintZoneConfig,
zpool: ZpoolName,
},
/// One zpool has two datasets of the same kind.
ZpoolWithDuplicateDatasetKinds {
dataset1: &'a BlueprintDatasetConfig,
dataset2: &'a BlueprintDatasetConfig,
zpool: ZpoolUuid,
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to more specific error kinds, but ZpoolWithDuplicateDatasetKinds will cover all three of these cases, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so:

  • Zone*DatasetCollision come from scanning blueprint_zones, from which we don't have a BlueprintDatasetConfig needed to construct a ZpoolWithDuplicateDatasetKinds
  • ZpoolWithDuplicateDatasetKinds comes from scanning blueprint_datasets (so the inner fields are BlueprintDatasetConfigs, but these don't necessarily have a corresponding BlueprintZoneConfig anywhere)

We could squish all of these down to one if we dropped the zone/dataset specifics to something like DatasetKind? That seems like it's dropping a lot of information, though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, gotcha, nevermind -- happy to keep all the cases distinct. I do think it's useful to give the most specific errors possible here.

I was (perhaps incorrectly?) thinking:

  • If we check blueprint_datasets for per-zpool uniqueness (checked by ZpoolWithDuplicateDatasetKinds)
  • ... and all blueprint_zones have associated datasets (checked by ZoneMissingFilesystemDataset + ZoneMissingDurableDataset)
  • ... then we can infer: all datasets that should exist do exist, and within the set of datasets, there are no collisions.

Granted - I don't have a beef with also attributing the dataset collisions to specific zones, so this is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right, and I debated collapsing these, but I decided to keep the separate cases for explicitness. It's okay if we get some "double notes", and I think it's easier to reason about each of the checks locally if they're checking their side of the condition (e.g., zones check that their corresponding dataset exists and datasets check that their corresponding zone exists, even though strictly speaking we could drop one of those and not miss anything).

nexus/reconfigurator/blippy/src/checks.rs Outdated Show resolved Hide resolved
Comment on lines +443 to +457
// TODO-john Add a Severity::BackwardsCompatibility and note the
// missing filesystem pool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea

nexus/reconfigurator/blippy/src/checks.rs Show resolved Hide resolved
Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for taking this on @jgallagher!

I agree with all @smklein's comments and appreciate the thorough review.

There are a few TODOs around expunged and decommissioned disks in blueprints. I'll take that on when my fix for #7098 goes out for review.


// Any given sled should have IPs within at most one subnet.
//
// The blueprint doesn't store sled subnets explicitly, so we can't
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think the rack subnet is something we want to move into the blueprint eventually?

I think we do, mainly based on the "make everything in the blueprint explicit" POV.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, yeah. I don't think it's particularly urgent.

@jgallagher jgallagher force-pushed the john/reconfigurator-blippy branch from 972e820 to 09a46f5 Compare December 20, 2024 19:32
Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests look great!

Comment on lines 677 to 678
fn test_bad_internnal_dns_subnet() {
static TEST_NAME: &str = "test_bad_internnal_dns_subnet";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn test_bad_internnal_dns_subnet() {
static TEST_NAME: &str = "test_bad_internnal_dns_subnet";
fn test_bad_internal_dns_subnet() {
static TEST_NAME: &str = "test_bad_internal_dns_subnet";

}
}
}
assert!(found_sled_missing_note, "found sled missing datasets note");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick; this isn't an expect -- we'll see this message if the note isn't found, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be super clear, I think the error message should be Did not find sled with missing datasets note

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah, yes, thanks.

}
}
}
assert!(found_sled_missing_note, "found sled missing disks note");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here re: expect vs assert messages

@jgallagher
Copy link
Contributor Author

Tests look great!

Thanks, and thanks for the super fast review. I didn't even have a chance to ping with a "tests are here, feel free to look them over"!

The tests are pretty repetitive; I was tempted to abstract the "mutate the blueprint, check for expected notes" pattern, but (a) I worry about abstracting tests too much and (b) in many of the tests the details of how the mutation / note creation happens are subtly different from each other in ways that might be hard to abstract. So I ended up with "just repeat a lot of similar things", which I think is okay for tests 🤷‍♂️.

@smklein
Copy link
Collaborator

smklein commented Dec 20, 2024

So I ended up with "just repeat a lot of similar things", which I think is okay for tests 🤷‍♂️.

I'm totally on board with the pattern of "repeat a lot at first to make things explicit, de-dup only later once everyone is crystal clear on what's happening under the hood".

@jgallagher jgallagher merged commit b1978e2 into main Dec 20, 2024
18 checks passed
@jgallagher jgallagher deleted the john/reconfigurator-blippy branch December 20, 2024 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants