[Reconfigurator] Blippy #7276

jgallagher · 2024-12-18T20:32:20Z

This PR introduces Blippy for linting blueprints (see #6987). It initially only reports FATAL errors associated with specific sleds, but hopefully provides enough structure to see how that can expand to include other severities and other components. (At a minimum, there will be some blueprint- or policy-level component for things like "there aren't enough Nexus zones" that aren't associated with any particular sled.)

As of this PR, the only user of Blippy is the builder test's verify_blueprint, from which I imported most of the checks that it current performs. I made a few of these checks slightly more strict, and from that I had to patch up a handful of tests that were doing weird things (e.g., manually expunging a zone without expunging its datasets) and also found one legitimate planner bug (I'll note in a separate comment below).

There aren't many doc comments (yet?) - the interface should be pretty straightforward, but I'm happy to add docs if folks are okay with the general structure of this. I'm not at all wedded to the Note / Severity / Kind / Component breakdown, so any feedback there is welcome.

jgallagher · 2024-12-18T20:33:53Z

nexus/reconfigurator/planning/src/blueprint_builder/builder.rs

+                skip_zpools.insert(zpool);
+            }
+            if let Some(zpool) = &zone_config.filesystem_pool {
+                skip_zpools.insert(zpool);


This is the legit planner bug I mentioned in the PR description; without this fix, we have several tests that fail with a blippy report that looks something like:

blippy report for blueprint 30f46d73-5dd5-4125-800f-6c910bb24638: 2 notes sled 77370f38-8ef9-4dc1-a455-5b2f72149eed: FATAL note: zpool oxp_03a64b5d-c9c6-4edd-ace0-7d4bbc3293e2 has two zone filesystems of the same kind (CruciblePantry 2220bf46-72e3-4d88-baae-6929f1891219 and CruciblePantry 319edb3b-f8a8-4881-8f28-e2d80ce861bc) sled 77370f38-8ef9-4dc1-a455-5b2f72149eed: FATAL note: zpool oxp_03a64b5d-c9c6-4edd-ace0-7d4bbc3293e2 has two zone filesystems of the same kind (CruciblePantry 319edb3b-f8a8-4881-8f28-e2d80ce861bc and CruciblePantry 6f6f16f9-3b33-408a-8508-2192249e3b0f)

This fix is also obvious in the blueprint diffs below in the PR, where we can see that before this change there were multiple zones of the same kind placed on the same zpool. I believe this bug does not affect production due to #7229.

jgallagher · 2024-12-18T20:39:58Z

nexus/reconfigurator/blippy/src/blippy.rs

+        let mut slf = Self { blueprint, notes: Vec::new() };
+        checks::perform_all_blueprint_only_checks(&mut slf);
+        slf
+    }


My intent here is to add at least one more method here that would let us perform additional checks that need the PlanningInput (or at least the Policy), but make that optional so it can still be used to check just a standalone blueprint. So something like:

let just_blueprint = Blippy::new(blueprint).into_report(); // today let full_context = Blippy::new(blueprint).with_policy(policy).into_report(); // future

smklein

Looks great, thanks for taking this on!

nexus/reconfigurator/blippy/src/blippy.rs

smklein · 2024-12-18T20:57:30Z

nexus/reconfigurator/blippy/src/blippy.rs

+    },
+    /// Two sleds are using the same sled subnet.
+    ConflictingSledSubnets {
+        other_sled: SledUuid,


Should we have both sled UUIDs here? Unclear which one is the "other"

Yeah I went back and forth on this. Every Note has a Component; this kind is assuming its note's component is a sled, and then this value is the "other" sled. I could put both sled IDs in here and just duplicate one with the Component? Or maybe this is pointing to a more serious structural problem (e.g., some Kinds are only sensible with specific Components, so maybe breaking them up the way I have is not right)?

ahhhh I didn't put two and two together to see the higher-level usage of Component here.

I'm looking at the structure of:

pub struct Note<'a> { pub component: Component, pub severity: Severity, pub kind: Kind<'a>, }

I think I agree with your comment, this Kind is actually a SledKind - I wouldn't want to lump non-sled stuff into this Kind enum.

Maybe we should do something like:

pub struct Note<'a> { pub severity: Severity, pub component_note: ComponentNote<'a>, } pub enum ComponentNote<'a> { Sled { id: SledUuid, kind: SledKind<'a>, }, OtherComponentKind { id: OtherUuid, kind: OtherKind<'a>, } }

Yeah that seems better, thanks 👍

Reworked in 15f04e8

nexus/reconfigurator/blippy/src/checks.rs

smklein · 2024-12-18T21:07:19Z

nexus/reconfigurator/blippy/src/blippy.rs

+    /// Two zones with the same durable dataset kind are on the same zpool.
+    ZoneDurableDatasetCollision {
+        zone1: &'a BlueprintZoneConfig,
+        zone2: &'a BlueprintZoneConfig,
+        zpool: ZpoolName,
+    },
+    /// Two zones with the same filesystem dataset kind are on the same zpool.
+    ZpoolFilesystemDatasetCollision {
+        zone1: &'a BlueprintZoneConfig,
+        zone2: &'a BlueprintZoneConfig,
+        zpool: ZpoolName,
+    },
+    /// One zpool has two datasets of the same kind.
+    ZpoolWithDuplicateDatasetKinds {
+        dataset1: &'a BlueprintDatasetConfig,
+        dataset2: &'a BlueprintDatasetConfig,
+        zpool: ZpoolUuid,
+    },


I'm not opposed to more specific error kinds, but ZpoolWithDuplicateDatasetKinds will cover all three of these cases, right?

I don't think so:

Zone*DatasetCollision come from scanning blueprint_zones, from which we don't have a BlueprintDatasetConfig needed to construct a ZpoolWithDuplicateDatasetKinds

ZpoolWithDuplicateDatasetKinds comes from scanning blueprint_datasets (so the inner fields are BlueprintDatasetConfigs, but these don't necessarily have a corresponding BlueprintZoneConfig anywhere)

We could squish all of these down to one if we dropped the zone/dataset specifics to something like DatasetKind? That seems like it's dropping a lot of information, though.

Ah, gotcha, nevermind -- happy to keep all the cases distinct. I do think it's useful to give the most specific errors possible here.

I was (perhaps incorrectly?) thinking:

If we check blueprint_datasets for per-zpool uniqueness (checked by ZpoolWithDuplicateDatasetKinds)

... and all blueprint_zones have associated datasets (checked by ZoneMissingFilesystemDataset + ZoneMissingDurableDataset)

... then we can infer: all datasets that should exist do exist, and within the set of datasets, there are no collisions.

Granted - I don't have a beef with also attributing the dataset collisions to specific zones, so this is fine.

I think you're right, and I debated collapsing these, but I decided to keep the separate cases for explicitness. It's okay if we get some "double notes", and I think it's easier to reason about each of the checks locally if they're checking their side of the condition (e.g., zones check that their corresponding dataset exists and datasets check that their corresponding zone exists, even though strictly speaking we could drop one of those and not miss anything).

nexus/reconfigurator/blippy/src/checks.rs

smklein · 2024-12-18T21:19:27Z

nexus/reconfigurator/blippy/src/checks.rs

+                // TODO-john Add a Severity::BackwardsCompatibility and note the
+                // missing filesystem pool


nexus/reconfigurator/blippy/src/checks.rs

andrewjstone

Looks great! Thanks for taking this on @jgallagher!

I agree with all @smklein's comments and appreciate the thorough review.

There are a few TODOs around expunged and decommissioned disks in blueprints. I'll take that on when my fix for #7098 goes out for review.

andrewjstone · 2024-12-18T23:25:42Z

nexus/reconfigurator/blippy/src/checks.rs

+
+            // Any given sled should have IPs within at most one subnet.
+            //
+            // The blueprint doesn't store sled subnets explicitly, so we can't


Do you think the rack subnet is something we want to move into the blueprint eventually?

I think we do, mainly based on the "make everything in the blueprint explicit" POV.

I think so, yeah. I don't think it's particularly urgent.

smklein

Tests look great!

smklein · 2024-12-20T19:36:17Z

nexus/reconfigurator/blippy/src/checks.rs

+    fn test_bad_internnal_dns_subnet() {
+        static TEST_NAME: &str = "test_bad_internnal_dns_subnet";


Suggested change

fn test_bad_internnal_dns_subnet() {

static TEST_NAME: &str = "test_bad_internnal_dns_subnet";

fn test_bad_internal_dns_subnet() {

static TEST_NAME: &str = "test_bad_internal_dns_subnet";

smklein · 2024-12-20T19:40:08Z

nexus/reconfigurator/blippy/src/checks.rs

+                }
+            }
+        }
+        assert!(found_sled_missing_note, "found sled missing datasets note");


Nitpick; this isn't an expect -- we'll see this message if the note isn't found, right?

To be super clear, I think the error message should be Did not find sled with missing datasets note

Hah, yes, thanks.

smklein · 2024-12-20T19:40:25Z

nexus/reconfigurator/blippy/src/checks.rs

+                }
+            }
+        }
+        assert!(found_sled_missing_note, "found sled missing disks note");


Same comment here re: expect vs assert messages

jgallagher · 2024-12-20T19:48:14Z

Tests look great!

Thanks, and thanks for the super fast review. I didn't even have a chance to ping with a "tests are here, feel free to look them over"!

The tests are pretty repetitive; I was tempted to abstract the "mutate the blueprint, check for expected notes" pattern, but (a) I worry about abstracting tests too much and (b) in many of the tests the details of how the mutation / note creation happens are subtly different from each other in ways that might be hard to abstract. So I ended up with "just repeat a lot of similar things", which I think is okay for tests 🤷‍♂️.

smklein · 2024-12-20T19:49:56Z

So I ended up with "just repeat a lot of similar things", which I think is okay for tests 🤷‍♂️.

I'm totally on board with the pattern of "repeat a lot at first to make things explicit, de-dup only later once everyone is crystal clear on what's happening under the hood".

jgallagher requested review from sunshowers, andrewjstone, davepacheco and smklein December 18, 2024 20:32

jgallagher commented Dec 18, 2024

View reviewed changes

smklein approved these changes Dec 18, 2024

View reviewed changes

andrewjstone reviewed Dec 18, 2024

View reviewed changes

andrewjstone approved these changes Dec 18, 2024

View reviewed changes

jgallagher added 12 commits December 20, 2024 10:21

blippy first draft

bde4a5e

use blippy to verify blueprints in tests

0c4fb15

basic doc comment

e6af5cf

clippy

2ea54f4

cargo fmt

85fcbda

rework Component/Kind into Kind/SledKind

15f04e8

rename for_sled and document that it can add fatal notes

5a49ce0

first unit tests; remove references from Note

6c1a10c

add tests for external networking checks

00f2555

add tests for duplicate dataset checks

ad5b5db

add missing dataset tests

c5613bd

add tests for orphaned dataset notes

09a46f5

jgallagher force-pushed the john/reconfigurator-blippy branch from 972e820 to 09a46f5 Compare December 20, 2024 19:32

remove extraneous commentary

16ee1ad

smklein approved these changes Dec 20, 2024

View reviewed changes

PR feedback

e3ee22c

jgallagher mentioned this pull request Dec 20, 2024

test failure: integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates #6727

Closed

jgallagher merged commit b1978e2 into main Dec 20, 2024
18 checks passed

jgallagher deleted the john/reconfigurator-blippy branch December 20, 2024 21:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Reconfigurator] Blippy #7276

[Reconfigurator] Blippy #7276

jgallagher commented Dec 18, 2024

jgallagher Dec 18, 2024

jgallagher Dec 18, 2024 •

edited

Loading

smklein left a comment

smklein Dec 18, 2024

jgallagher Dec 18, 2024

smklein Dec 18, 2024

jgallagher Dec 18, 2024

jgallagher Dec 20, 2024

smklein Dec 18, 2024

jgallagher Dec 18, 2024

smklein Dec 18, 2024

jgallagher Dec 20, 2024

smklein Dec 18, 2024

andrewjstone left a comment

andrewjstone Dec 18, 2024

jgallagher Dec 19, 2024

smklein left a comment

smklein Dec 20, 2024

smklein Dec 20, 2024

smklein Dec 20, 2024

jgallagher Dec 20, 2024

smklein Dec 20, 2024

jgallagher commented Dec 20, 2024

smklein commented Dec 20, 2024

		// TODO-john Add a Severity::BackwardsCompatibility and note the
		// missing filesystem pool

		fn test_bad_internnal_dns_subnet() {
		static TEST_NAME: &str = "test_bad_internnal_dns_subnet";

[Reconfigurator] Blippy #7276

[Reconfigurator] Blippy #7276

Conversation

jgallagher commented Dec 18, 2024

Choose a reason for hiding this comment

jgallagher Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

smklein left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewjstone left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smklein left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgallagher commented Dec 20, 2024

smklein commented Dec 20, 2024

jgallagher Dec 18, 2024 •

edited

Loading