-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[reconfigurator] Decommission sleds #5698
Conversation
(not yet in DB serialization)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick review, but looks reasonable to me!
dev-tools/omdb/src/bin/omdb/db.rs
Outdated
@@ -1456,7 +1456,7 @@ async fn cmd_db_sleds( | |||
Some(filter) => filter, | |||
None => { | |||
eprintln!("note: listing all sleds (use -F to filter, e.g. -F in-service)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note likely needs to be changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Observe that the disk state is now decommissioned | ||
assert_eq!( | ||
PhysicalDiskState::Decommissioned, | ||
lookup_physical_disk(&datastore, disk1.id()).await.disk_state | ||
); | ||
assert_eq!( | ||
PhysicalDiskState::Decommissioned, | ||
lookup_physical_disk(&datastore, disk2.id()).await.disk_state | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for both fixing and testing this
mut self, | ||
sled_ids: impl Iterator<Item = SledUuid>, | ||
self, | ||
commissioned_sled_ids: impl Iterator<Item = SledUuid>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: There really isn't any part of this argument that validates that the state of these sled IDs as "commissioned", right?
Should this just be called "added_sleds"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that does seem better. Renamed and expanded the doc comment in 64e8660
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Mostly nits from me.
{ | ||
result.found.state() | ||
} else { | ||
return Err(err.bail( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for understanding: I'm not used to this pattern. Is the err.bail
here related to running in a transaction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe so! I see this pattern in other transactions, and err.bail
's comments say it sets err
to the value we pass here but then returns DieselError::RollbackTransaction
.
// the case. | ||
for sled_id in self.blueprint.sled_ids_with_zones() { | ||
if !commissioned_sled_ids.contains(&sled_id) { | ||
self.blueprint.expunge_all_zones_for_sled( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When wouldn't this be the case?
If we see this happen, rather than doing an ensure
like operation, should we check what the issue was and log an error? That seems like an invariant violation to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed; I changed this to a hard error in 1e629df
datastore: &DataStore, | ||
sled_id: SledUuid, | ||
) -> anyhow::Result<()> { | ||
let (authz_sled,) = LookupPath::new(opctx, datastore) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do the auth here, rather than in the datastore method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataStore::sled_set_state_to_decommissioned()
was written in such a way as to require that - it wants an authz_sled
, so we have to do the lookup to call it at all. I'm not sure why that is, though - @sunshowers do you remember?
This is a pretty small delta on top of #5698 that I've been sitting on for a few days before getting a chance to test it on a4x2 this afternoon. I set up a4x2 with three sleds (g0, g1, g3), and went through 3.5 add/remove cycles on sled g2 (3 add/remove pairs and a 4th add). g2 is present in `omdb db sleds`: ``` root@oxz_switch:~# omdb db sleds SERIAL IP ROLE POLICY STATE ID g3 [fd00:1122:3344:103::1]:12345 scrimlet in service active 13b8d30d-b66d-4333-b156-9aa2527e130b g2 [fd00:1122:3344:124::1]:12345 - in service active 218e4e9c-0a27-4460-a65c-7e93bef531c4 g1 [fd00:1122:3344:102::1]:12345 - in service active 661e9e3e-0beb-43fe-9606-109b7145b258 g0 [fd00:1122:3344:101::1]:12345 scrimlet in service active 9f19235f-bfa4-4032-9cad-608448b4f1a0 ``` I added `SledFilter::Decommissioned` specifically for use with omdb, where we can see the 3 previous times `g2` was present (each with a different sled ID): ``` root@oxz_switch:~# omdb db sleds -F decommissioned SERIAL IP ROLE POLICY STATE ID g2 [fd00:1122:3344:121::1]:12345 - expunged decommissioned a7b09236-c8ba-4025-934d-b5f0539d9379 g2 [fd00:1122:3344:123::1]:12345 - expunged decommissioned ec0e81bf-6366-4273-97bc-47223334dc90 g2 [fd00:1122:3344:122::1]:12345 - expunged decommissioned f5f8ba44-9681-4e6f-84f2-fd654c83dd23 ``` The current blueprint shows history of all four times the sled has been present, because we don't currently prune expunged zones: ``` root@oxz_switch:~# omdb nexus blueprints show current blueprint 5bfb182e-df8a-41e5-a8a3-862bb42f8feb parent: 5b2823c8-ca41-46b5-9efd-ffaae0e6e028 ----------------------------------------------------------------------------------------------- zone type zone ID disposition underlay IP ----------------------------------------------------------------------------------------------- sled 13b8d30d-b66d-4333-b156-9aa2527e130b: blueprint zones at generation 5 clickhouse 9ee21438-4a2f-499a-b867-58993a36f2f0 in service fd00:1122:3344:103::6 cockroach_db b0f22c64-a8e7-4c83-af26-f30a0dc413d5 in service fd00:1122:3344:103::3 crucible 29f3db85-722f-41ed-a66c-66aa31a31591 in service fd00:1122:3344:103::b crucible 40317028-eedf-487f-88ae-821632f05f39 in service fd00:1122:3344:103::8 crucible 50153da9-da61-4cd9-829c-18129e5a6c52 in service fd00:1122:3344:103::9 crucible 765dc0e2-1850-43b1-ae08-531d51772f14 in service fd00:1122:3344:103::a crucible bcc91ace-f816-400d-9198-22ed20e00ca3 in service fd00:1122:3344:103::c crucible_pantry bf1c3443-fbf5-4e16-affb-ee4e2598cbcb in service fd00:1122:3344:103::7 external_dns ed00e862-094c-449b-bfdb-ddba26f36bb2 in service fd00:1122:3344:103::4 internal_dns 6b4f3315-ce80-4d89-a495-ff96c5f573cd in service fd00:1122:3344:3::1 internal_ntp 6b9d837b-9bb9-46f1-8f48-c0c8a7d89882 in service fd00:1122:3344:103::d nexus 327c035e-628e-4880-b434-8d77ca362f21 in service fd00:1122:3344:103::5 sled 218e4e9c-0a27-4460-a65c-7e93bef531c4: blueprint zones at generation 3 crucible 5a7167ff-04ef-47fd-93bc-2adc7c8a7087 in service fd00:1122:3344:124::26 crucible 9ca15af1-79dc-4248-86d1-5d7d25987609 in service fd00:1122:3344:124::23 crucible a80accc8-3fdd-42b2-8425-eb246a7f0ba0 in service fd00:1122:3344:124::22 crucible d80a6050-f2e3-4827-8fd0-37c5c88ac87d in service fd00:1122:3344:124::25 crucible e92e8930-6f81-4f09-a520-415ef896a05f in service fd00:1122:3344:124::24 internal_ntp a37275e8-eb02-4cb3-b216-a47a892313e7 in service fd00:1122:3344:124::21 sled 661e9e3e-0beb-43fe-9606-109b7145b258: blueprint zones at generation 5 boundary_ntp 8eb79dcb-b6a0-4a24-835e-cf9e55b12495 in service fd00:1122:3344:102::d cockroach_db 294868f1-d76b-4ef6-a87b-399ec06ba9a3 in service fd00:1122:3344:102::3 cockroach_db 373fa1d2-597c-4245-a87b-31a7f734c806 in service fd00:1122:3344:102::4 crucible 00611d19-0752-4e6f-a57a-43df54089987 in service fd00:1122:3344:102::c crucible 05697e11-b936-4319-b1a6-ba8e46c77988 in service fd00:1122:3344:102::b crucible 612ad32d-321d-4159-9265-4c0eca972e2c in service fd00:1122:3344:102::9 crucible c03f67ee-7f38-445f-b9a5-dcfd1feb976c in service fd00:1122:3344:102::a crucible e2e46038-8ecb-4fe0-be1a-94530842cd65 in service fd00:1122:3344:102::8 crucible_pantry 4fea76f9-d88e-43e9-ade7-2345dd77d2a7 in service fd00:1122:3344:102::7 internal_dns 504f14cf-42ea-4457-b922-fb400eea1495 in service fd00:1122:3344:2::1 nexus 8d3ba915-17f6-4c2f-a94d-31120e89f17b in service fd00:1122:3344:102::5 oximeter 4d98514a-a352-4ddc-996f-912958d9a80d in service fd00:1122:3344:102::6 sled 9f19235f-bfa4-4032-9cad-608448b4f1a0: blueprint zones at generation 5 boundary_ntp 603a1fe0-1fd1-4fc1-968c-af95e1400f2d in service fd00:1122:3344:101::d cockroach_db d0f66442-cc46-48b3-9770-d473517a81f5 in service fd00:1122:3344:101::3 cockroach_db ed701854-eb03-43eb-8a7c-5c1aea73cc51 in service fd00:1122:3344:101::4 crucible 56151b6f-fa7d-4459-9540-6b04460d2c64 in service fd00:1122:3344:101::8 crucible 8b37469e-2660-498e-bad7-9c9158b4fbae in service fd00:1122:3344:101::9 crucible b602ce7d-2b92-4e12-8b95-dd2c4cad3397 in service fd00:1122:3344:101::a crucible d87cae8c-d85d-4a25-b7d6-f7bbce50f6e9 in service fd00:1122:3344:101::b crucible ea549438-9531-491e-b85b-b92668184c9f in service fd00:1122:3344:101::c crucible_pantry 51e18a3d-a4ce-4ee1-a900-3f23444c9bfd in service fd00:1122:3344:101::7 external_dns 60f77e70-af63-4b4d-b3b2-62610f707d47 in service fd00:1122:3344:101::5 internal_dns d47c3ade-ae61-49b0-8bb7-217a825517de in service fd00:1122:3344:1::1 nexus a1df997e-3595-4f24-979c-76232ae16164 in service fd00:1122:3344:101::6 sled a7b09236-c8ba-4025-934d-b5f0539d9379: blueprint zones at generation 4 crucible 8aac1836-3d74-4d88-af41-b9e2fda8d3f7 expunged fd00:1122:3344:121::23 crucible a0468efe-3f76-4244-9d92-03a70608c777 expunged fd00:1122:3344:121::22 crucible afc45a35-0860-41d8-bac1-c06ac442fd43 expunged fd00:1122:3344:121::24 crucible b596d696-6521-4ebb-b0b4-22a772f5d1d0 expunged fd00:1122:3344:121::25 crucible b5f1851a-695f-4223-bc67-d30df30e77d0 expunged fd00:1122:3344:121::26 internal_ntp 219fdc9b-fe80-436a-9d81-c8f7bb3f0364 expunged fd00:1122:3344:121::21 sled ec0e81bf-6366-4273-97bc-47223334dc90: blueprint zones at generation 4 crucible 5fe4544e-d44b-45fe-bf99-505007654e97 expunged fd00:1122:3344:123::22 crucible 76f11abf-cfbf-41ee-972b-1c38c6babd22 expunged fd00:1122:3344:123::26 crucible 87821b5d-ed10-4787-ab24-6b7411bcc5d8 expunged fd00:1122:3344:123::25 crucible 8ceb3b5b-eb61-4195-b873-6dac8108fd73 expunged fd00:1122:3344:123::23 crucible c1d0cfb9-9489-4fce-a213-40fb5c1b7b0f expunged fd00:1122:3344:123::24 internal_ntp 86a61155-e468-4906-b0b6-d12fa2f3f4dd expunged fd00:1122:3344:123::21 sled f5f8ba44-9681-4e6f-84f2-fd654c83dd23: blueprint zones at generation 4 crucible 4c82a05d-bbc0-40b2-8b85-f688b56b2f7b expunged fd00:1122:3344:122::22 crucible 6b6a6770-f552-438d-bca0-b1d8bdd8c2ed expunged fd00:1122:3344:122::24 crucible 83eccc0d-afea-4e27-a906-3a629c2094d1 expunged fd00:1122:3344:122::23 crucible 8f6cea3e-5489-4665-a37b-c6336c0e54c8 expunged fd00:1122:3344:122::26 crucible bdd61a6c-50b8-450c-b963-443d77a1c7f4 expunged fd00:1122:3344:122::25 internal_ntp ab8b9602-a861-4985-809d-30e4724a63ab expunged fd00:1122:3344:122::21 METADATA: created by: a1df997e-3595-4f24-979c-76232ae16164 created at: 2024-05-10T19:09:09.547Z comment: sled 218e4e9c-0a27-4460-a65c-7e93bef531c4: add zones internal DNS version: 11 external DNS version: 2 ``` Also visible in that blueprint is evidence that we went through the "add NTP" -> "add crucible" reconfigurator steps. I did not go so far as to add a Nexus to g2 each time (it's considerably more manually intensive to do so, which we might need to address if that's a thing we want to test more thoroughly). This also fixes a bug in `allocate_sled_underlay_subnet_octets` that I accidentally introduced in #5675, restoring idempotence up until the point where the sled has upserted itself. (The comments added should clarify this.) We can and should still fail on an attempt to add a sled where we (a) have an allocation and (b) have an entry in the `sled` table, but no longer fail on attempt to add a sled where we (a) have an allocation but (b) do NOT have an entry in the `sled` table.
This builds on (and is currently pointed at) #5663, and is on the path to fixing #5625, although there is still more work to do there. (Should be small, and I'll start on it while this is in review to make sure it really fixes it.) I don't plan on landing this before R8 is out the door, but wanted to go ahead and open it up to review.
Much of this is fallout from our discussions about what it means to decommission a sled. Major changes are:
SledFilter::All
has been renamed toSledFilter::Commissioned
, and no longer returns sleds that are inSledState::Decommissioned
.Decommissioned
for sleds which satisfy our conditions. (Seedo_plan_decommission()
.)Decommissioned
once the inputs report that the sled has already been decommissioned.