[Nexus] Add a sled to an initialized rack #4545

andrewjstone · 2023-11-21T17:20:39Z

This commit provides an external API for adding a sled to an already initialized rack.

I still need to add a few automated tests and test on madrid, but I'd like to get eyes on the code ASAP due to urgency.

This commit provides an external API for adding a sled to an already initialized rack.

davepacheco

None of my comments here is critical.

What about automated testing? Did we decide the complexity isn't worth it? It does make me nervous because I think for the most part Nexus does have strong test coverage so people might not realize they need to do something special if they touch the code used by this feature.

nexus/db-queries/src/db/datastore/inventory.rs

nexus/db-queries/src/db/datastore/rack.rs

davepacheco · 2023-11-21T18:20:57Z

schema/crdb/dbinit.sql

+-- This table does not include subnet octets allocated during RSS and therefore
+-- all of the octets start at 33. This makes the data in this table purely additive
+-- post-RSS, which also implies that we cannot re-use subnet octets if an original
+-- sled that was part of RSS was removed from the cluster.


Why this choice?

It's much simpler than the alternative.

There's already a bifurcation where the rack is initialized via wicket and so there is no place where nexus can go ahead and allocate subnets before sleds get added. This is the order things must happen when adding a sled to an already initialized rack, because sleds only get added to the sled table after they are initialized.

Therefore if we wanted to track all allocations we'd need to add them after sleds get added to the sled table with RSS. This would require checking the allocations table for the added sled and adding the allocation if it's not there. For generality, we'd then need to do the same thing any sled was added to the sled table, since the nexus internal endpoint doesn't differentiate between sleds added during RSS or any other time. This is ok, but mainly just extra work.

I'm also not really sure what other issues I'd be running into trying to do this and I'm already running low on time.

I think this is similar to the way the internal DNS names and services get populated. RSS makes all those decisions, hands them to Nexus in the handoff, and in the handoff request Nexus goes and writes all that stuff to the database. I understand urgency is a priority here too.

This is a great point. In chat @davepacheco pointed me to how this initialization is done:
https://github.com/oxidecomputer/omicron/blob/main/nexus/src/app/rack.rs#L86
and
https://github.com/oxidecomputer/omicron/blob/main/nexus/db-queries/src/db/datastore/rack.rs#L423

I'd also need to do a migration for existing deployments. I'm not sure I'll have time to make these changes in the next day. I'm going to go through the other review comments and testing and then will see If I can do this. Otherwise, perhaps a follow up.

CC @smklein

schema/crdb/dbinit.sql

davepacheco · 2023-11-21T18:26:36Z

nexus/src/app/rack.rs

+        Ok(())
+    }
+
+    async fn get_any_sled_agent(


At some point we probably want to commonize this with sled_client() in nexus/src/app/sled.rs. The intent there was to have one way to get a sled agent client for a given sled. That way when we add connection pooling we only have to update one place. Probably a better reason is that we should probably add an authz check there. Anyway, that code is not currently factored well for what you want here (because it takes an id and does an extra lookup) so I'm not sure if it's worth it here.

Yeah, I just factored out the way Ry was already doing for bootstore init where any sled-agent can be talked to and put it into a function for re-use.

nexus/src/app/rack.rs

davepacheco · 2023-11-21T18:33:00Z

nexus/src/app/rack.rs

+        sa.add_sled_to_initialized_rack(&req).await.map_err(|e| {
+            Error::InternalError {
+                internal_message: format!(
+                    "failed to add sled with baseboard {:?} to rack {}: {e}",
+                    sled.baseboard, new_allocation.rack_id
+                ),
+            }
+        })?;


I think some of the failure modes should be 503s (e.g., failure to establish a TCP connection), though I don't think it matters that much in this case. We presumably want a general-purpose function for determining that, similar to the ErrorHandler mechanism for the database. I'm not sure if that exists for internal API calls.

nexus/src/external_api/http_entrypoints.rs

andrewjstone · 2023-11-21T18:39:29Z

None of my comments here is critical.

What about automated testing? Did we decide the complexity isn't worth it? It does make me nervous because I think for the most part Nexus does have strong test coverage so people might not realize they need to do something special if they touch the code used by this feature.

Thanks for the review @davepacheco. I'm about to add some automated tests now. I just wanted to get eyes on the code first and not wait for the tests for review.

smklein · 2023-11-21T20:01:45Z

schema/crdb/12.0.0/up1.sql

+
+    -- The octet that extends a /56 rack subnet to a /64 sled subnet
+    --
+    -- Always between 33 and 255 inclusive


CHECK constraints can be used to validate this: https://www.cockroachlabs.com/docs/v23.1/check

That being said -- do we actually want this bifurcation? Shouldn't we have "sleds added after RSS" and "sleds added during RSS" look as close as possible? Why do we only want to enable this allocation for late-added sleds?

Put another way: is it worthwhile to allocate rows to this table for existing sleds, when we perform the schema upgrade?

I left an answer for this in another comment: #4545 (comment)

While I agree it would be nice to have, it's unclear to me what the semantics should be. It's not just a matter of a migration, because we'd also need to add entries as a result of RSS on new racks being installed at customers.

Thanks for the tip about CHECK constraints!

Check constraints added in c9c1f90

nexus/src/app/rack.rs

smklein · 2023-11-21T20:19:34Z

nexus/db-queries/src/db/datastore/rack.rs

+            .filter(dsl::rack_subnet.is_not_null())
+            .select(dsl::rack_subnet)
+            .limit(1)
+            .first_async::<Option<IpNetwork>>(&*conn)


You could use .get_result_async here instead of .first_async here, if you want to throw a "Not Found" error.

See: https://docs.rs/async-bb8-diesel/0.1.0/async_bb8_diesel/trait.AsyncRunQueryDsl.html

These are all the ..._async versions of: https://docs.rs/diesel/latest/diesel/query_dsl/trait.RunQueryDsl.html

From get_result specifically:

Err(NotFound) will be returned if the query affected 0 rows.
You can call .optional() on the result of this if the command was optional to get back a Result<Option>

(this might be applicable for other spots you're using .first_async ?)

This is a great tip! Thank you! I had incorrectly assumed that first_async would also return a "Not Found" error. I will fix and use thisl.

Hmm, So it looks like first_async does the same thing:

Returns Ok(record) if found, and Err(NotFound) if no results are returned. If the query truly is optional, you can call .optional() on the result of this to get a Result<Option>.

The problem requiring the Option is the nullable field.

The comment about sled address allocation was removed because it was implemented in #4545

[Nexus] Add a sled to an initialized rack

81f28dc

This commit provides an external API for adding a sled to an already initialized rack.

andrewjstone requested review from davepacheco and smklein November 21, 2023 17:20

Merge remote-tracking branch 'origin/main' into nexus-add-sled-2

bef875e

davepacheco approved these changes Nov 21, 2023

View reviewed changes

openapi updates

7729c2e

smklein reviewed Nov 21, 2023

View reviewed changes

jgallagher mentioned this pull request Nov 21, 2023

Detect and un-delete phantom disks #4547

Merged

andrewjstone added 7 commits November 21, 2023 22:18

some review fixes and schema update

c9c1f90

Merge remote-tracking branch 'origin/main' into nexus-add-sled-2

c4d6073

add a test

61db1c9

Add integration test for listing uninitialized sleds

e9aec55

add another test

96a8037

Add a better roundtrip test for subnet allocations

20dfcf4

remove bunk file

df773b9

andrewjstone mentioned this pull request Nov 23, 2023

Tracking issue for adding a sled to an existing rack cluster #3924

Closed

10 tasks

Fix the dumbest error

3533973

andrewjstone enabled auto-merge (squash) November 25, 2023 06:19

andrewjstone merged commit 47968b8 into main Nov 25, 2023
20 of 21 checks passed

andrewjstone deleted the nexus-add-sled-2 branch November 25, 2023 06:57

askfongjojo mentioned this pull request Dec 6, 2023

500 error when attempting to list uninitialized sleds #4621

Closed

andrewjstone added a commit that referenced this pull request Dec 14, 2023

[Trivial] Fix wording and comments

a5bfe0d

The comment about sled address allocation was removed because it was implemented in #4545

andrewjstone mentioned this pull request Dec 14, 2023

[Trivial] Fix wording and comments #4695

Merged

andrewjstone added a commit that referenced this pull request Dec 14, 2023

[Trivial] Fix wording and comments (#4695)

c82d4fc

The comment about sled address allocation was removed because it was implemented in #4545

andrewjstone mentioned this pull request Dec 14, 2023

Ability to add a new sled to the lrtq cluster #3757

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Nexus] Add a sled to an initialized rack #4545

[Nexus] Add a sled to an initialized rack #4545

andrewjstone commented Nov 21, 2023

davepacheco left a comment

davepacheco Nov 21, 2023

andrewjstone Nov 21, 2023

davepacheco Nov 21, 2023

andrewjstone Nov 21, 2023

davepacheco Nov 21, 2023

andrewjstone Nov 21, 2023

davepacheco Nov 21, 2023

andrewjstone commented Nov 21, 2023

smklein Nov 21, 2023

smklein Nov 21, 2023

andrewjstone Nov 21, 2023

andrewjstone Nov 21, 2023

andrewjstone Nov 21, 2023

smklein Nov 21, 2023

smklein Nov 21, 2023

andrewjstone Nov 21, 2023

andrewjstone Nov 21, 2023

[Nexus] Add a sled to an initialized rack #4545

[Nexus] Add a sled to an initialized rack #4545

Conversation

andrewjstone commented Nov 21, 2023

davepacheco left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewjstone commented Nov 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment