Implement sled-agent API for adding a sled #4415

andrewjstone · 2023-11-02T17:23:03Z

This code largely reuses the existing code paths for starting sled-agent with a small change to configured the bootstore as a learner node if trust quorum is in use.

This code path does not attempt any retries, as it should be idempotent as long as the same sled and rack UUIDs are used.This is a command driven by an operator and we want to know right away if it succeeded or not. Multiple sleds can be added by Nexus with individual calls, so that individual errors can be returned without added complexity at the sled- agent level. These requests can also be sent to different sled-agents, since the sled-agent is just acting as a proxy to a remote bootstrap agent.

In preparation for sled-addition, a new version of `StartSledAgentRequest` was created and the old version coverted to V0. Deterministic conversion from v0 to v1 is possible because we know that `ntp_servers` and `dns_servers` fields are not in use. We also know the values of the new fields. `is_lrtq_learner` must be false because we only support RSS addition on existing sleds. We then implement a deserialize method that will return the new version regardless of what's on disk. This is similar to how we handled the bootstore upgrade for `EarlyNetworkConfig`. Also similar to that, we extended the format of `StartSledAgentRequest` to ease future deserializations. In those deserializations we can deserialize the "header" and use [serde_json::value::RawValue](https://docs.rs/serde_json/latest/serde_json/value/struct.RawValue.html) in order to defer deserialization of the `body` field until we know the `schema_version`.

jgallagher

Looks great - this is a lot less than I expected it to be! Just a couple minor questions.

sled-agent/src/http_entrypoints.rs

jgallagher · 2023-11-03T14:58:59Z

sled-agent/src/sled_agent.rs

+        log.new(o!("BootstrapAgentClient" => bootstrap_addr.to_string())),
+    );
+
+    client.start_sled_agent(&request).await.map_err(|err| {


What happens if we send this request to a sled that's already in the cluster?

Great question.

We'll end up in the ServerStarted clause which tries to be idempotent but only on a few fields. It will currently return an error if the sled-id or address changed.

I think it would probably be better to make this truly idempotent and check the whole request. What do you think?

I think it would probably be better to make this truly idempotent and check the whole request. What do you think?

It looks like as written, if we get a request with matching fields that are checked but non-matching other fields, we just ignore them entirely and return Ok(_), which would presumably leave the client confused (it could reasonably assume all the params it passed are in use, but really the sled-agent was still running with some other set of params)? I agree making it check the whole request seems better, unless there's some case in RSS that we don't know or understand where we need the current behavior.

I made it idempotent in 87d7d41

This code largely reuses the existing code paths for starting sled-agent with a small change to configured the bootstore as a learner node if trust quorum is in use. This code path does not attempt any retries, as it should be idempotent as long as the same sled and rack UUIDs are used.This is a command driven by an operator and we want to know right away if it succeeded or not. Multiple sleds can be added by Nexus with individual calls, so that individual errors can be returned without added complexity at the sled- agent level. These requests can also be sent to different sled-agents, since the sled-agent is just acting as a proxy to a remote bootstrap agent.

andrewjstone · 2023-11-08T00:12:37Z

I tested adding sled 15 to an existing madrid cluster of sleds (0,1,14) and it worked. The sled learned its key share and ended up starting sled agent which registered itself in CRDB via Nexus.

I also ensured that the command is idempotent and we properly error when trying to add a sled that is already part of the cluster.

andrewjstone requested a review from jgallagher November 2, 2023 17:23

jgallagher approved these changes Nov 3, 2023

View reviewed changes

andrewjstone added 2 commits November 6, 2023 16:11

Fix v0 deserialization

52e4932

andrewjstone force-pushed the sled-agent-add-sled-api branch from 2a068db to 0d50df0 Compare November 6, 2023 16:16

andrewjstone added 3 commits November 6, 2023 16:29

Externally visible error for add sled fail

6552bb8

fix error code

f151bb8

Error if StartSledAgentRequest is not idempotent

87d7d41

Base automatically changed from sled-agent-add-sled-to-rack-cluster to main November 7, 2023 20:36

Merge branch 'main' into sled-agent-add-sled-api

e048753

andrewjstone enabled auto-merge (squash) November 8, 2023 00:29

andrewjstone merged commit 3db5ae8 into main Nov 8, 2023
20 of 21 checks passed

andrewjstone deleted the sled-agent-add-sled-api branch November 8, 2023 01:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement sled-agent API for adding a sled #4415

Implement sled-agent API for adding a sled #4415

andrewjstone commented Nov 2, 2023

jgallagher left a comment

jgallagher Nov 3, 2023

andrewjstone Nov 6, 2023

jgallagher Nov 6, 2023 •

edited

Loading

andrewjstone Nov 6, 2023

andrewjstone commented Nov 8, 2023

Implement sled-agent API for adding a sled #4415

Implement sled-agent API for adding a sled #4415

Conversation

andrewjstone commented Nov 2, 2023

jgallagher left a comment

Choose a reason for hiding this comment

jgallagher Nov 3, 2023

Choose a reason for hiding this comment

andrewjstone Nov 6, 2023

Choose a reason for hiding this comment

jgallagher Nov 6, 2023 • edited Loading

Choose a reason for hiding this comment

andrewjstone Nov 6, 2023

Choose a reason for hiding this comment

andrewjstone commented Nov 8, 2023

jgallagher Nov 6, 2023 •

edited

Loading