Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement sled-agent API for adding a sled #4415

Merged
merged 7 commits into from
Nov 8, 2023

Conversation

andrewjstone
Copy link
Contributor

This code largely reuses the existing code paths for starting sled-agent with a small change to configured the bootstore as a learner node if trust quorum is in use.

This code path does not attempt any retries, as it should be idempotent as long as the same sled and rack UUIDs are used.This is a command driven by an operator and we want to know right away if it succeeded or not. Multiple sleds can be added by Nexus with individual calls, so that individual errors can be returned without added complexity at the sled- agent level. These requests can also be sent to different sled-agents, since the sled-agent is just acting as a proxy to a remote bootstrap agent.

In preparation for sled-addition, a new version of
`StartSledAgentRequest` was created and the old version coverted to V0.
Deterministic conversion from v0 to v1 is possible because we know that
`ntp_servers` and `dns_servers` fields are not in use. We also know the
values of the new fields. `is_lrtq_learner` must be false because we
only support RSS addition on existing sleds.

We then implement a deserialize method that will return the new version
regardless of  what's on disk. This is similar to how we handled the
bootstore upgrade for `EarlyNetworkConfig`. Also similar to that,
we extended the format of `StartSledAgentRequest` to ease future
deserializations. In those deserializations we can deserialize the
"header" and use [serde_json::value::RawValue](https://docs.rs/serde_json/latest/serde_json/value/struct.RawValue.html)
in order to defer deserialization of the `body` field until we know
the `schema_version`.
Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great - this is a lot less than I expected it to be! Just a couple minor questions.

sled-agent/src/http_entrypoints.rs Outdated Show resolved Hide resolved
log.new(o!("BootstrapAgentClient" => bootstrap_addr.to_string())),
);

client.start_sled_agent(&request).await.map_err(|err| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we send this request to a sled that's already in the cluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question.

We'll end up in the ServerStarted clause which tries to be idempotent but only on a few fields. It will currently return an error if the sled-id or address changed.

I think it would probably be better to make this truly idempotent and check the whole request. What do you think?

Copy link
Contributor

@jgallagher jgallagher Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would probably be better to make this truly idempotent and check the whole request. What do you think?

It looks like as written, if we get a request with matching fields that are checked but non-matching other fields, we just ignore them entirely and return Ok(_), which would presumably leave the client confused (it could reasonably assume all the params it passed are in use, but really the sled-agent was still running with some other set of params)? I agree making it check the whole request seems better, unless there's some case in RSS that we don't know or understand where we need the current behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it idempotent in 87d7d41

This code largely reuses the existing code paths for starting sled-agent
with a small change to configured the bootstore as a learner node if
trust quorum is in use.

This code path does not attempt any retries, as it should be idempotent
as long as the same sled and rack UUIDs are used.This is a command
driven by an operator and we want to know right away if it succeeded or
not. Multiple sleds can be added by Nexus with individual calls, so that
individual errors can be returned without added complexity at the sled-
agent level. These requests can also be sent to different sled-agents,
since the sled-agent is just acting as a proxy to a remote
bootstrap agent.
@andrewjstone andrewjstone force-pushed the sled-agent-add-sled-api branch from 2a068db to 0d50df0 Compare November 6, 2023 16:16
Base automatically changed from sled-agent-add-sled-to-rack-cluster to main November 7, 2023 20:36
@andrewjstone
Copy link
Contributor Author

I tested adding sled 15 to an existing madrid cluster of sleds (0,1,14) and it worked. The sled learned its key share and ended up starting sled agent which registered itself in CRDB via Nexus.

I also ensured that the command is idempotent and we properly error when trying to add a sled that is already part of the cluster.

@andrewjstone andrewjstone enabled auto-merge (squash) November 8, 2023 00:29
@andrewjstone andrewjstone merged commit 3db5ae8 into main Nov 8, 2023
20 of 21 checks passed
@andrewjstone andrewjstone deleted the sled-agent-add-sled-api branch November 8, 2023 01:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants