-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement sled-agent API for adding a sled #4415
Conversation
In preparation for sled-addition, a new version of `StartSledAgentRequest` was created and the old version coverted to V0. Deterministic conversion from v0 to v1 is possible because we know that `ntp_servers` and `dns_servers` fields are not in use. We also know the values of the new fields. `is_lrtq_learner` must be false because we only support RSS addition on existing sleds. We then implement a deserialize method that will return the new version regardless of what's on disk. This is similar to how we handled the bootstore upgrade for `EarlyNetworkConfig`. Also similar to that, we extended the format of `StartSledAgentRequest` to ease future deserializations. In those deserializations we can deserialize the "header" and use [serde_json::value::RawValue](https://docs.rs/serde_json/latest/serde_json/value/struct.RawValue.html) in order to defer deserialization of the `body` field until we know the `schema_version`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great - this is a lot less than I expected it to be! Just a couple minor questions.
log.new(o!("BootstrapAgentClient" => bootstrap_addr.to_string())), | ||
); | ||
|
||
client.start_sled_agent(&request).await.map_err(|err| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we send this request to a sled that's already in the cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question.
We'll end up in the ServerStarted
clause which tries to be idempotent but only on a few fields. It will currently return an error if the sled-id or address changed.
I think it would probably be better to make this truly idempotent and check the whole request. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would probably be better to make this truly idempotent and check the whole request. What do you think?
It looks like as written, if we get a request with matching fields that are checked but non-matching other fields, we just ignore them entirely and return Ok(_)
, which would presumably leave the client confused (it could reasonably assume all the params it passed are in use, but really the sled-agent was still running with some other set of params)? I agree making it check the whole request seems better, unless there's some case in RSS that we don't know or understand where we need the current behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it idempotent in 87d7d41
This code largely reuses the existing code paths for starting sled-agent with a small change to configured the bootstore as a learner node if trust quorum is in use. This code path does not attempt any retries, as it should be idempotent as long as the same sled and rack UUIDs are used.This is a command driven by an operator and we want to know right away if it succeeded or not. Multiple sleds can be added by Nexus with individual calls, so that individual errors can be returned without added complexity at the sled- agent level. These requests can also be sent to different sled-agents, since the sled-agent is just acting as a proxy to a remote bootstrap agent.
2a068db
to
0d50df0
Compare
I tested adding sled 15 to an existing madrid cluster of sleds (0,1,14) and it worked. The sled learned its key share and ended up starting sled agent which registered itself in CRDB via Nexus. I also ensured that the command is idempotent and we properly error when trying to add a sled that is already part of the cluster. |
This code largely reuses the existing code paths for starting sled-agent with a small change to configured the bootstore as a learner node if trust quorum is in use.
This code path does not attempt any retries, as it should be idempotent as long as the same sled and rack UUIDs are used.This is a command driven by an operator and we want to know right away if it succeeded or not. Multiple sleds can be added by Nexus with individual calls, so that individual errors can be returned without added complexity at the sled- agent level. These requests can also be sent to different sled-agents, since the sled-agent is just acting as a proxy to a remote bootstrap agent.