-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metropolis: implement removing nodes #262
Comments
There's two things here:
We mostly need the latter, so that's what we should probably implement first. Then the decomissioning flow will include the latter, either automatically or as part of the process to be done by operators. |
(good first issue for the 'clean up' part) |
Isnt this already implemented? |
Partially - we can remove everything but control plane nodes. |
We need this to be done until mid-september |
With https://review.monogon.dev/c/monogon/+/3343 merged we will be able to safely and fully remove nodes from consensus. A full decommissioning flow is blocked by SPIFFE support and general auth refactor, as that mostly revolves around distributing some revocation lists to make sure a node can't reconnect to the cluster. However, we can still remove non-decommissioned nodes by setting |
After removing etcd membership from a node, etcd panics which takes down the entire node. The panic is in IsLocalMemberLearner, stack trace:
I'm not sure how to fix the panic.
Another way to improve the situation is to move etcd into a separate process such that a panic in etcd does not restart the entire machine. This is #349. |
Removing consensus nodes is now possible with these changes merged: https://review.monogon.dev/c/monogon/+/3437 There are still some rough edges that could be improved: Bootstrap node does not stop consensusThe bootstrap node will not stop consensus when it has never rebooted and the role is removed, because the bootstrap data takes priority over the role. monogon/metropolis/node/core/roleserve/worker_controlplane.go Lines 134 to 136 in d5538b5
Move leadershipWhen removing the consensus role from a node which is currently either etcd or curator leader, there is some downtime until a new leader is elected. It would be nicer to first move leadership to another node in this case. The challenge with moving the leadership is that maintenance.MoveLeader can only be called on the etcd leader node, and etcd leadership is currently independent from curator leadership. We could solve that by making the curator leadership follow etcd leadership, which might have performance benefits as well. |
Nothing left here that's not part of the auth work, thus closing this. |
Currently there is no way to remove a node from a cluster
The text was updated successfully, but these errors were encountered: