-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconfigurator: Decommission cockroach nodes that belong to expunged cockroach omicron zones #5903
Conversation
.await | ||
} | ||
|
||
async fn invoke_cli_with_format_csv<'a, F, I, T>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fancy schmancy
@@ -203,17 +229,146 @@ impl NodeStatus { | |||
} | |||
} | |||
|
|||
// The cockroach CLI and `crdb_internal.gossip_liveness` table use a string for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice way to deal with this!
@@ -435,4 +653,82 @@ mod tests { | |||
db.cleanup().await.unwrap(); | |||
logctx.cleanup_successful(); | |||
} | |||
|
|||
// Ensure that if `cockroach node decommission` changes in a future CRDB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to suggest something like this, and here it is! Thanks for adding this.
@@ -914,8 +914,8 @@ impl Nexus { | |||
*mid | |||
} | |||
|
|||
pub(crate) async fn resolver(&self) -> internal_dns::resolver::Resolver { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely came across this recently as well. Good fix.
@@ -85,29 +100,235 @@ pub(crate) async fn deploy_zones( | |||
} | |||
} | |||
|
|||
/// Idempontently perform any cleanup actions necessary for expunged zones. | |||
/// | |||
/// # Panics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably we could instead log an error and return rather than taking out Nexus. I'm a bit concerned about taking out nexus, because it limits our debugging ability via omdb and the API. This is a pretty serious error, and hard to do accidentally though. It also leaves things in a state where invariants may be violated, which is almost certainly why a panic is the right thing to do.
Presumably in the future we will have an E2E test that actually runs the reconfigurator and will panic if the caller in execution/src/lib.rs
gets changed. However, even with that test we cannot guarantee the absence of new callers doing the wrong thing.
Thinking about whether it's better to panic or log is definitely a false dichotomy also, and only exists because we don't have other options. It would be pretty great if we could instead trigger an alert and prevent the reconfigurator from running without an override rather than killing nexus, which will end up in all Nexuses eventually panicking because of the timer based activation of the executor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can also reasonably do nothing here; just make this function filter out any non-expunged zones. It's a little less obvious, but much easier to use safely - call it with whatever zones you want, and it cleans up any expunged ones. Changed in 2d0c388
Testing this in a4x2 looks good. I used
Confirming that the add sled / add crdb worked, we see that new sixth CRDB in DNS:
and as part of the cockroach cluster:
I then hyperstopped g2, and cockroach realized the node was no longer available / live:
I then expunged the sled and generated a new blueprint which expunged all its zones:
After setting this as the new target blueprint, the sixth CRDB was removed from the cluster:
If I ask for the status of
I also confirmed we had a DNS version bump that goes back to just the original 5 CRDB zones:
It took about 6 minutes for cockroach to transition the node from
|
This adds a "decommission a node" endpoint to the
cockroach-admin
server, and a step to blueprint execution to clean up any expunged zones (which for now, only does anything for cockroach).Before merging, I need to run this through a4x2 and try expunging a sled running cockroach to ensure this is all working end-to-end, but I think it's close enough to be ready for review.