Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ClusterRecoveryAction #243

Merged
merged 3 commits into from
Dec 13, 2023
Merged

Implement ClusterRecoveryAction #243

merged 3 commits into from
Dec 13, 2023

Conversation

yisheng-zhou
Copy link
Contributor

Implement ClusterRecoveryAction: Orion can automatically recover more than one brokers now.

Broker issue can be detected by BrokerHealingOperator. Before this change, if one broker has outage, BrokerHealingOperator will trigger one BrokerRecoveryAction to recovery it. If more than one brokers have issues, BrokerHealingOperator just throws errors and does nothing.

This PR adds a layer ClusterRecoveryAction between BrokerHealingOperator and BrokerRecoveryAction. When BrokerHealingOperator detects one or more brokers with issues, it creates and dispatches a ClusterRecoveryAction to take care of it. ClusterRecoveryAction can decide whether to create multiple BrokerRecoveryActions to fix broker issues based on its own logic.

To avoid one broker being fixed multiple time, the PR introduces a method called removeRecoveringNodesFromCandidates. This Set based lock can make sure in one big time frame, one broker only be healed once.

@yisheng-zhou yisheng-zhou requested a review from a team as a code owner December 6, 2023 18:57
@@ -54,7 +52,7 @@ public class BrokerHealingOperator extends KafkaOperator {
private long maxNumStaleIntervals = 2; // default 2 times
private double unhealthyBrokerURPRatioThreshold = 0; // default no URPs allowed
private int unhealthyAlertFailThreshold = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this not converted to a config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These configs are not part of my change. They are the thresholds for alerting.

They are already configurable. You can check line 46-50 and initialize method.

@yisheng-zhou yisheng-zhou merged commit c8622dc into master Dec 13, 2023
1 check passed
@yisheng-zhou yisheng-zhou deleted the yishengzhou_231206 branch December 13, 2023 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants