-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ClusterRecoveryAction #243
Conversation
orion-server/src/main/java/com/pinterest/orion/core/actions/kafka/BrokerRecoveryAction.java
Outdated
Show resolved
Hide resolved
orion-server/src/main/java/com/pinterest/orion/core/actions/kafka/BrokerRecoveryAction.java
Outdated
Show resolved
Hide resolved
orion-server/src/main/java/com/pinterest/orion/core/actions/kafka/ClusterRecoveryAction.java
Outdated
Show resolved
Hide resolved
orion-server/src/main/java/com/pinterest/orion/core/actions/kafka/ClusterRecoveryAction.java
Outdated
Show resolved
Hide resolved
orion-server/src/main/java/com/pinterest/orion/core/actions/kafka/ClusterRecoveryAction.java
Outdated
Show resolved
Hide resolved
orion-server/src/main/java/com/pinterest/orion/core/actions/kafka/ClusterRecoveryAction.java
Outdated
Show resolved
Hide resolved
orion-server/src/main/java/com/pinterest/orion/core/actions/kafka/ClusterRecoveryAction.java
Outdated
Show resolved
Hide resolved
orion-server/src/main/java/com/pinterest/orion/core/actions/kafka/ClusterRecoveryAction.java
Show resolved
Hide resolved
orion-server/src/main/java/com/pinterest/orion/core/actions/kafka/ClusterRecoveryAction.java
Outdated
Show resolved
Hide resolved
@@ -54,7 +52,7 @@ public class BrokerHealingOperator extends KafkaOperator { | |||
private long maxNumStaleIntervals = 2; // default 2 times | |||
private double unhealthyBrokerURPRatioThreshold = 0; // default no URPs allowed | |||
private int unhealthyAlertFailThreshold = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this not converted to a config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These configs are not part of my change. They are the thresholds for alerting.
They are already configurable. You can check line 46-50 and initialize method.
Implement ClusterRecoveryAction: Orion can automatically recover more than one brokers now.
Broker issue can be detected by BrokerHealingOperator. Before this change, if one broker has outage, BrokerHealingOperator will trigger one BrokerRecoveryAction to recovery it. If more than one brokers have issues, BrokerHealingOperator just throws errors and does nothing.
This PR adds a layer ClusterRecoveryAction between BrokerHealingOperator and BrokerRecoveryAction. When BrokerHealingOperator detects one or more brokers with issues, it creates and dispatches a ClusterRecoveryAction to take care of it. ClusterRecoveryAction can decide whether to create multiple BrokerRecoveryActions to fix broker issues based on its own logic.
To avoid one broker being fixed multiple time, the PR introduces a method called removeRecoveringNodesFromCandidates. This Set based lock can make sure in one big time frame, one broker only be healed once.