-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change stoppable to perform min active check sequentially in mz #2886
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be consistent with behavior. What if user does not specific MZs?
helix-rest/src/main/java/org/apache/helix/rest/clusterMaintenanceService/HealthCheck.java
Outdated
Show resolved
Hide resolved
If there is no topology set up for the cluster, I believe batch stoppable will actually return an empty. the per instance stoppable check should work normally in this case. |
If this PR just make the fix without adding the non-zone based support, let's create ticket and making a TODO item in the code to make sure that support will happen later. |
helix-rest/src/main/java/org/apache/helix/rest/clusterMaintenanceService/HealthCheck.java
Outdated
Show resolved
Hide resolved
.../main/java/org/apache/helix/rest/clusterMaintenanceService/MaintenanceManagementService.java
Outdated
Show resolved
Hide resolved
for (Map.Entry<String, Future<StoppableCheck>> entry : futureStoppableCheckByInstance.entrySet()) { | ||
try { | ||
String instanceName = entry.getKey(); | ||
StoppableCheck stoppableCheck = entry.getValue().get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, by placing the min active replica check before submitting the asynchronous checks, there is no longer a need to check the future and get the value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Created issue to track this |
Yeah. Please add it to TODO comment in the code as well. |
Added comment in InstancesAccessor class
|
Issues
#2885
Helix stoppable does not account for multiple replicas in the same fault zone. This can occur during instance operations such as swap but also as a result of the issue linked above.
Description
Currently, the batch stoppable check parallelizes the performHelixOwnInstanceCheck, which includes the min active replica check. This parallelization works on the assumption that a single partition will not have more than 2 replicas hosted in 1 fault zone. This assumption is not fully correct as n+1 transitions can result in 2 replicas in the same fault zone, as well instance operations such as SWAP.
The parallelization of the other helix health checks can be kept to not lose out on that optimization. The min active check can then be done sequentially for each instance.
There is a potential for optimizing the min active replica check. Currently it is done per instance and reads the externalView and idealStates for each resource from ZK. It could be optimized to read the EVs and IS's once and use those for all instances the check is being ran on. However, I did not want to complicate this first draft of the PR with that refactoring.
Tests
TestInstancesAccessor.java#testMultipleReplicasInSameMZ
Changes that Break Backward Compatibility (Optional)
Not a significant incompatibility, but the STOPPABLE_CHECK_LIST no longer contains the min_active_replica check as it is done separately.
Commits
Code Quality
(helix-style-intellij.xml if IntelliJ IDE is used)