Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change stoppable to perform min active check sequentially in mz #2886

Merged
merged 7 commits into from
Aug 28, 2024

Conversation

GrantPSpencer
Copy link
Contributor

@GrantPSpencer GrantPSpencer commented Aug 16, 2024

Issues

  • My PR addresses the following Helix issues and references them in the PR description:

#2885
Helix stoppable does not account for multiple replicas in the same fault zone. This can occur during instance operations such as swap but also as a result of the issue linked above.

Description

  • Here are some details about my PR, including screenshots of any UI changes:

Currently, the batch stoppable check parallelizes the performHelixOwnInstanceCheck, which includes the min active replica check. This parallelization works on the assumption that a single partition will not have more than 2 replicas hosted in 1 fault zone. This assumption is not fully correct as n+1 transitions can result in 2 replicas in the same fault zone, as well instance operations such as SWAP.

The parallelization of the other helix health checks can be kept to not lose out on that optimization. The min active check can then be done sequentially for each instance.

There is a potential for optimizing the min active replica check. Currently it is done per instance and reads the externalView and idealStates for each resource from ZK. It could be optimized to read the EVs and IS's once and use those for all instances the check is being ran on. However, I did not want to complicate this first draft of the PR with that refactoring.

Tests

  • The following tests are written for this issue:

TestInstancesAccessor.java#testMultipleReplicasInSameMZ

  • The following is the result of the "mvn test" command on the appropriate module:
$ mvn test -o -Dtest=TestPerInstanceAccessor,TestInstancesAccessor,TestMaintenanceManagementService -pl=helix-rest

[INFO] Results:
[INFO] 
[INFO] Tests run: 45, Failures: 0, Errors: 0, Skipped: 0
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  02:31 min
[INFO] Finished at: 2024-08-15T20:43:09-07:00
[INFO] ------------------------------------------------------------------------

Changes that Break Backward Compatibility (Optional)

  • My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

Not a significant incompatibility, but the STOPPABLE_CHECK_LIST no longer contains the min_active_replica check as it is done separately.

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

Copy link
Contributor

@junkaixue junkaixue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be consistent with behavior. What if user does not specific MZs?

@GrantPSpencer
Copy link
Contributor Author

GrantPSpencer commented Aug 19, 2024

We should be consistent with behavior. What if user does not specific MZs?

If there is no topology set up for the cluster, I believe batch stoppable will actually return an empty.
My current assumption is that this is due to the single instance stoppable check calling getZoneBasedInstances which seems will return an empty list when no topology set.
For the batch instances check, it will return empty because it relies on _orderOfZones not being empty.

the per instance stoppable check should work normally in this case.

@junkaixue
Copy link
Contributor

We should be consistent with behavior. What if user does not specific MZs?

If there is no topology set up for the cluster, I believe batch stoppable will actually return an empty. My current assumption is that this is due to the single instance stoppable check calling getZoneBasedInstances which seems will return an empty list when no topology set. For the batch instances check, it will return empty because it relies on _orderOfZones not being empty.

the per instance stoppable check should work normally in this case.

If this PR just make the fix without adding the non-zone based support, let's create ticket and making a TODO item in the code to make sure that support will happen later.

Comment on lines +810 to +813
for (Map.Entry<String, Future<StoppableCheck>> entry : futureStoppableCheckByInstance.entrySet()) {
try {
String instanceName = entry.getKey();
StoppableCheck stoppableCheck = entry.getValue().get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, by placing the min active replica check before submitting the asynchronous checks, there is no longer a need to check the future and get the value.

Copy link
Contributor

@zpinto zpinto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@GrantPSpencer
Copy link
Contributor Author

Pull request approved by: @MarkGaox @zpinto
Commit message: Batch stoppable check now performs min_active check sequentially in zone

@GrantPSpencer
Copy link
Contributor Author

We should be consistent with behavior. What if user does not specific MZs?

If there is no topology set up for the cluster, I believe batch stoppable will actually return an empty. My current assumption is that this is due to the single instance stoppable check calling getZoneBasedInstances which seems will return an empty list when no topology set. For the batch instances check, it will return empty because it relies on _orderOfZones not being empty.
the per instance stoppable check should work normally in this case.

If this PR just make the fix without adding the non-zone based support, let's create ticket and making a TODO item in the code to make sure that support will happen later.

Created issue to track this
#2893

@junkaixue
Copy link
Contributor

Yeah. Please add it to TODO comment in the code as well.

@GrantPSpencer
Copy link
Contributor Author

GrantPSpencer commented Aug 28, 2024

Yeah. Please add it to TODO comment in the code as well.

Added comment in InstancesAccessor class

      // TODO: Add support for clusters that do not have topology set up.
      // Issue #2893: https://github.com/apache/helix/issues/2893

@junkaixue junkaixue merged commit aaf2294 into apache:master Aug 28, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants