Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky updateInstance(org.apache.helix.rest.server.TestPerInstanceAccessor) #2825

Merged
merged 1 commit into from
Jun 28, 2024

Conversation

zpinto
Copy link
Contributor

@zpinto zpinto commented Jun 28, 2024

The cause of the test case being flaky is due to switching the resources from SEMI_AUTO to FULL_AUTO while the cluster is ### Issues

Description

The cause of the test case being flaky is due to switching the resources from SEMI_AUTO to FULL_AUTO while the cluster is in MaintenanceMode.

When a resource is SEMI_AUTO, the MM rebalancer is not used because that would cause the preferenceList to potentially change and never recover to what it previously was. In the test case, we were switching the resources from SEMI_AUTO to FULL_AUTO causing the MM rebalancer to be used. There is then a RACE condition between the controller computing a new IdealState which drops the offline instances from the preferenceList, making the IdealState invalid for SEMI_AUTO, and us setting the resources back to SEMI_AUTO. If the controller wins, persisting the IdealState again with SEMI_AUTO will throw an exception.

Removing this logic to just test that isEvacuateFinished is true since all resources are SEMI_AUTO. We test isEvacuateFinished on FULL_AUTO resources in other places like TestZkHelixAdmin and TestInstanceOperation.

Tests

  • updateInstance(org.apache.helix.rest.server.TestPerInstanceAccessor)

Changes that Break Backward Compatibility (Optional)

NA

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

The cause of the test case being flaky is due to switching the resources from SEMI_AUTO to FULL_AUTO while the cluster is in MaintenanceMode.

When a resource is SEMI_AUTO, the MM rebalancer is not used because that would cause the preferenceList to potentially change and never recover to what it previously was.
In the test case, we were switching the resources from SEMI_AUTO to FULL_AUTO causing the MM rebalancer to be used. There is then a RACE condition between the controller computing a new IdealState which drops the offline instances from the preferenceList, making the IdealState invalid for SEMI_AUTO, and us setting the resources back to SEMI_AUTO. If the controller wins, persisting the IdealState again with SEMI_AUTO will throw an exception.

Removing this logic to just test that isEvacuateFinished is true since all resources are SEMI_AUTO. We test isEvacuateFinished on FULL_AUTO resources in other places like TestZkHelixAdmin and TestInstanceOperation.
@junkaixue junkaixue merged commit bf20e2a into apache:master Jun 28, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Failed CI Test] updateInstance(org.apache.helix.rest.server.TestPerInstanceAccessor)
2 participants