Fix flaky updateInstance(org.apache.helix.rest.server.TestPerInstanceAccessor) #2825

zpinto · 2024-06-28T00:33:07Z

The cause of the test case being flaky is due to switching the resources from SEMI_AUTO to FULL_AUTO while the cluster is ### Issues

Fix flaky updateInstance(org.apache.helix.rest.server.TestPerInstanceAccessor) fixes [Failed CI Test] updateInstance(org.apache.helix.rest.server.TestPerInstanceAccessor) #2744

Description

The cause of the test case being flaky is due to switching the resources from SEMI_AUTO to FULL_AUTO while the cluster is in MaintenanceMode.

When a resource is SEMI_AUTO, the MM rebalancer is not used because that would cause the preferenceList to potentially change and never recover to what it previously was. In the test case, we were switching the resources from SEMI_AUTO to FULL_AUTO causing the MM rebalancer to be used. There is then a RACE condition between the controller computing a new IdealState which drops the offline instances from the preferenceList, making the IdealState invalid for SEMI_AUTO, and us setting the resources back to SEMI_AUTO. If the controller wins, persisting the IdealState again with SEMI_AUTO will throw an exception.

Removing this logic to just test that isEvacuateFinished is true since all resources are SEMI_AUTO. We test isEvacuateFinished on FULL_AUTO resources in other places like TestZkHelixAdmin and TestInstanceOperation.

Tests

updateInstance(org.apache.helix.rest.server.TestPerInstanceAccessor)

Changes that Break Backward Compatibility (Optional)

NA

Commits

My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Code Quality

My diff has been formatted using helix-style.xml
(helix-style-intellij.xml if IntelliJ IDE is used)

The cause of the test case being flaky is due to switching the resources from SEMI_AUTO to FULL_AUTO while the cluster is in MaintenanceMode. When a resource is SEMI_AUTO, the MM rebalancer is not used because that would cause the preferenceList to potentially change and never recover to what it previously was. In the test case, we were switching the resources from SEMI_AUTO to FULL_AUTO causing the MM rebalancer to be used. There is then a RACE condition between the controller computing a new IdealState which drops the offline instances from the preferenceList, making the IdealState invalid for SEMI_AUTO, and us setting the resources back to SEMI_AUTO. If the controller wins, persisting the IdealState again with SEMI_AUTO will throw an exception. Removing this logic to just test that isEvacuateFinished is true since all resources are SEMI_AUTO. We test isEvacuateFinished on FULL_AUTO resources in other places like TestZkHelixAdmin and TestInstanceOperation.

junkaixue approved these changes Jun 28, 2024

View reviewed changes

junkaixue merged commit bf20e2a into apache:master Jun 28, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky updateInstance(org.apache.helix.rest.server.TestPerInstanceAccessor) #2825

Fix flaky updateInstance(org.apache.helix.rest.server.TestPerInstanceAccessor) #2825

zpinto commented Jun 28, 2024

Fix flaky updateInstance(org.apache.helix.rest.server.TestPerInstanceAccessor) #2825

Fix flaky updateInstance(org.apache.helix.rest.server.TestPerInstanceAccessor) #2825

Conversation

zpinto commented Jun 28, 2024

Description

Tests

Changes that Break Backward Compatibility (Optional)

Commits

Code Quality