Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix partition assignment npe #2903

Merged
merged 2 commits into from
Sep 6, 2024

Conversation

GrantPSpencer
Copy link
Contributor

Issues

  • My PR addresses the following Helix issues and references them in the PR description:

Failing tests in CI:
#2901
#2902

Description

  • Here are some details about my PR, including screenshots of any UI changes:

From #2877 , autoRebalanceStrategy now reads the idealState and StateModelDefinition from the cache. This causes an NPE when computing the potential assignment. getIdealAssignmentForFullAuto instantiates a cache (ResourceControllerDataProvider) and only instantiates the necessary values. I'm assuming this was done as an optimization and to avoid a full cache refresh. My change adds the other necessary fields to the cache for the calculation to complete. It's also possible to fully reresh the cache, but I believe this would require changing public constructors in order to pass the accessor down to the relevant method.

Here is the NPE trace:

1271778 [qtp462777594-3583] ERROR org.apache.helix.rest.server.resources.helix.ResourceAssignmentOptimizerAccessor [] - Failed to compute partition assignment
java.lang.NullPointerException: null
	at org.apache.helix.controller.rebalancer.strategy.AutoRebalanceStrategy.calculateExpectedReplicaCount(AutoRebalanceStrategy.java:643) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.strategy.AutoRebalanceStrategy.computePartitionAssignment(AutoRebalanceStrategy.java:111) ~[classes/:?]
	at org.apache.helix.controller.rebalancer.strategy.AutoRebalanceStrategy.computePartitionAssignment(AutoRebalanceStrategy.java:44) ~[classes/:?]
	at org.apache.helix.util.HelixUtil.getIdealAssignmentForFullAuto(HelixUtil.java:409) ~[classes/:?]
	at org.apache.helix.rest.server.resources.helix.ResourceAssignmentOptimizerAccessor.computeOptimalAssignmentForResources(ResourceAssignmentOptimizerAccessor.java:287) ~[classes/:?]
	at org.apache.helix.rest.server.resources.helix.ResourceAssignmentOptimizerAccessor.computePotentialAssignment(ResourceAssignmentOptimizerAccessor.java:147) ~[classes/:?]

Code changes:

  • helix-core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java: Added a new method setStateModelDefMap to allow manually setting the cache's state model definition map.
  • helix-core/src/main/java/org/apache/helix/util/HelixUtil.java: Modified the initialization of ResourceControllerDataProvider to include setting the state model definition map and ideal states, which are now necesasry fields

Tests

  • The following tests are written for this issue:

N/A

  • The following is the result of the "mvn test" command on the appropriate module:
    Test passes on this branch:
$ mvn test -o -Dtest=TestResourceAssignmentOptimizerAccessor,TestPartitionAssignmentAPI -pl=helix-rest

[INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 164.565 s - in TestSuite
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  02:52 min
[INFO] Finished at: 2024-09-05T15:46:33-07:00
[INFO] ------------------------------------------------------------------------

Test will fail on current master:

$ mvn test -o -Dtest=TestResourceAssignmentOptimizerAccessor,TestPartitionAssignmentAPI -pl=helix-rest

Failed run
[ERROR] Tests run: 6, Failures: 2, Errors: 0, Skipped: 2, Time elapsed: 158.233 s <<< FAILURE! - in TestSuite
[ERROR] testComputePartitionAssignment(org.apache.helix.rest.server.TestResourceAssignmentOptimizerAccessor)  Time elapsed: 0.496 s  <<< FAILURE!
java.lang.AssertionError: expected:<200> but was:<400>
        at org.apache.helix.rest.server.TestResourceAssignmentOptimizerAccessor.testComputePartitionAssignment(TestResourceAssignmentOptimizerAccessor.java:111)

[ERROR] testComputePartitionAssignmentMaintenanceMode(org.apache.helix.rest.server.TestPartitionAssignmentAPI)  Time elapsed: 19.54 s  <<< FAILURE!
java.lang.AssertionError: expected:<200> but was:<400>
        at org.apache.helix.rest.server.TestPartitionAssignmentAPI.testComputePartitionAssignmentMaintenanceMode(TestPartitionAssignmentAPI.java:453)

Changes that Break Backward Compatibility (Optional)

  • My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

N/A

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)


// TODO: Consider full cache refresh to prevent needing to manually set necessary fields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess core does not impact, right? Only rest?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge this change now for the 1.4.1 release. Please think of impact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method getIdealAssignmentForFullAuto is only called by ResourceAssignmentOptimizerAccessor in helix-rest module (the partitionAssignment API). It seems the only impact was partitionAssignment API calls for FULL_AUTO resources that are not on WAGED. Core should not be affected from what I've seen.

The equivalent ideal state calculation method used by partitionAssignment API for WAGED resources getAssignmentForWagedFullAutoImpl creates and fully refreshes the ResourceControllerDataProvider.

@junkaixue junkaixue merged commit 057a4a8 into apache:master Sep 6, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants