Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix partitionAssignment API failing due to NPE when no resource config #2653

Merged

Conversation

GrantPSpencer
Copy link
Contributor

Issues

  • My PR addresses the following Helix issues and references them in the PR description:

The PartitionAssignment API fails for waged clusters where a resource does not have a respective resource config defined for it in ZK.

This is the error that is shown to users:

{
  "error" : "Failed to compute partition assignment: org.apache.helix.HelixException: getIdealAssignmentForWagedFullAuto(): Calculation failed: Failed to compute BestPossibleState!"
}

This is the error that is found in helix-rest logs (truncated)

2023/10/11 03:20:59.336 ERROR [HelixUtil] [qtp1938380262-5394685] [helix-rest] [] getIdealAssignmentForWagedFullAuto(): Failed to compute ResourceAssignments!
java.lang.NullPointerException: null
at java.util.stream.Collectors.lambda$uniqKeysMapAccumulator$1(Collectors.java:177) ~[?:?]
at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169) ~[?:?]
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655) ~[?:?]
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) ~[?:?]
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) ~[?:?]
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) ~[?:?]
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) ~[?:?]
at org.apache.helix.util.HelixUtil.getAssignmentForWagedFullAutoImpl(HelixUtil.java:318) ~[helix-core-1.1.1-dev-202303311728.jar:1.1.1-dev-202303311728]
at org.apache.helix.util.HelixUtil.getTargetAssignmentForWagedFullAuto(HelixUtil.java:219) ~[helix-core-1.1.1-dev-202303311728.jar:1.1.1-dev-202303311728]

Description

  • Here are some details about my PR, including screenshots of any UI changes:

partitionAssignment API fails for clusters where resource configs aren't set due to NPE. This NPE occurs because getResourceConfig() will return null if the resource config does not exist, which is then added into the wagedResourceConfigs list. The below code is where the NPE occurs as one of the items in the list is null.

      dataProvider.setResourceConfigMap(resourceConfigs.stream()
          .collect(Collectors.toMap(ResourceConfig::getResourceName, Function.identity())));

Tests

  • The following tests are written for this issue:
    No new unit tests. But I did test this by deploying helix-rest locally to confirm that the partitionAssignment API worked after the change

  • The following is the result of the "mvn test" command on the appropriate module:

$mvn test -o -Dtest=TestResourceAssignmentOptimizerAccessor -pl=helix-rest

[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] 
[INFO] --- jacoco:0.8.6:report (generate-code-coverage-report) @ helix-rest ---
[INFO] Loading execution data file /Users/gspencer/Desktop/git-repos/helix/helix-rest/target/jacoco.exec
[INFO] Analyzed bundle 'Apache Helix :: Restful Interface' with 92 classes
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  54.333 s
[INFO] Finished at: 2023-10-10T21:11:40-07:00
[INFO] ------------------------------------------------------------------------

Copy link
Contributor

@desaikomal desaikomal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @GrantPSpencer - except for one comment, code change is simple.
thanks once again,

@@ -359,8 +359,10 @@ private void computeWagedAssignmentResult(List<IdealState> wagedResourceIdealSta
ConfigAccessor cfgAccessor = getConfigAccessor();
List<ResourceConfig> wagedResourceConfigs = new ArrayList<>();
for (IdealState idealState : wagedResourceIdealState) {
wagedResourceConfigs
.add(cfgAccessor.getResourceConfig(clusterId, idealState.getResourceName()));
ResourceConfig resourceConfig = cfgAccessor.getResourceConfig(clusterId, idealState.getResourceName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we not check if the resource is WAGED enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check for WAGED resources happens in the function (computeOptimalAssignmentForResources) that calls this function (computeWagedAssignmentResult). My understanding is there is an assumption that the only resources passed to it are waged resources.

The calling function is computeOptimalAssignmentForResources() in ResourceAssignmentOptimizerAccessor line 254

      // Compute all Waged resources in a batch later.
      if (idealState.getRebalancerClassName() != null && idealState.getRebalancerClassName()
          .equals(WagedRebalancer.class.getName())) {
        wagedResourceIdealState.add(idealState);
        continue;
      }

and then:

    if (!wagedResourceIdealState.isEmpty()) {
      computeWagedAssignmentResult(wagedResourceIdealState, inputFields, clusterState, clusterId,
          result);
    }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aha, so in some way, user has configured resource as WAGED but hasn't provided WAGED resource config? isn't this user error? we can prevent null pointer exception but shouldn't user know that the config is wrong too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically waged rebalance "works" for clusters without resource configs. This is actually the current set up of our super clusters. Using waged without any resource configs or relevant instance capacity configs.

I'm not 100% on this part, but I believe if there are no resource configs then the score calculated for each node will be 0 and tiebreak will go to the node without any resources assigned to it. I don't think there's a guarantee of evenness if there's no resource and instance capacity configs, but it will guarantee that each node will have at least 1 replica assigned to it (given # replicas > # nodes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NPE only occurs for the partitionAssignment API, but the actual controller rebalance algorithm works fine in the same scenario

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that means, we fill out the value using some default values in Waged workflow but not in this workflow.
Please look at: WagedValidationUtil::validateAndGetPartitionCapacity.

But your fix should be good too.

@GrantPSpencer
Copy link
Contributor Author

Pull request approved by @xyuanlu
Commit message: Fix partitionAssignment NPE when no resource configs

@xyuanlu xyuanlu merged commit 65e657d into apache:master Oct 18, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants