Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NullPointerException in IntermedaiteStateCalcStage #2973

Open
GrantPSpencer opened this issue Nov 27, 2024 · 0 comments
Open

NullPointerException in IntermedaiteStateCalcStage #2973

GrantPSpencer opened this issue Nov 27, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@GrantPSpencer
Copy link
Contributor

Describe the bug

NPE can occur in IntermedaiteStateCalcStage when applying pending messages to the intermediateStateMap. Specifically, when it tries to apply a message with DROPPED toState, it calls .remove(..) on a map that is null

2024/10/29 01:48:13.046 ERROR [GenericHelixController] [HelixController-pipeline-default-CLUSTERNAME-(70ae9461_DEFAULT)] [helix] [] Exception while executing DEFAULT pipeline for cluster CLUSTERNAME. Will not continue to next pipeline
java.lang.NullPointerException: null
        at org.apache.helix.controller.stages.IntermediateStateCalcStage.lambda$computeIntermediateMap$2(IntermediateStateCalcStage.java:868) ~[org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
        at java.util.HashMap.forEach(HashMap.java:1337) ~[?:?]
        at org.apache.helix.controller.stages.IntermediateStateCalcStage.computeIntermediateMap(IntermediateStateCalcStage.java:864) ~[org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
        at org.apache.helix.controller.stages.IntermediateStateCalcStage.computeIntermediatePartitionState(IntermediateStateCalcStage.java:402) ~[org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
        at org.apache.helix.controller.stages.IntermediateStateCalcStage.compute(IntermediateStateCalcStage.java:180) ~[org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
        at org.apache.helix.controller.stages.IntermediateStateCalcStage.process(IntermediateStateCalcStage.java:85) ~[org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
        at org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:75) ~[org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
        at org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:903) [org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
        at org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:1554) [org.apache.helix.helix-core-1.3.2-dev-202404301535-hotfix.jar:1.3.2-dev-202404301535-hotfix]
    for (Map.Entry<Partition, Map<String, Message>> entry : pendingMessageMap.entrySet()) {
      entry.getValue().forEach((key, value) -> {
        if (!value.getToState().equals(HelixDefinedState.DROPPED.name())) {
          intermediateStateMap.setState(entry.getKey(), value.getTgtName(), value.getToState());
        } else {
          intermediateStateMap.getStateMap().get(entry.getKey()).remove(value.getTgtName());
        }
      });

To Reproduce

Unable to reproduce outside of unit tests. Currently I think the behavior occurs when:

  1. Resource has partition with 1 replica .
  2. Message is sent to instance A to drop replica, but replica does not exist in instance's current state anymore.
  3. Controller snapshots cluster and runs pipeline.
  4. IntermediateStateCalc will attempt to call .remove() on a map that does not exist

I think the above state can be reached when:

  1. Race condition where node reads the message, drops the current state, but hasn't deleted the message yet so it is still seen as a pending message
  2. Node goes offline so there is no current state

Expected behavior

Failing to remove because map is null should not error out in my opinion. Can add null check or a getOrDefault to return empty map

@GrantPSpencer GrantPSpencer added the bug Something isn't working label Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant