Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition while registering nodes #2965

Open
jacob-netguardians opened this issue Nov 18, 2024 · 1 comment · May be fixed by #2978
Open

Race condition while registering nodes #2965

jacob-netguardians opened this issue Nov 18, 2024 · 1 comment · May be fixed by #2978
Labels
bug Something isn't working

Comments

@jacob-netguardians
Copy link

Describe the bug

When two nodes are registered very closely to eachother, it can happen that the ZKHelixAdmin.addInstance() method is called at about the same time on the two nodes. The one that comes slightly before the other one will start creating ZNodes for the "INSTANCE" hierarchy in Zookeeper.

Let's say the second node starts the run of the ZKHelixAdmin.addInstance() method now. Its call to findInstancesWithMatchingLogicalId() will fail badly, throwing an exception (see below), because the first node's INSTANCE hierarchy is not complete:

[o.a.h.m.z.ZKHelixAdmin#202] Add instance node-2 to cluster ecwbq5PO.
[o.a.h.m.z.ZKUtil#163] Invalid instance setup, missing znode path: /ecwbq5PO/INSTANCES/node-1/MESSAGES
[o.a.h.m.z.ZKUtil#163] Invalid instance setup, missing znode path: /ecwbq5PO/INSTANCES/node-1/CURRENTSTATES
[o.a.h.m.z.ZKUtil#163] Invalid instance setup, missing znode path: /ecwbq5PO/INSTANCES/node-1/STATUSUPDATES
[o.a.h.m.z.ZKUtil#163] Invalid instance setup, missing znode path: /ecwbq5PO/INSTANCES/node-1/ERRORS
[o.a.h.m.z.ZKHelixAdmin#232] Failed to add instance node-2 to cluster ecwbq5PO with instance operation ENABLE. Setting INSTANCE_OPERATION to UNKNOWN instead.
org.apache.helix.HelixException: fail to get config. instance: node-1 is NOT setup in cluster: ecwbq5PO
	at org.apache.helix.ConfigAccessor.getInstanceConfig(ConfigAccessor.java:908) ~[helix-core-1.4.1.1.jar:1.4.1.1]
	at org.apache.helix.util.InstanceUtil.lambda$findInstancesWithMatchingLogicalId$7(InstanceUtil.java:143) ~[helix-core-1.4.1.1.jar:1.4.1.1]
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197) ~[?:?]
	at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625) ~[?:?]
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) ~[?:?]
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) ~[?:?]
	at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921) ~[?:?]
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
	at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682) ~[?:?]
	at org.apache.helix.util.InstanceUtil.findInstancesWithMatchingLogicalId(InstanceUtil.java:148) ~[helix-core-1.4.1.1.jar:1.4.1.1]
	at org.apache.helix.util.InstanceUtil.validateInstanceOperationTransition(InstanceUtil.java:118) ~[helix-core-1.4.1.1.jar:1.4.1.1]
	at org.apache.helix.manager.zk.ZKHelixAdmin.addInstance(ZKHelixAdmin.java:228) ~[helix-core-1.4.1.1.jar:1.4.1.1]
...

So... while adding node-2, it complains about node-1 being invalid, and consequently sets the "instance-operation" on node-2 to UNKNOWN instead of ENABLE.

Version of helix (1.4.1.1) above is due to the fact that I had already fixed #2963 and had a local version number change for this. Line numbers in the logs above may also not exactly match what is known in version 1.4.1, due to the addition of debug logging I had to do before actually understanding the problem.

Expected behavior

A node should be registered completely or not at all, not having that kind of impact on the fellow nodes' registration.

Additional context

I propose the following fix, in the ZKHelixAdmin class, creating all those nodes in the node's INSTANCE hierarchy in an atomic fashion:

Replace, in the ZKHelixAdmin class, addInstance() method, this:

    _zkClient.createPersistent(PropertyPathBuilder.instanceMessage(clusterName, nodeId), true);
    _zkClient.createPersistent(PropertyPathBuilder.instanceCurrentState(clusterName, nodeId), true);
    _zkClient
        .createPersistent(PropertyPathBuilder.instanceTaskCurrentState(clusterName, nodeId), true);
    _zkClient.createPersistent(PropertyPathBuilder.instanceCustomizedState(clusterName, nodeId), true);
    _zkClient.createPersistent(PropertyPathBuilder.instanceError(clusterName, nodeId), true);
    _zkClient.createPersistent(PropertyPathBuilder.instanceStatusUpdate(clusterName, nodeId), true);
    _zkClient.createPersistent(PropertyPathBuilder.instanceHistory(clusterName, nodeId), true);

with

    // if those nodes are not created atomically, then having two nodes registering almost exactly
    // at the same time could be problematic, because one node would not able to check for already
    // existing instance with matching logical ID (see
    // InstanceUtil.findInstancesWithMatchingLogicalId call above, where finding an incomplete
    // "INSTANCE" node is a killer)
    final List<Op> ops = Arrays.asList(
        createPersistentNodeOp(PropertyPathBuilder.instance(clusterName, nodeId)),
        createPersistentNodeOp(PropertyPathBuilder.instanceMessage(clusterName, nodeId)),
        createPersistentNodeOp(PropertyPathBuilder.instanceCurrentState(clusterName, nodeId)),
        createPersistentNodeOp(PropertyPathBuilder.instanceTaskCurrentState(clusterName, nodeId)),
        createPersistentNodeOp(PropertyPathBuilder.instanceCustomizedState(clusterName, nodeId)),
        createPersistentNodeOp(PropertyPathBuilder.instanceError(clusterName, nodeId)),
        createPersistentNodeOp(PropertyPathBuilder.instanceStatusUpdate(clusterName, nodeId)),
        createPersistentNodeOp(PropertyPathBuilder.instanceHistory(clusterName, nodeId))
    );
    _zkClient.multi(ops);

where createPersistentNodeOp method is defined as:

  private static Op createPersistentNodeOp(String path) {
    return Op.create(path, null, ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);
  }
@jacob-netguardians jacob-netguardians added the bug Something isn't working label Nov 18, 2024
@jacob-netguardians
Copy link
Author

PR could be created (sorry, I do not have that sort of right, still a newbie here) using the following branch: https://github.com/jacob-netguardians/helix/tree/bugfix/2965-atomic-znode-creation-on-node-registration

jacob-netguardians added a commit to jacob-netguardians/helix that referenced this issue Nov 19, 2024
jacob-netguardians added a commit to jacob-netguardians/helix that referenced this issue Nov 19, 2024
@jacob-netguardians jacob-netguardians linked a pull request Dec 6, 2024 that will close this issue
5 tasks
jacob-netguardians added a commit to jacob-netguardians/helix that referenced this issue Dec 9, 2024
Reformatted changed file according to the "helix format"
jacob-netguardians added a commit to jacob-netguardians/helix that referenced this issue Dec 11, 2024
jacob-netguardians added a commit to jacob-netguardians/helix that referenced this issue Dec 11, 2024
Reformatted changed file according to the "helix format"
jacob-netguardians added a commit to jacob-netguardians/helix that referenced this issue Dec 11, 2024
Reformatted changed file according to the "helix format"
jacob-netguardians added a commit to jacob-netguardians/helix that referenced this issue Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant