Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metaclient updater retry logic #2805

Merged
merged 4 commits into from
Jun 21, 2024

Conversation

GrantPSpencer
Copy link
Contributor

Issues

  • My PR addresses the following Helix issues and references them in the PR description:

Retry logic for metaclient updater

Description

  • Here are some details about my PR, including screenshots of any UI changes:

Metaclient updater currently does not retry on no node or version mismatch. This change includes logic to attempt to create mode if it does not exist and also retry on version mismatch.

Please take a look at the test case as well, is there a better way to recreate the version mismatch race condition?

Tests

  • The following tests are written for this issue:

meta-client/src/main/java/org/apache/helix/metaclient/impl/zk/ZkMetaClient.java #testUpdate

  • The following is the result of the "mvn test" command on the appropriate module:
$ mvn test -o -pl=meta-client

[INFO] Tests run: 68, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 755.799 s - in TestSuite
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 68, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] 
[INFO] --- jacoco:0.8.6:report (generate-code-coverage-report) @ meta-client ---
[INFO] Loading execution data file /Users/gspencer/Desktop/git-repos/helix/meta-client/target/jacoco.exec
[INFO] Analyzed bundle 'Apache Helix :: Meta Client' with 78 classes
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  12:41 min
[INFO] Finished at: 2024-05-30T21:50:37-07:00
[INFO] ------------------------------------------------------------------------

Changes that Break Backward Compatibility (Optional)

  • My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

N/A

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

Sorry, something went wrong.

@GrantPSpencer
Copy link
Contributor Author

11bf562
Refactored updater tests to remove use of threads.

$ mvn test -o -Dtest=TestZkMetaClient.java -pl=meta-client
[INFO] Tests run: 20, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 12.671 s - in org.apache.helix.metaclient.impl.zk.TestZkMetaClient
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 20, Failures: 0, Errors: 0, Skipped: 0
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  21.240 s
[INFO] Finished at: 2024-06-05T15:41:14-07:00
[INFO] ------------------------------------------------------------------------

T newData = updater.update(null);
if (newData != null) {
try {
create(key, newData);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating the PR!
Question about the retry logic here.
For example, when user try to update node "/a/b/c", and there is no node "/a/b", set in line 228 with throw NoNodeException, and we reach line 243. Since there is no node "/a/b", create will also throw NoNodeException, and we keep retry until expire. Should we do recursive create instead? Or we just let it expire and forward the NoNodeException?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good question. I am leaning towards not recursively creating, because that is the same behavior as helix zk client. However, if this API's contract with customers is "give me a node path, and I will make sure that it has the data you want" then I think there's a good argument for making our create attempt recursive rather than just on the leaf node. cc @junkaixue do you have thoughts on this?

And to clarify, this is the current flow if you tried to update path /a/b/c when no nodes in that path exist yet. (It should fail before retrying update)
try to update /a/b/c --> MetaClientNoNodeException is caught
try to create /a/b/ --> MetaClientNoNodeException this is not caught and so immediately fails
we do not retry update as method failed due to above exception

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also thinking of no creation. If this is update, it supposed to be updating something was there. It has the possibility that the node just be deleted recursively before update. Then update should be an invalidate operation and fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would second no creation as well. Say if we try to update "/a/b/c" and get no node exception, we just fail immediately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoke with @xyuanlu offline. Added createIfAbsent parameter. Base functionality is to not create the node, but will attempt to create leaf node if it is not present. It will not recursively create and will fail on first create() call if parent path does not exist.

Added relevant test cases to cover this
cc @junkaixue

@GrantPSpencer
Copy link
Contributor Author

https://github.com/apache/helix/actions/runs/9392174978
Failed due to #2788
no failures in metaclient module

2024-06-06T01:14:54.1015466Z [info] ./helix-core/target/surefire-reports/TestSuite.txt: Tests run: 1420, Failures: 1, Errors: 0, Skipped: 17, Time elapsed: 6,000.559 s <<< FAILURE! - in TestSuite
2024-06-06T01:14:54.1042519Z ##[error] Test failed: testNodeSwap(org.apache.helix.integration.rebalancer.TestInstanceOperation)  Time elapsed: 18.28 s  <<< FAILURE!
2024-06-06T01:14:54.1045991Z [info] ./zookeeper-api/target/surefire-reports/TestSuite.txt: Tests run: 99, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 234.346 s - in TestSuite
2024-06-06T01:14:54.1048431Z [info] ./metrics-common/target/surefire-reports/TestSuite.txt: Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.247 s - in TestSuite
2024-06-06T01:14:54.1050817Z [info] ./meta-client/target/surefire-reports/TestSuite.txt: Tests run: 70, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 674.295 s - in TestSuite
2024-06-06T01:14:54.1053642Z [info] ./metadata-store-directory-common/target/surefire-reports/TestSuite.txt: Tests run: 31, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 11.88 s - in TestSuite
2024-06-06T01:14:54.1056149Z [info] ./helix-common/target/surefire-reports/TestSuite.txt: Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.265 s - in TestSuite
2024-06-06T01:14:54.1133664Z Post job cleanup.

Copy link
Contributor

@junkaixue junkaixue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@GrantPSpencer
Copy link
Contributor Author

Pull request approved by @xyuanlu , @junkaixue
Commit message: Add retry logic to MetaClient Updater

@junkaixue junkaixue merged commit 3055f26 into apache:master Jun 21, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants