Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling throwable during Helix message processing #2696

Conversation

csudharsanan
Copy link
Contributor

Issues

Description

  • When we encounter any Throwable Error like OOME, we do not handle it currently. We should ideally cleanup ZK state (remove the message) and allow Helix to continue working on failed task. Clean-up and retry code path should be executed on encountering Throwable of any type.

Tests

  • TestHelixTaskExecutor:

  • testThrowableHandlingOnMessageProcess

[INFO] Reactor Summary for Apache Helix 1.3.2-SNAPSHOT:
[INFO] 
[INFO] Apache Helix ....................................... SUCCESS [  1.793 s]
[INFO] Apache Helix :: Metrics Common ..................... SUCCESS [  2.393 s]
[INFO] Apache Helix :: Metadata Store Directory Common .... SUCCESS [  1.479 s]
[INFO] Apache Helix :: ZooKeeper API ...................... SUCCESS [  3.975 s]
[INFO] Apache Helix :: Helix Common ....................... SUCCESS [  1.533 s]
[INFO] Apache Helix :: Core ............................... SUCCESS [ 24.984 s]
[INFO] Apache Helix :: Admin Webapp ....................... SUCCESS [  2.978 s]
[INFO] Apache Helix :: Restful Interface .................. SUCCESS [  5.904 s]
[INFO] Apache Helix :: Distributed Lock ................... SUCCESS [  1.160 s]
[INFO] Apache Helix :: HelixAgent ......................... SUCCESS [  1.959 s]
[INFO] Apache Helix :: Front End .......................... SUCCESS [03:20 min]
[INFO] Apache Helix :: Recipes ............................ SUCCESS [  0.031 s]
[INFO] Apache Helix :: Recipes :: Rabbitmq Consumer Group . SUCCESS [  1.484 s]
[INFO] Apache Helix :: Recipes :: Rsync Replicated File Store SUCCESS [  1.493 s]
[INFO] Apache Helix :: Recipes :: distributed lock manager  SUCCESS [  1.121 s]
[INFO] Apache Helix :: Recipes :: distributed task execution SUCCESS [  1.137 s]
[INFO] Apache Helix :: Recipes :: service discovery ....... SUCCESS [  1.030 s]
[INFO] Apache Helix :: View Aggregator .................... SUCCESS [  2.253 s]
[INFO] Apache Helix :: Meta Client ........................ SUCCESS [  2.138 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  04:20 min
[INFO] Finished at: 2023-11-07T19:21:16-08:00
[INFO] ------------------------------------------------------------------------
[INFO] Tests run: 20, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 47.314 s - in TestSuite
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 20, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] 
[INFO] --- jacoco:0.8.6:report (generate-code-coverage-report) @ helix-core ---
[INFO] Loading execution data file /Users/csudhars/Documents/GitHub/csudharsanan_helix/helix/helix-core/target/jacoco.exec
[INFO] Analyzed bundle 'Apache Helix :: Core' with 947 classes
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  01:04 min
[INFO] Finished at: 2023-11-07T19:24:28-08:00
[INFO] ------------------------------------------------------------------------
  • The following is the result of the "mvn test" command on the appropriate module:

(If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)

Changes that Break Backward Compatibility (Optional)

  • My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

(Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)

Documentation (Optional)

  • In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

@csudharsanan csudharsanan force-pushed the csudhars/error-handling-message-processing branch 3 times, most recently from e045d61 to 87b6192 Compare November 8, 2023 17:41
@csudharsanan csudharsanan force-pushed the csudhars/error-handling-message-processing branch from 87b6192 to 00f4569 Compare November 8, 2023 20:55
@csudharsanan
Copy link
Contributor Author

This PR is ready to be merged

@xyuanlu
Copy link
Contributor

xyuanlu commented Nov 16, 2023

Thanks @csudharsanan for composing the PR! Normally we require a "approval" before merge. We will take another round of review :D

@csudharsanan
Copy link
Contributor Author

csudharsanan commented Nov 16, 2023

Thanks @csudharsanan for composing the PR! Normally we require a "approval" before merge. We will take another round of review :D

Ahh my bad.

@csudharsanan csudharsanan force-pushed the csudhars/error-handling-message-processing branch from 00f4569 to 9a596c4 Compare November 16, 2023 21:17
@csudharsanan csudharsanan force-pushed the csudhars/error-handling-message-processing branch from 9a596c4 to 2c1a38b Compare November 29, 2023 19:14
Copy link
Contributor

@desaikomal desaikomal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw what is CM in TestCM ??

@@ -488,11 +490,11 @@ public boolean scheduleTask(MessageTask task) {
"Message handling task already sheduled for " + taskId, manager);
}
}
} catch (Exception e) {
} catch (Throwable e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Helix, we use general Exception instead of Throwable. Even Throwable, you need to replace variable name to t.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated variable name to t. Since OOME encountered in this GCN is an error, we had to update this to throwable instead of Exception.

OOME has following lineage:

java.lang.Throwable
-->java.lang.Error
-->java.lang.VirtualMachineError
--->java.lang.OutOfMemoryError

@csudharsanan
Copy link
Contributor Author

btw what is CM in TestCM ??

CM stands for ClusterManager, there were two TestHelixTaskExecutors within the same module and had to rename the new Test class to this.

@csudharsanan csudharsanan force-pushed the csudhars/error-handling-message-processing branch from 2c1a38b to ded7aa5 Compare December 8, 2023 13:29
_statusUpdateUtil
.logError(message, HelixTaskExecutor.class, e, "Error while executing task " + e,
.logError(message, HelixTaskExecutor.class, "Error while executing task " + t,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider not integrate the Throwable in message. If we do it this way, it only print out a short description of this throwable.
We may want to print the stack trace and other information. Please consider something like.

 public void logError(Message message, Class classInfo, Exception e, String additionalInfo,
      HelixManager manager) 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can close this one, as created #2730 for addressing the comments

@junkaixue junkaixue closed this Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failure path not executed on encountering throwable error during message processing.
5 participants