Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[apache/helix] -- Added detail in the Exception message for WAGED rebalance (hard constraint) failures. #2829

Merged

Conversation

himanshukandwal
Copy link
Contributor

@himanshukandwal himanshukandwal commented Jul 8, 2024

Issues

Description

  • Here are some details about my PR, including screenshots of any UI changes:
    Currently, when WAGED rebalancer encounters failure due to hard constraint (say insufficient capacity), it does not specify which key and reason of the insufficient capacity. Currently details are logged as part of the DEBUG messages and we have to turning on DEBUG logging to gain insights, and then turn off the helix DEBUG logging after the triage.

In this PR, we are creating ValidationResult construct to better record the validation result details, which will be used produce a specific and detailed error message in case of Rebalance failure.

Tests

  • The following tests are written for this issue:
    Updated existing tests:
  • TestConstraintBasedAlgorithm.java
  • TestFaultZoneAwareConstraint.java
  • TestNodeCapacityConstraint.java
  • TestNodeMaxPartitionLimitConstraint.java
  • TestPartitionActivateConstraint.java
  • TestReplicaActivateConstraint.java
  • TestSamePartitionOnInstanceConstraint.java
  • TestValidGroupTagConstraint.java
  • The following is the result of the "mvn test" command on the appropriate module:
mvn test -o "-Dtest=Test*Constraint" -pl=helix-core
-------------------------------------------------------------------------------------------
START TestPartitionLevelTransitionConstraint_test at Mon Jul 08 16:20:17 PDT 2024
END TestPartitionLevelTransitionConstraint_test at Mon Jul 08 16:20:25 PDT 2024
AfterClass: TestPartitionLevelTransitionConstraint called.
START testMsgConstraint() at Mon Jul 08 16:20:29 PDT 2024
ZnRecord=msgId-001, {CREATE_TIMESTAMP=1720480829984, FROM_STATE=OFFLINE, MSG_ID=msgId-001, MSG_STATE=new, MSG_TYPE=STATE_TRANSITION, RESOURCE_NAME=TestDB, TGT_NAME=localhost_12918, TO_STATE=SLAVE}{}{}, Stat=Stat {_version=0, _creationTime=0, _modifiedTime=0, _ephemeralOwner=0} matches(5): [{MESSAGE_TYPE=STATE_TRANSITION, TRANSITION=OFFLINE-SLAVE, RESOURCE=TestDB, INSTANCE=localhost_12918}:5, {MESSAGE_TYPE=STATE_TRANSITION}:ANY, {MESSAGE_TYPE=STATE_TRANSITION, TRANSITION=OFFLINE-SLAVE, RESOURCE=TestDB, INSTANCE=.*}:2, {MESSAGE_TYPE=STATE_TRANSITION, TRANSITION=OFFLINE-SLAVE, RESOURCE=.*, INSTANCE=.*}:10, {MESSAGE_TYPE=STATE_TRANSITION, TRANSITION=OFFLINE-SLAVE}:50]
ZnRecord=msgId-002, {CREATE_TIMESTAMP=1720480829992, FROM_STATE=OFFLINE, MSG_ID=msgId-002, MSG_STATE=new, MSG_TYPE=STATE_TRANSITION, RESOURCE_NAME=TestDB, TGT_NAME=localhost_12919, TO_STATE=SLAVE}{}{}, Stat=Stat {_version=0, _creationTime=0, _modifiedTime=0, _ephemeralOwner=0} matches(5): [{MESSAGE_TYPE=STATE_TRANSITION}:ANY, {MESSAGE_TYPE=STATE_TRANSITION, TRANSITION=OFFLINE-SLAVE, RESOURCE=.*, INSTANCE=localhost_12919}:1, {MESSAGE_TYPE=STATE_TRANSITION, TRANSITION=OFFLINE-SLAVE, RESOURCE=TestDB, INSTANCE=.*}:2, {MESSAGE_TYPE=STATE_TRANSITION, TRANSITION=OFFLINE-SLAVE, RESOURCE=.*, INSTANCE=.*}:10, {MESSAGE_TYPE=STATE_TRANSITION, TRANSITION=OFFLINE-SLAVE}:50]
END testMsgConstraint() at Mon Jul 08 16:20:29 PDT 2024
START testStateConstraint() at Mon Jul 08 16:20:29 PDT 2024
{STATE=MASTER, RESOURCE=TestDB} matches(3): [{STATE=MASTER, RESOURCE=.*}:2, {STATE=MASTER}:1, {STATE=MASTER, RESOURCE=TestDB}:1]
{STATE=MASTER, RESOURCE=MyDB} matches(2): [{STATE=MASTER, RESOURCE=.*}:2, {STATE=MASTER}:1]
END testStateConstraint() at Mon Jul 08 16:20:30 PDT 2024
AfterClass: TestConstraint called.
Shut down zookeeper at port 2183 in thread main
[INFO] Tests run: 34, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 22.544 s - in TestSuite
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 34, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] 
[INFO] --- jacoco:0.8.6:report (generate-code-coverage-report) @ helix-core ---
[INFO] Loading execution data file /Users/hkandwal/Documents/workspaces/projects/helix_os_hk/helix-core/target/jacoco.exec
[INFO] Analyzed bundle 'Apache Helix :: Core' with 812 classes
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  01:13 min
[INFO] Finished at: 2024-07-08T16:20:34-07:00
[INFO] ------------------------------------------------------------------------

(If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)

Changes that Break Backward Compatibility (Optional)

  • My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

(Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)

Documentation (Optional)

  • In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

@himanshukandwal himanshukandwal force-pushed the hkandwal/enable-waged-error-logging branch from 08100f7 to 87dd9e3 Compare July 15, 2024 19:44
Copy link
Contributor

@junkaixue junkaixue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. thanks for working on this!

@himanshukandwal
Copy link
Contributor Author

This PR has been approved by @junkaixue.

Final Commit Message: We are enabling constraint level logging in the case when WAGED algorithm is not able to find the placement for a replica. The logging will be controlled via a flag which is enabled when the failure criteria is met, else is disabled.

@junkaixue junkaixue merged commit 6609c37 into apache:master Jul 17, 2024
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable logging detailed and specific error message for WAGED Rebalance failures
2 participants