Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[apache/helix] -- Add SetPartitionToError for participants to self annotate a node to ERROR state #2792

Merged
merged 1 commit into from
May 8, 2024

Conversation

csudharsanan
Copy link
Contributor

Issues

Fixes #2791

Description

What: An API endpoint that validates the incoming request and sends a state transition message to sets one or more partitions from any current state to ERROR state.

Why: Currently, the participants are unable to set a partition to an ERROR state explicitly when they seem to be stuck in a specific current state. The only way a replica can be set to ERROR is from within a state model. Having an endpoint to allow this behavior would allow the clients to call the resetPartition endpoint to set it back to INIT state and recover the replica. resetPartition works only on partitions in error state.

Tests


  [INFO] ------------------------------------------------------------------------
  [INFO] Reactor Summary for Apache Helix 1.3.2-SNAPSHOT:
  [INFO] 
  [INFO] Apache Helix ....................................... SUCCESS [  1.504 s]
  [INFO] Apache Helix :: Metrics Common ..................... SUCCESS [  0.244 s]
  [INFO] Apache Helix :: Metadata Store Directory Common .... SUCCESS [  0.363 s]
  [INFO] Apache Helix :: ZooKeeper API ...................... SUCCESS [  0.380 s]
  [INFO] Apache Helix :: Helix Common ....................... SUCCESS [  0.291 s]
  [INFO] Apache Helix :: Core ............................... SUCCESS [  0.306 s]
  [INFO] Apache Helix :: Admin Webapp ....................... SUCCESS [  0.606 s]
  [INFO] Apache Helix :: Restful Interface .................. SUCCESS [  0.941 s]
  [INFO] Apache Helix :: Distributed Lock ................... SUCCESS [  0.228 s]
  [INFO] Apache Helix :: HelixAgent ......................... SUCCESS [  0.187 s]
  [INFO] Apache Helix :: Recipes ............................ SUCCESS [  0.033 s]
  [INFO] Apache Helix :: Recipes :: Rabbitmq Consumer Group . SUCCESS [  0.205 s]
  [INFO] Apache Helix :: Recipes :: Rsync Replicated File Store SUCCESS [  0.248 s]
  [INFO] Apache Helix :: Recipes :: distributed lock manager  SUCCESS [  0.169 s]
  [INFO] Apache Helix :: Recipes :: distributed task execution SUCCESS [  0.246 s]
  [INFO] Apache Helix :: Recipes :: service discovery ....... SUCCESS [  0.186 s]
  [INFO] Apache Helix :: View Aggregator .................... SUCCESS [  0.167 s]
  [INFO] Apache Helix :: Meta Client ........................ SUCCESS [  0.146 s]
  [INFO] ------------------------------------------------------------------------
  [INFO] BUILD SUCCESS
  [INFO] ------------------------------------------------------------------------
  [INFO] Total time:  9.219 s
  [INFO] Finished at: 2024-04-16T13:13:02-07:00
  [INFO] ------------------------------------------------------------------------
  • The following tests are written/updated for this issue:

        - TestSetPartitionToErrorState (Integration)
        - TestZkHelixAdmin : testSetPartitionToError 
        - TestPerInstanceAccessor : updateInstance
    

    mvn test -o -Dtest=TestSetPartitionToErrorState -pl=helix-core

          AfterClass: TestSetPartitionToErrorState called.
          Shut down zookeeper at port 2183 in thread main
          [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 17.697 s - in   org.apache.helix.integration.TestSetPartitionToErrorState
          [INFO] 
          [INFO] Results:
          [INFO] 
          [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
          [INFO] 
          [INFO] 
          [INFO] --- jacoco:0.8.6:report (generate-code-coverage-report) @ helix-core ---
          [INFO] Loading execution data file /Users/csudhars/Apr15Helix-ForkedHelix/helix/helix-core/target/jacoco.exec
          [INFO] Analyzed bundle 'Apache Helix :: Core' with 806 classes
          [INFO] ------------------------------------------------------------------------
          [INFO] BUILD SUCCESS
          [INFO] ------------------------------------------------------------------------
          [INFO] Total time:  59.426 s
          [INFO] Finished at: 2024-04-16T14:14:54-07:00
          [INFO] ------------------------------------------------------------------------
    

    mvn test -o -Dtest=TestZkHelixAdmin -pl=helix-core

        END testZkHelixAdmin at Tue Apr 16 12:57:29 PDT 2024
        AfterClass: TestZkHelixAdmin called.
        Shut down zookeeper at port 2183 in thread main
        [INFO] Tests run: 20, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 66.246 s - in TestSuite
        [INFO] 
        [INFO] Results:
        [INFO] 
        [INFO] Tests run: 20, Failures: 0, Errors: 0, Skipped: 0
        [INFO] 
        [INFO] 
        [INFO] --- jacoco:0.8.6:report (generate-code-coverage-report) @ helix-core ---
        [INFO] Loading execution data file /Users/csudhars/Apr15Helix-ForkedHelix/helix/helix-core/target/jacoco.exec
        [INFO] Analyzed bundle 'Apache Helix :: Core' with 951 classes
        [INFO] ------------------------------------------------------------------------
        [INFO] BUILD SUCCESS
        [INFO] ------------------------------------------------------------------------
        [INFO] Total time:  02:00 min
        [INFO] Finished at: 2024-04-16T12:57:38-07:00
        [INFO] -
    

    mvn test -o -Dtest=TestPerInstanceAccessor -pl=helix-rest

        [INFO] Tests run: 24, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 219.352 s - in org.apache.helix.rest.server.TestPerInstanceAccessor
        [INFO] 
        [INFO] Results:
        [INFO] 
        [INFO] Tests run: 24, Failures: 0, Errors: 0, Skipped: 0
        [INFO] 
        [INFO] 
        [INFO] --- jacoco:0.8.6:report (generate-code-coverage-report) @ helix-rest ---
        [INFO] Loading execution data file /Users/csudhars/Apr15Helix-ForkedHelix/helix/helix-rest/target/jacoco.exec
        [INFO] Analyzed bundle 'Apache Helix :: Restful Interface' with 95 classes
        [INFO] ------------------------------------------------------------------------
        [INFO] BUILD SUCCESS
        [INFO] ------------------------------------------------------------------------
        [INFO] Total time:  03:52 min
        [INFO] Finished at: 2024-04-16T13:03:56-07:00
        [INFO] -
    

Changes that Break Backward Compatibility (Optional)

  • My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

(Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)

Documentation (Optional)

  • In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

Copy link
Contributor

@junkaixue junkaixue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think we need to make the logic similar to resetPartition can simplify a lot of assumption without adding a lot of logic there.

@csudharsanan csudharsanan force-pushed the csudhars/SetPartitionToError branch 3 times, most recently from 01e180e to 1dddca9 Compare May 1, 2024 21:37
@csudharsanan csudharsanan force-pushed the csudhars/SetPartitionToError branch from 1dddca9 to d519193 Compare May 7, 2024 00:14
Copy link
Contributor

@zpinto zpinto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

All looks good to me but want to follow up on one thing. I remember we saw some exceptions being thrown when trying to create sensors for * -> ERROR state transition metrics. Can we also resolve this issue in the PR? I think this may be a long standing issue for * -> * transition as well. It would be good to fix it now since users of helix-agent will want to have those metrics in addition users of * -> ERROR.

Copy link
Contributor

@MarkGaox MarkGaox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@csudharsanan csudharsanan force-pushed the csudhars/SetPartitionToError branch from d519193 to 6210d78 Compare May 7, 2024 19:36
@csudharsanan csudharsanan force-pushed the csudhars/SetPartitionToError branch from 6210d78 to c3c2cee Compare May 7, 2024 19:58
@csudharsanan
Copy link
Contributor Author

Fixed the mbean issue in HelixTask. Now it supports * -> * transitions. Since this wasn't failing tests, adding some logs.

Before:

Start zookeeper at localhost:2183 in thread main
START TestSetPartitionsToErrorState_testSetPartitionsToErrorState at Tue May 07 12:28:02 PDT 2024
true: wait 332ms, ClusterStateVerifier$BestPossAndExtViewZkVerifier(TestSetPartitionsToErrorState_testSetPartitionsToErrorState@localhost:2183)
javax.management.RuntimeOperationsException
        at java.management/com.sun.jmx.mbeanserver.Repository.addMBean(Repository.java:298)
        at java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerWithRepository(DefaultMBeanServerInterceptor.java:1848)
        at java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(DefaultMBeanServerInterceptor.java:945)
        at java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(DefaultMBeanServerInterceptor.java:880)
        at java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(DefaultMBeanServerInterceptor.java:315)
        at java.management/com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(JmxMBeanServer.java:523)
        at org.apache.helix.monitoring.mbeans.MBeanRegistrar.register(MBeanRegistrar.java:60)
        at org.apache.helix.monitoring.mbeans.dynamicMBeans.DynamicMBeanProvider.doRegister(DynamicMBeanProvider.java:89)
        at org.apache.helix.monitoring.mbeans.dynamicMBeans.DynamicMBeanProvider.doRegister(DynamicMBeanProvider.java:95)
        at org.apache.helix.monitoring.mbeans.StateTransitionStatMonitor.register(StateTransitionStatMonitor.java:83)
        at org.apache.helix.monitoring.mbeans.ParticipantStatusMonitor.reportTransitionStat(ParticipantStatusMonitor.java:113)
        at org.apache.helix.messaging.handling.HelixTask.reportMessageStat(HelixTask.java:335)
        at org.apache.helix.messaging.handling.HelixTask.finalCleanup(HelixTask.java:386)
        at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:185)
        at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:49)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: java.lang.IllegalArgumentException: Repository: cannot add mbean for pattern name CLMParticipantReport:Cluster=TestSetPartitionsToErrorState_testSetPartitionsToErrorState,Transition=*--ERROR
        ... 19 more
true: wait 233ms, ClusterStateVerifier$BestPossAndExtViewZkVerifier(TestSetPartitionsToErrorState_testSetPartitionsToErrorState@localhost:2183)
true: wait 216ms, ClusterStateVerifier$BestPossAndExtViewZkVerifier(TestSetPartitionsToErrorState_testSetPartitionsToErrorState@localhost:2183)
16468 [ZkClient-EventThread-162-localhost:2183] ERROR org.apache.helix.messaging.handling.HelixTaskExecutor [] - Message xyz cannot be processed: ***, {CREATE_TIMESTAMP=1715110092791, FROM_STATE=*, MSG_ID=***, MSG_STATE=new, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=TestDB0_7, RESOURCE_NAME=TestDB0, SRC_NAME=*****, STATE_MODEL_DEF=MasterSlave, STATE_MODEL_FACTORY_NAME=DEFAULT, TGT_NAME=localhost_12918, TGT_SESSION_ID=***, TO_STATE=ERROR}{}{}Partition TestDB0_7 current state is same as toState (*->ERROR) from message.
true: wait 53ms, ClusterStateVerifier$BestPossAndExtViewZkVerifier(TestSetPartitionsToErrorState_testSetPartitionsToErrorState@localhost:2183)
END TestSetPartitionsToErrorState_testSetPartitionsToErrorState at Tue May 07 12:28:15 PDT 2024

After:



Start zookeeper at localhost:2183 in thread main
START TestSetPartitionsToErrorState_testSetPartitionsToErrorState at Tue May 07 12:23:24 PDT 2024
true: wait 302ms, ClusterStateVerifier$BestPossAndExtViewZkVerifier(TestSetPartitionsToErrorState_testSetPartitionsToErrorState@localhost:2183)
true: wait 202ms, ClusterStateVerifier$BestPossAndExtViewZkVerifier(TestSetPartitionsToErrorState_testSetPartitionsToErrorState@localhost:2183)
true: wait 185ms, ClusterStateVerifier$BestPossAndExtViewZkVerifier(TestSetPartitionsToErrorState_testSetPartitionsToErrorState@localhost:2183)
16489 [ZkClient-EventThread-162-localhost:2183] ERROR org.apache.helix.messaging.handling.HelixTaskExecutor [] - Message xyz cannot be processed: ***, {CREATE_TIMESTAMP=1715109814097, FROM_STATE=*, MSG_ID=***, MSG_STATE=new, MSG_TYPE=STATE_TRANSITION, PARTITION_NAME=TestDB0_7, RESOURCE_NAME=TestDB0, SRC_NAME=*****, STATE_MODEL_DEF=MasterSlave, STATE_MODEL_FACTORY_NAME=DEFAULT, TGT_NAME=localhost_12918, TGT_SESSION_ID=***, TO_STATE=ERROR}{}{}Partition TestDB0_7 current state is same as toState (*->ERROR) from message.
true: wait 51ms, ClusterStateVerifier$BestPossAndExtViewZkVerifier(TestSetPartitionsToErrorState_testSetPartitionsToErrorState@localhost:2183)
END TestSetPartitionsToErrorState_testSetPartitionsToErrorState at Tue May 07 12:23:36 PDT 2024
AfterClass: TestSetPartitionsToErrorState called.
Shut down zookeeper at port 2183 in thread main

Copy link
Contributor

@junkaixue junkaixue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Please make sure the last commit message appended.

@csudharsanan
Copy link
Contributor Author

csudharsanan commented May 7, 2024

This PR is ready to be merged. This PR adds SetPartitionToError endpoint for participants to self annotate a node to ERROR state

@junkaixue junkaixue merged commit 1d47d6b into apache:master May 8, 2024
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants