Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[apache/helix] -- Fixes #2646 Optimize WagedInstanceCapacity Calculation to improve Helix Controller Pipeline #2649

Conversation

himanshukandwal
Copy link
Contributor

@himanshukandwal himanshukandwal commented Oct 9, 2023

Issues

Description

WagedInstanceCapacity data-structure is computed every time during a pipeline run and in case of large clusters, this computation takes ~80% of total time and hence, it's important to improve its performance.

  • Here are some details about my PR, including screenshots of any UI changes:

In this PR, we are performing following changes:

  • Created WagedInstanceCapacityManager class to cleanly factor-out the operations on WagedInstanceCapacity cache. It contains the condition on when the full build of cache should or shouldn't happen. With this we are skipping the full cache building in case of TaskCurrentStateChange, MessageChange, and Resume events.
  • In WagedInstanceCapacity, we are creating merged ResourceConfig once to determine the partition weights for all the partitions and not every time for each partition.

Tests

  • TestWagedInstanceCapacityManager
  • testProcessEventWithNoWagedResources
  • testProcessEventWithSomeWagedResources
  • testProcessEventWithAllWagedResources
  • testShouldNoOp
  • TestWagedInstanceCapacity
  • testProcessCurrentState
  • testProcessCurrentStateWithDoubleCharge
  • testProcessCurrentStateWithUnableToAssignPart
  • TestPipelinePerformance
  • testWagedInstanceCapacityCalculationPerformance
  • The following is the result of the "mvn test" command on the appropriate module:
mvn clean install -Dmaven.test.skip.exec=true && mvn test -o -Dtest=TestWagedInstanceCapacity,TestWagedInstanceCapacityManager -pl=helix-core

1425 [main] ERROR org.apache.helix.model.IdealState [] - could NOT determine replicas. set to 0
1438 [main] ERROR org.apache.helix.model.IdealState [] - could NOT determine replicas. set to 0
1447 [main] ERROR org.apache.helix.model.IdealState [] - could NOT determine replicas. set to 0
[INFO] Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.21 s - in TestSuite
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 7, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] 
[INFO] --- jacoco:0.8.6:report (generate-code-coverage-report) @ helix-core ---
[INFO] Loading execution data file /Users/hkandwal/Documents/workspaces/projects/helix_os_hk/helix-core/target/jacoco.exec
[INFO] Analyzed bundle 'Apache Helix :: Core' with 949 classes
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  5.627 s
[INFO] Finished at: 2023-10-09T15:00:30-07:00
[INFO] ------------------------------------------------------------------------

Changes that Break Backward Compatibility (Optional)

  • My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

(Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)

Documentation (Optional)

  • In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

Copy link
Contributor

@desaikomal desaikomal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial comments. will do second pass later. THANKS a lot for working.
wondering if we can have some profile numbers

@himanshukandwal
Copy link
Contributor Author

Initial comments. will do second pass later. THANKS a lot for working. wondering if we can have some profile numbers

Yes, this is quite important. I am working on the Integ test and will share the numbers.

@himanshukandwal
Copy link
Contributor Author

Initial comments. will do second pass later. THANKS a lot for working. wondering if we can have some profile numbers

Hey @desaikomal here are the results from the performance testing (CurrentStateComputationStage):

durationWithComputation: 502, durationWithoutComputation: 32, pctDecrease: 93.0
durationWithComputation: 996, durationWithoutComputation: 29, pctDecrease: 97.0

@himanshukandwal himanshukandwal force-pushed the hkandwal/optimize-calculation-of-resource-partition-weights branch from d6eb518 to 0d8c202 Compare October 13, 2023 16:41
Copy link
Contributor

@junkaixue junkaixue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to chat offline... I have few questions about this PR.

Copy link
Contributor

@desaikomal desaikomal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, performance number looks good. also please update the change description as events are different. once you address Junkai's comment, will review it last time. almost there.

@himanshukandwal himanshukandwal force-pushed the hkandwal/optimize-calculation-of-resource-partition-weights branch from 0d8c202 to 6430b59 Compare October 18, 2023 07:13
…apacity Calculation to improve Helix Controller Pipeline
…apacity Calculation to improve Helix Controller Pipeline
…apacity Calculation to improve Helix Controller Pipeline
@himanshukandwal
Copy link
Contributor Author

This PR is approved by @junkaixue and is ready to be merged.

Final commit message:
WagedInstanceCapacity data-structure is computed every time during a pipeline run and in case of large clusters, this computation takes ~80% of total time. Hence, in this PR we are skipping certain cluster events when this cache should not be rebuild, and improving the overall runtime.

@xyuanlu xyuanlu merged commit bd1d28c into apache:master Oct 23, 2023
2 checks passed
@himanshukandwal himanshukandwal changed the title [apache/helix] -- Fixes #2646 (Part-1), Optimize WagedInstanceCapacity Calculation to improve Helix Controller Pipeline [apache/helix] -- Fixes #2646 Optimize WagedInstanceCapacity Calculation to improve Helix Controller Pipeline May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants