Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge feature branch - Application cluster manager #2714

Merged
merged 11 commits into from
Dec 20, 2023

Conversation

xyuanlu
Copy link
Contributor

@xyuanlu xyuanlu commented Dec 18, 2023

Issues

  • My PR addresses the following Helix issues and references them in the PR description:

#2662
#2655

Description

  • Here are some details about my PR, including screenshots of any UI changes:

Merge feature branch ApplicationClusterManager, including the following changes.
Refactor stoppable check logic for enhanced zone analysis (https://github.com/apache/helix/pull/2654[)](https://github.com/apache/helix/commit/23eba7d9fcc4c979bf02f0872412af92343a1e99)

@MarkGaox
MarkGaox authored and Xiaoyuan Lu committed 5 days ago
Implement the cross-zone-based stoppable check (#2680)

@MarkGaox
MarkGaox authored and Xiaoyuan Lu committed 5 days ago
HelixAdmin APIs and pipeline changes to support Helix Node Swap (#2661)

@zpinto
zpinto authored and Xiaoyuan Lu committed 5 days ago
Enhanced stoppable checks with node evacuation filtering and introduc…

@MarkGaox
MarkGaox authored and Xiaoyuan Lu committed 5 days ago
Change canCompleteSwap and completeSwapIfPossible to return json with…

@zpinto
zpinto authored and Xiaoyuan Lu committed 5 days ago
Expose Evacuate Finished API in Helix-Rest (#2694)

@GrantPSpencer
GrantPSpencer authored and Xiaoyuan Lu committed 5 days ago
Fix WAGED to only use logicalId when computing baseline and centraliz…

@zpinto
zpinto authored and Xiaoyuan Lu committed 5 days ago
Make logic to determine state of replicas on SWAP_IN instance simpler…

@zpinto
zpinto authored and Xiaoyuan Lu committed 5 days ago
Prevent the spectator routing table from containing SWAP_IN instances.(…

@zpinto
zpinto authored and Xiaoyuan Lu committed 5 days ago
Commits on Dec 14, 2023
Build Topology with only required levels (FaultZone and EndNode) (#2713)

Tests

  • The following tests are written for this issue:

Please refer to original PRs

  • The following is the result of the "mvn test" command on the appropriate module:

(If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)

Changes that Break Backward Compatibility (Optional)

  • My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

(Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)

Documentation (Optional)

  • In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

MarkGaox and others added 11 commits December 13, 2023 12:47
Refactor the zone-based stoppable check logic and add support to randomly select zone order for the zone-based stoppable check.
Implement the cross-zone-based stoppable check and add to_be_stopped_instances query parameter to the stoppable check API
Add ability for up to 2 nodes with the same logicalId to be added to the cluster at the same time when a SWAP is happening.
During all paritionAssignment for WAGED and DelayedAutoRebalancer, we select just one instance for each logicalId. Achieves n -> n+1 for all replicas on SWAP_OUT node and back n when SWAP is marked complete, making it cancelable.

Adding and updating Helix Admin APIs to support swap operation:

setInstanceOperation
addInstance
canCompleteSwap
completeSwapIfPossible
* Refactor sanity checks for HelixAdmin swap APIs.

* Helix Node Swap pipeline changes and integration tests.

* Fix integration tests to properly restore stopped MockParticipant so following tests are not affected.

* Add comments and docstrings.

* Fix tests to clean up after themselves.

* Optimize duplicate logicalId filtering to only be called on allNodes and then used to remove duplicate logicalIds from enabledLiveNodes.

* Add handling for clusterConfig == null in updateSwappingInstances and fix AssignableNode to check for clusterTopologyConfig when attempting to get logicalId.

* Fix integ tests.

* Fix testGetDomainInformation since we no longer allow an instance to join the cluster with an invalid DOMAIN field.

* Add checks to ensure that the SWAP_IN instance has a matching FAULT_ZONE and matching INSTANCE_CAPACITY_MAP to SWAP_OUT node.

* Rename canSwapBeCompleted to canCompleteSwap.

* Add sanity checks to allow SWAP_IN node to join the cluster in disabled state before SWAP_OUT node has instance operation set.

* Fix print in test case.

* Add canCompleteSwap to PerInstanceAccessor and fix formatting.

* Fix flaky node swap after completion by making sure replica has is computed with logicalIds intead of instanceNames.
…ed blacklisting capabilities (#2687)

Enhanced stoppable checks with node evacuation filtering and introduced blacklisting capabilities
… kv pair for result of check or attempt to complete swap. (#2697)
…e picking assignable instances in the cache. (#2702)

Fix WAGED to only use logicalId when computing baseline and centralize picking assignable instances in the cache.
… and more predictable during an in-flight node swap. (#2706)
…#2710)

Prevent the spectator routing table from containing SWAP_IN instances.
* Change all rebalancer strategies to create Topology without additional non-FaultZone or EndNode levels of the tree. This will allow for swap to work in clusters where the non-FaultZone or EndNode domain kv pairs don't directly match the swapping node.
* Stabilize TestInstanceOperation. clusterVerifier is evaluating to true once the partitionAssignment matches the expected value; however, it is before the TopState is transfered back to the SWAP_IN node. This can be fixed by using TestHelper.verify to check that states converge within TIMEOUT.

* Moved evacuate tests with long ST resources to the end because it was taking long time to drop DBs which was causing flakyness in later tests. Ran TestInstanceOperation 5 times locally with success.
Copy link
Contributor

@zpinto zpinto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

One failing test seems to be flaky.

@xyuanlu xyuanlu merged commit 8cfe977 into master Dec 20, 2023
1 of 2 checks passed
@zpinto zpinto mentioned this pull request Dec 20, 2023
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants