Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build Topology with only required levels (FaultZone and EndNode) #2712

Conversation

zpinto
Copy link
Contributor

@zpinto zpinto commented Dec 13, 2023

Issues

  • Change all rebalancer strategies to create Topology without additional non-FaultZone or EndNode levels of the tree. This will allow for swap to work in clusters where the non-FaultZone or EndNode domain kv pairs don't directly match the swapping node. Helix Node/Instance Swap #2662

Description

  • When attempting to use Swap in clusters that had more complex topologies with levels that were not fault zone or end node, it was discovered that Swap assignment was being computed correctly. This is because the levels and all nodes in the Topology tree affect the assignment in CRUSH and CRUSHED. This changes the behavior of those rebalancer strategies to only create levels in the Topology tree for FaultZone and EndNode. This is the only thing required as the topology aware placement should only be done based on FaultZone.

Tests

  • Updated TestInstanceOperation to reproduce issues seen in testing on production like cluster. After topology changes, those tests pass.
  • Used current changes to deploy local controller for CRUSHED and WAGED clusters copied from production. Successfully carried out SWAP operation for both.

Changes that Break Backward Compatibility (Optional)

Changes to Topology class are not backwards compatible; however, the rebalancer will have different behavior and produce new assignment for existing clusters using more than a 2-layer topology. This constitutes bumping minor version and calling out in release notes in next open source release.

Documentation (Optional)

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

…l non-FaultZone or EndNode levels of the tree. This will allow for swap to work in clusters where the non-FaultZone or EndNode domain kv pairs don't directly match the swapping node.
@zpinto zpinto marked this pull request as ready for review December 13, 2023 04:30
…ate same instance config. Changes to copy getAssignableLiveInstances keySet to avoid ConcurrentModificationException.
public Topology(final List<String> allNodes, final List<String> liveNodes,
final Map<String, InstanceConfig> instanceConfigMap, ClusterConfig clusterConfig) {
final Map<String, InstanceConfig> instanceConfigMap, ClusterConfig clusterConfig,
boolean faultZoneLevelOnly) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, Please correct me if I cam wrong, I did not find anywhere we are using the default constructor. We always set faultZoneLevelOnly to true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct; however, this constructor is public and may be used outside our code base by others.

This will keep backwards compatibility, while only changing the behavior of the rebalancer to limit the backwards incompatibility concerns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should always keep the default constructor.

@zpinto
Copy link
Contributor Author

zpinto commented Dec 13, 2023

Failed CI due to flaky test testDisconnectWhenConnectionBreak #2691

This PR is ready to be merged, thanks for the review.

Final commit message:
Change all rebalancer strategies to create Topology without additional non-FaultZone or EndNode levels of the tree.

@xyuanlu xyuanlu merged commit 5d6396e into apache:ApplicationClusterManager Dec 13, 2023
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants