-
Notifications
You must be signed in to change notification settings - Fork 229
Cluster Change Detector for Helix Rebalancer
Note: the actual implementation may differ from what's outlined in this document. |
---|
This document outlines the design and details implementation of the cluster change detector for Helix rebalancers.
The distributed nature of applications requires the Helix controller to rebalance against various scenarios and changes that take place in such systems. Currently, Helix makes use of ZooKeeper's child/data change callbacks to be notified of changes happening around the cluster. Cluster Change Detector aims to become the central component in which various changes/callbacks/notifications are resolved to efficiently let Helix's rebalancer know that rebalancing is needed.
The current state of affairs is that the Controller relies on callbacks generated based on ZooKeeper Watchers to trigger a rebalancing pipeline. But there are cases in which no rebalancing might be needed depending on what kind of change is happening, or there could be various types of rebalancing that Helix will perform that happens in parallel to the original controller pipeline. However, it has become evident that Helix rebalancers should not directly react to all changes in the cluster; rather, rebalancers should only be triggered on relevant changes that actually require cluster rebalancing. This means that Helix's rebalancers can no longer blindly rely on callbacks; rather, we need a component that intelligently could tell the rebalancer to rebalance, eliminating various noises in the cluster.
Once ready, Helix's rebalancers will rely on Cluster Change Detector's APIs to determine whether rebalance is needed.
Cluster Change Detector is a critical component for the next-generation rebalancer for Helix: The New Helix Rebalancer: Weight-Aware Globally-Even Distribute Rebalancer.
The primary objectives of Cluster Change Detector are the following:
- Detect any changes happening around the cluster so that the rebalancer doesn't have to react to changes directly.
- We want clear separation of responsibility where the rebalancer purely does rebalancing, and change detection comes from another separate, independent component.
- Determine what kind of rebalance is needed and which resources/partitions/replicas are affected. *Previously, Helix on FULL-AUTO mode would trigger a rebalance for each event that triggers the pipeline. This triggered unnecessary rebalances that were a cause for increased latency and redundant computation. Cluster Change Detector aims to solve this problem.
- Enhanced audit log for changes *Helix outputs a lot of logs, and sometimes to the point that they are not too useful. Cluster Change Detector will be the central place for Helix-related audit logs and the logs will have relevant information around changes happening in the cluster, which should aid in debugging.
- (Optional) Detect changes as fast as possible to reduce reaction time. *Helix on FULL-AUTO rebalancer relies on the creation of pipeline events that get queued as they come in. This means that a slow pipeline run could increase the reaction time of Helix rebalancer.
There are two types of changes: permanent and transient.
Permanent changes change the nature of the cluster. The following are example scenarios:
- Helix Participants added/removed
- Participants' configs (for example, fault zone, capacity, traffic load, etc.) changed
- Resources added/removed
Transient changes do not change the nature of the cluster. These changes include common failure scenarios experienced by distributed systems such as network issues, hardware issues, and connection loss. In other words, nodes in a cluster could come and go. This, in Helix, translates to LiveInstance change.
With the possible types of changes defined, we could now move on to the topic of what actually needs to be done about these changes. Any change will require Helix to take action; that is, Helix will trigger state transitions to temporarily accommodate for such changes. These reactive state transitions will have to be sent out pretty quickly (preferably within milliseconds) to prevent, for example, situations like masterless partitions. These state transitions would be makeshift transitions that are calculated on the fly and would be optimized for speed rather than global optimality. We will call this partial rebalance.
On the other hand, there might be more ideal partition assignments than the resulting assignment from doing a partial rebalance - more ideal in the sense that you might be able to find a set of partition mappings that are more evenly distributed (when all constraints are accounted for). However, finding such sets of ideal, or good enough, assignments arguably takes more time because the calculation is more involved. We will refer to the computation of a more globally-optimized set of mappings as global baseline calculation.
The following summarizes what kind of rebalance would be needed by change type:
Permanent ClusterConfig Global Baseline Calculation + Partial Rebalance Permanent InstanceConfig Global Baseline Calculation + Partial Rebalance Permanent IdealStates/ResourceConfig Global Baseline Calculation + Partial Rebalance Transient LiveInstance Partial Rebalance
There are two types of logging that are crucial in maintaining online clusters: 1) What changes were made to the cluster by external entities (such as the operator, connection loss, etc.) and 2) What changes Helix is making to the cluster internally (move partitions to react to external changes).
We will have an additional stage where we will create a change detector reconcile the difference between the cache created from the previous controller pipeline and the cache created from the current pipeline.
An additional consideration at implementation time could be given to making the stages in the dataProcess pipeline occur asynchronously because they do not depend on each other. This will be an optional step because these pipeline stages will only involve in-memory computation, and they are not expected to be a latency bottleneck in the Controller pipeline.
The APIs listed here are loosely defined; that is, they are subject to change during implementation.
public class ClusterChangeDetector {
`public ClusterChangeDetector() {}`
`/**`
`Returns all change types detected during the ClusterDetection stage.`
`*/`
`public Set<ChangeType> getChangeTypes();`
`/**`
`Returns a set of the names of components that changed based on the given change type.`
`*/`
`public Set<String> getChangesBasedOnType(ChangeType changeType);`
}
In every iteration of Helix Controller pipeline, we will have the cluster change detector run its change-detection logic in the ChangeDetector stage. During that stage, Helix will log what type of changes were detected. Note that the changes referred to in this section will not contain individual state or details such as listing of all names of changed instances and partitions; rather, they will only include changes around the cluster topology. Logging cluster information at such minute detail will be too extensive and will pollute the log.
Helix could emit inGraph metrics for the aforementioned changes for easier monitoring. This will provide some insight on how frequently a given cluster undergoes topology change, and both Helix devs and application teams will be able to tell how often changes take place and what kind they are more easily. This kind of information is currently not available and will be useful in maintaining clusters.
This design proposes to implement IZkChildListener and/or IZkDataListener in order to bypass Helix's controller event queue. The expected result is faster detection of permanent/transient changes. We will not go this route because we agreed that Cluster Change Detector does not need to be faster in cadence than the rebalancer, and the speed of rebalancing will be capped at how fast Helix processes events regardless of how fast Cluster Change Detector detects changes.
Pull Request Description Template
ZooKeeper API module for Apache Helix
DataAccessor for Assignment Metadata
Concurrency and Parallelism for BucketDataAccessor
WAGED Rebalance Pipeline Redesign
WAGED rebalancer Hard Constraint Scope Expansion
IdealState Dependency Removal Progression Remove requested state in Task Framework