Cluster Change Detector for Helix Rebalancer

Note: the actual implementation may differ from what's outlined in this document.

Overview

This document outlines the design and details implementation of the cluster change detector for Helix rebalancers.

Introduction

What

The distributed nature of applications requires the Helix controller to rebalance against various scenarios and changes that take place in such systems. Currently, Helix makes use of ZooKeeper's child/data change callbacks to be notified of changes happening around the cluster. Cluster Change Detector aims to become the central component in which various changes/callbacks/notifications are resolved to efficiently let Helix's rebalancer know that rebalancing is needed.

Why

The current state of affairs is that the Controller relies on callbacks generated based on ZooKeeper Watchers to trigger a rebalancing pipeline. But there are cases in which no rebalancing might be needed depending on what kind of change is happening, or there could be various types of rebalancing that Helix will perform that happens in parallel to the original controller pipeline. However, it has become evident that Helix rebalancers should not directly react to all changes in the cluster; rather, rebalancers should only be triggered on relevant changes that actually require cluster rebalancing. This means that Helix's rebalancers can no longer blindly rely on callbacks; rather, we need a component that intelligently could tell the rebalancer to rebalance, eliminating various noises in the cluster.

How

Once ready, Helix's rebalancers will rely on Cluster Change Detector's APIs to determine whether rebalance is needed.

Background

Cluster Change Detector is a critical component for the next-generation rebalancer for Helix: The New Helix Rebalancer: Weight-Aware Globally-Even Distribute Rebalancer.

Problem Statement

Objectives

The primary objectives of Cluster Change Detector are the following:

Detect any changes happening around the cluster so that the rebalancer doesn't have to react to changes directly.
- We want clear separation of responsibility where the rebalancer purely does rebalancing, and change detection comes from another separate, independent component.
Determine what kind of rebalance is needed and which resources/partitions/replicas are affected. *Previously, Helix on FULL-AUTO mode would trigger a rebalance for each event that triggers the pipeline. This triggered unnecessary rebalances that were a cause for increased latency and redundant computation. Cluster Change Detector aims to solve this problem.
Enhanced audit log for changes *Helix outputs a lot of logs, and sometimes to the point that they are not too useful. Cluster Change Detector will be the central place for Helix-related audit logs and the logs will have relevant information around changes happening in the cluster, which should aid in debugging.
(Optional) Detect changes as fast as possible to reduce reaction time. *Helix on FULL-AUTO rebalancer relies on the creation of pipeline events that get queued as they come in. This means that a slow pipeline run could increase the reaction time of Helix rebalancer.

Architecture/Implementation

Defining Change Types

There are two types of changes: permanent and transient.

Permanent Changes

Permanent changes change the nature of the cluster. The following are example scenarios:

Helix Participants added/removed
Participants' configs (for example, fault zone, capacity, traffic load, etc.) changed
Resources added/removed

Transient Changes

Transient changes do not change the nature of the cluster. These changes include common failure scenarios experienced by distributed systems such as network issues, hardware issues, and connection loss. In other words, nodes in a cluster could come and go. This, in Helix, translates to LiveInstance change.

Global Baseline Calculation and Partial Rebalance

With the possible types of changes defined, we could now move on to the topic of what actually needs to be done about these changes. Any change will require Helix to take action; that is, Helix will trigger state transitions to temporarily accommodate for such changes. These reactive state transitions will have to be sent out pretty quickly (preferably within milliseconds) to prevent, for example, situations like masterless partitions. These state transitions would be makeshift transitions that are calculated on the fly and would be optimized for speed rather than global optimality. We will call this partial rebalance.

On the other hand, there might be more ideal partition assignments than the resulting assignment from doing a partial rebalance - more ideal in the sense that you might be able to find a set of partition mappings that are more evenly distributed (when all constraints are accounted for). However, finding such sets of ideal, or good enough, assignments arguably takes more time because the calculation is more involved. We will refer to the computation of a more globally-optimized set of mappings as global baseline calculation.

The following summarizes what kind of rebalance would be needed by change type:

Permanent ClusterConfig Global Baseline Calculation + Partial Rebalance Permanent InstanceConfig Global Baseline Calculation + Partial Rebalance Permanent IdealStates/ResourceConfig Global Baseline Calculation + Partial Rebalance Transient LiveInstance Partial Rebalance

Enhanced Logging

There are two types of logging that are crucial in maintaining online clusters: 1) What changes were made to the cluster by external entities (such as the operator, connection loss, etc.) and 2) What changes Helix is making to the cluster internally (move partitions to react to external changes).

Change In Helix's Controller Pipeline

We will have an additional stage where we will create a change detector reconcile the difference between the cache created from the previous controller pipeline and the cache created from the current pipeline.

An additional consideration at implementation time could be given to making the stages in the dataProcess pipeline occur asynchronously because they do not depend on each other. This will be an optional step because these pipeline stages will only involve in-memory computation, and they are not expected to be a latency bottleneck in the Controller pipeline.

Cluster Change Detector API

The APIs listed here are loosely defined; that is, they are subject to change during implementation.

public class ClusterChangeDetector {

`public ClusterChangeDetector() {}`

`/**`
   `Returns all change types detected during the ClusterDetection stage.`
`*/`
`public Set<ChangeType> getChangeTypes();`

`/**`
   `Returns a set of the names of components that changed based on the given change type.`
`*/`
`public Set<String> getChangesBasedOnType(ChangeType changeType);`

}

Logging and Monitoring

Logging

In every iteration of Helix Controller pipeline, we will have the cluster change detector run its change-detection logic in the ChangeDetector stage. During that stage, Helix will log what type of changes were detected. Note that the changes referred to in this section will not contain individual state or details such as listing of all names of changed instances and partitions; rather, they will only include changes around the cluster topology. Logging cluster information at such minute detail will be too extensive and will pollute the log.

Monitoring

Helix could emit inGraph metrics for the aforementioned changes for easier monitoring. This will provide some insight on how frequently a given cluster undergoes topology change, and both Helix devs and application teams will be able to tell how often changes take place and what kind they are more easily. This kind of information is currently not available and will be useful in maintaining clusters.

Future Design

Listening Directly on ZK Changes

This design proposes to implement IZkChildListener and/or IZkDataListener in order to bypass Helix's controller event queue. The expected result is faster detection of permanent/transient changes. We will not go this route because we agreed that Cluster Change Detector does not need to be faster in cadence than the rebalancer, and the speed of rebalancing will be capped at how fast Helix processes events regardless of how fast Cluster Change Detector detects changes.