Helix Cloud Support

Introduction

In the past, Helix, as a cluster management framework to manage partitioned, replicated resources in distributed systems, was mostly used in on premise environment. In on premise environment, companies handle all resources deployment, hardware maintenance, securities, privacy, etc. However, nowadays, with more and more high performance/low cost cloud environment available, companies started to switch their software to cloud environment. There a couple of famous cloud service providers, like AWS, Azure, and GCP, etc. In a cloud environment, company can easily scale up or scale down depending on overall usage, no need to worry about provisioning any more. With this trend, we can see challenges as well as opportunities for Helix to better serve the systems deployed in cloud environment.

Scope of this feature

In this feature, we focus on one opportunity of Helix in cloud environment to help customers auto register the participants to a Helix cluster. Currently, after a Helix cluster is created, there are two ways to add instances (participants) to the cluster. One is manually added, where customers manually add instance config to the cluster; and the other is auto join, where customers set the auto join config of the clusters to be true, and each participant populates its own instance config when connects. However, the auto join only works perfectly when customers use Helix in non rack-aware environment, meaning there is no fault domain concept. If used in rack-aware environment, users still need to manually input the domain information to the instance config. Considering most customers would use Helix in a rack-aware environment, it will be beneficial if Helix could provide them a fully automatic way for participants to join the cluster.

In on premise environment, it is hard for each participant to get its own fault domain information. But in cloud environment, there is a good opportunity to realize full automation as a lot of cloud providers give this information to each individual participant through a metadata endpoint. For example, for AWS, Azure, and GCP, they all use a fixed IP address http://169.254.169.254/ for each instance to get their metadata information which contains domain information. In AWS, the field is named as "placement"; in Azure, the field is named as "PlatformUpdateDomain"; in GCP, the field is named as "zone". It is usually just an integer dictating which fault domain the instance belongs to.

High Level Flow

Implementation

At a high level, Helix provides the following main components.

Provide enhanced REST and Java APIs for cluster creation and cloud config update

Helix will enhance current cluster creation REST API as well as Java API with an extra field that denotes whether the cloud is enabled or not. If it’s enabled, Helix will help create the cluster with default Azure cloud environment.

Besides the enhanced cluster creation API, Helix also provides a set of cloud specific APIs that handles the get/add/update of cloud config. This API is mainly for users to have their customized cloud config other than Azure.

Provide Helix cloud configs

Helix provides cloud configs at two different levels, one is at cluster level, and the other is at participants level. We describe them separately. At cluster level, we have a new znode called CloudConfig in zookeeper. It has a few fields that store the relatively static cloud information for the whole cluster as follows. Similar to other existing configs, Helix provides cloud config builder, validation, get/set functions.

CLOUD_ENABLED determine whether the cluster is inside cloud environment
CLOUD_PROVIDER denote what environment the cluster is in, e.g. Azure, AWS, or Customized
CLOUD_ID the cloud Id that belongs to this cluster
CLOUD_INFO_SOURCE the source for retrieving the cloud information
CLOUD_INFO_PROCESSOR_NAME the name of the function that processes the fetching and parsing of cloud information

The first three fields are required, and the last two are optional. If the user chooses to use the provider that already has default implementations in Helix, e.g., Azure, he does not need to provide the last two fields. As Helix already provides the default value for these two fields in system property, which is considered as a bundle with Azure implementation mentioned above. If the user uses customized providers, or chooses some other cloud environment that does not have any implementation in Helix yet, he needs to provide the last two fields.

At participant level, Helix provides detailed cloud properties, which is more related to the participant’s actions. For default value, Helix provides Azure cloud properties, including features such as http timeout when querying AIMS in Azure, max retry times, etc. If users would like to use their customized config for participants, they can input the specific properties through ZKHelixManager, which is passed into participants.

Provide generic Interface for fetching and parsing cloud instance information

Helix provides an interface for users to implement fetching and parsing functions for cloud instance information. We make the interface generic enough so that users may implement their fetching and parsing logic with maximum freedom.

Implement Azure cloud instance information processor

Helix provides the implementation of processing (fetching and parsing) Azure cloud instance information. Fetching function retrieves instance information from Azure AIMS, and parsing function validates the response, and retrieves the fields we need for participant auto registration.

Implement participant auto registration logic

In the auto registration process, the participant will first query cloud config and decide what environment it is in. Based on this information, the participant will call the corresponding cloud instance information processor as mentioned above. There are two steps. The first one is to retrieve cloud instance information, and the second one is to parse and validate the information and return a desired response to participant. Take Azure for example, as shown below, a few steps are needed before the participant can auto join the cluster.