Skip to content

Helix Cloud Support

Meng Zhang edited this page May 4, 2020 · 8 revisions

Introduction

In the past, Helix, as a cluster management framework to manage partitioned, replicated resources in distributed systems, was mostly used in on premise environment. In on premise environment, companies handle all resources deployment, hardware maintenance, securities, privacy, etc. However, nowadays, with more and more high performance/low cost cloud environment available, companies started to switch their software to cloud environment. There a couple of famous cloud service providers, like AWS, Azure, and GCP, etc. In a cloud environment, company can easily scale up or scale down depending on overall usage, no need to worry about provisioning any more. With this trend, we can see challenges as well as opportunities for Helix to better serve the systems deployed in cloud environment.

Scope of this feature

In this feature, we focus on one opportunity of Helix in cloud environment to help customers auto register the participants to a Helix cluster. Currently, after a Helix cluster is created, there are two ways to add instances (participants) to the cluster. One is manually added, where customers manually add instance config to the cluster; and the other is auto join, where customers set the auto join config of the clusters to be true, and each participant populates its own instance config when connects. However, the auto join only works perfectly when customers use Helix in non rack-aware environment, meaning there is no fault domain concept. If used in rack-aware environment, users still need to manually input the domain information to the instance config. Considering most customers would use Helix in a rack-aware environment, it will be beneficial if Helix could provide them a fully automatic way for participants to join the cluster.

In on premise environment, it is hard for each participant to get its own fault domain information. But in cloud environment, there is a good opportunity to realize full automation as a lot of cloud providers give this information to each individual participant through a metadata endpoint. For example, for AWS, Azure, and GCP, they all use a fixed IP address http://169.254.169.254/ for each instance to get their metadata information which contains domain information. In AWS, the field is named as "placement"; in Azure, the field is named as "PlatformUpdateDomain"; in GCP, the field is named as "zone". It is usually just an integer dictating which fault domain the instance belongs to.

Implementation

How to use

Clone this wiki locally