This project provides several reference architectures to run distributed training on Amazon EKS for different use cases using p4d.24xlarge
instances (you can replace them by p5
or trn1
. These examples use eksctl and a cluster manifest to create your specified Amazon EKS cluster.
To deploy the architectures you must install the dependencies below. You are advised to go through the fist two steps of the Getting started with Amazon EKS guide from the AWS Documentation.
- AWS CLI is the AWS command line interface.
- eksctl command line tool to manage EKS clusters.
- kubectl command line for Kubernetes.
The following digram shows a common architecture that can be used for distributed model training on EKS.
The EKS cluster has two nodegroups. A system
nodegroup is used to run pods like kube-dns, kubeflow training operator, etc. which provide internal cluster-scope services and can run on CPU. A worker nodegroup built with an accelerated instance type is used to run the distributed training workload.
The cluster configuration is specified via a yaml manifest file. If a cluster version is not specified in the manifest, then the default EKS API version will be used. For our examples we set the version to 1.27. This setting may be adjusted before creating clusters as needed. The following example cluster configurations for distributed training are provided:
eks-g4dn-vpc.yaml
: Cluster using an existing VPC with a nodegroup of 2 *g4dn.8xlarge
instances. This instance type supports Elastic Fabric Adapter (EFA), usually does not require a capacity reservation, and is a good starting point when developing distributed training architectures. To use this manifest, edit the vpc id and subnets, and specify the desired private subnet for the nodes.eks-g4dn.yaml
: Cluster with a nodegroup of 2 *g4dn.8xlarge
instances, created in a new VPC. This example shows that when a VPC is not specified, one is created for the cluster. The manifest can work without any modifications, however if you wish to change the cluster name, API version, region, availability zones, etc. you can modify the file before using it to create the cluster.eks-p4de-odcr-vpc.yaml
: Cluster using an existing VPC with a nodegroup of 2 *p4de.24xlarge
instances from an existing on-demand capacity reservation (ODCR). This is the most common configuration for distributed training workloads.Edit the file to specify vpc id, subnets, and capacityReservationID. Please note that the subnet of the nodeGroup should match the one of the capacity reservation.eks-p4de-odcr.yaml
: Cluster with 2 *p4de.24xlarge
instances from an existing ODCR. A new VPC will be created for this cluster. This configuration is useful for distributed training when no VPC is already available. Note that you would have to match the AZ of your ODCR in the nodegroup section of the manifest. Nodegroups in this and previous examples are fully-managed and can be accessed via the EKS console. If you are using an instance type that is not yet supported in managed nodegroups by EKS, you can define a nodegroup in a self-manged nodegroup section as shown at the end of this example.eks-p5-odcr.yaml
: Cluster with 1 *p5.48xlarge
instances from an existing ODCR and an existing VPC. Note that you would have to match the AZ of your ODCR in the nodegroup section of the manifest. Nodegroups in this and previous examples are fully-managed and can be accessed via the EKS console. If you are using an instance type that is not yet supported in managed nodegroups by EKS, you can define a nodegroup in a self-manged nodegroup by using theeks-p5-capacity-block.yaml
template.eks-p5-capacity-block.yaml
: This deploys a cluster without a node group, allowing you to create an unmanaged node group for Capacity Blocks for ML. See the section Capacity Block for further detail.
To configure your desired cluster, edit the cluster manifest file that most closely matches your desired configuration or copy the file and customize it, following the cluster manifest schema. Any of the values in the manifests can be changed and more node groups can be added to the same cluster. The minimal set of values to specify for each file are described above.
You will need to replace the following placeholders to deploy your clusters:
PLACEHOLDER_AWS_REGION
: region in which to deploy the cluster, replace byus-east-1
for example.PLACEHOLDER_AZ_1
: We use 2 AZs for the cluster, replace byus-east-1a
for example.PLACEHOLDER_AZ_2
: This AZ is where your compute capacity is located, replace byus-east-1c
for example if that's where your capacity is located.PLACEHOLDER_VPC_ID
: ID of the VPC in which you deploy the cluster, it should take the formvpc-12356790abcd
.PLACEHOLDER_SUBNET_PUBLIC_1
andPLACEHOLDER_SUBNET_PUBLIC_2
: change to the id of a public subnet (subnet-12356790abcd
).PLACEHOLDER_SUBNET_PUBLIC_2
: change to the id of a public subnet to host the compute nodes (subnet-12356790abcd
).PLACEHOLDER_SUBNET_PRIVATE_1
: change to the id of a public subnet to host the compute nodes (subnet-12356790abcd
).PLACEHOLDER_SUBNET_PRIVATE_2
: change to the id of a public subnet to host the compute nodes (subnet-12356790abcd
). This subnet holds your compute capacity, ensure it is in the right AZ.PLACEHOLDER_CAPACITY_RESERVATION_ID
: if using a capacity reservation put the ID here (cr-12356790abcd
).
- Let's assume that your desired cluster configuration is stored in file
cluster.yaml
. Then to create the cluster, execute the following command:Example output:eksctl create cluster -f ./cluster.yaml
Cluster creation may take between 15 and 30 minutes. Upon successful creation your localYYYY-MM-DD HH:mm:SS [ℹ] eksctl version x.yyy.z YYYY-MM-DD HH:mm:SS [ℹ] using region <region_name> ... YYYY-MM-DD HH:mm:SS [✔] EKS cluster "<cluster_name>" in "<region_name>" region is ready
~/.kube/config
file gets updated with connection information to your cluster. - Execute the following command line in order to verify that the cluster is accessible:
kubectl get nodes
You should see a list of three nodes. One would be a system node instance of type c5.2xlarge, and the others will belong to the nodegroup of instances with your desired instance type for distributed training.
To remove your cluster, execute the following command:
kubectl delete cluster -f ./cluster.yaml
Example output:
YYYY-MM-DD HH:mm:SS [ℹ] deleting EKS cluster "<cluster_name>"
...
YYYY-MM-DD HH:mm:SS [ℹ] waiting for CloudFormation stack "<stack_name>"
Capacity Blocks for ML have a restriction that they can't be in a managed node group, in order to create an unmanaged node group we'll first deploy a eks cluster and then deploy a CloudFormation stack using the values created with the cluster:
- Deploy the EKS cluster using the template
eks-p5-capacity-block.yaml
:
eksctl create cluster -f ./eks-p5-capacity-block.yaml
- After this cluster deployment finishes we'll deploy the following stack:
ClusterName
this needs to be the same as the cluster you created above, default toeks-p5-odcr-vpc
ClusterControlPlaneSecurityGroup
grab this by visiting the EKS Console > Cluster > Networking > Additional Security GroupNodeImageIdSSMParam
defaults to the EKS GPU AMI 1.29 but you can override this with theNodeImageId
parameter.- This sets up a security group for EFA.
- After the nodegroup is created we need to update the config map
3.1 Check to see if you already have an aws-auth
ConfigMap
.
kubectl describe configmap -n kube-system aws-auth
3.2 If you are shown an aws-auth
ConfigMap
, then update it as needed.
3.2.1 Open the ConfigMap
for editing.
kubectl edit -n kube-system configmap/aws-auth
3.2.2 Add a new mapRoles
entry as needed. Set the rolearn
value to the NodeInstanceRole value that you recorded in the previous procedure.
[...]
data:
mapRoles: |
- rolearn: <ARN of instance role (not instance profile)>
username: system:node:{{EC2PrivateDNSName}}
groups:
- system:bootstrappers
- system:nodes
[...]
3.2.3 Save the file and exit your text editor.
3.3 If you received an error stating "Error from server (NotFound): configmaps "aws-auth" not found
, then apply the stock ConfigMap
.
3.3.1 Download the configuration map.
curl -O https://s3.us-west-2.amazonaws.com/amazon-eks/cloudformation/2020-10-29/aws-auth-cm.yaml
3.3.2 In the aws-auth-cm.yaml
file, set the rolearn
value to the NodeInstanceRole value that you recorded in the previous procedure. You can do this with a text editor, or by replacing my-node-instance-role
and running the following command:
sed -i.bak -e 's|<ARN of instance role (not instance profile)>|my-node-instance-role|' aws-auth-cm.yaml
3.3.3 Apply the configuration. This command may take a few minutes to finish.
kubectl apply -f aws-auth-cm.yaml
- After the cluster is created we can list the nodes:
kubectl get nodes
- Apply K8 Nvidia CNI Plugin:
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml
- If using EFA, make sure to install the EFA CNI Plugin.
kubectl apply -f https://raw.githubusercontent.com/aws-samples/aws-efa-eks/main/manifest/efa-k8s-device-plugin.yml
For further information regarding EKS cluster infrastructure see the aws-do-eks project. More cluster configurations are available here.
Related resources for further reading can be found at the links below: