Skip to content

Commit

Permalink
Update nvidia-gpu-efa template
Browse files Browse the repository at this point in the history
  • Loading branch information
iankouls-aws committed Jun 21, 2024
1 parent 7125f2c commit d35ac91
Show file tree
Hide file tree
Showing 6 changed files with 395 additions and 50 deletions.
2 changes: 2 additions & 0 deletions patterns/nvidia-gpu-efa/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
efa-info-test.yaml
efa-nccl-test.yaml
198 changes: 149 additions & 49 deletions patterns/nvidia-gpu-efa/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# EKS Cluster w/ NVIDIA GPUs and EFA for Machine Learning
# EKS Cluster w/ NVIDIA GPUs and EFA for AI/ML Workloads

This pattern demonstrates an Amazon EKS Cluster with an EFA-enabled nodegroup that utilizes `p5.48xlarge` instances with H100 NVIDIA GPUs used in distributed, multi-node machine learning workloads.
This pattern demonstrates an Amazon EKS Cluster with an EFA-enabled nodegroup that utilizes `p5.48xlarge` instances with H100 NVIDIA GPUs used in distributed, multi-node AI/ML workloads.

The following components are demonstrated in this pattern:

Expand Down Expand Up @@ -31,32 +31,38 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started

## Validate

!!! note
Note:

The following steps are shown with `g5.8xlarge` for frugality. Values shown below will change based on the instance type selected (i.e. - `p5.48xlarge` has 8 GPUs and 32 EFA interfaces)
Desired instance type can be specified in [eks.tf](eks.tf#L36).
Values shown below will change based on the instance type selected (i.e. - `p5.48xlarge` has 8 GPUs and 32 EFA interfaces).
A list of EFA-enabled instance types is available [here](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types).
If you are using an on-demand capacity reservation (ODCR) for your instance type, please uncomment the `capacity_reservation_specification` block in `eks.tf`
and specify a capacity_reservation_id. Please ensure that the region and availability zone of your ODCR match the ones used in `main.tf`.

1. List the nodes by instance type:
1. List the nodes with instance type details:

```sh
kubectl get nodes -o yaml | grep instance-type | grep node | grep -v f:
kubectl get nodes -L node.kubernetes.io/instance-type
```

```text
node.kubernetes.io/instance-type: g5.8xlarge
node.kubernetes.io/instance-type: m5.large
node.kubernetes.io/instance-type: m5.large
node.kubernetes.io/instance-type: g5.8xlarge
NAME STATUS ROLES AGE VERSION INSTANCE-TYPE
ip-10-0-1-16.us-east-2.compute.internal Ready <none> 12h v1.29.3-eks-ae9a62a p5.48xlarge
ip-10-0-12-113.us-east-2.compute.internal Ready <none> 14h v1.29.3-eks-ae9a62a m5.large
ip-10-0-12-201.us-east-2.compute.internal Ready <none> 12h v1.29.3-eks-ae9a62a p5.48xlarge
ip-10-0-46-217.us-east-2.compute.internal Ready <none> 14h v1.29.3-eks-ae9a62a m5.large
```

You should see two EFA-enabled (in this example `g5.8xlarge`) nodes in the list.
You should see two EFA-enabled (in this example `p5.48xlarge`) nodes in the list.

2. Deploy Kubeflow MPI Operator

Kubeflow MPI Operator is required for running MPIJobs on EKS. We will use an MPIJob to test EFA.
To deploy the MPI operator execute the following:

```sh
kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.3.0/deploy/v2beta1/mpi-operator.yaml
kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.4.0/deploy/v2beta1/mpi-operator.yaml
```

```text
Expand All @@ -82,80 +88,174 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started
clusterrole.rbac.authorization.k8s.io/mpi-operator configured
```

3. EFA test
3. EFA info test

The results should shown that two EFA adapters are available (one for each worker pod)
This test prints a list of available EFA interfaces by using the `/opt/amazon/efa/bin/fi_info` utility.
Edit the script [generate-efa-info-test.sh](generate-efa-info-test.sh) and adjust the following environment variables if needed prior to running it:

```sh
export NUM_WORKERS=2
export GPU_PER_WORKER=8
export EFA_PER_WORKER=32
```

`NUM_WORKERS` - number of nodes you want to run the test on
`GPU_PER_WORKER` - number of GPUs available on each node
`EFA_PER_WORKER` - number of EFA interfaces available on each node

```sh
./generate-efa-info-test.sh
```

This script generates an MPIJob manifest file named `efa-info-test.yaml`.

To start the test apply the generated manifest to the cluster:

```sh
kubectl apply -f https://raw.githubusercontent.com/aws-samples/aws-do-eks/main/Container-Root/eks/deployment/efa-device-plugin/test-efa.yaml
kubectl apply -f ./efa-info-test.yaml
```

```text
```log
mpijob.kubeflow.org/efa-info-test created
```

Observe the pods in the current namespace. You should see a launcher pod and worker pods.
It is normal for the launcher pod to restart a few times until the worker pods are fully running.

```sh
watch kubectl get pods
```

Once the test launcher pod enters status `Running` or `Completed`, see the test logs using the command below:
```log
NAME READY STATUS RESTARTS AGE
efa-info-test-launcher-wm8pm 0/1 CrashLoopBackOff 1 (16s ago) 19s
efa-info-test-worker-0 1/1 Running 0 19s
efa-info-test-worker-1 1/1 Running 0 19s
```

```log
NAME READY STATUS RESTARTS AGE
efa-info-test-launcher-wm8pm 1/1 Running 2 (18s ago) 21s
efa-info-test-worker-0 1/1 Running 0 21s
efa-info-test-worker-1 1/1 Running 0 21s
```

```log
NAME READY STATUS RESTARTS AGE
efa-info-test-launcher-wm8pm 0/1 Completed 2 5m20s
```

Once the test launcher pod enters status `Running` or `Completed`,
see the test logs using the command below:

```sh
kubectl logs -f $(kubectl get pods | grep launcher | cut -d ' ' -f 1)
```

```text
Warning: Permanently added 'efa-info-test-worker-1.efa-info-test-worker.default.svc,10.11.13.224' (ECDSA) to the list of known hosts.
Warning: Permanently added 'efa-info-test-worker-0.efa-info-test-worker.default.svc,10.11.4.63' (ECDSA) to the list of known hosts.
```log
Warning: Permanently added 'efa-info-test-worker-1.efa-info-test.default.svc' (ED25519) to the list of known hosts.
Warning: Permanently added 'efa-info-test-worker-0.efa-info-test.default.svc' (ED25519) to the list of known hosts.
[1,1]<stdout>:provider: efa
[1,1]<stdout>: fabric: efa
[1,1]<stdout>: domain: rdmap197s0-rdm
[1,1]<stdout>: version: 116.10
[1,1]<stdout>: domain: rdmap79s0-rdm
[1,1]<stdout>: version: 120.10
[1,1]<stdout>: type: FI_EP_RDM
[1,1]<stdout>: protocol: FI_PROTO_EFA
...
[1,0]<stdout>:provider: efa
[1,0]<stdout>: fabric: efa
[1,0]<stdout>: domain: rdmap197s0-rdm
[1,0]<stdout>: version: 116.10
[1,0]<stdout>: domain: rdmap201s0-rdm
[1,0]<stdout>: version: 120.10
[1,0]<stdout>: type: FI_EP_RDM
[1,0]<stdout>: protocol: FI_PROTO_EFA
```

4. EFA NCCL test
Finally, remove the job:

```sh
kubectl delete -f ./efa-info-test.yaml
```

To run the EFA NCCL test please execute the following kubectl command:
4. EFA NCCL test

The EFA NCCL test is used to measure network bandwidth by running the `/opt/nccl-tests/build/all_reduce_perf` utility.
Please edit the script below and modify the environment variables at the beginning of the script as needed.
Then generate the MPIjob manifest.

```sh
kubectl apply -f https://raw.githubusercontent.com/aws-samples/aws-do-eks/main/Container-Root/eks/deployment/efa-device-plugin/test-nccl-efa.yaml
./generate-efa-nccl-test.sh
```

This script creates a file named `efa-nccl-test.yaml`. Apply the manifest to start the EFA nccl test.

```text
mpijob.kubeflow.org/test-nccl-efa created
```
```sh
kubectl apply -f ./efa-nccl-test.yaml
Once the launcher pod enters `Running` or `Completed` state, execute the following to see the test logs:
```text
mpijob.kubeflow.org/efa-nccl-test created
```
Similarly to the EFA info test, a launcher and worker pods will be created. The launcher pod will be
in CrashLoopBackoff mode until the worker pods enter Running state.
As soon as the launcher pod enters Running state as well, execute the following command to see the test logs:
```sh
kubectl logs -f $(kubectl get pods | grep launcher | cut -d ' ' -f 1)
```
```text
[1,0]<stdout>:test-nccl-efa-worker-0:21:21 [0] NCCL INFO NET/OFI Selected Provider is efa (found 1 nics)
[1,0]<stdout>:test-nccl-efa-worker-0:21:21 [0] NCCL INFO Using network AWS Libfabric
[1,0]<stdout>:NCCL version 2.12.7+cuda11.4
```log
...
[1,0]<stdout>:# out-of-place in-place
[1,0]<stdout>:# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
[1,0]<stdout>:# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
[1,0]<stdout>: 0 0 float sum -1 0.13 0.00 0.00 0 0.12 0.00 0.00 0
[1,0]<stdout>: 0 0 float sum -1 0.12 0.00 0.00 0 0.12 0.00 0.00 0
[1,0]<stdout>: 4 1 float sum -1 65.43 0.00 0.00 0 65.82 0.00 0.00 0
[1,0]<stdout>: 8 2 float sum -1 64.86 0.00 0.00 0 65.67 0.00 0.00 0
[1,0]<stdout>: 16 4 float sum -1 64.72 0.00 0.00 0 64.83 0.00 0.00 0
[1,0]<stdout>: 32 8 float sum -1 65.47 0.00 0.00 0 65.16 0.00 0.00 0
[1,0]<stdout>: 64 16 float sum -1 65.34 0.00 0.00 0 65.58 0.00 0.00 0
[1,0]<stdout>: 128 32 float sum -1 65.99 0.00 0.00 0 66.28 0.00 0.00 0
[1,0]<stdout>: 256 64 float sum -1 75.81 0.00 0.01 0 66.76 0.00 0.01 0
[1,0]<stdout>: 512 128 float sum -1 69.43 0.01 0.01 0 67.18 0.01 0.01 0
[1,0]<stdout>: 1024 256 float sum -1 82.35 0.01 0.02 0 69.03 0.01 0.03 0
[1,0]<stdout>: 2048 512 float sum -1 72.49 0.03 0.05 0 71.37 0.03 0.05 0
[1,0]<stdout>: 4096 1024 float sum -1 77.47 0.05 0.10 0 77.42 0.05 0.10 0
[1,0]<stdout>: 8192 2048 float sum -1 78.10 0.10 0.20 0 78.01 0.11 0.20 0
[1,0]<stdout>: 16384 4096 float sum -1 93.35 0.18 0.33 0 80.11 0.20 0.38 0
[1,0]<stdout>: 32768 8192 float sum -1 106.6 0.31 0.58 0 96.22 0.34 0.64 0
[1,0]<stdout>: 65536 16384 float sum -1 120.6 0.54 1.02 0 89.06 0.74 1.38 0
[1,0]<stdout>: 131072 32768 float sum -1 93.62 1.40 2.62 0 106.3 1.23 2.31 0
[1,0]<stdout>: 262144 65536 float sum -1 111.5 2.35 4.41 0 111.6 2.35 4.41 0
[1,0]<stdout>: 524288 131072 float sum -1 121.2 4.33 8.11 0 109.9 4.77 8.94 0
[1,0]<stdout>: 1048576 262144 float sum -1 119.7 8.76 16.43 0 118.7 8.83 16.56 0
[1,0]<stdout>: 2097152 524288 float sum -1 143.9 14.58 27.33 0 144.2 14.55 27.28 0
[1,0]<stdout>: 4194304 1048576 float sum -1 163.7 25.62 48.03 0 163.6 25.64 48.08 0
[1,0]<stdout>: 8388608 2097152 float sum -1 195.3 42.95 80.54 0 194.9 43.03 80.69 0
[1,0]<stdout>: 16777216 4194304 float sum -1 278.6 60.22 112.91 0 279.9 59.94 112.38 0
[1,0]<stdout>: 33554432 8388608 float sum -1 459.7 73.00 136.87 0 433.9 77.34 145.01 0
[1,0]<stdout>: 67108864 16777216 float sum -1 587.2 114.29 214.29 0 587.1 114.31 214.34 0
[1,0]<stdout>: 134217728 33554432 float sum -1 926.6 144.85 271.60 0 851.5 157.63 295.55 0
[1,0]<stdout>: 268435456 67108864 float sum -1 1497.8 179.22 336.03 0 1496.0 179.44 336.45 0
[1,0]<stdout>: 536870912 134217728 float sum -1 2558.6 209.83 393.42 0 2560.8 209.65 393.10 0
[1,0]<stdout>: 1073741824 268435456 float sum -1 4553.6 235.80 442.13 0 4553.0 235.83 442.19 0
[1,0]<stdout>: 2147483648 536870912 float sum -1 9062.5 236.96 444.31 0 9060.4 237.02 444.41 0
[1,0]<stdout>:# Out of bounds values : 0 OK
[1,0]<stdout>:# Avg bus bandwidth : 79.9352
[1,0]<stdout>:#
```
Columns 8 and 12 in the output table show the in-place and out-of-place bus bandwidth calculated for the data size listed in column 1. In this case it is 3.13 and 3.12 GB/s respectively.
Your actual results may be slightly different. The calculated average bus bandwidth is displayed at the bottom of the log when the test finishes after it reaches the max data size,
specified in the mpijob manifest. In this result the average bus bandwidth is 1.15 GB/s.
Columns 9 and 13 in the output table show the in-place and out-of-place bus bandwidth calculated for the data size listed in column 2.
In this case it is at maximum 444.31 and 444.41 GB/s respectively.
Your actual results may be slightly different. The calculated average bus bandwidth is displayed at the end of the log.
In this test run the average bus bandwidth was 79.9352 GB/s.
```text
[1,0]<stdout>:# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
[1,0]<stdout>:# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
...
[1,0]<stdout>: 262144 65536 float sum -1 195.0 1.34 1.34 0 194.0 1.35 1.35 0
[1,0]<stdout>: 524288 131072 float sum -1 296.9 1.77 1.77 0 291.1 1.80 1.80 0
[1,0]<stdout>: 1048576 262144 float sum -1 583.4 1.80 1.80 0 579.6 1.81 1.81 0
[1,0]<stdout>: 2097152 524288 float sum -1 983.3 2.13 2.13 0 973.9 2.15 2.15 0
[1,0]<stdout>: 4194304 1048576 float sum -1 1745.4 2.40 2.40 0 1673.2 2.51 2.51 0
...
[1,0]<stdout>:# Avg bus bandwidth : 1.15327
Lastly, delete the MPIJob:
```sh
kubectl delete -f ./efa-nccl-test.yaml
```
## Destroy
Expand Down
9 changes: 9 additions & 0 deletions patterns/nvidia-gpu-efa/eks.tf
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,15 @@ module "eks" {
ami_type = "AL2_x86_64_GPU"
instance_types = ["p5.48xlarge"]


# Uncomment this block and specify capacity_reservation_id
# to use an on-demand capacity reservation (ODCR) for your EFA instances
#capacity_reservation_specification = {
# capacity_reservation_target = {
# capacity_reservation_id = "cr-xxxxxxxxxxxxxxxxx"
# }
#}

pre_bootstrap_user_data = <<-EOT
# Mount instance store volumes in RAID-0 for kubelet and containerd
# https://github.com/awslabs/amazon-eks-ami/blob/master/doc/USER_GUIDE.md#raid-0-for-kubelet-and-containerd-raid0
Expand Down
93 changes: 93 additions & 0 deletions patterns/nvidia-gpu-efa/generate-efa-info-test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
#!/bin/bash

export MPI_JOB_NAME=efa-info-test
export IMAGE_URI=public.ecr.aws/hpc-cloud/nccl-tests:latest
export NUM_WORKERS=2
export GPU_PER_WORKER=8
export EFA_PER_WORKER=32
export TOTAL_GPUS=$((${NUM_WORKERS}*${GPU_PER_WORKER}))

cat <<EOF >> efa-info-test.yaml
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: ${MPI_JOB_NAME}
spec:
runPolicy:
cleanPodPolicy: Running
backoffLimit: 20
slotsPerWorker: ${GPU_PER_WORKER}
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
restartPolicy: OnFailure
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
containers:
- image: ${IMAGE_URI}
name: ${MPI_JOB_NAME}-launcher
imagePullPolicy: IfNotPresent
env:
- name: LD_LIBRARY_PATH
value: "/opt/amazon/openmpi/lib:/opt/nccl/build/lib:/opt/amazon/efa/lib:/opt/aws-ofi-nccl/install/lib:/usr/local/nvidia/lib"
- name: PATH
value: "/opt/amazon/efa/bin:/usr/bin"
- name: XLA_FLAGS
value: "--xla_gpu_cuda_data_dir=/usr/local/cuda"
- name: TF_XLA_FLAGS
value: "--tf_xla_cpu_global_jit"
- name: NCCL_DEBUG
value: INFO
command:
- /opt/amazon/openmpi/bin/mpirun
- --allow-run-as-root
- --tag-output
- -np
- "${TOTAL_GPUS}"
- -bind-to
- none
- -map-by
- slot
- -x
- PATH
- -x
- LD_LIBRARY_PATH
- -x
- XLA_FLAGS
- -x
- TF_XLA_FLAGS
- -x
- NCCL_DEBUG=INFO
- --mca
- pml
- ^cm
- --mca
- pml_rsh_agent=ssh
- --oversubscribe
- /opt/amazon/efa/bin/fi_info
- -p
- "efa"
- -t
- "FI_EP_RDM"
Worker:
replicas: ${NUM_WORKERS}
template:
spec:
containers:
- image: ${IMAGE_URI}
name: ${MPI_JOB_NAME}-worker
imagePullPolicy: IfNotPresent
resources:
limits:
nvidia.com/gpu: ${GPU_PER_WORKER}
vpc.amazonaws.com/efa: ${EFA_PER_WORKER}
requests:
nvidia.com/gpu: ${GPU_PER_WORKER}
vpc.amazonaws.com/efa: ${EFA_PER_WORKER}
EOF

Loading

0 comments on commit d35ac91

Please sign in to comment.