diff --git a/docs/eks/gpumon.md b/docs/eks/gpumon.md index d6432116..c0315932 100644 --- a/docs/eks/gpumon.md +++ b/docs/eks/gpumon.md @@ -1,7 +1,7 @@ # Monitoring NVIDIA GPU Workloads -GPUs play an integral part in data intensive workloads. The base infrastructure module of the Observability Accelerator proivdes the ability to deploy the NVIDIA DCGM Exporter Dashboard -The dashboard utilizes metrics scraped from the '/metrics' endpoint that are exposed when running the nvidia gpu operator. +GPUs play an integral part in data intensive workloads. The base infrastructure module of the Observability Accelerator proivdes the ability to deploy the NVIDIA DCGM Exporter Dashboard. +The dashboard utilizes metrics scraped from the `/metrics` endpoint that are exposed when running the nvidia gpu operator and NVSMI binary. !!!note In order to make use of this dashboard, you will need to have a GPU backed EKS cluster and deploy the [GPU operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/amazon-eks.html) @@ -11,4 +11,31 @@ The dashboard utilizes metrics scraped from the '/metrics' endpoint that are exp This is enabled by default in the [base infrasturcture module](https://aws-observability.github.io/terraform-aws-observability-accelerator/eks/). +## Dashboards + +In order to start producing diagnostic metrics you must first deploy the nvidia SMI binary. nvidia-smi (also NVSMI) provides monitoring and management capabilities for each of NVIDIA’s devices from Fermi and higher architecture families. We can now deploy the nvidia-smi binary, which shows diagnostic information about all GPUs visible to the container: + +``` +cat << EOF | kubectl apply -f - +apiVersion: v1 +kind: Pod +metadata: + name: nvidia-smi +spec: + restartPolicy: OnFailure + containers: + - name: nvidia-smi + image: "nvidia/cuda:11.0.3-base-ubuntu20.04" + args: + - "nvidia-smi" + resources: + limits: + nvidia.com/gpu: 1 +EOF +``` +After producing the metrics they should populate the DCGM exporter dashboard: + +![image](https://github.com/aws-observability/terraform-aws-observability-accelerator/assets/97046295/66e8ae83-3a78-48b8-a9fc-4460a5a4d173) + +