Update gpumon.md

aws-observability · Jan 17, 2024 · 265ce80 · 265ce80
1 parent 12dac03
commit 265ce80
Showing 1 changed file with 29 additions and 2 deletions.
diff --git a/docs/eks/gpumon.md b/docs/eks/gpumon.md
@@ -1,7 +1,7 @@
 # Monitoring NVIDIA GPU Workloads
 
-GPUs play an integral part in data intensive workloads. The base infrastructure module of the Observability Accelerator proivdes the ability to deploy the NVIDIA DCGM Exporter Dashboard
-The dashboard utilizes metrics scraped from the '/metrics' endpoint that are exposed when running the nvidia gpu operator.
+GPUs play an integral part in data intensive workloads. The base infrastructure module of the Observability Accelerator proivdes the ability to deploy the NVIDIA DCGM Exporter Dashboard.
+The dashboard utilizes metrics scraped from the `/metrics` endpoint that are exposed when running the nvidia gpu operator and NVSMI binary.
 
 !!!note
     In order to make use of this dashboard, you will need to have a GPU backed EKS cluster and deploy the [GPU operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/amazon-eks.html)
@@ -11,4 +11,31 @@ The dashboard utilizes metrics scraped from the '/metrics' endpoint that are exp
 
 This is enabled by default in the [base infrasturcture module](https://aws-observability.github.io/terraform-aws-observability-accelerator/eks/).
 
+## Dashboards
+
+In order to start producing diagnostic metrics you must first deploy the nvidia SMI binary. nvidia-smi (also NVSMI) provides monitoring and management capabilities for each of NVIDIA’s devices from Fermi and higher architecture families. We can now deploy the nvidia-smi binary, which shows diagnostic information about all GPUs visible to the container:
+
+```
+cat << EOF | kubectl apply -f -
+apiVersion: v1
+kind: Pod
+metadata:
+  name: nvidia-smi
+spec:
+  restartPolicy: OnFailure
+  containers:
+  - name: nvidia-smi
+    image: "nvidia/cuda:11.0.3-base-ubuntu20.04"
+    args:
+    - "nvidia-smi"
+    resources:
+      limits:
+        nvidia.com/gpu: 1
+EOF
+```
+After producing the metrics they should populate the DCGM exporter dashboard:
+
+![image](https://github.com/aws-observability/terraform-aws-observability-accelerator/assets/97046295/66e8ae83-3a78-48b8-a9fc-4460a5a4d173)
+
+