Skip to content

Commit

Permalink
Update gpumon.md
Browse files Browse the repository at this point in the history
  • Loading branch information
lewinkedrs authored Jan 17, 2024
1 parent 12dac03 commit 265ce80
Showing 1 changed file with 29 additions and 2 deletions.
31 changes: 29 additions & 2 deletions docs/eks/gpumon.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Monitoring NVIDIA GPU Workloads

GPUs play an integral part in data intensive workloads. The base infrastructure module of the Observability Accelerator proivdes the ability to deploy the NVIDIA DCGM Exporter Dashboard
The dashboard utilizes metrics scraped from the '/metrics' endpoint that are exposed when running the nvidia gpu operator.
GPUs play an integral part in data intensive workloads. The base infrastructure module of the Observability Accelerator proivdes the ability to deploy the NVIDIA DCGM Exporter Dashboard.
The dashboard utilizes metrics scraped from the `/metrics` endpoint that are exposed when running the nvidia gpu operator and NVSMI binary.

!!!note
In order to make use of this dashboard, you will need to have a GPU backed EKS cluster and deploy the [GPU operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/amazon-eks.html)
Expand All @@ -11,4 +11,31 @@ The dashboard utilizes metrics scraped from the '/metrics' endpoint that are exp

This is enabled by default in the [base infrasturcture module](https://aws-observability.github.io/terraform-aws-observability-accelerator/eks/).

## Dashboards

In order to start producing diagnostic metrics you must first deploy the nvidia SMI binary. nvidia-smi (also NVSMI) provides monitoring and management capabilities for each of NVIDIA’s devices from Fermi and higher architecture families. We can now deploy the nvidia-smi binary, which shows diagnostic information about all GPUs visible to the container:

```
cat << EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: nvidia-smi
spec:
restartPolicy: OnFailure
containers:
- name: nvidia-smi
image: "nvidia/cuda:11.0.3-base-ubuntu20.04"
args:
- "nvidia-smi"
resources:
limits:
nvidia.com/gpu: 1
EOF
```
After producing the metrics they should populate the DCGM exporter dashboard:

![image](https://github.com/aws-observability/terraform-aws-observability-accelerator/assets/97046295/66e8ae83-3a78-48b8-a9fc-4460a5a4d173)



0 comments on commit 265ce80

Please sign in to comment.