Skip to content

Commit

Permalink
feat(module: eks-monitoring) Add NVIDIA gpu monitoring dashboards (#257)
Browse files Browse the repository at this point in the history
* gpu dashboards

* fixing locals

* doc start

* Update gpumon.md

* fixing typos and doc names

* fixing module name

* fixing mkdocs

* gpu to nvidia

* Apply pre-commit

---------

Co-authored-by: Rodrigue Koffi <[email protected]>
  • Loading branch information
lewinkedrs and bonclay7 authored Jan 24, 2024
1 parent d8b3067 commit ada16d5
Show file tree
Hide file tree
Showing 7 changed files with 95 additions and 0 deletions.
38 changes: 38 additions & 0 deletions docs/eks/gpu-monitoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Monitoring NVIDIA GPU Workloads

GPUs play an integral part in data intensive workloads. The eks-monitoring module of the Observability Accelerator provides the ability to deploy the NVIDIA DCGM Exporter Dashboard.
The dashboard utilizes metrics scraped from the `/metrics` endpoint that are exposed when running the nvidia gpu operator with the [DCGM exporter](https://developer.nvidia.com/blog/monitoring-gpus-in-kubernetes-with-dcgm/) and NVSMI binary.

!!!note
In order to make use of this dashboard, you will need to have a GPU backed EKS cluster and deploy the [GPU operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/amazon-eks.html)
The recommended way of deploying the GPU operator is the [Data on EKS Blueprint](https://github.com/aws-ia/terraform-aws-eks-data-addons/blob/main/nvidia-gpu-operator.tf)

## Deployment

This is enabled by default in the [eks-monitoring module](https://aws-observability.github.io/terraform-aws-observability-accelerator/eks/).

## Dashboards

In order to start producing diagnostic metrics you must first deploy the nvidia SMI binary. nvidia-smi (also NVSMI) provides monitoring and management capabilities for each of NVIDIA’s devices from Fermi and higher architecture families. We can now deploy the nvidia-smi binary, which shows diagnostic information about all GPUs visible to the container:

```
cat << EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: nvidia-smi
spec:
restartPolicy: OnFailure
containers:
- name: nvidia-smi
image: "nvidia/cuda:11.0.3-base-ubuntu20.04"
args:
- "nvidia-smi"
resources:
limits:
nvidia.com/gpu: 1
EOF
```
After producing the metrics they should populate the DCGM exporter dashboard:

![image](https://github.com/aws-observability/terraform-aws-observability-accelerator/assets/97046295/66e8ae83-3a78-48b8-a9fc-4460a5a4d173)
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ nav:
- Amazon EKS:
- Infrastructure: eks/index.md
- EKS API server: eks/eks-apiserver.md
- EKS GPU montitoring: eks/gpu-monitoring.md
- Multicluster:
- Single AWS account: eks/multicluster.md
- Cross AWS account: eks/multiaccount.md
Expand Down
3 changes: 3 additions & 0 deletions modules/eks-monitoring/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ See examples using this Terraform modules in the **Amazon EKS** section of [this
| [kubectl_manifest.flux_gitrepository](https://registry.terraform.io/providers/alekc/kubectl/latest/docs/resources/manifest) | resource |
| [kubectl_manifest.flux_kustomization](https://registry.terraform.io/providers/alekc/kubectl/latest/docs/resources/manifest) | resource |
| [kubectl_manifest.kubeproxy_monitoring_dashboard](https://registry.terraform.io/providers/alekc/kubectl/latest/docs/resources/manifest) | resource |
| [kubectl_manifest.nvidia_monitoring_dashboards](https://registry.terraform.io/providers/alekc/kubectl/latest/docs/resources/manifest) | resource |
| [aws_caller_identity.current](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/caller_identity) | data source |
| [aws_eks_cluster.eks_cluster](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/eks_cluster) | data source |
| [aws_partition.current](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/partition) | data source |
Expand Down Expand Up @@ -93,6 +94,7 @@ See examples using this Terraform modules in the **Amazon EKS** section of [this
| <a name="input_enable_managed_prometheus"></a> [enable\_managed\_prometheus](#input\_enable\_managed\_prometheus) | Creates a new Amazon Managed Service for Prometheus Workspace | `bool` | `true` | no |
| <a name="input_enable_nginx"></a> [enable\_nginx](#input\_enable\_nginx) | Enable NGINX workloads monitoring, alerting and default dashboards | `bool` | `false` | no |
| <a name="input_enable_node_exporter"></a> [enable\_node\_exporter](#input\_enable\_node\_exporter) | Enables or disables Node exporter. Disabling this might affect some data in the dashboards | `bool` | `true` | no |
| <a name="input_enable_nvidia_monitoring"></a> [enable\_nvidia\_monitoring](#input\_enable\_nvidia\_monitoring) | Enables monitoring of nvidia metrics | `bool` | `true` | no |
| <a name="input_enable_recording_rules"></a> [enable\_recording\_rules](#input\_enable\_recording\_rules) | Enables or disables Managed Prometheus recording rules | `bool` | `true` | no |
| <a name="input_enable_tracing"></a> [enable\_tracing](#input\_enable\_tracing) | Enables tracing with OTLP traces receiver to X-Ray | `bool` | `true` | no |
| <a name="input_flux_config"></a> [flux\_config](#input\_flux\_config) | FluxCD configuration | <pre>object({<br> create_namespace = bool<br> k8s_namespace = string<br> helm_chart_name = string<br> helm_chart_version = string<br> helm_release_name = string<br> helm_repo_url = string<br> helm_settings = map(string)<br> helm_values = map(any)<br> })</pre> | <pre>{<br> "create_namespace": true,<br> "helm_chart_name": "flux2",<br> "helm_chart_version": "2.12.2",<br> "helm_release_name": "observability-fluxcd-addon",<br> "helm_repo_url": "https://fluxcd-community.github.io/helm-charts",<br> "helm_settings": {},<br> "helm_values": {},<br> "k8s_namespace": "flux-system"<br>}</pre> | no |
Expand Down Expand Up @@ -127,6 +129,7 @@ See examples using this Terraform modules in the **Amazon EKS** section of [this
| <a name="input_managed_prometheus_workspace_region"></a> [managed\_prometheus\_workspace\_region](#input\_managed\_prometheus\_workspace\_region) | Amazon Managed Prometheus Workspace's Region | `string` | `null` | no |
| <a name="input_ne_config"></a> [ne\_config](#input\_ne\_config) | Node exporter configuration | <pre>object({<br> create_namespace = bool<br> k8s_namespace = string<br> helm_chart_name = string<br> helm_chart_version = string<br> helm_release_name = string<br> helm_repo_url = string<br> helm_settings = map(string)<br> helm_values = map(any)<br><br> scrape_interval = string<br> scrape_timeout = string<br> })</pre> | <pre>{<br> "create_namespace": true,<br> "helm_chart_name": "prometheus-node-exporter",<br> "helm_chart_version": "4.24.0",<br> "helm_release_name": "prometheus-node-exporter",<br> "helm_repo_url": "https://prometheus-community.github.io/helm-charts",<br> "helm_settings": {},<br> "helm_values": {},<br> "k8s_namespace": "prometheus-node-exporter",<br> "scrape_interval": "60s",<br> "scrape_timeout": "60s"<br>}</pre> | no |
| <a name="input_nginx_config"></a> [nginx\_config](#input\_nginx\_config) | Configuration object for NGINX monitoring | <pre>object({<br> enable_alerting_rules = bool<br> enable_recording_rules = bool<br> enable_dashboards = bool<br> scrape_sample_limit = number<br><br> flux_gitrepository_name = string<br> flux_gitrepository_url = string<br> flux_gitrepository_branch = string<br> flux_kustomization_name = string<br> flux_kustomization_path = string<br><br> grafana_dashboard_url = string<br><br> prometheus_metrics_endpoint = string<br> })</pre> | `null` | no |
| <a name="input_nvidia_monitoring_config"></a> [nvidia\_monitoring\_config](#input\_nvidia\_monitoring\_config) | Config object for nvidia monitoring | <pre>object({<br> flux_gitrepository_name = string<br> flux_gitrepository_url = string<br> flux_gitrepository_branch = string<br> flux_kustomization_name = string<br> flux_kustomization_path = string<br> })</pre> | `null` | no |
| <a name="input_prometheus_config"></a> [prometheus\_config](#input\_prometheus\_config) | Controls default values such as scrape interval, timeouts and ports globally | <pre>object({<br> global_scrape_interval = string<br> global_scrape_timeout = string<br> })</pre> | <pre>{<br> "global_scrape_interval": "120s",<br> "global_scrape_timeout": "15s"<br>}</pre> | no |
| <a name="input_tags"></a> [tags](#input\_tags) | Additional tags (e.g. `map('BusinessUnit`,`XYZ`) | `map(string)` | `{}` | no |
| <a name="input_target_secret_name"></a> [target\_secret\_name](#input\_target\_secret\_name) | Target secret in Kubernetes to store the Grafana API Key Secret | `string` | `"grafana-admin-credentials"` | no |
Expand Down
20 changes: 20 additions & 0 deletions modules/eks-monitoring/dashboards.tf
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,26 @@ YAML
depends_on = [module.external_secrets]
}

# nvidia dashboards
resource "kubectl_manifest" "nvidia_monitoring_dashboards" {
yaml_body = <<YAML
apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
kind: Kustomization
metadata:
name: ${local.nvidia_monitoring_config.flux_kustomization_name}
namespace: flux-system
spec:
interval: 1m0s
path: ${local.nvidia_monitoring_config.flux_kustomization_path}
prune: true
sourceRef:
kind: GitRepository
name: ${local.nvidia_monitoring_config.flux_gitrepository_name}
YAML
count = var.enable_nvidia_monitoring ? 1 : 0
depends_on = [module.external_secrets]
}

resource "kubectl_manifest" "kubeproxy_monitoring_dashboard" {
yaml_body = <<YAML
apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
Expand Down
9 changes: 9 additions & 0 deletions modules/eks-monitoring/locals.tf
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,15 @@ locals {
}
}

nvidia_monitoring_config = {
# can be overriden by providing a config
flux_gitrepository_name = try(var.nvidia_monitoring_config.flux_gitrepository_name, var.flux_gitrepository_name)
flux_gitrepository_url = try(var.nvidia_monitoring_config.flux_gitrepository_url, var.flux_gitrepository_url)
flux_gitrepository_branch = try(var.nvidia_monitoring_config.flux_gitrepository_branch, var.flux_gitrepository_branch)
flux_kustomization_name = try(var.nvidia_monitoring_config.flux_kustomization_name, "grafana-dashboards-adothealth")
flux_kustomization_path = try(var.nvidia_monitoring_config.flux_kustomization_path, "./artifacts/grafana-operator-manifests/eks/gpu")
}

kubeproxy_monitoring_config = {
# can be overriden by providing a config
flux_gitrepository_name = try(var.kubeproxy_monitoring_config.flux_gitrepository_name, var.flux_gitrepository_name)
Expand Down
4 changes: 4 additions & 0 deletions modules/eks-monitoring/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -196,6 +196,10 @@ module "helm_addon" {
name = "enableAdotcollectorMetrics"
value = var.enable_adotcollector_metrics
},
{
name = "enableGpuMonitoring"
value = var.enable_nvidia_monitoring
},
{
name = "serviceAccount"
value = local.kube_service_account_name
Expand Down
20 changes: 20 additions & 0 deletions modules/eks-monitoring/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -552,6 +552,26 @@ variable "enable_adotcollector_metrics" {
default = true
}

variable "enable_nvidia_monitoring" {
description = "Enables monitoring of nvidia metrics"
type = bool
default = true
}

variable "nvidia_monitoring_config" {
description = "Config object for nvidia monitoring"
type = object({
flux_gitrepository_name = string
flux_gitrepository_url = string
flux_gitrepository_branch = string
flux_kustomization_name = string
flux_kustomization_path = string
})

# defaults are pre-computed in locals.tf, provide a full definition to override
default = null
}

variable "adothealth_monitoring_config" {
description = "Config object for ADOT health monitoring"
type = object({
Expand Down

0 comments on commit ada16d5

Please sign in to comment.