Monitor GPU usage (NVIDIA GPUs) and GPU-related metrics in Grafana
β Overview of the Stack
| Component | Purpose |
|---|---|
| NVIDIA DCGM Exporter | Exposes GPU metrics from nodes (via DaemonSet) |
| Prometheus | Scrapes GPU metrics from DCGM Exporter |
| Grafana | Visualizes metrics using dashboards |
π Step-by-Step Setup
β 1. Install NVIDIA DCGM Exporter
The DCGM Exporter (Data Center GPU Manager) exposes GPU metrics in Prometheus format.
You should see Prometheus-format metrics like DCGM_FI_DEV_GPU_UTIL.
β 2. Ensure Prometheus is Installed and Scraping
If using Prometheus Operator (from Helm):Add scrape config to ServiceMonitor
If using kube-prometheus-stack, this is automatic with correct labels.
To scrape manually, add this ServiceMonitor:
β 3. Install Grafana Dashboard
Use NVIDIAβs official GPU dashboards:- Open Grafana
- Go to Dashboards > Import
- Use one of the following dashboard IDs:
| Name | Grafana.com ID |
|---|---|
| NVIDIA DCGM Exporter GPU Dashboard | 12239 |
| Kubernetes GPU Monitoring | 15176 |
You can also customize and clone these dashboards.
β 4. Verify GPU Metrics in Prometheus
Run in Prometheus UI or via Grafana Explore:π Optional: Node Exporter GPU Plugin (Advanced)
If you want host-level detail alongside GPU, you can also usenode_exporter with custom GPU scripts, but dcgm-exporter is the preferred method for Kubernetes.
β Summary
| Tool | Role |
|---|---|
| DCGM Exporter | Exposes GPU metrics |
| Prometheus | Scrapes and stores GPU metrics |
| Grafana | Visualizes with dashboards (12239, 15176) |
Would you like a Helm chart-based setup for Prometheus + Grafana + DCGM on EKS?
