Monitor GPU usage (NVIDIA GPUs) and GPU-related metrics in Grafana
✅ Overview of the Stack
| Component | Purpose |
|---|---|
| NVIDIA DCGM Exporter | Exposes GPU metrics from nodes (via DaemonSet) |
| Prometheus | Scrapes GPU metrics from DCGM Exporter |
| Grafana | Visualizes metrics using dashboards |
🛠 Step-by-Step Setup
✅ 1. Install NVIDIA DCGM Exporter
The DCGM Exporter (Data Center GPU Manager) exposes GPU metrics in Prometheus format.
You should see Prometheus-format metrics like DCGM_FI_DEV_GPU_UTIL.
✅ 2. Ensure Prometheus is Installed and Scraping
If using Prometheus Operator (from Helm):Add scrape config to ServiceMonitor
If using kube-prometheus-stack, this is automatic with correct labels.
To scrape manually, add this ServiceMonitor:
✅ 3. Install Grafana Dashboard
Use NVIDIA’s official GPU dashboards:- Open Grafana
- Go to Dashboards > Import
- Use one of the following dashboard IDs:
| Name | Grafana.com ID |
|---|---|
| NVIDIA DCGM Exporter GPU Dashboard | 12239 |
| Kubernetes GPU Monitoring | 15176 |
You can also customize and clone these dashboards.
✅ 4. Verify GPU Metrics in Prometheus
Run in Prometheus UI or via Grafana Explore:🔒 Optional: Node Exporter GPU Plugin (Advanced)
If you want host-level detail alongside GPU, you can also usenode_exporter with custom GPU scripts, but dcgm-exporter is the preferred method for Kubernetes.
✅ Summary
| Tool | Role |
|---|---|
| DCGM Exporter | Exposes GPU metrics |
| Prometheus | Scrapes and stores GPU metrics |
| Grafana | Visualizes with dashboards (12239, 15176) |
Would you like a Helm chart-based setup for Prometheus + Grafana + DCGM on EKS?
