Skip to main content

Monitor GPU usage (NVIDIA GPUs) and GPU-related metrics in Grafana


✅ Overview of the Stack

ComponentPurpose
NVIDIA DCGM ExporterExposes GPU metrics from nodes (via DaemonSet)
PrometheusScrapes GPU metrics from DCGM Exporter
GrafanaVisualizes metrics using dashboards

🛠 Step-by-Step Setup

✅ 1. Install NVIDIA DCGM Exporter

The DCGM Exporter (Data Center GPU Manager) exposes GPU metrics in Prometheus format.
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/main/deployments/k8s/dcgm-exporter.yaml
Verify:
kubectl get pods -n gpu-operator
kubectl port-forward svc/dcgm-exporter 9400:9400 -n gpu-operator
curl http://localhost:9400/metrics
You should see Prometheus-format metrics like DCGM_FI_DEV_GPU_UTIL.

✅ 2. Ensure Prometheus is Installed and Scraping

If using Prometheus Operator (from Helm):

Add scrape config to ServiceMonitor

If using kube-prometheus-stack, this is automatic with correct labels. To scrape manually, add this ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: dcgm-exporter
  namespaceSelector:
    matchNames:
      - gpu-operator
  endpoints:
    - port: "metrics"
      interval: 30s

✅ 3. Install Grafana Dashboard

Use NVIDIA’s official GPU dashboards:
  • Open Grafana
  • Go to Dashboards > Import
  • Use one of the following dashboard IDs:
NameGrafana.com ID
NVIDIA DCGM Exporter GPU Dashboard12239
Kubernetes GPU Monitoring15176
You can also customize and clone these dashboards.

✅ 4. Verify GPU Metrics in Prometheus

Run in Prometheus UI or via Grafana Explore:
DCGM_FI_DEV_GPU_UTIL
DCGM_FI_DEV_MEM_COPY_UTIL
DCGM_FI_DEV_FB_USED
These will show % utilization, memory used, etc.

🔒 Optional: Node Exporter GPU Plugin (Advanced)

If you want host-level detail alongside GPU, you can also use node_exporter with custom GPU scripts, but dcgm-exporter is the preferred method for Kubernetes.

✅ Summary

ToolRole
DCGM ExporterExposes GPU metrics
PrometheusScrapes and stores GPU metrics
GrafanaVisualizes with dashboards (12239, 15176)

Would you like a Helm chart-based setup for Prometheus + Grafana + DCGM on EKS?