Skip to main content

Monitor GPU usage (NVIDIA GPUs) and GPU-related metrics in Grafana


βœ… Overview of the Stack

ComponentPurpose
NVIDIA DCGM ExporterExposes GPU metrics from nodes (via DaemonSet)
PrometheusScrapes GPU metrics from DCGM Exporter
GrafanaVisualizes metrics using dashboards

πŸ›  Step-by-Step Setup

βœ… 1. Install NVIDIA DCGM Exporter

The DCGM Exporter (Data Center GPU Manager) exposes GPU metrics in Prometheus format.
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/main/deployments/k8s/dcgm-exporter.yaml
Verify:
kubectl get pods -n gpu-operator
kubectl port-forward svc/dcgm-exporter 9400:9400 -n gpu-operator
curl http://localhost:9400/metrics
You should see Prometheus-format metrics like DCGM_FI_DEV_GPU_UTIL.

βœ… 2. Ensure Prometheus is Installed and Scraping

If using Prometheus Operator (from Helm):

Add scrape config to ServiceMonitor

If using kube-prometheus-stack, this is automatic with correct labels. To scrape manually, add this ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: dcgm-exporter
  namespaceSelector:
    matchNames:
      - gpu-operator
  endpoints:
    - port: "metrics"
      interval: 30s

βœ… 3. Install Grafana Dashboard

Use NVIDIA’s official GPU dashboards:
  • Open Grafana
  • Go to Dashboards > Import
  • Use one of the following dashboard IDs:
NameGrafana.com ID
NVIDIA DCGM Exporter GPU Dashboard12239
Kubernetes GPU Monitoring15176
You can also customize and clone these dashboards.

βœ… 4. Verify GPU Metrics in Prometheus

Run in Prometheus UI or via Grafana Explore:
DCGM_FI_DEV_GPU_UTIL
DCGM_FI_DEV_MEM_COPY_UTIL
DCGM_FI_DEV_FB_USED
These will show % utilization, memory used, etc.

πŸ”’ Optional: Node Exporter GPU Plugin (Advanced)

If you want host-level detail alongside GPU, you can also use node_exporter with custom GPU scripts, but dcgm-exporter is the preferred method for Kubernetes.

βœ… Summary

ToolRole
DCGM ExporterExposes GPU metrics
PrometheusScrapes and stores GPU metrics
GrafanaVisualizes with dashboards (12239, 15176)

Would you like a Helm chart-based setup for Prometheus + Grafana + DCGM on EKS?