GPU DCGM Exporter - ahmadrazalab

✅ Overview of the Stack

Component	Purpose
NVIDIA DCGM Exporter	Exposes GPU metrics from nodes (via DaemonSet)
Prometheus	Scrapes GPU metrics from DCGM Exporter
Grafana	Visualizes metrics using dashboards

🛠 Step-by-Step Setup

✅ 1. Install NVIDIA DCGM Exporter

The DCGM Exporter (Data Center GPU Manager) exposes GPU metrics in Prometheus format.

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/main/deployments/k8s/dcgm-exporter.yaml

Verify:

kubectl get pods -n gpu-operator
kubectl port-forward svc/dcgm-exporter 9400:9400 -n gpu-operator
curl http://localhost:9400/metrics

You should see Prometheus-format metrics like DCGM_FI_DEV_GPU_UTIL.

✅ 2. Ensure Prometheus is Installed and Scraping

If using Prometheus Operator (from Helm):

Add scrape config to `ServiceMonitor`

If using kube-prometheus-stack, this is automatic with correct labels. To scrape manually, add this ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: dcgm-exporter
  namespaceSelector:
    matchNames:
      - gpu-operator
  endpoints:
    - port: "metrics"
      interval: 30s

✅ 3. Install Grafana Dashboard

Use NVIDIA’s official GPU dashboards:

Open Grafana
Go to Dashboards > Import
Use one of the following dashboard IDs:

Name	Grafana.com ID
NVIDIA DCGM Exporter GPU Dashboard	`12239`
Kubernetes GPU Monitoring	`15176`

You can also customize and clone these dashboards.

✅ 4. Verify GPU Metrics in Prometheus

Run in Prometheus UI or via Grafana Explore:

DCGM_FI_DEV_GPU_UTIL
DCGM_FI_DEV_MEM_COPY_UTIL
DCGM_FI_DEV_FB_USED

These will show % utilization, memory used, etc.

🔒 Optional: Node Exporter GPU Plugin (Advanced)

If you want host-level detail alongside GPU, you can also use node_exporter with custom GPU scripts, but dcgm-exporter is the preferred method for Kubernetes.

✅ Summary

Tool	Role
DCGM Exporter	Exposes GPU metrics
Prometheus	Scrapes and stores GPU metrics
Grafana	Visualizes with dashboards (`12239`, `15176`)

Would you like a Helm chart-based setup for Prometheus + Grafana + DCGM on EKS?

​Monitor GPU usage (NVIDIA GPUs) and GPU-related metrics in Grafana

​✅ Overview of the Stack

​🛠 Step-by-Step Setup

​✅ 1. Install NVIDIA DCGM Exporter

​✅ 2. Ensure Prometheus is Installed and Scraping

​Add scrape config to ServiceMonitor

​✅ 3. Install Grafana Dashboard

​✅ 4. Verify GPU Metrics in Prometheus

​🔒 Optional: Node Exporter GPU Plugin (Advanced)

​✅ Summary

Monitor GPU usage (NVIDIA GPUs) and GPU-related metrics in Grafana