on EKS using CloudWatch Container Insights & visualize them in Grafana, without the complexity of operator/runtime troubleshooting.Enable GPU-level metrics
📦 Step 1: Enable Container Insights with GPU Support
AWS now supports GPU observability natively through the Container Insights Enhanced Observability add-on.✅ Automatic Setup (recommended):
-
Ensure your EKS cluster has OIDC enabled:
-
Attach the necessary IAM access:
-
Enable the Container Insights add-on with GPU support:
🧩 Step 2: Verify GPU Metrics in CloudWatch
After a few minutes, go to the CloudWatch console → Container Insights → EKS. You’ll see multiple built-in dashboards. Look for GPU-specific panels showing:- GPU Utilization
- GPU Memory Usage
- GPU Temperature
- GPU Power Consumption
nvidia-device-plugin or helium heaps of setup
📈 Step 3: Visualize in Grafana
You can use Amazon Managed Grafana or your own self-managed Grafana.If using Managed Grafana:
- Create or use an existing Grafana workspace.
- Add CloudWatch as a data source (supports Container Insights).
-
Import GPU Dashboard:
- Browse dashboards in Grafana.com, or import a custom one.
- Alternatively, build your own panel using metrics like:
ContainerInsights → node_gpu_usage_totalacrossClusterName,NodeName([aws.amazon.com][5], [grafana.com][6], [docs.aws.amazon.com][7]).
If using self-hosted Grafana:
- Configure CloudWatch plugin (from Grafana > 6.5+).
- Follow the same import or build dashboard steps.
🧭 Dashboard & Metrics to Use
Metrics collected underContainerInsights:
- node_gpu_limit, node_gpu_usage_total, node_gpu_reserved_capacity — for node-level GPU capacity and usage.
- pod_gpu_usage_total, pod_gpu_request — for pod-level GPU consumption metrics
🎯 Why This Approach Works
- Fully automated: no manual device plugin, driver, or containerd tweaks
- Managed and supported by AWS
- Integrates seamlessly with Grafana
- Includes logs + metrics for end-to-end observability
to set up GPU-level monitoring on your existing EKS cluster using CloudWatch Container Insights and visualize it in Grafana.Terraform-based solution
🚀 Step 1: Enable Container Insights with GPU support
Add this to your Terraform (assuming you already have an EKS cluster managed via Terraform):cloudwatch_sa role:
🛠 Step 2: Grant the IAM Role to the Service Account
🔍 Step 3: Validate in CloudWatch
Afterterraform apply, wait a few minutes then check:
- CloudWatch → Container Insights → EKS
- You should see panels for GPU usage, memory, temperature, and power ([docs.aws.amazon.com][4], [blog.devops.dev][5]).
📊 Step 4: Visualize GPU Metrics in Grafana
Option A: Amazon Managed Grafana- Add CloudWatch data source (select Container Insights).
-
Upload/import a dashboard watching:
ContainerInsights/node_gpu_usage_totalContainerInsights/node_gpu_limitContainerInsights/pod_gpu_usage_totalYou can also import a premade GPU dashboard or build from scratch using these metrics ([aws-observability.github.io][6]).
✅ Summary of the Full Flow
| Stage | Tool | Outcome |
|---|---|---|
| Terraform Setup | AWS EKS Add-on | Deploys CloudWatch Agent + DCGM exporter across GPU nodes |
| CloudWatch | Container Insights | GPU metrics available (GPU Utilization, Temp, Memory, Power) |
| Grafana Visualization | AWS/Managed or Open-source | Visualize GPU metrics using CloudWatch data source |
🔧 What You Should Do Next
- Copy the Terraform snippets above into your existing config.
- Run
terraform init && terraform apply. - Wait ~5 minutes for the add-on to deploy.
- Confirm GPU metrics exist in CloudWatch.
- Set up Grafana with CloudWatch integration and visualize your GPU dashboards.
