Skip to main content

Enable GPU-level metrics

on EKS using CloudWatch Container Insights & visualize them in Grafana, without the complexity of operator/runtime troubleshooting.

📦 Step 1: Enable Container Insights with GPU Support

AWS now supports GPU observability natively through the Container Insights Enhanced Observability add-on.
  1. Ensure your EKS cluster has OIDC enabled:
    eksctl utils associate-iam-oidc-provider --cluster my-cluster --approve
    
  2. Attach the necessary IAM access:
    aws iam attach-role-policy \
      --role-name <EKSNodeInstanceRole> \
      --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
    
  3. Enable the Container Insights add-on with GPU support:
    aws eks create-addon \
      --cluster-name my-cluster \
      --addon-name amazon-cloudwatch-observability \
      --service-account-role-arn <CloudWatchAgentServiceAccountRoleArn> \
      --resolve-conflicts OVERWRITE
    
This will deploy the CloudWatch Agent, DCGM exporter, and log agents across your EKS nodes automatically

🧩 Step 2: Verify GPU Metrics in CloudWatch

After a few minutes, go to the CloudWatch console → Container Insights → EKS. You’ll see multiple built-in dashboards. Look for GPU-specific panels showing:
  • GPU Utilization
  • GPU Memory Usage
  • GPU Temperature
  • GPU Power Consumption
These metrics are now being collected without needing nvidia-device-plugin or helium heaps of setup

📈 Step 3: Visualize in Grafana

You can use Amazon Managed Grafana or your own self-managed Grafana.

If using Managed Grafana:

  1. Create or use an existing Grafana workspace.
  2. Add CloudWatch as a data source (supports Container Insights).
  3. Import GPU Dashboard:
    • Browse dashboards in Grafana.com, or import a custom one.
    • Alternatively, build your own panel using metrics like: ContainerInsights → node_gpu_usage_total across ClusterName, NodeName ([aws.amazon.com][5], [grafana.com][6], [docs.aws.amazon.com][7]).

If using self-hosted Grafana:

  1. Configure CloudWatch plugin (from Grafana > 6.5+).
  2. Follow the same import or build dashboard steps.

🧭 Dashboard & Metrics to Use

Metrics collected under ContainerInsights:
  • node_gpu_limit, node_gpu_usage_total, node_gpu_reserved_capacity — for node-level GPU capacity and usage.
  • pod_gpu_usage_total, pod_gpu_request — for pod-level GPU consumption metrics
Use these to build graphs or alerts in Grafana.

🎯 Why This Approach Works

  • Fully automated: no manual device plugin, driver, or containerd tweaks
  • Managed and supported by AWS
  • Integrates seamlessly with Grafana
  • Includes logs + metrics for end-to-end observability




Terraform-based solution

to set up GPU-level monitoring on your existing EKS cluster using CloudWatch Container Insights and visualize it in Grafana.

🚀 Step 1: Enable Container Insights with GPU support

Add this to your Terraform (assuming you already have an EKS cluster managed via Terraform):
data "aws_eks_cluster" "this" {
  name = var.eks_cluster_name
}

resource "aws_eks_addon" "cw_observability" {
  cluster_name             = data.aws_eks_cluster.this.name
  addon_name               = "amazon-cloudwatch-observability"
  addon_version            = "v2.1.2-eksbuild.1"  # or latest
  resolve_conflicts_on_update = "OVERWRITE"
  service_account_role_arn = aws_iam_role.cloudwatch_sa.arn

  configuration_values = jsonencode({
    agent = {
      config = {
        logs = {
          metrics_collected = {
            kubernetes = {
              enhanced_container_insights      = true
              accelerated_compute_metrics       = true
            }
          }
        }
      }
    }
  })
}
And define the cloudwatch_sa role:
resource "aws_iam_role" "cloudwatch_sa" {
  name = "${var.eks_cluster_name}-cw-agent"

  assume_role_policy = data.aws_iam_policy_document.eks_assume_sa.json
}

resource "aws_iam_role_policy_attachment" "cw_agent_attachment" {
  role       = aws_iam_role.cloudwatch_sa.name
  policy_arn = "arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy"
}
This deploys the CloudWatch Agent + DCGM exporter, enabling GPU metrics collection to CloudWatch ([docs.aws.amazon.com][1], [stackoverflow.com][2], [dev.to][3]).

🛠 Step 2: Grant the IAM Role to the Service Account

data "aws_iam_policy_document" "eks_assume_sa" {
  statement {
    effect = "Allow"
    principals {
      type        = "Federated"
      identifiers = [data.aws_iam_openid_connect_provider.oidc.arn]
    }
    actions = ["sts:AssumeRoleWithWebIdentity"]
    condition {
      test     = "StringEquals"
      variable = "${data.aws_iam_openid_connect_provider.oidc.url}:sub"
      values   = ["system:serviceaccount:amazon-cloudwatch:cloudwatch-agent"]
    }
  }
}

data "aws_iam_openid_connect_provider" "oidc" {
  url = data.aws_eks_cluster.this.identity[0].oidc[0].issuer
}

🔍 Step 3: Validate in CloudWatch

After terraform apply, wait a few minutes then check:
  • CloudWatch → Container Insights → EKS
  • You should see panels for GPU usage, memory, temperature, and power ([docs.aws.amazon.com][4], [blog.devops.dev][5]).

📊 Step 4: Visualize GPU Metrics in Grafana

Option A: Amazon Managed Grafana
module "mgw" {
  source  = "terraform-aws-modules/grafana/aws"
  version = "x.y.z"

  workspace_name = "eks-gpu-monitoring"
  authentication = {
    providers = ["AWS_SSO"]
  }

  permissions = [{
    principal = aws_iam_role.cloudwatch_sa.arn
    role      = "ADMIN"
  }]
}
Then in Grafana UI:
  1. Add CloudWatch data source (select Container Insights).
  2. Upload/import a dashboard watching:
    • ContainerInsights/node_gpu_usage_total
    • ContainerInsights/node_gpu_limit
    • ContainerInsights/pod_gpu_usage_total You can also import a premade GPU dashboard or build from scratch using these metrics ([aws-observability.github.io][6]).

✅ Summary of the Full Flow

StageToolOutcome
Terraform SetupAWS EKS Add-onDeploys CloudWatch Agent + DCGM exporter across GPU nodes
CloudWatchContainer InsightsGPU metrics available (GPU Utilization, Temp, Memory, Power)
Grafana VisualizationAWS/Managed or Open-sourceVisualize GPU metrics using CloudWatch data source

🔧 What You Should Do Next

  1. Copy the Terraform snippets above into your existing config.
  2. Run terraform init && terraform apply.
  3. Wait ~5 minutes for the add-on to deploy.
  4. Confirm GPU metrics exist in CloudWatch.
  5. Set up Grafana with CloudWatch integration and visualize your GPU dashboards.