Skip to main content

ALB CloudWatch Alarms Configuration

Overview

This configuration creates CloudWatch alarms for Application Load Balancers (ALBs) and their Target Groups with reduced noise settings to only alert on real disruptions.

Monitored Resources

Application Load Balancers (4 ALBs)

  1. AppNameDashboard-PROD-ALB
  2. AppName-PROD-ALB
  3. AppName-PROD-ALB
  4. AppName-elb

Target Groups (4 TGs)

  1. AppNameDashboard-Prod-Primary-TG
  2. AppName-PROD-Primary-TG
  3. AppName-PROD-TG-Primary
  4. AppName-target-grp

Alarm Configuration

🔴 Critical Alarms (Lower Threshold for Quick Response)

AllTargetsUnhealthy

  • Metric: HealthyHostCount
  • Threshold: < 1 (triggers when 0 healthy hosts)
  • Evaluation Periods: 2 (10 minutes)
  • Level: Target Group
  • Why: Critical - all instances down, needs immediate attention

🟡 Performance Alarms (Higher Thresholds to Reduce Noise)

ALB Target Response Time

  • Metric: TargetResponseTime (Average)
  • Threshold: > 5 seconds
  • Evaluation Periods: 3 consecutive breaches (15 minutes total)
  • Level: ALB
  • Rationale: Only alerts on sustained high response times (> 5s for 15 mins)

Target Group Response Time

  • Metric: TargetResponseTime (Average)
  • Threshold: > 5 seconds
  • Evaluation Periods: 3 consecutive breaches (15 minutes total)
  • Level: Target Group
  • Rationale: Only alerts on sustained high response times (> 5s for 15 mins)

🟠 Error Rate Alarms (Higher Thresholds to Reduce Noise)

HTTP 4xx Errors (Target Group)

  • Metric: HTTPCode_Target_4XX_Count (Sum)
  • Threshold: > 100 errors per 5 minutes
  • Evaluation Periods: 3 consecutive breaches (15 minutes total)
  • Level: Target Group
  • Rationale: Client errors - only alert on sustained high volume (> 100 errors/5min for 15 mins)

HTTP 5xx Errors (Target Group)

  • Metric: HTTPCode_Target_5XX_Count (Sum)
  • Threshold: > 50 errors per 5 minutes
  • Evaluation Periods: 3 consecutive breaches (15 minutes total)
  • Level: Target Group
  • Rationale: Server errors - more sensitive than 4xx but still requires sustained issues

🔵 Connection Alarms (Higher Thresholds to Reduce Noise)

Rejected Connection Count

  • Metric: RejectedConnectionCount (Sum)
  • Threshold: > 25 connections per 5 minutes
  • Evaluation Periods: 3 consecutive breaches (15 minutes total)
  • Level: ALB
  • Rationale: Only alerts on sustained connection rejections (> 25/5min for 15 mins)

Alarm Thresholds Summary

MetricThresholdEvaluation PeriodsTotal TimeLevel
ALB Response Time5 seconds3 × 5 min15 minutesALB
TG Response Time5 seconds3 × 5 min15 minutesTG
TG 4xx Errors100 requests3 × 5 min15 minutesTG
TG 5xx Errors50 requests3 × 5 min15 minutesTG
Rejected Connections25 connections3 × 5 min15 minutesALB
All Targets Unhealthy< 1 host2 × 5 min10 minutesTG

Key Features

🎯 Reduced Noise Design

  • 3 Evaluation Periods for most alarms = 15 minutes of sustained issues before alerting
  • Higher Thresholds = Only alert on real disruptions, not minor spikes
  • treat_missing_data = “notBreaching” = Missing data doesn’t trigger false alarms

🚨 Critical Alerts Stay Sensitive

  • AllTargetsUnhealthy uses only 2 evaluation periods (10 min) since it’s critical
  • treat_missing_data = “breaching” for health checks = Assume worst case if no data

📊 Smart Statistic Selection

  • Average for response times = Smooths out occasional spikes
  • Sum for error counts = Total impact over the period
  • Average for healthy host count = Ensures consistent unhealthy state

Naming Convention

CW-ALB-<ALB-Name>-<MetricType>        # ALB-level alarms
CW-TG-<TG-Name>-<MetricType>          # Target Group-level alarms
Examples:
  • CW-ALB-AppNameDashboard-PROD-ALB-TargetResponseTime
  • CW-TG-AppName-PROD-TG-Primary-HTTPCode5XX
  • CW-TG-AppName-target-grp-AllTargetsUnhealthy

Tags

All alarms include these tags:
Name        = "<Alarm Name>"
Application = "CommonInfraResource"
Environment = "Production"
ManagedBy   = "Terraform"

Total Alarms Created

24 CloudWatch Alarms:
  • 8 ALB-level alarms (2 per ALB × 4 ALBs)
  • 16 Target Group-level alarms (4 per TG × 4 TGs)

Deployment

# Validate configuration
terraform validate

# Preview changes
terraform plan

# Apply changes
terraform apply

# View outputs
terraform output

Adding SNS Notifications

To receive notifications, update the alarm_actions parameter in each alarm:
alarm_actions = ["arn:aws:sns:ap-south-1:ACCOUNT_ID:your-topic-name"]

Adjusting Thresholds

Edit the locals block in ALB-ALARMS.tf:
locals {
  tg_4xx_threshold                  = 100  # Increase/decrease as needed
  tg_5xx_threshold                  = 50   # Increase/decrease as needed
  alb_response_time_threshold       = 5    # In seconds
  tg_response_time_threshold        = 5    # In seconds
  alb_rejected_connection_threshold = 25   # Number of connections
}

Monitoring Best Practices

  1. Review alarm history after 1-2 weeks to fine-tune thresholds
  2. Set up SNS notifications for production environments
  3. Create dashboards in CloudWatch to visualize metrics
  4. Document expected baseline values for each metric
  5. Review and adjust thresholds based on traffic patterns

Troubleshooting

If you’re getting too many alarms:

  • Increase thresholds in the locals block
  • Increase evaluation periods (e.g., from 3 to 4)
  • Review if the baseline traffic has changed

If you’re missing critical issues:

  • Decrease thresholds (be more sensitive)
  • Reduce evaluation periods (e.g., from 3 to 2)
  • Add more alarm types for additional metrics

Additional Metrics to Consider

You may want to add alarms for:
  • ActiveConnectionCount - Monitor connection pool
  • NewConnectionCount - Track connection rate
  • ProcessedBytes - Monitor throughput
  • TargetConnectionErrorCount - Backend connection issues
  • UnHealthyHostCount - Track degraded targets (not just all unhealthy)

Support & Documentation