ALB CloudWatch Alarms Configuration

Overview

This configuration creates CloudWatch alarms for Application Load Balancers (ALBs) and their Target Groups with reduced noise settings to only alert on real disruptions.

Monitored Resources

Application Load Balancers (4 ALBs)

AppNameDashboard-PROD-ALB
AppName-PROD-ALB
AppName-PROD-ALB
AppName-elb

Target Groups (4 TGs)

AppNameDashboard-Prod-Primary-TG
AppName-PROD-Primary-TG
AppName-PROD-TG-Primary
AppName-target-grp

Alarm Configuration

🔴 Critical Alarms (Lower Threshold for Quick Response)

AllTargetsUnhealthy

Metric: HealthyHostCount
Threshold: < 1 (triggers when 0 healthy hosts)
Evaluation Periods: 2 (10 minutes)
Level: Target Group
Why: Critical - all instances down, needs immediate attention

🟡 Performance Alarms (Higher Thresholds to Reduce Noise)

ALB Target Response Time

Metric: TargetResponseTime (Average)
Threshold: > 5 seconds
Evaluation Periods: 3 consecutive breaches (15 minutes total)
Level: ALB
Rationale: Only alerts on sustained high response times (> 5s for 15 mins)

Target Group Response Time

Metric: TargetResponseTime (Average)
Threshold: > 5 seconds
Evaluation Periods: 3 consecutive breaches (15 minutes total)
Level: Target Group
Rationale: Only alerts on sustained high response times (> 5s for 15 mins)

🟠 Error Rate Alarms (Higher Thresholds to Reduce Noise)

HTTP 4xx Errors (Target Group)

Metric: HTTPCode_Target_4XX_Count (Sum)
Threshold: > 100 errors per 5 minutes
Evaluation Periods: 3 consecutive breaches (15 minutes total)
Level: Target Group
Rationale: Client errors - only alert on sustained high volume (> 100 errors/5min for 15 mins)

HTTP 5xx Errors (Target Group)

Metric: HTTPCode_Target_5XX_Count (Sum)
Threshold: > 50 errors per 5 minutes
Evaluation Periods: 3 consecutive breaches (15 minutes total)
Level: Target Group
Rationale: Server errors - more sensitive than 4xx but still requires sustained issues

🔵 Connection Alarms (Higher Thresholds to Reduce Noise)

Rejected Connection Count

Metric: RejectedConnectionCount (Sum)
Threshold: > 25 connections per 5 minutes
Evaluation Periods: 3 consecutive breaches (15 minutes total)
Level: ALB
Rationale: Only alerts on sustained connection rejections (> 25/5min for 15 mins)

Alarm Thresholds Summary

Metric	Threshold	Evaluation Periods	Total Time	Level
ALB Response Time	5 seconds	3 × 5 min	15 minutes	ALB
TG Response Time	5 seconds	3 × 5 min	15 minutes	TG
TG 4xx Errors	100 requests	3 × 5 min	15 minutes	TG
TG 5xx Errors	50 requests	3 × 5 min	15 minutes	TG
Rejected Connections	25 connections	3 × 5 min	15 minutes	ALB
All Targets Unhealthy	< 1 host	2 × 5 min	10 minutes	TG

Key Features

🎯 Reduced Noise Design

3 Evaluation Periods for most alarms = 15 minutes of sustained issues before alerting
Higher Thresholds = Only alert on real disruptions, not minor spikes
treat_missing_data = “notBreaching” = Missing data doesn’t trigger false alarms

🚨 Critical Alerts Stay Sensitive

AllTargetsUnhealthy uses only 2 evaluation periods (10 min) since it’s critical
treat_missing_data = “breaching” for health checks = Assume worst case if no data

📊 Smart Statistic Selection

Average for response times = Smooths out occasional spikes
Sum for error counts = Total impact over the period
Average for healthy host count = Ensures consistent unhealthy state

Naming Convention

CW-ALB-<ALB-Name>-<MetricType>        # ALB-level alarms
CW-TG-<TG-Name>-<MetricType>          # Target Group-level alarms

Examples:

CW-ALB-AppNameDashboard-PROD-ALB-TargetResponseTime
CW-TG-AppName-PROD-TG-Primary-HTTPCode5XX
CW-TG-AppName-target-grp-AllTargetsUnhealthy

Total Alarms Created

24 CloudWatch Alarms:

8 ALB-level alarms (2 per ALB × 4 ALBs)
16 Target Group-level alarms (4 per TG × 4 TGs)

Deployment

# Validate configuration
terraform validate

# Preview changes
terraform plan

# Apply changes
terraform apply

# View outputs
terraform output

To receive notifications, update the alarm_actions parameter in each alarm:

alarm_actions = ["arn:aws:sns:ap-south-1:ACCOUNT_ID:your-topic-name"]

Adjusting Thresholds

Edit the locals block in ALB-ALARMS.tf:

locals {
  tg_4xx_threshold                  = 100  # Increase/decrease as needed
  tg_5xx_threshold                  = 50   # Increase/decrease as needed
  alb_response_time_threshold       = 5    # In seconds
  tg_response_time_threshold        = 5    # In seconds
  alb_rejected_connection_threshold = 25   # Number of connections
}

Monitoring Best Practices

Review alarm history after 1-2 weeks to fine-tune thresholds
Set up SNS notifications for production environments
Create dashboards in CloudWatch to visualize metrics
Document expected baseline values for each metric
Review and adjust thresholds based on traffic patterns

Troubleshooting

If you’re getting too many alarms:

Increase thresholds in the locals block
Increase evaluation periods (e.g., from 3 to 4)
Review if the baseline traffic has changed

If you’re missing critical issues:

Decrease thresholds (be more sensitive)
Reduce evaluation periods (e.g., from 3 to 2)
Add more alarm types for additional metrics

Additional Metrics to Consider

You may want to add alarms for:

ActiveConnectionCount - Monitor connection pool
NewConnectionCount - Track connection rate
ProcessedBytes - Monitor throughput
TargetConnectionErrorCount - Backend connection issues
UnHealthyHostCount - Track degraded targets (not just all unhealthy)

Terraform Commands & Concepts

TF Cloudflare

TF AWS CW Lambda Slack

ALB ALARMS README

ALB CloudWatch Alarms Configuration

Overview

Monitored Resources

Application Load Balancers (4 ALBs)

Target Groups (4 TGs)

Alarm Configuration

🔴 Critical Alarms (Lower Threshold for Quick Response)

AllTargetsUnhealthy

🟡 Performance Alarms (Higher Thresholds to Reduce Noise)

ALB Target Response Time

Target Group Response Time

🟠 Error Rate Alarms (Higher Thresholds to Reduce Noise)

HTTP 4xx Errors (Target Group)

HTTP 5xx Errors (Target Group)

🔵 Connection Alarms (Higher Thresholds to Reduce Noise)

Rejected Connection Count

Alarm Thresholds Summary

Key Features

🎯 Reduced Noise Design

🚨 Critical Alerts Stay Sensitive

📊 Smart Statistic Selection

Naming Convention

Tags

Total Alarms Created

Deployment

Adjusting Thresholds

Monitoring Best Practices

Troubleshooting

If you’re getting too many alarms:

If you’re missing critical issues:

Additional Metrics to Consider

Support & Documentation

Terraform Commands & Concepts

TF Cloudflare

TF AWS CW Lambda Slack

Documentation Index

​ALB CloudWatch Alarms Configuration

​Overview

​Monitored Resources

​Application Load Balancers (4 ALBs)

​Target Groups (4 TGs)

​Alarm Configuration

​🔴 Critical Alarms (Lower Threshold for Quick Response)

​AllTargetsUnhealthy

​🟡 Performance Alarms (Higher Thresholds to Reduce Noise)

​ALB Target Response Time

​Target Group Response Time

​🟠 Error Rate Alarms (Higher Thresholds to Reduce Noise)

​HTTP 4xx Errors (Target Group)

​HTTP 5xx Errors (Target Group)

​🔵 Connection Alarms (Higher Thresholds to Reduce Noise)

​Rejected Connection Count

​Alarm Thresholds Summary

​Key Features

​🎯 Reduced Noise Design

​🚨 Critical Alerts Stay Sensitive

​📊 Smart Statistic Selection

​Naming Convention

​Tags

​Total Alarms Created

​Deployment

​Adding SNS Notifications

​Adjusting Thresholds

​Monitoring Best Practices

​Troubleshooting

​If you’re getting too many alarms:

​If you’re missing critical issues:

​Additional Metrics to Consider

​Support & Documentation

ALB CloudWatch Alarms Configuration

Overview

Monitored Resources

Application Load Balancers (4 ALBs)

Target Groups (4 TGs)

Alarm Configuration

🔴 Critical Alarms (Lower Threshold for Quick Response)

AllTargetsUnhealthy

🟡 Performance Alarms (Higher Thresholds to Reduce Noise)

ALB Target Response Time

Target Group Response Time

🟠 Error Rate Alarms (Higher Thresholds to Reduce Noise)

HTTP 4xx Errors (Target Group)

HTTP 5xx Errors (Target Group)

🔵 Connection Alarms (Higher Thresholds to Reduce Noise)

Rejected Connection Count

Alarm Thresholds Summary

Key Features

🎯 Reduced Noise Design

🚨 Critical Alerts Stay Sensitive

📊 Smart Statistic Selection

Naming Convention

Tags

Total Alarms Created

Deployment

Adding SNS Notifications

Adjusting Thresholds

Monitoring Best Practices

Troubleshooting

If you’re getting too many alarms:

If you’re missing critical issues:

Additional Metrics to Consider

Support & Documentation