ALB CloudWatch Alarms Configuration
Overview
This configuration creates CloudWatch alarms for Application Load Balancers (ALBs) and their Target Groups with reduced noise settings to only alert on real disruptions.Monitored Resources
Application Load Balancers (4 ALBs)
- AppNameDashboard-PROD-ALB
- AppName-PROD-ALB
- AppName-PROD-ALB
- AppName-elb
Target Groups (4 TGs)
- AppNameDashboard-Prod-Primary-TG
- AppName-PROD-Primary-TG
- AppName-PROD-TG-Primary
- AppName-target-grp
Alarm Configuration
🔴 Critical Alarms (Lower Threshold for Quick Response)
AllTargetsUnhealthy
- Metric: HealthyHostCount
- Threshold: < 1 (triggers when 0 healthy hosts)
- Evaluation Periods: 2 (10 minutes)
- Level: Target Group
- Why: Critical - all instances down, needs immediate attention
🟡 Performance Alarms (Higher Thresholds to Reduce Noise)
ALB Target Response Time
- Metric: TargetResponseTime (Average)
- Threshold: > 5 seconds
- Evaluation Periods: 3 consecutive breaches (15 minutes total)
- Level: ALB
- Rationale: Only alerts on sustained high response times (> 5s for 15 mins)
Target Group Response Time
- Metric: TargetResponseTime (Average)
- Threshold: > 5 seconds
- Evaluation Periods: 3 consecutive breaches (15 minutes total)
- Level: Target Group
- Rationale: Only alerts on sustained high response times (> 5s for 15 mins)
🟠 Error Rate Alarms (Higher Thresholds to Reduce Noise)
HTTP 4xx Errors (Target Group)
- Metric: HTTPCode_Target_4XX_Count (Sum)
- Threshold: > 100 errors per 5 minutes
- Evaluation Periods: 3 consecutive breaches (15 minutes total)
- Level: Target Group
- Rationale: Client errors - only alert on sustained high volume (> 100 errors/5min for 15 mins)
HTTP 5xx Errors (Target Group)
- Metric: HTTPCode_Target_5XX_Count (Sum)
- Threshold: > 50 errors per 5 minutes
- Evaluation Periods: 3 consecutive breaches (15 minutes total)
- Level: Target Group
- Rationale: Server errors - more sensitive than 4xx but still requires sustained issues
🔵 Connection Alarms (Higher Thresholds to Reduce Noise)
Rejected Connection Count
- Metric: RejectedConnectionCount (Sum)
- Threshold: > 25 connections per 5 minutes
- Evaluation Periods: 3 consecutive breaches (15 minutes total)
- Level: ALB
- Rationale: Only alerts on sustained connection rejections (> 25/5min for 15 mins)
Alarm Thresholds Summary
| Metric | Threshold | Evaluation Periods | Total Time | Level |
|---|---|---|---|---|
| ALB Response Time | 5 seconds | 3 × 5 min | 15 minutes | ALB |
| TG Response Time | 5 seconds | 3 × 5 min | 15 minutes | TG |
| TG 4xx Errors | 100 requests | 3 × 5 min | 15 minutes | TG |
| TG 5xx Errors | 50 requests | 3 × 5 min | 15 minutes | TG |
| Rejected Connections | 25 connections | 3 × 5 min | 15 minutes | ALB |
| All Targets Unhealthy | < 1 host | 2 × 5 min | 10 minutes | TG |
Key Features
🎯 Reduced Noise Design
- 3 Evaluation Periods for most alarms = 15 minutes of sustained issues before alerting
- Higher Thresholds = Only alert on real disruptions, not minor spikes
- treat_missing_data = “notBreaching” = Missing data doesn’t trigger false alarms
🚨 Critical Alerts Stay Sensitive
- AllTargetsUnhealthy uses only 2 evaluation periods (10 min) since it’s critical
- treat_missing_data = “breaching” for health checks = Assume worst case if no data
📊 Smart Statistic Selection
- Average for response times = Smooths out occasional spikes
- Sum for error counts = Total impact over the period
- Average for healthy host count = Ensures consistent unhealthy state
Naming Convention
CW-ALB-AppNameDashboard-PROD-ALB-TargetResponseTimeCW-TG-AppName-PROD-TG-Primary-HTTPCode5XXCW-TG-AppName-target-grp-AllTargetsUnhealthy
Tags
All alarms include these tags:Total Alarms Created
24 CloudWatch Alarms:- 8 ALB-level alarms (2 per ALB × 4 ALBs)
- 16 Target Group-level alarms (4 per TG × 4 TGs)
Deployment
Adding SNS Notifications
To receive notifications, update thealarm_actions parameter in each alarm:
Adjusting Thresholds
Edit thelocals block in ALB-ALARMS.tf:
Monitoring Best Practices
- Review alarm history after 1-2 weeks to fine-tune thresholds
- Set up SNS notifications for production environments
- Create dashboards in CloudWatch to visualize metrics
- Document expected baseline values for each metric
- Review and adjust thresholds based on traffic patterns
Troubleshooting
If you’re getting too many alarms:
- Increase thresholds in the
localsblock - Increase evaluation periods (e.g., from 3 to 4)
- Review if the baseline traffic has changed
If you’re missing critical issues:
- Decrease thresholds (be more sensitive)
- Reduce evaluation periods (e.g., from 3 to 2)
- Add more alarm types for additional metrics
Additional Metrics to Consider
You may want to add alarms for:ActiveConnectionCount- Monitor connection poolNewConnectionCount- Track connection rateProcessedBytes- Monitor throughputTargetConnectionErrorCount- Backend connection issuesUnHealthyHostCount- Track degraded targets (not just all unhealthy)
