ECS Deployment Circuit Breaker: Automatic Rollback with CloudWatch Alerting
Protect production deployments with ECS deployment circuit breaker for automatic rollback, and a CloudWatch alarm that alerts your team when it triggers.
ECS Deployment Circuit Breaker: Automatic Rollback with CloudWatch Alerting
A bad deployment that stays live for 15 minutes can cause more damage than the bug itself. ECS deployment circuit breaker automatically detects failed deployments and rolls back to the last stable state—no human intervention required. Add a CloudWatch alarm on top, and your team gets notified the moment a rollback happens.
How the Circuit Breaker Works
The ECS deployment circuit breaker monitors task health during a rolling deployment. If new tasks keep failing to reach a healthy state, it stops the deployment and rolls back to the previous task definition.
Deployment starts
├── New tasks launch with updated task definition
├── ECS monitors task health (container starts, health checks pass)
├── If tasks keep crashing or failing health checks:
│ ├── Circuit breaker TRIGGERS
│ ├── New tasks are stopped
│ └── Service rolls back to previous task definition
└── If all tasks become healthy:
└── Deployment completes successfully
The key insight: Without the circuit breaker, ECS will keep trying to launch failing tasks indefinitely. Your service stays in a degraded state with a mix of old and new (broken) tasks until someone notices and manually intervenes.
Enabling the Circuit Breaker
AWS CLI
aws ecs create-service \ --cluster production \ --service-name api \ --task-definition api:42 \ --desired-count 3 \ --deployment-configuration '{ "deploymentCircuitBreaker": { "enable": true, "rollback": true }, "maximumPercent": 200, "minimumHealthyPercent": 100 }'
Terraform
resource "aws_ecs_service" "api" { name = "api" cluster = aws_ecs_cluster.production.id task_definition = aws_ecs_task_definition.api.arn desired_count = 3 launch_type = "FARGATE" deployment_circuit_breaker { enable = true rollback = true } deployment_maximum_percent = 200 deployment_minimum_healthy_percent = 100 network_configuration { subnets = var.private_subnet_ids security_groups = [aws_security_group.api.id] assign_public_ip = false } }
CloudFormation
ApiService: Type: AWS::ECS::Service Properties: Cluster: !Ref ProductionCluster ServiceName: api TaskDefinition: !Ref ApiTaskDefinition DesiredCount: 3 LaunchType: FARGATE DeploymentConfiguration: DeploymentCircuitBreaker: Enable: true Rollback: true MaximumPercent: 200 MinimumHealthyPercent: 100
Why rollback: true? With enable: true alone, the circuit breaker stops the bad deployment but leaves the service in a partially updated state. Setting rollback: true tells ECS to actively roll back to the last successful task definition, restoring full capacity automatically.
What Triggers the Circuit Breaker
The circuit breaker uses a failure threshold based on a rolling window. ECS evaluates deployment health in phases:
- Task launch failures — the container fails to start (bad image, OOM on startup, missing secrets)
- Health check failures — the container starts but fails ELB health checks or ECS container health checks
- Repeated failures — if the failure count exceeds a threshold relative to your
desiredCount, the circuit breaker trips
The exact threshold logic: ECS allows a minimum of 10 task launch attempts before evaluating. If the failure rate stays above 50% after that window, the circuit breaker triggers. For services with a higher desiredCount, the window scales proportionally.
Common failure patterns that trip the circuit breaker:
| Failure | Root Cause |
|---|---|
CannotPullContainerError | Image doesn't exist in ECR, or task execution role lacks ecr:GetDownloadUrlForLayer |
ResourceInitializationError | VPC endpoints missing, security group blocks ECR/S3 access |
Essential container exited | App crashes on startup (bad config, missing env vars, migration failure) |
| Health check timeout | App takes too long to start, or health check path returns non-200 |
| OOM killed | Container exceeds its memory limit on startup |
Alerting When the Circuit Breaker Triggers
The circuit breaker handles rollback automatically, but your team still needs to know it happened. A deployment that silently rolls back is a bug that silently stays in your codebase. Use EventBridge to catch the SERVICE_DEPLOYMENT_FAILED event and fire a CloudWatch alarm via SNS.
EventBridge Rule + SNS Notification
# --- SNS topic for deployment alerts --- resource "aws_sns_topic" "deployment_alerts" { name = "ecs-deployment-alerts" } resource "aws_sns_topic_subscription" "email" { topic_arn = aws_sns_topic.deployment_alerts.arn protocol = "email" endpoint = "[email protected]" } # --- EventBridge rule: fires when circuit breaker triggers --- resource "aws_cloudwatch_event_rule" "deployment_failed" { name = "ecs-deployment-circuit-breaker-triggered" description = "Fires when an ECS deployment fails (circuit breaker rollback)" event_pattern = jsonencode({ source = ["aws.ecs"] detail-type = ["ECS Deployment State Change"] detail = { eventName = ["SERVICE_DEPLOYMENT_FAILED"] } }) } resource "aws_cloudwatch_event_target" "notify_sns" { rule = aws_cloudwatch_event_rule.deployment_failed.name target_id = "deployment-failure-alert" arn = aws_sns_topic.deployment_alerts.arn } # Allow EventBridge to publish to SNS resource "aws_sns_topic_policy" "eventbridge_publish" { arn = aws_sns_topic.deployment_alerts.arn policy = jsonencode({ Version = "2012-10-17" Statement = [{ Effect = "Allow" Principal = { Service = "events.amazonaws.com" } Action = "sns:Publish" Resource = aws_sns_topic.deployment_alerts.arn }] }) }
CloudFormation
DeploymentAlertsTopic: Type: AWS::SNS::Topic Properties: TopicName: ecs-deployment-alerts DeploymentAlertsSubscription: Type: AWS::SNS::Subscription Properties: TopicArn: !Ref DeploymentAlertsTopic Protocol: email Endpoint: [email protected] DeploymentFailedRule: Type: AWS::Events::Rule Properties: Name: ecs-deployment-circuit-breaker-triggered Description: Fires when an ECS deployment fails (circuit breaker rollback) EventPattern: source: - aws.ecs detail-type: - "ECS Deployment State Change" detail: eventName: - SERVICE_DEPLOYMENT_FAILED Targets: - Id: deployment-failure-alert Arn: !Ref DeploymentAlertsTopic
Scoping to a Specific Cluster or Service
The rule above fires for any ECS deployment failure in the account. To scope it to a specific cluster or service, add filters to the event pattern:
event_pattern = jsonencode({ source = ["aws.ecs"] detail-type = ["ECS Deployment State Change"] detail = { eventName = ["SERVICE_DEPLOYMENT_FAILED"] clusterArn = [aws_ecs_cluster.production.arn] # Optionally filter by service: # serviceName = ["api"] } })
What the Event Looks Like
When the circuit breaker triggers, EventBridge emits an event like this:
{ "source": "aws.ecs", "detail-type": "ECS Deployment State Change", "detail": { "eventType": "INFO", "eventName": "SERVICE_DEPLOYMENT_FAILED", "clusterArn": "arn:aws:ecs:us-east-1:123456789012:cluster/production", "serviceName": "api", "deploymentId": "ecs-svc/1234567890", "reason": "DEPLOYMENT_FAILURE - rolling back to deploymentId ecs-svc/0987654321" } }
The reason field tells you it was a circuit breaker rollback. For Slack or PagerDuty integration, use AWS Chatbot or a small Lambda that parses this event and forwards a formatted message.
Adding Slack Notifications
For Slack, the simplest path is AWS Chatbot:
resource "aws_chatbot_slack_channel_configuration" "deployment_alerts" { configuration_name = "ecs-deployment-alerts" iam_role_arn = aws_iam_role.chatbot.arn slack_channel_id = "C0123456789" slack_team_id = "T0123456789" sns_topic_arns = [aws_sns_topic.deployment_alerts.arn] logging_level = "INFO" }
Complete Example: Circuit Breaker + Alert
Here's everything wired together:
# --- ECS Service with circuit breaker --- resource "aws_ecs_service" "api" { name = "api" cluster = aws_ecs_cluster.production.id task_definition = aws_ecs_task_definition.api.arn desired_count = 3 launch_type = "FARGATE" deployment_circuit_breaker { enable = true rollback = true } deployment_maximum_percent = 200 deployment_minimum_healthy_percent = 100 load_balancer { target_group_arn = aws_lb_target_group.api.arn container_name = "api" container_port = 8080 } network_configuration { subnets = var.private_subnet_ids security_groups = [aws_security_group.api.id] assign_public_ip = false } lifecycle { ignore_changes = [task_definition] # Managed by CI/CD } } # --- Alert when circuit breaker triggers --- resource "aws_sns_topic" "deployment_alerts" { name = "ecs-deployment-alerts" } resource "aws_cloudwatch_event_rule" "deployment_failed" { name = "ecs-deployment-circuit-breaker-triggered" event_pattern = jsonencode({ source = ["aws.ecs"] detail-type = ["ECS Deployment State Change"] detail = { eventName = ["SERVICE_DEPLOYMENT_FAILED"] clusterArn = [aws_ecs_cluster.production.arn] } }) } resource "aws_cloudwatch_event_target" "notify_sns" { rule = aws_cloudwatch_event_rule.deployment_failed.name target_id = "deployment-failure-alert" arn = aws_sns_topic.deployment_alerts.arn } resource "aws_sns_topic_policy" "eventbridge_publish" { arn = aws_sns_topic.deployment_alerts.arn policy = jsonencode({ Version = "2012-10-17" Statement = [{ Effect = "Allow" Principal = { Service = "events.amazonaws.com" } Action = "sns:Publish" Resource = aws_sns_topic.deployment_alerts.arn }] }) }
Circuit Breaker vs. Blue/Green vs. Rolling
| Feature | Circuit Breaker (Rolling) | Blue/Green (CodeDeploy) |
|---|---|---|
| Rollback speed | Seconds (stops bad tasks) | Minutes (shifts traffic back) |
| Setup complexity | Low (service config only) | High (CodeDeploy + ALB listener rules) |
| Traffic control | No (all-or-nothing per task) | Yes (canary %, linear shifts) |
| Cost during deployment | Same (replaces tasks in place) | 2x (runs both versions) |
| Best for | Most services | Critical paths needing canary rollout |
For most teams, the circuit breaker provides the best balance of safety and simplicity. Reserve blue/green with CodeDeploy for services where you need fine-grained traffic shifting (10% canary → 50% → 100%).
Production Checklist
- Deployment circuit breaker enabled with
rollback: true - EventBridge rule capturing
SERVICE_DEPLOYMENT_FAILEDevents - SNS topic with email/Slack subscriptions for deployment failure alerts
- SNS topic policy allowing EventBridge to publish
- EventBridge rule scoped to the correct cluster (not account-wide)
- ECS container health checks configured in task definition
-
deployment_minimum_healthy_percentset to 100 (no capacity drop during deploy)
Conclusion
The deployment circuit breaker is the minimum viable safety net every ECS service should have. It's a single configuration flag that prevents bad deployments from taking down your service. Wire up an EventBridge rule to catch the failure event, push it to SNS, and your team knows the moment a rollback happens—without anyone staring at the ECS console.
Key principles:
- Enable circuit breaker with
rollback: trueon every ECS service - Use EventBridge + SNS to alert on
SERVICE_DEPLOYMENT_FAILED - Scope EventBridge rules to the relevant cluster to avoid noise
- A silent rollback is a bug hiding in your codebase—always alert on it
Need help building deployment safety into your infrastructure? Let's talk about making your deployments boring (in the best way).
You might also like
AWS ECS Production Deployment: The Complete Guide
Deploy containerized applications on AWS ECS with auto-scaling, blue/green deployments, and production-grade monitoring.
GitHub Actions CI/CD Pipeline Design for Production
Build reliable, fast CI/CD pipelines with GitHub Actions: caching strategies, secrets management, matrix builds, reusable workflows, and deployment patterns.
Sentry for Production: Error Monitoring and Performance Tracking
Master Sentry for production applications: error tracking, performance monitoring, distributed tracing, and alerting strategies that catch issues before users do.