Back to Blog
AWSECSDevOpsMonitoring

ECS Deployment Circuit Breaker: Automatic Rollback with CloudWatch Alerting

Protect production deployments with ECS deployment circuit breaker for automatic rollback, and a CloudWatch alarm that alerts your team when it triggers.

Azynth Team
8 min read

ECS Deployment Circuit Breaker: Automatic Rollback with CloudWatch Alerting

A bad deployment that stays live for 15 minutes can cause more damage than the bug itself. ECS deployment circuit breaker automatically detects failed deployments and rolls back to the last stable state—no human intervention required. Add a CloudWatch alarm on top, and your team gets notified the moment a rollback happens.

How the Circuit Breaker Works

The ECS deployment circuit breaker monitors task health during a rolling deployment. If new tasks keep failing to reach a healthy state, it stops the deployment and rolls back to the previous task definition.

Deployment starts
├── New tasks launch with updated task definition
├── ECS monitors task health (container starts, health checks pass)
├── If tasks keep crashing or failing health checks:
│   ├── Circuit breaker TRIGGERS
│   ├── New tasks are stopped
│   └── Service rolls back to previous task definition
└── If all tasks become healthy:
    └── Deployment completes successfully

The key insight: Without the circuit breaker, ECS will keep trying to launch failing tasks indefinitely. Your service stays in a degraded state with a mix of old and new (broken) tasks until someone notices and manually intervenes.

Enabling the Circuit Breaker

AWS CLI

aws ecs create-service \ --cluster production \ --service-name api \ --task-definition api:42 \ --desired-count 3 \ --deployment-configuration '{ "deploymentCircuitBreaker": { "enable": true, "rollback": true }, "maximumPercent": 200, "minimumHealthyPercent": 100 }'

Terraform

resource "aws_ecs_service" "api" { name = "api" cluster = aws_ecs_cluster.production.id task_definition = aws_ecs_task_definition.api.arn desired_count = 3 launch_type = "FARGATE" deployment_circuit_breaker { enable = true rollback = true } deployment_maximum_percent = 200 deployment_minimum_healthy_percent = 100 network_configuration { subnets = var.private_subnet_ids security_groups = [aws_security_group.api.id] assign_public_ip = false } }

CloudFormation

ApiService: Type: AWS::ECS::Service Properties: Cluster: !Ref ProductionCluster ServiceName: api TaskDefinition: !Ref ApiTaskDefinition DesiredCount: 3 LaunchType: FARGATE DeploymentConfiguration: DeploymentCircuitBreaker: Enable: true Rollback: true MaximumPercent: 200 MinimumHealthyPercent: 100

Why rollback: true? With enable: true alone, the circuit breaker stops the bad deployment but leaves the service in a partially updated state. Setting rollback: true tells ECS to actively roll back to the last successful task definition, restoring full capacity automatically.

What Triggers the Circuit Breaker

The circuit breaker uses a failure threshold based on a rolling window. ECS evaluates deployment health in phases:

  1. Task launch failures — the container fails to start (bad image, OOM on startup, missing secrets)
  2. Health check failures — the container starts but fails ELB health checks or ECS container health checks
  3. Repeated failures — if the failure count exceeds a threshold relative to your desiredCount, the circuit breaker trips

The exact threshold logic: ECS allows a minimum of 10 task launch attempts before evaluating. If the failure rate stays above 50% after that window, the circuit breaker triggers. For services with a higher desiredCount, the window scales proportionally.

Common failure patterns that trip the circuit breaker:

FailureRoot Cause
CannotPullContainerErrorImage doesn't exist in ECR, or task execution role lacks ecr:GetDownloadUrlForLayer
ResourceInitializationErrorVPC endpoints missing, security group blocks ECR/S3 access
Essential container exitedApp crashes on startup (bad config, missing env vars, migration failure)
Health check timeoutApp takes too long to start, or health check path returns non-200
OOM killedContainer exceeds its memory limit on startup

Alerting When the Circuit Breaker Triggers

The circuit breaker handles rollback automatically, but your team still needs to know it happened. A deployment that silently rolls back is a bug that silently stays in your codebase. Use EventBridge to catch the SERVICE_DEPLOYMENT_FAILED event and fire a CloudWatch alarm via SNS.

EventBridge Rule + SNS Notification

# --- SNS topic for deployment alerts --- resource "aws_sns_topic" "deployment_alerts" { name = "ecs-deployment-alerts" } resource "aws_sns_topic_subscription" "email" { topic_arn = aws_sns_topic.deployment_alerts.arn protocol = "email" endpoint = "[email protected]" } # --- EventBridge rule: fires when circuit breaker triggers --- resource "aws_cloudwatch_event_rule" "deployment_failed" { name = "ecs-deployment-circuit-breaker-triggered" description = "Fires when an ECS deployment fails (circuit breaker rollback)" event_pattern = jsonencode({ source = ["aws.ecs"] detail-type = ["ECS Deployment State Change"] detail = { eventName = ["SERVICE_DEPLOYMENT_FAILED"] } }) } resource "aws_cloudwatch_event_target" "notify_sns" { rule = aws_cloudwatch_event_rule.deployment_failed.name target_id = "deployment-failure-alert" arn = aws_sns_topic.deployment_alerts.arn } # Allow EventBridge to publish to SNS resource "aws_sns_topic_policy" "eventbridge_publish" { arn = aws_sns_topic.deployment_alerts.arn policy = jsonencode({ Version = "2012-10-17" Statement = [{ Effect = "Allow" Principal = { Service = "events.amazonaws.com" } Action = "sns:Publish" Resource = aws_sns_topic.deployment_alerts.arn }] }) }

CloudFormation

DeploymentAlertsTopic: Type: AWS::SNS::Topic Properties: TopicName: ecs-deployment-alerts DeploymentAlertsSubscription: Type: AWS::SNS::Subscription Properties: TopicArn: !Ref DeploymentAlertsTopic Protocol: email Endpoint: [email protected] DeploymentFailedRule: Type: AWS::Events::Rule Properties: Name: ecs-deployment-circuit-breaker-triggered Description: Fires when an ECS deployment fails (circuit breaker rollback) EventPattern: source: - aws.ecs detail-type: - "ECS Deployment State Change" detail: eventName: - SERVICE_DEPLOYMENT_FAILED Targets: - Id: deployment-failure-alert Arn: !Ref DeploymentAlertsTopic

Scoping to a Specific Cluster or Service

The rule above fires for any ECS deployment failure in the account. To scope it to a specific cluster or service, add filters to the event pattern:

event_pattern = jsonencode({ source = ["aws.ecs"] detail-type = ["ECS Deployment State Change"] detail = { eventName = ["SERVICE_DEPLOYMENT_FAILED"] clusterArn = [aws_ecs_cluster.production.arn] # Optionally filter by service: # serviceName = ["api"] } })

What the Event Looks Like

When the circuit breaker triggers, EventBridge emits an event like this:

{ "source": "aws.ecs", "detail-type": "ECS Deployment State Change", "detail": { "eventType": "INFO", "eventName": "SERVICE_DEPLOYMENT_FAILED", "clusterArn": "arn:aws:ecs:us-east-1:123456789012:cluster/production", "serviceName": "api", "deploymentId": "ecs-svc/1234567890", "reason": "DEPLOYMENT_FAILURE - rolling back to deploymentId ecs-svc/0987654321" } }

The reason field tells you it was a circuit breaker rollback. For Slack or PagerDuty integration, use AWS Chatbot or a small Lambda that parses this event and forwards a formatted message.

Adding Slack Notifications

For Slack, the simplest path is AWS Chatbot:

resource "aws_chatbot_slack_channel_configuration" "deployment_alerts" { configuration_name = "ecs-deployment-alerts" iam_role_arn = aws_iam_role.chatbot.arn slack_channel_id = "C0123456789" slack_team_id = "T0123456789" sns_topic_arns = [aws_sns_topic.deployment_alerts.arn] logging_level = "INFO" }

Complete Example: Circuit Breaker + Alert

Here's everything wired together:

# --- ECS Service with circuit breaker --- resource "aws_ecs_service" "api" { name = "api" cluster = aws_ecs_cluster.production.id task_definition = aws_ecs_task_definition.api.arn desired_count = 3 launch_type = "FARGATE" deployment_circuit_breaker { enable = true rollback = true } deployment_maximum_percent = 200 deployment_minimum_healthy_percent = 100 load_balancer { target_group_arn = aws_lb_target_group.api.arn container_name = "api" container_port = 8080 } network_configuration { subnets = var.private_subnet_ids security_groups = [aws_security_group.api.id] assign_public_ip = false } lifecycle { ignore_changes = [task_definition] # Managed by CI/CD } } # --- Alert when circuit breaker triggers --- resource "aws_sns_topic" "deployment_alerts" { name = "ecs-deployment-alerts" } resource "aws_cloudwatch_event_rule" "deployment_failed" { name = "ecs-deployment-circuit-breaker-triggered" event_pattern = jsonencode({ source = ["aws.ecs"] detail-type = ["ECS Deployment State Change"] detail = { eventName = ["SERVICE_DEPLOYMENT_FAILED"] clusterArn = [aws_ecs_cluster.production.arn] } }) } resource "aws_cloudwatch_event_target" "notify_sns" { rule = aws_cloudwatch_event_rule.deployment_failed.name target_id = "deployment-failure-alert" arn = aws_sns_topic.deployment_alerts.arn } resource "aws_sns_topic_policy" "eventbridge_publish" { arn = aws_sns_topic.deployment_alerts.arn policy = jsonencode({ Version = "2012-10-17" Statement = [{ Effect = "Allow" Principal = { Service = "events.amazonaws.com" } Action = "sns:Publish" Resource = aws_sns_topic.deployment_alerts.arn }] }) }

Circuit Breaker vs. Blue/Green vs. Rolling

FeatureCircuit Breaker (Rolling)Blue/Green (CodeDeploy)
Rollback speedSeconds (stops bad tasks)Minutes (shifts traffic back)
Setup complexityLow (service config only)High (CodeDeploy + ALB listener rules)
Traffic controlNo (all-or-nothing per task)Yes (canary %, linear shifts)
Cost during deploymentSame (replaces tasks in place)2x (runs both versions)
Best forMost servicesCritical paths needing canary rollout

For most teams, the circuit breaker provides the best balance of safety and simplicity. Reserve blue/green with CodeDeploy for services where you need fine-grained traffic shifting (10% canary → 50% → 100%).

Production Checklist

  • Deployment circuit breaker enabled with rollback: true
  • EventBridge rule capturing SERVICE_DEPLOYMENT_FAILED events
  • SNS topic with email/Slack subscriptions for deployment failure alerts
  • SNS topic policy allowing EventBridge to publish
  • EventBridge rule scoped to the correct cluster (not account-wide)
  • ECS container health checks configured in task definition
  • deployment_minimum_healthy_percent set to 100 (no capacity drop during deploy)

Conclusion

The deployment circuit breaker is the minimum viable safety net every ECS service should have. It's a single configuration flag that prevents bad deployments from taking down your service. Wire up an EventBridge rule to catch the failure event, push it to SNS, and your team knows the moment a rollback happens—without anyone staring at the ECS console.

Key principles:

  • Enable circuit breaker with rollback: true on every ECS service
  • Use EventBridge + SNS to alert on SERVICE_DEPLOYMENT_FAILED
  • Scope EventBridge rules to the relevant cluster to avoid noise
  • A silent rollback is a bug hiding in your codebase—always alert on it

Need help building deployment safety into your infrastructure? Let's talk about making your deployments boring (in the best way).

You might also like