Protect production deployments with ECS deployment circuit breaker for automatic rollback, and a CloudWatch alarm that alerts your team when it triggers.

ECS Deployment Circuit Breaker: Automatic Rollback with CloudWatch Alerting

A bad deployment that stays live for 15 minutes can cause more damage than the bug itself. ECS deployment circuit breaker automatically detects failed deployments and rolls back to the last stable state—no human intervention required. Add a CloudWatch alarm on top, and your team gets notified the moment a rollback happens.

How the Circuit Breaker Works

The ECS deployment circuit breaker monitors task health during a rolling deployment. If new tasks keep failing to reach a healthy state, it stops the deployment and rolls back to the previous task definition.

Deployment starts
├── New tasks launch with updated task definition
├── ECS monitors task health (container starts, health checks pass)
├── If tasks keep crashing or failing health checks:
│   ├── Circuit breaker TRIGGERS
│   ├── New tasks are stopped
│   └── Service rolls back to previous task definition
└── If all tasks become healthy:
    └── Deployment completes successfully

The key insight: Without the circuit breaker, ECS will keep trying to launch failing tasks indefinitely. Your service stays in a degraded state with a mix of old and new (broken) tasks until someone notices and manually intervenes.

Enabling the Circuit Breaker

AWS CLI

aws ecs create-service \
  --cluster production \
  --service-name api \
  --task-definition api:42 \
  --desired-count 3 \
  --deployment-configuration '{
    "deploymentCircuitBreaker": {
      "enable": true,
      "rollback": true
    },
    "maximumPercent": 200,
    "minimumHealthyPercent": 100
  }'

Terraform

resource "aws_ecs_service" "api" {
  name            = "api"
  cluster         = aws_ecs_cluster.production.id
  task_definition = aws_ecs_task_definition.api.arn
  desired_count   = 3
  launch_type     = "FARGATE"

  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }

  deployment_maximum_percent         = 200
  deployment_minimum_healthy_percent = 100

  network_configuration {
    subnets          = var.private_subnet_ids
    security_groups  = [aws_security_group.api.id]
    assign_public_ip = false
  }
}

CloudFormation

ApiService:
  Type: AWS::ECS::Service
  Properties:
    Cluster: !Ref ProductionCluster
    ServiceName: api
    TaskDefinition: !Ref ApiTaskDefinition
    DesiredCount: 3
    LaunchType: FARGATE
    DeploymentConfiguration:
      DeploymentCircuitBreaker:
        Enable: true
        Rollback: true
      MaximumPercent: 200
      MinimumHealthyPercent: 100

Why rollback: true? With enable: true alone, the circuit breaker stops the bad deployment but leaves the service in a partially updated state. Setting rollback: true tells ECS to actively roll back to the last successful task definition, restoring full capacity automatically.

What Triggers the Circuit Breaker

The circuit breaker uses a failure threshold based on a rolling window. ECS evaluates deployment health in phases:

Task launch failures — the container fails to start (bad image, OOM on startup, missing secrets)
Health check failures — the container starts but fails ELB health checks or ECS container health checks
Repeated failures — if the failure count exceeds a threshold relative to your desiredCount, the circuit breaker trips

The exact threshold logic: ECS allows a minimum of 10 task launch attempts before evaluating. If the failure rate stays above 50% after that window, the circuit breaker triggers. For services with a higher desiredCount, the window scales proportionally.

Common failure patterns that trip the circuit breaker:

Failure	Root Cause
`CannotPullContainerError`	Image doesn't exist in ECR, or task execution role lacks `ecr:GetDownloadUrlForLayer`
`ResourceInitializationError`	VPC endpoints missing, security group blocks ECR/S3 access
`Essential container exited`	App crashes on startup (bad config, missing env vars, migration failure)
Health check timeout	App takes too long to start, or health check path returns non-200
OOM killed	Container exceeds its memory limit on startup

Alerting When the Circuit Breaker Triggers

The circuit breaker handles rollback automatically, but your team still needs to know it happened. A deployment that silently rolls back is a bug that silently stays in your codebase. Use EventBridge to catch the SERVICE_DEPLOYMENT_FAILED event and fire a CloudWatch alarm via SNS.

EventBridge Rule + SNS Notification

# --- SNS topic for deployment alerts ---
resource "aws_sns_topic" "deployment_alerts" {
  name = "ecs-deployment-alerts"
}

resource "aws_sns_topic_subscription" "email" {
  topic_arn = aws_sns_topic.deployment_alerts.arn
  protocol  = "email"
  endpoint  = "[email protected]"
}

# --- EventBridge rule: fires when circuit breaker triggers ---
resource "aws_cloudwatch_event_rule" "deployment_failed" {
  name        = "ecs-deployment-circuit-breaker-triggered"
  description = "Fires when an ECS deployment fails (circuit breaker rollback)"

  event_pattern = jsonencode({
    source      = ["aws.ecs"]
    detail-type = ["ECS Deployment State Change"]
    detail = {
      eventName = ["SERVICE_DEPLOYMENT_FAILED"]
    }
  })
}

resource "aws_cloudwatch_event_target" "notify_sns" {
  rule      = aws_cloudwatch_event_rule.deployment_failed.name
  target_id = "deployment-failure-alert"
  arn       = aws_sns_topic.deployment_alerts.arn
}

# Allow EventBridge to publish to SNS
resource "aws_sns_topic_policy" "eventbridge_publish" {
  arn = aws_sns_topic.deployment_alerts.arn

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "events.amazonaws.com" }
      Action    = "sns:Publish"
      Resource  = aws_sns_topic.deployment_alerts.arn
    }]
  })
}

CloudFormation

DeploymentAlertsTopic:
  Type: AWS::SNS::Topic
  Properties:
    TopicName: ecs-deployment-alerts

DeploymentAlertsSubscription:
  Type: AWS::SNS::Subscription
  Properties:
    TopicArn: !Ref DeploymentAlertsTopic
    Protocol: email
    Endpoint: [email protected]

DeploymentFailedRule:
  Type: AWS::Events::Rule
  Properties:
    Name: ecs-deployment-circuit-breaker-triggered
    Description: Fires when an ECS deployment fails (circuit breaker rollback)
    EventPattern:
      source:
        - aws.ecs
      detail-type:
        - "ECS Deployment State Change"
      detail:
        eventName:
          - SERVICE_DEPLOYMENT_FAILED
    Targets:
      - Id: deployment-failure-alert
        Arn: !Ref DeploymentAlertsTopic

Scoping to a Specific Cluster or Service

The rule above fires for any ECS deployment failure in the account. To scope it to a specific cluster or service, add filters to the event pattern:

event_pattern = jsonencode({
  source      = ["aws.ecs"]
  detail-type = ["ECS Deployment State Change"]
  detail = {
    eventName   = ["SERVICE_DEPLOYMENT_FAILED"]
    clusterArn  = [aws_ecs_cluster.production.arn]
    # Optionally filter by service:
    # serviceName = ["api"]
  }
})

What the Event Looks Like

When the circuit breaker triggers, EventBridge emits an event like this:

{
  "source": "aws.ecs",
  "detail-type": "ECS Deployment State Change",
  "detail": {
    "eventType": "INFO",
    "eventName": "SERVICE_DEPLOYMENT_FAILED",
    "clusterArn": "arn:aws:ecs:us-east-1:123456789012:cluster/production",
    "serviceName": "api",
    "deploymentId": "ecs-svc/1234567890",
    "reason": "DEPLOYMENT_FAILURE - rolling back to deploymentId ecs-svc/0987654321"
  }
}

The reason field tells you it was a circuit breaker rollback. For Slack or PagerDuty integration, use AWS Chatbot or a small Lambda that parses this event and forwards a formatted message.

Adding Slack Notifications

For Slack, the simplest path is AWS Chatbot:

resource "aws_chatbot_slack_channel_configuration" "deployment_alerts" {
  configuration_name = "ecs-deployment-alerts"
  iam_role_arn       = aws_iam_role.chatbot.arn
  slack_channel_id   = "C0123456789"
  slack_team_id      = "T0123456789"
  sns_topic_arns     = [aws_sns_topic.deployment_alerts.arn]

  logging_level = "INFO"
}

Complete Example: Circuit Breaker + Alert

Here's everything wired together:

# --- ECS Service with circuit breaker ---
resource "aws_ecs_service" "api" {
  name            = "api"
  cluster         = aws_ecs_cluster.production.id
  task_definition = aws_ecs_task_definition.api.arn
  desired_count   = 3
  launch_type     = "FARGATE"

  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }

  deployment_maximum_percent         = 200
  deployment_minimum_healthy_percent = 100

  load_balancer {
    target_group_arn = aws_lb_target_group.api.arn
    container_name   = "api"
    container_port   = 8080
  }

  network_configuration {
    subnets          = var.private_subnet_ids
    security_groups  = [aws_security_group.api.id]
    assign_public_ip = false
  }

  lifecycle {
    ignore_changes = [task_definition]  # Managed by CI/CD
  }
}

# --- Alert when circuit breaker triggers ---
resource "aws_sns_topic" "deployment_alerts" {
  name = "ecs-deployment-alerts"
}

resource "aws_cloudwatch_event_rule" "deployment_failed" {
  name = "ecs-deployment-circuit-breaker-triggered"

  event_pattern = jsonencode({
    source      = ["aws.ecs"]
    detail-type = ["ECS Deployment State Change"]
    detail = {
      eventName  = ["SERVICE_DEPLOYMENT_FAILED"]
      clusterArn = [aws_ecs_cluster.production.arn]
    }
  })
}

resource "aws_cloudwatch_event_target" "notify_sns" {
  rule      = aws_cloudwatch_event_rule.deployment_failed.name
  target_id = "deployment-failure-alert"
  arn       = aws_sns_topic.deployment_alerts.arn
}

resource "aws_sns_topic_policy" "eventbridge_publish" {
  arn = aws_sns_topic.deployment_alerts.arn

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "events.amazonaws.com" }
      Action    = "sns:Publish"
      Resource  = aws_sns_topic.deployment_alerts.arn
    }]
  })
}

Circuit Breaker vs. Blue/Green vs. Rolling

Feature	Circuit Breaker (Rolling)	Blue/Green (CodeDeploy)
Rollback speed	Seconds (stops bad tasks)	Minutes (shifts traffic back)
Setup complexity	Low (service config only)	High (CodeDeploy + ALB listener rules)
Traffic control	No (all-or-nothing per task)	Yes (canary %, linear shifts)
Cost during deployment	Same (replaces tasks in place)	2x (runs both versions)
Best for	Most services	Critical paths needing canary rollout

For most teams, the circuit breaker provides the best balance of safety and simplicity. Reserve blue/green with CodeDeploy for services where you need fine-grained traffic shifting (10% canary → 50% → 100%).

Production Checklist

Deployment circuit breaker enabled with rollback: true
EventBridge rule capturing SERVICE_DEPLOYMENT_FAILED events
SNS topic with email/Slack subscriptions for deployment failure alerts
SNS topic policy allowing EventBridge to publish
EventBridge rule scoped to the correct cluster (not account-wide)
ECS container health checks configured in task definition
deployment_minimum_healthy_percent set to 100 (no capacity drop during deploy)

Conclusion

The deployment circuit breaker is the minimum viable safety net every ECS service should have. It's a single configuration flag that prevents bad deployments from taking down your service. Wire up an EventBridge rule to catch the failure event, push it to SNS, and your team knows the moment a rollback happens—without anyone staring at the ECS console.

Key principles:

Enable circuit breaker with rollback: true on every ECS service
Use EventBridge + SNS to alert on SERVICE_DEPLOYMENT_FAILED
Scope EventBridge rules to the relevant cluster to avoid noise
A silent rollback is a bug hiding in your codebase—always alert on it

Need help building deployment safety into your infrastructure? Let's talk about making your deployments boring (in the best way).

ECS Deployment Circuit Breaker: Automatic Rollback with CloudWatch Alerting

ECS Deployment Circuit Breaker: Automatic Rollback with CloudWatch Alerting

How the Circuit Breaker Works

Enabling the Circuit Breaker

AWS CLI

Terraform

CloudFormation

What Triggers the Circuit Breaker

Alerting When the Circuit Breaker Triggers

EventBridge Rule + SNS Notification

CloudFormation

Scoping to a Specific Cluster or Service

What the Event Looks Like

Adding Slack Notifications

Complete Example: Circuit Breaker + Alert

Circuit Breaker vs. Blue/Green vs. Rolling

Production Checklist

Conclusion

You might also like

AWS ECS Production Deployment: The Complete Guide

GitHub Actions CI/CD Pipeline Design for Production

Sentry for Production: Error Monitoring and Performance Tracking