Incident Response at Scale: From Alert to Resolution
Build resilient systems with effective incident response: on-call best practices, runbooks, blameless postmortems, and SLO-driven reliability.
Incident Response at Scale: From Alert to Resolution
Production incidents are inevitable. The difference between good and great engineering teams is how quickly they detect, respond to, and learn from incidents.
The Incident Response Lifecycle
- Alert → Monitoring system detects issue
- Acknowledge → Responder accepts the incident
- Triage → Determine severity and impact
- Mitigate → Restore service (stop the bleeding)
- Resolve → Fix root cause
- Post-mortem → Analyze and learn
- Improve → Feedback loop (update alerts, runbooks, code)
Let's build a world-class incident response system.
1. Alerting: Signal vs Noise
Bad alerts create alert fatigue. Good alerts are actionable.
Bad Alert
# Fires every 5 minutes alert: HighCPU expr: cpu_usage > 50%
Problems:
- Not actionable (50% is fine)
- No context (which service?)
- Page for non-impact
Good Alert
alert: ServiceDegraded expr: | ( rate(http_requests_total{code=~"5.."}[5m]) / rate(http_requests_total[5m]) ) > 0.05 AND rate(http_requests_total[5m]) > 10 for: 5m labels: severity: page service: "{{ $labels.service }}" annotations: summary: "{{ $labels.service }} error rate >5% for 5min" description: "Current error rate: {{ $value | humanizePercentage }}" runbook: https://runbooks.company.com/high-error-rate dashboard: https://grafana.company.com/d/service?var-service={{ $labels.service }}
Why it's better:
- ✅ Impact-based (error rate, not CPU)
- ✅ Minimum traffic threshold (avoids false positives)
- ✅ 5-minute window (filters flapping)
- ✅ Links to runbook and dashboard
- ✅ Actionable: check the dashboard, follow runbook
Alert Severity Levels
# P0: Page immediately alerts: - name: SystemDown severity: critical notify: pagerduty - name: ErrorRateHigh severity: critical notify: pagerduty # P1: Create ticket, notify during business hours - name: LatencyP99High severity: warning notify: slack # P2: Log, no immediate action - name: DiskUsageHigh severity: info notify: slack
Golden rule: Only page for customer-impacting issues.
2. On-Call Rotation
Fair, sustainable on-call schedules prevent burnout:
# PagerDuty schedule example schedules: - name: Platform Team Primary rotation: type: weekly handoff_time: "09:00" timezone: "America/New_York" users: - [email protected] - [email protected] - [email protected] - [email protected] - name: Platform Team Secondary rotation: type: weekly handoff_time: "09:00" users: - [email protected] - [email protected] escalation_policy: - level: 1 targets: [Platform Team Primary] timeout: 5min # Escalate if not acked - level: 2 targets: [Platform Team Secondary] timeout: 10min - level: 3 targets: [Engineering Manager]
Best practices:
- Follow-the-sun: Handoff between US/EU/Asia teams (no 3am pages)
- Secondary on-call: Backup if primary unavailable
- Escalation: Auto-escalate if not acknowledged
- Compensation: Extra PTO or pay for on-call weeks
3. Runbooks: Incident Playbooks
Runbooks guide responders through common incidents:
Runbook: High Error Rate (5xx)
Symptoms
- Error rate >5% for 5 minutes
- Dashboard: https://grafana.company.com/d/errors
Impact
- Customer requests failing
- Severity: HIGH
Investigation
Step 1: Check recent deployments
# Was there a recent deploy? kubectl rollout history deployment/api-service # If yes, rollback kubectl rollout undo deployment/api-service
Step 2: Check dependencies
- Database: https://grafana.company.com/d/postgres
- Redis: https://grafana.company.com/d/redis
- External API: https://status.stripe.com
Step 3: Check logs
# Recent errors kubectl logs deployment/api-service --since=10m | grep ERROR # Common patterns kubectl logs deployment/api-service --since=10m | grep ERROR | sort | uniq -c | sort -rn
Step 4: Check database connections
-- Active connections SELECT count(*) FROM pg_stat_activity; -- Long-running queries SELECT pid, now() - query_start as duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC;
Mitigation
If database is the issue:
# Scale up read replicas kubectl scale deployment api-service --replicas=10 # Or: Enable read-only mode kubectl set env deployment/api-service READ_ONLY_MODE=true
Communication Template
🚨 Incident: High error rate on API service Status: Investigating Impact: ~10% of API requests failing Started: 2026-02-04 14:32 UTC Last Update: 14:35 UTC Actions: - Rolled back recent deployment - Monitoring error rate Next update: 14:45 UTC
Escalation
If not resolved in 20 minutes, escalate to @platform-leads
Runbooks should be:
- Step-by-step (junior engineer can follow)
- Include copy-paste commands
- Link to dashboards
- Updated after every incident
4. Incident Communication
Keep stakeholders informed without overwhelming them:
Slack Incident Channel
# Automatically create incident channel import slack_sdk def create_incident_channel(incident_id: str, severity: str): client = slack_sdk.WebClient(token=os.environ["SLACK_TOKEN"]) # Create channel channel_name = f"incident-{incident_id}" response = client.conversations_create(name=channel_name) channel_id = response["channel"]["id"] # Set topic client.conversations_setTopic( channel=channel_id, topic=f"🚨 {severity} - Incident {incident_id}" ) # Invite key people client.conversations_invite( channel=channel_id, users=["U123", "U456"] # On-call engineers ) # Post initial message client.chat_postMessage( channel=channel_id, text=f"Incident {incident_id} started", blocks=[ { "type": "section", "text": { "type": "mrkdwn", "text": f"*Incident:* {incident_id}\n*Severity:* {severity}\n*Started:* <!date^{int(time.time())}^{{date_short_pretty}} {{time}}|now>" } }, { "type": "actions", "elements": [ { "type": "button", "text": {"type": "plain_text", "text": "View Dashboard"}, "url": "https://grafana.company.com/incident" }, { "type": "button", "text": {"type": "plain_text", "text": "Runbook"}, "url": "https://runbooks.company.com" } ] } ] ) return channel_id
Status Page Updates
# Update public status page import requests def update_status_page(incident_id: str, status: str, message: str): # Using Statuspage.io API requests.patch( f"https://api.statuspage.io/v1/pages/{PAGE_ID}/incidents/{incident_id}", headers={"Authorization": f"OAuth {API_KEY}"}, json={ "incident": { "status": status, # investigating, identified, monitoring, resolved "body": message, "components": { "API": "degraded_performance" } } } )
5. SLOs: Reliability Targets
Service Level Objectives define acceptable reliability:
# SLO definitions slos: - name: API Availability sli: | ( sum(rate(http_requests_total{code!~"5.."}[30d])) / sum(rate(http_requests_total[30d])) ) target: 0.999 # 99.9% availability window: 30d - name: API Latency P99 sli: | histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[30d]) ) target: 0.5 # 500ms window: 30d
Error budget:
- 99.9% SLO = 0.1% error budget
- 30 days = 43,200 minutes
- Error budget = 43.2 minutes downtime/month
If budget exhausted: Freeze feature releases, focus on reliability.
SLO-Based Alerting
# Alert when burning error budget too fast alert: ErrorBudgetBurnRateFast expr: | ( 1 - ( sum(rate(http_requests_total{code!~"5.."}[1h])) / sum(rate(http_requests_total[1h])) ) ) > (0.001 * 14.4) # Burning budget 14.4x faster for: 5m annotations: summary: "Burning error budget too fast (will exhaust in 2 days at this rate)"
6. Blameless Postmortems
After every incident, write a postmortem—not to blame, but to learn:
Example Postmortem: API Outage 2026-02-04
Incident ID: INC-2024-042 Date: 2026-02-04 Duration: 32 minutes (14:32 - 15:04 UTC) Severity: Critical Impact: 15% of API requests failed
Summary
Database connection pool exhausted due to memory leak in auth middleware.
Timeline (UTC)
| Time | Event |
|---|---|
| 14:32 | Alert: High error rate (5%) |
| 14:35 | On-call engineer acked, checking logs |
| 14:40 | Identified database connections at max |
| 14:43 | Restarted API pods (temporary fix) |
| 14:45 | Error rate dropped to 2% |
| 14:50 | Identified memory leak in auth middleware |
| 14:55 | Rolled back to previous version |
| 15:04 | Error rate back to normal |
Root Cause
Memory leak in auth-middleware v2.3.0:
// Bug: listeners not cleaned up function authMiddleware(req, res, next) { req.on('data', handleData); // ❌ Never removed // ... }
What Went Well
- ✅ Alert fired within 1 minute of issue
- ✅ Runbook helped guide investigation
- ✅ Rollback process was quick and reliable
What Went Wrong
- ❌ No memory leak detection in tests
- ❌ Gradual deploy not enabled (would have caught at 10%)
- ❌ Database connection metric not monitored
Action Items
| Action | Owner | Due Date | Status |
|---|---|---|---|
| Add memory leak tests | @alice | 2026-02-11 | Done |
| Enable gradual deploys | @bob | 2026-02-08 | Done |
| Alert on DB connection pool usage | @charlie | 2026-02-06 | Done |
| Document connection pool tuning | @dana | 2026-02-15 | In Progress |
Lessons Learned
- Always test for resource leaks (memory, connections, file handles)
- Gradual/canary deploys catch bugs before 100% rollout
- Monitor connection pools, not just query latency
Key principles:
- Blameless: Focus on systems, not people
- Timeline: Precise timestamps help identify patterns
- Action items: Concrete next steps with owners
- Follow-up: Review action items in 2 weeks
7. Incident Metrics
Track incident response effectiveness:
# Incident metrics dashboard metrics = { # Detection "mean_time_to_detect": "2.3 minutes", # Response "mean_time_to_acknowledge": "1.8 minutes", "mean_time_to_mitigate": "12 minutes", "mean_time_to_resolve": "28 minutes", # Volume "incidents_this_month": 8, "incidents_by_severity": { "critical": 2, "high": 3, "medium": 3 }, # Quality "repeat_incidents": 1, # Same root cause "postmortems_completed": 7, # 7/8 = 87.5% }
Trends to watch:
- MTTR increasing? → Need better runbooks
- Repeat incidents? → Action items not completed
- Detection time high? → Improve monitoring
Tools of the Trade
Alerting:
- Prometheus + Alertmanager
- Grafana Cloud
- Datadog
On-call:
- PagerDuty
- Opsgenie
- VictorOps
Incident management:
- incident.io (Slack-native)
- FireHydrant
- Jeli
Status pages:
- Statuspage.io
- Atlassian Statuspage
Conclusion
Effective incident response is a competitive advantage. With the right processes:
- Incidents detected in seconds
- Responders guided by runbooks
- Communication clear and timely
- Learning captured in postmortems
- Reliability improves over time
Remember: The goal isn't zero incidents—it's learning and improving from every incident.
Need help building your incident response process? Let's talk about improving your reliability.
You might also like
Sentry for Production: Error Monitoring and Performance Tracking
Master Sentry for production applications: error tracking, performance monitoring, distributed tracing, and alerting strategies that catch issues before users do.
Production Observability: OpenTelemetry and Distributed Tracing
Implement comprehensive observability with OpenTelemetry: distributed tracing, metrics, and logs in a unified pipeline for production systems.
AWS ECS Production Deployment: The Complete Guide
Deploy containerized applications on AWS ECS with auto-scaling, blue/green deployments, and production-grade monitoring.