Back to Blog
SREIncident ResponseReliabilityDevOps

Incident Response at Scale: From Alert to Resolution

Build resilient systems with effective incident response: on-call best practices, runbooks, blameless postmortems, and SLO-driven reliability.

Azynth Team
13 min read

Incident Response at Scale: From Alert to Resolution

Production incidents are inevitable. The difference between good and great engineering teams is how quickly they detect, respond to, and learn from incidents.

The Incident Response Lifecycle

  1. Alert → Monitoring system detects issue
  2. Acknowledge → Responder accepts the incident
  3. Triage → Determine severity and impact
  4. Mitigate → Restore service (stop the bleeding)
  5. Resolve → Fix root cause
  6. Post-mortem → Analyze and learn
  7. Improve → Feedback loop (update alerts, runbooks, code)

Let's build a world-class incident response system.

1. Alerting: Signal vs Noise

Bad alerts create alert fatigue. Good alerts are actionable.

Bad Alert

# Fires every 5 minutes alert: HighCPU expr: cpu_usage > 50%

Problems:

  • Not actionable (50% is fine)
  • No context (which service?)
  • Page for non-impact

Good Alert

alert: ServiceDegraded expr: | ( rate(http_requests_total{code=~"5.."}[5m]) / rate(http_requests_total[5m]) ) > 0.05 AND rate(http_requests_total[5m]) > 10 for: 5m labels: severity: page service: "{{ $labels.service }}" annotations: summary: "{{ $labels.service }} error rate >5% for 5min" description: "Current error rate: {{ $value | humanizePercentage }}" runbook: https://runbooks.company.com/high-error-rate dashboard: https://grafana.company.com/d/service?var-service={{ $labels.service }}

Why it's better:

  • ✅ Impact-based (error rate, not CPU)
  • ✅ Minimum traffic threshold (avoids false positives)
  • ✅ 5-minute window (filters flapping)
  • ✅ Links to runbook and dashboard
  • ✅ Actionable: check the dashboard, follow runbook

Alert Severity Levels

# P0: Page immediately alerts: - name: SystemDown severity: critical notify: pagerduty - name: ErrorRateHigh severity: critical notify: pagerduty # P1: Create ticket, notify during business hours - name: LatencyP99High severity: warning notify: slack # P2: Log, no immediate action - name: DiskUsageHigh severity: info notify: slack

Golden rule: Only page for customer-impacting issues.

2. On-Call Rotation

Fair, sustainable on-call schedules prevent burnout:

# PagerDuty schedule example schedules: - name: Platform Team Primary rotation: type: weekly handoff_time: "09:00" timezone: "America/New_York" users: - [email protected] - [email protected] - [email protected] - [email protected] - name: Platform Team Secondary rotation: type: weekly handoff_time: "09:00" users: - [email protected] - [email protected] escalation_policy: - level: 1 targets: [Platform Team Primary] timeout: 5min # Escalate if not acked - level: 2 targets: [Platform Team Secondary] timeout: 10min - level: 3 targets: [Engineering Manager]

Best practices:

  • Follow-the-sun: Handoff between US/EU/Asia teams (no 3am pages)
  • Secondary on-call: Backup if primary unavailable
  • Escalation: Auto-escalate if not acknowledged
  • Compensation: Extra PTO or pay for on-call weeks

3. Runbooks: Incident Playbooks

Runbooks guide responders through common incidents:

Runbook: High Error Rate (5xx)

Symptoms
Impact
  • Customer requests failing
  • Severity: HIGH
Investigation
Step 1: Check recent deployments
# Was there a recent deploy? kubectl rollout history deployment/api-service # If yes, rollback kubectl rollout undo deployment/api-service
Step 2: Check dependencies
Step 3: Check logs
# Recent errors kubectl logs deployment/api-service --since=10m | grep ERROR # Common patterns kubectl logs deployment/api-service --since=10m | grep ERROR | sort | uniq -c | sort -rn
Step 4: Check database connections
-- Active connections SELECT count(*) FROM pg_stat_activity; -- Long-running queries SELECT pid, now() - query_start as duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC;
Mitigation

If database is the issue:

# Scale up read replicas kubectl scale deployment api-service --replicas=10 # Or: Enable read-only mode kubectl set env deployment/api-service READ_ONLY_MODE=true
Communication Template
🚨 Incident: High error rate on API service Status: Investigating Impact: ~10% of API requests failing Started: 2026-02-04 14:32 UTC Last Update: 14:35 UTC Actions: - Rolled back recent deployment - Monitoring error rate Next update: 14:45 UTC
Escalation

If not resolved in 20 minutes, escalate to @platform-leads

Runbooks should be:

  • Step-by-step (junior engineer can follow)
  • Include copy-paste commands
  • Link to dashboards
  • Updated after every incident

4. Incident Communication

Keep stakeholders informed without overwhelming them:

Slack Incident Channel

# Automatically create incident channel import slack_sdk def create_incident_channel(incident_id: str, severity: str): client = slack_sdk.WebClient(token=os.environ["SLACK_TOKEN"]) # Create channel channel_name = f"incident-{incident_id}" response = client.conversations_create(name=channel_name) channel_id = response["channel"]["id"] # Set topic client.conversations_setTopic( channel=channel_id, topic=f"🚨 {severity} - Incident {incident_id}" ) # Invite key people client.conversations_invite( channel=channel_id, users=["U123", "U456"] # On-call engineers ) # Post initial message client.chat_postMessage( channel=channel_id, text=f"Incident {incident_id} started", blocks=[ { "type": "section", "text": { "type": "mrkdwn", "text": f"*Incident:* {incident_id}\n*Severity:* {severity}\n*Started:* <!date^{int(time.time())}^{{date_short_pretty}} {{time}}|now>" } }, { "type": "actions", "elements": [ { "type": "button", "text": {"type": "plain_text", "text": "View Dashboard"}, "url": "https://grafana.company.com/incident" }, { "type": "button", "text": {"type": "plain_text", "text": "Runbook"}, "url": "https://runbooks.company.com" } ] } ] ) return channel_id

Status Page Updates

# Update public status page import requests def update_status_page(incident_id: str, status: str, message: str): # Using Statuspage.io API requests.patch( f"https://api.statuspage.io/v1/pages/{PAGE_ID}/incidents/{incident_id}", headers={"Authorization": f"OAuth {API_KEY}"}, json={ "incident": { "status": status, # investigating, identified, monitoring, resolved "body": message, "components": { "API": "degraded_performance" } } } )

5. SLOs: Reliability Targets

Service Level Objectives define acceptable reliability:

# SLO definitions slos: - name: API Availability sli: | ( sum(rate(http_requests_total{code!~"5.."}[30d])) / sum(rate(http_requests_total[30d])) ) target: 0.999 # 99.9% availability window: 30d - name: API Latency P99 sli: | histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[30d]) ) target: 0.5 # 500ms window: 30d

Error budget:

  • 99.9% SLO = 0.1% error budget
  • 30 days = 43,200 minutes
  • Error budget = 43.2 minutes downtime/month

If budget exhausted: Freeze feature releases, focus on reliability.

SLO-Based Alerting

# Alert when burning error budget too fast alert: ErrorBudgetBurnRateFast expr: | ( 1 - ( sum(rate(http_requests_total{code!~"5.."}[1h])) / sum(rate(http_requests_total[1h])) ) ) > (0.001 * 14.4) # Burning budget 14.4x faster for: 5m annotations: summary: "Burning error budget too fast (will exhaust in 2 days at this rate)"

6. Blameless Postmortems

After every incident, write a postmortem—not to blame, but to learn:

Example Postmortem: API Outage 2026-02-04

Incident ID: INC-2024-042 Date: 2026-02-04 Duration: 32 minutes (14:32 - 15:04 UTC) Severity: Critical Impact: 15% of API requests failed

Summary

Database connection pool exhausted due to memory leak in auth middleware.

Timeline (UTC)

TimeEvent
14:32Alert: High error rate (5%)
14:35On-call engineer acked, checking logs
14:40Identified database connections at max
14:43Restarted API pods (temporary fix)
14:45Error rate dropped to 2%
14:50Identified memory leak in auth middleware
14:55Rolled back to previous version
15:04Error rate back to normal

Root Cause

Memory leak in auth-middleware v2.3.0:

// Bug: listeners not cleaned up function authMiddleware(req, res, next) { req.on('data', handleData); // ❌ Never removed // ... }

What Went Well

  • ✅ Alert fired within 1 minute of issue
  • ✅ Runbook helped guide investigation
  • ✅ Rollback process was quick and reliable

What Went Wrong

  • ❌ No memory leak detection in tests
  • ❌ Gradual deploy not enabled (would have caught at 10%)
  • ❌ Database connection metric not monitored

Action Items

ActionOwnerDue DateStatus
Add memory leak tests@alice2026-02-11Done
Enable gradual deploys@bob2026-02-08Done
Alert on DB connection pool usage@charlie2026-02-06Done
Document connection pool tuning@dana2026-02-15In Progress

Lessons Learned

  1. Always test for resource leaks (memory, connections, file handles)
  2. Gradual/canary deploys catch bugs before 100% rollout
  3. Monitor connection pools, not just query latency

Key principles:

  • Blameless: Focus on systems, not people
  • Timeline: Precise timestamps help identify patterns
  • Action items: Concrete next steps with owners
  • Follow-up: Review action items in 2 weeks

7. Incident Metrics

Track incident response effectiveness:

# Incident metrics dashboard metrics = { # Detection "mean_time_to_detect": "2.3 minutes", # Response "mean_time_to_acknowledge": "1.8 minutes", "mean_time_to_mitigate": "12 minutes", "mean_time_to_resolve": "28 minutes", # Volume "incidents_this_month": 8, "incidents_by_severity": { "critical": 2, "high": 3, "medium": 3 }, # Quality "repeat_incidents": 1, # Same root cause "postmortems_completed": 7, # 7/8 = 87.5% }

Trends to watch:

  • MTTR increasing? → Need better runbooks
  • Repeat incidents? → Action items not completed
  • Detection time high? → Improve monitoring

Tools of the Trade

Alerting:

  • Prometheus + Alertmanager
  • Grafana Cloud
  • Datadog

On-call:

  • PagerDuty
  • Opsgenie
  • VictorOps

Incident management:

  • incident.io (Slack-native)
  • FireHydrant
  • Jeli

Status pages:

  • Statuspage.io
  • Atlassian Statuspage

Conclusion

Effective incident response is a competitive advantage. With the right processes:

  • Incidents detected in seconds
  • Responders guided by runbooks
  • Communication clear and timely
  • Learning captured in postmortems
  • Reliability improves over time

Remember: The goal isn't zero incidents—it's learning and improving from every incident.


Need help building your incident response process? Let's talk about improving your reliability.

You might also like