When things break at 3am, nobody wants to think from first principles. Generate runbooks in advance.
Generate from Your Codebase
Analyze our infrastructure code in terraform/ and our services in src/.
For each service, generate a runbook covering:
1. Health check endpoints and how to verify they work
2. Common failure modes (DB connection, memory, disk)
3. Diagnostic commands to run for each failure
4. Recovery steps
5. Escalation criteria
Output as markdown files in docs/runbooks/.
Template Structure
Claude generates runbooks like this:
# API Service Runbook
## Health Check
curl https://api.example.com/health
## Symptoms: 5xx Spike
1. Check pod status: `kubectl get pods -l app=api`
2. Check recent deploys: `kubectl rollout history deployment/api`
3. Check DB connections: `kubectl exec -it api-pod -- pg_isready`
4. If DB unreachable: check `terraform/rds.tf` for config
## Recovery: Rollback
kubectl rollout undo deployment/api
## Escalation
If not resolved in 15min, page the on-call SRE.
Keep Them Updated
Compare docs/runbooks/ against the current infrastructure code.
Flag any runbooks that reference services, commands, or configs
that no longer exist. Update them.
Tip
Store runbooks in the repo alongside the code they describe. When the code changes, Claude can update the runbook in the same PR.