Module Runbook Template
On-call playbook skeleton. Copy to apps/<service>/<module>/RUNBOOK.md and fill in. Lives next to the code so it stays current.
# <Module Name> — Runbook
- **Module ID:** <domain.module_id>
- **Owner:** @<team-lead>
- **On-call rotation:** <link to rotation table>
- **Status dashboard:** <link to Datadog / Grafana>
- **SLO:** <p95 latency, error rate, reconciliation tolerance>
## 1. Dependencies
```mermaid
flowchart LR
Me[this module] --> DB[(Postgres)]
Me --> Cache[(Redis)]
Me --> Bus[(queue)]
Me --> Upstream[upstream service]
Downstream[downstream consumer] --> Me
```
| Dependency | Connection | Failure mode |
|---|---|---|
| Postgres | pool | writes fail; reads degrade to cache |
| Redis | cache + queue | performance drop; still runnable |
| Queue | event publish | cross-module events delayed |
| External API | HTTP | circuit breaker; fallback TBD |
## 2. Common incident playbooks
### 2.1 5xx error rate spike
**Detect:** alert `<module>.error_rate > 1% for 5min`
**Triage:**
1. Sentry latest errors: <URL>
2. DB pool status: <URL>
3. Upstream health: <URL>
4. Recent deploys: `gh run list --workflow=deploy`
**Common causes & action:**
- DB connection exhaustion → scale api pods, kill long-running queries
- Upstream timeout → open circuit breaker, page upstream team
- Bad deploy → roll back within 5 min
### 2.2 Reconciliation diff over SLO (migration only)
**Detect:** `<module>.reconciliation.diff_rate > SLO`
**Triage:**
1. `SELECT * FROM reconciliation_diffs WHERE module = '<module>' ORDER BY detected_at DESC LIMIT 50`
2. Recent schema changes? Check `git log packages/db/schema/<module>.*.ts`
3. Bridge worker lag: <URL>
**Action:**
- Small batch → manual reconcile + patch
- Large batch → halt stage progression, roll back to previous stage, investigate root cause
- Persistent > 24h → escalate to P1, engage architect
### 2.3 HITL queue stuck
**Detect:** pending HITL tasks > 50, or oldest > 4h
**Action:**
1. Identify stuck workflows in HITL dashboard
2. Approver absent? escalate to backup
3. Agent loop? pause workflow, disable feature flag
4. Pending tasks > 200 → P2 incident
### 2.4 Agent / LLM cost spike
**Detect:** `llm.cost.per_hour > baseline × 2`
**Action:**
1. Token dashboard: which feature / tenant / model?
2. Infinite loop suspicion? check retry pattern
3. Enable rate limit, disable affected feature flag
4. Notify AI team + cost owner
## 3. Common operations
### Toggle migration feature flag
```bash
# Stage transitions require 2-person sign-off
gh pr create --base main --title "flag: migration.<module>.mode = double_write"
```
### Manual reconciliation
```bash
pnpm --filter @<org>/worker reconcile --module <module> --window 24h
```
### Deploy rollback
```bash
# Cloud Run
gcloud run services update-traffic api --to-revisions=<prev>=100
# Vercel
vercel rollback <deployment-url>
```
### Trace lookup
```
https://<otel-backend>/traces?service=<module>&trace_id=<id>
```
## 4. Escalation
| Level | Trigger | Contacts |
|---|---|---|
| P3 | single user; can wait | module owner |
| P2 | multi-user; workaround exists | owner + on-call manager |
| P1 | module unavailable / data at risk | architect + owner + VP |
| P0 | money / PII / regulatory | war-room: architect, CTO, finance, legal |
## 5. Module-specific gotchas
<!-- What surprised the last person on-call. Examples:
- Invoice numbers are monotonic and must not gap/duplicate
- Cache TTL is 30s; stale-read window if upstream writes
- This module's rows are immutable after status='locked' — don't try to update
-->
## 6. Post-incident
Any P0/P1 triggers a postmortem within 24h:
```
# Postmortem: <title>
Date: YYYY-MM-DD
Duration: <start–resolved>
Impact: <users/txns/$>
Root cause:
Timeline:
Action items:
- [ ] add regression test
- [ ] add runbook entry
- [ ] add alert
```