Monitoring Dashboard Spec

Every migrating module gets a dashboard with the same six blocks. Implementation (Datadog / Grafana / Vercel Analytics / stack-of-choice) is flexible; the panel layout is not.

Module: <domain.module_id>
Current migration stage: read from feature flag
Owner + on-call: <link>
Last incident: <date + severity>

Block A — Health (RED)

Request rate per minute, split by method × status_group
Error rate (% 5xx), SLO line: < 0.5%
Latency (p50 / p95 / p99), SLO line: module's specified target

Block B — Business KPIs (module-specific)

Tie these back to the module's success metrics in its playbook § 11. Typical examples:

Entity create / update rate
Pending / failed records
Throughput for the module's main business event

Each KPI panel in Block B should have a counterpart in the module playbook — no dashboard metric without a documented target.

Block C — Migration (Stage 1–3 only)

Reconciliation diff rate over 24h, SLO line from module playbook
Dual-write lag (new side behind old, seconds)
Stage card: current stage, entered at, exit-threshold progress
Rollback drill: last rehearsal date

Block D — AI / Agent

Tool call rate by tool name
HITL checkpoints: count pending + avg wait
LLM cost: hourly $ + cumulative day
Cache hit rate: prompt caching
Eval regression pass rate: latest PR + daily scheduled

Block E — Dependencies

External API response rate / latency
Upstream event consume lag
Downstream event delivery success

Block F — Data Quality

Ontology drift alerts (last 24h)
Data integrity (e.g. budget IS NULL row count)
Audit log write rate — should equal mutation rate

Alerts (mandatory set)

Name	Trigger	Severity	Notify
`<module>.error_rate_high`	> 0.5% for 5m	P2	Slack `#<module>-alerts`
`<module>.latency_p95_high`	> SLO × 1.5 for 10m	P2	Slack
`<module>.recon_diff_high`	> SLO for 30m	P1	PagerDuty
`<module>.dual_write_lag`	> 5m	P2	Slack
`<module>.hitl_stuck`	pending > 50 or oldest > 4h	P2	Slack + email
`<module>.llm_cost_spike`	hourly > 2× baseline	P2	Slack + AI team
`<module>.eval_regression_fail`	any regression fails	P1 (merge block)	PR check

Deployment checklist

Before a module can enter Stage 0, all of these must be green:

Dashboard URL referenced in the module's runbook
Block A and Block C show data (even if shadow-mode)
At least 3 alerts validated with synthetic triggers
On-call rotation points to this dashboard
Weekly review meeting agenda includes dashboard walkthrough

Header​

Block A — Health (RED)​

Block B — Business KPIs (module-specific)​

Block C — Migration (Stage 1–3 only)​

Block D — AI / Agent​

Block E — Dependencies​

Block F — Data Quality​

Alerts (mandatory set)​

Deployment checklist​

Header