Skip to main content

v0.11 migration notes

Operator-facing summary of what changes when you upgrade pmk gateway from v0.10.x to v0.11.0. Three PRs land together: #45 (graceful-restart suppression), #46 (per-channel audience), #47 (monthly-partitioned audit logs). Closes #23 and #44.

TL;DR

✅ Zero config changes required.
✅ Existing ~/.pmk/gateway.json is back-fill-compatible.
✅ Existing ~/.pmk/gateway/{events,admin}.log are still readable.
⚠️ First kill→restart after upgrade still broadcasts "重新上線" once
(no marker present from the old code). All subsequent graceful
restarts within 5 minutes are silent.

New runtime files

PathPurposeLifetime
~/.pmk/gateway/shutdown-markerSingle-use marker dropped on graceful shutdown so the next start can suppress a spurious "重新上線" broadcast (PR #45)Written on SIGTERM / SIGINT (any graceful shutdown), deleted on next gateway start
~/.pmk/gateway/events-YYYY-MM.logMonthly partition of the structured event ledger. Replaces single-file events.log (PR #47)Append-only; a new file appears at each UTC month boundary
~/.pmk/gateway/admin-YYYY-MM.logSame scheme for /pmk admin audit trailSame

The legacy events.log / admin.log files from v0.10 are still read by pmk gateway audit — never written by v0.11, never deleted automatically. You can keep them in place, archive them, or rm once you no longer need the old history.

Behaviour changes

1. Graceful restart no longer spams "重新上線"

Before (v0.10): every kill → restart cycle, even one within 3 seconds, broadcast pmk gateway 暫離 and then pmk gateway 重新上線 to every channel and DM with recent activity. Restart-heavy debugging sessions stacked notices out of order in Slack.

After (v0.11): graceful restart writes a shutdown-marker with the kill timestamp. The next start reads the marker, computes offlineDurationMs, and suppresses the back-online broadcast if the gap is < 5 minutes. The transition is still recorded in events.log as gateway.online ... broadcast: false so an operator running pmk gateway audit can see it.

A genuine crash (process killed without a graceful path, machine slept, etc.) leaves no marker — the next start treats it as a real downtime and broadcasts normally. Distinction is recorded as reason: "crash-recovery" vs "graceful-fast-restart" on the event.

2. Per-channel audience override

pickAudience resolution chain now has three tiers (was two):

pickAudience(cfg, userId, channelId?)
1. cfg.audience.users[userId] — explicit per-user override (unchanged)
2. cfg.audience.channels[channelId] — channel default (NEW)
3. cfg.audience.default — workspace default

Per-user always wins. Channel default lets you say "everyone in #leadership defaults to exec" without writing 12 individual user overrides. Configure either way:

# CLI
pmk gateway audience set-channel C0AVD1XD946 exec
pmk gateway audience unset-channel C0AVD1XD946

# Slack
/pmk admin audience set-channel #leadership exec
/pmk admin audience unset-channel #leadership

pmk gateway audience list and /pmk admin audience list now display per-channel overrides alongside per-user.

Caveat (pre-existing, not v0.11-specific): pmk gateway reads ~/.pmk/gateway.json once at startup and keeps it in memory for the life of the process. Running audience set / unset / set-channel / unset-channel writes the change to disk, but the live gateway keeps using its in-memory copy until the next graceful restart. After mutating audience config, run kill $(cat ~/.pmk/gateway/gateway.pid) && nohup pmk gateway start & — the v0.11 marker mechanism keeps the restart silent in Slack as long as it lands within 5 minutes. Surfaced during v0.11 integration verification on 2026-05-05; caveat applies equally to v0.10.x.

3. Concurrent broadcast fan-out

broadcastOffline / broadcastBackOnline previously awaited each chat.postMessage in series — 20+ seconds for ~24 recipients, vulnerable to mid-flight SIGTERM leaving a half-finished broadcast. v0.11 uses runWithConcurrency(limit=3), finishing in seconds. Errors per recipient stay isolated, so one kicked bot doesn't abort the rest.

4. New event types in events.log

Two new entries join mra-ask.end, turn.processed, escalate.triggered, escalate.absorbed:

{"at":"2026-05-05T03:22:50.409Z","type":"gateway.online","seq":1,"reason":"crash-recovery","broadcast":true}
{"at":"2026-05-05T03:24:23.850Z","type":"gateway.offline","seq":2,"reason":"shutdown","broadcast":true}
{"at":"2026-05-05T04:46:25.773Z","type":"gateway.online","seq":1,"reason":"graceful-fast-restart","broadcast":false,"offlineDurationMs":1332}

seq is monotonic per process (resets each gateway start). broadcast: false means the suppression kicked in. Future pmk gateway audit releases will surface a "presence churn" section based on these.

Upgrade checklist

  1. Upgrade your binary: git pull && npm run cli:build (no schema migration needed).
  2. First restart after upgrade: the v0.10 gateway already shut down (and ran clearHeartbeat()), so the new build sees no marker — first start still broadcasts 重新上線 once. From the second graceful restart onward, suppression works.
  3. Existing audit log readers (pmk gateway audit) Just Work — they merge legacy events.log with the new monthly partitions and present one stream.
  4. Tail-style debugging: use tail -f ~/.pmk/gateway/events-$(date -u +%Y-%m).log instead of the v0.10 tail -f ~/.pmk/gateway/events.log. Note the -u (UTC) — partitions roll over at UTC midnight, not local TZ.
  5. Config mutations need a graceful restart — every pmk gateway audience / pmk gateway escalation / pmk gateway admin / /pmk admin … write goes to ~/.pmk/gateway.json, but the running daemon keeps its in-memory snapshot. Restart with kill $(cat ~/.pmk/gateway/gateway.pid) && nohup pmk gateway start &; the marker mechanism makes restart silent in Slack within the 5-minute window.

Audit log paths

For scripts that consume the audit logs directly (rare — most operators use pmk gateway audit):

# Glob for "all events ever"
~/.pmk/gateway/events.log # legacy v0.10 monolith (read-only after upgrade)
~/.pmk/gateway/events-*.log # v0.11 partitions

# Same for admin
~/.pmk/gateway/admin.log
~/.pmk/gateway/admin-*.log

Both paths must be unioned for a complete audit. The shipped readGatewayEvents() / readAdminLog() functions handle this for you.

v0.11.1: context-safety hardening (additive)

No breaking changes. Operator-facing additions on top of v0.11.0.

Env vars

Env varDefaultWhat it caps
PMK_MAX_SESSION_TOKENS25000 (was 60000)Soft session-prune budget. Lower if your host has a heavy ~/.claude/ config (lots of skills/hooks/MCP servers) — that overhead is inherited every time claude-agent-sdk spawns the local CLI.
PMK_SEED_CAP12000Maximum chars in the PKB ingest seed pushed at session start.
PMK_MRA_RESULT_CAP16000Maximum chars in any mra-ask stdout pushed into session history. Replaces the prior hardcoded 24_000 in buildMraSuccessMessage.

New event types in events-YYYY-MM.log

JSONL lines (synthetic samples — actor: "U…" is a redacted Slack user ID; at is ISO-8601):

{"at":"…","type":"context.exceeded","actor":"U…","sessionTokensBefore":31578,"retrievalAtoms":1,"phase":"first-call"}
{"at":"…","type":"context.force-pruned","actor":"U…","droppedPairs":4,"tokensAfter":1200}
{"at":"…","type":"message.capped","actor":"U…","kind":"seed","originalChars":88292,"cappedChars":12000}

phase is "first-call" or "synthesise"; kind is "seed" or "mra-result".

pmk gateway audit — new Context safety section

Always rendered (even when all counts are zero, so an operator can see the safety net is in place):

Context safety
context.exceeded: 0 (first-call 0, synthesise 0)
force-pruned: 0
messages capped: 0 (seed 0, mra-result 0)

If context.exceeded is non-zero in your weekly audit, the cap defaults are too loose for your host — tighten PMK_MAX_SESSION_TOKENS first, then PMK_SEED_CAP / PMK_MRA_RESULT_CAP.

User-visible Slack changes

When msg_too_long does fire and the auto-retry recovers, the user sees a one-line scissors prefix on the reply:

:scissors: 對話過長,已自動裁掉 N 輪舊訊息

<the actual answer>

When even the auto-retry fails (extremely rare — single message larger than the model can take after force-prune), the user sees:

:x: 對話太長,請開新 thread 重新提問

Both replace the prior raw pmk 內部錯誤:An API error occurred: msg_too_long leak.

Upgrade checklist (v0.11.0 → v0.11.1)

  1. git pull && npm run cli:build — no schema migration needed.
  2. Existing sessions on disk: nothing to do. Pre-existing oversized messages stay as-is; the new caps apply going forward at write-time.
  3. Tail the new event types: tail -f ~/.pmk/gateway/events-$(date -u +%Y-%m).log | grep -E 'context\.|message\.capped'.
  4. After a week, run pmk gateway audit --days 7 and check the Context safety section.

See also

  • Gateway lifecycle — end-to-end flow of a single DM
  • Changelog — release-by-release narrative
  • Issue #23 and #44 — the bugs / feature requests this milestone closes