Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2.5 KiB
REPORT — slice 8B.2 (controller half): resume at snapshotted (v0.38.0) (2026-06-10)
Overwrite-latest report. Cumulative history: CHANGELOG.md. Implements the controller half of
TASK — Slice 8B.2. Pairs withfelhom-agentv0.13.0. No hub change.
Outcome
The quiesce loop (8B) kept the app stopped for the whole backup. In snapshot mode the app only
needs to be stopped until the storage snapshot is taken; after that vzdump reads from the
snapshot. The controller now resumes its app at the agent's snapshotted phase instead of
done — app downtime drops from whole-backup to until-snapshot, with no loss of app-consistency.
Measured live: ~3s vs ~23s (~87% cut), restore still clean.
What landed (internal/quiesce)
- The status-poll loop resumes (
StartStack+ clears the marker) atsnapshotted, then keeps polling todone/failed— so a new backup isn't started until this one truly finishes and a post-snapshot failure is still observed (the backup isn't "successful" untildone; the early resume does not mark it done). - Fallback: if
snapshottednever arrives (stop/downgraded storage), it resumes atdoneexactly as 8B. The agent only emitssnapshottedwhen the actual mode is snapshot. - Crash-safety unchanged: marker written before stop; guaranteed unquiesce (deferred); startup
Recover(). A failure aftersnapshottedis harmless — the app is already up.
Tests
go build ./... + go test ./... green. quiesce: resume at snapshotted (RESUME event before
done, marker cleared, then tracked to done); stop-mode fallback (resume at done, no
snapshotted); fail-after-snapshotted (single resume, app stays up); the 8B crash-safety tests
stay green.
Live validation (demo-felhom)
A provisioned controller v0.38.0 with a postgres stack, short quiesce poll: timeline —
quiescing [pgtest] 12:58:45 → snapshotted — resuming app early 12:58:48 → backup done 12:59:08.
App downtime ≈ 3s (vs ≈ 23s to done). The snapshot backup restored to a scratch guest came up
clean (database system was shut down at 12:58:45, no WAL replay) — the early resume preserved
app-consistency. The controller kept tracking to done after resuming (no overlapping backup).
Deferred / dependency
Snapshot-capable storage (lvm-thin/ZFS) required for the win; stop/downgraded storage falls back to
resume-at-done (8B). No consistency-contract or crash-safety change. No secrets committed.