Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
3.6 KiB
REPORT — slice 8B (controller half): app-consistent backup quiesce loop (v0.36.0) (2026-06-10)
Overwrite-latest report (most recent significant work only). Cumulative history lives in CHANGELOG.md. Implements the in-guest controller half of
TASK — Slice 8B. The host-agent half is infelhom-agentv0.11.0. No hub change.
Outcome
An agent-initiated vzdump is crash-consistent only (an LXC has no fsfreeze). This makes
app-consistency the controller's job: a background loop polls the agent's GET /backup/due, and when
due stops its app stacks around the agent backup so the captured state is clean-shutdown-consistent,
then restarts exactly those stacks. Validated live end-to-end, including the load-bearing restore
contrast and crash-safety.
What landed
internal/quiesce— the loop:GET /backup/due→ (if due) quiesce (stop deployed, non-protected, running stacks) →POST /backup→ pollGET /backup/statustodone/failed→ unquiesce (restart exactly the stacks it stopped).- Crash-safety (the centerpiece — a stranded-down app is worse than a crash-consistent backup):
a persisted marker (atomic,
0600) written before stopping anything; guaranteed unquiesce via a deferred closure (restart on a backup error, a status-poll error, the max-quiesce bound, or context cancellation); a max-quiesce-duration hard bound (restart no matter what — the backup finishes on the agent); crash recovery at startup (Recover()restarts stacks left stopped by a mid-quiesce crash, then clears the marker); the marker also single-flights the loop.
- Crash-safety (the centerpiece — a stranded-down app is worse than a crash-consistent backup):
a persisted marker (atomic,
agentapi:BackupDue/StartBackup/BackupStatus+ aposthelper (leaf-pinned client).stacks.Manager.RunningAppStacks()— deployed, non-protected, currently-up stacks (protected infra — traefik/cloudflared/felhom-controller — is never stopped), sorted for deterministic order.config.QuiesceConfig+main.gowiring:Recover()at startup, then the loop goroutine, gated on the local API being configured (a provisioned guest) + quiesce enabled.
Tests
go test ./... green. quiesce: happy path (stop → backup → poll done → restart exactly those, in
order; marker cleared); backup-start failure → stacks STILL restarted; failed phase → restarted;
max-quiesce guard → restarted at the bound; crash recovery → marker stacks restarted + cleared;
single-flight; only the stacks we stopped are restarted; marker-written-before-stop ordering.
Live validation (demo-felhom)
A provisioned guest with a postgres app stack: the loop quiesced [pgtest] → POST /backup (real
agent vzdump, taken with pg stopped) → done → unquiesce (postgres uptime reset confirms stop+restart);
/backup/due then went false (no re-loop). Load-bearing restore contrast: the quiesced backup
restored to a scratch guest → postgres database system was shut down (clean, no recovery); an
un-quiesced (crash-consistent, pg running) backup restored → postgres redo starts … redo done
(WAL crash recovery). Crash-safety live: the controller was hard-killed mid-quiesce (pg down,
marker present) → on restart [quiesce] crash recovery … restarting them fired and postgres came back
up, marker cleared. The backup-failure → restart path is unit-tested.
Deferred (stated, not built)
8B.2 downtime optimization (snapshot mode + a snapshotted phase so the controller resumes at
snapshot-taken). Controller de-privileging + customer disk endpoints → 8C. No secrets committed.