Files
felhom-controller/REPORT.md
T

3.6 KiB

REPORT — slice 8B (controller half): app-consistent backup quiesce loop (v0.36.0) (2026-06-10)

Overwrite-latest report (most recent significant work only). Cumulative history lives in CHANGELOG.md. Implements the in-guest controller half of TASK — Slice 8B. The host-agent half is in felhom-agent v0.11.0. No hub change.

Outcome

An agent-initiated vzdump is crash-consistent only (an LXC has no fsfreeze). This makes app-consistency the controller's job: a background loop polls the agent's GET /backup/due, and when due stops its app stacks around the agent backup so the captured state is clean-shutdown-consistent, then restarts exactly those stacks. Validated live end-to-end, including the load-bearing restore contrast and crash-safety.

What landed

  • internal/quiesce — the loop: GET /backup/due → (if due) quiesce (stop deployed, non-protected, running stacks) → POST /backup → poll GET /backup/status to done/failedunquiesce (restart exactly the stacks it stopped).
    • Crash-safety (the centerpiece — a stranded-down app is worse than a crash-consistent backup): a persisted marker (atomic, 0600) written before stopping anything; guaranteed unquiesce via a deferred closure (restart on a backup error, a status-poll error, the max-quiesce bound, or context cancellation); a max-quiesce-duration hard bound (restart no matter what — the backup finishes on the agent); crash recovery at startup (Recover() restarts stacks left stopped by a mid-quiesce crash, then clears the marker); the marker also single-flights the loop.
  • agentapi: BackupDue / StartBackup / BackupStatus + a post helper (leaf-pinned client).
  • stacks.Manager.RunningAppStacks() — deployed, non-protected, currently-up stacks (protected infra — traefik/cloudflared/felhom-controller — is never stopped), sorted for deterministic order.
  • config.QuiesceConfig + main.go wiring: Recover() at startup, then the loop goroutine, gated on the local API being configured (a provisioned guest) + quiesce enabled.

Tests

go test ./... green. quiesce: happy path (stop → backup → poll done → restart exactly those, in order; marker cleared); backup-start failure → stacks STILL restarted; failed phase → restarted; max-quiesce guard → restarted at the bound; crash recovery → marker stacks restarted + cleared; single-flight; only the stacks we stopped are restarted; marker-written-before-stop ordering.

Live validation (demo-felhom)

A provisioned guest with a postgres app stack: the loop quiesced [pgtest]POST /backup (real agent vzdump, taken with pg stopped) → done → unquiesce (postgres uptime reset confirms stop+restart); /backup/due then went false (no re-loop). Load-bearing restore contrast: the quiesced backup restored to a scratch guest → postgres database system was shut down (clean, no recovery); an un-quiesced (crash-consistent, pg running) backup restored → postgres redo starts … redo done (WAL crash recovery). Crash-safety live: the controller was hard-killed mid-quiesce (pg down, marker present) → on restart [quiesce] crash recovery … restarting them fired and postgres came back up, marker cleared. The backup-failure → restart path is unit-tested.

Deferred (stated, not built)

8B.2 downtime optimization (snapshot mode + a snapshotted phase so the controller resumes at snapshot-taken). Controller de-privileging + customer disk endpoints → 8C. No secrets committed.