# REPORT — slice 8B (controller half): app-consistent backup quiesce loop (v0.36.0) (2026-06-10) > Overwrite-latest report (most recent significant work only). Cumulative history lives in > [CHANGELOG.md](CHANGELOG.md). Implements the in-guest controller half of `TASK — Slice 8B`. The > host-agent half is in `felhom-agent` v0.11.0. No hub change. ## Outcome An agent-initiated vzdump is crash-consistent only (an LXC has no fsfreeze). This makes app-consistency the controller's job: a background loop polls the agent's `GET /backup/due`, and when due **stops its app stacks** around the agent backup so the captured state is clean-shutdown-consistent, then restarts exactly those stacks. **Validated live end-to-end**, including the load-bearing restore contrast and crash-safety. ## What landed - **`internal/quiesce`** — the loop: `GET /backup/due` → (if due) **quiesce** (stop deployed, non-protected, running stacks) → `POST /backup` → poll `GET /backup/status` to `done`/`failed` → **unquiesce** (restart exactly the stacks it stopped). - **Crash-safety (the centerpiece — a stranded-down app is worse than a crash-consistent backup):** a persisted **marker** (atomic, `0600`) written **before** stopping anything; **guaranteed unquiesce** via a deferred closure (restart on a backup error, a status-poll error, the max-quiesce bound, or context cancellation); a **max-quiesce-duration** hard bound (restart no matter what — the backup finishes on the agent); **crash recovery** at startup (`Recover()` restarts stacks left stopped by a mid-quiesce crash, then clears the marker); the marker also single-flights the loop. - **`agentapi`**: `BackupDue` / `StartBackup` / `BackupStatus` + a `post` helper (leaf-pinned client). - **`stacks.Manager.RunningAppStacks()`** — deployed, non-protected, currently-up stacks (protected infra — traefik/cloudflared/felhom-controller — is never stopped), sorted for deterministic order. - **`config.QuiesceConfig`** + `main.go` wiring: `Recover()` at startup, then the loop goroutine, gated on the local API being configured (a provisioned guest) + quiesce enabled. ## Tests `go test ./...` green. quiesce: happy path (stop → backup → poll done → restart exactly those, in order; marker cleared); **backup-start failure → stacks STILL restarted**; failed phase → restarted; **max-quiesce guard → restarted at the bound**; **crash recovery → marker stacks restarted + cleared**; single-flight; **only the stacks we stopped are restarted**; **marker-written-before-stop** ordering. ## Live validation (demo-felhom) A provisioned guest with a postgres app stack: the loop quiesced `[pgtest]` → `POST /backup` (real agent vzdump, taken with pg stopped) → `done` → unquiesce (postgres uptime reset confirms stop+restart); `/backup/due` then went false (no re-loop). **Load-bearing restore contrast:** the **quiesced** backup restored to a scratch guest → postgres `database system was shut down` (**clean, no recovery**); an **un-quiesced** (crash-consistent, pg running) backup restored → postgres `redo starts … redo done` (**WAL crash recovery**). **Crash-safety live:** the controller was hard-killed mid-quiesce (pg down, marker present) → on restart `[quiesce] crash recovery … restarting them` fired and postgres came back up, marker cleared. The backup-failure → restart path is unit-tested. ## Deferred (stated, not built) **8B.2** downtime optimization (snapshot mode + a `snapshotted` phase so the controller resumes at snapshot-taken). Controller de-privileging + customer disk endpoints → **8C**. No secrets committed.