diff --git a/documentation/architecture/03-host-agent.md b/documentation/architecture/03-host-agent.md index 122e579..af6f6a7 100644 --- a/documentation/architecture/03-host-agent.md +++ b/documentation/architecture/03-host-agent.md @@ -197,12 +197,27 @@ Tiers double as backup *and* restore-source priority (fastest surviving source f per Part 1: **snapshot** (LVM-thin, transient, whole-guest rollback — not a backup) → **local second storage** (vzdump to dir/NFS/CIFS) → **PBS offsite** (the DR substrate). -- **Quiescing (controller-driven for app-consistency):** an LXC has no fsfreeze - (`proxmox-platform.md` §4.2), so app-consistency is the controller's job: it learns a backup - is due (`GET /backup/due`, §6, or via its hub channel) → **quiesces** the app stack → - `POST /backup` → polls `GET /backup/status` → unquiesces. **An agent-initiated vzdump is - crash-consistent only** (there is no inbound-to-guest channel to trigger a quiesce — §3/§5). - Every Proxmox op is async → the agent polls `task exitstatus`, never trusts the POST return. +- **Quiescing (controller-driven for app-consistency) — implemented (slice 8B):** an LXC has no + fsfreeze (`proxmox-platform.md` §4.2), so app-consistency is the controller's job: it learns a + backup is due (`GET /backup/due`, §6) → **quiesces** (stops its app stacks) → `POST /backup` → + polls `GET /backup/status` to `done` → **unquiesces** (restarts exactly the stacks it stopped). + Implemented in `felhom-controller` v0.36.0 (`internal/quiesce`) + `felhom-agent` v0.11.0 (the + `/backup/due` cadence policy + `/backup/status` phases). **An agent-initiated vzdump is + crash-consistent only** (there is no inbound-to-guest channel to trigger a quiesce — §3/§5); the + controller stopping its stacks first is what makes the captured state **clean-shutdown-consistent** + (validated live: a quiesced postgres restore comes up clean — "database system was shut down" — vs + a crash-consistent restore doing WAL recovery — "redo starts… redo done"). Every Proxmox op is + async → the agent polls `task exitstatus`, never trusts the POST return. + - **Crash-safety (the centerpiece — a stranded-down app is worse than a crash-consistent backup):** + a persisted marker written **before** stopping anything; **guaranteed unquiesce** (restart on a + backup error, a status-poll error, the max-quiesce bound, or controller shutdown); a + **max-quiesce-duration** hard bound (restart the app no matter what — the backup finishes on the + agent); and **crash recovery** at controller startup (restart stacks left stopped by a mid-quiesce + crash). The marker also single-flights the loop. All proven live + unit-tested. + - **8B.2 fast-follow (downtime optimization):** `vzdump --mode snapshot` + a `snapshotted` status + phase so the controller resumes at *snapshot-taken* instead of *backup-done* — needs + snapshot-capable storage validated for LXC. The agent's `/backup/status` already carries the phase + enum for this; only the `snapshotted` value + snapshot-mode wiring remain. - **Bulk volumes have no DR coverage from the guest vzdump** — they are excluded (§7). Every `bulk` volume needs an explicit own-backup decision: its own backup target per the manifest `policy`, **or deliberately none** when the data is re-downloadable (customer informed). On @@ -365,7 +380,7 @@ this path — bring up + reattach external storage and it is whole. This is full | **Guest-loss DR** (front half + DR identity policy; no controller deploy) | **7** | **implemented** (v0.8.0, `dr_guest_loss` mode — continuity identity preserved) | | PBS recovery-code escrow **creation** + **hub opaque storage** (§8a) | **7** | **implemented** (agent v0.9.0 `internal/escrow`; hub v0.8.0 `PUT /hosts/{id}/escrow`) | | **Local API** server (§6) + provisioning **back half** — deploy controller, hand bootstrap config, mint per-guest local token | **8A** | **implemented** (agent v0.10.0 `internal/localapi` + `internal/provision`; controller v0.35.0 `internal/bootstrap` + `internal/agentapi`). The controller image is **baked into the golden** (no registry cred in any guest); the back-half mints the token, writes a 0600 `bootstrap.json` to a `chown 100000:100000` config mount, and `pct set`-attaches it read-only; the golden's baked unit deploys the controller, which ingests the bootstrap, comes up configured, and reaches the agent over the bridge (leaf-pin + token). Validated live end-to-end on the demo. | -| **Quiesced app-consistent backup** (`/backup/due`-driven stack-stop) | **8B** | deferred — `/backup/due` is thin in 8A; the controller quiesce-then-`POST /backup` loop is 8B | +| **Quiesced app-consistent backup** (`/backup/due`-driven stack-stop) | **8B** | **implemented** (agent v0.11.0 `/backup/due` cadence + `/backup/status` phases; controller v0.36.0 `internal/quiesce` — stop stacks → backup → restart, with crash-safety marker/guaranteed-unquiesce/max-bound/crash-recovery). Validated live incl. the postgres clean-vs-crash-recovery restore contrast. Downtime optimization (snapshot mode) → 8B.2. | | **Controller de-privileging** (retire the disk-execution subsystem; new customer disk endpoints behind the slice-4 data-bearing classifier) | **8C** | deferred | | **Host/hardware loss** DR — re-enroll in "restore mode"; hub serves identity / PBS namespace / tunnel token / storage manifest / restore directive | **10** | deferred — needs hub desired-state serving; hub store today holds only `{host_id, customer_id, api_key}` (slice 3) | | PBS escrow **consumption** (recover `K` on a new box) | **10** | deferred — exercised by host-loss DR | @@ -446,6 +461,15 @@ This doc hands the implementation three contracts it was waiting on: ## Changelog — design-review + Phase-3 fold-in (2026-06-08) +### Slice-8B implemented: app-consistent backup (quiesce / stack-stop) (2026-06-10) +- §8: the **controller-driven quiesce** (stop app stacks → `POST /backup` → restart) is **implemented** + (controller v0.36.0 `internal/quiesce` + agent v0.11.0 `/backup/due` cadence + `/backup/status` + phases). Documented the **crash-safety** centerpiece (marker-before-stop, guaranteed unquiesce, + max-quiesce bound, startup crash-recovery, single-flight) and the **8B.2** downtime-optimization + fast-follow (snapshot mode + a `snapshotted` phase). Validated live: a **quiesced** postgres restore + comes up clean ("database system was shut down") vs a **crash-consistent** restore doing WAL recovery. +- §9 slice table: **8B → implemented**; 8C (controller de-privileging) still pending. + ### Slice-8A implemented: local API + provisioning back-half (2026-06-10) - NEW §6a: the **local-API implementation** (agent v0.10.0 `internal/localapi`; controller v0.35.0 `internal/bootstrap` + `internal/agentapi`) — persisted self-signed leaf with a **stable