doc 03: slice 8B implemented — §8 controller-driven quiesce, §9 table, changelog (2026-06-10)

§8: controller-driven quiesce (stop stacks -> POST /backup -> restart) implemented
(controller v0.36.0 internal/quiesce + agent v0.11.0 cadence/phases); crash-safety
centerpiece + 8B.2 snapshot-mode fast-follow documented. Validated live: quiesced
postgres restore clean vs crash-consistent WAL recovery. §9 table: 8B implemented.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-10 11:04:36 +02:00
parent e436b61368
commit d1a3cd0625
+31 -7
View File
@@ -197,12 +197,27 @@ Tiers double as backup *and* restore-source priority (fastest surviving source f
per Part 1: **snapshot** (LVM-thin, transient, whole-guest rollback — not a backup) → per Part 1: **snapshot** (LVM-thin, transient, whole-guest rollback — not a backup) →
**local second storage** (vzdump to dir/NFS/CIFS) → **PBS offsite** (the DR substrate). **local second storage** (vzdump to dir/NFS/CIFS) → **PBS offsite** (the DR substrate).
- **Quiescing (controller-driven for app-consistency):** an LXC has no fsfreeze - **Quiescing (controller-driven for app-consistency) — implemented (slice 8B):** an LXC has no
(`proxmox-platform.md` §4.2), so app-consistency is the controller's job: it learns a backup fsfreeze (`proxmox-platform.md` §4.2), so app-consistency is the controller's job: it learns a
is due (`GET /backup/due`, §6, or via its hub channel) → **quiesces** the app stack backup is due (`GET /backup/due`, §6) → **quiesces** (stops its app stacks) → `POST /backup`
`POST /backup` polls `GET /backup/status` → unquiesces. **An agent-initiated vzdump is polls `GET /backup/status` to `done`**unquiesces** (restarts exactly the stacks it stopped).
crash-consistent only** (there is no inbound-to-guest channel to trigger a quiesce — §3/§5). Implemented in `felhom-controller` v0.36.0 (`internal/quiesce`) + `felhom-agent` v0.11.0 (the
Every Proxmox op is async → the agent polls `task exitstatus`, never trusts the POST return. `/backup/due` cadence policy + `/backup/status` phases). **An agent-initiated vzdump is
crash-consistent only** (there is no inbound-to-guest channel to trigger a quiesce — §3/§5); the
controller stopping its stacks first is what makes the captured state **clean-shutdown-consistent**
(validated live: a quiesced postgres restore comes up clean — "database system was shut down" — vs
a crash-consistent restore doing WAL recovery — "redo starts… redo done"). Every Proxmox op is
async → the agent polls `task exitstatus`, never trusts the POST return.
- **Crash-safety (the centerpiece — a stranded-down app is worse than a crash-consistent backup):**
a persisted marker written **before** stopping anything; **guaranteed unquiesce** (restart on a
backup error, a status-poll error, the max-quiesce bound, or controller shutdown); a
**max-quiesce-duration** hard bound (restart the app no matter what — the backup finishes on the
agent); and **crash recovery** at controller startup (restart stacks left stopped by a mid-quiesce
crash). The marker also single-flights the loop. All proven live + unit-tested.
- **8B.2 fast-follow (downtime optimization):** `vzdump --mode snapshot` + a `snapshotted` status
phase so the controller resumes at *snapshot-taken* instead of *backup-done* — needs
snapshot-capable storage validated for LXC. The agent's `/backup/status` already carries the phase
enum for this; only the `snapshotted` value + snapshot-mode wiring remain.
- **Bulk volumes have no DR coverage from the guest vzdump** — they are excluded (§7). Every - **Bulk volumes have no DR coverage from the guest vzdump** — they are excluded (§7). Every
`bulk` volume needs an explicit own-backup decision: its own backup target per the manifest `bulk` volume needs an explicit own-backup decision: its own backup target per the manifest
`policy`, **or deliberately none** when the data is re-downloadable (customer informed). On `policy`, **or deliberately none** when the data is re-downloadable (customer informed). On
@@ -365,7 +380,7 @@ this path — bring up + reattach external storage and it is whole. This is full
| **Guest-loss DR** (front half + DR identity policy; no controller deploy) | **7** | **implemented** (v0.8.0, `dr_guest_loss` mode — continuity identity preserved) | | **Guest-loss DR** (front half + DR identity policy; no controller deploy) | **7** | **implemented** (v0.8.0, `dr_guest_loss` mode — continuity identity preserved) |
| PBS recovery-code escrow **creation** + **hub opaque storage** (§8a) | **7** | **implemented** (agent v0.9.0 `internal/escrow`; hub v0.8.0 `PUT /hosts/{id}/escrow`) | | PBS recovery-code escrow **creation** + **hub opaque storage** (§8a) | **7** | **implemented** (agent v0.9.0 `internal/escrow`; hub v0.8.0 `PUT /hosts/{id}/escrow`) |
| **Local API** server (§6) + provisioning **back half** — deploy controller, hand bootstrap config, mint per-guest local token | **8A** | **implemented** (agent v0.10.0 `internal/localapi` + `internal/provision`; controller v0.35.0 `internal/bootstrap` + `internal/agentapi`). The controller image is **baked into the golden** (no registry cred in any guest); the back-half mints the token, writes a 0600 `bootstrap.json` to a `chown 100000:100000` config mount, and `pct set`-attaches it read-only; the golden's baked unit deploys the controller, which ingests the bootstrap, comes up configured, and reaches the agent over the bridge (leaf-pin + token). Validated live end-to-end on the demo. | | **Local API** server (§6) + provisioning **back half** — deploy controller, hand bootstrap config, mint per-guest local token | **8A** | **implemented** (agent v0.10.0 `internal/localapi` + `internal/provision`; controller v0.35.0 `internal/bootstrap` + `internal/agentapi`). The controller image is **baked into the golden** (no registry cred in any guest); the back-half mints the token, writes a 0600 `bootstrap.json` to a `chown 100000:100000` config mount, and `pct set`-attaches it read-only; the golden's baked unit deploys the controller, which ingests the bootstrap, comes up configured, and reaches the agent over the bridge (leaf-pin + token). Validated live end-to-end on the demo. |
| **Quiesced app-consistent backup** (`/backup/due`-driven stack-stop) | **8B** | deferred — `/backup/due` is thin in 8A; the controller quiesce-then-`POST /backup` loop is 8B | | **Quiesced app-consistent backup** (`/backup/due`-driven stack-stop) | **8B** | **implemented** (agent v0.11.0 `/backup/due` cadence + `/backup/status` phases; controller v0.36.0 `internal/quiesce` — stop stacks → backup → restart, with crash-safety marker/guaranteed-unquiesce/max-bound/crash-recovery). Validated live incl. the postgres clean-vs-crash-recovery restore contrast. Downtime optimization (snapshot mode) → 8B.2. |
| **Controller de-privileging** (retire the disk-execution subsystem; new customer disk endpoints behind the slice-4 data-bearing classifier) | **8C** | deferred | | **Controller de-privileging** (retire the disk-execution subsystem; new customer disk endpoints behind the slice-4 data-bearing classifier) | **8C** | deferred |
| **Host/hardware loss** DR — re-enroll in "restore mode"; hub serves identity / PBS namespace / tunnel token / storage manifest / restore directive | **10** | deferred — needs hub desired-state serving; hub store today holds only `{host_id, customer_id, api_key}` (slice 3) | | **Host/hardware loss** DR — re-enroll in "restore mode"; hub serves identity / PBS namespace / tunnel token / storage manifest / restore directive | **10** | deferred — needs hub desired-state serving; hub store today holds only `{host_id, customer_id, api_key}` (slice 3) |
| PBS escrow **consumption** (recover `K` on a new box) | **10** | deferred — exercised by host-loss DR | | PBS escrow **consumption** (recover `K` on a new box) | **10** | deferred — exercised by host-loss DR |
@@ -446,6 +461,15 @@ This doc hands the implementation three contracts it was waiting on:
## Changelog — design-review + Phase-3 fold-in (2026-06-08) ## Changelog — design-review + Phase-3 fold-in (2026-06-08)
### Slice-8B implemented: app-consistent backup (quiesce / stack-stop) (2026-06-10)
- §8: the **controller-driven quiesce** (stop app stacks → `POST /backup` → restart) is **implemented**
(controller v0.36.0 `internal/quiesce` + agent v0.11.0 `/backup/due` cadence + `/backup/status`
phases). Documented the **crash-safety** centerpiece (marker-before-stop, guaranteed unquiesce,
max-quiesce bound, startup crash-recovery, single-flight) and the **8B.2** downtime-optimization
fast-follow (snapshot mode + a `snapshotted` phase). Validated live: a **quiesced** postgres restore
comes up clean ("database system was shut down") vs a **crash-consistent** restore doing WAL recovery.
- §9 slice table: **8B → implemented**; 8C (controller de-privileging) still pending.
### Slice-8A implemented: local API + provisioning back-half (2026-06-10) ### Slice-8A implemented: local API + provisioning back-half (2026-06-10)
- NEW §6a: the **local-API implementation** (agent v0.10.0 `internal/localapi`; controller v0.35.0 - NEW §6a: the **local-API implementation** (agent v0.10.0 `internal/localapi`; controller v0.35.0
`internal/bootstrap` + `internal/agentapi`) — persisted self-signed leaf with a **stable `internal/bootstrap` + `internal/agentapi`) — persisted self-signed leaf with a **stable