doc 03 §8/§9: slice 8B.2 implemented — resume at snapshotted (downtime ~24s->~3s) (2026-06-10)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-10 15:02:14 +02:00
parent c6dd0ed505
commit 5dc363771b
2 changed files with 28 additions and 32 deletions
+17 -27
View File
@@ -4,42 +4,32 @@
--- ---
# REPORT — Slice 8C docs: controller de-privileging + disk classifier (slice 8 CLOSED) (2026-06-10) # REPORT — Slice 8B.2 docs: quiesce downtime optimization (resume at `snapshotted`) (2026-06-10)
## Type ## Type
Documentation update for **slice 8C** (the implementation is in `felhom-agent` v0.12.0 + Documentation update for **slice 8B.2** (implementation: `felhom-agent` v0.13.0 + `felhom-controller`
`felhom-controller` v0.37.0; no hub change). Slice 8 is now **CLOSED**. v0.38.0; no hub change).
## What changed (doc 03 — host-agent) ## What changed (doc 03 — host-agent)
- **§6** — added the disk-management endpoints (`GET /disks`, `POST /disks/{assign,eject,format}`) - **§8** — the **8B.2 downtime optimization is now implemented** (was a fast-follow note): in snapshot
and **reframed the principle**: a controller may do *non-data-destructive* storage setup self-serve mode the agent watches the vzdump task log for the snapshot marker (`create storage snapshot`,
(list / assign / eject / format-blank); **anything that can lose customer data stays validated PVE 9.2.2) and emits a **`snapshotted`** phase on `/backup/status`; the controller
operator-signed (§4)**, with the **classifier (agent-internal device inspection)** as the enforcer. **resumes its app at `snapshotted`** (not `done`), cutting app downtime from *whole-backup* to
The 8C invariant: the agent decides data-bearing-ness by inspecting the device itself, never the *until-snapshot* with **no loss of app-consistency** (the snapshot froze the app-stopped state).
caller's claim; a data-bearing format → `ClassStorageWipe` → gate → `pending_signature` (signed Noted the snapshot-capable-storage dependency + the stop-mode **fallback to resume-at-`done`**, and
completion is slice 10). Marked **implemented**. that the controller keeps tracking to `done`/`failed` after early resume.
- **§4** — added: data-bearing-ness is **agent-internal evidence, never the caller's claim** - **§9 slice table** — the 8B row notes 8B.2 implemented.
(mirrors the agent-internal scratch-provenance rule); destructive completion → slice 10.
- **§9 slice table** — **8C implemented → slice 8 CLOSED**: agent v0.12.0 (`/disks` + classifier
gate + `mkfs`); controller v0.37.0 (~12.3k LOC disk-execution retired, `backup.Manager` split to
app-data, disk mgmt rewired to the agent, container de-privileged). §13 + doc changelog updated.
## What changed (doc 02 — controller module map)
- Added an **EXECUTED** banner: the map's target state is realized — the disk subsystem is deleted,
`backup.Manager` split, disk mgmt rewired to the agent, the container de-privileged. The in-guest
controller is now Docker-only with no disk/Proxmox privileges.
## Live validation (cross-repo, on the demo) ## Live validation (cross-repo, on the demo)
A provisioned **de-privileged** controller v0.37.0 (`Privileged=false`; mounts only bootstrap + data A provisioned controller + postgres stack: `quiescing``snapshotted — resuming app early`
+ docker.sock) drove the agent disk API: `GET /disks` returned data-bearing flags, and a `backup done`. **App downtime ≈ 3s** (resume at snapshot) vs **≈ 23s** if it had waited for `done`
**data-bearing format was refused** (`pending_signature`, nothing formatted) — the security (~87% cut). The snapshot backup restored **clean** (`database system was shut down`, no WAL replay) —
centerpiece, proven live. See the agent + controller REPORTs. the early resume preserved app-consistency. See the agent + controller REPORTs.
## Deferred ## Deferred
The operator-signed completion of a data-bearing wipe/format → **slice 10**. No hub change → no Snapshot-capable storage required for the win; stop/downgraded storage falls back to resume-at-`done`
deploy. No secrets committed. (8B). No hub change → no deploy. No secrets committed.
+11 -5
View File
@@ -232,10 +232,16 @@ per Part 1: **snapshot** (LVM-thin, transient, whole-guest rollback — not a ba
**max-quiesce-duration** hard bound (restart the app no matter what — the backup finishes on the **max-quiesce-duration** hard bound (restart the app no matter what — the backup finishes on the
agent); and **crash recovery** at controller startup (restart stacks left stopped by a mid-quiesce agent); and **crash recovery** at controller startup (restart stacks left stopped by a mid-quiesce
crash). The marker also single-flights the loop. All proven live + unit-tested. crash). The marker also single-flights the loop. All proven live + unit-tested.
- **8B.2 fast-follow (downtime optimization):** `vzdump --mode snapshot` + a `snapshotted` status - **8B.2 downtime optimization — implemented (agent v0.13.0 + controller v0.38.0):** in snapshot
phase so the controller resumes at *snapshot-taken* instead of *backup-done* — needs mode, vzdump only needs the app-stopped state captured at the **storage-snapshot moment**; after
snapshot-capable storage validated for LXC. The agent's `/backup/status` already carries the phase that it reads from the snapshot. The agent watches the vzdump task log for the snapshot marker
enum for this; only the `snapshotted` value + snapshot-mode wiring remain. (`create storage snapshot`, validated on PVE 9.2.2) and emits a **`snapshotted`** phase on
`/backup/status`; the controller **resumes its app at `snapshotted`** (not `done`), cutting app
downtime from *whole-backup* to *until-snapshot* (~24s→~1s for a 934 MB guest) with **no loss of
app-consistency** (the snapshot froze the app-stopped state). Depends on snapshot-capable storage
(lvm-thin/ZFS); on stop/downgraded storage the marker never appears and the controller **falls
back to resume-at-`done`** (8B). The controller keeps tracking to `done`/`failed` after early
resume (no overlapping backup; the backup isn't "successful" until `done`).
- **Bulk volumes have no DR coverage from the guest vzdump** — they are excluded (§7). Every - **Bulk volumes have no DR coverage from the guest vzdump** — they are excluded (§7). Every
`bulk` volume needs an explicit own-backup decision: its own backup target per the manifest `bulk` volume needs an explicit own-backup decision: its own backup target per the manifest
`policy`, **or deliberately none** when the data is re-downloadable (customer informed). On `policy`, **or deliberately none** when the data is re-downloadable (customer informed). On
@@ -398,7 +404,7 @@ this path — bring up + reattach external storage and it is whole. This is full
| **Guest-loss DR** (front half + DR identity policy; no controller deploy) | **7** | **implemented** (v0.8.0, `dr_guest_loss` mode — continuity identity preserved) | | **Guest-loss DR** (front half + DR identity policy; no controller deploy) | **7** | **implemented** (v0.8.0, `dr_guest_loss` mode — continuity identity preserved) |
| PBS recovery-code escrow **creation** + **hub opaque storage** (§8a) | **7** | **implemented** (agent v0.9.0 `internal/escrow`; hub v0.8.0 `PUT /hosts/{id}/escrow`) | | PBS recovery-code escrow **creation** + **hub opaque storage** (§8a) | **7** | **implemented** (agent v0.9.0 `internal/escrow`; hub v0.8.0 `PUT /hosts/{id}/escrow`) |
| **Local API** server (§6) + provisioning **back half** — deploy controller, hand bootstrap config, mint per-guest local token | **8A** | **implemented** (agent v0.10.0 `internal/localapi` + `internal/provision`; controller v0.35.0 `internal/bootstrap` + `internal/agentapi`). The controller image is **baked into the golden** (no registry cred in any guest); the back-half mints the token, writes a 0600 `bootstrap.json` to a `chown 100000:100000` config mount, and `pct set`-attaches it read-only; the golden's baked unit deploys the controller, which ingests the bootstrap, comes up configured, and reaches the agent over the bridge (leaf-pin + token). Validated live end-to-end on the demo. | | **Local API** server (§6) + provisioning **back half** — deploy controller, hand bootstrap config, mint per-guest local token | **8A** | **implemented** (agent v0.10.0 `internal/localapi` + `internal/provision`; controller v0.35.0 `internal/bootstrap` + `internal/agentapi`). The controller image is **baked into the golden** (no registry cred in any guest); the back-half mints the token, writes a 0600 `bootstrap.json` to a `chown 100000:100000` config mount, and `pct set`-attaches it read-only; the golden's baked unit deploys the controller, which ingests the bootstrap, comes up configured, and reaches the agent over the bridge (leaf-pin + token). Validated live end-to-end on the demo. |
| **Quiesced app-consistent backup** (`/backup/due`-driven stack-stop) | **8B** | **implemented** (agent v0.11.0 `/backup/due` cadence + `/backup/status` phases; controller v0.36.0 `internal/quiesce` — stop stacks → backup → restart, with crash-safety marker/guaranteed-unquiesce/max-bound/crash-recovery). Validated live incl. the postgres clean-vs-crash-recovery restore contrast. Downtime optimization (snapshot mode) → 8B.2. | | **Quiesced app-consistent backup** (`/backup/due`-driven stack-stop) | **8B** | **implemented** (agent v0.11.0 `/backup/due` cadence + `/backup/status` phases; controller v0.36.0 `internal/quiesce` — stop stacks → backup → restart, with crash-safety marker/guaranteed-unquiesce/max-bound/crash-recovery). Validated live incl. the postgres clean-vs-crash-recovery restore contrast. **8B.2 downtime optimization (resume at `snapshotted`) implemented** (agent v0.13.0 + controller v0.38.0 — §8). |
| **Controller de-privileging** (retire the disk-execution subsystem; new customer disk endpoints behind the slice-4 data-bearing classifier) | **8C** | **implemented — slice 8 CLOSED** (agent v0.12.0: `/disks` endpoints + the data-bearing classifier gate + `mkfs`; controller v0.37.0: ~12.3k LOC of disk-execution retired — storage/restic/cross-drive/migrate/watchdog/scanner/infra-backup — `backup.Manager` split to app-data only, disk mgmt rewired to the agent, container de-privileged). The data-bearing format refusal (§6) is the security centerpiece. | | **Controller de-privileging** (retire the disk-execution subsystem; new customer disk endpoints behind the slice-4 data-bearing classifier) | **8C** | **implemented — slice 8 CLOSED** (agent v0.12.0: `/disks` endpoints + the data-bearing classifier gate + `mkfs`; controller v0.37.0: ~12.3k LOC of disk-execution retired — storage/restic/cross-drive/migrate/watchdog/scanner/infra-backup — `backup.Manager` split to app-data only, disk mgmt rewired to the agent, container de-privileged). The data-bearing format refusal (§6) is the security centerpiece. |
| **Host/hardware loss** DR — re-enroll in "restore mode"; hub serves identity / PBS namespace / tunnel token / storage manifest / restore directive | **10** | deferred — needs hub desired-state serving; hub store today holds only `{host_id, customer_id, api_key}` (slice 3) | | **Host/hardware loss** DR — re-enroll in "restore mode"; hub serves identity / PBS namespace / tunnel token / storage manifest / restore directive | **10** | deferred — needs hub desired-state serving; hub store today holds only `{host_id, customer_id, api_key}` (slice 3) |
| PBS escrow **consumption** (recover `K` on a new box) | **10** | deferred — exercised by host-loss DR | | PBS escrow **consumption** (recover `K` on a new box) | **10** | deferred — exercised by host-loss DR |