diff --git a/REPORT.md b/REPORT.md index 3d08663..062d7c4 100644 --- a/REPORT.md +++ b/REPORT.md @@ -1,57 +1,43 @@ -# REPORT — slice 8C (controller half): de-privileging + disk management via the agent (v0.37.0) (2026-06-10) +# REPORT — slice 8B.2 (controller half): resume at `snapshotted` (v0.38.0) (2026-06-10) -> Overwrite-latest report. Cumulative history: [CHANGELOG.md](CHANGELOG.md). Implements the in-guest -> controller half of `TASK — Slice 8C` (closes slice 8). Pairs with `felhom-agent` v0.12.0. +> Overwrite-latest report. Cumulative history: [CHANGELOG.md](CHANGELOG.md). Implements the +> controller half of `TASK — Slice 8B.2`. Pairs with `felhom-agent` v0.13.0. No hub change. ## Outcome -The disk-execution subsystem moved to the host agent. The in-guest controller is now **Docker-only -with no disk/Proxmox privileges** and drives disk management through the agent's local API. -**~12.3k LOC retired.** The controller-side re-platform milestone — **slice 8 CLOSED**. +The quiesce loop (8B) kept the app stopped for the **whole backup**. In snapshot mode the app only +needs to be stopped until the **storage snapshot** is taken; after that vzdump reads from the +snapshot. The controller now resumes its app at the agent's **`snapshotted`** phase instead of +`done` — app downtime drops from *whole-backup* to *until-snapshot*, with no loss of app-consistency. +**Measured live: ~3s vs ~23s (~87% cut)**, restore still clean. -## What landed +## What landed (`internal/quiesce`) -- **Disk management via the agent** (`internal/web/agent_disk_handlers.go`, `ServeDiskAPI`): thin - proxies `GET /api/disks` / `POST /api/disks/{assign,eject,format}` over the slice-8A `agentapi` - client (leaf-pinned, own token). Execution is the agent's; the UX stays here. A **data-bearing - format refusal** (`agentapi.ErrFormatRefused`) surfaces as **HTTP 409 "operator authorization - required"** — the agent inspects the device; the controller's claim is irrelevant. -- **`internal/agentapi`**: `Disks`/`AssignDisk`/`EjectDisk`/`FormatDisk` + `ErrFormatRefused`. +- The status-poll loop **resumes (`StartStack` + clears the marker) at `snapshotted`**, then **keeps + polling to `done`/`failed`** — so a new backup isn't started until this one truly finishes and a + post-snapshot failure is still observed (the backup isn't "successful" until `done`; the early + resume does not mark it done). +- **Fallback:** if `snapshotted` never arrives (stop/downgraded storage), it resumes at `done` + exactly as 8B. The agent only emits `snapshotted` when the actual mode is snapshot. +- **Crash-safety unchanged:** marker written before stop; guaranteed unquiesce (deferred); startup + `Recover()`. A failure *after* `snapshotted` is harmless — the app is already up. -## Retired (→ agent / obsolete) +## Tests -- **`internal/storage/`** (whole package: scan/format/attach/migrate/safety, DriveMigrator). -- **`internal/backup/`**: restic, crossdrive, restore_drives, disk_layout, local_infra, restore_scan, - drive-restore, restic paths. **`backup.Manager` surgically split to app-data only** (DB dumps + - Docker-volume tars + per-app restore kept; restic/cross-drive/snapshot-history/integrity dropped; - `RestoreApp` now restores from on-disk volume-tar dumps — snapshot restore is the agent's domain). -- **`report/infra_backup*` + `infra_pull`** (kept the setup fresh-install config download), - **`setup/scanner`** + the wizard's drive-recovery flows, **`monitor/watchdog` + `pinger`** - (watchdog → agent; Healthchecks pinging → the Hub owns monitoring), **`web/storage_handlers` + - `handler_restore`**. Wiring dropped from main/router/web (CrossDriveRunner, DriveMigrator, watchdog, - infra push, restic scheduler jobs; kept the db-dump job). - -## De-privileged - -`scripts/docker-setup.sh` controller compose: dropped `privileged: true`, `/mnt` rshared, `/sys`, -`/dev`, `/etc/fstab`, `/run/udev`. The golden's bootstrap `docker run` was already minimal (8A). - -## Tests / build - -`go build ./...` + `go test ./...` green (app-data backup / stacks / quiesce / bootstrap / agentapi / -disk-client tests pass). The data-bearing-format refusal is proven in `agentapi` tests. +`go build ./...` + `go test ./...` green. quiesce: resume at `snapshotted` (RESUME event before +`done`, marker cleared, then tracked to `done`); stop-mode fallback (resume at `done`, no +`snapshotted`); fail-after-`snapshotted` (single resume, app stays up); the 8B crash-safety tests +stay green. ## Live validation (demo-felhom) -A provisioned controller v0.37.0 from the refreshed golden: **`Privileged=false`**, container mounts -ONLY bootstrap + data + docker.sock (no `/dev`/`/etc/fstab`/`/sys`/`/mnt`), **healthy + configured** -(not setup). It drove the agent disk API: `GET /disks` (data-bearing flags) and a **data-bearing -format → refused** (the agent inspected the device, the gate returned `pending_signature`, nothing -formatted). The de-privileged controller runs app management normally (8A bootstrap / 8B quiesce -unaffected). +A provisioned controller v0.38.0 with a postgres stack, short quiesce poll: timeline — +`quiescing [pgtest]` 12:58:45 → `snapshotted — resuming app early` 12:58:48 → `backup done` 12:59:08. +**App downtime ≈ 3s** (vs ≈ 23s to `done`). The snapshot backup restored to a scratch guest came up +**clean** (`database system was shut down at 12:58:45`, no WAL replay) — the early resume preserved +app-consistency. The controller kept tracking to `done` after resuming (no overlapping backup). -## Deferred / note +## Deferred / dependency -The operator-signed completion of a data-bearing wipe → slice 10. `RestoreApp` restore semantics -changed (volume-tar dumps, not restic) — restic/snapshot restore is the agent's domain now. Unused -`config.Backup` restic/retention fields left in place (harmless; out of scope). No secrets committed. +Snapshot-capable storage (lvm-thin/ZFS) required for the win; stop/downgraded storage falls back to +resume-at-`done` (8B). No consistency-contract or crash-safety change. No secrets committed.