diff --git a/REPORT.md b/REPORT.md index 5e0f916..3d08663 100644 --- a/REPORT.md +++ b/REPORT.md @@ -1,53 +1,57 @@ -# REPORT — slice 8B (controller half): app-consistent backup quiesce loop (v0.36.0) (2026-06-10) +# REPORT — slice 8C (controller half): de-privileging + disk management via the agent (v0.37.0) (2026-06-10) -> Overwrite-latest report (most recent significant work only). Cumulative history lives in -> [CHANGELOG.md](CHANGELOG.md). Implements the in-guest controller half of `TASK — Slice 8B`. The -> host-agent half is in `felhom-agent` v0.11.0. No hub change. +> Overwrite-latest report. Cumulative history: [CHANGELOG.md](CHANGELOG.md). Implements the in-guest +> controller half of `TASK — Slice 8C` (closes slice 8). Pairs with `felhom-agent` v0.12.0. ## Outcome -An agent-initiated vzdump is crash-consistent only (an LXC has no fsfreeze). This makes -app-consistency the controller's job: a background loop polls the agent's `GET /backup/due`, and when -due **stops its app stacks** around the agent backup so the captured state is clean-shutdown-consistent, -then restarts exactly those stacks. **Validated live end-to-end**, including the load-bearing restore -contrast and crash-safety. +The disk-execution subsystem moved to the host agent. The in-guest controller is now **Docker-only +with no disk/Proxmox privileges** and drives disk management through the agent's local API. +**~12.3k LOC retired.** The controller-side re-platform milestone — **slice 8 CLOSED**. ## What landed -- **`internal/quiesce`** — the loop: `GET /backup/due` → (if due) **quiesce** (stop deployed, - non-protected, running stacks) → `POST /backup` → poll `GET /backup/status` to `done`/`failed` → - **unquiesce** (restart exactly the stacks it stopped). - - **Crash-safety (the centerpiece — a stranded-down app is worse than a crash-consistent backup):** - a persisted **marker** (atomic, `0600`) written **before** stopping anything; **guaranteed - unquiesce** via a deferred closure (restart on a backup error, a status-poll error, the max-quiesce - bound, or context cancellation); a **max-quiesce-duration** hard bound (restart no matter what — the - backup finishes on the agent); **crash recovery** at startup (`Recover()` restarts stacks left - stopped by a mid-quiesce crash, then clears the marker); the marker also single-flights the loop. -- **`agentapi`**: `BackupDue` / `StartBackup` / `BackupStatus` + a `post` helper (leaf-pinned client). -- **`stacks.Manager.RunningAppStacks()`** — deployed, non-protected, currently-up stacks (protected - infra — traefik/cloudflared/felhom-controller — is never stopped), sorted for deterministic order. -- **`config.QuiesceConfig`** + `main.go` wiring: `Recover()` at startup, then the loop goroutine, gated - on the local API being configured (a provisioned guest) + quiesce enabled. +- **Disk management via the agent** (`internal/web/agent_disk_handlers.go`, `ServeDiskAPI`): thin + proxies `GET /api/disks` / `POST /api/disks/{assign,eject,format}` over the slice-8A `agentapi` + client (leaf-pinned, own token). Execution is the agent's; the UX stays here. A **data-bearing + format refusal** (`agentapi.ErrFormatRefused`) surfaces as **HTTP 409 "operator authorization + required"** — the agent inspects the device; the controller's claim is irrelevant. +- **`internal/agentapi`**: `Disks`/`AssignDisk`/`EjectDisk`/`FormatDisk` + `ErrFormatRefused`. -## Tests +## Retired (→ agent / obsolete) -`go test ./...` green. quiesce: happy path (stop → backup → poll done → restart exactly those, in -order; marker cleared); **backup-start failure → stacks STILL restarted**; failed phase → restarted; -**max-quiesce guard → restarted at the bound**; **crash recovery → marker stacks restarted + cleared**; -single-flight; **only the stacks we stopped are restarted**; **marker-written-before-stop** ordering. +- **`internal/storage/`** (whole package: scan/format/attach/migrate/safety, DriveMigrator). +- **`internal/backup/`**: restic, crossdrive, restore_drives, disk_layout, local_infra, restore_scan, + drive-restore, restic paths. **`backup.Manager` surgically split to app-data only** (DB dumps + + Docker-volume tars + per-app restore kept; restic/cross-drive/snapshot-history/integrity dropped; + `RestoreApp` now restores from on-disk volume-tar dumps — snapshot restore is the agent's domain). +- **`report/infra_backup*` + `infra_pull`** (kept the setup fresh-install config download), + **`setup/scanner`** + the wizard's drive-recovery flows, **`monitor/watchdog` + `pinger`** + (watchdog → agent; Healthchecks pinging → the Hub owns monitoring), **`web/storage_handlers` + + `handler_restore`**. Wiring dropped from main/router/web (CrossDriveRunner, DriveMigrator, watchdog, + infra push, restic scheduler jobs; kept the db-dump job). + +## De-privileged + +`scripts/docker-setup.sh` controller compose: dropped `privileged: true`, `/mnt` rshared, `/sys`, +`/dev`, `/etc/fstab`, `/run/udev`. The golden's bootstrap `docker run` was already minimal (8A). + +## Tests / build + +`go build ./...` + `go test ./...` green (app-data backup / stacks / quiesce / bootstrap / agentapi / +disk-client tests pass). The data-bearing-format refusal is proven in `agentapi` tests. ## Live validation (demo-felhom) -A provisioned guest with a postgres app stack: the loop quiesced `[pgtest]` → `POST /backup` (real -agent vzdump, taken with pg stopped) → `done` → unquiesce (postgres uptime reset confirms stop+restart); -`/backup/due` then went false (no re-loop). **Load-bearing restore contrast:** the **quiesced** backup -restored to a scratch guest → postgres `database system was shut down` (**clean, no recovery**); an -**un-quiesced** (crash-consistent, pg running) backup restored → postgres `redo starts … redo done` -(**WAL crash recovery**). **Crash-safety live:** the controller was hard-killed mid-quiesce (pg down, -marker present) → on restart `[quiesce] crash recovery … restarting them` fired and postgres came back -up, marker cleared. The backup-failure → restart path is unit-tested. +A provisioned controller v0.37.0 from the refreshed golden: **`Privileged=false`**, container mounts +ONLY bootstrap + data + docker.sock (no `/dev`/`/etc/fstab`/`/sys`/`/mnt`), **healthy + configured** +(not setup). It drove the agent disk API: `GET /disks` (data-bearing flags) and a **data-bearing +format → refused** (the agent inspected the device, the gate returned `pending_signature`, nothing +formatted). The de-privileged controller runs app management normally (8A bootstrap / 8B quiesce +unaffected). -## Deferred (stated, not built) +## Deferred / note -**8B.2** downtime optimization (snapshot mode + a `snapshotted` phase so the controller resumes at -snapshot-taken). Controller de-privileging + customer disk endpoints → **8C**. No secrets committed. +The operator-signed completion of a data-bearing wipe → slice 10. `RestoreApp` restore semantics +changed (volume-tar dumps, not restic) — restic/snapshot restore is the agent's domain now. Unused +`config.Backup` restic/retention fields left in place (harmless; out of scope). No secrets committed.