docs: reflow CLAUDE.md; switch REPORT.md to overwrite-latest; add no-secrets rule

Unify the REPORT/CHANGELOG convention with the sibling repos (REPORT.md was append/cumulative -> now overwrite-latest; CHANGELOG stays cumulative). Reflow removes hard mid-paragraph line wraps; rendered output unchanged. CHANGELOG entry in hub/CHANGELOG.md. No hub code change -> no version bump. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
update
2026-06-08 20:54:32 +02:00 · 2026-06-08 20:06:11 +02:00 · 2026-06-08 19:17:41 +02:00 · 2026-06-08 18:31:44 +02:00 · 2026-06-08 16:38:18 +02:00 · 2026-06-08 16:36:16 +02:00
28 changed files with 4877 additions and 198 deletions
@@ -1,191 +1,94 @@
-# CLAUDE.md — Project Instructions for Claude Code
+# CLAUDE.md — Project Instructions for Claude Code (`felhom.eu`)
-> This file is read automatically by Claude Code at the start of every session.
+> Read automatically by Claude Code when it works in this repo. Keep it updated as the project evolves. Cross-repo orientation (the felhom system, artifact taxonomy, access) lives in the workspace-root `e:\git\CLAUDE.md`; this file is `felhom.eu`-specific.
 > It replaces the "Instructions" panel from the claude.ai Project.
 > Keep it updated as the project evolves.
 ## Project overview
 This repo (`felhom.eu`) contains:
- **Website** (`website/`) — Static HTML pages at felhom.eu, served via k3s nginx + git-sync sidecar
+- **Website** (`website/`) — static HTML at felhom.eu, served via k3s nginx + git-sync sidecar.
- **Hub** (`hub/`) — Go application (felhom-hub) — centralized dashboard for monitoring customer controllers, runs on k3s at hub.felhom.eu
+- **Hub** (`hub/`) — Go application (felhom-hub) — the **operator backend**, on k3s at `hub.felhom.eu`.
- **K8s manifests** (`manifests/`) — k3s deployment manifests for all felhom-system services
+- **K8s manifests** (`manifests/`) — k3s deployment manifests for felhom-system services.
 - **Architecture docs** (`documentation/`) — the **authoritative design home for the whole Felhom system**: `architecture/01..05-*.md` (topology/trust, controller module map, host-agent, signing, hub), `proxmox-platform.md`, and `tests/phase{0,1-2,3,4}-findings.md`. Read these before designing.
-See `README.md` for full architecture, DNS, email, and SEO documentation.
+See `README.md` for full architecture/DNS/email/SEO docs. See `TASK.md` for the current task (if any).
-See `TASK.md` for the current task to implement (if it exists).
+
 ## The Felhom system (so the hub's role is in context)
 Felhom is **Proxmox-based**, with a locked **three-component model**:
 - **Hub** (this repo, `hub/`) — operator backend. Authors operator *intent*; mirrors box *reality*; holds **no data-plane role** and never connects inbound to a box.
 - **Host agent** (repo `felhom-agent/`) — one per Proxmox host; owns all Proxmox interaction.
 - **In-guest controller** (repo `felhom-controller/`) — one per customer LXC; Docker-only.
 The hub is **not** just controller monitoring anymore. As of slice 3 it ingests **two report streams**: the agent's host-domain report (`POST /api/v1/host-report`, the heartbeat) and the legacy controller report (`POST /api/v1/report`). The controller path is **frozen and retires at the slice-10 cutover** — do not modify it until then.
 ## Hub — current state (v0.7.x)
 - **Tables:** `customer_configs`, `events`, `app_telemetry`/`app_log_issues`, the legacy `reports`, and the slice-3 host-domain additions `hosts` / `guests` / `host_reports` (additive; columns marked inert exist for the slice-10 cutover but are unused now).
 - **Auth:** Bearer — global key, per-customer key (legacy), and per-host key (`GetHostByAPIKey`, slice 3). Provisional global-key host mint at `POST /api/v1/admin/hosts`.
 - **Monitoring:** the controller `StalenessChecker` (over `reports`) AND a sibling `HostStalenessChecker` (over `host_reports`, emitting `host_stale`/`host_down`/`host_recovered`).
 - Two-tier notifications (operator English / customer Hungarian, Resend, cooldowns); `events` audit.
 ## Code quality rules
- Always double-check generated code for bugs, logic issues, syntax errors
+- Always double-check generated code for bugs, logic issues, syntax errors.
- Handle edge cases without overcomplicating the script/program
+- Handle edge cases without overcomplicating.
- Add debug capabilities (logging, verbose output) for easier troubleshooting
+- Add debug capabilities (logging, verbose output).
- If you need more input or troubleshooting command output, ask first — don't guess
+- If you need more input or troubleshooting output, **ask first — don't guess**.
-## Workspace layout
+## Workflow & artifacts
-```
+The planning/architecture assistant ("project Claude", in claude.ai) writes specs and validates pushes; **you (Claude Code) implement**. A file being open in the editor is NOT an instruction.
 E:\git\felhom.eu\                  (or /e/git/felhom.eu/ in Git Bash)
 ├── hub/                           # felhom-hub Go application
 │   ├── cmd/hub/                   # Entry point (main.go)
 │   ├── internal/
 │   │   ├── api/                   # Report ingestion API
 │   │   ├── store/                 # SQLite storage + queries
 │   │   └── web/                   # Dashboard UI
 │   │       ├── server.go          # Server, routing, template funcs
 │   │       ├── embed.go           # go:embed for templates
 │   │       └── templates/         # HTML templates + CSS
 │   ├── configs/                   # Example config files
 │   ├── Dockerfile
 │   ├── Makefile
 │   └── go.mod
 ├── manifests/                     # k3s deployment manifests
 │   ├── hub.yaml                   # Hub deployment (hub.felhom.eu)
 │   ├── webpage.yaml               # Website + FileBrowser + git-sync
 │   ├── contact-mailer.yaml        # Contact form email sender
 │   ├── healthchecks.yaml          # Healthchecks (status.felhom.eu)
 │   └── umami.yaml                 # Analytics (stats.felhom.eu)
 ├── website/                       # Static HTML pages (felhom.eu)
 │   ├── index.html
 │   ├── alkalmazasok.html
 │   ├── ... (all Hungarian, UTF-8 with BOM)
 │   └── assets/                    # Logos, screenshots, OG images
 ├── CLAUDE.md                      # This file
 ├── README.md                      # Full project documentation
 └── TASK.md                        # Current task (if exists)
 ```
-Related repos (same parent directory):
+- **`TASK.md` / `TASK-*.md`** — a spec for you to implement. Then push and update this repo's changelog (`hub/CHANGELOG.md`) and root `REPORT.md` per the convention below.
-```
+- **`RUNBOOK-*.md`** — an operational procedure. CC executes the steps it has access and capability for, including live validation on the demo nodes and the demo Proxmox host (CC has root@felhom-pve SSH + the felhom-agent token). A step is human-only only when it genuinely needs physical presence, a real-world decision, or credentials CC truly lacks — mark those steps HUMAN. Do not decline a whole procedure because it touches a live host or a privileged token. (Judgment still applies: confirm before irreversible ops on real customer data — but demo scratch guests are fair game.)
-E:\git\deploy-felhom-compose\     # felhom-controller Go app + deploy scripts
+- Validation of a push against a spec's criteria is project Claude's job, not yours, unless asked.
 E:\git\app-catalog-felhom.eu\     # Docker Compose templates per app
 E:\git\homelab-manifests\         # k3s cluster manifests (dooplex.hu services)
 E:\git\misc-scripts\              # Helper scripts (build scripts, repo collector)
 ```
-All repos hosted at `gitea.dooplex.hu/admin/`.
+> **In every repository where you make a change, update both files in that repo:**
-
+> - **`CHANGELOG.md`** — a cumulative log of **all** changes; newest entry on top.
-## SSH access
+> - **`REPORT.md`** — **overwrite** with a summary of the **most recent** implementation (or significant validation/operational run) only; not cumulative.
-
+>
-SSH key-based authentication configured. No password prompts.
+> **Never write secrets** — tokens, passwords, private keys, API keys — into `CHANGELOG.md`, `REPORT.md`, or any committed file. Reference them as "stored out-of-band" instead.
 **IMPORTANT — SSH binary:** Claude Code runs in Git Bash, which has its own SSH at
 `/usr/bin/ssh` (= `C:\Program Files\Git\usr\bin\ssh.exe`). This binary does NOT have
 access to the Windows SSH agent and will fail silently. Always use the Windows native
 OpenSSH binary:
 ```
 SSH=/c/Windows/System32/OpenSSH/ssh.exe
 ```
 All SSH commands below use `$SSH` — set it at the start of your session.
 | Host | IP | User | Role |
 |------|----|------|------|
 | Build server (k3s node) | 192.168.0.180 | kisfenyo | Build + push images, kubectl |
 | Demo node | 192.168.0.162 | kisfenyo | Test deployment (demo-felhom.eu) |
 **Note:** `kubectl` on the build server requires `sudo` (k3s kubeconfig permissions).
 ## Build & deploy workflow — Hub
 After making code changes to `hub/`, you **MUST** build, push, and deploy the new image.
 Do NOT leave code changes uncommitted or undeployed.
 ### Step 1: Commit and push changes
 ```bash
 cd /e/git/felhom.eu
 git add -A && git commit -m "<descriptive message>" && git push
 ```
 ### Step 2: Build + push the container image on the build server
 The build server (192.168.0.180) has the build toolchain. The build script lives at
 `~/build/felhom-hub/build.sh` on the build server (NOT in this repo).
 First, check the current running version:
 ```bash
 $SSH kisfenyo@192.168.0.180 "sudo kubectl get deploy -n felhom-system hub -o jsonpath='{.spec.template.spec.containers[0].image}'"
 ```
 Then build with the next version (e.g., if current is 0.1.2, use 0.1.3):
 ```bash
 $SSH kisfenyo@192.168.0.180 "cd ~/build/felhom-hub && ./build.sh <NEW_VERSION> --push"
 ```
 The build script:
 - Pulls latest code from Gitea (`git pull` on the felhom.eu repo)
 - Copies `hub/` source to a clean build workspace
 - Builds Docker image with version + build-time ldflags
 - Pushes to `gitea.dooplex.hu/admin/felhom-hub:<VERSION>` and `:latest`
 ### Step 3: Deploy to k3s
 ```bash
 $SSH kisfenyo@192.168.0.180 "sudo kubectl set image -n felhom-system deploy/hub hub=gitea.dooplex.hu/admin/felhom-hub:<NEW_VERSION>"
 ```
 ### Step 4: Verify the deployment
 ```bash
 $SSH kisfenyo@192.168.0.180 "sudo kubectl get pods -n felhom-system -l app=hub && echo '---' && sudo kubectl logs -n felhom-system -l app=hub --tail 10"
 ```
 Should show pod Running and `[INFO] felhom-hub <VERSION> starting` in logs.
 ### Build workflow summary
 | Step | Command | Where |
 |------|---------|-------|
 | 1. Commit + push | `git add -A && git commit && git push` | Local (this repo) |
 | 2. Build + push image | `$SSH kisfenyo@192.168.0.180 "cd ~/build/felhom-hub && ./build.sh <VER> --push"` | Build server |
 | 3. Deploy | `$SSH kisfenyo@192.168.0.180 "sudo kubectl set image -n felhom-system deploy/hub hub=...:<VER>"` | Build server (kubectl) |
 | 4. Verify | `$SSH kisfenyo@192.168.0.180 "sudo kubectl get pods -n felhom-system -l app=hub"` | Build server |
 ## Build & deploy workflow — Website
 The website auto-deploys via git-sync sidecar. Just push to `main`:
 ```bash
 cd /e/git/felhom.eu
 git add -A && git commit -m "<message>" && git push
 ```
 Changes are live within 1-2 minutes. No build step needed.
 For emergency edits, use FileBrowser at `https://files.felhom.eu`.
 ## Build & deploy workflow — K8s Manifests
 Manifests are applied manually:
 ```bash
 ssh kisfenyo@192.168.0.180 "sudo kubectl apply -f /home/kisfenyo/git/felhom.eu/manifests/<manifest>.yaml"
 ```
 Remember to `git pull` on the build server first if you pushed changes locally.
 ## Tech stack (Hub)
- **Language:** Go 1.24+
+- **Language:** Go 1.24+ (build server is go1.26.0).
- **Web framework:** stdlib `net/http` + `html/template`
+- **Web:** stdlib `net/http` + `html/template`. **DB:** SQLite via `modernc.org/sqlite` (pure Go).
- **Database:** SQLite via `modernc.org/sqlite` (pure Go, no CGo)
+- **Auth:** bcrypt + Bearer tokens. **Deploy:** Docker on k3s (felhom-system ns).
- **Auth:** bcrypt password hash + basic auth
+- **Storage:** Longhorn PVC at `/data/` (SQLite DB). **Config:** YAML via ConfigMap at `/etc/felhom-hub/hub.yaml`.
- **Deployment:** Docker container on k3s (felhom-system namespace)
+
- **Storage:** Longhorn PVC at `/data/` (SQLite DB)
+## SSH access
- **Config:** YAML file mounted via k8s ConfigMap at `/etc/felhom-hub/hub.yaml`
+
 Use the Windows OpenSSH binary (Git Bash's `/usr/bin/ssh` can't reach the Windows agent and fails silently): `SSH=/c/Windows/System32/OpenSSH/ssh.exe`. All SSH commands below use `$SSH`.
 | Host | IP | User | Role |
 |------|----|------|------|
 | Build server (k3s node) | 192.168.0.180 | kisfenyo | Build + push images, kubectl (needs `sudo`) |
 | Demo Proxmox host | 192.168.0.162 | root@pam (SSH alias felhom-pve, root, no sudo) | pveum/pct + live Proxmox validation — available to CC |
 ## Build & deploy — Hub
 After code changes to `hub/`, you **MUST** build, push, and deploy.
 1. **Commit + push:** `cd /e/git/felhom.eu && git add -A && git commit -m "<msg>" && git push`
 2. **Check running version:** `$SSH kisfenyo@192.168.0.180 "sudo kubectl get deploy -n felhom-system hub -o jsonpath='{.spec.template.spec.containers[0].image}'"`
 3. **Build + push image** (next version; build script lives on the build server, not in this repo): `$SSH kisfenyo@192.168.0.180 "cd ~/build/felhom-hub && ./build.sh <NEW_VERSION> --push"` (pulls latest from Gitea, builds with version+build-time ldflags into `main.Version`, pushes `gitea.dooplex.hu/admin/felhom-hub:<VER>` and `:latest`.)
 4. **Deploy:** `$SSH kisfenyo@192.168.0.180 "sudo kubectl set image -n felhom-system deploy/hub hub=gitea.dooplex.hu/admin/felhom-hub:<NEW_VERSION>"`
 5. **Verify:** `$SSH kisfenyo@192.168.0.180 "sudo kubectl get pods -n felhom-system -l app=hub && sudo kubectl logs -n felhom-system -l app=hub --tail 10"` (expect Running + `[INFO] felhom-hub <VERSION> starting`.)
 > If the hub deployment is ArgoCD-managed (auto-sync), a manual `kubectl set image` may be reverted by ArgoCD drift-correction — confirm the deploy path before relying on step 4.
 ## Build & deploy — Website / Manifests
 - **Website** auto-deploys via git-sync; just push to `main` (live in 1–2 min). Emergency edits: FileBrowser at `https://files.felhom.eu`.
 - **Manifests** are applied manually (git pull on the build server first if you pushed): `$SSH kisfenyo@192.168.0.180 "sudo kubectl apply -f /home/kisfenyo/git/felhom.eu/manifests/<manifest>.yaml"`
 ## Key patterns
- Hub receives reports from customer controllers via `POST /api/v1/report` (Bearer token auth)
+- Hub ingests **host-reports from agents** (`POST /api/v1/host-report`, Bearer per-host) and legacy **controller reports** (`POST /api/v1/report`). The host-report `received_at` is the dead-man's-switch liveness signal.
- Dashboard shows all customers in a table with status, CPU, memory, disk, containers, backup age
+- Status logic: OK (report < 30m), WARN (30m–1h or health=warn), DOWN (> 1h or health=fail).
- Customer detail page shows system info, report history, full JSON report
+- SQLite timestamps vary in format — use `parseSQLiteTime()`.
- Status logic: OK (report < 30m), WARN (30m-1h or health=warn), DOWN (> 1h or health=fail)
+- Dashboard/detail auto-refresh every 60s via `<meta http-equiv="refresh">`. Geo-restricted to Hungary via nginx ingress annotation.
 - SQLite timestamps may vary in format — use `parseSQLiteTime()` for robust parsing
 - Auto-refresh: dashboard and detail pages refresh every 60 seconds via `<meta http-equiv="refresh">`
 - Geo-restricted to Hungary via nginx ingress annotation
 ## File encoding
-All HTML files in `website/` are **UTF-8 with BOM**. Ensure your editor preserves this.
+All `website/` HTML is **UTF-8 with BOM** — preserve it. Hub Go source is standard UTF-8 (no BOM).
 Hub Go source files are standard UTF-8 (no BOM).
@@ -217,7 +217,7 @@ Every page includes:
 | Repository | Purpose |
 |------------|---------|
 | [app-catalog-felhom.eu](https://gitea.dooplex.hu/admin/app-catalog-felhom.eu) | Docker Compose templates + .felhom.yml metadata for 45+ apps |
-| [deploy-felhom-compose](https://gitea.dooplex.hu/admin/deploy-felhom-compose) | felhom-controller Go app + customer deploy scripts |
+| [felhom-controller](https://gitea.dooplex.hu/admin/felhom-controller) | felhom-controller Go app + customer deploy scripts |
 | [deploy-portainer](https://gitea.dooplex.hu/admin/deploy-portainer) | Legacy — Portainer-based deploy scripts (deprecated) |
 | [homelab-manifests](https://gitea.dooplex.hu/admin/homelab-manifests) | k3s cluster manifests for dooplex.hu services |
 | [misc-scripts](https://gitea.dooplex.hu/admin/misc-scripts) | Utility scripts (collect-repos.sh, etc.) |
@@ -0,0 +1,132 @@
 # felhom.eu — task reports
 > **Overwrite** this file with a summary of the most recent task only (uniform with the other repos; not cumulative). The cumulative hub history lives in [hub/CHANGELOG.md](hub/CHANGELOG.md). Sections below predate this convention change and are retained as history.
 ---
 ## Hub slice 3 — host-domain ingest (v0.7.0) — 2026-06-08
 Purely **additive** host-domain ingest in `hub/`: new tables, the agent's
 `/host-report` heartbeat endpoint, per-host Bearer auth, a provisional host mint, and a
 host-domain dead-man's-switch. The existing controller path is **untouched**; the schema/
 auth cutover remains **slice 10**. Pushed to `main`; build/vet/test green locally and on
 the build server.
 ### New tables (`store.go migrate()`, idempotent — `// v0.7.0: host-domain`)
 - **`hosts`** — one per customer agent. Reality columns (`agent_version`, `last_report_at`)
  + operator-intent columns **INERT until slice 10** (`desired_json`, `desired_generation`,
  `dr_record_json`).
 - **`guests`** — one per controller LXC, PK `guest_id = "<host_id>/<vmid>"` (hub-derived).
  Reality columns (`display_name`, `status`, `controller_version`, `vmid`, `last_seen_at`)
  + **INERT** `api_key`, `desired_spec_json`.
 - **`host_reports`** — the report stream + denormalized columns (cpu/mem/disk %, guest
  counts, cloudflared status); pruned by `Prune(maxDays)` alongside `reports`.
 > Inert columns exist **now** so slice 10 needs no `ALTER`; nothing reads/writes them this
 > slice. Migration is additive-only (no `DROP`, no edits to `reports`/`customer_configs`)
 > and idempotent.
 ### New store methods
 `GetHostByAPIKey`, `GetHost`, `ListHosts`, `UpsertHost` (updates only identity + `updated_at`
 on conflict), `SaveHostReport` (inserts a report row + bumps reality columns only),
 `UpsertGuestFromReport` (updates reality columns only — **preserves** `api_key`/
 `desired_spec_json`), `GetHostStaleness` (skips never-reported hosts), `GuestID`.
 Structs: `Host`, `Guest`, `HostReportDenorm`, `HostStaleRow`.
 ### Auth (added; existing path unchanged)
 `checkAuthHost(r)` → `(hostID, customerID, isGlobal, ok)`: global key → trust `body.host_id`;
 per-host key → bound identity; failure → not-ok. `checkAuthCustomer` is byte-for-byte unchanged.
 ### Endpoints
 - **`POST /api/v1/host-report`** (the heartbeat): per-host auth; 4 MiB body; computes denorm
  (`guest_running` counts only `status=="running"`); `SaveHostReport` + per-guest
  `UpsertGuestFromReport` (a guest upsert failure is logged, not fatal — liveness); returns the
  control envelope `{status:"ok", poll_interval_seconds:900, blocked, desired_generation:0,
  has_signed_ops:false}`. `blocked` reflects `customer_configs.status`; the other two are
  reserved placeholders (slice 4). Global-key bootstrap requires the host to already exist
  (else 400); per-host key requires `body.host_id == hostID` (else 403).
 - **`POST /api/v1/admin/hosts`** — **PROVISIONAL**, global-key only. Mints `host_id` (legible
  `<customer>-<hex>`) + a random `api_key` (`configgen.RandomHex(32)`); 201 `{host_id, api_key}`.
  Flagged in code as the slice-3 bootstrap to be removed/locked at enrollment (slices 7–8).
 ### Host dead-man's-switch
 `monitor.HostStalenessChecker` (`host_staleness.go`) — a **sibling** of the controller
 `StalenessChecker`, keyed on host↔`host_reports`, emitting `host_stale`/`host_down`/
 `host_recovered` (30m / 60m), attributed to the host's customer (so the existing per-customer
 notification UX picks them up). Registered in `allowedEventTypes`; wired in `main.go` on the
 existing 60s ticker. The controller staleness/deadline checkers are untouched and keep running.
 ### Contract
 The `/host-report` JSON matches the agent spec §4 field-for-field (host_id, reported_at,
 agent_version, host{…}, guests[{vmid,name,status,controller_version,spec}], cloudflared{status},
 and the empty storage_targets/backups/restore_tests/pbs_snapshots/audit_tail — accepted
 empty/absent). The envelope matches agent spec §5.
 ### Test matrix (new, hermetic — temp SQLite, no live data)
 - **store**: upsert/lookup; a report-path update **preserves** `desired_json`/`desired_generation`;
  guest upsert **preserves** `api_key`/`desired_spec_json` while updating reality; `GuestID`;
  staleness skips never-reported.
 - **auth**: `checkAuthHost` global / per-host / unknown.
 - **ingest**: valid → 200 + envelope + denorm (`guest_running` = 1 of 2); host_id mismatch → 403;
  unknown host under global key → 400; blocked customer → `blocked:true`; oversize body → 400.
 - **admin mint**: non-global → 403; unknown customer → 400; success → 201 + minted key
  round-trips through `/host-report`.
 - **host staleness**: seed emits no events; ok→stale→down→recovered transitions.
 ### Untouched / deferred (explicit)
 - **Controller path unchanged**: `/api/v1/report`, `reports`, `customer_configs`,
  `checkAuthCustomer`, existing staleness + deadline checkers — additions only, all still green.
 - **Not built** (per scope): desired-state serving, `signed_ops`, geo→hub, DR-record migration,
  dashboard re-design. The cutover (drop `reports`→`guest_reports`, merge checkers, tighten the
  provisional admin/global-key auth) remains **slice 10**.
 ### Versioning / deploy
 Hub version is the `main.Version` ldflags var (`build.sh <VER>`), default `"dev"`; recorded
 **v0.7.0** in `hub/CHANGELOG.md`. The image build + ArgoCD deploy are **not** part of this task
 (no deploy performed).
 ### Repo state
 Branch: `main`. Verified `go build/vet/test ./...` green in `hub/` locally (go1.26) and on the
 build server (go1.26).
 ---
 ## Hub slice-3 follow-ups (v0.7.1) — 2026-06-08
 Validation follow-ups (hub half). Pushed to `main`; build/vet/test green locally (go1.26) and on
 the build server.
 ### §3 — `/host-report` rejects oversize with 413 (not silent truncation)
 `handleHostReport` now reads `maxHostReportBytes+1` (const `4 << 20`, defined near
 `defaultHostPollSeconds`) and returns **`413 Payload too large`** when exceeded, instead of relying
 on `LimitReader` truncation (which could accept a truncated-but-valid JSON as a partial report,
 dropping guests from the mirror). **Scope-frozen:** the controller `handleReport` 1 MiB read is
 **unchanged** (diff touches only the host path); the small divergence is acceptable until cutover.
 `TestHandleHostReport_OversizeRejected` now asserts 413.
 ### §4 — cross-repo contract golden fixture (hub half)
 - `hub/internal/api/testdata/host-report.golden.json` — a **byte-identical copy** of felhom-agent's
  golden (verified by md5).
 - `TestHostReport_GoldenContract` — mints a host, POSTs the golden through the **real**
  `handleHostReport`, asserts 200 + denorm (`guest_total=2`, `guest_running=1`,
  `cloudflared_status="active"`) + both guests upserted. Proves `hostReportPayload` still extracts
  the contract from the real wire shape.
 **Caveat (called out):** the two golden files are a *duplicated* contract with no shared source of
 truth. JSON can't hold a comment, so the mandatory "keep byte-identical" marker lives in each test
 file's doc comment. When slices 5/6 add real `storage_targets`/`backups` fields, promote this to a
 shared Go types module (the proper fix); this fixture is the bridge.
 ### Versioning / scope
 Recorded **v0.7.1** in `hub/CHANGELOG.md`. The hub version is the `main.Version` ldflags var
 (`build.sh <VER>`, default `"dev"`) — there is no in-repo version constant to bump (the task's
 pointer to `web/version.go` is the controller-image `VersionChecker`, unrelated); the image tag is
 applied at build/deploy (ArgoCD), not in this task. No deploy performed.
 ### Untouched (confirmed)
 Controller path (`handleReport`/`reports`/`customer_configs`/`checkAuthCustomer`/existing checkers)
 unchanged. The agent's proxmox client timeout was a "confirm" item — already bounded (30s default),
 no change.
 ### Repo state
 Branch: `main`. Verified `go build/vet/test ./...` green in `hub/` locally (go1.26) and on the build server (go1.26).
@@ -0,0 +1,224 @@
 # Felhom Controller Architecture — Part 1: Topology & Trust
 **Status:** draft (decisions from the topology/trust design sessions).
 **Platform facts** referenced here live in `docs/proxmox-platform.md`; this document
 records *Felhom's decisions*, not Proxmox behaviour.
 ---
 ## 1. Model at a glance
 Three components. **Control is always box-initiated** — the hub never connects *into* a
 customer box.
 ```
        operator side                     customer box (per Proxmox host)
   ┌───────────────────┐         ┌───────────────────────────────────────────┐
   │       HUB         │         │  Proxmox host                              │
   │ (dooplex.hu, k3s) │         │   ┌──────────────┐                         │
   │  - report sink    │◀──poll──┤   │  HOST AGENT  │  operator-tier          │
   │  - signed jobs    │  signed │   │  (Proxmox    │  • all Proxmox ops      │
   │  - dashboard      │  jobs   │   │   token)     │  • provision / restore  │
   │  - customer record│         │   └──────┬───────┘  • storage mgmt         │
   │  - PBS namespace  │         │          │ local constrained API           │
   └─────────▲─────────┘         │   ┌──────▼───────────────────────────────┐ │
             │                   │   │  customer LXC (one per customer)      │ │
             │  direct, app-     │   │   ┌──────────────┐   Docker:          │ │
             └───────────────────┼───┤   │ IN-GUEST     │   [app] [app] ...  │ │
                domain reports   │   │   │ CONTROLLER   │   (Docker containers)│
                                 │   │   │ (Docker-only)│                    │ │
                                 │   │   └──────────────┘                    │ │
                                 │   └───────────────────────────────────────┘ │
                                 └───────────────────────────────────────────┘
   PBS (offsite) ◀── outbound, client-side-encrypted backups ── customer box
   end-users / customer ◀── Cloudflare Tunnel ── apps + controller UI
 ```
 ---
 ## 2. The customer node
 - One **Proxmox host** per box (PVE 9.2, Debian 13, LVM-thin).
 - **Default workload topology:** one **customer LXC**, Docker inside it, each app a Docker
  container/stack. Apps are isolated at the Docker layer (separate containers, networks,
  volumes, cgroup limits); they share one LXC/kernel/Docker daemon.
 - **Escape hatch:** promote an individual app to its own guest (LXC or VM) only for a
  specific reason — a non-Linux/Windows app, a genuinely untrusted or exposed app needing
  hard isolation, or a resource hog needing guarantees.
 - **Multi-tenant:** one customer per host is the home default; multiple customer LXCs on
  one host (a company environment) is **not precluded** — the agent manages a *set* of
  guests. The only multi-tenant-specific work deferred to "if it becomes real" is resource
  fairness (per-guest disk/RAM/CPU quotas).
 ---
 ## 3. Components & responsibilities
 | | **Hub** | **Host agent** | **In-guest controller** |
 |---|---|---|---|
 | Runs on | dooplex.hu (k3s) | the Proxmox host | the customer LXC |
 | Tier | operator backend | operator (high-privilege) | customer-facing (app) |
 | Holds | customer records, signed-job source, PBS namespaces, escrowed keys | the **only** Proxmox API token; per-host operator identity | **no Proxmox creds**; its own hub API key + a local-API token to the agent |
 | Does | reporting sink, dashboard, job queue, source of durable truth | all Proxmox ops (provision, restore, snapshot, backup, storage mgmt, LXC lifecycle); polls hub for signed jobs; exposes a constrained local API to the controller; **per-guest authorization gate** | Docker/app lifecycle, catalog deploy, customer UI, app-level (data-layer) backup; reports app-domain to the hub directly |
 | Never does | initiate a connection *into* a box | — | touch the Proxmox API directly |
 **Key separation:** the controller manages Docker; the agent manages Proxmox. The controller's
 only path to guest-level operations (snapshot-before-deploy, "grow my RAM") is a constrained
 **local API call to the agent**, which the agent authorizes (scoped to that controller's own
 guest) and executes with its operator-tier token. This consolidates all Proxmox access and
 all per-guest authorization in one auditable place and leaves the guest with zero Proxmox
 credentials.
 ---
 ## 4. Control plane — box-initiated
 - CGNAT does **not** force this: the Cloudflare Tunnel already makes a box reachable through
  Cloudflare's edge. We *choose* box-initiated control for the smallest attack surface — the
  box exposes no control endpoint at all.
 - The agent and the controller **poll** the hub; the hub never initiates inbound.
 - Operator actions are delivered as **signed jobs**: the agent verifies an operator signature
  before executing, so a compromised hub database alone cannot forge commands.
 - All operator-initiated actions are recorded in a **customer-visible audit log**.
 ---
 ## 5. Trust boundaries
 | Boundary | What crosses | Mechanism | Blast radius if breached |
 |---|---|---|---|
 | end-user ↔ apps | app traffic | Cloudflare Tunnel → Traefik (Host routing) | that app |
 | customer ↔ controller UI | management UI | Cloudflare Tunnel; UI auth (bcrypt) | the customer's own box |
 | controller ↔ agent | snapshot/resize/backup requests | local constrained RPC; agent authorizes per-guest | the controller's own guest only |
 | agent ↔ hub | reports + signed jobs | outbound poll; signed jobs | one box; signed jobs limit forgery |
 | controller ↔ hub | app-domain reports/jobs (incl. geo desired-state) | outbound, own API key | app-domain of one customer |
 | box ↔ PBS | encrypted backups | outbound; per-customer namespace; client-side encryption | ciphertext only (operator can't read) |
 | guest ↔ Proxmox host | **(none direct)** | the guest holds no Proxmox creds; all via the agent | — |
 | hub ↔ Cloudflare API | geo-restriction WAF (enforcement) | the **hub** holds the CF API token; reconciles geo desired-state → WAF | the customer's zone/WAF |
 ---
 ## 6. Enrollment & identity
 - **Physical presence at provisioning** (on-site install, or pre-imaged-and-delivered).
  This removes any zero-touch remote-enrollment problem.
 - A **one-time retrieval code** mints durable identity. Single-use (burned on the successful
  config fetch) plus a short *pre-use* TTL; one-click regenerate for the only real failure
  case (fetch fails before anything is persisted). After the fetch, the code is irrelevant —
  everything downstream runs on durable credentials, so retries don't need it.
 - **Order:** the agent enrolls first (and, running as root at setup, mints its own scoped
  operator-tier Proxmox token), then provisions the customer LXC from the golden template and
  deploys the controller into it — injecting the controller's hub API key and its local-API
  token. The controller is the agent's product, never the other way around.
 - The **hub customer record is the durable source of truth**, and it survives box loss:
  identity, domain, **Cloudflare tunnel token**, **PBS namespace**, **storage manifest**, a
  **mirrored app inventory** (bottom-up reality, not operator-declared intent — apps themselves
  restore from the PBS guest snapshot, never re-deployed from this record; see `05` §1/§9), and the
  **escrowed (zero-knowledge) backup key**. This is what makes hardware replacement possible.
 ---
 ## 7. Networking
 - **Cloudflare Tunnel** provides inbound access to apps and the controller UI (the CGNAT
  solution). Tunnel token lives in the hub record → **reused on new hardware during DR**, so
  DNS/routing stay intact through an outage.
 - **Outbound only** for control/report/backup (poll to hub, push to PBS). No inbound control
  endpoint exists in the chosen model.
 - **Tunnel placement: host** (resolved, Part 3 §3/§5). `cloudflared` runs on the Proxmox host
  as its own **agent-managed systemd service** — not inside the guest — so the data path
  survives control-plane death by construction. Geo-restriction WAF is **hub-enforced** (the
  hub holds the CF API token; the controller only reports geo desired-state).
 ---
 ## 8. Storage & backup
 **Tiers** (escalating failure scope):
 | Layer | Mechanism | Survives | Note |
 |---|---|---|---|
 | Snapshot | LVM-thin snapshot (transient) | *logical* loss only | whole-LXC rollback; **not a backup** |
 | Local — second storage | vzdump to `dir`/`nfs`/`cifs` | primary-disk failure (USB) / box death (NAS) | first *real* backup tier |
 | Offsite — PBS | dedup'd, incremental, encrypted | site loss | the DR substrate; paid tier |
 - **Storage manifest** (hub-held, agent-reconciled): per target → type, durable identity
  (UUID / `server:/export` / repo+fingerprint), **class** (fast/slow + rough IOPS, set once
  at attach), role, encrypted credentials, schedule/retention. The agent creates the Proxmox
  storages, continuously checks presence/reachability, and reports per-target status (a
  disconnected target → actionable notification).
 - **App data placement is per-volume, not per-app:** `.felhom.yml` classifies each volume
  **hot** (DB/config/cache → fast storage, enforced) vs **bulk** (media/files → may be slow).
  A photo app's DB stays on SSD while its blobs go to the USB.
 - **Backup scoping:** hot data (LXC rootfs) rides the guest `vzdump` → tiers + PBS. Bulk data
  on external mount points is **excluded** from the guest vzdump (per-mount `backup` flag) and
  gets its own per-volume policy (file-level to a tier, slower cadence — or explicitly *not*
  backed up for re-downloadable content, with the customer informed).
 - **Tiers double as the DR restore-source priority:** restore from the fastest *surviving*
  source (local if still attachable, PBS on true site loss).
 - **Key custody (zero-knowledge default):** three tiers the customer chooses —
  *customer-only* / *zero-knowledge escrow (default)* / *operator-managed*. Default escrows
  the **PBS passphrase-protected keyfile** in the hub, wrapped under a **customer recovery
  code** the operator can't open; DR needs the customer's code. Access-notification is an
  audit signal, never the primary guard. (Don't build bespoke crypto — use PBS's native
  keyfile passphrase.)
 ---
 ## 9. Disaster recovery
 - **Guest-loss (host + agent alive):** the agent restores the guest from the fastest
  surviving tier, **resets identity** (MAC/hostname — see `proxmox-platform.md`), boots it,
  controller returns. Validated mechanics: Phase 2.
 - **Host / hardware-loss (agent gone):** re-provision (§6) in **restore mode** — the hub,
  knowing the customer has PBS backups, hands the freshly-enrolled agent the existing identity
  + PBS namespace + a restore directive instead of a clean-provision directive. The agent
  restores from PBS; the controller returns on the same domain (tunnel reused from the hub
  record). DR = provisioning + a restore mode, not a separate mechanism.
 - **Snapshot-before-deploy:** controller asks the agent to snapshot, deploys, runs its
  post-deploy health check, asks the agent to roll back on failure. (Transient snapshot, §8.)
 ---
 ## 10. How this embodies the product values
 - **Zero-knowledge offsite** — the operator holds the offsite backup but cannot read it.
 - **Box-initiated control + signed jobs** — no standing operator backdoor; a hub compromise
  alone can't forge commands.
 - **Customer-visible audit log** — every operator action is visible to the customer.
 - **Never hold data hostage** — subscriptions cover ongoing labour (monitoring, offsite,
  support, new deployments); the customer's data and deployed apps remain recoverable by the
  customer (recovery code), with nothing locked behind the operator.
 ---
 ## 11. Open sub-decisions (carried into later parts)
 - **RTO/RPO targets** → drive the backup + offsite-replication schedule (§8).
 - Offboarding / decommission (scenario 6) — not yet designed; must honour "never hold data
  hostage" in credential revocation + data hand-off.
 - Multi-tenant resource fairness — deferred until multi-tenant is real (§2).
 ---
 ## Appendix — relationship to the spike
 - **Phase 0** → §2: LXC-default for the workload; overhead numbers.
 - **Phase 1** → §3/§5: validated the privilege boundary (create/allocate is operator-tier).
  The guest-side scoped-backup-token it proved possible is **not** used — we chose the
  agent-mediated path — but it confirmed restore = operator-tier, which shapes the agent.
 - **Phase 2** → §8/§9: backup→restore round-trip; identity reset on restore.
 ---
 ## Changelog — design-review + Phase-3 fold-in (2026-06-08)
 - §5 trust boundaries: **added `hub ↔ Cloudflare API`** row (hub holds the CF token, enforces
  geo→WAF); controller↔hub row notes it carries geo desired-state (S4).
 - §7 networking: **tunnel placement resolved → host** (agent-managed systemd service); geo is
  hub-enforced (S4/S5).
 - §11 open items: removed the now-resolved **tunnel placement** and **self-update flow** entries
  (S5; self-update designed in 03 §11).
 - §6 durable record: **"declarative app inventory" → "mirrored app inventory"** — aligns the wording
  with the locked two-driver model (`05` §1: apps are bottom-up mirror, never operator-declared;
  `05` §9: apps restore from the PBS guest snapshot, not re-deployed from this record).
@@ -0,0 +1,374 @@
 # Felhom Controller Architecture — Part 2: Controller Module Map
 **Status:** audit (keep / port / delete / modify / add), grounded in the v0.33 source.
 **Subject:** the v0.33 controller in `felhom-controller/controller/` (110 `.go` files,
 ~40 K LOC) audited against [01-topology-and-trust.md](01-topology-and-trust.md) and
 [../proxmox-platform.md](../proxmox-platform.md).
 > This is a **planning map, not the port.** No controller code was changed. Source
 > citations use `controller/internal/...:line` (a different repo, so links are not
 > clickable). Classifications reflect the **target model**: the in-guest controller is
 > **Docker-only and holds no Proxmox credentials**; everything host/disk/Proxmox moves to
 > a new **host agent** (out of scope here); the controller reaches the agent through a
 > constrained **local API**.
 ## Classification scheme
 **KEEP** (host-agnostic, ~unchanged) · **PORT** (survives, needs rework) ·
 **DELETE (→agent)** (responsibility moves to the host agent) ·
 **DELETE (obsolete)** (no longer needed) · **MODIFY** (stays, materially changes) ·
 **NEW** (no v0.33 equivalent).
 Risk tags: **clean** · **needs-rework** · **hazard** (entangles a delete-target with a keep/port target).
 ---
 ## 0. Executive summary
 - The **app domain is largely intact and portable**: stack lifecycle (`stacks/`), catalog
  git-sync (`sync/`), app-to-app integrations (`integrations/`), `.fab` export/import
  (`appexport/`), the scheduler, crypto, asset sync, the hub report/notify *channels*, and
  most of the web UI **KEEP/PORT cleanly**.
 - The **disk/storage/host half deletes wholesale to the agent**: all of `storage/`,
  `monitor/watchdog.go`, the restic/cross-drive/disk-layout/drive-mount parts of `backup/`,
  `report/infra_backup*`+`infra_pull`, and the host-physical parts of `system/`.
 - The **setup wizard (`setup/`) is obsolete** — the agent provisions the controller.
 - **The single biggest hazard is `backup/`**: the keep side (DB dumps, Docker-volume
  archive, per-app restore — needed by `appexport/` and the backup UI) and the delete side
  (restic, cross-drive, drive-mount) are **interleaved inside the same files**
  (`backup.go`, `restore.go`, `paths.go`), not cleanly file-separated. Extracting the
  app-data-backup subset into a clean retained package is the critical refactor.
 - **Intent-vs-reality corrections** (vs the task's provisional split): `monitor/pinger.go`
  is already **dead** (legacy Healthchecks.io, "deprecated… now handled by Hub" per
  `main.go`) → DELETE(obsolete), not keep. `backup.go`/`restore.go`/`paths.go` do **not**
  split on file boundaries — they split *within* the file. `settings/` is **not** pure app
  domain — it stores disk/disconnect/decommission state. `system/` is genuinely
  mixed-per-function, not per-file.
 ---
 ## 1. v0.33 module inventory (package → purpose, key deps)
 | Package | Purpose | Key internal deps |
 |---|---|---|
 | `cmd/controller/main.go` | Entry point; wires all subsystems; 6 adapters break import cycles; branches into setup mode | imports **every** package |
 | `api/` | REST API (`router.go`) + geo endpoints (`geo.go`) | stacks, backup, metrics, notify, selfupdate, sync, system, assets, integrations, cloudflare, config, settings |
 | `appexport/` | `.fab` app export/import (config+DB+volumes, AES-256-CTR+scrypt) | **backup** (DB dump), (provider iface → stacks) |
 | `assets/` | Download/cache app assets from Hub API | — (HTTP only) |
 | `backup/` | DB dumps, Docker-volume archive, **restic**, **cross-drive rsync**, per-app restore, **drive mount**, disk-layout, infra-backup metadata | config, monitor, settings, system, util |
 | `cloudflare/` | Geo-restriction via Cloudflare WAF (zone/waf/geosync/countries) — **enforcement → hub** (S4) | settings |
 | `config/` | `controller.yaml` schema + load | — |
 | `crypto/` | AES-256-GCM for app.yaml secrets | — |
 | `integrations/` | App-to-app (OnlyOffice→FileBrowser/Nextcloud) via docker exec / config patch | stacks, crypto, settings |
 | `metrics/` | SQLite time-series: system + container metrics, log scan | system |
 | `monitor/` | App health (`healthcheck`,`pinger`) + **storage/USB watchdog** | config, notify, settings, system |
 | `notify/` | Hub event push (direct, own API key) | settings |
 | `recovery/` | Generate `recovery-info.txt` (DR guide) | — |
 | `report/` | Build+push hub report; **infra-backup payload**; **recovery pull** | backup, config, metrics, monitor, scheduler, settings, stacks, system |
 | `scheduler/` | Cron/interval jobs, Budapest TZ | — |
 | `selftest/` | Startup checks (docker/dirs/catalog/hub/**restic repos**/mountpoint) | backup, config, settings, system |
 | `selfupdate/` | Self-update: pull image, edit compose, `up -d` | config |
 | `settings/` | `settings.json` persistent state: **storage paths/disconnect/decommission**, cross-drive cfg, notif prefs, geo, integration state, DB-validation cache | — |
 | `setup/` | **First-run wizard** (scan drives, hub-restore, manual config) | backup, config, report, settings, web |
 | `stacks/` | Docker Compose lifecycle, deploy + memory validation, metadata (`.felhom.yml`), HDD-data delete | config, crypto, system |
 | `storage/` | **Physical disk** scan/format/attach/mount/migrate/fstab/safety | backup, settings, util |
 | `sync/` | Catalog git-sync (pull templates) | config |
 | `system/` | Resource info: mem/cpu/load (guest) + **temp/disk-model/USB/mount topology (host)** | — |
 | `util/` | String helper | — |
 | `web/` | Hungarian dashboard: pages, auth, deploy, backup UI, **storage/disk UI**, DR restore UI, export UI, debug | appexport, backup, config, crypto, integrations, monitor, notify, scheduler, selfupdate, settings, stacks, storage, system |
 ---
 ## 2. Classification table (per package/file)
 ### `cmd/`
 | File | Class | Reason | Risk |
 |---|---|---|---|
 | `cmd/controller/main.go` | **MODIFY** | Wiring stays, but drop the setup-mode branch, the storage/watchdog/drive-migrator/restic/cross-drive/infra-backup wiring, and add the **agent local-API client**. 6 adapters shrink. | hazard |
 ### `api/`
 | File | Class | Reason | Risk |
 |---|---|---|---|
 | `api/router.go` | **PORT/MODIFY** | Keep stacks/deploy/integrations/metrics/sync/assets/selfupdate routes; **remove `/api/storage/*` (disk)**; backup routes become **agent-coordinated guest-backup** requests; `config/apply` (hub-pushes-yaml) changes since the **agent** now injects config at provision. | needs-rework |
 | `api/geo.go` | **PORT/MODIFY** | Keep the customer-facing geo **preference** endpoints (set/get global + per-app); **drop the Cloudflare-sync trigger** — enforcement → hub (S4). The controller reports geo desired-state up instead of calling the CF API. | needs-rework |
 ### `appexport/` — KEEP/PORT (Docker-volume + DB level, no disk ops)
 | File | Class | Reason | Risk |
 |---|---|---|---|
 | `crypto.go` | **KEEP** | Self-contained AES-256-CTR+HMAC+scrypt for `.fab`. | clean |
 | `manifest.go`, `provider.go` | **KEEP** | Bundle metadata; provider interface (impl in main). | clean |
 | `export.go` | **PORT** | Docker-volume `tar`, DB dump via `backup.DumpOne`, config copy. Depends on the **retained** app-data-backup subset of `backup/`; HDD-mount enumeration reworked to **per-volume placement**. | needs-rework |
 | `restore.go` | **PORT** | `docker volume create`/`tar xf`, DB import, compose up. Same per-volume rework. | needs-rework |
 | `estimate.go` | **PORT** | `du`/`df` on mounts → per-volume sizing. | clean |
 ### `assets/`
 | File | Class | Reason | Risk |
 |---|---|---|---|
 | `syncer.go` | **KEEP** | Hub API download + checksum cache; already a direct hub channel. | clean |
 ### `backup/` — THE SPLIT (delete side interleaved with keep side; see §3)
 | File | Class | Reason | Risk |
 |---|---|---|---|
 | `dbdump.go` | **KEEP** | Pure `docker exec pg_dump`/`mariadb-dump` — app/DB data layer; the retained per-app backup. | clean |
 | `appdata.go` | **PORT** | App-data discovery (stacks/volumes/DB containers, `du`). "HDD mount" concept → per-volume. | needs-rework |
 | `backup.go` (1478 L) | **MODIFY (split)** | Mixes **keep** (`RunDBDumps`, `DumpAppVolumes(Safe)`, app restore) with **delete→agent** (`RunBackup`/`backupDrive`/restic snapshot/prune/check on per-drive repos). Must be torn in two. | hazard |
 | `restore.go` (442 L) | **MODIFY (split)** | `RestoreApp` restic path → agent; Docker-volume + Tier-2 rsync restore (app layer) → keep. | hazard |
 | `restore_app_linux.go`/`_other.go` | **PORT** | Per-app restore: compose pull/up, rsync app data, DB-dump restore. App layer; depends on backup location that changes. | needs-rework |
 | `paths.go` | **MODIFY (split)** | `AppDBDumpPath`/`AppVolumeDumpPath` keep; `Primary/SecondaryResticRepoPath`, `InfraBackupDir` → agent. | needs-rework |
 | `restic.go` | **DELETE (→agent)** | restic repos on drives = infra backup tier; agent does vzdump/PBS. | hazard |
 | `crossdrive.go` | **DELETE (→agent)** | Tier-2 cross-drive rsync to secondary storage = storage-tier (agent + storage manifest). | hazard |
 | `restore_drives_linux.go`/`_other.go` | **DELETE (→agent)** | `lsblk`/`blkid`/`mount`/fstab — pure host disk. | hazard |
 | `disk_layout.go` | **DELETE (→agent)** | Disk topology for DR → agent. | clean |
 | `local_infra.go` | **DELETE (→agent)** | Per-drive infra-backup metadata → agent. | clean |
 | `restore_scan.go` | **DELETE (→agent)** | Scans drives to build a DR restore plan = agent-tier DR. | needs-rework |
 ### `cloudflare/` — DELETE (→hub): CF-API enforcement moves to the hub (S4)
 | File | Class | Reason | Risk |
 |---|---|---|---|
 | `client.go`,`zone.go`,`waf.go`,`geosync.go`,`countries.go` | **DELETE (→hub)** | The **hub** holds the CF API token and reconciles geo desired-state → WAF (doc 01 §5, doc 03 §2). The controller no longer calls the Cloudflare API — it reports geo desired-state up. The customer-facing geo *preference UI/data* stays (see `api/geo.go`). | needs-rework |
 ### `config/`, `crypto/`, `util/`
 | File | Class | Reason | Risk |
 |---|---|---|---|
 | `config/config.go` | **MODIFY** | Drop `BackupConfig` (restic/retention), storage-drive keys, and `InfrastructureConfig.cf_api_token` (→hub, S4); keep customer/paths/web/git/stacks/monitoring/hub/assets/system; **add agent local-API endpoint+token**. | needs-rework |
 | `crypto/crypto.go` | **KEEP** | App.yaml secret encryption. | clean |
 | `util/strings.go` | **KEEP** | Trivial helper. | clean |
 ### `integrations/` — all KEEP (pure app-domain)
 | File | Class | Reason | Risk |
 |---|---|---|---|
 | `integrations.go`,`lifecycle.go`,`manager.go`,`onlyoffice_filebrowser.go`,`onlyoffice_nextcloud.go` | **KEEP** | App-to-app via `docker exec` / compose-config patch; no host ops. | clean |
 ### `metrics/`
 | File | Class | Reason | Risk |
 |---|---|---|---|
 | `store.go`,`logscanner.go`,`telemetry.go`,`types.go` | **KEEP** | SQLite store, `docker logs` scan, container telemetry — app-domain. | clean |
 | `collector.go` | **PORT** | Container metrics (`docker stats`) keep; host metrics via `system.GetInfo` (temp, physical disk) become **agent-provided or dropped**. | needs-rework |
 | `sysinfo.go`/`sysinfo_other.go` | **MODIFY** | Reads `/host/etc`, `/proc/cpuinfo`, uptime — host static info; in-guest some is meaningful, hardware identity via agent. | needs-rework |
 ### `monitor/`
 | File | Class | Reason | Risk |
 |---|---|---|---|
 | `healthcheck.go` | **PORT (split)** | Keep guest health (mem/cpu/docker/protected-containers); host health (temp, **physical disk**, storage-path mount status) becomes **agent-fed**. | needs-rework |
 | `pinger.go` | **DELETE (obsolete)** | Legacy Healthchecks.io; `main.go` itself marks it "deprecated… now handled by Hub". *(Corrects the task's KEEP/PORT guess.)* | clean |
 | `watchdog.go` (902 L) | **DELETE (→agent)** | Storage/USB disconnect monitoring: `umount -l`, `mount -T /host-fstab`, UUID probing, restic-lock cleanup — pure host storage. | hazard |
 ### `notify/`, `recovery/`, `scheduler/`, `selftest/`
 | File | Class | Reason | Risk |
 |---|---|---|---|
 | `notify/notifier.go` | **KEEP/MODIFY** | Direct hub event channel (own API key) — keep; prune infra event types that move to the agent (`storage_disconnected`, `crossdrive_*`, `disaster_recovery_*`). | clean |
 | `recovery/info.go` | **DELETE (obsolete)** | Generates a DR text guide (OS install, docker-setup.sh, hub restore UI); DR is now agent+hub provisioning. | clean |
 | `scheduler/scheduler.go` | **KEEP** | Generic cron/interval, Budapest TZ. | clean |
 | `selftest/selftest.go` | **PORT** | Keep docker/dirs/catalog/hub checks; drop restic-repo + system-data **mountpoint** checks (→agent). | needs-rework |
 ### `report/`
 | File | Class | Reason | Risk |
 |---|---|---|---|
 | `pusher.go` | **KEEP** | Direct hub push (`/api/v1/report`, Bearer). | clean |
 | `telemetry.go` | **KEEP** | Per-app telemetry section. | clean |
 | `builder.go` (326 L) | **MODIFY** | Keep containers/telemetry/stacks/geo/app-health; drop/relocate host system info, physical storage, **restic backup status incl. restic password**. | hazard |
 | `types.go` | **MODIFY** | Schema: drop infra fields (`restic password`, physical storage), keep app-domain. | needs-rework |
 | `infra_backup.go`/`_linux.go`/`_other.go` | **DELETE (→agent)** | Builds infra-backup payload (disk layout, restic/enc passwords) for hub. | hazard |
 | `infra_pull.go` | **DELETE (→agent)** | Pulls recovery config + infra backup from hub (setup-wizard DR). | needs-rework |
 ### `selfupdate/` — controller is agent-managed (doc 03 §11)
 | File | Class | Reason | Risk |
 |---|---|---|---|
 | `version.go` | **KEEP** | Semver parse / version string (still used for reporting). | clean |
 | `state.go` | **DELETE (obsolete)** | Self-update audit state — the agent owns controller updates now (doc 03 §11). | clean |
 | `updater.go` | **DELETE (→agent)** | Resolved (doc 03 §11): the controller is **agent-managed** — the agent snapshots → redeploys → health-gates → rolls back the controller. The controller's old self-update path (image pull + compose edit) is **removed**. | clean |
 ### `settings/`
 | File | Class | Reason | Risk |
 |---|---|---|---|
 | `settings/settings.go` (1101 L) | **MODIFY (split)** | Keep notif prefs, integration state, geo, DB-validation cache, cross-drive *intent*. The **storage-path registry** (`StoragePath` with `Disconnected`/`DisconnectedAt`/`StoppedStacks`/decommission) is disk-management state → reshape to **per-volume placement** fed by the agent's storage manifest; disconnect/decommission/migrate state leaves. (UUID is *not* a persisted field — runtime-derived from fstab.) | hazard |
 ### `setup/` — all DELETE (obsolete); the agent provisions the controller
 | File | Class | Reason | Risk |
 |---|---|---|---|
 | `handlers.go`,`setup.go`,`csrf.go`,`network.go` | **DELETE (obsolete)** | First-run wizard (hub-restore, manual config, LAN-IP detection). | needs-rework |
 | `scanner.go` | **DELETE (→agent)** | Drive scan (`lsblk`+temp mounts) for backup discovery — host op; its capability informs the agent. | clean |
 ### `stacks/` — core app domain (KEEP/PORT)
 | File | Class | Reason | Risk |
 |---|---|---|---|
 | `manager.go` (1074 L) | **KEEP/PORT** | Docker Compose orchestration, scan/state/start/stop/logs — the heart. Minor port. | clean |
 | `deploy.go` | **PORT** | Memory validation (`system.GetMemoryMB` — **guest** mem, fine in LXC), secret gen, encrypted app.yaml. **Add snapshot-before-deploy → agent** hook. | needs-rework |
 | `healthprobe.go` | **KEEP** | TCP/HTTP app probes. | clean |
 | `metadata.go` | **PORT** | `.felhom.yml` parse. **Add per-volume hot/bulk classification** (doc 01 §8). | needs-rework |
 | `delete.go` | **PORT** | Stack delete + HDD-data `os.RemoveAll` on bind mounts → per-volume cleanup. | needs-rework |
 ### `storage/` — entire package DELETE (→agent)
 | File | Class | Reason | Risk |
 |---|---|---|---|
 | `scan*`,`format*`,`attach*`,`migrate*`,`migrate_drive*`,`safety*` | **DELETE (→agent)** | Physical disk: `lsblk`/`sfdisk`/`wipefs`/`mkfs.ext4`/`partprobe`/`mount`/`umount`/fstab/`blkid`/drive-rsync. The agent owns all of this (doc 01 §3, §8). | hazard |
 ### `sync/`
 | File | Class | Reason | Risk |
 |---|---|---|---|
 | `sync/sync.go` | **KEEP** | Catalog git-sync (clone/fetch/reset, copy compose+`.felhom.yml`, never overwrite app.yaml). | clean |
 ### `system/` — split per-function (not per-file)
 | File | Class | Reason | Risk |
 |---|---|---|---|
 | `cpu_linux.go`/`cpu_other.go` | **KEEP** | `/proc/stat` works inside an LXC. | clean |
 | `info.go`/`info_other.go` | **KEEP** | Structs/stubs. | clean |
 | `info_linux.go` | **MODIFY (split)** | Keep mem (`/proc/meminfo`)/load/statfs (guest); **temp via `/host/sys`, hwmon → agent**. | needs-rework |
 | `mounts_linux.go`/`mounts_other.go` | **DELETE (→agent)** mostly | Mount-point detection, USB, disk model, fstab, probe — host/disk. Guest-meaningful `statfs` disk-usage is the only keep-candidate → fold into the kept `info`. | hazard |
 ### `web/` — split by UI surface
 | File | Class | Reason | Risk |
 |---|---|---|---|
 | `auth.go`,`csrf.go`,`logbuffer.go`,`embed.go`,`templates.go` | **KEEP** | Session/CSRF, log ring buffer, embeds/logo. | clean |
 | `funcmap.go` | **KEEP/PORT** | Template helpers; a few backup/state labels track the backup rework. | clean |
 | `server.go` (559 L) | **MODIFY** | Routing/wiring; remove storage/DR-restore/watchdog wiring; keep app/deploy/backup/settings/export/debug. | needs-rework |
 | `handlers.go` (1883 L) | **PORT/MODIFY** | Core pages keep; the embedded **storage-path management** (add/remove/label/schedulable, storage bars, FileBrowser mount sync) → per-volume / agent-fed. | hazard |
 | `handler_export.go` | **KEEP/PORT** | `.fab` UI. | clean |
 | `handler_debug.go` (823 L) | **PORT** | Drop storage-simulate/infra-push/DR debug; keep the rest. | needs-rework |
 | `alerts.go` | **PORT/MODIFY** | Storage-disconnect alert now sourced from **agent** status; backup/update alerts keep. | needs-rework |
 | `handler_restore.go` | **DELETE (→agent) / MODIFY** | DR restore-mode UI; DR is agent-tier — replace with an agent-status view or remove. | needs-rework |
 | `storage_handlers.go` (1600 L) | **DELETE (→agent)** | Format/attach/mount/disconnect/migrate-drive/decommission disk UI. Any survivor is a **thin client calling the agent API** (e.g. per-volume placement requests). | hazard |
 | `templates/` (HTML, non-Go) | **PORT** | Remove disk-wizard + DR pages; keep app/deploy/backup/settings pages. | needs-rework |
 ### `scripts/`
 | File | Class | Reason | Risk |
 |---|---|---|---|
 | `scripts/hashpass.go` | **KEEP** | Standalone bcrypt helper. | clean |
 ---
 ## 3. Coupling hazards (delete-targets depended on by keep/port)
 1. **`backup/` is half-deleted but split *inside files*, not across them.** `backup.go`
   contains both `RunDBDumps`/`DumpAppVolumesSafe`/app-restore (keep) and
   `RunBackup`/`backupDrive` + restic (delete→agent); `restore.go` and `paths.go` are
   likewise mixed. **Keep/port consumers reach into this same package:**
   - `appexport/export.go:295` → `backup.DiscoverDatabases`/`DumpOne` (DB dump is app-layer — must survive)
   - `report/builder.go:buildBackupReport` → backup status (MODIFY)
   - `web/handlers.go` (backups page, `buildAppBackupRows`), `web/funcmap.go`, `web/alerts.go`, `web/handler_restore.go`, `web/handler_debug.go`
   - `selftest/selftest.go:217` → `checkResticRepos` (restic path — delete)
   - `main.go` scheduler chain `RunFullBackup` (DB→volume→restic→infra-push) interleaves both sides.
   **Action:** extract the app-data-backup subset (DB dump, volume archive, per-app
   restore) into a clean retained package *before* deleting the restic/cross-drive code,
   or every keep consumer breaks.
 2. **`backup/crossdrive.go` (delete→agent) is wired as `crossDriveRunner` into**
   `main.go`, `api/router.go`, `web/server.go`, and surfaced by `report/builder.go` and the
   backups page. Removing it requires reworking the backup UI/report to the agent's
   guest-backup status.
 3. **`storage/` (delete→agent) depended on by keep/port UI:** `web/storage_handlers.go`
   (delete) and `web/server.go`/`web/handlers.go` (port) — the latter renders storage
   labels/bars and runs **FileBrowser mount sync** off the storage-path registry.
   `storage/migrate*.go` also imports `backup` (also being split). Untangle the per-volume
   placement UI from the disk-management UI.
 4. **`monitor/watchdog.go` (delete→agent) depended on by** `web/alerts.go` (port),
   `web/server.go`, `web/handler_debug.go`, `main.go`. The disconnect **alert** must instead
   consume agent-reported storage status.
 5. **`system/` mixed-per-function, consumed by both sides.** Keep consumers —
   `stacks/deploy.go` (`GetMemoryMB`, guest), `metrics/collector.go` (container) — must not
   drag in the host-disk/temp/USB code that goes to the agent (`mounts_linux.go`,
   `info_linux.go` temp). Also consumed by `report/builder.go` (MODIFY), `monitor/healthcheck.go`
   (PORT), `selftest`, `crossdrive` (delete). **Split `system/` cleanly into guest-info vs
   host-info first.**
 6. **`settings/StoragePath` carries disk state into an app-domain store.** Disk fields
   (`Disconnected`,`DisconnectedAt`,`StoppedStacks`, decommission — UUID is *not* persisted, it's runtime-derived from fstab via `system.ParseFstabUUID`/`watchdog.go`) are written by
   `watchdog.go`/`storage_handlers.go`/`crossdrive.go` (all delete) but the same struct is
   read by `stacks`/`web` for labels and **placement** (keep). Reshape `StoragePath` to a
   placement record fed by the agent manifest.
 7. **`report/builder.go` imports almost everything** (backup, monitor, scheduler, stacks,
   system, metrics, settings, config). Its MODIFY must land *after* the backup and system
   splits, or it pulls deleted code along.
 8. **`backup/paths.go` shared both ways** — `appexport` + `selftest` + the kept DB-dump
   flow use the app-dump path helpers; the same file holds the restic/secondary helpers
   that leave.
 9. **DR/provisioning chain is cross-cut:** `setup/` (obsolete) → `report/infra_pull` +
   `recovery/info` + `backup.MountDrivesFromLayout` + `backup.ReadLocalInfraBackup`. All
   obsolete/→agent, but `main.go`'s setup branch and `web/handler_restore.go` reference
   them; remove together.
 ---
 ## 4. Moves to the host agent (consolidated — feeds the future agent design)
 > Reporting only; **not** designing the agent here.
 - **All physical-disk management** — `storage/` in full: scan/classify, format
  (`wipefs`/`sfdisk`/`mkfs.ext4`/`partprobe`), attach (raw mount + bind + fstab), per-app
  and full-drive migration (rsync), safety checks (system-disk detection).
 - **Storage/USB watchdog** — `monitor/watchdog.go`: disconnect/reconnect detection,
  `umount -l`, `mount -T /host-fstab`, UUID-by-id probing, safe-disconnect, restic-lock
  cleanup.
 - **Infra/disk backup tier** — `backup/restic.go`, `crossdrive.go`,
  `restore_drives_*`, `disk_layout.go`, `local_infra.go`, `restore_scan.go`, plus the
  restic-snapshot half of `backup.go`, the restic-restore half of `restore.go`, and the
  restic/secondary path helpers in `paths.go`. (Maps to the agent's `vzdump`→tiers→PBS in
  doc 01 §8.)
 - **Infra-backup payload + recovery pull** — `report/infra_backup*`, `report/infra_pull`.
 - **Host-physical telemetry** — `system/mounts_linux.go` (mount topology, USB, disk
  model), the temp/hwmon parts of `system/info_linux.go`, and the host-hardware parts of
  `metrics/sysinfo.go`.
 - **Drive scanning for provisioning/DR** — `setup/scanner.go`.
 - **Self-restore-test execution** — the agent performs the restore-to-scratch-guest; the
  controller only orchestrates/validates (see §5).
 ---
 ## 5. New components to build (no v0.33 equivalent)
 1. **Agent local-API client** — the controller's only path to guest-level Proxmox
   operations (doc 01 §3, §5): `snapshot-before-deploy` + rollback, "grow my RAM", request
   guest backup/restore, read the storage manifest / mount placement, query per-target
   storage status. Replaces the deleted direct host/disk code with constrained RPC. The
   controller holds **no Proxmox creds** — only a local-API token.
 2. **Per-volume storage placement** (doc 01 §8) — `.felhom.yml` `hot`/`bulk` volume
   classification (extend `stacks/metadata.go`), enforcement at deploy (extend
   `stacks/deploy.go`), and a placement record in `settings`. Replaces the per-app
   HDD-path + cross-drive model. A `bulk` volume must be realized as a `backup=0` mount point,
   **never** a rootfs Docker named volume (validated recipe: `phase3-findings.md` B2 / doc 03 §7).
 3. **Self-restore-test status display** (read-only) — the **agent owns orchestration** (it
   holds the PBS key and creates the scratch guest — operator-tier, doc 03 §8); the controller
   only surfaces `GET /restore-test/status` in its UI. (Round-trip validated: Phase 2,
   [../proxmox-platform.md](../proxmox-platform.md) §4.)
 4. **Snapshot-before-deploy/rollback flow** in the deploy path — wraps the existing
   compose deploy with agent snapshot → health check → agent rollback-on-failure
   (doc 01 §9). New behaviour on top of `stacks/deploy.go` + `stacks/healthprobe.go`.
 5. **Agent-provisioning bootstrap receiver** — the controller accepts its injected hub API
   key + local-API token from the agent at provision time (doc 01 §6), replacing the
   deleted `setup/` wizard.
 ---
 ## 6. Open / blocked items
 - **Geo — resolved (S4):** CF-API **enforcement moves to the hub** (it holds the CF token and
  reconciles geo → WAF); the controller keeps the geo **preference UI/data** and reports
  desired-state up. Tunnel placement is settled (host, agent-managed, doc 03 §3/§5). The
  `cloudflare/` package + `api/geo.go`'s CF-sync are DELETE-from-controller → hub.
 - **Self-update — resolved (doc 03 §11):** the controller is agent-managed; its self-update
  path is removed.
 - **`settings`/`stacks` per-volume reshape** — depends on the storage-manifest contract
  between hub ↔ agent ↔ controller (doc 01 §8), not yet specified.
 - **Backup UI/report surface** — depends on the agent's guest-backup status API shape
  (what the controller can see about vzdump/PBS state) — undefined.
 - **Notification event taxonomy** — which infra events (`storage_disconnected`,
  `crossdrive_*`, `disaster_recovery_*`) the **agent** emits vs the controller, once those
  responsibilities move.
 ---
 ## Changelog — design-review + Phase-3 fold-in (2026-06-08)
 - **M1:** removed `UUID` from the `settings.StoragePath` field lists (§ settings, hazard #6) —
  it is runtime-derived from fstab, not persisted.
 - **S4 (geo):** `cloudflare/` reclassified **PORT(blocked) → DELETE(→hub)** (CF-API enforcement
  moves to the hub); `api/geo.go` → **PORT/MODIFY** (keep geo *preference* endpoints, drop the
  CF-sync trigger); `config/config.go` also drops `cf_api_token`. §6 + §1 updated.
 - **S5:** cloudflare/geo no longer "blocked on tunnel placement" (resolved).
 - **S6:** §5(3) self-restore-test → **status-display only**; the agent owns orchestration.
 - **Self-update resolved (03 §11):** `updater.go` → **DELETE(→agent)**, `state.go` →
  DELETE(obsolete), `version.go` KEEP; §6 + §5(2) updated (bulk = `backup=0` mountpoint recipe).
@@ -0,0 +1,299 @@
 # Architecture Part 3 — The Host Agent
 > Status: design draft (decision content). To be grounded by Claude Code against
 > `docs/proxmox-platform.md` and `docs/architecture/02-controller-module-map.md`,
 > then placed at `docs/architecture/03-host-agent.md`.
 >
 > Builds on Part 1 (`01-topology-and-trust.md`) and Part 2 (`02-controller-module-map.md`).
 > Where this doc and the locked decisions disagree, the locked decisions win and this
 > draft is wrong — flag it.
 ## 1. Purpose & scope
 The **host agent** is the operator-tier component that runs on each Proxmox host and
 owns *all* Proxmox interaction. It is the trusted host actor: it provisions and restores
 guests, manages host storage, orchestrates backups and restore-tests, watches the host
 and the tunnel, talks to the hub, and exposes a narrow local API to the in-guest
 controllers it deploys.
 It is the privileged tier. The controller deliberately holds **no** Proxmox credentials
 (Part 1) — the privilege the controller shed by losing `storage/` did not disappear, it
 **moved here**. That makes the agent's hardening and blast-radius discipline the most
 security-sensitive part of the platform.
 The agent manages a **set** of guests on its host (usually one customer = one guest, but
 the multi-tenant/company case is not precluded — the agent's data model is per-host,
 N-guests, never "the guest").
 ## 2. Responsibilities (and explicit non-responsibilities)
 Owns:
 1. **Proxmox lifecycle** — create/start/stop/destroy guests, snapshots, storage allocation. Via a scoped Proxmox API token (the **`FelhomAgent` operator role** — `proxmox-platform.md` §3.6, validated Phase 3 B3) for everything the API covers; raw host ops only where unavoidable.
 2. **Storage management** — attach/classify targets, reconcile the storage manifest, mount USB-by-UUID, present mounts into guests.
 3. **Backup/restore orchestration** — vzdump to the tiers, PBS, snapshot management, and the **self-restore-test**.
 4. **Host & tunnel monitoring** — host metrics, guest up/down, storage-target status, and `cloudflared` health; reports the host domain to the hub.
 5. **Provisioning** — provision a guest **by restoring the golden base image** (§9), deploy the controller into it, hand it its bootstrap config; also **build and refresh the golden base image** itself.
 6. **Hub control loop** — poll for desired state + signed jobs, reconcile, execute, report, heartbeat.
 7. **Local API** — the per-guest authorization gate the controller calls.
 8. **Self-update** — update itself (carefully — it is a host service) and update the controllers it owns.
 Explicitly does **not**:
 - Serve application traffic or sit in the data path. **Control plane, not data plane**: if the agent dies, apps keep serving (Docker + LXC run without it); only *management* degrades — no new backups, no provisioning, hub loses the heartbeat.
 - Hold or proxy customer application data.
 - Run inside a guest. It is the thing that recovers guests and the host; it cannot be one of them.
 - Manage **geo-restriction / the Cloudflare API**. Geo is hub-owned: the customer sets it in the controller UI, the controller reports the geo desired-state to the hub, and the **hub** (holding the CF API token) reconciles the WAF (S4). The agent manages only the *tunnel* service (`cloudflared`, §3/§5), never WAF rules.
 ## 3. Process model & host integration
 - **Native Go binary, systemd service** on the host: boot-start, `Restart=always`, systemd watchdog (kill+restart on hang), journald logging, resource limits.
 - **Root-minimized (boundary settled — Phase 3 B3).** The agent runs as a **non-root** service user with the scoped `FelhomAgent` token for all API-covered work + a **narrow `sudoers` allowlist** for true host ops. Per Phase 3 (B3) the boundary is settled: the entire per-customer guest lifecycle — provision (by restore, §9), config, start/stop, snapshot, backup, **restore**, destroy — is token-covered. Genuine OS-root is confined to: (1) building/refreshing the **golden base image** (`keyctl` create is `root@pam`-only — one-time at enrollment + a maintenance cadence, §9); (2) **host mounts** (USB mount-by-UUID, systemd mount units / fstab); (3) **SMART / hardware sensors**. Root therefore never sits on the per-customer path. See `proxmox-platform.md` §3.6 for the role + boundary table.
 - **`cloudflared` is a separate systemd service**, not embedded in the agent. This is what makes the data path survive control-plane death by construction. The agent **manages and health-watches** it (see §5) but the tunnel does not live or die with the agent process.
 ## 4. Control model — reconcile + signed destructive ops
 Two channels, split by **reversibility**, not by transport.
 **(a) Desired-state reconciliation — steady state.**
 The hub holds desired state for the host: which guests should exist (and at what spec),
 the storage manifest, backup/retention policies, controller image versions. The agent
 runs a reconcile loop converging actual Proxmox state → desired: idempotent, self-healing,
 and tolerant of missed polls (drift is corrected on the next loop). Provisioning retries,
 re-attach of a flapping USB target, redeploy of a crashed controller — all fall out of
 reconciliation for free.
 **(b) Signed one-shot jobs — operator actions.**
 Restore-now, decommission, force-backup, break-glass-enable. Discrete, run-once
 (idempotency key), written to the customer-visible audit log, and **outside** the reconcile
 loop — they are point-in-time and often destructive, and a reconciler must never re-run a
 restore because it "sees drift." A one-shot job names a **target** ("restore guest X from
 snapshot S"), not a procedure; the agent owns the *how*.
 **The reversibility gate (security-critical).**
 "Signed jobs resist hub compromise" only holds if the agent also distrusts hub-supplied
 *desired state* for destructive changes. The gate is by **provenance + data-bearing-ness, not
 by verb**:
 - **The reconciler MAY act without an operator signature** when: (a) creating/starting/restarting; (b) destroying resources it created earlier **within the same journaled transaction** (compensating rollback, §10); (c) destroying resources it **tagged ephemeral/scratch** (e.g. restore-test scratch guests, §8). The ephemeral/scratch tag is **agent-internal provenance and is never accepted from the hub** — else a compromised hub could relabel a data-bearing guest as scratch to walk the gate.
 - **An operator signature is always required** to destroy/overwrite any resource holding the only/primary copy of customer data — live-guest destroy, storage detach/wipe, restore-overwrite, decommission — *regardless of whether it arrives as a job or as a desired-state delta*. A compromised hub cannot forge them because the signing key is **not held by the hub** (it lives with the operator / a separate signing path; the hub only queues opaque signed blobs).
 - **Healing a crashed controller is non-destructive by construction:** it is reconstructable from its image + the guest's persistent volume, so "redeploy" = restart the LXC / `docker compose up -d` **inside the existing guest** — never a guest destroy. (v0.33 precedent: `watchdog.go` restarts stopped stacks, it never destroys the guest.)
 Signed payloads carry a **nonce + expiry** (anti-replay: a captured "restore" job cannot be
 re-injected later) and a target binding (host + guest id) so a signature can't be retargeted.
 Notification-on-destructive-op is an **audit signal, never the guard** — a compromised hub
 could both issue and suppress the notice, which is exactly why the *signature* (not the
 notification) is the control.
 ## 5. Hub ↔ agent protocol (host domain)
 **Box-initiated poll.** The hub never connects inbound. Each poll cycle exchanges:
 - **Up:** heartbeat + a host-domain state report — host CPU/RAM/disk, per-guest up/down + spec, storage-target status (USB connected? NFS/CIFS reachable? PBS reachable?), last backup per target, last restore-test result, `cloudflared` health, agent + controller versions, audit-log tail.
 - **Down:** the current desired state, any pending signed one-shot jobs, and config (poll interval, update window, policy changes).
 **Dead-man's-switch (essential, not optional).** In a box-initiated model the heartbeat
 *is* the liveness signal — a box that stops checking in is otherwise invisible. The hub
 alerts the operator when an agent misses its expected check-in window. This is the worst
 failure mode for a managed service, so it gets first-class treatment hub-side.
 **Break-glass.** Standing inbound control is off. But when the poll loop *itself* is wedged
 (agent hung, host sick) you cannot fix it through the poll loop. So there is an explicit,
 **off-by-default, customer-consented, fully-audited** emergency path: SSH to the host via
 the Cloudflare Tunnel behind Cloudflare Access (or on-site). Enabling it is itself a signed,
 logged operation; it auto-expires.
 ## 6. Agent ↔ controller local API
 The controller (in its LXC) reaches the agent (on the host) over the local bridge.
 - **Transport:** HTTPS to the host's bridge IP on a fixed port.
 - **Auth:** a per-guest local token, minted by the agent when it deploys the controller and written into the guest's bootstrap config. The agent maps token → guest and **authorizes per guest**: a controller can only act on *its own* guest. This is the agent acting as the per-guest authorization gate from Part 1.
 - **Surface (minimal, all scoped to the caller's own guest):**
  - `GET /storage` — mounts available to this guest and their **class** (fast/slow), so the controller can place hot vs bulk volumes per `.felhom.yml`. (The agent owns the actual mounts; the controller just binds to the paths it's given.)
  - `POST /snapshot` — snapshot *this* guest (the snapshot-before-deploy primitive).
  - `POST /rollback` — roll *this* guest back to a named snapshot (post-deploy failure recovery).
  - `POST /backup` — request a backup-now of *this* guest (enqueued; non-destructive).
  - `GET /backup/due` — whether a policy-scheduled backup is due for *this* guest, so the controller can quiesce then call `POST /backup` (the app-consistent path, §8).
  - `GET /backup/status`, `GET /restore-test/status` — read-only status for the controller's UI.
 Note what is *absent*: nothing here lets a controller touch another guest, the host, storage
 attachment, or restore-overwrite. Destructive/cross-guest power stays operator-signed (§4).
 A controller can only `POST /rollback` (or snapshot/backup) **its own** guest — the agent maps
 token → guest and authorizes per guest, so a compromised controller's blast radius is
 **self-scoped and bounded** to its own guest.
 ## 7. Storage manifest & reconciliation
 The manifest is the load-bearing contract. It absorbs the **persisted** disk-state fields that
 `settings.StoragePath` carries today **and adds** `durable_id`/UUID — today the controller
 re-derives the UUID from fstab each boot (Part 2 / Phase-3), so persisting it is an
 improvement. Held in the hub, reconciled by the agent.
 Per target:
 | field | meaning |
 |---|---|
 | `type` | `local-dir` / `usb` / `nfs` / `cifs` / `pbs` |
 | `durable_id` | UUID (USB), `server:export` (NFS/CIFS), `repo+fingerprint` (PBS) — survives box loss |
 | `class` | `fast` or `slow`, set **once at attach**, with an IOPS marker; no runtime speed-test |
 | `role` | `primary` / `vzdump-target` / `pbs-offsite` / `bulk-data` |
 | `creds` | encrypted (NFS/CIFS/PBS); USB has none |
 | `policy` | schedule + retention for this target |
 | `state` | `attached` / `disconnected` / `decommissioned` |
 Reconciliation: ensure each `attached` target is mounted (USB-by-UUID via the sudoers
 allowlist), each Proxmox storage entry matches, and `disconnected` targets are surfaced to
 the hub (the storage watchdog — detect a USB drop in seconds, not at the next health cycle).
 **Placement is per-volume, not per-app.** Hot volumes (DB/config) → a `fast` target,
 **enforced**; bulk volumes (media) → may live on `slow`, declared in `.felhom.yml`.
 A `bulk` volume **MUST** be realized as a `backup=0` **volume mount point** (or an external
 bind mount) — **never** a Docker named volume in rootfs, which `vzdump` always captures
 (verified, `phase3-findings.md` B2). Proven recipe: attach
 `-mpN <storage>:<size>,mp=/mnt/bulk,backup=0`, then
 `docker volume create --driver local -o type=none -o o=bind -o device=/mnt/bulk <vol>` (or a
 compose bind). The per-volume placement component (Part 2 §5(2)) enforces this at deploy. The
 **DR consequence** of excluding bulk is covered in §8.
 **Field re-homing (from `settings.StoragePath`, Part 2):** `Label` → manifest (canonical);
 `IsDefault`/`Schedulable` → manifest `policy`; `MigratedTo` + decommission → manifest `state`;
 `StoppedStacks` → the **controller's `settings`** (app-domain: which apps to restart on
 reconnect, not a host concern).
 ## 8. Backup/restore orchestration
 Tiers double as backup *and* restore-source priority (fastest surviving source first),
 per Part 1: **snapshot** (LVM-thin, transient, whole-guest rollback — not a backup) →
 **local second storage** (vzdump to dir/NFS/CIFS) → **PBS offsite** (the DR substrate).
 - **Quiescing (controller-driven for app-consistency):** an LXC has no fsfreeze
  (`proxmox-platform.md` §4.2), so app-consistency is the controller's job: it learns a backup
  is due (`GET /backup/due`, §6, or via its hub channel) → **quiesces** the app stack →
  `POST /backup` → polls `GET /backup/status` → unquiesces. **An agent-initiated vzdump is
  crash-consistent only** (there is no inbound-to-guest channel to trigger a quiesce — §3/§5).
  Every Proxmox op is async → the agent polls `task exitstatus`, never trusts the POST return.
 - **Bulk volumes have no DR coverage from the guest vzdump** — they are excluded (§7). Every
  `bulk` volume needs an explicit own-backup decision: its own backup target per the manifest
  `policy`, **or deliberately none** when the data is re-downloadable (customer informed). On
  host-loss, un-backed-up bulk is gone; a **bind-mounted** bulk volume re-attaches only on the
  *same* host, so cross-host DR needs the separate backup. A deliberate per-volume choice,
  never a silent loss.
 - **Key custody (PBS):** the **live** PBS key sits on the box so the agent can both back up
  *and* run restore-tests. The hub holds only the **recovery-code-wrapped escrow** copy it
  cannot open (zero-knowledge default). So: the box can restore-test; the operator cannot
  read the data; the customer's offsite recovery code is the irreducible residual.
 - **Self-restore-test:** the closing of the "tested restore is the critical gap" theme. The
  agent periodically restores a backup into a **throwaway scratch guest**, boots it, runs
  health checks, reports pass/fail, and tears it down. Zero-knowledge backups can *only* be
  restore-tested by the box (the operator lacks the key) — so this lives in the agent by
  necessity, not just convenience. Integrity-verify (cheap, ciphertext-level) runs more often
  as the lighter check.
 ## 9. Provisioning & DR flows
 **Provisioning (reconcile-driven, by restore).** Fresh creation of a Docker-capable LXC needs
 the `keyctl=1` feature flag, which Proxmox permits only for `root@pam` (Phase 3, B3) — not the
 scoped token. But a token-authorized **restore preserves `keyctl`** (Phase 3, B3), so the agent
 provisions **by restoring a golden base image**, never by `pct create` on the per-customer path:
 - A **golden base archive** — minimal Debian + Docker, `nesting=1,keyctl=1`, overlayfs — is
  built once as `root@pam` **at enrollment** (when the agent legitimately holds root to mint its
  Proxmox token) and refreshed on a maintenance cadence. This is the one place `keyctl`/root
  provisioning lives — off the per-customer path.
 - To provision guest G: restore the golden archive → new VMID (token-covered: `VM.Allocate` +
  `Datastore.AllocateSpace`; `keyctl` preserved) → reset identity (MAC/hostname) → size the guest
  (CPU/mem config + `pct resize` rootfs, token-covered) → attach storage mounts per the manifest
  → deploy the controller → hand it bootstrap config. A mid-flight failure is journaled and
  compensating-rolled-back (destroy the just-restored guest — allowed without a signature per §4,
  same-transaction provenance).
 **Unified bring-up primitive.** Provisioning and DR-restore share the same token-covered front
 half — *restore an archive → reset identity* — and differ only in the archive and the back half:
 provisioning restores the **golden base** then deploys a fresh controller; DR-restore restores
 the **customer's backup** (already containing controller + data), brings it up, and reattaches
 external storage. One code path, exercised by every restore-test (§8).
 **Guest loss.** Agent restores G from the fastest surviving tier and resets identity
 (MAC/hostname) so the restored guest rejoins cleanly — this *is* the unified restore primitive
 above (customer-backup archive, DR back half).
 **Host/hardware loss.** Re-enroll the new host in **restore mode**; the hub — the durable
 source of truth that survives box death — hands the new agent the existing identity, PBS
 namespace, tunnel token, storage manifest, and a restore directive. Tunnel is reused from
 the hub record, so DNS stays intact.
 ## 10. Concurrency, crash-safety, idempotency
 - **Per-guest serialization.** Reconcile, one-shot jobs, and local-API calls all feed a
  work queue that serializes mutations **per guest** (Proxmox dislikes concurrent conflicting
  ops on the same guest). Independent guests proceed in parallel.
 - **Operation journaling.** Multi-step async ops (provision, restore, controller-update, agent
  self-update) are journaled with their in-flight Proxmox task ids. On agent restart, the
  journal is replayed: resume-or-rollback, so a crash mid-restore never leaves a corrupt or
  half-built guest.
 - **Idempotency keys** on one-shot jobs (run-once across retries and restarts).
 ## 11. Self-update
 - **Agent (the hard case — a host service, no snapshot-rollback).** **A/B layout:** download →
  verify signature → stage as the inactive slot → flip a `current → good|new` symlink → restart.
  **Revert authority lives outside the swapped binary** — `Restart=always` alone just
  crash-loops a bad binary — so a **separate health-gate** (a systemd oneshot `ExecStartPost`
  probe, or a tiny supervisor unit) flips `current` back to last-good and restarts on a failed
  health window. The new version is **committed as "good" only after a clean health window**.
  Triggered by a hub signed job within the update window; manual always allowed. Journaled (§10).
 - **Controller (the easy case — it's a guest).** The agent owns the controller's lifecycle,
  so the **agent updates the controller**: snapshot-before-update (free rollback, because the
  controller *is* a snapshottable guest) → pull new image → redeploy → health-check → rollback
  on failure. This resolves the Part-2 `selfupdate/` open: the controller is **agent-managed**,
  not self-updating; the controller's old self-update path is removed.
 ## 12. Secrets at rest on the host
 The agent holds, root-only on the host fs: the scoped Proxmox token, the hub API key, the
 operator's **public** verify key (for §4 signatures — public, low-risk), the Cloudflare
 tunnel token, encrypted storage creds (NFS/CIFS/PBS), and the **live PBS key**. The privilege
 and the secret footprint that left the controller now concentrate here — which is the whole
 argument for §3's root-minimization and a small, auditable agent.
 ## 13. Open items / what this unblocks
 Resolved here: tunnel placement (host, agent-managed, own systemd service), the
 reconcile-vs-jobs fork (hybrid, gated by reversibility), agent process model, self-update
 ownership, the local-API surface, the storage-manifest schema, **provision-by-restore**, and
 the **root-vs-API boundary** (Phase 3, B3).
 Still open:
 - Multi-tenant **resource fairness** on a shared host (per-guest cgroup limits, noisy-neighbor) — deferred to the company-case pass.
 - Operator-side **signing tooling** — where the operator signing key lives operationally and how a destructive op gets signed without undue friction (offline key vs. a small signing service; the security floor is "not in the hub").
 - Hub-side **desired-state editing UX** and the host-domain report schema details — belong to the hub architecture doc.
 - **Golden base image** refresh cadence + fleet versioning — who triggers a rebuild, how the per-host image version is tracked (operational detail, not blocking; §9).
 This doc hands the implementation three contracts it was waiting on:
 1. the **local-API surface** (§6) → the controller's NEW local-API client, snapshot-before-deploy, and self-restore-test wiring (Part 2);
 2. the **storage-manifest schema** (§7) → the `settings.StoragePath` reshape and per-volume hot/bulk placement (Part 2);
 3. the **backup contract** (§7–8) → the destination for the app-data-backup package extracted in the Part-2 refactor.
 ---
 ## Changelog — design-review + Phase-3 fold-in (2026-06-08)
 - **NEW provision-by-restore** (§9): the agent provisions by **restoring a golden base image**
  (token-covered, preserves `keyctl`), never `pct create` on the per-customer path; one unified
  restore primitive shared with DR. §2 responsibility + §3 boundary updated.
 - **B3** (§2/§3): replaced "Phase-1 minimal role" with the validated **`FelhomAgent`** operator
  role; root-vs-API boundary **settled** (root only for golden-image build, host mounts, SMART).
 - **B1** (§4): reversibility gate rewritten as **provenance + data-bearing** (scratch tag is
  agent-internal, never hub-supplied; crashed-controller heal is non-destructive in-place).
 - **B2** (§7/§8): validated bulk-as-`backup=0`-mountpoint recipe + the **bulk-DR consequence**
  (excluded bulk needs its own backup decision).
 - **S1** (§6/§8): `GET /backup/due` added; controller-driven quiescing; agent vzdump is
  crash-consistent only. **S2** (§10/§11): A/B self-update with external revert authority;
  controller-update + agent self-update journaled. **S3** (§7): `StoragePath` field re-homing.
  **S4:** geo non-responsibility added (§2). **M2** (§7): manifest "absorbs + adds durable_id".
  **§6:** rollback is self-scoped/bounded. **§13:** golden-image refresh cadence added as open.
@@ -0,0 +1,154 @@
 # Architecture Part 4 — Control-plane authorization (operator signing)
 > Status: design draft (decision content), grounded on `docs/tests/phase4-signing-findings.md`.
 > To be reviewed by Claude Code against that spike + `03` §4, then placed at
 > `docs/architecture/04-control-plane-authorization.md`.
 >
 > Builds on Part 1 (enrollment / trust), Part 3 (the agent verifies + the §4 reversibility gate).
 > This doc defines the **mechanism** behind `03` §4's "an operator signature the hub can't forge."
 ## 1. Purpose & scope
 `03` §4 gates **destructive/irreversible** operations behind an operator signature the hub cannot
 forge. That gate is only real if signing is real. This doc defines the signing mechanism: the
 primitive, the keys, rotation, the three components' roles, and the operator workflow. The
 *policy* (what needs a signature) lives in `03` §4; this is the *how*.
 **Recap of what needs a signature** (from `03` §4, by reversibility, not by verb): destroying or
 overwriting any resource holding the only/primary copy of customer data — live-guest destroy,
 storage detach/wipe, restore-overwrite, decommission — **regardless of whether it arrives as a job
 or a desired-state delta**. Benign convergence (deploy a guest, attach storage, restore to a *new*
 guest, bump a version) runs on normal hub auth, unsigned. Most recovery is therefore unsigned;
 signed ops are rare and deliberate.
 ## 2. Primitive — SSH signatures (SSHSIG)
 Confirmed by Phase 4: destructive ops carry an **SSH signature** (`ssh-keygen -Y sign`, the armored
 `SSHSIG` format), verified by the agent in Go (`golang.org/x/crypto/ssh`) — `pem.Decode` →
 `ssh.Unmarshal` → `ssh.ParsePublicKey` → `pub.Verify`. ~40 lines of framing, no hand-rolled crypto.
 **Why SSHSIG and not raw Ed25519 / minisign:** SSHSIG verification dispatches on the key type
 embedded in the signature, so the **same verifier accepts a software key (`ssh-ed25519`) today and
 a FIDO2 hardware key (`sk-ssh-ed25519@openssh.com`) later** — which is exactly the hardware-ready
 foundation we want (§7). A raw-Ed25519 verifier cannot consume an sk signature (flags+counter,
 different signed-data), so it would force a verifier change on every box at hardware-adoption time.
 SSHSIG buys key-type-agnosticism for a one-file framing cost (Phase 4 §5–6).
 ### 2.1 The signed object — canonical op blob
 The signature covers an op blob (Phase 4 §2):
 ```
 { op, target:{host_id, guest_id}, params, nonce, issued_at, expires_at, key_id }
 ```
 - **Canonical form is a *signer-side* requirement** — JSON, keys sorted at every level, no
  insignificant whitespace, UTF-8 — so the blob is deterministic and human-auditable. The
  **verifier trusts the exact bytes it receives** (it verifies the signature over the raw bytes and
  parses those same bytes for fields), so there is no canonicalization-mismatch risk on the verify
  side. The canonical form is the shared contract between the operator CLI and the agent (both Go).
 - `nonce` ≥128-bit random; `issued_at`/`expires_at` a short window (minutes); `key_id` identifies
  the signing key (rotation/audit).
 ### 2.2 Domain separation — the namespace
 The SSHSIG **namespace** `felhom-op-v1` is a **fixed constant in the verifier**, never
 caller-supplied. A signature minted for any other namespace must not verify (proven). This stops a
 signature made for one purpose being reused for another.
 ### 2.3 Verify pipeline (order is load-bearing)
 `namespace → allow-list → crypto verify → target binding → time window → nonce`. The **nonce is
 recorded last**, only after everything else passes, so an invalid signature can never consume a
 nonce (DoS-safe). Each layer is mandatory and was proven to reject independently (Phase 4 §3–4):
 - **target binding** — `target.host_id`/`guest_id` must equal *this* box/guest (a signature for box
  A cannot be replayed at box B);
 - **time window** — `now ∈ [issued_at, expires_at]`;
 - **nonce** — unseen within the window (the nonce store **must be persistent across agent restarts**
  and expiry-pruned; a non-persistent store reopens the replay window after a restart).
 The Phase-4 reference verifier (`VerifySignedOp`) is the seed of the agent's implementation.
 ## 3. The keys — two-key model, software now
 Both software (SSH-format) keys today; both are also valid FIDO2-resident keys later with no box
 change (§7).
 - **Operational signing key** — the "master stamp" for destructive ops. A **dedicated** key (NOT
  the operator's daily SSH login key), passphrase-protected, on the operator workstation. Used only
  for destructive ops — rare, so its exposure is low.
 - **Cold recovery key** — generated once, kept **offline** (password manager / a USB held back /
  printed). Never used for ordinary ops; its sole power is to authorize rotating the operational key
  if that key is lost or compromised.
 Both **public** keys are pinned onto the agent at enrollment (the allowed-signers set). The
 operational key is authorized for ops; the recovery key is authorized **only** for key-rotation
 instructions.
 **Allowed-signers is a set** → single signer today; **quorum (N-of-M) for the highest-blast ops is
 just set sizing + a threshold policy**, addable later without a redesign (Phase 4 §8). Out of scope
 now.
 ## 4. Rotation & compromise recovery
 The agents pin the operator public keys. The danger: rotation must **not** flow as plain hub config,
 or a compromised hub re-pins its own key and forges everything. So **every re-pin is itself a signed
 op the agent verifies** (same pipeline, §2.3) — never unauthenticated config.
 - **Planned rotation:** the *current* operational key signs a "new operational public key = X" op;
  the agent accepts it because it's signed by the trusted current key (key-signs-key).
 - **Operational key lost/compromised:** the **cold recovery key** signs the re-pin; the agent accepts
  it because the recovery key is pinned and authorized for rotation. The compromised key is removed
  from the allowed set in the same signed op.
 - **Both keys gone:** on-site physical re-enrollment (last resort — re-establishes the trust root the
  way initial enrollment did).
 ## 5. Component roles
 - **Operator tooling (the workstation).** A signing CLI behind a thin **`Signer` interface**
  (`Sign(blob) → signature`). The backend today is a **file key**; a **FIDO2/PIV** backend drops in
  later (§7) with no change to the blob format, the hub, or the agent. Holds the operational private
  key (passphrase-protected); can reach the cold recovery key when rotation is needed.
 - **Hub.** Queues the **opaque** signed blobs and surfaces pending destructive ops + their signature
  status in the operator UI. Holds **no** private key and cannot sign — a compromised hub can only
  queue blobs the agent rejects. (Matches `03` §4 / box-initiated poll.)
 - **Agent (each box).** Pins the allowed-signers set (operational + recovery) at enrollment; runs the
  verify pipeline (§2.3) on any destructive op before executing; writes every signed op to the
  customer-visible **audit log**. Notification-on-destructive-op is an audit signal, never the guard
  (a compromised hub could issue *and* suppress notice — the signature is the control).
 - **Enrollment.** Pins the initial operational + recovery public keys onto the agent during the
  physical-presence provisioning step (the trust root is established on-site, not via the hub).
 ## 6. Operator workflow
 - **Routine work** (deploy, monitor, attach storage, restore to a *new* guest): no signing, zero
  overhead.
 - **A destructive op** (rare): the operator runs the signing CLI on their workstation — which builds
  the canonical blob, signs it (passphrase, or later a hardware touch), and posts it to the hub
  queue — then the agent polls, verifies, executes, and audits. One command + passphrase, from the
  desk. **Never** a site visit.
 ## 7. Hardware readiness (Viktor's "build the foundation now")
 Software `ssh-ed25519` now; a FIDO2 `sk-ssh-ed25519@openssh.com` key later is a **no-op on the
 boxes** — proven end-to-end against the OpenSSH spec in Phase 4 §5 (the unchanged verifier accepts a
 spec-faithful sk signature). At hardware adoption the operator generates an sk-key, points the
 `Signer` backend at it, and updates the allowed-signers entry; nothing on the boxes changes.
 Two honest notes:
 - **Confirm with a real device at adoption.** §5 was validated to spec, not against live hardware —
  a 5-minute real-key round-trip should confirm it (no surprise expected; signer/library/device all
  follow the same spec).
 - **Optional future hardening:** require the FIDO2 **user-presence (touch) flag**. The verifier is
  crypto-only today (correct for software keys); enforcing the flag is a small later option once
  hardware is in use.
 ## 8. Open items
 - **Quorum policy** (N-of-M per op-class, e.g. two signatures for decommission) — deferred; the
  allowed-signers-set foundation supports it.
 - **Signing-key passphrase UX** on the workstation (ssh-agent / askpass) — minor operator-tooling
  detail.
 - **Hub-side pending-op UI** (showing ops awaiting signature + audit) — belongs to the hub doc.
 ## 9. What this unblocks
 Closes the `03` §4 "undesigned signing path." Hands the implementation: the **canonical blob spec**
 (§2.1) + the **`VerifySignedOp` reference** (Phase 4 §7) for the agent's verify path, the
 **`Signer` interface** for the operator CLI, and the **allowed-signers pinning** step for enrollment.
 The hub's signed-job queue + pending-op UI carry into the hub architecture doc.
@@ -0,0 +1,223 @@
 # Architecture Part 5 — The Hub
 > Status: design draft (decision content). To be validated by Claude Code against the **actual
 > felhom-hub source** (`felhom.eu` repo, `hub/`) + Parts 01–04, then placed at
 > `docs/architecture/05-hub-architecture.md`.
 >
 > The hub is **not** greenfield — it's a mature service (felhom-hub v0.6.3, Go + SQLite on k3s,
 > `hub.felhom.eu`). This doc is the **deltas** to evolve it for the Proxmox model, plus the new
 > data model. Builds on Part 1 (trust/enrollment), Part 3 (the agent + reconcile), Part 4 (signing).
 ## 1. Source-of-truth model — two drivers, two directions
 The single most important framing, and the one that governs everything below: the hub is **not** a
 monolithic source of truth. State flows in two directions with opposite drivers.
 - **Operator-driven *intent* — hub authors, agent reconciles (top-down).** Which guests should
  exist and their spec, storage *policy* (a target's role/class/backup schedule), controller +
  golden-image versions, identity, tunnel. The operator sets these in the hub; the agent converges
  toward them. Here the hub *is* the source of truth.
 - **Box/customer-driven *reality* — box authors, pushes up, hub mirrors (bottom-up).** Which USB
  drive is *physically* attached (and its `durable_id`), what apps are deployed and where, the
  customer's controller configs/settings, host/guest health, latest PBS snapshot pointers. The
  customer or the physical world drives these; the box reports them; the hub stays an up-to-date
  **mirror** but is **never** the driver.
 They meet at a **handshake**, not a tug-of-war. Storage is the clearest case: the customer plugs in
 a drive → the agent *detects* it and reports `durable_id X attached` (reality) → the operator
 assigns `role=bulk, class=slow, backup=weekly` (policy, intent) → the agent reconciles that policy
 *onto the detected drive*. **Apps never enter the reconcile loop** — app deployment is the
 controller's domain (customer- or operator-driven, inside the guest); the hub only mirrors the
 resulting inventory. **Reconciliation applies to infrastructure; the app/customer layer is mirrored.**
 ## 2. Data model (Part 1 decision (b): customer-anchored)
 A customer's deployment is one **Host** (its agent) plus one-or-more **Guests** (its controllers).
 1 customer = 1 host + N guests; the shared-host multi-tenant case is deferred (not precluded — the
 `hosts` table is the seam it would use).
 - **`customer_configs`** (existing) — the Customer anchor: identity, domain, email,
  `retrieval_password`, status, config_json. Unchanged role.
 - **`hosts`** (new) — `host_id PK, customer_id, api_key` (the agent's hub key), `agent_version`,
  desired-state intent (storage manifest + policies + golden-image version, as JSON), a per-host
  **`desired_generation`** counter, the slim DR record (§9), timestamps.
 - **`guests`** (new) — `guest_id PK, customer_id, host_id, api_key` (the controller's hub key),
  `display_name, controller_version`, per-guest **`desired_spec_json`** (CPU/mem/disk, versions),
  timestamps.
 **Per-reporter keys:** today's per-customer `customer_configs.api_key` becomes per-reporter —
 `hosts.api_key` (agent) and `guests.api_key` (controller). The hub resolves a presented Bearer key →
 host or guest → customer; `customer_configs.api_key` goes unused once auth resolves via the new keys.
 **Clean cutover:** no dual-model support; the demo re-enrolls fresh into `host + guests`.
 ## 3. Report ingest — two domains
 The single controller report splits. The de-privileged controller no longer sees host disks/storage/
 backup, so its report **slims** (it loses System/Storage/Backup, keeps app-domain).
 - **`POST /api/v1/host-report`** (new, agent) → **`host_reports`**: host CPU/RAM/disk, per-guest
  up/down + spec, storage-target status (attached drives + `durable_id` + reachability), last backup
  + restore-test per target, latest PBS snapshot pointers, `cloudflared` health, agent + controller
  versions. Denormalized columns for the dashboard; full `report_json`. Index `(host_id, received_at
  DESC)` + `(customer_id, received_at DESC)`.
 - **`POST /api/v1/report`** (existing, slimmed controller) → the renamed **`guest_reports`**: it
  gains `guest_id` + `host_id`; its `cpu/memory` denorm now means *guest-level*; `backup_last_snapshot`
  goes quiet (backup status lives in `host_reports`). App telemetry / log issues stay.
 These two streams are the bottom-up mirror of §1 — they keep the hub current without a separate push.
 ## 4. Liveness / dead-man's-switch
 Evolves the existing staleness checker (60s **cadence**, 30m/1h **thresholds** — OK <30m, down at
 2× = >1h; today: controller-report recency → `node_stale`/`down`/`recovered`):
 - **Primary = host-report recency → `host_stale` / `host_down`.** The agent heartbeat is the box's
  liveness signal; a silent agent = the box is gone (the critical alert).
 - **Guest up/down comes from the host report's per-guest status** — authoritative, every poll, faster
  than waiting for a guest report to go stale.
 - **Guest-report recency = secondary** app-level signal.
 **Backup-deadline checker:** today it is *event-based* — it scans for `backup_completed`/`backup_failed`
 events since local midnight and alerts if none. Two changes: (1) **mechanism** — move it to a field
 check on `host_reports`' last-backup-per-target (cleaner now that backup state arrives in the host
 report); (2) **emitter** — the de-privileged controller no longer runs backups, so the **agent** is the
 source of the last-backup status (Part 3 §8). Without re-homing the source, the deadline check would go
 silent after the controller stops backing up.
 ## 5. Desired-state serving
 The operator's **intent** (§1 top-down) lives as JSON on `hosts`/`guests` (storage manifest +
 policies + golden version on the host; per-guest spec + versions on the guest) with a per-host
 `desired_generation`. The agent pulls its host's desired state on poll (with the generation, so it
 reconciles only on change and reports which generation it has converged to).
 - **Benign convergence** (create a guest, attach storage per policy, bump a version, adjust a
  non-destructive policy) → the agent reconciles freely.
 - **Destructive convergence** (guest removal = destroy, storage detach/wipe, data-losing resize) →
  the agent requires a **matching signed op** (§6) before executing that delta; absent/invalid → it
  refuses and reports `pending_signature`.
 **Geo is *not* in the agent's desired state** — it's customer→hub→Cloudflare (§7); the agent never
 touches WAF.
 ## 6. Authorization — signed-op queue + editing flow
 Implements Part 4's gate on the hub side. The hub holds **no signing key**.
 - **`signed_ops`** (new): `op_id, customer_id, host_id, target_guest, op_type, op_blob (canonical
  JSON), signature (armored SSHSIG), status (pending_signature → signed → delivered → executed /
  failed / expired / rejected), nonce, issued_at, expires_at, executed_at, result`.
 - **Editing flow:** the operator edits a customer's desired state, reusing the existing config-form +
  diff UX. Note the **transport inverts**: today's "Push" is a hub→box *inbound* POST (forbidden by the
  box-initiated model); here "publish" means **write to desired state, delivered on the next agent/
  controller poll**. The form and diff carry over; the push transport does not. The hub diffs vs current
  and **classifies each delta** (B1 rule):
  - **benign** → published straight to desired state;
  - **destructive** → the hub generates the canonical op blob and routes it through signing.
 - **Signing hand-off (Part 4 option (b)):** a local operator CLI (`felhom-sign --pending`) fetches
  the pending blob from the hub, signs it on the workstation with the dedicated key, and posts the
  signature back into `signed_ops`. The hub never sees the key.
 - The agent polls `signed_ops` for its host alongside desired state, verifies (Part 4 pipeline),
  executes, and reports status → the hub logs to the existing **`events`** audit trail.
 - **Classification lives in both places, with different jobs:** the hub classifies at *edit time*
  for UX (prompt to sign); the **agent's classification is the authoritative guard** (a compromised
  hub could skip the prompt, but the agent still enforces the signature).
 - A **pending-ops view** per customer shows the lifecycle (awaiting signature → awaiting agent →
  executed).
 ## 7. Geo enforcement (Part-2 S4)
 The hub already holds the CF API token and already has a remove-all path
 (`internal/web/configs.go` `handleGeoDisable` → `cloudflare.RemoveGeoRules`). **But the token is
 dual-purpose today** — DNS-01/ACME *and* WAF/geo — and `configgen.Generate` deep-merges it (via
 `config_json`) into the generated `controller.yaml`, so it currently ships **down to the box**. Two
 things follow:
 - **ACME assumption (must be stated, not skipped):** in the Cloudflare-Tunnel-default model the edge
  terminates TLS, so the box needs no public certificate and the **DNS-01/ACME use of the token goes
  away**. Granting that, the token comes fully off the box and lives hub-only. (If any box still does
  DNS-01, the token cannot fully come off — so this assumption is load-bearing.)
 - **`configgen` must stop emitting `cf_api_token`** into `controller.yaml` (drop it from the merge /
  relocate it to a hub-only field).
 The delta: the **customer sets geo in the controller UI → the controller reports the geo desired-state
 up → the hub reconciles it into the Cloudflare WAF** (rather than the box calling the CF API). The hub
 keeps the remove-all override for self-lockout. The controller no longer calls the CF API.
 ## 8. Enrollment (evolution of the existing retrieval-password/config-gen flow)
 Today: `GET /config/{id}` with an `X-Retrieval-Password` (Hungarian passphrase) returns a deep-merged
 `controller.yaml`. New:
 - Enrollment mints the **agent identity first** (the agent then provisions controllers), pins the
  **operator signing public keys** (Part 4 — operational + cold recovery) onto the agent, and the
  agent mints each controller's bootstrap (its hub guest key + local-API token).
 - A **restore-mode** re-enrollment (§9) hands an existing identity to a fresh agent.
 The existing `configgen` deep-merge + Hungarian-passphrase machinery is the base; it grows the
 agent-first + key-pinning + restore-mode steps.
 ## 9. DR model
 The headline: the **old heavy infra-backup push retires** — not because the hub authors everything
 (§1 says it doesn't), but because (a) the box-driven mirror already arrives via the §3 report streams,
 and (b) the actual app **data + configs live inside the PBS guest snapshot**. So a separate
 config+secrets+restic-password infra-backup blob is redundant.
 What remains:
 - the **report streams** keep the hub's mirror current (storage layout + `durable_id`s, app inventory,
  snapshot pointers) — but this mirror is **convenience, not the DR source of record** (reports are
  pruned by age);
 - the agent **escrows the recovery-code-wrapped PBS key** to the hub (the one artifact only the box
  can produce — zero-knowledge: the hub stores it, cannot open it);
 - a **slim DR record** on the `hosts` row (PBS namespace + repo fingerprint + the wrapped escrow key).
  These last two are *box-reported* columns on an otherwise operator-intent row — labelled as such so
  the §1 two-driver split stays legible per column.
 Both existing infra-backup tables retire — `infra_backup_versions` (the current/live one, all readers
 hit it) **and** `infra_backups` (the deprecated legacy mirror). The slim DR record folds onto `hosts`
 instead. The **controller's infra-backup push is removed** (it's de-privileged).
 **Recovery (host loss):** the new agent re-enrolls in **restore mode**; the hub hands it the durable
 record — and DR reads from the **durable sources, not the prunable report mirror**: operator intent
 (desired-state on `hosts`/`guests` — identity, tunnel token, storage manifest), the slim DR record
 (PBS namespace + repo fingerprint), the **wrapped escrow key**, and **PBS's own snapshot enumeration**
 (the agent lists snapshots once it has the namespace + unwrapped key). Guest inventory + app data come
 from **inside the PBS guest snapshots**, not from a retained `host_report`, so recovery doesn't degrade
 when the last report has aged out. The **customer provides their recovery code at the agent**, which
 unwraps the PBS key locally (never sent to the hub); the agent restores guests from PBS, resets
 identity, reuses the tunnel. The customer recovery code is the irreducible residual (the premium
 operator-managed custody tier avoids it, at the cost of the operator holding the key). The old
 controller-targeted `GET /recovery/{id}` is replaced by this agent restore-mode flow.
 ## 10. What persists from today (unchanged or lightly adapted)
 The Customer record (`customer_configs`); config generation/retrieval (`configgen`); the two-tier
 notification system (operator English / customer Hungarian, Resend, cooldowns); `events` + audit;
 `app_telemetry` / `app_log_issues`; customer lifecycle actions (block/unblock, trigger-update,
 delete); the asset manager; and the dashboard — adapted to render the **host + guests** view per
 customer instead of a single controller.
 ## 11. Schema deltas (grounded in store.go's idempotent style; clean cutover)
 - **NEW:** `hosts`, `guests`, `host_reports`, `signed_ops`.
 - **DROP `reports` + CREATE `guest_reports`** (under the clean cutover this is drop+create with no data
  migration, not an in-place rename); `guest_reports` adds `guest_id`, `host_id`; `cpu/memory` mean
  guest-level; `backup_last_snapshot` goes quiet.
 - **ADD** desired-state JSON + `desired_generation` to `hosts`; `desired_spec_json` to `guests`; the
  slim DR record (PBS namespace + repo fingerprint + wrapped escrow key) onto `hosts`.
 - **DROP both** `infra_backup_versions` (current/live) **and** `infra_backups` (legacy mirror) — the DR
  record replaces them on `hosts`.
 - **KEEP** `customer_configs`, `events`, `customer_notifications`, `notification_log`,
  `app_telemetry`, `app_log_issues`.
 - **Authz cleanup the cutover enables:** several endpoints today use global-or-any-customer-key auth
  rather than customer-scoped (the infra-backup GETs, `/notify`). Most retire with the infra-backup
  push; any that carry over should scope to the resolved host/guest → customer under §2.
 ## 12. Open items
 - Operator signing-key operational mechanics (Part 4 §8) — the hub-side pending-op UI is here; the
  key custody/rotation tooling is Part 4's.
 - Multi-tenant resource fairness (deferred shared-host case).
 - Hub-side desired-state **editing UX** specifics (form/diff wiring) — to be grounded against
  `hub/internal/web/configs.go` at implementation.
 - Golden-image refresh cadence / fleet versioning (carried from Part 3 §13).
@@ -0,0 +1,260 @@
 # Critical design review — Proxmox re-platform doc set
 > ✅ **RESOLVED (2026-06-08).** All findings folded into 01/02/03 + `proxmox-platform.md`
 > (Phase-3 spike run for B2/B3 → `tests/phase3-findings.md`). **Folded:** B1 (03 §4), B2
 > (03 §7/§8 + platform §4.7), B3 (03 §2/§3 + platform §3.6), S1 (03 §6/§8), S2 (03 §10/§11),
 > S3 (03 §7), S4 (01 §5/§7 + 02 + 03 §2), S5 (01 §7/§11 + 02 §6), S6 (02 §5), M1 (02 §3),
 > M2 (03 §7), M3 (03 §10), §6-residual (03 §6). Plus the two Phase-3 design updates:
 > provision-by-restore (03 §9) and the settled root-vs-API boundary (03 §3). **Deferred/none:**
 > no finding was deferred; the pre-existing open items (operator signing-key mechanics,
 > multi-tenant fairness, hub-side desired-state UX, golden-image refresh cadence) remain
 > flagged in 03 §13. This artifact can be deleted once confirmed.
 Working artifact. Review pass over `01-topology-and-trust.md`, `02-controller-module-map.md`,
 `03-host-agent.md`, `proxmox-platform.md`, and the Phase 0 / Phase 1-2 findings, grounded
 against the v0.33 source (`felhom-controller/controller/`). Every finding cites a
 file+line or a doc section. Severity: **blocking** / **should-fix** / **minor**.
 Two findings are self-corrections of my own earlier work (`02` and `proxmox-platform.md`) —
 flagged as such.
 ---
 ## Ranked summary
 | # | Severity | Finding | Where |
 |---|---|---|---|
 | B1 | **blocking** | Reversibility gate contradicts the self-heal reconcile loop — crashed-guest healing can require a signature-gated destroy → reconcile stalls | `03` §4 vs §4(a) |
 | B2 | **blocking** | vzdump bulk-exclusion only works for **volume** mount points; Docker **named volumes live in the LXC rootfs and ARE captured** → naive placement silently backs up the 1 TB media drive. Unvalidated by spike. | `03` §7 vs `proxmox-platform.md` §4.3 + pct manpage |
 | B3 | **blocking** | Agent's Proxmox role is called "the minimal role from Phase 1" — but that role is the *narrow self-backup* role that Phase 1 proved is **denied** create/allocate/restore. The agent's operator-tier role is undefined. | `03` §2/§3 vs `phase1-2` §1.3-1.4, `01` appendix |
 | S1 | should-fix | Quiescing for agent/hub-scheduled backups has **no agent→controller channel** — the local API is controller→agent only | `03` §6, §8 |
 | S2 | should-fix | Agent self-update revert authority unspecified — if the new binary won't boot, nothing outside it can flip back | `03` §11 |
 | S3 | should-fix | Storage manifest drops fields `settings.StoragePath` carries today (Label, Schedulable/default, StoppedStacks, MigratedTo) with no re-homing stated | `03` §7 vs `settings.go:90-103` |
 | S4 | should-fix | Geo-restriction WAF ownership + Cloudflare **API token** placement unspecified after tunnel placement was locked; zone-wide token in a guest is a blast-radius concern | `03` (absent), `01` §3, `config.go` InfrastructureConfig |
 | S5 | should-fix | Cross-doc staleness: `01` §11 still lists tunnel placement OPEN; `02` §6 lists geo "blocked on tunnel placement" — both resolved by `03` §13 | `01` §11, `02` §6 vs `03` §13 |
 | S6 | should-fix (self-correct) | `02` put self-restore-test **orchestration** in the controller; `03` correctly makes it agent-owned (controller only reads status) | `02` §5(3) vs `03` §6/§8 |
 | M1 | minor (self-correct) | `02` §3 lists `UUID` as a `settings.StoragePath` field — it isn't; UUID is derived from fstab at runtime | `02` §3 vs `settings.go:91-103` |
 | M2 | minor | `03` §7 says the manifest "absorbs the disk-state fields StoragePath carries today" incl. UUID — UUID isn't persisted today, so the manifest *adds* it (an improvement, not absorption) | `03` §7 |
 | M3 | minor | controller-update is not in `03` §10's journaled-ops list, though it's a multi-step async op | `03` §10 vs §11 |
 **Values check: clean.** No DR/key-custody/offboarding path leaves a customer locked out.
 Zero-knowledge DR (`03` §8, `01` §8) correctly makes the customer recovery code the
 irreducible residual; the operator cannot read data and the box can still restore-test.
 No hostage path found.
 **Locked premises:** reviewed for soundness/consistency only; not relitigated.
 ---
 ## Blocking findings
 ### B1 — The reversibility gate stalls the self-healing reconcile loop
 **Where:** `03` §4(a) vs the gate in §4.
 **What:** §4(a) lists "redeploy of a crashed controller" as benign convergence that "falls
 out of reconciliation for free." The gate then lists **guest destroy** among the
 irreversible ops that require an operator signature "*regardless of whether they arrive as a
 job or as a desired-state delta*." These collide: if healing a wedged guest requires
 destroy+recreate (corrupt rootfs, failed in-place restart, half-built guest from an
 interrupted provision), the reconciler hits a signature-gated op and **cannot proceed
 without an operator** — the loop either stalls or silently gives up, defeating "self-healing
 … tolerant of missed polls."
 **Why it matters:** This is the security-critical control model. A fuzzy benign/destructive
 line is unimplementable: either the reconciler can destroy (and a compromised hub's desired
 state can wipe guests — the exact threat §4 exists to stop), or it can't (and self-heal is a
 fiction for the crashed-guest case).
 **Grounding:** `03` §4 self-describes the gate as "security-critical"; §9/§10 already rely on
 the reconciler rolling back "a half-built guest" — which *is* a destroy of a customer-id-bound
 resource, contradicting the blanket "guest destroy needs a signature."
 **Suggested fix (crisp, implementable rule):** Scope the reconciler's destructive verbs by
 *provenance and data-bearing-ness*, not by verb:
 - The reconciler MAY, without a signature: (a) create/start/restart; (b) destroy resources it
  **created earlier in the same journaled transaction** (compensating rollback, §10); (c)
  destroy resources **tagged ephemeral/scratch** (restore-test scratch guests, §8).
 - Destroying or overwriting any resource that **holds the only/primary copy of customer data**
  always needs an operator signature.
 - **Healing a crashed controller is non-destructive by construction:** the controller is
  reconstructable from its image + the guest's persistent volume, so "redeploy" = restart the
  LXC / `docker compose up -d` **inside the existing guest** — never a guest destroy. State
  this explicitly so the two clauses stop colliding. (The v0.33 self-heal precedent is already
  in-place restart: `watchdog.go` restarts stopped stacks, it never destroys the guest.)
 ### B2 — vzdump bulk-exclusion: the rootfs-Docker-volume trap
 **Where:** `03` §7 ("Bulk external mounts are excluded from the guest's vzdump (a per-mount
 backup flag)").
 **What:** Two grounded problems:
 1. The flag is real but narrow. The pct manpage (verified): `backup=<boolean>` —
   *"Whether to include the mount point in backups (**only used for volume mount points**)."*
   It does **not** apply to bind mounts / device mounts (those are handled separately).
 2. The trap: `proxmox-platform.md` §4.3 (validated in `phase1-2` §2.2) proved that **Docker
   named volumes live inside the LXC rootfs and ARE captured by vzdump** — a sentinel in
   `pgdata` survived. The default Felhom app uses Docker named volumes. So unless bulk data is
   deliberately placed on a **dedicated Proxmox volume mount point** (backup=0) or a bind
   mount, a "bulk" volume will be an ordinary named volume in rootfs and will be **silently
   swept into the whole-guest image** — exactly the 1 TB-media-in-every-backup outcome §7 says
   it prevents.
 **Why it matters:** Backup size/cost and RPO blow up silently; the failure is invisible until
 a media drive fills the vzdump target. This is load-bearing for the §8 tier model.
 **Grounding:** pct manpage (fetched 2026); `proxmox-platform.md` §4.3; `phase1-2` §2.2.
 Not covered by any spike — `proxmox-platform.md` §6 "not yet validated" should gain this row.
 **Suggested fix:** Make the placement contract explicit: a `bulk` volume **must** be realized
 as a dedicated LXC mount point (volume mountpoint with `backup=0`, or an external bind mount),
 **never** a Docker named volume in rootfs. The per-volume placement component (`02` §5(2))
 must enforce this at deploy. Add a Phase-3 spike: create an LXC with a `backup=0` volume
 mountpoint + a bind mount, vzdump it, confirm both are excluded and the rootfs+`backup=1`
 volume are included.
 ### B3 — The agent's Proxmox role is mis-grounded as "the Phase-1 minimal role"
 **Where:** `03` §2 ("scoped Proxmox API token (minimal role from Phase 1)"), §3 ("the
 Phase-1 minimal role is the API floor").
 **What:** Phase 1's minimal role (`FelhomSelfBackup` = `VM.Audit, VM.Snapshot, VM.Backup,
 Datastore.AllocateSpace, Datastore.Audit`) is the **narrow self-backup** role scoped to one
 guest, and Phase 1 explicitly proved it is **denied (403)** on create/allocate
 (`phase1-2` §1.3 call #7) — i.e. exactly the operator-tier ops the agent's whole job consists
 of (provision, restore, storage allocation). Worse, `01` appendix states that guest-side role
 "**is not used** — we chose the agent-mediated path." So `03` cites, as the agent's role
 floor, a role that (a) the architecture discarded and (b) is provably insufficient for the
 agent.
 **Why it matters:** The agent's actual operator-tier role is **undefined**. Provisioning,
 restore, and storage management cannot be built or hardened against an undefined privilege
 set, and §3's root-minimization argument ("the Phase-1 minimal role is the API floor")
 collapses because that floor can't create a guest.
 **Grounding:** `phase1-2` §1.3 (create CT = 403), §1.4 (role = self-backup only); `01`
 appendix ("not used … confirmed restore = operator-tier"); `proxmox-platform.md` §3.4.
 **Suggested fix:** Replace the Phase-1 reference with a **new agent operator role** to be
 defined and least-privilege-tested in a Phase-3 spike — minimally `VM.Allocate`, `VM.Config.*`,
 `VM.PowerMgmt`, `VM.Snapshot(.Rollback)`, `VM.Backup`, `VM.Audit`, `Datastore.Allocate(Space)`,
 `Datastore.Audit`, plus whatever storage-attach needs (see S4/root-boundary below). Keep §3's
 "API token, not root, where the API suffices" principle — that part is sound — but stop
 calling it the Phase-1 role.
 ---
 ## Should-fix findings
 ### S1 — No agent→controller channel for backup quiescing
 **Where:** `03` §6 (local API is controller→agent only) vs §8 ("the controller stops the app
 stack … before a guest vzdump where app-consistency matters").
 **What:** App-consistent LXC backup requires the controller to quiesce (no fsfreeze for LXC —
 `proxmox-platform.md` §4.2, `phase1-2` §2.1). But the §6 surface is entirely controller→agent;
 the box-initiated model forbids the hub calling in, and there is no agent→controller call
 defined. For a **hub/agent-scheduled** backup (schedule lives in the manifest `policy`, §7),
 the agent has no way to tell the controller "quiesce now."
 **Why it matters:** Either scheduled backups silently fall back to crash-consistent (relying
 on WAL recovery, which `phase1-2` §3 warns is unvalidated under write load), or the feature
 can't be built as drawn.
 **Suggested fix:** Make backups **controller-driven for app-consistency**: the controller
 learns due/policy via its own hub channel (or a `GET /backup/due` on the local API), quiesces,
 calls the existing `POST /backup`, then unquiesces on completion. Document that agent-initiated
 vzdump is crash-consistent only. (No inbound-to-guest channel needed — preserves §3/§5.)
 ### S2 — Agent self-update revert authority unspecified
 **Where:** `03` §11 ("a watchdog reverts to last-good if the new binary fails to come up
 healthy").
 **What:** The agent is a single host systemd service with `Restart=always` (§3). If the new
 binary crashes on startup, systemd just restarts the **same bad binary** in a loop. "Revert
 to last-good" cannot be done *by* the thing that won't boot. §11 doesn't name the actor.
 **Why it matters:** A bad self-update can brick the crown-jewel host agent — the one component
 that recovers everything else — with no automatic recovery, requiring break-glass.
 **Suggested fix:** Put revert authority **outside** the swapped binary: e.g. an A/B symlink
 (`current → good|new`) where a separate systemd oneshot health-gate (`ExecStartPost` probe; on
 failure flip the symlink back and restart), or a tiny supervisor unit. Boot-into-last-good +
 explicit "commit" after a clean health window is the robust pattern. Add agent-update to the
 §10 journal so an interrupted swap is resumable.
 ### S3 — Manifest schema omits live `StoragePath` fields without re-homing them
 **Where:** `03` §7 table vs `settings.go:90-103`.
 **What:** Today's `StoragePath` carries `Label`, `IsDefault`, `Schedulable`, `StoppedStacks`,
 `Decommissioned`/`DecommissionedAt`/`MigratedTo`. The manifest covers state (attached/
 disconnected/decommissioned) and durable_id, but drops: **Label** (human name, e.g. "Külső
 HDD 1TB" — UI), **Schedulable/IsDefault** (default placement target for new apps),
 **StoppedStacks** (which apps to restart on reconnect — app-domain), **MigratedTo** (decommission
 target pointer).
 **Why it matters:** `02` named this manifest as the contract that the `settings.StoragePath`
 reshape depends on. Silently dropped fields become lost behavior (no default-drive choice, no
 restart-after-reconnect list, no friendly labels).
 **Suggested fix:** Either add Label + a placement-default marker to the manifest, or explicitly
 state which fields re-home to the controller's `settings` (StoppedStacks and Label are
 plausibly controller-side; default/schedulable placement must live wherever placement decisions
 are made). Make the split explicit so neither side assumes the other owns it.
 ### S4 — Geo-WAF ownership + Cloudflare API token placement unspecified
 **Where:** `03` covers `cloudflared` (tunnel) health but is silent on geo-restriction WAF; `02`
 §6 had `cloudflare/`+`geo` "blocked on tunnel placement"; `01` §3 lists the controller's creds
 as "hub API key + local-API token" only.
 **What:** Now that tunnel placement is locked (host), the **geo-restriction WAF** management
 (`cloudflare/` package: zone/waf/geosync) still has no home. It requires a Cloudflare **API
 token** (`config.go` InfrastructureConfig.cf_api_token) with zone-wide WAF edit rights. If geo
 stays in the controller (app-domain, per `02`), a **zone-wide Cloudflare token sits inside the
 customer guest** — a real blast-radius concern (compromise → edit/disable WAF for the whole
 zone, potentially other customers on the same zone).
 **Why it matters:** Trust-boundary gap. `01` §5's boundary table has no row for controller↔
 Cloudflare-API. Unspecified ownership blocks the `02` geo classification from being unblocked.
 **Suggested fix:** Decide geo-WAF ownership explicitly and add it to `01` §5. Options: (a) move
 WAF management to the **agent/hub** (operator-tier, token off the customer box); (b) keep it in
 the controller but scope the CF token per-zone/per-customer if the account model allows. Note
 this is now *unblocked* by the tunnel decision and should leave `02` §6's "blocked" state.
 ### S5 — Cross-doc staleness on the now-locked tunnel placement
 **Where:** `01` §11 ("Cloudflare Tunnel placement: host vs guest (§7)") and `02` §6
 ("`cloudflare/` + `api/geo.go` — blocked on tunnel placement") vs `03` §13 ("Resolved here:
 tunnel placement (host, agent-managed)") and the LOCKED list.
 **What:** `01` and `02` still present as OPEN/blocked a decision `03` and the locked set have
 resolved.
 **Why it matters:** A dev reading `01`/`02` would treat a settled decision as open (or a
 classification as blocked when only geo-ownership, S4, actually remains).
 **Suggested fix:** When folding this review in: update `01` §7/§11 to record tunnel=host
 (agent-managed systemd service); update `02` §6 to reduce the cloudflare item from "blocked on
 tunnel placement" to the narrower "blocked on geo-WAF ownership (S4)."
 ### S6 — (self-correction) self-restore-test orchestration belongs to the agent, not the controller
 **Where:** `02` §5(3) said "Self-restore-test orchestration — *controller* asks the agent to
 restore to scratch guest, validates, reports." `03` §8 makes the **agent** drive it
 autonomously; §6 gives the controller only `GET /restore-test/status` (read-only).
 **What:** `03` is right and `02` overreached. Zero-knowledge means only the box/agent holds the
 PBS key (`03` §8); creating a scratch guest is operator-tier (create/allocate — `phase1-2`
 §1.3 #7); the controller cannot do either. The controller's only piece is surfacing status.
 **Why it matters:** Keeps the NEW-component list honest — this is not a controller component to
 build beyond a status read.
 **Suggested fix:** Amend `02` §5(3) to "self-restore-test **status display** (read-only); the
 agent owns orchestration."
 ---
 ## Minor findings
 - **M1 (self-correction):** `02` §3 lists `UUID` among `settings.StoragePath` fields. It is
  **not** there (`settings.go:91-103`: Path, Label, IsDefault, Schedulable, AddedAt,
  Disconnected/At, StoppedStacks, Decommissioned/At, MigratedTo). UUID is derived at runtime
  from fstab / `/host-dev/disk/by-uuid` by `system.ParseFstabUUID` and `watchdog.go`. The
  classification (settings = MODIFY/split) is unaffected; the field list was wrong.
 - **M2:** Consequently `03` §7's "absorbs the disk-state fields `settings.StoragePath` carries
  today" overstates: `durable_id`/UUID is *not* carried today, so the manifest **adds** durable
  identity (a genuine improvement — today the controller re-derives UUID from fstab each boot,
  which is fragile). Reword "absorbs" → "absorbs + adds durable_id."
 - **M3:** `03` §10 journals "provision, restore" but not **controller-update** (§11), which is
  also a multi-step async op (snapshot→pull→redeploy→health→rollback). Add it so an agent crash
  mid-controller-update is resume-or-rollback like the others.
 ---
 ## Verified-correct (no action) — grounding that held up
 - LXC flags `nesting=1,keyctl=1` + overlayfs (`03` §9) match `proxmox-platform.md` §2.3 /
  `phase0` §3. ✓
 - async `task exitstatus`, not POST return (`03` §8) matches `proxmox-platform.md` §3.5. ✓
 - stop-mode backup not requiring `VM.PowerMgmt` (`03` §8 "per Phase 1") matches
  `proxmox-platform.md` §3.4. ✓ (applies to the agent role too.)
 - running-LXC snapshot on LVM-thin (`03` §6/§8/§11) matches `proxmox-platform.md` §4.5 /
  `phase1-2` §1.6. ✓
 - `monitor/pinger.go` deprecation (`02` DELETE-obsolete) confirmed in `main.go:168,175`
  ("legacy, will be removed" / "no longer used — monitoring is now handled by the Hub"). ✓
 - backup keep/delete **intra-file tear** (`02` hazard) confirmed: `backup.go` holds both
  `RunDBDumps`/`DumpAppVolumes(Safe)` (keep) and `RunBackup`/`RunFullBackup` (restic, delete);
  `restore.go` holds `RestoreApp` (restic) + `RestoreAppFromTier2` (app). The §7-8 backup
  contract gives the extracted app-data-backup package a coherent destination. ✓
 - Control-plane-not-data-plane (`03` §2/§43): apps keep serving if the agent dies — consistent
  with Docker-in-LXC running independently (`phase0` §3). ✓
 - §6 per-guest local-API authorization (token→guest map): sound; a leaked token acts only on
  its own guest. Residual: a compromised controller can `POST /rollback` its **own** guest
  (blast radius = self) — acceptable per design; worth a one-line note that rollback is
  self-scoped and bounded.
@@ -0,0 +1,221 @@
 # `05-hub-architecture.md` — critical review (grounded against felhom-hub v0.6.3 source + Parts 01–04)
 Method: every claim about the existing hub was checked against `felhom.eu/hub/` source; every
 cross-doc claim against Parts 01/03/04. Citations are `file:line`. Severity: **blocking** (wrong /
 breaks an assumption) · **should-fix** (real gap or contradiction, low blast) · **minor**.
 The two highest-value catches (doc assumes something the code contradicts) are **S1** and **S2**.
 ---
 ## Ranked summary
 | # | What | Where (doc → code) | Severity |
 |---|---|---|---|
 | S1 | §9/§11 name the **wrong infra-backup table as current** — `infra_backup_versions` is the live/primary one; `infra_backups` is the deprecated write-only mirror | 05 §9/§11 → `store.go:198-217,541-578` | should-fix (code-contradiction) |
 | S2 | §7 treats the CF token as **geo-only**; it is **dual-purpose (DNS-01/ACME + WAF)** and is injected into the generated `controller.yaml` | 05 §7 → `config_form.html:76-80`, `controller.yaml.default:26`, `configgen.go:28-37`, `configs.go:1041` | should-fix (code-contradiction / unverified assumption) |
 | S3 | §6 leans on the existing **"Push"**, but that is a hub→box **inbound** POST — forbidden by the box-initiated model; transport must invert to poll | 05 §6 → `configs.go:569-570,1148-1150`; Part 1 §4/§5/§11; Part 3 §5 | should-fix |
 | S4 | Part 1 §6 calls app inventory **"declarative"**; 05 §1 (LOCKED) says apps are mirrored, never declared/reconciled, restored from PBS | Part 1 §6 ↔ 05 §1/§9 | should-fix (cross-doc) |
 | S5 | §9 hands "guest inventory + snapshots" **from the prunable report mirror**; DR soundness actually rests on durable sources | 05 §9/§3 → `store.go:809-816` | should-fix (DR robustness) |
 | S6 | §4 says backup-deadline checker "maps onto host_reports' last-backup field"; today it is **event-based** and controller-emitted | 05 §4 → `deadline.go:31-86` | should-fix (mechanism) |
 | M1 | "60s staleness checker" conflates the 60s **cadence** with the 30m/1h **threshold** | 05 §4 → `main.go:207-217,99-102`, `staleness.go:33-37` | minor |
 | M2 | §2 `customer_configs` field list omits `api_key` — the very field the per-reporter plan retires | 05 §2 → `store.go:102-112` | minor |
 | M3 | §11 `reports`→`guest_reports` "rename" is really drop+create under the locked clean cutover | 05 §11 → `store.go:55-119` | minor |
 | M4 | Pre-existing weak authz on infra-backup GET / `/notify` (any valid key, not customer-scoped) | handler.go:407,536,568,596 | minor |
 No **blocking** findings — the data model and the two-driver framing are sound, and the LOCKED clean
 cutover absorbs most schema risk. The items below are gaps/contradictions worth fixing before the doc
 drives work.
 ---
 ## Highest-value: doc assumes something the code contradicts
 ### S1 — `infra_backups` vs `infra_backup_versions` is inverted (should-fix, code-contradiction)
 05 §9: *"`infra_backup_versions` retires; `infra_backups` is repurposed into the slim DR record."*
 §11 repeats: *"RETIRE `infra_backup_versions`; repurpose `infra_backups`."*
 The code is the other way round:
 - `infra_backup_versions` (added v0.7.0, `store.go:198-211`) is the **live/primary** table. **Every read
  path hits it**: `GetInfraBackup` (`store.go:565-578`), `GetInfraBackupByID` (`store.go:581-593`),
  `GetInfraBackupMeta` (`store.go:604`), `ListInfraBackupVersions` (`store.go:640`), and the recovery
  endpoint (`handler.go:670-686`).
 - `infra_backups` (original single-row, `store.go:96-100`) is **deprecated**. It is now **written only
  as a legacy mirror** ("for backward compatibility during rollback window", `store.go:552-558`) and is
  **never read** except as the one-time migration *source* (`store.go:214-217`).
 So the doc proposes retiring the current table and repurposing the dead one. Under the LOCKED clean
 cutover both are discarded anyway, so blast radius is low — but an implementer following §9/§11
 literally would point the DR record at the wrong table.
 **Fix:** take §11's own alternative — *fold the slim DR record onto `hosts`* and **drop both**
 infra-backup tables. If a standalone table is kept, base it on `infra_backup_versions` (the one with the
 data/readers), and correct the "which is current" framing.
 ### S2 — the CF API token is **not** geo-only; it is the ACME token too, and ships into `controller.yaml` (should-fix, code-contradiction)
 05 §7: *"The hub already holds the CF API token (the config form notes Zone WAF:Edit)… rather than
 pushing the token down to the controller… The controller no longer calls the CF API."*
 Grounding confirms the hub **does** hold the token and **does** have a remove-all path:
 `config_json → infrastructure.cf_api_token` (`configs.go:714-715,1041-1042,1089-1096`) →
 `cfClient.RemoveGeoRules(cfToken, cfg.Domain, …)` in `handleGeoDisable` (`configs.go:1112`), route
 `/customers/{id}/geo/disable` (`server.go:201-205`). ✓ The §7 framing of geo-enforcement-moves-to-hub
 is also consistent with Part 1 §5/§7 and Part 3 §2/§46.
 **But the doc's assumption that the token is *for geo* is contradicted by the code:** the same
 `cf_api_token` is **dual-purpose** —
 - the config-form hint says **"Zone DNS:Edit (ACME), Zone WAF:Edit (geo)"** (`config_form.html:80`),
 - `controller.yaml.default:26` documents it as the **"Cloudflare API token (DNS-01 challenge)"**,
 - and it is **deep-merged into the generated `controller.yaml`** via `configgen.Generate` (config_json
  overrides, `configgen.go:28-37`), i.e. **today it is shipped down to the box** and served at
  `/config/{id}` and `/recovery/{id}`.
 Consequences §7 must address:
 1. **"Token off the controller" is incomplete** if the box still does DNS-01/ACME. In the CF-Tunnel
   model the box may no longer need a public cert at all (edge-terminated), making the ACME use moot —
   but that is an assumption the doc must state, not skip. Either confirm ACME is gone, or the CF token
   cannot fully come off the box.
 2. **`configgen` must stop emitting `cf_api_token` into `controller.yaml`** (or relocate it to a
   hub-only field). As written, the generated config still carries it.
 ---
 ## Should-fix
 ### S3 — §6 "Push" is an inbound-to-box mechanism the new model forbids
 05 §6: *"the operator edits a customer's desired state (building on the existing config-form +
 Push/Pull/Diff)."* The form + diff/pull/push handlers exist — `handlePushConfig` (`configs.go:569`),
 `handlePullConfig` (`configs.go:952`), `handleConfigDiff` (`configs.go:861`), routes at
 `server.go:209-229`. ✓ So the UI base is real.
 The wrinkle: **"Push" today is a hub→controller outbound POST** (`handlePushConfig` "sends the generated
 YAML config to the controller", `configs.go:569-570`), as is the geo-disable notify
 (`notifyControllerGeoDisable` → `POST controllerURL/api/geo/settings`, `configs.go:1148-1153`). Both are
 the hub **connecting into the box** — explicitly disallowed by the box-initiated model (Part 1 §4
 "the hub never initiates inbound"; §5 row `agent↔hub`/`controller↔hub` = outbound poll; Part 3 §5 "The
 hub never connects inbound"). 05's own §5 already resolves this (desired state is **pulled** on poll
 with a `desired_generation`). So the doc is internally consistent in *mechanism* but loose in *wording*:
 **make §6 explicit that "Push" becomes "publish to desired state, delivered on the next agent/controller
 poll," not a reuse of the inbound push transport.** The form/diff UX carries over; the transport inverts.
 (Same applies to the geo-disable controller-notify path.)
 ### S4 — "declarative app inventory" (Part 1 §6) vs "apps are mirrored, never reconciled" (05 §1)
 Part 1 §6 lists the durable record as including a **"declarative app inventory"** that survives box loss
 — wording that implies an operator-authored, re-deployable spec. 05 §1 (LOCKED two-driver model) is
 explicit the opposite way: *"Apps never enter the reconcile loop… the hub only mirrors the resulting
 inventory… the app/customer layer is mirrored,"* and 05 §9 restores apps **from the PBS guest snapshot**,
 not by re-deploying a declared inventory. These are reconcilable (the mirror *is* durable last-known
 truth) but the word "declarative" contradicts the locked framing and the §9 restore-from-snapshot path.
 **Fix (align the older doc to the locked model):** in Part 1 §6 change "declarative app inventory" →
 "mirrored / last-reported app inventory," and note apps are recovered from the guest snapshot, not
 re-declared. (Flagging an internal inconsistency, not relitigating the locked premise.)
 ### S5 — §9 reads DR inputs from a prunable mirror; soundness rests on durable sources
 05 §9 hands the recovering agent *"identity, tunnel token, storage manifest, PBS namespace, guest
 inventory + snapshots."* §3 places "guest inventory" and "latest PBS snapshot pointers" in
 `host_reports` — the bottom-up mirror. But reports are **pruned** (`Prune` deletes rows older than
 `maxDays`, `store.go:809-816`; the doc keeps this), so after a long pre-DR outage the last `host_report`
 can be gone or stale. The actually-durable DR inputs are: desired-state on `hosts`/`guests` (§5), the
 slim DR record (PBS namespace + repo fingerprint + wrapped escrow key, §9/§11), and **PBS's own snapshot
 enumeration** (the agent lists snapshots once it has the namespace + unwrapped key). The mirrored
 inventory/pointers are convenience, not the source of record.
 **Fix:** state in §9 that DR reads from the durable sources (desired-state + DR record + PBS), **not**
 from prunable `host_reports`, so recovery doesn't degrade when the last report has aged out. This also
 keeps §1's two-driver discipline clean: DR must not depend on bottom-up mirror rows being retained.
 (Note: the `hosts` row legitimately mixes top-down intent columns with a few box-reported columns —
 repo fingerprint, wrapped escrow key. That is fine; just label them as box-reported so the §1 split
 stays legible at the column level.)
 ### S6 — backup-deadline checker: doc says field-based, code is event-based (and re-emitter changes)
 05 §4: *"The existing backup-deadline checker maps onto `host_reports`' last-backup-per-target."* The
 existing checker is **event-based**, not field-based: `CheckBackupDeadlines` looks for
 `backup_completed` / `backup_failed` (and `db_dump_*`) **events** since Budapest midnight and emits
 `expected_backup_missed` if neither is present (`deadline.go:31-86`). Two changes the doc should make
 explicit:
 1. **Mechanism:** either keep it event-based (someone emits `backup_completed`) or genuinely move it to
   a `host_reports.last_backup_per_target` field check — the doc says the latter but the impl is the
   former.
 2. **Emitter:** today the **controller** emits backup events; in the de-privileged model the **agent**
   owns backup/PBS (Part 3 §8), so the agent must now emit `backup_completed`/`backup_failed` (or the
   host report carries last-backup-per-target). Without re-homing the emitter, the deadline check goes
   silent after the controller stops doing backups.
 ---
 ## Minor
 - **M1 — "60s staleness checker" (§4).** 60s is the **check cadence** (`main.go:207-217`,
  `ticker := time.NewTicker(60 * time.Second)`); the **staleness threshold** is 30m (default,
  `main.go:99-102`) with down at 2× = 60m (`staleness.go:33-37`; CLAUDE.md "OK <30m, DOWN >1h"). The
  event-transition mechanism (`node_stale`/`node_down`/`node_recovered`) is described correctly
  (`staleness.go:155-185`). Reword to "the staleness checker (60s cadence, 30m/1h thresholds)."
 - **M2 — `customer_configs` fields (§2).** The list ("identity, domain, email, retrieval_password,
  status, config_json") omits **`api_key`** (`store.go:108`) — the field §2's per-reporter plan
  actually retires. Worth noting `customer_configs.api_key` becomes unused once auth resolves via
  `hosts.api_key` / `guests.api_key`.
 - **M3 — rename under clean cutover (§11).** `migrate()` is all `CREATE TABLE IF NOT EXISTS` +
  idempotent `ALTER` (`store.go:55-119,146-149`). §11's claim "grounded in store.go's idempotent style"
  is accurate. But a `reports`→`guest_reports` **rename** isn't part of that style; under the LOCKED
  clean cutover (demo re-enrolls fresh, §2) it is really **drop `reports` + create `guest_reports`**
  with no data migration. Name it as such to avoid implying an in-place rename + backfill.
 - **M4 — pre-existing weak authz.** `handleInfraBackupGet`/`Versions` and `handleNotify`/
  `handleSavePreferences`/`handleInfraBackupPush` use `checkAuth` (global **or any** customer key,
  `handler.go:63-66`), not customer-scoped `checkAuthCustomer`. Most retire with the infra-backup push
  (§9); for any that carry over, the per-reporter model (§2) should scope them to the resolved
  host/guest→customer. Not a regression the doc introduces — a cleanup the cutover enables.
 ---
 ## Confirmed accurate (grounding that holds — so the rest of the doc can be trusted)
 - **§10 KEEP list** matches the schema exactly: `customer_configs`, `events`, `customer_notifications`,
  `notification_log`, `app_telemetry`, `app_log_issues` all present (`store.go:74-189,102-135`). The
  asset manager exists (`handler.go:57,834-867`). ✓
 - **§10 two-tier notifications** (operator English / customer Hungarian, Resend, cooldowns) match
  `notify/dispatcher.go`: `processOperator` (1h cooldown, `FormatOperatorEmail`, gated by `operatorOn`,
  `dispatcher.go:91-114`) + `processCustomer` (prefs-driven, default 6h, `FormatCustomerEmail`,
  `dispatcher.go:116-158`); wired in `main.go:134`. ✓
 - **§8 enrollment / §11 configgen** — deep-merge + Hungarian passphrase base is real:
  `configgen.deepMerge` (`configgen.go:76-91`), programmatic overrides + `hub.api_key = cfg.APIKey`
  (`configgen.go:40-47`), retrieval-password gate (`handler.go:709-753`). The evolution to agent-first +
  per-guest keys + key-pinning is a clean extension. ✓
 - **§2 auth extension** (Bearer → reporter → customer) is clean against today's
  `checkAuthCustomer` (global key, else `GetCustomerConfigByAPIKey`, `handler.go:72-90`,
  `store.go:913-935`); adding host/guest key lookups slots straight in. ✓
 - **§11 "idempotent style"** is accurate (`store.go:55-119`). New tables/columns (`hosts`, `guests`,
  `host_reports`, `signed_ops`, `desired_generation`, `desired_spec_json`) follow the existing
  `CREATE IF NOT EXISTS` / `ALTER … ` pattern cleanly.
 - **§9 escrow/custody** is consistent with Part 1 §8 (three-tier custody, zero-knowledge default,
  recovery-code-wrapped PBS keyfile, operator can't open) and Part 3 §8 (live PBS key on the box for
  backup + restore-test; hub holds only the wrapped escrow). The "customer recovery code is the
  irreducible residual; operator-managed tier avoids it" matches Part 1 §8 verbatim in spirit. ✓
 - **§4 dead-man's-switch** (host-report recency = primary liveness) is consistent with Part 3 §5
  ("the heartbeat *is* the liveness signal… first-class treatment hub-side"). ✓
 - **§5/§6 signed-op + desired-state** are consistent with Part 4 and Part 3 §4:
  hub holds **no** signing key and queues opaque blobs (Part 4 §5; 05 §6 "The hub holds no signing
  key"); agent runs the verify pipeline and is the authoritative guard (Part 4 §2.3, Part 3 §4; 05 §6
  "the agent's classification is the authoritative guard"); hub classifies at edit-time for UX only.
  05 §6's `signed_ops` columns are a consistent superset of Part 4 §2.1's blob
  `{op, target:{host_id,guest_id}, params, nonce, issued_at, expires_at, key_id}` (05 adds hub-side
  lifecycle states `delivered`/`rejected` — fine). The local-CLI hand-off (`felhom-sign --pending`)
  matches Part 4 §5–6's `Signer`-on-the-workstation model. ✓
 ## Two-driver soundness (axis 3) — holds
 No place in 05 has the hub **drive** box/customer-owned state. Desired-state (§5) is all infrastructure
 intent (guests, storage *policy*, versions, identity, tunnel) — top-down and legitimate. Apps are
 explicitly excluded from reconcile (§1, §5) and mirrored only. Storage is the handshake (detect →
 assign policy → reconcile policy onto the detected drive), matching Part 3 §7. The one nuance (S5): the
 `hosts` row holds both top-down intent and a few box-reported columns (repo fingerprint, wrapped escrow
 key) — acceptable, just label provenance per column. Reconcile (§5) never collides with app/storage
 reality because the reality columns (`durable_id` attached, snapshot pointers, app inventory) are
 mirror-only and never serve as desired state.
 ## DR completeness (axis 4) — safe to retire the heavy push, with S5's clarification
 Retiring the controller's infra-backup push is safe **given** that DR reads from durable sources, not
 the prunable mirror (S5). What the old push carried — `deployed_stacks` + `disk_layout.mounts`
 (`store.go:768-795`, surfaced by `handleRecovery`, `handler.go:620-705`) — is reconstructible:
 storage layout/`durable_id`s from the storage manifest (desired-state, durable) + host-report mirror;
 app inventory from the guest **inside the PBS snapshot** (so it need not be separately stored); snapshot
 list from PBS itself. The one artifact only the box can produce — the recovery-code-wrapped PBS key — is
 explicitly escrowed (§9), zero-knowledge, consistent with Part 1 §8 / Part 3 §8. So nothing
 DR-essential is lost by removing the push **provided** §9 is amended per S5 to name durable sources and
 not lean on `host_reports` retention.
@@ -0,0 +1,385 @@
 # Proxmox Platform Reference
 Authoritative, living reference for the Proxmox platform underneath `felhom-agent`.
 It records **facts about Proxmox and what we validated about it** — not Felhom design
 decisions. Where a design choice exists, this doc points to the (future) controller
 architecture document rather than making the choice here.
 **Evidence base** (raw, chronological spike logs — kept as the underlying record):
 - [tests/phase0-findings.md](tests/phase0-findings.md) — VM-vs-LXC overhead, Docker-in-LXC viability
 - [tests/phase1-2-findings.md](tests/phase1-2-findings.md) — privilege model, backup/restore round-trip
 - [tests/Proxmox_Spike_-_API_&_Access-Control_Reference.md](tests/Proxmox_Spike_-_API_&_Access-Control_Reference.md) — **superseded** pre-spike reference (contains a known privsep error; do not cite as authoritative)
 Every nontrivial claim links to its evidence section. Validated on a single host
 (`demo-felhom`, 192.168.0.162, 4 vCPU / 16 GB) on 2026-06-07; treat single-run timings and
 measurements as indicative, not benchmarks.
 ---
 ## 1. Platform baseline
 Validated stack [[phase0 §1](tests/phase0-findings.md)]:
 | Component | Version |
 |---|---|
 | Proxmox VE (`pve-manager`) | **9.2.2** (`b9984c6d90a4bd80`) |
 | OS | Debian 13 (Trixie) |
 | Kernel | proxmox-kernel **7.0.2-6-pve** |
 | `pve-qemu-kvm` | 11.0.0-3 |
 | `qemu-server` | 9.1.15 |
 | `pve-container` | 6.1.10 |
 | `lxc-pve` / `lxcfs` | 7.0.0-2 / 7.0.0-pve1 |
 | `criu` | 4.1.1-1 |
 `pvesh get /version` → release 9.2. Always confirm the node name on the box
 (`pvesh get /nodes`) rather than hard-coding it.
 ### 1.1 Storage backends
 Two backends were present and exercised [[phase0 §1](tests/phase0-findings.md), [phase1-2 §pre-flight](tests/phase1-2-findings.md)]:
 | Storage | Type | Path / VG | Content types | Holds |
 |---|---|---|---|---|
 | `local` | `dir` | `/var/lib/vz` | `iso, vztmpl, backup, import` | ISOs, CT templates, **vzdump archives** |
 | `local-lvm` | `lvmthin` | VG `pve`, thinpool `data` | `rootdir, images` | guest disk volumes |
 **Why backups cannot live on LVM-thin:** LVM-thin is a *block* backend — it allocates
 logical volumes for guest disks. Backup archives and templates are *files*, which require a
 file-level backend (`dir`, NFS, CIFS, or PBS). A `vzdump` target must therefore be a
 storage whose content types include `backup` (here, `local`); pointing `vzdump` at
 `local-lvm` is not valid. [[phase1-2 §pre-flight / §2.1](tests/phase1-2-findings.md)]
 ### 1.2 Repositories
 PVE 9 uses **deb822** `.sources` files under `/etc/apt/sources.list.d/`. For a host
 without a subscription, the enterprise repos (`pve-enterprise.sources`,
 `ceph-*-enterprise.sources`) must be disabled (they return 401) and a no-subscription repo
 enabled. *The spike host arrived with the no-subscription repo already configured and the
 host updated [[phase0 baseline](tests/phase0-findings.md)]; the repo setup itself was not a
 spike deliverable* — the canonical no-subscription `.sources` is the standard Proxmox 9
 procedure (`/etc/apt/sources.list.d/pve-no-subscription.sources` with
 `Components: pve-no-subscription`). Treat the exact commands as standard setup, not
 spike-validated.
 **Docker repository (validated):** Docker's official apt repo **has a `trixie` channel**;
 no fallback to Debian's `docker.io` was needed. Installed Docker **29.5.3** from it in both
 guest types. [[phase0 §1](tests/phase0-findings.md)]
 ---
 ## 2. Guest model (LXC vs VM) — validated facts
 Both guest types ran the **identical** workload (Debian 13, Docker 29.5.3, a
 postgres/redis/nginx compose stack) under identical resources (2 vCPU, 2048 MB, ~10 GB)
 [[phase0](tests/phase0-findings.md)].
 ### 2.1 Isolation characteristic (fact, not recommendation)
 - **LXC** is an OS-level container: it **shares the host kernel**. Docker-in-LXC needs the
  container configured for nesting (see §2.3).
 - **VM** runs its **own guest kernel** under KVM/QEMU, with full hardware-level isolation
  and its own firmware.
 The trade-offs below follow directly from this difference.
 ### 2.2 Resource overhead (measured)
 Host RAM used = `MemTotal − MemAvailable`, deltas vs a both-stopped baseline of 1702 MB;
 one guest measured at a time [[phase0 §2](tests/phase0-findings.md)]:
 | Metric | LXC | VM | Note |
 |---|---|---|---|
 | Idle host-RAM delta | **+211 MB** | **+2056 MB** | structural, see below |
 | Under-load host-RAM delta | **+410 MB** | **+2084 MB** | |
 | Per-guest attribution | cgroup `memory.current` 1961 MB¹ | KVM RSS ~2031 MB | |
 | Idle host CPU used | ~0.3 % | ~6.0 % | VM has an emulation/guest-kernel floor |
 | Under-load host CPU used | ~39.4 % | ~53.9 % | VM work shows as `%guest` (31.9 %) |
 | pgbench throughput | 2211 tps | 1820 tps | identical load, 0 failed both |
 | Disk used (host thin-LV) | ~2.67 GiB | ~2.94 GiB | of 10 GiB allocated |
 | Provisioning (create→ready) | ~10–15 s | ~60–75 s | template-extract vs qcow2-import+boot |
 ¹ `cgroup memory.current` counts reclaimable page cache shared with the host and
 **overstates** the LXC's true incremental cost; the +211 MB host delta is the honest
 number [[phase0 §4.4](tests/phase0-findings.md)].
 **Why the RAM gap is structural** [[phase0 §4.3](tests/phase0-findings.md)]: LXC processes
 share the host kernel and page cache, so only the working set counts against the host. A VM
 with **no ballooning configured** has KVM back every guest-touched page (including the
 guest's own page cache), so its host cost ≈ the full RAM allocation and is largely
 load-independent. *Ballooning / KSM were not tested* and could change the VM figure.
 ### 2.3 Docker-in-LXC viability (validated)
 Docker ran **cleanly in an *unprivileged* LXC** configured with
 `--features nesting=1,keyctl=1 --unprivileged 1` (PVE 9 syntax, accepted by `pct create`)
 [[phase0 §3](tests/phase0-findings.md)]:
 - `docker run hello-world` → success; full 3-container stack healthy.
 - **Storage driver: `overlayfs`** (cgroup v2, systemd cgroup driver) — **no `vfs`
  fallback**. (Docker 29 names the overlay driver `overlayfs` via the containerd
  snapshotter image store; same overlay technology as the legacy `overlay2`.)
 - Named volume persisted writes; multi-container networking + published port worked
  (`curl localhost:8080` → 200); 0 failed transactions under load.
 - No privileged-container fallback was needed.
 ### 2.4 Guest agent & app-consistency capability
 - **VM:** `qemu-guest-agent` installs and reports (`agent: 1`), enabling
  `guest-fsfreeze`-based app-consistent `snapshot` backups [[phase0 §4.8](tests/phase0-findings.md)].
  The Debian genericcloud image does **not** ship the agent — it must be installed
  in-guest.
 - **LXC:** no guest agent exists → **no fsfreeze** (see §4.2).
 ---
 ## 3. API & access control
 ### 3.1 Fundamentals
 - **Base URL:** `https://<host>:8006/api2/json`. Every `pve*` CLI is a thin wrapper over
  this REST API.
 - **Token auth header:** `Authorization: PVEAPIToken=USER@REALM!TOKENID=SECRET`. The
  secret is shown **once** at creation. Response envelope: `{"data": ...}`.
 - **TLS reality:** the host serves the default **self-signed** certificate. `curl` without
  `-k` fails `SSL certificate problem: unable to get local issuer certificate`
  [[phase1-2 §1.5](tests/phase1-2-findings.md)]. Production trust (pin the PVE CA / install
  a real cert) is a separate, not-yet-decided concern.
 ### 3.2 RBAC model
 An ACL entry is a triple **(path, principal, role)**; a role is a bundle of privileges,
 assigned at the most specific path. Paths include `/`, `/vms/<vmid>`, `/nodes/<node>`,
 `/storage/<store>`, `/pool/<pool>`, `/access/...`.
 Introspection (**corrected for PVE 9**) [[phase1-2 §1.1](tests/phase1-2-findings.md)]:
 - `pveum role list` — lists roles **with their privileges**.
 - ⚠️ `pveum role info <role>` **does not exist in PVE 9** (the old reference used it).
 - `pveum acl list`, `pveum user permissions <user> --path <path>`.
 ### 3.3 Privilege-separated tokens — the intersection rule (corrected)
 > **A privsep token's (`--privsep 1`) effective permissions are the *intersection* of (a)
 > the backing user's permissions and (b) the token's own ACLs.** The role must therefore be
 > granted on **BOTH the user AND the token** for the same path. Granting it on the token
 > only yields an **empty intersection** and a **403 even on self-calls.**
 > [[phase1-2 §1.2](tests/phase1-2-findings.md)]
 This corrects the superseded reference (§3 there grants the ACL to the token only). The
 intersection is what keeps a privsep token ≤ its user while still being independently
 scopeable to a narrow path.
 Working pattern (validated):
 ```bash
 pveum role add <Role> -privs "<priv> <priv> ..."          # NB: -privs is space-separated
 pveum user add <user>@pve
 pveum user token add <user>@pve <tokenid> --privsep 1     # capture SECRET (shown once)
 pveum acl modify <path> -user  '<user>@pve'         -role <Role>   # BOTH the user...
 pveum acl modify <path> -token '<user>@pve!<tokenid>' -role <Role> # ...AND the token
 ```
 `pveum acl delete` **requires `--roles`** (a bare `-user`/`-token` path errors
 `400 roles: property is missing`). Deleting the token/user/role auto-invalidates the
 referencing ACLs. [[phase1-2 §5](tests/phase1-2-findings.md)]
 ### 3.4 Validated minimal self-backup role
 A token scoped to **one VMID + the backup datastore** can audit, snapshot, and back up
 **only that guest**, and is denied on every other guest and on create/allocate
 [[phase1-2 §1.3–1.4](tests/phase1-2-findings.md)]:
 > **Minimal role for self-audit + self-snapshot + both `snapshot`- and `stop`-mode
 > self-backup:**
 > `VM.Audit, VM.Snapshot, VM.Backup, Datastore.AllocateSpace, Datastore.Audit`
 ⚠️ **`VM.PowerMgmt` is NOT required for stop-mode backup** — `vzdump` performs the guest
 shutdown/restart internally under `VM.Backup` (tested: stop-mode self-backup returned
 `exitstatus OK` without it) [[phase1-2 §1.4](tests/phase1-2-findings.md)]. This corrects the
 old reference's "likely yes" guess.
 Validated boundary (token scoped to `/vms/<self>` + `/storage/local`):
 | Operation | Result |
 |---|---|
 | `GET /version` | 200 |
 | `GET` self status, `POST` self snapshot, `POST` self vzdump | 200 / task `OK` |
 | `GET`/`POST` against **another** guest's vmid | **403** (read) / task **403** (backup) |
 | `POST /nodes/<node>/lxc` (create/allocate a guest) | **403** — create/allocate is operator-tier |
 ### 3.5 Async tasks — trust `exitstatus`, not the POST
 Long operations (`vzdump`, `snapshot`, clone, restore) return a **UPID**, not a result.
 Poll `GET /nodes/<node>/tasks/<upid>/status` until `status: stopped`, then read
 `exitstatus` [[phase1-2 §1.3](tests/phase1-2-findings.md)].
 > ⚠️ **Authorization can surface at task execution, not at the HTTP POST.** A `vzdump`
 > against an unauthorized vmid returns **HTTP 200 + a UPID**, but the task then ends
 > `exitstatus: "403 Permission check failed (/vms/<id>, VM.Backup)"` and produces **no
 > archive**. A caller that trusts the 200 would wrongly believe the backup ran. Always poll
 > the task and check `exitstatus`.
 (The task owner — including a token — can read its own task status: 200.)
 ### 3.6 Operator-tier agent role & root-vs-API boundary (validated)
 The operator-tier **host agent** (`03-host-agent.md`) needs a far broader role than the
 Phase-1 *guest self-backup* role (which is denied create/allocate — §3.4). The minimal role
 that drives the full guest lifecycle via an API token, validated by paring
 [[phase3 §B3](tests/phase3-findings.md)]:
 > **`FelhomAgent` (operator-tier, 16 privileges):**
 > `VM.Allocate, VM.Audit, VM.Config.Disk, VM.Config.CPU, VM.Config.Memory, VM.Config.Network,
 > VM.Config.Options, VM.PowerMgmt, VM.Snapshot, VM.Snapshot.Rollback, VM.Backup,
 > Datastore.Allocate, Datastore.AllocateSpace, Datastore.Audit, Sys.Audit, SDN.Use`
 >
 > Paring proved: `SDN.Use` is **required** (PVE 9 gates bridge use; omitting it → `403
 > (/sdn/zones/localnetwork/vmbr0, SDN.Use)`); `Sys.Audit` required for host metrics
 > (`GET /nodes/<node>/status`); `VM.Config.Network`/`VM.Config.Options` required for NIC/onboot
 > config; `Datastore.AllocateTemplate` **not** needed (drop it). NB `VM.Config.CPUMemory` is
 > not a real privilege — it is `VM.Config.CPU` + `VM.Config.Memory`.
 **Root-vs-API boundary** [[phase3 §B3](tests/phase3-findings.md)] — nearly the entire guest
 lifecycle, **including restore**, is API-token-covered; the genuine OS-root residual is narrow:
 | Operation | Coverage |
 |---|---|
 | Create LXC (nesting-only), config, allocate, start/stop, snapshot/rollback, vzdump, **restore**, destroy, add storage definition, host metrics | **scoped API token** (the `FelhomAgent` role) |
 | ⚠️ **Create LXC with `keyctl=1`** (Docker needs it — §2.3) | **OS root `root@pam` only** |
 | USB physical mount-by-UUID / systemd mount unit / fstab; SMART/sensors | OS root / narrow sudoers |
 > ⚠️ **`keyctl=1` (and any feature flag except `nesting`) can be set only by an actual
 > `root@pam` session** — `changing feature flags (except nesting) is only allowed for
 > root@pam`. **No API token qualifies**, not even a non-privsep `root@pam` token (same 403).
 > So *fresh provisioning* of a Docker-capable LXC needs `pct create` as OS root (or a narrow
 > sudoers entry). **Restore is exempt:** a token-authorized `vzrestore` **preserves
 > `keyctl=1`** from the archive — the DR path needs no root.
 ---
 ## 4. Backup & restore (`vzdump` / `pct restore`)
 ### 4.1 Modes
 - **`stop`** — orderly guest shutdown → backup → restart. Highest consistency, defined
  downtime. (For LXC the shutdown/restart is internal to `vzdump`; needs only `VM.Backup` —
  §3.4.)
 - **`snapshot`** — lowest downtime; copies blocks while running. Consistency depends on the
  guest cooperating (§4.2).
 - **`suspend`** — legacy/compat, not used.
 ### 4.2 Consistency: crash-consistent vs quiesced, and no-fsfreeze-for-LXC
 > ⚠️ **An LXC has no guest agent, so `snapshot`-mode `vzdump` does NOT fsfreeze.** A
 > running-stack LXC backup is therefore **crash-consistent** (filesystem-level), not
 > app-consistent. App-consistency for an LXC is the caller's job: quiesce in-guest first
 > (stop the stack / flush DBs) or use `stop` mode. A **VM** with `qemu-guest-agent` gets
 > `guest-fsfreeze` around the copy → near-free app-consistency. [[phase1-2 §2.1](tests/phase1-2-findings.md), [phase0 §4.8](tests/phase0-findings.md)]
 **Validated restore behaviour** (LXC, Postgres) [[phase1-2 §2.2](tests/phase1-2-findings.md)]:
 - **Crash-consistent (running):** on first start Postgres ran **automatic WAL recovery**
  (`database system was interrupted … not properly shut down; automatic recovery in
  progress … redo done … ready to accept connections`) and the data was intact.
 - **Quiesced (stack stopped):** clean start, no recovery, data intact.
 - Both restored correctly here on an idle-at-backup DB; this is **not** a durability
  guarantee under heavy write load (§6).
 ### 4.3 What a backup captures
 A single LXC `vzdump` captures the container rootfs **including the Docker named volumes**
 (they live in the rootfs) — one backup = the whole guest and its data. Validated: a
 sentinel row survived both variants [[phase1-2 §2.2](tests/phase1-2-findings.md)].
 Sizes/timings (2.5 GiB source, zstd) [[phase1-2 §2.1–2.2](tests/phase1-2-findings.md)]:
 backup ~934 MB (~2.7:1) in ~22–25 s; restore in ~11–12 s.
 ### 4.4 Restore = recreate-from-archive (identity is preserved)
 There is no single "restore" call — you recreate the guest from the archive into a **new
 VMID**:
 - **LXC:** `pct restore <newid> <archive> --storage <store>`
 - **VM:** `qmrestore <archive> <newid>` (or `POST /nodes/<node>/qemu` with `archive=`)
 > ⚠️ **`pct restore` preserves the source config — including the MAC address and
 > hostname.** Restoring while the original still runs causes a **MAC/hostname collision** on
 > the bridge; reset network identity (`pct set <id> -net0 name=eth0,bridge=vmbr0,ip=dhcp`
 > regenerates the MAC) before starting. [[phase1-2 §2.2](tests/phase1-2-findings.md)]
 **Restored config survives intact:** `unprivileged: 1` and `features: nesting=1,keyctl=1`
 are preserved, so Docker runs in the restored CT [[phase1-2 §2.2](tests/phase1-2-findings.md)].
 ### 4.5 Snapshots
 A **running, unprivileged LXC can be snapshotted on LVM-thin** with no stop required
 (`exitstatus OK`; snapshot listed while the CT stays `running`)
 [[phase1-2 §1.6](tests/phase1-2-findings.md)]. This is the mechanism available for a
 snapshot-before-change rollback flow.
 ### 4.6 PBS (Proxmox Backup Server)
 **Not yet validated.** No PBS datastore was configured or tested in the spike. All backup
 findings above are for `vzdump` to a `dir` storage. PBS (dedup, incremental, remote, dirty-
 bitmap) is pending.
 ### 4.7 vzdump scope by LXC mount type (validated)
 A stop-mode `vzdump` includes/excludes each LXC mount point by **type and the `backup` flag**
 [[phase3 §B2](tests/phase3-findings.md)]. Validated three ways (vzdump log, archive grep,
 restore):
 | Location | `backup` flag | In the vzdump? |
 |---|---|---|
 | rootfs (and anything inside it) | — | **included** (always) |
 | **Docker named volume** (default driver) | — | **included** — it lives in the rootfs (`/var/lib/docker/volumes/<v>/_data`) |
 | volume mount point (`mpN`) | `backup=1` | included |
 | volume mount point (`mpN`) | `backup=0` | **excluded** (vol recreated empty on restore) |
 | bind mount point (`mpN: /host/path`) | n/a | **excluded** ("not a volume"); data is *not* in the archive |
 > ⚠️ **The `backup=<boolean>` flag is honoured ONLY for *volume* mount points.** A **Docker
 > named volume is in the rootfs and is always captured** — so a "bulk" volume left as a
 > default named volume is silently swept into the whole-guest image. To keep bulk data **out**,
 > realize it as a dedicated `backup=0` volume mount point (proven recipe:
 > `pct set <id> -mpN <storage>:<size>,mp=/mnt/bulk,backup=0` then
 > `docker volume create --driver local -o type=none -o o=bind -o device=/mnt/bulk bulkvol`).
 > A **bind mount's** data is excluded from the archive entirely; on same-host restore it
 > reappears only because the bind config re-attaches the same host dir — on a *different* host
 > (true DR) it is gone unless backed up separately.
 ---
 ## 5. Gotchas & operational notes (quick reference)
 | Gotcha | Detail | Evidence |
 |---|---|---|
 | **deb822 repos** | PVE 9 repos are `.sources` files; disable enterprise, enable no-subscription | standard setup |
 | **Privsep dual-grant** | privsep token needs the role on **both** user and token, else empty intersection → 403 | [phase1-2 §1.2](tests/phase1-2-findings.md) |
 | **Async authz** | `vzdump` POST returns 200+UPID even when unauthorized; the 403 is in the task `exitstatus`; poll it | [phase1-2 §1.3](tests/phase1-2-findings.md) |
 | **No fsfreeze for LXC** | running-LXC `snapshot` backup is crash-consistent only; quiesce or use `stop` for app-consistency | [phase1-2 §2.1](tests/phase1-2-findings.md) |
 | **Restore identity collision** | `pct restore` keeps source MAC + hostname; reset before starting alongside the original | [phase1-2 §2.2](tests/phase1-2-findings.md) |
 | **Restart policy for self-heal** | restored/rebooted containers come up `exited` with no restart policy; need a restart policy or an explicit `compose up -d` to return automatically | [phase1-2 §2.2/§3](tests/phase1-2-findings.md) |
 | **Self-signed TLS** | host cert is self-signed; `curl` needs `-k` until trust is set up | [phase1-2 §1.5](tests/phase1-2-findings.md) |
 | **`pveum role info` gone** | use `pveum role list` in PVE 9 | [phase1-2 §1.1](tests/phase1-2-findings.md) |
 | **`pveum acl delete` needs `--roles`** | bare `-user`/`-token` path errors `400 roles: property is missing` | [phase1-2 §5](tests/phase1-2-findings.md) |
 | **`VM.PowerMgmt` not needed** | stop-mode backup works under `VM.Backup` alone | [phase1-2 §1.4](tests/phase1-2-findings.md) |
 | **`keyctl=1` is root-only** | feature flags except `nesting` need a `root@pam` session; no API token (even root's) can set them; restore preserves them | [phase3 §B3](tests/phase3-findings.md) |
 | **`SDN.Use` gates bridge use** | PVE 9 needs `SDN.Use` to attach a NIC to `vmbr0`; omit it → 403 | [phase3 §B3](tests/phase3-findings.md) |
 | **Docker named vol = always backed up** | named volumes live in rootfs; only *volume mountpoints* honour `backup=0`; bulk must be a dedicated `backup=0` mp | [phase3 §B2](tests/phase3-findings.md) |
 ---
 ## 6. Validated vs open
 ### Validated by the spike
 | Fact | Evidence |
 |---|---|
 | PVE 9.2.2 / Debian 13 / kernel 7.0.2 baseline; `local` (dir) vs `local-lvm` (thin) roles | [phase0 §1](tests/phase0-findings.md), [phase1-2 pre-flight](tests/phase1-2-findings.md) |
 | Docker runs in an **unprivileged** LXC (`nesting=1,keyctl=1`), driver `overlayfs`, cgroup v2 | [phase0 §3](tests/phase0-findings.md) |
 | LXC vs VM overhead (idle host RAM +211 MB vs +2056 MB; CPU/throughput/provisioning) | [phase0 §2](tests/phase0-findings.md) |
 | Privsep token = intersection of user ∩ token ACLs (dual-grant required) | [phase1-2 §1.2](tests/phase1-2-findings.md) |
 | Minimal self-backup role; `VM.PowerMgmt` unnecessary | [phase1-2 §1.4](tests/phase1-2-findings.md) |
 | Token scoped to one VMID: self-ops succeed, cross-guest + create/allocate denied | [phase1-2 §1.3](tests/phase1-2-findings.md) |
 | Async UPID model; vzdump authz surfaces in `exitstatus`, not the POST | [phase1-2 §1.3](tests/phase1-2-findings.md) |
 | Running, unprivileged LXC snapshots on LVM-thin (no stop) | [phase1-2 §1.6](tests/phase1-2-findings.md) |
 | `vzdump` → `pct restore` round-trip; one backup captures Docker volumes; config survives | [phase1-2 §2](tests/phase1-2-findings.md) |
 | Crash-consistent restore recovers via Postgres WAL; quiesced restores clean | [phase1-2 §2.2](tests/phase1-2-findings.md) |
 | LXC vzdump scope by mount type; `backup=0` excludes volume mps; Docker named vols ride rootfs; proven bulk-exclusion recipe | [phase3 §B2](tests/phase3-findings.md) |
 | Operator agent role (16 privs); guest lifecycle incl. restore is API-token-covered; `keyctl` create is `root@pam`-only | [phase3 §B3](tests/phase3-findings.md) |
 ### Not yet validated (do not assume)
 | Open item | Why it matters |
 |---|---|
 | **PBS** (dedup/incremental/remote backup) | the only backup path tested was `vzdump` to a `dir` |
 | **The real controller running inside an LXC** reaching `host:8006` | spike used `curl`/CLI, not the actual Go controller |
 | **App-consistency under heavy write load** | WAL recovery was validated only on an idle-at-backup DB |
 | **Live migration / restore to a different host** | single-node spike only |
 | **Ballooning / KSM** effect on VM RAM cost | VM RAM measured with neither configured |
 | **Cluster / HA** behaviour | single node |
 | **Production TLS trust** for the API | all calls used `-k` against a self-signed cert |
 | **deb822 no-subscription repo setup** as a controlled step | host arrived pre-configured |
 ---
 ## 7. Scope boundary
 This document holds **platform facts only.** Felhom design decisions — e.g. which guest
 type is the default, whether to use privsep or non-privsep tokens, where PBS lives — are
 **out of scope** and belong in the controller-architecture document. Where this reference
 notes a decision exists, the decision itself is recorded there, not here.
@@ -0,0 +1,176 @@
 > ⚠️ **SUPERSEDED — spike evidence only, not authoritative.** This is the *pre-spike*
 > reference and contains at least one known error (the privsep/ACL mechanism in §3 — it
 > grants the ACL to the token only, which yields an empty intersection and a 403 even on
 > self-calls). For the corrected, validated facts read
 > [`../proxmox-platform.md`](../proxmox-platform.md). Kept here unchanged as the record of
 > what we believed going into the spike.
 # Proxmox Spike — API & Access-Control Reference
 Reference for the **controller-as-guest** architecture, synthesized from current
 Proxmox VE 9.x documentation (June 2026).
 Items marked **[confirm on box]** should be verified once PVE is installed —
 treat them as Phase 0/1 verification steps, not gospel. Every Proxmox CLI tool
 is a thin wrapper over the same REST API, so anything below is reachable from Go.
 ---
 ## 1. API fundamentals
 - **Base URL:** `https://192.168.0.162:8006/api2/json`
 - **Auth (API token):** HTTP header
  `Authorization: PVEAPIToken=USER@REALM!TOKENID=SECRET`
  The secret is shown **once** at creation — capture it immediately, it can't be
  retrieved again.
 - **Response shape:** `{ "data": ... }`; errors come back via HTTP status + body.
 - **Discovery (do this live on the box instead of trusting any doc):**
  - `pvesh get /version`
  - `pvesh ls /nodes/<node>/qemu/<vmid>`
  - Full schema browser: `https://pve.proxmox.com/pve-docs/api-viewer/`
  - "What call does the GUI make?" → perform the action in the web UI with
    browser DevTools → Network open and read the request. Fastest way to find
    the exact endpoint + params for anything.
 - **Async tasks:** long operations (backup, restore, clone) return a **UPID**
  (task id), not a result. Poll `GET /nodes/<node>/tasks/<upid>/status` until
  `status: stopped`, then check `exitstatus`. The controller must poll, not
  block. **[confirm on box]** the exact polling/response shape.
 ---
 ## 2. RBAC model — (path, principal, role)
 An ACL entry is a triple of **(path, user/group/token, role)**. A role is a
 bundle of privileges, assigned at the most specific path possible.
 - **Paths:** `/`, `/vms/<vmid>`, `/nodes/<node>`, `/storage/<store>`,
  `/pool/<pool>`, `/access/...`
 - **Predefined roles include:** `PVEAuditor` (read-only), `PVEVMUser`,
  `PVEVMAdmin`, `PVEDatastoreUser`, `PVEAdmin`, `PVEUserAdmin`.
 - **API tokens with privilege separation (`--privsep 1`):** the token's
  effective permissions are the **intersection** of (a) the backing user's
  permissions and (b) the token's own ACLs. A privsep token can therefore never
  exceed its user, and you grant it a separate, minimal ACL. This is exactly the
  property the in-guest controller needs.
 Introspection:
 ```bash
 pveum role list
 pveum role info PVEVMAdmin
 pveum user permissions <user> --path /vms/<vmid>
 ```
 ---
 ## 3. Two-tier privilege model (our architecture decision)
 **Tier A — in-guest controller (customer-facing, NARROW).**
 Runs inside the customer's guest. Token scoped to *that guest's own VMID only*:
 read its own status/config, snapshot itself, back itself up, write the backup to
 the datastore. Cannot see or touch other guests. The LXC/VM's own privilege
 level is irrelevant here — reaching `host:8006` is just an HTTPS call + token.
 **Tier B — operator (provisioning, BROAD).**
 Creates/destroys guests, builds the golden template, attaches storage, wires PBS.
 Lives operator-side (hub / tooling), never on the customer box.
 ### Phase 1 runbook — minimal self-backup role + scoped token
 ```bash
 # 1. Custom least-privilege role: "back up / snapshot myself"
 #    [confirm on box: exact privilege names via `pveum role list` / api-viewer]
 pveum role add FelhomSelfBackup \
  -privs "VM.Audit VM.Snapshot VM.Backup Datastore.AllocateSpace Datastore.Audit"
 # 2. Dedicated API-only user in the PVE realm (no login password)
 pveum user add felhom-ctl@pve --comment "In-guest controller (self-backup)"
 # 3. Privsep token for that user (SECRET shown once)
 pveum user token add felhom-ctl@pve ctl --privsep 1
 # 4. Scope the TOKEN to one guest + the backup datastore only
 pveum acl modify /vms/<vmid>      -token 'felhom-ctl@pve!ctl' -role FelhomSelfBackup
 pveum acl modify /storage/<store> -token 'felhom-ctl@pve!ctl' -role FelhomSelfBackup
 # 5. Test FROM INSIDE the guest
 curl -k https://<host>:8006/api2/json/version \
  -H "Authorization: PVEAPIToken=felhom-ctl@pve!ctl=<SECRET>"
 curl -k -X POST https://<host>:8006/api2/json/nodes/<node>/vzdump \
  -H "Authorization: PVEAPIToken=felhom-ctl@pve!ctl=<SECRET>" \
  -d "vmid=<vmid>&storage=<store>&mode=snapshot"
 ```
 **Pass criteria:** the token backs up its OWN vmid, and returns **403** on any
 other vmid. That single result validates the whole controller-as-guest design.
 **Open question to settle here:** does Tier A also need `VM.PowerMgmt` so it can
 stop/start its own guest for `stop`-mode backups? Likely yes — add it and re-test.
 ---
 ## 4. Backup / restore (vzdump)
 **Modes:**
 - **`stop`** — orderly guest shutdown → live backup → resume. Highest
  consistency, short defined downtime.
 - **`snapshot`** — lowest downtime; copies blocks while running. *Small
  inconsistency risk* unless the guest cooperates (see below).
 - **`suspend`** — legacy/compat, longer downtime, not recommended.
 **App-consistency — the concrete version of the earlier warning:**
 - **VM:** install `qemu-guest-agent` in the guest and set `agent: 1`.
  `snapshot`-mode vzdump then calls `guest-fsfreeze-freeze` / `-thaw` around the
  copy → near-free filesystem consistency. **This is a real point in the VM's
  favour over LXC.**
 - **LXC:** no guest agent → no fsfreeze. App-consistency becomes the
  *controller's* job: quiesce in-guest first (stop stacks / flush DBs) **then**
  vzdump, or use `stop` mode. Same lesson as the restic work, moved to the guest
  layer.
 **CLI / API:**
 ```bash
 vzdump <vmid> --mode snapshot --storage <store>                 # CLI
 # API (async → UPID):
 POST /api2/json/nodes/<node>/vzdump        params: vmid, storage, mode, ...
 ```
 **Restore is NOT a single "restore" call** — you recreate the guest from the
 archive:
 - **VM:** `qmrestore <archive> <newvmid>`  /  `POST /nodes/<node>/qemu` with `archive=...`
 - **LXC:** `pct restore <newvmid> <archive>`  /  `POST /nodes/<node>/lxc` with the archive as source
 Phase 2's real-restore test = restore to a **fresh vmid** and boot it. Do not
 declare the backup "working" until a restored guest actually runs.
 ---
 ## 5. Key REST endpoints (qemu shown; lxc is parallel under `/lxc`)
 ```
 GET  /nodes
 GET  /nodes/<node>/qemu                          list VMs
 GET  /nodes/<node>/qemu/<vmid>/status/current    live status
 GET  /nodes/<node>/qemu/<vmid>/config            config
 POST /nodes/<node>/qemu/<vmid>/status/{start,stop,shutdown,reboot}
 POST /nodes/<node>/qemu/<vmid>/snapshot          (snapname, description)
 GET  /nodes/<node>/qemu/<vmid>/snapshot          list snapshots
 POST /nodes/<node>/qemu/<vmid>/snapshot/<snap>/rollback
 POST /nodes/<node>/vzdump                         backup (async, UPID)
 GET  /nodes/<node>/tasks/<upid>/status            poll async task
 ```
 LXC: replace `/qemu/` with `/lxc/`. For **Docker-in-LXC** the container needs
 `features nesting=1,keyctl=1` (`pct set <vmid> -features nesting=1,keyctl=1`, or
 the `features` property on `POST /nodes/<node>/lxc`) — **[confirm on box]**.
 ---
 ## 6. Phase 0 confirm-on-box checklist
 - [ ] PVE 9.2 installed; storage = LVM-thin (leave free space to also test dir/qcow2)
 - [ ] Exact privilege set for `FelhomSelfBackup` (`pveum role info`)
 - [ ] UPID task-polling response shape
 - [ ] Docker official apt repo has a `trixie` channel
 - [ ] LXC `features nesting=1,keyctl=1` syntax + Docker actually runs inside an LXC
 - [ ] Baseline idle + under-load RAM/CPU: one Debian VM vs one Debian LXC, identical resources
@@ -0,0 +1,331 @@
 # Phase 0 — VM vs LXC Overhead Spike: Findings
 **Host:** `demo-felhom` (192.168.0.162) — Proxmox VE 9.2.2, Debian 13 (Trixie),
 kernel 7.0.2-6-pve, 4 vCPU, 16 GB RAM (15771 MB `MemTotal`).
 **Date:** 2026-06-07. **Measured one guest at a time, the other fully stopped.**
 > This document presents **data and observations only**. No recommendation or verdict —
 > the architecture decision is made elsewhere.
 ---
 ## 1. Provenance
 ### Platform
 | Component | Version |
 |---|---|
 | pve-manager | 9.2.2 (`b9984c6d90a4bd80`) |
 | kernel | proxmox-kernel 7.0.2-6-pve |
 | pve-qemu-kvm | 11.0.0-3 |
 | qemu-server | 9.1.15 |
 | pve-container | 6.1.10 |
 | lxc-pve / lxcfs | 7.0.0-2 / 7.0.0-pve1 |
 | criu | 4.1.1-1 |
 `pvesh get /version` → release 9.2, version 9.2.2.
 ### Guest images
 | | LXC (9001) | VM (9000) |
 |---|---|---|
 | Source | `local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst` | `debian-13-genericcloud-amd64.qcow2` |
 | Build | Debian 13.1 standard CT template (downloaded via `pveam`, checksum verified) | cloud build **20260601-2496**; in-guest reports Debian **13.5** after `apt update` |
 | qcow2 | n/a | virtual 3 GiB, on-disk 323 MiB, compat 1.1/zlib |
 ### Docker (identical in both guests)
 | | LXC | VM |
 |---|---|---|
 | Source | Docker official apt repo, **`trixie` channel** (confirmed present) | same |
 | Version | **29.5.3** build d1c06ef | **29.5.3** build d1c06ef |
 | Storage Driver | **`overlayfs`** (not vfs) | **`overlayfs`** (not vfs) |
 | Cgroup Version / Driver | **v2 / systemd** | **v2 / systemd** |
 | `hello-world` | OK | OK |
 > Docker's official repo **does** have a `trixie` channel — no fallback to Debian's
 > `docker.io` was needed. Docker 29 reports the driver as `overlayfs` (the containerd
 > snapshotter image store) rather than the legacy name `overlay2`; this is the same
 > overlay technology and is **not** a `vfs` fallback.
 ---
 ## 2. Comparison table
 Baseline (both guests stopped): host RAM used **median 1702 MB** (range 1699–1703);
 host CPU **~0.1 % used** (99.9 % idle). All RAM deltas below are vs this baseline.
 Host RAM used = `MemTotal − MemAvailable`, 5 samples ~3 s apart (median reported).
 | Metric | LXC (9001) | VM (9000) | Δ (VM − LXC) |
 |---|---|---|---|
 | **Idle host-RAM delta** | **+211 MB** (1913) | **+2056 MB** (3758) | **+1845 MB** |
 | **Under-load host-RAM delta** | **+410 MB** (2112) | **+2084 MB** (3786) | **+1674 MB** |
 | **Per-guest mem attribution** | cgroup `memory.current` = **1961 MB**¹ | KVM process RSS = **2031 MB** (idle) / **2047 MB** (load) | — |
 | **Idle host CPU used** | **~0.3 %** (0.20 usr + 0.10 sys) | **~6.0 %** (3.37 usr + 2.31 sys + 0.29 guest) | **+5.7 pp** |
 | **Under-load host CPU used** | **~39.4 %** (17.1 usr + 7.5 sys + 14.5 iowait + 0.3 soft) | **~53.9 %** (31.9 guest + 16.4 iowait + 3.4 sys + 1.7 usr + 0.6 soft) | **+14.5 pp** |
 | **pgbench throughput** | **2211.7 tps**, lat 1.809 ms, 132 710 tx/60 s, 0 failed | **1819.6 tps**, lat 2.198 ms, 163 764 tx/90 s, 0 failed² | **−392 tps** |
 | **Disk allocated** | 10 GiB | 10 GiB | 0 |
 | **Disk used (host thin-LV)** | 26.73 % ≈ **2.67 GiB** | 29.33 % ≈ **2.94 GiB** | +0.27 GiB |
 | **Disk used (inside guest)** | 2.1 GiB / 9.7 GiB | 2.4 GiB / 9.7 GiB | +0.3 GiB |
 | **Provisioning (rough, create→ready)** | ~10–15 s³ | ~60–75 s³ | — |
 ¹ `memory.current` counts reclaimable page cache shared with the host and therefore
 **overstates** the LXC's true incremental cost; the +211 MB host-RAM delta is the honest
 number. ² VM 60 s runs gave 1739 & 1759 tps — consistent with the 90 s definitive run.
 ³ Guest-creation step only; see §4. Docker install + first image pull (~network-bound,
 ~identical for both) is excluded.
 ### Inside-guest `free -m` (context only — not the decisive number)
 | | total | used | buff/cache | available |
 |---|---|---|---|---|
 | LXC idle | 2048 | 125 | 1851 | 1922 |
 | VM idle | 1974 | 509 | 1524 | 1464 |
 The VM sees **1974 MB** usable of 2048 allocated (firmware/kernel reservation).
 ---
 ## 3. Docker-in-LXC viability
 **Worked cleanly in an *unprivileged* LXC with `--features nesting=1,keyctl=1`. No
 privileged fallback was needed.**
 - `--features nesting=1,keyctl=1 --unprivileged 1` accepted by `pct create` (PVE 9
  syntax confirmed via `pct help create`).
 - `docker run hello-world` → success.
 - **Storage driver: `overlayfs`** (cgroup v2, systemd cgroup driver) — **no `vfs`
  fallback**.
 - Full 3-container stack (`postgres:17`, `redis:7`, `nginx:alpine`) came up healthy.
 - Named volume `pgdata` persisted a write (`SELECT count` returned 1 after table
  create/insert).
 - Multi-container networking + published port worked: `curl localhost:8080` → **HTTP 200**.
 - 60 s pgbench load: **0 failed transactions**.
 No errors, no `dmesg`/`journalctl` anomalies, no workarounds. The privileged-LXC
 fallback path (step A5) was therefore **not exercised**.
 ---
 ## 4. Observations & confounds
 1. **VM under-load CPU required a re-measurement (diagnosed, not hidden).** The first
   VM-load sample showed host CPU ~5 % — identical to *idle* — while pgbench nonetheless
   completed a full 60 s run (1739 tps). Root cause: the VM load was launched through a
   **nested SSH + `nohup &`** layer (host→VM), which started pgbench *after* the sampling
   window. The LXC path used local `pct exec` (no nested SSH) so its first sample was
   valid. Re-running with pgbench held in the **foreground of a long-lived SSH channel**
   (guaranteed active) and sampling during a confirmed window gave the true **53.9 %**
   (`%guest`=31.9). **Confound:** the two guests' load was driven through different
   plumbing (`pct exec` vs nested SSH); the *throughput* numbers are unaffected
   (pgbench self-reports its own duration), but the CPU figures came from
   methodologically asymmetric harnesses.
 2. **Baseline drift from residual page cache.** After stopping each guest, host RAM did
   not snap back to 1702 MB immediately (e.g. 1895 MB just after the LXC stopped;
   1965→1794 MB drifting down after the VM). This is reclaimable cache, not a leak.
   Treat all RAM deltas as ±~100 MB.
 3. **The headline RAM gap is structural, not incidental.** LXC processes share the host
   kernel and page cache, so only the working set counts against the host (+211 MB idle).
   The VM, with **no ballooning configured**, has KVM back every guest-touched page —
   including the guest's own 1.5 GB page cache — so the host cost ≈ the full 2 GB
   allocation (KVM RSS ≈ 2031 MB) and is **largely load-independent** (3758 idle → 3786
   load). Ballooning / KSM were not tested and could change this.
 4. **`cgroup memory.current` ≠ host cost.** For the LXC it read 1961 MB (near the 2 GB
   limit) because it includes reclaimable page cache; the real incremental host cost was
   +211 MB. Per the protocol, `MemTotal − MemAvailable` is the decisive metric.
 5. **VM idle CPU floor (~6 %) vs LXC (~0.3 %).** QEMU device emulation + a full guest
   kernel's timer/housekeeping impose a small constant CPU cost even at rest.
 6. **Throughput vs CPU trade.** The VM did slightly *less* work (1820 vs 2211 tps) for
   *more* host CPU (53.9 vs 39.4 %). The extra cost surfaces as `%guest` (31.9 %) — the
   actual DB work *plus* virtualization overhead — whereas in the LXC the same DB work
   appears directly as host `%usr`/`%sys`. iowait was comparable (~15–16 %, WAL fsync).
 7. **Workload fits in RAM.** pgbench scale `-s 10` (~150 MB) fits in cache in both
   guests, so the test is commit/CPU-bound rather than disk-bound; a larger-than-RAM
   dataset would stress the storage paths differently and is not covered here.
 8. **qemu-guest-agent confirmed on the VM** (`qm guest cmd 9000 ping` → OK). This enables
   `guest-fsfreeze`-based app-consistent `snapshot`-mode vzdump for the VM — a capability
   the LXC has no equivalent for. The genericcloud image does **not** ship the agent;
   it had to be installed in-guest (and the VM IP had to be found via `nmap`/MAC until
   the agent was up).
 9. **Provisioning asymmetry foreshadows cloning.** LXC create is template-extract-bound
   (526 MiB at 387 MiB/s + SSH keygen, ~10–15 s). VM create is qcow2-import-bound (3 GiB
   → LVM ≈ 30 s) plus a full firmware boot to SSH-ready (~30–45 s). Figures are rough,
   single-run, and exclude the shared network-bound Docker install + first image pull.
 ---
 ## 5. Raw command log (appendix)
 ### 5.1 Provenance
 ```
 $ pveversion -v | grep ...
 pve-manager: 9.2.2 (running version: 9.2.2/b9984c6d90a4bd80)
 proxmox-kernel-7.0: 7.0.2-6
 criu: 4.1.1-1
 lxc-pve: 7.0.0-2
 lxcfs: 7.0.0-pve1
 pve-container: 6.1.10
 pve-qemu-kvm: 11.0.0-3
 qemu-server: 9.1.15
 $ pvesm status
 local         dir      active  98497780  4333576  89114656  4.40%
 local-lvm  lvmthin    active 365760512        0 365760512  0.00%
 # Docker repo trixie channel:
 $ curl -fsSL https://download.docker.com/linux/debian/dists/ | grep -oE 'trixie|bookworm|bullseye'
 bookworm / bullseye / trixie        # trixie present
 # Cloud image:
 $ qemu-img info debian-13-genericcloud-amd64.qcow2
 virtual size: 3 GiB ; disk size: 323 MiB ; compat 1.1 ; build 20260601-2496
 ```
 ### 5.2 Baseline (both guests stopped)
 ```
 $ for i in 1..5; awk MemTotal-MemAvailable /proc/meminfo ; sleep 3
 used=1699 MB / 1702 / 1702 / 1702 / 1703 MB      (median 1702)
 $ mpstat 1 5
 Average: all 0.05 usr 0.05 sys ... 99.90 idle
 ```
 ### 5.3 LXC 9001 — create + Docker
 ```
 $ pct create 9001 local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst \
    --hostname spike-lxc --cores 2 --memory 2048 --rootfs local-lvm:10 \
    --net0 name=eth0,bridge=vmbr0,ip=dhcp --features nesting=1,keyctl=1 \
    --unprivileged 1 --start 1
  Logical volume "vm-9001-disk-0" created.
  extracting archive ... Total bytes read: 551505920 (526MiB, 387MiB/s)
  Creating SSH host key ... done
 === exit: 0 ; status: running
 features: nesting=1,keyctl=1 ; unprivileged: 1 ; ip 192.168.0.115/24
 # Docker install (official repo, trixie stable): DOCKER-INSTALL-OK
 $ docker --version            -> Docker version 29.5.3, build d1c06ef
 $ docker run --rm hello-world -> Hello from Docker!
 $ docker info | grep -iE 'Storage Driver|Cgroup'
 Storage Driver: overlayfs
 Cgroup Driver: systemd
 Cgroup Version: 2
 Server Version: 29.5.3 ; Kernel: 7.0.2-6-pve ; OS: Debian GNU/Linux 13 (trixie)
 ```
 ### 5.4 LXC 9001 — stack health
 ```
 $ docker compose ps
 spike-cache-1  running   Up
 spike-db-1     running   Up
 spike-web-1    running   Up
 $ curl -s -o /dev/null -w 'HTTP %{http_code}' localhost:8080   -> HTTP 200
 $ psql CREATE TABLE spike_persist; INSERT; SELECT count(*)     -> 1   (volume persists)
 ```
 ### 5.5 LXC 9001 — idle measurement
 ```
 Host RAM used (5x3s): 1913 / 1914 / 1913 / 1914 / 1913 MB     (median 1913, Δ +211)
 cgroup memory.current: 2056036352 B = 1961 MB
 inside free -m: total 2048 used 125 buff/cache 1851 available 1922
 mpstat 1 5 Average: 0.20 usr 0.10 sys ... 99.70 idle   (~0.3% used)
 pct df 9001: rootfs 9.7G size, 2.1G used, 21.6%
 ```
 ### 5.6 LXC 9001 — under-load measurement
 ```
 $ pgbench -i -s 10  -> done in 1.39 s
 $ pgbench -T 60 -c 4 (run concurrently with sampling):
 Host RAM used (5x3s): 2149 / 2143 / 2112 / 2086 / 2071 MB     (median 2112, Δ +410)
 cgroup memory.current: 2130382848 B = 2032 MB
 mpstat 1 5 Average: 17.10 usr 7.50 sys 14.50 iowait 0.31 soft 60.59 idle  (~39.4% used)
 pgbench result: scaling 10, clients 4, 60 s
  transactions: 132710 ; failed 0 (0.000%)
  latency average = 1.809 ms ; tps = 2211.713864
 host thin LV vm-9001-disk-0: 10240 MB, Data% 26.73  (≈2.67 GiB)
 ```
 ### 5.7 VM 9000 — create + cloud-init
 ```
 $ qm create 9000 --name spike-vm --cores 2 --memory 2048 \
    --net0 virtio,bridge=vmbr0 --scsihw virtio-scsi-single --agent 1
 $ qm set 9000 --scsi0 local-lvm:0,import-from=/var/lib/vz/template/qcow2/debian-13-genericcloud-amd64.qcow2
  transferred 3.0 GiB of 3.0 GiB (100.00%)
  scsi0: successfully created disk 'local-lvm:vm-9000-disk-0,size=3G'
 $ qm set 9000 --ide2 local-lvm:cloudinit --boot order=scsi0 --serial0 socket --vga serial0
 $ qm disk resize 9000 scsi0 10G        -> resized 3.00 -> 10.00 GiB
 $ qm set 9000 --ciuser spike --cipassword spike --sshkeys /root/spike-pubkey.pub --ipconfig0 ip=dhcp
   # pubkey file = the two real keys from the host's /etc/pve/priv/authorized_keys
   #   (incl. ssh-ed25519 ...kisfenyo@windows — the same workstation key)
 $ qm start 9000   -> start-ok
 ```
 ### 5.8 VM 9000 — IP discovery + guest agent + Docker
 ```
 # genericcloud has no guest-agent at first boot -> qm guest cmd ping failed.
 # IP found via MAC on the bridge:
 $ nmap -sn 192.168.0.0/24 | grep -B2 BC:24:11:C7:41:87
  Nmap scan report for 192.168.0.155 ; MAC BC:24:11:C7:41:87 (Proxmox)
 $ ssh -i /root/.ssh/id_rsa spike@192.168.0.155 'hostname; cat /etc/debian_version'
  spike-vm ; 13.5
 # install qemu-guest-agent + Docker (official repo, trixie): VM-INSTALL-OK
 $ qm guest cmd 9000 ping            -> AGENT OK   (fsfreeze available)
 $ docker --version                  -> Docker version 29.5.3, build d1c06ef
 $ docker run --rm hello-world       -> Hello from Docker!
 $ docker info | grep -iE 'Storage Driver|Cgroup'
 Storage Driver: overlayfs ; Cgroup Driver: systemd ; Cgroup Version: 2
 ```
 ### 5.9 VM 9000 — stack health
 ```
 $ docker compose ps -> spike-cache-1 / spike-db-1 / spike-web-1 all running
 $ curl ... localhost:8080 -> HTTP 200
 $ psql ... SELECT count(*) -> 1   (volume persists)
 ```
 ### 5.10 VM 9000 — idle measurement
 ```
 Host RAM used (5x3s): 3758 / 3757 / 3754 / 3759 / 3758 MB     (median 3758, Δ +2056)
 KVM process RSS / VSZ: 2079988 / 3380896 KiB  (RSS = 2031 MB)
 inside free -m: total 1974 used 509 buff/cache 1524 available 1464
 mpstat 1 5 Average: 3.37 usr 2.31 sys 0.29 guest ... 94.04 idle  (~6.0% used)
 qm config: scsi0 local-lvm:vm-9000-disk-0,size=10G
 host thin LV vm-9000-disk-0: 10240 MB, Data% 29.33  (≈2.94 GiB)
 inside df -h /: 9.7G size, 2.4G used, 25%
 ```
 ### 5.11 VM 9000 — under-load measurement (definitive, load confirmed active)
 ```
 # First attempt (nested-ssh + nohup &) launched pgbench AFTER the sample window ->
 # host CPU read a false ~5% (identical to idle). Diagnosed; re-run below holds
 # pgbench in the foreground of a long-lived SSH channel and samples during it.
 $ pgbench -T 90 -c 4 (foreground, channel held):
  transactions: 163764 ; failed 0 (0.000%)
  latency average = 2.198 ms ; tps = 1819.602345
  (60 s confirmation runs: 1739 & 1759 tps)
 # Sampled 10 s into the confirmed-active load:
 Host RAM used (5x3s): 3784 / 3786 / 3786 / 3786 / 3786 MB     (median 3786, Δ +2084)
 KVM process RSS / VSZ: 2096508 / 4495008 KiB  (RSS = 2047 MB)
 guest uptime: load average 1.71 (2 vCPU)  -> vCPUs busy
 mpstat 1 8 Average:
  1.70 usr  3.40 sys  16.35 iowait  0.58 soft  31.89 guest  46.08 idle   (~53.9% used)
 ```
 ### 5.12 Teardown state
 ```
 $ qm list  -> 9000 spike-vm stopped
 $ pct list -> 9001 spike-lxc stopped
 # both present, both stopped (numbers can be re-checked)
 ```
 ---
 ## 6. Teardown — destroy commands (NOT run)
 Both guests were left **stopped but present**. To remove them:
 ```bash
 qm destroy 9000 --purge            # VM   (also removes cloudinit + disks)
 pct destroy 9001 --purge           # LXC
 # optional spike artifacts on the host:
 rm -f /var/lib/vz/template/qcow2/debian-13-genericcloud-amd64.qcow2
 rm -f /root/spike-pubkey.pub /root/vm-install.sh
 # (Debian 13 CT template left in place: local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst)
 ```
@@ -0,0 +1,315 @@
 # Phase 1 + 2 — Privilege Model & Backup/Restore Round-Trip: Findings
 **Host:** `demo-felhom` (192.168.0.162) — Proxmox VE 9.2.2, node confirmed via
 `pvesh get /nodes` → `demo-felhom`. Storage: `local` (dir, content
 `iso,vztmpl,backup,import`), `local-lvm` (LVM-thin, `rootdir,images`).
 **Subject:** LXC `9001` (`spike-lxc`, unprivileged, `nesting=1,keyctl=1`, Docker +
 postgres/redis/nginx stack). **Date:** 2026-06-07.
 > Data and observations only — **no recommendation or verdict**.
 ## Hypotheses — verdicts at a glance
 | | Hypothesis | Result |
 |---|---|---|
 | **H1** | Backup scopes to one VMID; restore/create needs node/pool allocate → denied to narrow token | **CONFIRMED** (create CT = 403) |
 | **H2** | An LXC vzdump captures the Docker volumes (they live in the container rootfs) | **CONFIRMED** (sentinel survived both restores) |
 | **H3** | Crash-consistent (running) *and* quiesced (stopped) backups both restore cleanly | **CONFIRMED** (A via WAL recovery, B clean start) |
 | **H4** | Running unprivileged LXC snapshots on LVM-thin; restored CT keeps unprivileged+nesting/keyctl | **CONFIRMED** (live snapshot OK; config survived) |
 ---
 ## 1. Phase 1 — Privilege model
 ### 1.1 Setup (operator side, root)
 ```
 pveum role add FelhomSelfBackup -privs "VM.Audit VM.Snapshot VM.Backup Datastore.AllocateSpace Datastore.Audit"
 pveum user add felhom-ctl@pve --comment "spike in-guest controller"
 pveum user token add felhom-ctl@pve ctl --privsep 1   # secret: b6547d9d-... (ephemeral, spike-only)
 pveum acl modify /vms/9001      -token 'felhom-ctl@pve!ctl' -role FelhomSelfBackup
 pveum acl modify /storage/local -token 'felhom-ctl@pve!ctl' -role FelhomSelfBackup
 ```
 Privilege names were verified against `PVEVMAdmin` / `PVEDatastoreUser` via
 `pveum role list` first. **Note:** the reference doc's introspection command
 `pveum role info <role>` **does not exist in PVE 9** — only `pveum role list` works.
 ### 1.2 ⚠️ Privsep gotcha — the doc's runbook is incomplete
 With `--privsep 1`, a token's effective rights are the **intersection of the backing
 user's permissions AND the token's own ACLs**. The reference doc (§3) grants ACLs to the
 **token only**. With the user `felhom-ctl@pve` holding **no** permissions, the
 intersection was **empty** — the first self-audit call returned:
 ```
 HTTP 403  {"message":"Permission check failed (/vms/9001, VM.Audit)\n"}
 ```
 **Fix applied:** also grant the user the role on the same paths
 (`pveum acl modify /vms/9001 -user felhom-ctl@pve -role FelhomSelfBackup`, same for
 `/storage/local`). After that the self-calls succeeded. **A privsep token needs the
 permission present on *both* the user and the token** (the token ACL is what keeps the
 token ≤ user / narrowly scoped). This must be reflected in the controller provisioning.
 ### 1.3 Test matrix (every call run from **inside** the unprivileged LXC, `pct exec 9001`)
 `H=192.168.0.162  N=demo-felhom  AUTH="PVEAPIToken=felhom-ctl@pve!ctl=<secret>"`
 | # | Call | Expected | **Actual** | Notes |
 |---|---|---|---|---|
 | 1 | `GET /version` | 200 | **200** | reachable + auth from inside LXC (no privilege needed) |
 | 2 | `GET /nodes/$N/lxc/9001/status/current` | 200 | **200**¹ | self audit (after privsep fix) |
 | 3 | `POST /nodes/$N/lxc/9001/snapshot snapname=spk1` | 200/UPID→OK | **200, task exitstatus OK** | **running-LXC self-snapshot (H4)** |
 | 4 | `POST /nodes/$N/vzdump vmid=9001 storage=local mode=snapshot` | 200/UPID→OK | **200, task exitstatus OK** | self backup, archive produced |
 | 5 | `GET /nodes/$N/qemu/9000/status/current` | 403 | **403** | `Permission check failed (/vms/9000, VM.Audit)` |
 | 6 | `POST /nodes/$N/vzdump vmid=9000 storage=local` | 403 | **200 POST → task exitstatus 403**² | see note |
 | 7 | `POST /nodes/$N/lxc` (create CT) | 403 | **403** | `Permission check failed` — **proves create/allocate is operator-tier (H1)** |
 ¹ before the privsep fix this was 403; see §1.2.
 ² **Important nuance:** the `vzdump` endpoint accepts the POST and returns a UPID even for
 an unauthorized vmid; the authorization failure surfaces at **task execution**, not at the
 HTTP layer. Polled from root:
 `exitstatus: "403 Permission check failed (/vms/9000, VM.Backup)"`, and **no 9000 archive
 was created**. The boundary holds — but a controller must **poll the task exitstatus**, not
 trust the POST's 200, to know a cross-guest backup was actually refused.
 **Pass criteria met:** self-ops (1–4) succeed; cross-guest read (5), cross-guest backup
 (6, at task level), and create/allocate (7) are denied. The controller-as-guest boundary
 and the two-tier split are validated.
 ### 1.4 Final minimal role — `VM.PowerMgmt` **not** required
 The doc's open question ("does Tier A need `VM.PowerMgmt` for stop-mode backups? Likely
 yes"). **Tested and refuted:** a **stop-mode** self-vzdump submitted by the token
 (`vmid=9001 mode=stop`) completed with **`exitstatus: OK`** using the role *without*
 `VM.PowerMgmt`. `vzdump` performs the guest shutdown/restart internally under
 `VM.Backup`; no separate power privilege is needed.
 > **Final minimal role (`FelhomSelfBackup`) — satisfies self-audit, self-snapshot, and
 > both `snapshot`- and `stop`-mode self-backup:**
 > `VM.Audit, VM.Snapshot, VM.Backup, Datastore.AllocateSpace, Datastore.Audit`
 > (`VM.PowerMgmt` deliberately omitted — confirmed unnecessary.)
 ### 1.5 TLS observation
 From inside the LXC, `curl` **without** `-k`:
 ```
 curl: (60) SSL certificate problem: unable to get local issuer certificate
 ```
 The host serves the default self-signed PVE cert; all tests used `-k`. Production trust
 (pin the PVE CA / issue a proper cert) is a separate design decision, flagged here.
 ### 1.6 Running-LXC snapshot (H4)
 Call #3 snapshotted the **running** unprivileged LXC on LVM-thin (`exitstatus OK`).
 `pct listsnapshot 9001` shows `spk1` with `pct status 9001 = running`. **No stop
 required** — the snapshot-before-update rollback flow is viable on a live container.
 ---
 ## 2. Phase 2 — Backup → real restore round-trip
 Sentinel written pre-flight into the `pgdata` volume:
 `restore_check(42,'phase2-sentinel')` → clean read `42|phase2-sentinel`.
 ### 2.1 Backups (operator/root side)
 | Variant | Mode | Stack state | Task time | Wall | Archive | Size (zstd) |
 |---|---|---|---|---|---|---|
 | **A — crash-consistent** | `snapshot` | **running** | 00:00:24 | 25 s | `vzdump-lxc-9001-2026_06_07-20_13_43.tar.zst` | **934 MB** (979,718,569 B) |
 | **B — quiesced** | `snapshot` | **stopped** (`docker compose stop`) | 00:00:21 | 22 s | `vzdump-lxc-9001-2026_06_07-20_14_40.tar.zst` | **934 MB** (979,671,582 B) |
 Both from a 2.5 GiB source; zstd → ~934 MB (~2.7:1). The stack was restarted after
 Variant B. **LXC snapshot-mode vzdump does *not* fsfreeze** (no guest agent in an LXC —
 consistent with the Phase 0 finding) → Variant A is genuinely crash-consistent.
 ### 2.2 Restore → fresh VMID → boot → verify
 | Check | 9002 (Variant A) | 9003 (Variant B) |
 |---|---|---|
 | Restore time (`pct restore … --storage local-lvm`) | **12 s** | **11 s** |
 | `unprivileged: 1` survived | **yes** | **yes** |
 | `features: nesting=1,keyctl=1` survived | **yes** | **yes** |
 | Containers after boot | `exited` (no restart policy) → `docker compose up -d` | same |
 | 3 containers healthy | **yes** | **yes** |
 | `curl localhost:8080` | **HTTP 200** | **HTTP 200** |
 | **Sentinel `(42,'phase2-sentinel')`** | **PRESENT** | **PRESENT** |
 | Postgres first-start | **WAL crash recovery** (see below) | **clean start, no recovery** |
 > Restored CTs inherit 9001's fixed `hwaddr`. To avoid a MAC clash with the still-running
 > 9001 on `vmbr0`, `net0` was reset to auto-generate a fresh MAC before boot. All
 > verification (stack health, `curl localhost`, sentinel) is guest-internal and needs no
 > external network — and the Docker images are inside the restored rootfs, so no pulls.
 **Variant A — Postgres automatic WAL recovery on 9002 (verbatim, post-restore boot):**
 ```
 LOG:  database system was interrupted; last known up at 2026-06-07 18:13:21 UTC
 LOG:  database system was not properly shut down; automatic recovery in progress
 LOG:  redo starts at 0/CB12838
 LOG:  invalid record length at 0/CB12870: expected at least 24, got 0   # normal end-of-WAL
 LOG:  redo done at 0/CB12838 ...
 LOG:  checkpoint starting: end-of-recovery immediate wait
 LOG:  database system is ready to accept connections
 ```
 **Variant B — clean start on 9003 (verbatim, post-restore boot):**
 ```
 LOG:  database system was shut down at 2026-06-07 18:14:39 UTC
 LOG:  database system is ready to accept connections
 ```
 **H2 confirmed:** one LXC vzdump captured the whole customer including the Docker named
 volume — the sentinel data restored in both guests. **H3 confirmed:** both variants
 restored to a bootable guest with intact data; the crash-consistent one recovered via WAL
 with no manual intervention, the quiesced one started clean. **H4 confirmed:** restored
 config preserved `unprivileged` + `nesting/keyctl`, so Docker ran in the restored CT.
 ---
 ## 3. Observations & confounds
 1. **Privsep token needs perms on user *and* token** (§1.2) — the single most important
   correction to the reference runbook; without it every scoped call 403s.
 2. **vzdump authorization is task-level, not POST-level** (§1.3 note ²) — a 200 + UPID
   does **not** mean authorized. The controller must poll `exitstatus`. This is also the
   general async-task lesson: every backup/snapshot/restore returns a UPID and the real
   result is in the task status.
 3. **`pveum role info` is gone in PVE 9** — use `pveum role list`. Minor doc drift.
 4. **`VM.PowerMgmt` not needed for stop-mode backup** (§1.4) — narrower role than the doc
   assumed.
 5. **No fsfreeze for LXC** — Variant A relied on Postgres's own WAL crash recovery, which
   worked here for an idle-at-backup DB. Under heavy write load, app-consistency for LXC
   still rests on the controller quiescing first (or stop-mode), exactly as the reference
   warned. This single test is not a durability guarantee under load.
 6. **Restore MAC collision** (§2.2) — `pct restore` preserves the source `hwaddr`;
   restoring while the original runs needs a MAC reset (or the original stopped). The
   controller's restore flow must handle identity (MAC/hostname/IP) to avoid clashes.
 7. **No restart policy on the compose services** — restored containers came up `exited`;
   `docker compose up -d` (or a restart policy / systemd unit) is required for the stack
   to return automatically after a restore or guest reboot.
 8. **Restore is fast, backup dominated by I/O** — restores were 11–12 s (extract at
   ~524 MiB/s); backups ~22–25 s (read 2.5 GiB at ~108–119 MiB/s + zstd). Single runs,
   idle host, ~150 MB DB; not a throughput benchmark.
 9. **Sequencing artifact:** a Phase-1 stop-mode self-backup ran before Phase 2 and
   stopped/started 9001; the stack was brought back up and the sentinel re-verified
   before the Variant A/B backups, so it does not affect the round-trip results.
 ---
 ## 4. Raw command log (appendix)
 ### 4.1 Pre-flight
 ```
 $ pvesh get /nodes  -> node: demo-felhom
 $ cat /etc/pve/storage.cfg
 dir: local   ... content iso,vztmpl,backup,import        # 'backup' present
 lvmthin: local-lvm ... content rootdir,images            # no backup (expected)
 $ pct start 9001 ; docker compose up -d  -> 3 containers Started
 $ curl localhost:8080  -> HTTP 200
 # sentinel:
 CREATE TABLE ; INSERT 0 1 ; SELECT count -> 1 ; SELECT * -> 42 | phase2-sentinel
 ```
 ### 4.2 Phase 1 — role/user/token/ACL
 ```
 $ pveum role add FelhomSelfBackup -privs "VM.Audit VM.Snapshot VM.Backup Datastore.AllocateSpace Datastore.Audit"  -> role-ok
 $ pveum user add felhom-ctl@pve --comment "spike in-guest controller"  -> user-ok
 $ pveum user token add felhom-ctl@pve ctl --privsep 1
  {"full-tokenid":"felhom-ctl@pve!ctl","info":{"privsep":"1"},"value":"b6547d9d-08ec-4f22-beb8-a551dc2cd69d"}
 $ pveum acl modify /vms/9001 -token 'felhom-ctl@pve!ctl' -role FelhomSelfBackup   -> ok
 $ pveum acl modify /storage/local -token 'felhom-ctl@pve!ctl' -role FelhomSelfBackup -> ok
 $ pveum role list | grep FelhomSelfBackup
  FelhomSelfBackup | Datastore.AllocateSpace,Datastore.Audit,VM.Audit,VM.Backup,VM.Snapshot
 $ pveum role info FelhomSelfBackup   -> ERROR: unknown command 'pveum role info'   # PVE9 has no 'role info'
 ```
 ### 4.3 Phase 1 — matrix (from inside LXC)
 ```
 # TLS without -k:
 curl: (60) SSL certificate problem: unable to get local issuer certificate
 # BEFORE privsep fix:
 #2 GET self status -> HTTP 403 {"message":"Permission check failed (/vms/9001, VM.Audit)\n"}
 # privsep fix:
 $ pveum acl modify /vms/9001 -user 'felhom-ctl@pve' -role FelhomSelfBackup  -> ok
 $ pveum acl modify /storage/local -user 'felhom-ctl@pve' -role FelhomSelfBackup -> ok
 # AFTER fix:
 #1 GET /version                         -> HTTP 200
 #2 GET /nodes/.../lxc/9001/status/current -> HTTP 200 {"data":{...,"status":"running",...}}
 #5 GET /nodes/.../qemu/9000/status/current -> HTTP 403 (/vms/9000, VM.Audit)
 #6 POST vzdump vmid=9000 -> HTTP 200 {"data":"UPID:...vzdump:9000:felhom-ctl@pve!ctl:"}
   root poll: exitstatus="403 Permission check failed (/vms/9000, VM.Backup)"
   task log: TASK ERROR: 403 Permission check failed (/vms/9000, VM.Backup)
   /var/lib/vz/dump: no 9000 archive created
 #7 POST /nodes/.../lxc (create CT vmid=9009) -> HTTP 403 {"message":"Permission check failed\n"}
 #3 POST lxc/9001/snapshot snapname=spk1 -> HTTP 200 UPID:...vzsnapshot:9001...
   root: exitstatus "OK" ; pct listsnapshot 9001 -> spk1 ; pct status 9001 -> running
 #4 POST vzdump vmid=9001 storage=local mode=snapshot -> HTTP 200 UPID:...vzdump:9001...
   root: exitstatus "OK"
   token can read own task status: HTTP 200 {"...exitstatus":"OK"}   # earlier poll TIMEOUTs were a shell-quoting bug in the helper, not a perms issue
 # stop-mode self-backup (VM.PowerMgmt test):
 $ token POST vzdump vmid=9001 storage=local mode=stop -> HTTP 200 UPID:...vzdump:9001...
   root poll: exitstatus "OK"     # SUCCEEDED without VM.PowerMgmt in the role
 ```
 ### 4.4 Phase 2 — backups
 ```
 # Variant A (running):
 $ vzdump 9001 --mode snapshot --storage local --compress zstd
 INFO: Total bytes written: 2585589760 (2.5GiB, 108MiB/s)
 INFO: archive file size: 934MB
 INFO: Finished Backup of VM 9001 (00:00:24)   ; WALL_SECONDS=25
 -> vzdump-lxc-9001-2026_06_07-20_13_43.tar.zst  (979718569 B)
 # Variant B (stopped):
 $ docker compose stop   (cache,db,web Stopped)
 $ vzdump 9001 --mode snapshot --storage local --compress zstd
 INFO: Total bytes written: 2585825280 (2.5GiB, 119MiB/s)
 INFO: Finished Backup of VM 9001 (00:00:21)   ; WALL_SECONDS=22
 -> vzdump-lxc-9001-2026_06_07-20_14_40.tar.zst  (979671582 B)
 $ docker compose start   (db,cache,web Started)
 ```
 ### 4.5 Phase 2 — restores + verification
 ```
 # A -> 9002:
 $ pct restore 9002 .../20_13_43.tar.zst --storage local-lvm
  Total bytes read: 2585589760 (2.5GiB, 524MiB/s) ; RESTORE_A_SECONDS=12
 $ pct config 9002 -> features: nesting=1,keyctl=1 ; unprivileged: 1
 $ pct set 9002 -net0 name=eth0,bridge=vmbr0,ip=dhcp   # fresh MAC BC:24:11:E3:F4:64
 $ pct start 9002 ; docker compose up -d -> 3 running ; curl -> HTTP 200
 $ psql SELECT * FROM restore_check -> 42 | phase2-sentinel
  db log: "was interrupted ... not properly shut down; automatic recovery in progress
           redo starts/redo done ... database system is ready to accept connections"
 # B -> 9003:
 $ pct restore 9003 .../20_14_40.tar.zst --storage local-lvm
  Total bytes read: 2585825280 (2.5GiB, 524MiB/s) ; RESTORE_B_SECONDS=11
 $ pct config 9003 -> features: nesting=1,keyctl=1 ; unprivileged: 1
 $ pct set 9003 -net0 ... (fresh MAC) ; pct start 9003 ; docker compose up -d -> 3 running ; curl 200
 $ psql SELECT * FROM restore_check -> 42 | phase2-sentinel
  db log: "database system was shut down at ... ; database system is ready to accept connections"  # clean
 ```
 ---
 ## 5. Teardown (executed)
 Restore targets destroyed; Phase 1 objects and spike artifacts removed; `9000`/`9001`
 left **stopped-but-present**. Verified clean: `felhom-ctl@pve` deleted, no spike ACLs,
 empty `dump/`, `spk1` removed.
 > **Correction:** `pveum acl delete` **requires `--roles`** (a bare `-user`/`-token`
 > path errors `400 roles: property is missing`). In practice the explicit ACL deletes
 > are unnecessary — deleting the token/user/role **auto-invalidates** the referencing
 > ACLs (PVE logs `ignore invalid acl token …` and drops them).
 ```bash
 pct stop 9002 ; pct stop 9003 ; pct destroy 9002 --purge ; pct destroy 9003 --purge
 # correct ACL-delete syntax (needs --roles), or just let user/role deletion clean them:
 pveum acl delete /vms/9001      --roles FelhomSelfBackup --users  'felhom-ctl@pve'
 pveum acl delete /vms/9001      --roles FelhomSelfBackup --tokens 'felhom-ctl@pve!ctl'
 pveum acl delete /storage/local --roles FelhomSelfBackup --users  'felhom-ctl@pve'
 pveum acl delete /storage/local --roles FelhomSelfBackup --tokens 'felhom-ctl@pve!ctl'
 pveum user token remove felhom-ctl@pve ctl ; pveum user delete felhom-ctl@pve ; pveum role delete FelhomSelfBackup
 pct delsnapshot 9001 spk1
 rm -f /var/lib/vz/dump/vzdump-lxc-9001-*.tar.zst /var/lib/vz/dump/vzdump-lxc-9001-*.log
 pct stop 9001     # back to stopped-but-present
 ```
 ## 6. To destroy 9000/9001 later (NOT run — left stopped-but-present)
 ```bash
 qm destroy 9000 --purge        # VM  (Phase 0 subject)
 pct destroy 9001 --purge       # LXC (Phase 0/1/2 subject)
 # Debian 13 CT template left in place: local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst
 ```
@@ -0,0 +1,234 @@
 # Phase 3 — vzdump exclusion (B2) & agent operator role + root boundary (B3): Findings
 **Host:** `demo-felhom` (192.168.0.162) — Proxmox VE 9.2.2, node confirmed via
 `pvesh get /nodes` → `demo-felhom`. **Date:** 2026-06-08. Throwaway resources (VMIDs
 9010-9023, role/user `FelhomAgent`/`felhom-agent@pve`); all torn down (only the pre-existing
 9000/9001 remain, stopped). Every Proxmox op polled to `task exitstatus` (not the POST
 return).
 > Validates the two items the design review (`_design-review.md`) flagged as unvalidated:
 > **B2** (what vzdump includes/excludes per LXC mount type + how to keep bulk out) and **B3**
 > (the least-privilege operator role + the root-vs-API boundary). Data only.
 ---
 ## B2 — vzdump inclusion/exclusion matrix
 **Setup:** one unprivileged LXC `9010` (`nesting=1,keyctl=1`, overlayfs), Docker 29.5.3
 installed, with five sentinel locations:
 | # | location | config |
 |---|---|---|
 | 1 | rootfs file `/SENTINEL_ROOTFS` | rootfs (`local-lvm:8`) |
 | 2 | Docker **named** volume `b2vol` → `SENTINEL_DOCKERVOL` | default driver |
 | 3 | `mp1` volume mount `/mnt/mp1` `SENTINEL_MP1` | `local-lvm:1,backup=1` |
 | 4 | `mp2` volume mount `/mnt/mp2` `SENTINEL_MP2` | `local-lvm:1,backup=0` |
 | 5 | `mp3` **bind** mount `/mnt/mp3` `SENTINEL_MP3` | host `/root/b2-bindsrc` |
 | 6 | bulk Docker vol `bulkvol` bound onto mp2 → `SENTINEL_BULK` | `--driver local -o type=none -o o=bind -o device=/mnt/mp2` |
 **The "trap" confirmed at setup:** the Docker named volume's on-disk path is
 `/var/lib/docker/volumes/b2vol/_data` — **inside the LXC rootfs**.
 ### Result matrix (stop-mode vzdump → `local`, verified 3 ways: vzdump log, archive grep, restore to 9011)
 | Sentinel | location | flag | **in archive?** | restored 9011 |
 |---|---|---|---|---|
 | `SENTINEL_ROOTFS` | rootfs | — | **INCLUDED** | present |
 | `SENTINEL_DOCKERVOL` | Docker named vol (in rootfs) | — | **INCLUDED** ⚠️ the trap | present |
 | `SENTINEL_MP1` | volume mp | `backup=1` | **INCLUDED** | present |
 | `SENTINEL_MP2` | volume mp | `backup=0` | **EXCLUDED** | absent (vol recreated empty) |
 | `SENTINEL_MP3` | bind mount | n/a | **EXCLUDED** | reappears via re-bind only¹ |
 | `SENTINEL_BULK` | Docker vol on mp2 | `backup=0` | **EXCLUDED** | absent |
 ¹ The bind-mount **data is not in the archive** (archive grep shows no mp3 path). It
 reappears in the restored 9011 only because `pct restore` preserves the bind config
 `mp3: /root/b2-bindsrc` and re-attaches the **same host dir**. On a *different* host (true DR)
 the bind data would be gone unless backed up separately — important for DR planning.
 **vzdump log (verbatim) — the authoritative per-mount decision:**
 ```
 INFO: including mount point rootfs ('/') in backup
 INFO: including mount point mp1 ('/mnt/mp1') in backup
 INFO: excluding volume mount point mp2 ('/mnt/mp2') from backup (disabled)
 INFO: excluding bind mount point mp3 ('/mnt/mp3') from backup (not a volume)
 ```
 **Archive contents (verbatim) — `tar --zstd -tf … | grep SENTINEL`:**
 ```
 ./var/lib/docker/volumes/b2vol/_data/SENTINEL_DOCKERVOL
 ./SENTINEL_ROOTFS
 ./mnt/mp1/SENTINEL_MP1
 ```
 **Restore verification (verbatim) — sentinels in restored 9011:**
 ```
 PRESENT : /SENTINEL_ROOTFS
 PRESENT : /var/lib/docker/volumes/b2vol/_data/SENTINEL_DOCKERVOL
 PRESENT : /mnt/mp1/SENTINEL_MP1
 ABSENT  : /mnt/mp2/SENTINEL_MP2
 ABSENT  : /mnt/mp2/SENTINEL_BULK
 PRESENT : /mnt/mp3/SENTINEL_MP3   # via re-bind to same host dir, NOT from archive
 ```
 ### Proven bulk-exclusion recipe
 A "bulk" Docker volume is kept out of the guest vzdump by binding it onto a **volume
 mountpoint with `backup=0`**:
 1. Attach a Proxmox volume mountpoint with the flag:
   `pct set <id> -mpN <storage>:<size>,mp=/mnt/bulk,backup=0`
 2. Realize the Docker volume on that path:
   `docker volume create --driver local -o type=none -o o=bind -o device=/mnt/bulk bulkvol`
   (or a compose bind to `/mnt/bulk`).
 3. Data written through `bulkvol` lands on the `backup=0` mountpoint → **excluded** from
   vzdump, while rootfs/hot sentinels are **included**. Verified: `SENTINEL_BULK` absent from
   archive and restore; `SENTINEL_ROOTFS` present.
 ### The trap, stated for the placement component
 `backup=<boolean>` is **only honoured for volume mount points** (confirmed: pct manpage +
 vzdump log "excluding volume mount point … (disabled)"). A Docker **named volume uses the
 default driver and lands in the rootfs**, which is **always backed up** — so a "bulk" volume
 left as an ordinary named volume is **silently swept into the whole-guest image**. The
 per-volume placement component **must** realize every `bulk` volume as a dedicated `backup=0`
 mountpoint (or external bind mount), never a default named volume.
 ---
 ## B3 — agent operator role + root-vs-API boundary
 **Caveat applied (Phase 1):** privsep token needs the role on **both** user and token. Setup:
 user `felhom-agent@pve` + privsep token `agent`, role `FelhomAgent`, dual-granted at `/`.
 All ops driven **as the token** via the REST API; task `exitstatus` polled.
 > ⚠️ **Terminology:** the Phase-1 `FelhomSelfBackup` role is the discarded **guest-side
 > self-backup** role (scoped to one guest, *denied* create/allocate). `FelhomAgent` here is
 > its **operator-tier replacement** — a different, broader role. Do not conflate.
 ### Op matrix (as the scoped token)
 | # | Operation | API call | Result |
 |---|---|---|---|
 | read | host status | `GET /nodes/$N/status` | **200** (needs `Sys.Audit`) |
 | read | storage list | `GET /storage` | **200** (`Datastore.Audit`) |
 | 1 | **create LXC, `nesting=1,keyctl=1`** | `POST /nodes/$N/lxc` | **403** — `changing feature flags (except nesting) is only allowed for root@pam` |
 | 1′ | create LXC, **nesting-only** | `POST /nodes/$N/lxc` | **200 / OK** |
 | 2 | set config (mem/cpu/options + mountpoint w/ `backup` flag) | `PUT /nodes/$N/lxc/<id>/config` | **200** |
 | 3 | allocate volume | `POST /nodes/$N/storage/local-lvm/content` | **200** (`Datastore.AllocateSpace`) |
 | 4 | start | `POST …/status/start` | **OK** (`VM.PowerMgmt`) |
 | 5 | stop | `POST …/status/stop` | **OK** |
 | 6a | snapshot | `POST …/snapshot` | **OK** (`VM.Snapshot`) |
 | 6b | rollback | `POST …/snapshot/s1/rollback` | **OK** (`VM.Snapshot.Rollback`) |
 | 7 | stop-mode backup | `POST /nodes/$N/vzdump mode=stop` | **OK** (`VM.Backup`) |
 | 8 | restore → fresh vmid | `POST /nodes/$N/lxc restore=1` | **OK** — and **restored CT kept `features: nesting=1,keyctl=1`** |
 | 9 | destroy CT | `DELETE /nodes/$N/lxc/<id>?purge=1` | **OK** (`VM.Allocate`) |
 | 9b | add storage definition (dir) | `POST /storage` | **200** (`Datastore.Allocate`, **no root**) |
 **The two headline results:**
 1. **`keyctl=1` on create is `root@pam`-only.** Verbatim:
   `Permission check failed (changing feature flags (except nesting) is only allowed for root@pam)`.
   Confirmed this is **not** token-fixable: a **non-privsep `root@pam` token** got the **same
   403**. Only an actual `root@pam` session (OS root / `pct create` as root) can set it.
   `nesting` alone is allowed for a scoped token.
 2. **Restore preserves `keyctl`.** A token-authorized `vzrestore` of a keyctl archive produced
   `9021` with `features: nesting=1,keyctl=1, unprivileged: 1`. So the **DR/restore path is
   fully token-covered**; only *fresh provisioning* needs root for the keyctl flag.
 ### Paring (each drop shown to still pass, or proven needed)
 | Privilege | Verdict | Evidence |
 |---|---|---|
 | `Datastore.AllocateTemplate` | **DROP** (unnecessary) | create-from-template succeeded without it (200/OK) |
 | `Sys.Audit` | **KEEP** | `GET /nodes/$N/status` → **403** without it (host metrics, `03` §5) |
 | `VM.Config.Network` | **KEEP** | create with `net0` → **403 (/vms/…, VM.Config.Network)** without it |
 | `VM.Config.Options` | **KEEP** | config `onboot=1` → **403 (/vms/…, VM.Config.Options)** without it |
 | `SDN.Use` | **KEEP (added vs review sketch)** | create → **403 (/sdn/zones/localnetwork/vmbr0, SDN.Use)** without it |
 > Corrections to the review's candidate sketch: `VM.Config.CPUMemory` is **not a real
 > privilege** — split into `VM.Config.CPU` + `VM.Config.Memory`. `SDN.Use` was **missing** and
 > is **required** (PVE 9 gates bridge use behind it). `Datastore.AllocateTemplate` is **not
 > needed**.
 ### Final minimal `FelhomAgent` role (proven sufficient for ops 1′–9b)
 ```
 VM.Allocate  VM.Audit  VM.Config.Disk  VM.Config.CPU  VM.Config.Memory
 VM.Config.Network  VM.Config.Options  VM.PowerMgmt  VM.Snapshot  VM.Snapshot.Rollback
 VM.Backup  Datastore.Allocate  Datastore.AllocateSpace  Datastore.Audit  Sys.Audit  SDN.Use
 ```
 (16 privileges. `Datastore.Allocate` is for the storage-definition add; drop it if the agent
 never creates Proxmox storage entries via the API. `VM.PowerMgmt` is for start/stop lifecycle
 — not for the backup itself, consistent with `proxmox-platform.md` §3.4.)
 ### Root-vs-API boundary table (answers `03` §3)
 | Agent host operation | Coverage | Notes |
 |---|---|---|
 | Create unprivileged LXC, **nesting-only** | **API token** | `VM.Allocate`+`VM.Config.*`+`Datastore.AllocateSpace`+`SDN.Use` |
 | **Create with `keyctl=1` (Docker needs it — Phase 0)** | **OS root `root@pam`** (`pct create` as root / sudoers) | no API token works, incl. a root@pam token |
 | Set config (mem/cpu/net/options/mountpoint + `backup` flag) | API token | |
 | Allocate guest volume | API token | `Datastore.AllocateSpace` |
 | Start / stop / snapshot / rollback | API token | `VM.PowerMgmt` / `VM.Snapshot(.Rollback)` |
 | vzdump backup (stop/snapshot mode) | API token | `VM.Backup` |
 | **Restore from vzdump (preserves keyctl)** | **API token** | DR path needs no root |
 | Destroy guest (scratch + compensating rollback, B1) | API token | `VM.Allocate` |
 | Add Proxmox **storage definition** (dir/nfs/cifs/pbs) | API token | `Datastore.Allocate`; the *definition* only |
 | Host status / metrics report | API token | `Sys.Audit` |
 | **USB physical mount-by-UUID / systemd mount unit / fstab** | **OS root / narrow sudoers** | not a Proxmox API op (host-level mount; not tested here) |
 | **SMART / hardware sensors** | OS root | not API-exposed |
 **Boundary summary:** nearly the entire guest lifecycle — including **restore** — is covered
 by the scoped token. The genuine OS-root residual is narrow: **(1) fresh creation of a
 Docker-capable LXC (the `keyctl` flag), (2) physical USB mount-by-UUID / systemd mount units /
 fstab, (3) hardware/SMART.** This supports `03` §3's "non-root service + scoped token + narrow
 sudoers" model — with the **specific** sudoers/root entries being: `pct create` (or just the
 keyctl-setting step) and the host mount operations.
 ---
 ## Raw command log (appendix)
 ### B2
 ```
 pct create 9010 ... --features nesting=1,keyctl=1 --unprivileged 1   # rootfs local-lvm:8
 pct set 9010 -mp1 local-lvm:1,mp=/mnt/mp1,backup=1
 pct set 9010 -mp2 local-lvm:1,mp=/mnt/mp2,backup=0
 pct set 9010 -mp3 /root/b2-bindsrc,mp=/mnt/mp3
 # docker named vol: docker volume inspect b2vol -> /var/lib/docker/volumes/b2vol/_data
 # bulk: docker volume create --driver local -o type=none -o o=bind -o device=/mnt/mp2 bulkvol
 vzdump 9010 --mode stop --storage local --compress zstd
 #   INFO: including mount point rootfs ('/') in backup
 #   INFO: including mount point mp1 ('/mnt/mp1') in backup
 #   INFO: excluding volume mount point mp2 ('/mnt/mp2') from backup (disabled)
 #   INFO: excluding bind mount point mp3 ('/mnt/mp3') from backup (not a volume)
 tar --zstd -tf <archive> | grep SENTINEL   # -> rootfs, dockervol, mp1 only
 pct restore 9011 <archive> --storage local-lvm   # -> mp2/bulk absent, mp3 via re-bind
 ```
 ### B3
 ```
 pveum role add FelhomAgent -privs "VM.Allocate VM.Audit VM.Config.Disk VM.Config.CPU VM.Config.Memory VM.Config.Network VM.Config.Options VM.PowerMgmt VM.Snapshot VM.Snapshot.Rollback VM.Backup Datastore.Allocate Datastore.AllocateSpace Datastore.AllocateTemplate Datastore.Audit Sys.Audit"   # candidate (pre-SDN)
 pveum user add felhom-agent@pve ; pveum user token add felhom-agent@pve agent --privsep 1
 pveum acl modify / -user  'felhom-agent@pve'        -role FelhomAgent
 pveum acl modify / -token 'felhom-agent@pve!agent'  -role FelhomAgent
 # token create with keyctl:
 POST /nodes/demo-felhom/lxc ... features=nesting=1,keyctl=1
  -> 403 "changing feature flags (except nesting) is only allowed for root@pam"
 # + SDN.Use missing initially:
  -> 403 "Permission check failed (/sdn/zones/localnetwork/vmbr0, SDN.Use)"
 # root@pam non-privsep token, keyctl create:
  -> 403 (same "only allowed for root@pam")   # tokens never qualify
 # token nesting-only create / config(PUT) / start / stop / snapshot / rollback /
 # vzdump(stop) / restore->9021 (kept keyctl) / destroy / POST /storage  -> all 200/OK
 # paring:
 GET /nodes/$N/status  without Sys.Audit            -> 403   (KEEP)
 create net0           without VM.Config.Network     -> 403   (KEEP)
 config onboot=1       without VM.Config.Options      -> 403   (KEEP)
 create from template  without Datastore.AllocateTemplate -> OK (DROP)
 ```
 ### Teardown
 ```
 pct destroy 9010 9011 9021 --purge   # 9020/9022/9023 already destroyed during tests
 pveum user token remove felhom-agent@pve agent ; pveum user delete felhom-agent@pve
 pveum role delete FelhomAgent        # ACLs at / auto-invalidated
 rm -f /var/lib/vz/dump/vzdump-lxc-9010-* /var/lib/vz/dump/vzdump-lxc-9020-*
 # verified: only 9000/9001 remain (stopped-but-present); no felhom-agent user/role; dump dir empty
 ```
@@ -0,0 +1,257 @@
 # Phase 4 — Control-plane signing primitive (SSHSIG + Go verify): Findings
 **Where run:** build server `192.168.0.180` (Debian 13, **Go 1.24.4**, **OpenSSH 10.0p2**),
 no Proxmox. **Date:** 2026-06-08. Throwaway key generated, used, and **deleted** — no private
 key, passphrase, or `.sig` committed.
 > De-risks the signing primitive *before* it is written into `04-control-plane-authorization.md`
 > or the agent's verify code. **Verdict up front: the approach works cleanly and is key-type-
 > agnostic — no fallback needed.** Go verifies the armored `SSHSIG` format, every tamper/replay/
 > authorization case is rejected, and a synthetic FIDO2 `sk-ssh-ed25519` signature verifies
 > through the **unchanged** code path (true hardware drop-in).
 ---
 ## 0. Result at a glance — 14/14 checks pass
 ```
 == Step 2: SSHSIG signature verification (key-type-agnostic path) ==
  PASS  correct                verified, op="guest_destroy"
  PASS  wrong key              rejected: signer not in allowed set
  PASS  tampered blob          rejected: signature invalid: ssh: signature did not verify
  PASS  wrong namespace        rejected: namespace mismatch: got "felhom-op-wrong" want "felhom-op-v1"
 == Step 3: anti-replay / authorization (valid signature, still rejected) ==
  PASS  first use              verified, op="guest_destroy"
  PASS  replay (same nonce)    rejected: replay: nonce a1b2c3d4...8f90 already seen
  PASS  expired                rejected: expired (expires_at=2020-01-02 ..., now=2026-06-08 ...)
  PASS  not-yet-valid          rejected: not yet valid (issued_at=2030-01-01 ...)
  PASS  retargeted host        rejected: target mismatch: blob=demo-felhom/9001 this=other-host/9001
  PASS  retargeted guest       rejected: target mismatch: blob=demo-felhom/9001 this=demo-felhom/8888
 == Step 4: key-type-agnosticism — FIDO2 sk-ssh-ed25519 (synthetic, no device) ==
  PASS  parses sk pubkey       type="sk-ssh-ed25519@openssh.com"
  PASS  authorized_keys form   sk-ssh-ed25519@openssh.com AAAAGnNrLXNzaC1lZDI1NTE5...
  PASS  sk end-to-end verify   verified, op="guest_destroy"
 ```
 ---
 ## 1. Software round-trip (baseline, CLI)
 - Key: `ssh-keygen -t ed25519 -f felhom-op -N '<passphrase>' -C felhom-operator`.
  (Signing non-interactively used an `SSH_ASKPASS` helper + `setsid -w`; in production the
  operator key lives behind an agent or a FIDO2 device, so the at-sign passphrase prompt is a
  non-issue. The passphrase mechanics are **not** what this spike de-risks.)
 - Sign with a **domain-separated namespace**:
  `ssh-keygen -Y sign -f felhom-op -n felhom-op-v1 blob.json` → `blob.json.sig`
  (armored `-----BEGIN SSH SIGNATURE-----`).
 - Baseline verify (CLI sanity) with an allow-list:
  ```
  allowed_signers:  felhom-operator namespaces="felhom-op-v1" ssh-ed25519 AAAAC3...
  $ ssh-keygen -Y verify -f allowed_signers -I felhom-operator -n felhom-op-v1 \
        -s blob.json.sig < blob.json
  Good "felhom-op-v1" signature for felhom-operator with ED25519 key SHA256:y0Lj8dIYTM6...
  ```
 ## 2. Canonical op blob spec (documented)
 The signature covers **these exact bytes**; the operator CLI (also Go) must reproduce them
 byte-for-byte. **Canonical form: JSON, keys sorted lexicographically at every level, no
 insignificant whitespace, no trailing newline, UTF-8.**
 ```json
 {"expires_at":"<RFC3339 UTC>","issued_at":"<RFC3339 UTC>","key_id":"<id>","nonce":"<128-bit hex>","op":"<op>","params":{...},"target":{"guest_id":"<vmid>","host_id":"<node>"}}
 ```
 | field | meaning |
 |---|---|
 | `op` | the operation, e.g. `guest_destroy`, `storage_detach`, `restore_overwrite` |
 | `target.host_id` / `target.guest_id` | the box + guest the op is bound to (anti-retarget) |
 | `params` | op-specific arguments (themselves canonical-sorted) |
 | `nonce` | unique per op (anti-replay); ≥128-bit random |
 | `issued_at` / `expires_at` | validity window (short — minutes) |
 | `key_id` | which operator key (for rotation / audit) |
 Exact test blob (236 bytes): `{"expires_at":"2026-06-09T00:00:00Z","issued_at":"2026-06-08T00:00:00Z","key_id":"felhom-op-1","nonce":"a1b2c3d4e5f60718293a4b5c6d7e8f90","op":"guest_destroy","params":{"purge":true},"target":{"guest_id":"9001","host_id":"demo-felhom"}}`
 > Note: the SSHSIG **namespace** (`felhom-op-v1`) is the cryptographic domain separator and is
 > a **fixed constant in the verifier**, never caller-supplied — a signature minted for any
 > other namespace must not verify (proven: "wrong namespace" rejected).
 ## 3. Go SSHSIG verify — approach + implementation cost
 **It is not a one-call verify, but it is clean — no hand-rolled crypto.** The only manual work
 is SSHSIG *framing*; all crypto and key-type dispatch is the library's. Steps:
 1. `pem.Decode` the armor → `block.Type == "SSH SIGNATURE"`, `block.Bytes` is the binary SSHSIG.
   *(Go's `encoding/pem` parses the armor directly — no manual base64/line handling.)*
 2. Strip the literal 6-byte `SSHSIG` magic preamble (it is **not** length-prefixed).
 3. `ssh.Unmarshal` the rest into a struct `{Version uint32; PublicKey, Namespace, Reserved,
   HashAlgo, Signature string}` — library does the SSH wire parsing.
 4. `ssh.ParsePublicKey([]byte(PublicKey))` → an `ssh.PublicKey`.
 5. Recompute the signed data per spec: `"SSHSIG" || string(namespace) || string(reserved) ||
   string(hash_algorithm) || string(H(message))`, where `H` is the **named** hash
   (`sha256`/`sha512`) — built with one `ssh.Marshal`.
 6. `ssh.Unmarshal([]byte(Signature))` into `ssh.Signature`, then **`pub.Verify(signed, &sig)`** —
   which **dispatches on the key's own algorithm** (this is what makes it key-agnostic).
 **Cost verdict:** ~40 lines of framing in one file, zero crypto implemented by us. Well within
 the agent's budget; **no reason to fall back** to a different primitive.
 ## 4. Anti-replay / authorization layer (on top of signature validity)
 Enforced in `VerifySignedOp` *after* the signature check, each proven to reject **even with a
 valid signature** (Step 3 output above):
 - **replay** — nonce already recorded in the window → reject;
 - **expired / not-yet-valid** — `now ∉ [issued_at, expires_at]` → reject (both sides shown);
 - **retargeted** — `target.host_id`/`guest_id` ≠ this box/guest → reject (both shown).
 (Order matters: signature → namespace → allow-list → crypto verify → target → time → nonce, so
 a replayed *but otherwise valid* op is still caught, and an invalid sig never consumes a nonce.)
 ## 5. Key-type-agnosticism — **TRUE DROP-IN** (no box change for FIDO2 later)
 No FIDO2 device was used (by choice). Instead the spike **emulated the authenticator exactly**:
 - Synthesized a well-formed `sk-ssh-ed25519@openssh.com` public key; `ssh.ParsePublicKey` parses
  it and `ssh.MarshalAuthorizedKey` round-trips it.
 - Constructed a real `SSHSIG` whose inner signature follows the sk scheme (per OpenSSH
  `PROTOCOL.u2f`): `ed25519` over `sha256(application) || flags || counter || sha256(signed_data)`,
  with the blob `string(format) string(ed25519_sig) byte(flags) uint32(counter)` — i.e. exactly
  what a FIDO2 key emits.
 - Ran it through the **unchanged `VerifySignedOp`** → **verified** (`op="guest_destroy"`).
 **Verdict: true drop-in.** `pub.Verify` for `sk-ssh-ed25519` is implemented in
 `golang.org/x/crypto/ssh` **v0.52.0** (it reconstructs `appDigest‖flags‖counter‖dataDigest` and
 `ed25519.Verify`s it). Introducing a hardware operator key later is a **no-op on the boxes** —
 the agent's verify code is identical; only the operator's signer key (and the allowed-signers
 set entry) changes. No sk-specific handler is needed.
 > Because verification dispatches on the key type embedded in the signature, the same path also
 > accepts `ssh-ed25519`, `rsa-sha2-*`, `ecdsa-sha2-*`, etc. — algorithm choice is the operator's,
 > not the agent's.
 ## 6. Fallback (not taken) and its cost
 A fallback would be a **raw Ed25519 detached signature** (or `minisign`): trivially one
 `ed25519.Verify` call, no SSHSIG framing. **Rejected** because it **loses the clean FIDO2 path** —
 a raw-Ed25519 verifier cannot consume an `sk-ssh-ed25519` signature (which carries flags+counter
 and a different signed-data construction), so the future hardware swap would require **changing
 the verifier on every box**. SSHSIG buys exactly the key-type-agnosticism (§5) that a raw scheme
 forfeits, at a one-file framing cost (§3). **No fallback is warranted.**
 ## 7. Reference verifier (seed of the agent's verify code)
 Verified working on Go 1.24.4 / `x/crypto` v0.52.0. (Test harness omitted; this is the verify
 core + SSHSIG framing + anti-replay/authz.)
 ```go
 const Namespace = "felhom-op-v1"   // FIXED domain separator, never caller-supplied
 const sshsigMagic = "SSHSIG"
 type Target struct{ HostID, GuestID string }
 type OpBlob struct {
 	Op        string          `json:"op"`
 	Target    Target          `json:"target"`
 	Params    json.RawMessage `json:"params"`
 	Nonce     string          `json:"nonce"`
 	IssuedAt  time.Time       `json:"issued_at"`
 	ExpiresAt time.Time       `json:"expires_at"`
 	KeyID     string          `json:"key_id"`
 }
 // (Target needs json tags host_id/guest_id in the real struct.)
 type NonceStore interface{ SeenOrRecord(nonce string, exp time.Time) bool }
 type sshsigBlob struct {
 	Version                                       uint32
 	PublicKey, Namespace, Reserved, HashAlgo, Signature string
 }
 func hashByName(n string) (hash.Hash, error) {
 	switch n {
 	case "sha256": return sha256.New(), nil
 	case "sha512": return sha512.New(), nil
 	}
 	return nil, fmt.Errorf("unsupported SSHSIG hash %q", n)
 }
 func parseArmoredSSHSIG(armored []byte) (*sshsigBlob, error) {
 	block, _ := pem.Decode(armored)
 	if block == nil || block.Type != "SSH SIGNATURE" {
 		return nil, errors.New("not an SSH SIGNATURE armor")
 	}
 	if len(block.Bytes) < 6 || string(block.Bytes[:6]) != sshsigMagic {
 		return nil, errors.New("missing SSHSIG magic")
 	}
 	var sb sshsigBlob
 	if err := ssh.Unmarshal(block.Bytes[6:], &sb); err != nil { return nil, err }
 	if sb.Version != 1 { return nil, fmt.Errorf("bad version %d", sb.Version) }
 	return &sb, nil
 }
 func signedData(sb *sshsigBlob, msg []byte) ([]byte, error) {
 	h, err := hashByName(sb.HashAlgo); if err != nil { return nil, err }
 	h.Write(msg); md := h.Sum(nil)
 	body := ssh.Marshal(struct{ Namespace, Reserved, HashAlgo string; Hash []byte }{
 		sb.Namespace, sb.Reserved, sb.HashAlgo, md})
 	return append([]byte(sshsigMagic), body...), nil
 }
 // VerifySignedOp: key-type-agnostic signature verify + anti-replay/authorization.
 // allowedSigners is the trusted operator set (one key now; a quorum set later).
 func VerifySignedOp(blob, sigArmored []byte, allowedSigners []ssh.PublicKey,
 	thisHostID, thisGuestID string, seenNonces NonceStore) (string, error) {
 	sb, err := parseArmoredSSHSIG(sigArmored)
 	if err != nil { return "", err }
 	if sb.Namespace != Namespace {
 		return "", fmt.Errorf("namespace mismatch: got %q want %q", sb.Namespace, Namespace)
 	}
 	pub, err := ssh.ParsePublicKey([]byte(sb.PublicKey))
 	if err != nil { return "", err }
 	allowed := false
 	for _, a := range allowedSigners {
 		if bytes.Equal(a.Marshal(), pub.Marshal()) { allowed = true; break }
 	}
 	if !allowed { return "", errors.New("signer not in allowed set") }
 	signed, err := signedData(sb, blob)
 	if err != nil { return "", err }
 	var inner ssh.Signature
 	if err := ssh.Unmarshal([]byte(sb.Signature), &inner); err != nil { return "", err }
 	if err := pub.Verify(signed, &inner); err != nil {     // dispatches on key algorithm
 		return "", fmt.Errorf("signature invalid: %w", err)
 	}
 	var op OpBlob
 	if err := json.Unmarshal(blob, &op); err != nil { return "", err }
 	if op.Target.HostID != thisHostID || op.Target.GuestID != thisGuestID {
 		return "", fmt.Errorf("target mismatch")
 	}
 	now := time.Now().UTC()
 	if now.Before(op.IssuedAt) { return "", errors.New("not yet valid") }
 	if now.After(op.ExpiresAt) { return "", errors.New("expired") }
 	if seenNonces.SeenOrRecord(op.Nonce, op.ExpiresAt) {
 		return "", fmt.Errorf("replay: nonce %s already seen", op.Nonce)
 	}
 	return op.Op, nil
 }
 ```
 ## 8. Inputs to the design doc (`04-control-plane-authorization.md`)
 - **Primitive confirmed:** SSHSIG (`ssh-keygen -Y sign` / armored `BEGIN SSH SIGNATURE`),
  verified in Go via `pem.Decode` + `ssh.Unmarshal` + `ssh.ParsePublicKey` + `pub.Verify`. Low
  implementation cost; no crypto hand-rolled.
 - **Hub cannot forge:** the operator private key never touches the hub; the hub only queues the
  opaque armored blob (matches `03` §4).
 - **Key-type-agnostic / hardware-ready:** software `ed25519` now, FIDO2 `sk-ssh-ed25519` later is
  a **box no-op** (proven end-to-end). The verifier hardcodes neither key type nor algorithm.
 - **`allowedSigners` is a set:** single signer today; **threshold/quorum is just set sizing** plus
  an N-of-M policy on top (out of scope here).
 - **Anti-replay/authz are mandatory and cheap:** namespace (fixed), allow-list, then crypto,
  then target-binding, time-window, nonce — all enforced and tested.
 - **Canonical blob (§2)** is the shared contract between the operator CLI and the agent verifier.
@@ -1,5 +1,33 @@
 # Felhom Hub — Changelog
 ## Repo docs — no hub version change (2026-06-08)
 ### Changed
 - **Reflowed `felhom.eu/CLAUDE.md`** — removed hard mid-paragraph line wraps (prose, list items, blockquotes now single-line); tables untouched; rendered output unchanged.
 - **Unified the REPORT/CHANGELOG convention**: this repo's `REPORT.md` switches from *append/cumulative* to **overwrite-latest** (uniform with the sibling repos); `CHANGELOG.md` (this file) stays the cumulative log, newest on top. Updated `REPORT.md`'s header note accordingly (existing sections retained as history). Added an explicit **no-secrets** rule. No hub code change → no version bump.
 ## v0.7.1 (2026-06-08)
 ### Changed
 - **`/host-report` rejects oversize bodies explicitly with 413** (`handler.go`) instead of silently truncating at the 4 MiB `LimitReader` cap. Reads one byte past `maxHostReportBytes` and returns `413 Payload too large` — a truncated-but-valid JSON could otherwise be accepted as a partial report (silently dropping guests from the mirror). The controller `handleReport` 1 MiB path is **unchanged** (frozen until slice-10 cutover).
 ### Added
 - **Cross-repo contract fixture** `hub/internal/api/testdata/host-report.golden.json` (byte-identical with felhom-agent's copy) + `TestHostReport_GoldenContract` — POSTs the golden through the real `handleHostReport` and asserts 200 + denorm (`guest_total`/`guest_running`/`cloudflared_status`) + both guests upserted, proving `hostReportPayload` still extracts the contract from the real shape. Duplicated contract (no shared types module yet); revisit at slices 5/6.
 ## v0.7.0 (2026-06-08)
 ### Added — host-domain ingest (slice 3, additive; controller path untouched)
 - **New tables** `hosts`, `guests`, `host_reports` (`store.go migrate()`, idempotent). Full schema now, including columns **inert until slice 10** (`hosts.desired_json`/`desired_generation`/`dr_record_json`, `guests.api_key`/`desired_spec_json`) so the cutover needs no `ALTER`. Nothing reads/writes the inert columns this slice.
 - **`POST /api/v1/host-report`** — the agent's heartbeat. Per-host Bearer auth; 4 MiB body; persists the full report + denormalized fields (cpu/mem/disk %, guest counts, cloudflared status); upserts each guest's **reality** columns (`guest_id = "<host_id>/<vmid>"`, hub-derived); returns the control envelope `{status, poll_interval_seconds:900, blocked, desired_generation:0, has_signed_ops:false}` (`blocked` reflects the customer's status; the latter two are reserved/placeholder for slice 4).
 - **Per-host key auth** — `checkAuthHost` (Bearer → host → customer), added alongside the unchanged `checkAuthCustomer`. Global key remains a bootstrap fallback.
 - **`POST /api/v1/admin/hosts`** — **PROVISIONAL** global-key-only host mint (host_id + per-host api_key); the slice-3 bootstrap until enrollment (slices 7–8) replaces it.
 - **Host dead-man's-switch** — `monitor.HostStalenessChecker` over `host_reports`, emitting `host_stale`/`host_down`/`host_recovered` (30m/60m), attributed to the host's customer; registered in `allowedEventTypes`; wired in `cmd/hub/main.go` on the existing 60s ticker. A deliberate **sibling** of the controller `StalenessChecker` (both run until slice 10).
 - **Store methods**: `GetHostByAPIKey`, `GetHost`, `ListHosts`, `UpsertHost`, `SaveHostReport`, `UpsertGuestFromReport` (preserves inert columns on conflict), `GetHostStaleness` (skips never-reported hosts), `GuestID`. `Prune` now also prunes `host_reports` (same retention).
 - **Tests** (new, hermetic): store, auth (`checkAuthHost`), ingest (valid+envelope+denorm, host_id mismatch→403, unknown-host-under-global→400, blocked→true, oversize→400), admin mint (non-global→403, unknown customer→400, mint+round-trip), host staleness transitions.
 ### Unchanged (explicit)
 - The controller path — `/api/v1/report`, `reports`, `customer_configs`, `checkAuthCustomer`, the existing staleness/deadline checkers — is untouched and still green. The old controller and the new agent report in parallel during slices 3–9; the schema/auth cutover is **slice 10**.
 ## v0.6.2 (2026-02-26)
 ### Added
@@ -206,6 +206,9 @@ func main() {
 	// Staleness checker — runs every 60s
 	stalenessChecker := monitor.NewStalenessChecker(dataStore, staleThreshold, dispatcher.ProcessEvent, logger)
 	// v0.7.0: host-domain dead-man's-switch (sibling; the controller checker above is
 	// unchanged and keeps running until the slice-10 cutover). Same 60s cadence.
 	hostStalenessChecker := monitor.NewHostStalenessChecker(dataStore, staleThreshold, dispatcher.ProcessEvent, logger)
 	go func() {
 		ticker := time.NewTicker(60 * time.Second)
 		defer ticker.Stop()
@@ -215,6 +218,7 @@ func main() {
 				return
 			case <-ticker.C:
 				stalenessChecker.Check()
 				hostStalenessChecker.Check()
 			}
 		}
 	}()
@@ -89,6 +89,30 @@ func (h *Handler) checkAuthCustomer(r *http.Request) (customerID string, isGloba
 	return cfg.CustomerID, false, true
 }
 // checkAuthHost resolves a Bearer token to a HOST identity (the agent's auth
 // path). It is a sibling of checkAuthCustomer — the controller path is unchanged.
 //   - global key  -> ("", "", true, true)    caller trusts body.host_id
 //   - per-host key -> (hostID, customerID, false, true)
 //   - failure      -> ("", "", false, false)
 func (h *Handler) checkAuthHost(r *http.Request) (hostID, customerID string, isGlobal, ok bool) {
 	auth := r.Header.Get("Authorization")
 	if !strings.HasPrefix(auth, "Bearer ") {
 		return "", "", false, false
 	}
 	token := strings.TrimPrefix(auth, "Bearer ")
 	// Global key first (same constant-time compare as checkAuthCustomer).
 	if h.apiKey != "" && subtle.ConstantTimeCompare([]byte(token), []byte(h.apiKey)) == 1 {
 		return "", "", true, true
 	}
 	host, err := h.store.GetHostByAPIKey(token)
 	if err != nil || host == nil {
 		return "", "", false, false
 	}
 	return host.HostID, host.CustomerID, false, true
 }
 // ServeHTTP routes API requests.
 func (h *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
 	path := strings.TrimPrefix(r.URL.Path, "/api/v1")
@@ -96,6 +120,10 @@ func (h *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
 	switch {
 	case r.Method == http.MethodPost && path == "/report":
 		h.handleReport(w, r)
 	case r.Method == http.MethodPost && path == "/host-report":
 		h.handleHostReport(w, r)
 	case r.Method == http.MethodPost && path == "/admin/hosts":
 		h.handleAdminCreateHost(w, r)
 	case r.Method == http.MethodPost && path == "/event":
 		h.handleEvent(w, r)
 	case r.Method == http.MethodPost && path == "/notify":
@@ -194,6 +222,203 @@ func (h *Handler) handleReport(w http.ResponseWriter, r *http.Request) {
 	json.NewEncoder(w).Encode(resp)
 }
 // defaultHostPollSeconds is the cadence the hub hands every agent this slice (no
 // per-host override UI yet — that is a later slice).
 const defaultHostPollSeconds = 900
 // maxHostReportBytes bounds a host-report body. Larger than the controller path's
 // 1 MiB because host reports carry the full guest list + (later) storage/backup
 // arrays. We read one byte past it and reject explicitly (413) rather than letting
 // LimitReader silently truncate — a truncated-but-valid JSON would otherwise be
 // accepted as a partial report, dropping guests from the mirror.
 const maxHostReportBytes = 4 << 20 // 4 MiB
 // hostReportPayload is the subset of the agent host-report (slice-3 contract,
 // §3 / agent spec §4) the hub needs for denorm + guest reality. Unknown fields
 // (storage_targets/backups/restore_tests/pbs_snapshots/audit_tail) are ignored,
 // so an empty or absent collection is accepted without error.
 type hostReportPayload struct {
 	HostID       string `json:"host_id"`
 	AgentVersion string `json:"agent_version"`
 	Host         struct {
 		CPUPercent    float64 `json:"cpu_percent"`
 		MemoryPercent float64 `json:"memory_percent"`
 		DiskPercent   float64 `json:"disk_percent"`
 	} `json:"host"`
 	Guests []struct {
 		VMID              int    `json:"vmid"`
 		Name              string `json:"name"`
 		Status            string `json:"status"`
 		ControllerVersion string `json:"controller_version"`
 	} `json:"guests"`
 	Cloudflared struct {
 		Status string `json:"status"`
 	} `json:"cloudflared"`
 }
 // handleHostReport ingests the agent's host-report (the heartbeat) and returns the
 // control envelope (agent spec §5).
 func (h *Handler) handleHostReport(w http.ResponseWriter, r *http.Request) {
 	hostID, custID, isGlobal, ok := h.checkAuthHost(r)
 	if !ok {
 		http.Error(w, "Unauthorized", http.StatusUnauthorized)
 		return
 	}
 	body, err := io.ReadAll(io.LimitReader(r.Body, maxHostReportBytes+1))
 	if err != nil {
 		http.Error(w, "Bad request", http.StatusBadRequest)
 		return
 	}
 	if len(body) > maxHostReportBytes {
 		http.Error(w, "Payload too large", http.StatusRequestEntityTooLarge)
 		return
 	}
 	var rep hostReportPayload
 	if err := json.Unmarshal(body, &rep); err != nil || rep.HostID == "" {
 		http.Error(w, "Invalid payload: host_id required", http.StatusBadRequest)
 		return
 	}
 	if isGlobal {
 		// Global-key bootstrap: trust body.host_id but require the host to exist
 		// (it must be minted first) and resolve its customer from the row.
 		host, err := h.store.GetHost(rep.HostID)
 		if err != nil {
 			h.logger.Printf("[ERROR] host lookup failed for %s: %v", rep.HostID, err)
 			http.Error(w, "Internal error", http.StatusInternalServerError)
 			return
 		}
 		if host == nil {
 			http.Error(w, "Unknown host_id (mint via /admin/hosts first)", http.StatusBadRequest)
 			return
 		}
 		hostID, custID = rep.HostID, host.CustomerID
 	} else if rep.HostID != hostID {
 		http.Error(w, "Forbidden: host_id mismatch", http.StatusForbidden)
 		return
 	}
 	running := 0
 	for _, g := range rep.Guests {
 		if g.Status == "running" {
 			running++
 		}
 	}
 	denorm := store.HostReportDenorm{
 		AgentVersion:      rep.AgentVersion,
 		CPUPercent:        rep.Host.CPUPercent,
 		MemoryPercent:     rep.Host.MemoryPercent,
 		DiskPercent:       rep.Host.DiskPercent,
 		GuestTotal:        len(rep.Guests),
 		GuestRunning:      running,
 		CloudflaredStatus: rep.Cloudflared.Status,
 	}
 	if err := h.store.SaveHostReport(hostID, custID, body, denorm); err != nil {
 		h.logger.Printf("[ERROR] Failed to save host-report from %s: %v", hostID, err)
 		http.Error(w, "Internal error", http.StatusInternalServerError)
 		return
 	}
 	for _, g := range rep.Guests {
 		status := g.Status
 		if status == "" {
 			status = "unknown"
 		}
 		guest := &store.Guest{
 			GuestID:           store.GuestID(hostID, g.VMID),
 			CustomerID:        custID,
 			HostID:            hostID,
 			VMID:              g.VMID,
 			DisplayName:       g.Name,
 			Status:            status,
 			ControllerVersion: g.ControllerVersion,
 		}
 		if err := h.store.UpsertGuestFromReport(guest); err != nil {
 			// A guest upsert failure must not drop the whole report (liveness).
 			h.logger.Printf("[WARN] Failed to upsert guest %s: %v", guest.GuestID, err)
 		}
 	}
 	h.logger.Printf("[INFO] host-report from %s (%d guests, %d bytes)", hostID, len(rep.Guests), len(body))
 	blocked := false
 	if cc, err := h.store.GetCustomerConfig(custID); err == nil && cc != nil && cc.Status == "blocked" {
 		blocked = true
 	}
 	resp := map[string]interface{}{
 		"status":                "ok",
 		"poll_interval_seconds": defaultHostPollSeconds,
 		"blocked":               blocked,
 		"desired_generation":    0,     // reserved (slice 4)
 		"has_signed_ops":        false, // reserved (slice 4)
 	}
 	w.Header().Set("Content-Type", "application/json")
 	w.WriteHeader(http.StatusOK)
 	json.NewEncoder(w).Encode(resp)
 }
 // handleAdminCreateHost mints a host identity (host_id + per-host api_key).
 //
 // PROVISIONAL (slice-3 bootstrap): global-key only, so the demo agent can
 // authenticate before enrollment (slices 7–8) exists. Enrollment will mint host
 // identity + pin signing keys; this endpoint should be removed/locked down then
 // (tracked under doc 05 §11 auth-tightening at cutover).
 func (h *Handler) handleAdminCreateHost(w http.ResponseWriter, r *http.Request) {
 	_, _, isGlobal, ok := h.checkAuthHost(r)
 	if !ok || !isGlobal {
 		http.Error(w, "Forbidden: global key required", http.StatusForbidden)
 		return
 	}
 	body, err := io.ReadAll(io.LimitReader(r.Body, 1<<20))
 	if err != nil {
 		http.Error(w, "Bad request", http.StatusBadRequest)
 		return
 	}
 	var req struct {
 		CustomerID  string `json:"customer_id"`
 		HostID      string `json:"host_id"`
 		DisplayName string `json:"display_name"`
 	}
 	if err := json.Unmarshal(body, &req); err != nil || req.CustomerID == "" {
 		http.Error(w, "Invalid payload: customer_id required", http.StatusBadRequest)
 		return
 	}
 	cc, err := h.store.GetCustomerConfig(req.CustomerID)
 	if err != nil {
 		http.Error(w, "Internal error", http.StatusInternalServerError)
 		return
 	}
 	if cc == nil {
 		http.Error(w, "Unknown customer_id", http.StatusBadRequest)
 		return
 	}
 	hostID := req.HostID
 	if hostID == "" {
 		sfx, err := configgen.RandomHex(3) // 6 hex chars — human-legible for the demo
 		if err != nil {
 			http.Error(w, "Internal error", http.StatusInternalServerError)
 			return
 		}
 		hostID = req.CustomerID + "-" + sfx
 	}
 	apiKey, err := configgen.RandomHex(32)
 	if err != nil {
 		http.Error(w, "Internal error", http.StatusInternalServerError)
 		return
 	}
 	if err := h.store.UpsertHost(&store.Host{HostID: hostID, CustomerID: req.CustomerID, APIKey: apiKey}); err != nil {
 		h.logger.Printf("[ERROR] Failed to mint host for %s: %v", req.CustomerID, err)
 		http.Error(w, "Internal error", http.StatusInternalServerError)
 		return
 	}
 	h.logger.Printf("[INFO] provisional host mint: %s (customer %s)", hostID, req.CustomerID)
 	w.Header().Set("Content-Type", "application/json")
 	w.WriteHeader(http.StatusCreated)
 	json.NewEncoder(w).Encode(map[string]string{"host_id": hostID, "api_key": apiKey})
 }
 // allowedEventTypes lists all valid event_type values the Hub accepts.
 var allowedEventTypes = map[string]bool{
 	// Controller-pushed events
@@ -219,11 +444,15 @@ var allowedEventTypes = map[string]bool{
 	"disaster_recovery_started":   true,
 	"disaster_recovery_completed": true,
 	// Hub-generated events
-	"node_stale":              true,
+	"node_stale":     true,
-	"node_down":               true,
+	"node_down":      true,
-	"node_recovered":          true,
+	"node_recovered": true,
-	"expected_backup_missed":  true,
+	// Hub-generated host-domain events (v0.7.0, slice 3)
-	"expected_dbdump_missed":  true,
+	"host_stale":             true,
 	"host_down":              true,
 	"host_recovered":         true,
 	"expected_backup_missed": true,
 	"expected_dbdump_missed": true,
 	// Special
 	"test": true,
 }
@@ -686,10 +915,10 @@ func (h *Handler) handleRecovery(w http.ResponseWriter, r *http.Request, custome
 	}
 	resp := struct {
-		CustomerID     string                    `json:"customer_id"`
+		CustomerID     string                     `json:"customer_id"`
-		ConfigYAML     string                    `json:"config_yaml"`
+		ConfigYAML     string                     `json:"config_yaml"`
-		InfraBackup    json.RawMessage           `json:"infra_backup"`
+		InfraBackup    json.RawMessage            `json:"infra_backup"`
-		HasInfraBackup bool                      `json:"has_infra_backup"`
+		HasInfraBackup bool                       `json:"has_infra_backup"`
 		BackupVersions []store.InfraBackupVersion `json:"backup_versions,omitempty"`
 	}{
 		CustomerID:     customerID,
@@ -0,0 +1,232 @@
 package api
 import (
 	"database/sql"
 	"encoding/json"
 	"io"
 	"log"
 	"net/http"
 	"net/http/httptest"
 	"os"
 	"path/filepath"
 	"strings"
 	"testing"
 	"gitea.dooplex.hu/admin/felhom-hub/internal/store"
 	_ "modernc.org/sqlite"
 )
 const globalKey = "GLOBALKEY"
 func newTestHandler(t *testing.T) (*Handler, *store.Store, string) {
 	t.Helper()
 	path := filepath.Join(t.TempDir(), "test.db")
 	st, err := store.New(path, log.New(io.Discard, "", 0))
 	if err != nil {
 		t.Fatalf("store.New: %v", err)
 	}
 	t.Cleanup(func() { st.Close() })
 	h := New(st, globalKey, "", "", nil, log.New(io.Discard, "", 0))
 	return h, st, path
 }
 func do(h *Handler, method, path, bearer, body string) *httptest.ResponseRecorder {
 	req := httptest.NewRequest(method, "/api/v1"+path, strings.NewReader(body))
 	if bearer != "" {
 		req.Header.Set("Authorization", "Bearer "+bearer)
 	}
 	rr := httptest.NewRecorder()
 	h.ServeHTTP(rr, req)
 	return rr
 }
 func TestCheckAuthHost(t *testing.T) {
 	h, st, _ := newTestHandler(t)
 	st.UpsertHost(&store.Host{HostID: "h1", CustomerID: "c1", APIKey: "HKEY"})
 	mk := func(bearer string) *http.Request {
 		req := httptest.NewRequest(http.MethodPost, "/api/v1/host-report", nil)
 		if bearer != "" {
 			req.Header.Set("Authorization", "Bearer "+bearer)
 		}
 		return req
 	}
 	if _, _, isGlobal, ok := h.checkAuthHost(mk(globalKey)); !ok || !isGlobal {
 		t.Error("global key should resolve isGlobal=true")
 	}
 	hostID, custID, isGlobal, ok := h.checkAuthHost(mk("HKEY"))
 	if !ok || isGlobal || hostID != "h1" || custID != "c1" {
 		t.Errorf("per-host key = %q/%q global=%v ok=%v", hostID, custID, isGlobal, ok)
 	}
 	if _, _, _, ok := h.checkAuthHost(mk("bogus")); ok {
 		t.Error("unknown key should fail")
 	}
 }
 func validReportBody(hostID string) string {
 	return `{"host_id":"` + hostID + `","agent_version":"0.3.0",` +
 		`"host":{"cpu_percent":3.2,"memory_percent":25,"disk_percent":19,"loadavg":["0.1"],"uptime_seconds":100},` +
 		`"guests":[{"vmid":100,"name":"acme","status":"running","controller_version":""},` +
 		`{"vmid":101,"name":"beta","status":"stopped"}],` +
 		`"storage_targets":[],"backups":[],"cloudflared":{"status":"active"},"audit_tail":[]}`
 }
 func TestHandleHostReport_ValidAndEnvelopeAndDenorm(t *testing.T) {
 	h, st, dbPath := newTestHandler(t)
 	st.SaveCustomerConfig(&store.CustomerConfig{CustomerID: "c1", APIKey: "ckey", RetrievalPassword: "p"})
 	st.UpsertHost(&store.Host{HostID: "h1", CustomerID: "c1", APIKey: "HKEY"})
 	rr := do(h, http.MethodPost, "/host-report", "HKEY", validReportBody("h1"))
 	if rr.Code != 200 {
 		t.Fatalf("status = %d, body=%s", rr.Code, rr.Body.String())
 	}
 	var env struct {
 		Status              string `json:"status"`
 		PollIntervalSeconds int    `json:"poll_interval_seconds"`
 		Blocked             bool   `json:"blocked"`
 		DesiredGeneration   int    `json:"desired_generation"`
 		HasSignedOps        bool   `json:"has_signed_ops"`
 	}
 	json.Unmarshal(rr.Body.Bytes(), &env)
 	if env.Status != "ok" || env.PollIntervalSeconds != 900 || env.Blocked || env.DesiredGeneration != 0 || env.HasSignedOps {
 		t.Errorf("envelope = %+v", env)
 	}
 	// Denorm: guest_running counts only "running" (1 of 2). Read via a 2nd connection.
 	db, _ := sql.Open("sqlite", dbPath)
 	defer db.Close()
 	var total, running int
 	var cf string
 	db.QueryRow(`SELECT guest_total, guest_running, cloudflared_status FROM host_reports WHERE host_id='h1' ORDER BY id DESC LIMIT 1`).
 		Scan(&total, &running, &cf)
 	if total != 2 || running != 1 || cf != "active" {
 		t.Errorf("denorm total=%d running=%d cloudflared=%q (want 2,1,active)", total, running, cf)
 	}
 	// Guests upserted.
 	var gname, gstatus string
 	if err := db.QueryRow(`SELECT display_name, status FROM guests WHERE guest_id='h1/100'`).Scan(&gname, &gstatus); err != nil {
 		t.Fatalf("guest h1/100 not upserted: %v", err)
 	}
 	if gname != "acme" || gstatus != "running" {
 		t.Errorf("guest = %q/%q", gname, gstatus)
 	}
 }
 func TestHandleHostReport_HostIDMismatch(t *testing.T) {
 	h, st, _ := newTestHandler(t)
 	st.UpsertHost(&store.Host{HostID: "h1", CustomerID: "c1", APIKey: "HKEY"})
 	rr := do(h, http.MethodPost, "/host-report", "HKEY", validReportBody("other-host"))
 	if rr.Code != http.StatusForbidden {
 		t.Errorf("status = %d, want 403", rr.Code)
 	}
 }
 func TestHandleHostReport_UnknownHostUnderGlobalKey(t *testing.T) {
 	h, _, _ := newTestHandler(t)
 	rr := do(h, http.MethodPost, "/host-report", globalKey, validReportBody("ghost"))
 	if rr.Code != http.StatusBadRequest {
 		t.Errorf("status = %d, want 400 (unknown host_id)", rr.Code)
 	}
 }
 func TestHandleHostReport_BlockedCustomer(t *testing.T) {
 	h, st, _ := newTestHandler(t)
 	st.SaveCustomerConfig(&store.CustomerConfig{CustomerID: "c1", APIKey: "ckey", RetrievalPassword: "p"})
 	st.SetCustomerConfigStatus("c1", "blocked")
 	st.UpsertHost(&store.Host{HostID: "h1", CustomerID: "c1", APIKey: "HKEY"})
 	rr := do(h, http.MethodPost, "/host-report", "HKEY", validReportBody("h1"))
 	if rr.Code != 200 {
 		t.Fatalf("status = %d", rr.Code)
 	}
 	var env struct {
 		Blocked bool `json:"blocked"`
 	}
 	json.Unmarshal(rr.Body.Bytes(), &env)
 	if !env.Blocked {
 		t.Error("blocked customer should yield blocked:true")
 	}
 }
 func TestHandleHostReport_OversizeRejected(t *testing.T) {
 	h, st, _ := newTestHandler(t)
 	st.UpsertHost(&store.Host{HostID: "h1", CustomerID: "c1", APIKey: "HKEY"})
 	big := `{"host_id":"h1","guests":[{"vmid":1,"name":"` + strings.Repeat("a", 5<<20) + `"}]}`
 	rr := do(h, http.MethodPost, "/host-report", "HKEY", big)
 	if rr.Code != http.StatusRequestEntityTooLarge {
 		t.Errorf("oversize body status = %d, want 413", rr.Code)
 	}
 }
 func TestAdminCreateHost(t *testing.T) {
 	h, st, _ := newTestHandler(t)
 	st.SaveCustomerConfig(&store.CustomerConfig{CustomerID: "c1", APIKey: "ckey", RetrievalPassword: "p"})
 	// non-global key (per-customer) → 403
 	if rr := do(h, http.MethodPost, "/admin/hosts", "ckey", `{"customer_id":"c1"}`); rr.Code != http.StatusForbidden {
 		t.Errorf("per-customer key status = %d, want 403", rr.Code)
 	}
 	// missing/unknown customer → 400
 	if rr := do(h, http.MethodPost, "/admin/hosts", globalKey, `{"customer_id":"nope"}`); rr.Code != http.StatusBadRequest {
 		t.Errorf("unknown customer status = %d, want 400", rr.Code)
 	}
 	// success → 201 + usable key (round-trip)
 	rr := do(h, http.MethodPost, "/admin/hosts", globalKey, `{"customer_id":"c1"}`)
 	if rr.Code != http.StatusCreated {
 		t.Fatalf("mint status = %d, body=%s", rr.Code, rr.Body.String())
 	}
 	var minted struct {
 		HostID string `json:"host_id"`
 		APIKey string `json:"api_key"`
 	}
 	json.Unmarshal(rr.Body.Bytes(), &minted)
 	if minted.HostID == "" || minted.APIKey == "" {
 		t.Fatalf("mint response = %+v", minted)
 	}
 	// the minted key authenticates a host-report
 	rr2 := do(h, http.MethodPost, "/host-report", minted.APIKey, validReportBody(minted.HostID))
 	if rr2.Code != 200 {
 		t.Errorf("round-trip host-report with minted key = %d, body=%s", rr2.Code, rr2.Body.String())
 	}
 }
 // TestHostReport_GoldenContract drives the real handler with the shared golden
 // host-report and proves hostReportPayload still extracts what it needs from the
 // real wire shape (denorm + guest upsert).
 //
 // testdata/host-report.golden.json MUST be kept byte-identical with felhom-agent's
 // internal/hub/testdata/host-report.golden.json — it is a duplicated contract until
 // a shared types module exists (revisit when slices 5/6 add real fields).
 func TestHostReport_GoldenContract(t *testing.T) {
 	h, st, dbPath := newTestHandler(t)
 	st.SaveCustomerConfig(&store.CustomerConfig{CustomerID: "c1", APIKey: "ckey", RetrievalPassword: "p"})
 	st.UpsertHost(&store.Host{HostID: "demo-host-01", CustomerID: "c1", APIKey: "HKEY"})
 	golden, err := os.ReadFile("testdata/host-report.golden.json")
 	if err != nil {
 		t.Fatal(err)
 	}
 	rr := do(h, http.MethodPost, "/host-report", "HKEY", string(golden))
 	if rr.Code != 200 {
 		t.Fatalf("golden report status = %d, body=%s", rr.Code, rr.Body.String())
 	}
 	db, _ := sql.Open("sqlite", dbPath)
 	defer db.Close()
 	var total, running int
 	var cf string
 	if err := db.QueryRow(`SELECT guest_total, guest_running, cloudflared_status FROM host_reports WHERE host_id='demo-host-01' ORDER BY id DESC LIMIT 1`).
 		Scan(&total, &running, &cf); err != nil {
 		t.Fatal(err)
 	}
 	if total != 2 || running != 1 || cf != "active" {
 		t.Errorf("denorm total=%d running=%d cloudflared=%q (want 2,1,active)", total, running, cf)
 	}
 	var guestCount int
 	db.QueryRow(`SELECT COUNT(*) FROM guests WHERE host_id='demo-host-01'`).Scan(&guestCount)
 	if guestCount != 2 {
 		t.Errorf("guests upserted = %d, want 2", guestCount)
 	}
 }
@@ -0,0 +1,38 @@
 {
  "host_id": "demo-host-01",
  "reported_at": "2026-06-08T12:00:00Z",
  "agent_version": "0.3.1",
  "host": {
    "node": "demo-felhom",
    "cpu_percent": 3.2,
    "memory_total_bytes": 16777216000,
    "memory_used_bytes": 4194304000,
    "memory_percent": 25,
    "disk_total_bytes": 152000000000,
    "disk_used_bytes": 30000000000,
    "disk_percent": 19.7,
    "loadavg": ["0.10", "0.20", "0.15"],
    "uptime_seconds": 86400
  },
  "guests": [
    {
      "vmid": 100,
      "name": "felhom-cust-acme",
      "status": "running",
      "controller_version": "",
      "spec": { "cores": 2, "memory_bytes": 2147483648, "disk_bytes": 21474836480 }
    },
    {
      "vmid": 101,
      "name": "felhom-cust-beta",
      "status": "stopped",
      "controller_version": ""
    }
  ],
  "storage_targets": [],
  "backups": [],
  "restore_tests": [],
  "pbs_snapshots": [],
  "cloudflared": { "status": "active" },
  "audit_tail": []
 }
@@ -0,0 +1,176 @@
 package monitor
 import (
 	"log"
 	"sync"
 	"time"
 	"gitea.dooplex.hu/admin/felhom-hub/internal/store"
 )
 // HostStalenessChecker is the host-domain dead-man's-switch (v0.7.0, slice 3). It
 // is a deliberate SIBLING of StalenessChecker, not a rename: during slices 3–9 the
 // controller report stream (reports) and the agent host-report stream
 // (host_reports) are both live, so both checkers run. It keys on host↔host_reports
 // and emits host_stale / host_down / host_recovered. Merging is a slice-10 job.
 //
 // Events are attributed to the host's CUSTOMER (SaveEvent + onEvent take the
 // customer_id) so the existing per-customer notification/event UX picks them up
 // unchanged.
 type HostStalenessChecker struct {
 	store     *store.Store
 	threshold time.Duration // "stale" after this (default 30m — same as the controller checker)
 	downAfter time.Duration // "down" after this (2x threshold)
 	logger    *log.Logger
 	onEvent   EventNotifyFunc
 	mu            sync.Mutex
 	states        map[string]string    // hostID → "ok" | "stale" | "down"
 	customerOf    map[string]string    // hostID → customerID (for event attribution)
 	downtimeStart map[string]time.Time // hostID → when it first became unreachable
 }
 // NewHostStalenessChecker creates the checker and seeds state from current
 // host-report recency. No events are generated during initialization.
 func NewHostStalenessChecker(s *store.Store, threshold time.Duration, onEvent EventNotifyFunc, logger *log.Logger) *HostStalenessChecker {
 	sc := &HostStalenessChecker{
 		store:         s,
 		threshold:     threshold,
 		downAfter:     2 * threshold,
 		logger:        logger,
 		onEvent:       onEvent,
 		states:        make(map[string]string),
 		customerOf:    make(map[string]string),
 		downtimeStart: make(map[string]time.Time),
 	}
 	rows, err := s.GetHostStaleness()
 	if err != nil {
 		logger.Printf("[WARN] Host staleness checker: failed to seed states: %v", err)
 		return sc
 	}
 	var okCount, staleCount, downCount int
 	for _, row := range rows {
 		if s.IsCustomerBlocked(row.CustomerID) {
 			continue
 		}
 		sc.customerOf[row.HostID] = row.CustomerID
 		age := time.Since(row.LastReportAt)
 		switch {
 		case age > sc.downAfter:
 			sc.states[row.HostID] = "down"
 			downCount++
 		case age > sc.threshold:
 			sc.states[row.HostID] = "stale"
 			staleCount++
 		default:
 			sc.states[row.HostID] = "ok"
 			okCount++
 		}
 	}
 	logger.Printf("[INFO] Host staleness checker initialized: %d ok, %d stale, %d down", okCount, staleCount, downCount)
 	return sc
 }
 // Check evaluates all hosts and emits events on state transitions. Call every 60s.
 func (sc *HostStalenessChecker) Check() {
 	rows, err := sc.store.GetHostStaleness()
 	if err != nil {
 		sc.logger.Printf("[WARN] Host staleness check failed: %v", err)
 		return
 	}
 	sc.mu.Lock()
 	defer sc.mu.Unlock()
 	seen := make(map[string]bool, len(rows))
 	for _, row := range rows {
 		seen[row.HostID] = true
 		if sc.store.IsCustomerBlocked(row.CustomerID) {
 			delete(sc.states, row.HostID)
 			continue
 		}
 		sc.customerOf[row.HostID] = row.CustomerID
 		age := time.Since(row.LastReportAt)
 		var newState string
 		switch {
 		case age > sc.downAfter:
 			newState = "down"
 		case age > sc.threshold:
 			newState = "stale"
 		default:
 			newState = "ok"
 		}
 		oldState := sc.states[row.HostID]
 		if oldState == "" {
 			sc.states[row.HostID] = newState // first observation — no event
 			continue
 		}
 		if oldState == newState {
 			continue
 		}
 		sc.states[row.HostID] = newState
 		if newState == "stale" && oldState == "ok" {
 			sc.downtimeStart[row.HostID] = time.Now()
 		}
 		downtimeDur := age
 		if newState == "ok" {
 			if t, ok := sc.downtimeStart[row.HostID]; ok {
 				downtimeDur = time.Since(t)
 			}
 			delete(sc.downtimeStart, row.HostID)
 		}
 		sc.emitTransition(row.HostID, row.CustomerID, oldState, newState, downtimeDur)
 	}
 	for id := range sc.states {
 		if !seen[id] {
 			delete(sc.states, id)
 			delete(sc.downtimeStart, id)
 		}
 	}
 }
 // GetState returns the current staleness state for a host.
 func (sc *HostStalenessChecker) GetState(hostID string) string {
 	sc.mu.Lock()
 	defer sc.mu.Unlock()
 	s := sc.states[hostID]
 	if s == "" {
 		return "unknown"
 	}
 	return s
 }
 func (sc *HostStalenessChecker) emitTransition(hostID, customerID, oldState, newState string, age time.Duration) {
 	var eventType, severity, message string
 	switch {
 	case newState == "stale":
 		eventType = "host_stale"
 		severity = "warning"
 		message = "Host " + hostID + ": no report for " + formatDuration(age)
 	case newState == "down":
 		eventType = "host_down"
 		severity = "error"
 		message = "Host " + hostID + ": no report for " + formatDuration(age)
 	case newState == "ok" && (oldState == "stale" || oldState == "down"):
 		eventType = "host_recovered"
 		severity = "info"
 		message = "Host " + hostID + ": reports resumed (was " + oldState + " for " + formatDuration(age) + ")"
 	default:
 		return
 	}
 	sc.logger.Printf("[INFO] Host staleness: %s %s → %s (%s)", hostID, oldState, newState, eventType)
 	if _, err := sc.store.SaveEvent(customerID, eventType, severity, message, "{}", "hub"); err != nil {
 		sc.logger.Printf("[WARN] Failed to save host staleness event for %s: %v", hostID, err)
 		return
 	}
 	if sc.onEvent != nil {
 		sc.onEvent(customerID, eventType, severity, message, "{}", "hub")
 	}
 }
@@ -0,0 +1,88 @@
 package monitor
 import (
 	"database/sql"
 	"fmt"
 	"io"
 	"log"
 	"path/filepath"
 	"testing"
 	"time"
 	"gitea.dooplex.hu/admin/felhom-hub/internal/store"
 	_ "modernc.org/sqlite"
 )
 // backdate sets a host's last_report_at to N minutes ago, simulating the passage
 // of time without sleeping. Uses a second connection (the checker reads via store).
 func backdate(t *testing.T, db *sql.DB, hostID string, minutesAgo int) {
 	t.Helper()
 	if _, err := db.Exec(`UPDATE hosts SET last_report_at = datetime('now', ?) WHERE host_id = ?`,
 		fmt.Sprintf("-%d minutes", minutesAgo), hostID); err != nil {
 		t.Fatal(err)
 	}
 }
 func TestHostStalenessChecker(t *testing.T) {
 	path := filepath.Join(t.TempDir(), "test.db")
 	st, err := store.New(path, log.New(io.Discard, "", 0))
 	if err != nil {
 		t.Fatal(err)
 	}
 	defer st.Close()
 	db, _ := sql.Open("sqlite", path)
 	defer db.Close()
 	st.SaveCustomerConfig(&store.CustomerConfig{CustomerID: "c1", APIKey: "ck", RetrievalPassword: "p"})
 	st.UpsertHost(&store.Host{HostID: "h1", CustomerID: "c1", APIKey: "k1"})
 	st.SaveHostReport("h1", "c1", []byte(`{}`), store.HostReportDenorm{}) // sets last_report_at
 	var events []string
 	onEvent := func(customerID, eventType, severity, message, detailsJSON, source string) {
 		events = append(events, eventType)
 	}
 	// Seed already-stale (40m) → state stale, but NO event on init.
 	backdate(t, db, "h1", 40)
 	sc := NewHostStalenessChecker(st, 30*time.Minute, onEvent, log.New(io.Discard, "", 0))
 	if len(events) != 0 {
 		t.Fatalf("seed must not emit events, got %v", events)
 	}
 	if sc.GetState("h1") != "stale" {
 		t.Fatalf("seeded state = %q, want stale", sc.GetState("h1"))
 	}
 	// Same age → no transition.
 	sc.Check()
 	if len(events) != 0 {
 		t.Fatalf("no transition expected, got %v", events)
 	}
 	// Fresh report → host_recovered.
 	backdate(t, db, "h1", 2)
 	sc.Check()
 	if last(events) != "host_recovered" {
 		t.Fatalf("events = %v, want last host_recovered", events)
 	}
 	// Aged to stale → host_stale.
 	backdate(t, db, "h1", 40)
 	sc.Check()
 	if last(events) != "host_stale" {
 		t.Fatalf("events = %v, want last host_stale", events)
 	}
 	// Aged past 2× → host_down.
 	backdate(t, db, "h1", 130)
 	sc.Check()
 	if last(events) != "host_down" {
 		t.Fatalf("events = %v, want last host_down", events)
 	}
 }
 func last(s []string) string {
 	if len(s) == 0 {
 		return ""
 	}
 	return s[len(s)-1]
 }
@@ -0,0 +1,122 @@
 package store
 import (
 	"io"
 	"log"
 	"path/filepath"
 	"testing"
 )
 func newTestStore(t *testing.T) *Store {
 	t.Helper()
 	s, err := New(filepath.Join(t.TempDir(), "test.db"), log.New(io.Discard, "", 0))
 	if err != nil {
 		t.Fatalf("store.New: %v", err)
 	}
 	t.Cleanup(func() { s.Close() })
 	return s
 }
 func TestGuestID(t *testing.T) {
 	if got := GuestID("demo-host-01", 100); got != "demo-host-01/100" {
 		t.Errorf("GuestID = %q", got)
 	}
 }
 func TestUpsertHost_AndLookup(t *testing.T) {
 	s := newTestStore(t)
 	if err := s.UpsertHost(&Host{HostID: "h1", CustomerID: "c1", APIKey: "k1"}); err != nil {
 		t.Fatalf("UpsertHost: %v", err)
 	}
 	h, err := s.GetHost("h1")
 	if err != nil || h == nil {
 		t.Fatalf("GetHost: %v / %v", h, err)
 	}
 	if h.CustomerID != "c1" || h.APIKey != "k1" || h.DesiredJSON != "{}" || h.LastReportAt != nil {
 		t.Errorf("host = %+v", h)
 	}
 	byKey, err := s.GetHostByAPIKey("k1")
 	if err != nil || byKey == nil || byKey.HostID != "h1" {
 		t.Errorf("GetHostByAPIKey hit = %+v / %v", byKey, err)
 	}
 	miss, err := s.GetHostByAPIKey("nope")
 	if err != nil || miss != nil {
 		t.Errorf("GetHostByAPIKey miss = %+v / %v (want nil,nil)", miss, err)
 	}
 }
 func TestSaveHostReport_BumpsRealityPreservesIntent(t *testing.T) {
 	s := newTestStore(t)
 	if err := s.UpsertHost(&Host{HostID: "h1", CustomerID: "c1", APIKey: "k1"}); err != nil {
 		t.Fatal(err)
 	}
 	// Operator-owned intent columns (inert this slice) set out-of-band.
 	if _, err := s.db.Exec(`UPDATE hosts SET desired_json='{"want":1}', desired_generation=7 WHERE host_id='h1'`); err != nil {
 		t.Fatal(err)
 	}
 	denorm := HostReportDenorm{AgentVersion: "0.3.0", CPUPercent: 3.2, MemoryPercent: 25, DiskPercent: 19, GuestTotal: 2, GuestRunning: 1, CloudflaredStatus: "active"}
 	if err := s.SaveHostReport("h1", "c1", []byte(`{"host_id":"h1"}`), denorm); err != nil {
 		t.Fatalf("SaveHostReport: %v", err)
 	}
 	h, _ := s.GetHost("h1")
 	if h.AgentVersion != "0.3.0" || h.LastReportAt == nil {
 		t.Errorf("reality not bumped: %+v", h)
 	}
 	if h.DesiredJSON != `{"want":1}` || h.DesiredGeneration != 7 {
 		t.Errorf("a report must NOT clobber intent columns: desired_json=%q gen=%d", h.DesiredJSON, h.DesiredGeneration)
 	}
 	var n int
 	s.db.QueryRow(`SELECT COUNT(*) FROM host_reports WHERE host_id='h1'`).Scan(&n)
 	if n != 1 {
 		t.Errorf("host_reports rows = %d, want 1", n)
 	}
 }
 func TestUpsertGuestFromReport_PreservesInertColumns(t *testing.T) {
 	s := newTestStore(t)
 	gid := GuestID("h1", 100)
 	if err := s.UpsertGuestFromReport(&Guest{GuestID: gid, CustomerID: "c1", HostID: "h1", VMID: 100, DisplayName: "acme", Status: "running"}); err != nil {
 		t.Fatal(err)
 	}
 	// Slice-10 columns set out-of-band; a report upsert must not touch them.
 	if _, err := s.db.Exec(`UPDATE guests SET api_key='controllerkey', desired_spec_json='{"cores":4}' WHERE guest_id=?`, gid); err != nil {
 		t.Fatal(err)
 	}
 	// A later report changes reality (status/name).
 	if err := s.UpsertGuestFromReport(&Guest{GuestID: gid, CustomerID: "c1", HostID: "h1", VMID: 100, DisplayName: "acme-renamed", Status: "stopped"}); err != nil {
 		t.Fatal(err)
 	}
 	var apiKey, desiredSpec, status, name string
 	err := s.db.QueryRow(`SELECT api_key, desired_spec_json, status, display_name FROM guests WHERE guest_id=?`, gid).
 		Scan(&apiKey, &desiredSpec, &status, &name)
 	if err != nil {
 		t.Fatal(err)
 	}
 	if apiKey != "controllerkey" || desiredSpec != `{"cores":4}` {
 		t.Errorf("inert columns clobbered: api_key=%q desired_spec_json=%q", apiKey, desiredSpec)
 	}
 	if status != "stopped" || name != "acme-renamed" {
 		t.Errorf("reality not updated: status=%q name=%q", status, name)
 	}
 }
 func TestGetHostStaleness_SkipsNeverReported(t *testing.T) {
 	s := newTestStore(t)
 	s.UpsertHost(&Host{HostID: "h1", CustomerID: "c1", APIKey: "k1"})
 	rows, err := s.GetHostStaleness()
 	if err != nil {
 		t.Fatal(err)
 	}
 	if len(rows) != 0 {
 		t.Errorf("never-reported host should be skipped, got %d rows", len(rows))
 	}
 	s.SaveHostReport("h1", "c1", []byte(`{}`), HostReportDenorm{})
 	rows, _ = s.GetHostStaleness()
 	if len(rows) != 1 || rows[0].HostID != "h1" {
 		t.Errorf("after a report expected 1 row, got %+v", rows)
 	}
 }
@@ -5,6 +5,7 @@ import (
 	"encoding/json"
 	"fmt"
 	"log"
 	"strconv"
 	"time"
 	_ "modernc.org/sqlite"
@@ -18,18 +19,18 @@ type Store struct {
 // CustomerSummary holds the latest status for a customer (for dashboard).
 type CustomerSummary struct {
-	CustomerID        string
+	CustomerID         string
-	CustomerName      string
+	CustomerName       string
-	ControllerVersion string
+	ControllerVersion  string
-	ReceivedAt        time.Time
+	ReceivedAt         time.Time
-	HealthStatus      string
+	HealthStatus       string
-	CPUPercent        float64
+	CPUPercent         float64
-	MemoryPercent     float64
+	MemoryPercent      float64
-	ContainerTotal    int
+	ContainerTotal     int
-	ContainerRunning  int
+	ContainerRunning   int
 	BackupLastSnapshot *time.Time
-	ReportJSON        string
+	ReportJSON         string
-	ControllerURL     string
+	ControllerURL      string
 	// Computed fields (not stored)
 	TimeSinceReport time.Duration
@@ -216,6 +217,63 @@ func (s *Store) migrate() error {
 		WHERE NOT EXISTS (SELECT 1 FROM infra_backup_versions
 			WHERE infra_backup_versions.customer_id = infra_backups.customer_id)`)
 	// v0.7.0: host-domain (slice 3). Purely additive — the controller path
 	// (reports/customer_configs) is untouched; the schema cutover is slice 10.
 	// Columns marked INERT exist now so slice 10 needs no ALTER; nothing reads or
 	// writes them this slice.
 	_, err = s.db.Exec(`
 		CREATE TABLE IF NOT EXISTS hosts (
 			host_id              TEXT PRIMARY KEY,
 			customer_id          TEXT NOT NULL,
 			api_key              TEXT NOT NULL,
 			agent_version        TEXT NOT NULL DEFAULT '',
 			last_report_at       DATETIME,
 			desired_json         TEXT NOT NULL DEFAULT '{}',
 			desired_generation   INTEGER NOT NULL DEFAULT 0,
 			dr_record_json       TEXT NOT NULL DEFAULT '{}',
 			created_at           DATETIME NOT NULL DEFAULT (datetime('now')),
 			updated_at           DATETIME NOT NULL DEFAULT (datetime('now'))
 		);
 		CREATE INDEX IF NOT EXISTS idx_hosts_customer ON hosts(customer_id);
 		CREATE TABLE IF NOT EXISTS guests (
 			guest_id            TEXT PRIMARY KEY,
 			customer_id         TEXT NOT NULL,
 			host_id             TEXT NOT NULL,
 			vmid                INTEGER NOT NULL,
 			display_name        TEXT NOT NULL DEFAULT '',
 			status              TEXT NOT NULL DEFAULT 'unknown',
 			controller_version  TEXT NOT NULL DEFAULT '',
 			last_seen_at        DATETIME,
 			api_key             TEXT NOT NULL DEFAULT '',
 			desired_spec_json   TEXT NOT NULL DEFAULT '{}',
 			created_at          DATETIME NOT NULL DEFAULT (datetime('now')),
 			updated_at          DATETIME NOT NULL DEFAULT (datetime('now'))
 		);
 		CREATE INDEX IF NOT EXISTS idx_guests_host     ON guests(host_id);
 		CREATE INDEX IF NOT EXISTS idx_guests_customer ON guests(customer_id);
 		CREATE TABLE IF NOT EXISTS host_reports (
 			id                 INTEGER PRIMARY KEY AUTOINCREMENT,
 			host_id            TEXT NOT NULL,
 			customer_id        TEXT NOT NULL,
 			received_at        DATETIME NOT NULL DEFAULT (datetime('now')),
 			report_json        TEXT NOT NULL,
 			agent_version      TEXT,
 			cpu_percent        REAL,
 			memory_percent     REAL,
 			disk_percent       REAL,
 			guest_total        INTEGER,
 			guest_running      INTEGER,
 			cloudflared_status TEXT
 		);
 		CREATE INDEX IF NOT EXISTS idx_host_reports_host     ON host_reports(host_id, received_at DESC);
 		CREATE INDEX IF NOT EXISTS idx_host_reports_customer ON host_reports(customer_id, received_at DESC);
 	`)
 	if err != nil {
 		return err
 	}
 	return nil
 }
@@ -812,7 +870,13 @@ func (s *Store) Prune(maxDays int) (int64, error) {
 	if err != nil {
 		return 0, err
 	}
-	return res.RowsAffected()
+	n, _ := res.RowsAffected()
 	// v0.7.0: prune the parallel host-domain report stream, same retention.
 	if hres, herr := s.db.Exec("DELETE FROM host_reports WHERE received_at < ?", cutoff); herr == nil {
 		hn, _ := hres.RowsAffected()
 		n += hn
 	}
 	return n, nil
 }
 // Close closes the database connection.
@@ -1138,11 +1202,11 @@ func scanEvents(rows *sql.Rows) ([]Event, error) {
 // parseSQLiteTime tries multiple formats that modernc.org/sqlite may return.
 func parseSQLiteTime(s string) time.Time {
 	formats := []string{
-		"2006-01-02 15:04:05",          // SQLite datetime('now')
+		"2006-01-02 15:04:05",           // SQLite datetime('now')
-		"2006-01-02T15:04:05Z",         // RFC3339 without fractional
+		"2006-01-02T15:04:05Z",          // RFC3339 without fractional
 		time.RFC3339,                    // 2006-01-02T15:04:05Z07:00
-		time.RFC3339Nano,               // with fractional seconds
+		time.RFC3339Nano,                // with fractional seconds
-		"2006-01-02 15:04:05+00:00",    // with explicit UTC offset
+		"2006-01-02 15:04:05+00:00",     // with explicit UTC offset
 		"2006-01-02 15:04:05.999999999", // with fractional, no TZ
 	}
 	for _, f := range formats {
@@ -1180,3 +1244,200 @@ func parseDiskSummary(reportJSON string) string {
 	}
 	return result
 }
 // ---- v0.7.0: host-domain (slice 3) ----
 // Additive store surface for the agent's host-report stream. The controller-path
 // methods above are untouched.
 // Host is one customer agent. Mixes operator-intent columns (Desired*, DRRecord —
 // INERT until slice 10) with box-reported reality (AgentVersion, LastReportAt).
 type Host struct {
 	HostID            string
 	CustomerID        string
 	APIKey            string
 	AgentVersion      string
 	LastReportAt      *time.Time
 	DesiredJSON       string
 	DesiredGeneration int64
 	DRRecordJSON      string
 	CreatedAt         time.Time
 	UpdatedAt         time.Time
 }
 // Guest is one controller LXC. Reality columns are report-driven; APIKey and
 // DesiredSpecJSON are INERT until slice 10 and must survive report upserts.
 type Guest struct {
 	GuestID           string
 	CustomerID        string
 	HostID            string
 	VMID              int
 	DisplayName       string
 	Status            string
 	ControllerVersion string
 	LastSeenAt        *time.Time
 	APIKey            string
 	DesiredSpecJSON   string
 	CreatedAt         time.Time
 	UpdatedAt         time.Time
 }
 // HostReportDenorm are the denormalized fields pulled from a host-report for the
 // dashboard / staleness, mirroring the reports table's denorm pattern.
 type HostReportDenorm struct {
 	AgentVersion      string
 	CPUPercent        float64
 	MemoryPercent     float64
 	DiskPercent       float64
 	GuestTotal        int
 	GuestRunning      int
 	CloudflaredStatus string
 }
 // HostStaleRow is the minimal per-host recency row the dead-man's-switch reads.
 type HostStaleRow struct {
 	HostID       string
 	CustomerID   string
 	LastReportAt time.Time
 }
 // GuestID derives the interim guest primary key from host + vmid. The hub owns the
 // id scheme (locked decision 3) so the slice-10 swap to durable ids is hub-only.
 func GuestID(hostID string, vmid int) string {
 	return hostID + "/" + strconv.Itoa(vmid)
 }
 func scanHost(scan func(dest ...any) error) (*Host, error) {
 	var h Host
 	var lastReport sql.NullString
 	var createdAt, updatedAt string
 	err := scan(&h.HostID, &h.CustomerID, &h.APIKey, &h.AgentVersion, &lastReport,
 		&h.DesiredJSON, &h.DesiredGeneration, &h.DRRecordJSON, &createdAt, &updatedAt)
 	if err != nil {
 		return nil, err
 	}
 	if lastReport.Valid {
 		t := parseSQLiteTime(lastReport.String)
 		h.LastReportAt = &t
 	}
 	h.CreatedAt = parseSQLiteTime(createdAt)
 	h.UpdatedAt = parseSQLiteTime(updatedAt)
 	return &h, nil
 }
 const hostSelectCols = `host_id, customer_id, api_key, agent_version, last_report_at,
 	desired_json, desired_generation, dr_record_json, created_at, updated_at`
 // GetHostByAPIKey looks up a host by its per-host hub key. Returns nil (no error)
 // if no match — parallels GetCustomerConfigByAPIKey.
 func (s *Store) GetHostByAPIKey(apiKey string) (*Host, error) {
 	h, err := scanHost(s.db.QueryRow(`SELECT `+hostSelectCols+` FROM hosts WHERE api_key = ?`, apiKey).Scan)
 	if err == sql.ErrNoRows {
 		return nil, nil
 	}
 	return h, err
 }
 // GetHost looks up a host by id. Returns nil (no error) if not found.
 func (s *Store) GetHost(hostID string) (*Host, error) {
 	h, err := scanHost(s.db.QueryRow(`SELECT `+hostSelectCols+` FROM hosts WHERE host_id = ?`, hostID).Scan)
 	if err == sql.ErrNoRows {
 		return nil, nil
 	}
 	return h, err
 }
 // ListHosts returns all hosts (debug / host-domain views).
 func (s *Store) ListHosts() ([]Host, error) {
 	rows, err := s.db.Query(`SELECT ` + hostSelectCols + ` FROM hosts ORDER BY host_id`)
 	if err != nil {
 		return nil, err
 	}
 	defer rows.Close()
 	var hosts []Host
 	for rows.Next() {
 		h, err := scanHost(rows.Scan)
 		if err != nil {
 			return nil, err
 		}
 		hosts = append(hosts, *h)
 	}
 	return hosts, rows.Err()
 }
 // UpsertHost creates or updates a host identity (used by the admin mint). On
 // conflict it updates only operator-settable identity fields + updated_at; it does
 // NOT touch the reality columns (agent_version/last_report_at) or the inert intent
 // columns (desired_*/dr_record_json) — those are owned elsewhere.
 func (s *Store) UpsertHost(h *Host) error {
 	_, err := s.db.Exec(`
 		INSERT INTO hosts (host_id, customer_id, api_key, updated_at)
 		VALUES (?, ?, ?, datetime('now'))
 		ON CONFLICT(host_id) DO UPDATE SET
 			customer_id = excluded.customer_id,
 			api_key = excluded.api_key,
 			updated_at = datetime('now')`,
 		h.HostID, h.CustomerID, h.APIKey,
 	)
 	return err
 }
 // SaveHostReport inserts a host_reports row and bumps the host's reality columns
 // (agent_version/last_report_at/updated_at) — never the inert intent columns.
 func (s *Store) SaveHostReport(hostID, customerID string, reportJSON []byte, d HostReportDenorm) error {
 	_, err := s.db.Exec(`
 		INSERT INTO host_reports (host_id, customer_id, report_json, agent_version,
 			cpu_percent, memory_percent, disk_percent, guest_total, guest_running, cloudflared_status)
 		VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)`,
 		hostID, customerID, string(reportJSON), d.AgentVersion,
 		d.CPUPercent, d.MemoryPercent, d.DiskPercent, d.GuestTotal, d.GuestRunning, d.CloudflaredStatus,
 	)
 	if err != nil {
 		return err
 	}
 	_, err = s.db.Exec(`
 		UPDATE hosts SET agent_version = ?, last_report_at = datetime('now'), updated_at = datetime('now')
 		WHERE host_id = ?`, d.AgentVersion, hostID)
 	return err
 }
 // UpsertGuestFromReport upserts the REALITY columns of a guest. On conflict it
 // must NOT clobber the inert columns (api_key / desired_spec_json).
 func (s *Store) UpsertGuestFromReport(g *Guest) error {
 	_, err := s.db.Exec(`
 		INSERT INTO guests (guest_id, customer_id, host_id, vmid, display_name, status,
 			controller_version, last_seen_at, updated_at)
 		VALUES (?, ?, ?, ?, ?, ?, ?, datetime('now'), datetime('now'))
 		ON CONFLICT(guest_id) DO UPDATE SET
 			vmid = excluded.vmid,
 			display_name = excluded.display_name,
 			status = excluded.status,
 			controller_version = excluded.controller_version,
 			last_seen_at = datetime('now'),
 			updated_at = datetime('now')`,
 		g.GuestID, g.CustomerID, g.HostID, g.VMID, g.DisplayName, g.Status,
 		g.ControllerVersion,
 	)
 	return err
 }
 // GetHostStaleness returns per-host recency for the dead-man's-switch. Hosts that
 // have never reported (NULL last_report_at) are skipped — a freshly-minted host is
 // not "down" until it has checked in at least once.
 func (s *Store) GetHostStaleness() ([]HostStaleRow, error) {
 	rows, err := s.db.Query(`SELECT host_id, customer_id, last_report_at FROM hosts WHERE last_report_at IS NOT NULL`)
 	if err != nil {
 		return nil, err
 	}
 	defer rows.Close()
 	var out []HostStaleRow
 	for rows.Next() {
 		var r HostStaleRow
 		var last string
 		if err := rows.Scan(&r.HostID, &r.CustomerID, &last); err != nil {
 			return nil, err
 		}
 		r.LastReportAt = parseSQLiteTime(last)
 		out = append(out, r)
 	}
 	return out, rows.Err()
 }
@@ -10,7 +10,7 @@ import (
 	"time"
 )
-const templateRawURL = "https://gitea.dooplex.hu/admin/deploy-felhom-compose/raw/branch/main/controller/configs/controller.yaml.example"
+const templateRawURL = "https://gitea.dooplex.hu/admin/felhom-controller/raw/branch/main/controller/configs/controller.yaml.example"
 // TemplateFetcher periodically fetches controller.yaml.example from the Gitea
 // repo and caches it for config generation. Falls back to go:embed default.
@@ -187,7 +187,7 @@ spec:
              cpu: "50m"
      containers:
        - name: umami
-          image: ghcr.io/umami-software/umami:postgresql-latest
+          image: ghcr.io/umami-software/umami:3.1.0
          ports:
            - containerPort: 3000
          env:
@@ -105,9 +105,22 @@ spec:
      labels:
        app: filebrowser
    spec:
      # filebrowser v2.63.13 (debian default) runs as a non-root UID by default
      # and can't write to PVC files left by the previous v2-alpine image (which
      # ran as root). Force root explicitly so the existing PVC contents are
      # readable + writable. (The alternative -- chown the PVC then drop perms --
      # needs a one-shot initContainer; not worth the moving parts here.)
      securityContext:
        runAsUser: 0
        runAsGroup: 0
      containers:
        - name: filebrowser
-          image: filebrowser/filebrowser:v2-alpine
+          image: filebrowser/filebrowser:v2.63.13
          # v2.63.x default config path is `/config/settings.json`; our ConfigMap
          # is mounted at `/.filebrowser.json`. Tell filebrowser to read it
          # explicitly so it picks up port 8080 (else it falls back to port 80
          # and the readiness probe on 8080 fails).
          args: ["-c", "/.filebrowser.json"]
          ports:
            - containerPort: 8080
          volumeMounts:
Author	SHA1	Message	Date
admin	2f8658981d	docs: reflow CLAUDE.md; switch REPORT.md to overwrite-latest; add no-secrets rule Unify the REPORT/CHANGELOG convention with the sibling repos (REPORT.md was append/cumulative -> now overwrite-latest; CHANGELOG stays cumulative). Reflow removes hard mid-paragraph line wraps; rendered output unchanged. CHANGELOG entry in hub/CHANGELOG.md. No hub code change -> no version bump. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 20:54:32 +02:00
admin	7bc27c38de	update	2026-06-08 20:06:11 +02:00
admin	aab3e137c5	updated CLAUDE.md	2026-06-08 19:17:41 +02:00
admin	4be3bdf486	fix(hub): slice-3 follow-ups — /host-report 413 oversize + contract golden (v0.7.1) - handleHostReport: read maxHostReportBytes+1 (4 MiB const) and reject oversize with 413 instead of silent LimitReader truncation. Controller handleReport (1 MiB) is unchanged. Test asserts 413. - contract: hub/internal/api/testdata/host-report.golden.json (byte-identical with felhom-agent's copy) + TestHostReport_GoldenContract drives the real handler and asserts 200 + denorm + both guests upserted. - CHANGELOG v0.7.1. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 18:31:44 +02:00
admin	23611c20ef	chore(hub): revert incidental gofmt-only reformatting outside slice-3 scope Restores notify/templates.go, store/telemetry.go, web/configs.go to upstream — those were alignment-only churn from a tree-wide gofmt, not part of slice 3. Keeps the host-domain diff additions-only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 16:38:18 +02:00
admin	7c0c75457f	feat(hub): host-domain ingest — tables + /host-report + per-host auth + host dead-man's-switch (v0.7.0, slice 3) Purely additive; the controller path (reports/customer_configs/checkAuthCustomer/ existing checkers) is untouched. Cutover remains slice 10. - store: new hosts/guests/host_reports tables (full schema incl. columns INERT until slice 10, so no later ALTER); GetHostByAPIKey/GetHost/ListHosts/UpsertHost/ SaveHostReport/UpsertGuestFromReport (preserves inert cols)/GetHostStaleness/ GuestID; Prune also prunes host_reports. - api: checkAuthHost (sibling of checkAuthCustomer); POST /host-report (per-host Bearer, 4MiB, denorm + guest upsert, control envelope); POST /admin/hosts (PROVISIONAL global-key host mint); host_* event types registered. - monitor: HostStalenessChecker sibling over host_reports (host_stale/down/ recovered), wired on the existing 60s ticker; controller checkers unchanged. - tests (hermetic): store intent/inert-column preservation, auth, ingest (envelope+denorm, mismatch/unknown/blocked/oversize), admin mint round-trip, host staleness transitions. CHANGELOG v0.7.0. Contract matches the agent host-report spec field-for-field. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 16:36:16 +02:00
admin	0d832def7b	fix: update repo-name refs after deploy-felhom-compose -> felhom-controller rename - hub/internal/web/templatefetcher.go: raw-template URL now points at the renamed repo (was relying on Gitea's post-rename redirect) - documentation/ (moved here from the felhom-agent repo): fix controller-source path refs (deploy-felhom-compose -> felhom-controller) and the platform repo name (proxmox-controller -> felhom-agent) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 14:03:13 +02:00
admin	cb1d964620	Merge pull request 'moved documentation to felhom.eu' (#7 ) from fix/filebrowser-config-args into main Reviewed-on: #7	2026-06-08 11:54:53 +00:00
admin	3d6cde8080	Merge pull request 'docs: rework repo-name references for renames' (#6 ) from chore/rename-repo-refs into main Reviewed-on: #6	2026-06-08 11:52:04 +00:00
admin	715f644bf0	moved documentation to felhom.eu	2026-06-08 13:50:14 +02:00
admin	0f12e17175	docs: rework repo-name references for renames deploy-felhom-compose -> felhom-controller, proxmox-controller -> felhom-agent in README.md and CLAUDE.md. Hub source (templatefetcher.go) intentionally left untouched per scope; its raw-template URL is flagged separately for the operator. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 13:39:53 +02:00
admin	7b545c1ec7	Merge pull request 'fix: pass --config to filebrowser (v2.63.x changed default lookup path)' (#5 ) from fix/filebrowser-config-args into main	2026-06-06 12:22:05 +00:00
admin	ea66afa960	manifests: pass --config to filebrowser so it reads our ConfigMap The previous PR pinned filebrowser to v2.63.13 + runAsUser:0 which solved the PVC permission issue, but the pod was still 0/1 Ready because v2.63.x changed the default config-file lookup path: Old (v2-alpine): /.filebrowser.json (matched our existing mount) New (v2.63.13) : /config/settings.json (NOT mounted in this pod) So the new image ran with its built-in defaults (port 80, in-memory db), and the readiness probe on 8080/health timed out. Fix: pass `args: ["-c", "/.filebrowser.json"]` so filebrowser uses the ConfigMap we already mount there. No volumeMount changes needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-06 14:22:04 +02:00
admin	87b062e84a	Merge pull request 'feat: umami 3.1.0 + filebrowser v2.63.13 (root)' (#4 ) from feat/umami-v3-filebrowser-root into main	2026-06-06 12:17:21 +00:00
admin	bd0531e4a8	manifests: umami -> 3.1.0 (v3 line) + filebrowser v2.63.13 with runAsUser:0 umami: Switch from SHA-pinned v3.0.3 to the tagged v3.1.0 release (the v3 line proper -- same schema lineage, normal Prisma minor-version migration). This is the documented forward path that the version- checker hint `postgresql-latest -> 3.1` indicated. The v1.x postgresql-vX.Y.Z line we briefly tried earlier today is a DIFFERENT image lineage with incompatible migrations -- avoid. filebrowser: Re-pin to v2.63.13 (debian-based default) so Renovate can track future bumps. The non-root UID in that image can't write to the existing PVC contents (chowned to root by the previous v2-alpine image), so set pod-level securityContext runAsUser:0 + runAsGroup:0 to keep using the same volume layout without a chown initContainer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-06 14:17:20 +02:00
admin	dc64bb2d79	Merge pull request 'fix(URGENT): pin umami to exact SHA (v1.38.0 has schema lineage mismatch)' (#3 ) from fix/umami-sha-pin into main	2026-06-06 11:53:55 +00:00
admin	7e6ea9d66c	manifests: pin umami to exact image SHA (schema mismatch with v1.38.0) Previous PR pinned `ghcr.io/umami-software/umami:postgresql-v1.38.0`. The new pod crashlooped on Prisma: ERROR: relation "event" does not exist Migration name: 02_add_event_data Database error code: 42P01 The 120-day-old working pod's actual image is: ghcr.io/umami-software/umami@sha256:28f263fe06f79ebffa5a6a6e9b... It runs an older umami build whose schema doesn't have the `event` table that the v1 migration `02_add_event_data` operates on. The DB has migrations 10-14 applied (newer than 02 by name) but 02 isn't in its applied set -- likely a schema fork between the line our 120d pod runs and the postgresql-vX.Y.Z line that v1.38.0 advances toward. Pin to the exact SHA that the working pod uses, so pod restarts + ArgoCD syncs both keep producing pods on the same known-good image (cached on the node, no registry pull needed). Renovate also stops chasing the broken upgrade path. Proper fix (deferred): plan a v3.x migration. The version-checker dashboard hint `postgresql-latest → 3.1` suggests umami v3.x dropped the `postgresql-` prefix and is what we'd want long-term. That needs a real DB migration plan since the schema lineage is genuinely different from this image. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-06 13:53:54 +02:00
admin	a964dc20a4	Merge pull request 'fix: revert filebrowser to v2-alpine (PVC permission issue with v2.63.13)' (#2 ) from fix/filebrowser-revert into main	2026-06-06 11:45:19 +00:00
admin	df2a1259d9	manifests: revert filebrowser v2.63.13 -> v2-alpine (PVC permission issue) The previous PR pinned `filebrowser/filebrowser:v2-alpine` to v2.63.13 but it crashlooped on: Error: open /database/filebrowser.db: permission denied The v2.63.13 image (debian-based default) runs as a non-root UID and can't write to files on the PVC that were created by the v2-alpine image (which ran as root). No `v2.63.13-alpine` tag exists upstream (filebrowser stopped publishing per-version alpine variants), so we can't trivially preserve the same runtime. Quick recovery: revert to v2-alpine so filebrowser is usable again. Proper fix (deferred): either an initContainer that `chown -R 1000:1000 /database /srv` or a `securityContext.fsGroup: 1000` on the pod spec to let the non-root UID write to the existing PVC. Both require some care since the chown is destructive if the UID is wrong. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-06 13:45:18 +02:00
admin	e363c6594d	Merge pull request 'manifests: re-pin moving tags (umami / filebrowser)' (#1 ) from fix/version-pins into main	2026-06-06 11:41:51 +00:00
admin	ce80dce497	manifests: re-pin moving tags so Renovate can track them - umami postgresql-latest -> postgresql-v1.38.0 - filebrowser v2-alpine -> v2.63.13 These two were "latest"-style moving tags that Renovate physically cannot propose updates for. Pinning to current upstream versions so future bumps go through the normal Renovate PR flow. Note: Renovate operates from the homelab-manifests repo, not this one yet — but felhom-system/* copies exist in homelab-manifests for discoverability, and Renovate already tracks the pinned forms via a new customManager for the umami `postgresql-vX.Y.Z` pattern (added in homelab-manifests admin-system/renovate.yaml). For now, future bumps will need to be applied to both repos until we consolidate the source of truth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-06 13:41:50 +02:00