# Critical design review — Proxmox re-platform doc set > ✅ **RESOLVED (2026-06-08).** All findings folded into 01/02/03 + `proxmox-platform.md` > (Phase-3 spike run for B2/B3 → `tests/phase3-findings.md`). **Folded:** B1 (03 §4), B2 > (03 §7/§8 + platform §4.7), B3 (03 §2/§3 + platform §3.6), S1 (03 §6/§8), S2 (03 §10/§11), > S3 (03 §7), S4 (01 §5/§7 + 02 + 03 §2), S5 (01 §7/§11 + 02 §6), S6 (02 §5), M1 (02 §3), > M2 (03 §7), M3 (03 §10), §6-residual (03 §6). Plus the two Phase-3 design updates: > provision-by-restore (03 §9) and the settled root-vs-API boundary (03 §3). **Deferred/none:** > no finding was deferred; the pre-existing open items (operator signing-key mechanics, > multi-tenant fairness, hub-side desired-state UX, golden-image refresh cadence) remain > flagged in 03 §13. This artifact can be deleted once confirmed. Working artifact. Review pass over `01-topology-and-trust.md`, `02-controller-module-map.md`, `03-host-agent.md`, `proxmox-platform.md`, and the Phase 0 / Phase 1-2 findings, grounded against the v0.33 source (`felhom-controller/controller/`). Every finding cites a file+line or a doc section. Severity: **blocking** / **should-fix** / **minor**. Two findings are self-corrections of my own earlier work (`02` and `proxmox-platform.md`) — flagged as such. --- ## Ranked summary | # | Severity | Finding | Where | |---|---|---|---| | B1 | **blocking** | Reversibility gate contradicts the self-heal reconcile loop — crashed-guest healing can require a signature-gated destroy → reconcile stalls | `03` §4 vs §4(a) | | B2 | **blocking** | vzdump bulk-exclusion only works for **volume** mount points; Docker **named volumes live in the LXC rootfs and ARE captured** → naive placement silently backs up the 1 TB media drive. Unvalidated by spike. | `03` §7 vs `proxmox-platform.md` §4.3 + pct manpage | | B3 | **blocking** | Agent's Proxmox role is called "the minimal role from Phase 1" — but that role is the *narrow self-backup* role that Phase 1 proved is **denied** create/allocate/restore. The agent's operator-tier role is undefined. | `03` §2/§3 vs `phase1-2` §1.3-1.4, `01` appendix | | S1 | should-fix | Quiescing for agent/hub-scheduled backups has **no agent→controller channel** — the local API is controller→agent only | `03` §6, §8 | | S2 | should-fix | Agent self-update revert authority unspecified — if the new binary won't boot, nothing outside it can flip back | `03` §11 | | S3 | should-fix | Storage manifest drops fields `settings.StoragePath` carries today (Label, Schedulable/default, StoppedStacks, MigratedTo) with no re-homing stated | `03` §7 vs `settings.go:90-103` | | S4 | should-fix | Geo-restriction WAF ownership + Cloudflare **API token** placement unspecified after tunnel placement was locked; zone-wide token in a guest is a blast-radius concern | `03` (absent), `01` §3, `config.go` InfrastructureConfig | | S5 | should-fix | Cross-doc staleness: `01` §11 still lists tunnel placement OPEN; `02` §6 lists geo "blocked on tunnel placement" — both resolved by `03` §13 | `01` §11, `02` §6 vs `03` §13 | | S6 | should-fix (self-correct) | `02` put self-restore-test **orchestration** in the controller; `03` correctly makes it agent-owned (controller only reads status) | `02` §5(3) vs `03` §6/§8 | | M1 | minor (self-correct) | `02` §3 lists `UUID` as a `settings.StoragePath` field — it isn't; UUID is derived from fstab at runtime | `02` §3 vs `settings.go:91-103` | | M2 | minor | `03` §7 says the manifest "absorbs the disk-state fields StoragePath carries today" incl. UUID — UUID isn't persisted today, so the manifest *adds* it (an improvement, not absorption) | `03` §7 | | M3 | minor | controller-update is not in `03` §10's journaled-ops list, though it's a multi-step async op | `03` §10 vs §11 | **Values check: clean.** No DR/key-custody/offboarding path leaves a customer locked out. Zero-knowledge DR (`03` §8, `01` §8) correctly makes the customer recovery code the irreducible residual; the operator cannot read data and the box can still restore-test. No hostage path found. **Locked premises:** reviewed for soundness/consistency only; not relitigated. --- ## Blocking findings ### B1 — The reversibility gate stalls the self-healing reconcile loop **Where:** `03` §4(a) vs the gate in §4. **What:** §4(a) lists "redeploy of a crashed controller" as benign convergence that "falls out of reconciliation for free." The gate then lists **guest destroy** among the irreversible ops that require an operator signature "*regardless of whether they arrive as a job or as a desired-state delta*." These collide: if healing a wedged guest requires destroy+recreate (corrupt rootfs, failed in-place restart, half-built guest from an interrupted provision), the reconciler hits a signature-gated op and **cannot proceed without an operator** — the loop either stalls or silently gives up, defeating "self-healing … tolerant of missed polls." **Why it matters:** This is the security-critical control model. A fuzzy benign/destructive line is unimplementable: either the reconciler can destroy (and a compromised hub's desired state can wipe guests — the exact threat §4 exists to stop), or it can't (and self-heal is a fiction for the crashed-guest case). **Grounding:** `03` §4 self-describes the gate as "security-critical"; §9/§10 already rely on the reconciler rolling back "a half-built guest" — which *is* a destroy of a customer-id-bound resource, contradicting the blanket "guest destroy needs a signature." **Suggested fix (crisp, implementable rule):** Scope the reconciler's destructive verbs by *provenance and data-bearing-ness*, not by verb: - The reconciler MAY, without a signature: (a) create/start/restart; (b) destroy resources it **created earlier in the same journaled transaction** (compensating rollback, §10); (c) destroy resources **tagged ephemeral/scratch** (restore-test scratch guests, §8). - Destroying or overwriting any resource that **holds the only/primary copy of customer data** always needs an operator signature. - **Healing a crashed controller is non-destructive by construction:** the controller is reconstructable from its image + the guest's persistent volume, so "redeploy" = restart the LXC / `docker compose up -d` **inside the existing guest** — never a guest destroy. State this explicitly so the two clauses stop colliding. (The v0.33 self-heal precedent is already in-place restart: `watchdog.go` restarts stopped stacks, it never destroys the guest.) ### B2 — vzdump bulk-exclusion: the rootfs-Docker-volume trap **Where:** `03` §7 ("Bulk external mounts are excluded from the guest's vzdump (a per-mount backup flag)"). **What:** Two grounded problems: 1. The flag is real but narrow. The pct manpage (verified): `backup=` — *"Whether to include the mount point in backups (**only used for volume mount points**)."* It does **not** apply to bind mounts / device mounts (those are handled separately). 2. The trap: `proxmox-platform.md` §4.3 (validated in `phase1-2` §2.2) proved that **Docker named volumes live inside the LXC rootfs and ARE captured by vzdump** — a sentinel in `pgdata` survived. The default Felhom app uses Docker named volumes. So unless bulk data is deliberately placed on a **dedicated Proxmox volume mount point** (backup=0) or a bind mount, a "bulk" volume will be an ordinary named volume in rootfs and will be **silently swept into the whole-guest image** — exactly the 1 TB-media-in-every-backup outcome §7 says it prevents. **Why it matters:** Backup size/cost and RPO blow up silently; the failure is invisible until a media drive fills the vzdump target. This is load-bearing for the §8 tier model. **Grounding:** pct manpage (fetched 2026); `proxmox-platform.md` §4.3; `phase1-2` §2.2. Not covered by any spike — `proxmox-platform.md` §6 "not yet validated" should gain this row. **Suggested fix:** Make the placement contract explicit: a `bulk` volume **must** be realized as a dedicated LXC mount point (volume mountpoint with `backup=0`, or an external bind mount), **never** a Docker named volume in rootfs. The per-volume placement component (`02` §5(2)) must enforce this at deploy. Add a Phase-3 spike: create an LXC with a `backup=0` volume mountpoint + a bind mount, vzdump it, confirm both are excluded and the rootfs+`backup=1` volume are included. ### B3 — The agent's Proxmox role is mis-grounded as "the Phase-1 minimal role" **Where:** `03` §2 ("scoped Proxmox API token (minimal role from Phase 1)"), §3 ("the Phase-1 minimal role is the API floor"). **What:** Phase 1's minimal role (`FelhomSelfBackup` = `VM.Audit, VM.Snapshot, VM.Backup, Datastore.AllocateSpace, Datastore.Audit`) is the **narrow self-backup** role scoped to one guest, and Phase 1 explicitly proved it is **denied (403)** on create/allocate (`phase1-2` §1.3 call #7) — i.e. exactly the operator-tier ops the agent's whole job consists of (provision, restore, storage allocation). Worse, `01` appendix states that guest-side role "**is not used** — we chose the agent-mediated path." So `03` cites, as the agent's role floor, a role that (a) the architecture discarded and (b) is provably insufficient for the agent. **Why it matters:** The agent's actual operator-tier role is **undefined**. Provisioning, restore, and storage management cannot be built or hardened against an undefined privilege set, and §3's root-minimization argument ("the Phase-1 minimal role is the API floor") collapses because that floor can't create a guest. **Grounding:** `phase1-2` §1.3 (create CT = 403), §1.4 (role = self-backup only); `01` appendix ("not used … confirmed restore = operator-tier"); `proxmox-platform.md` §3.4. **Suggested fix:** Replace the Phase-1 reference with a **new agent operator role** to be defined and least-privilege-tested in a Phase-3 spike — minimally `VM.Allocate`, `VM.Config.*`, `VM.PowerMgmt`, `VM.Snapshot(.Rollback)`, `VM.Backup`, `VM.Audit`, `Datastore.Allocate(Space)`, `Datastore.Audit`, plus whatever storage-attach needs (see S4/root-boundary below). Keep §3's "API token, not root, where the API suffices" principle — that part is sound — but stop calling it the Phase-1 role. --- ## Should-fix findings ### S1 — No agent→controller channel for backup quiescing **Where:** `03` §6 (local API is controller→agent only) vs §8 ("the controller stops the app stack … before a guest vzdump where app-consistency matters"). **What:** App-consistent LXC backup requires the controller to quiesce (no fsfreeze for LXC — `proxmox-platform.md` §4.2, `phase1-2` §2.1). But the §6 surface is entirely controller→agent; the box-initiated model forbids the hub calling in, and there is no agent→controller call defined. For a **hub/agent-scheduled** backup (schedule lives in the manifest `policy`, §7), the agent has no way to tell the controller "quiesce now." **Why it matters:** Either scheduled backups silently fall back to crash-consistent (relying on WAL recovery, which `phase1-2` §3 warns is unvalidated under write load), or the feature can't be built as drawn. **Suggested fix:** Make backups **controller-driven for app-consistency**: the controller learns due/policy via its own hub channel (or a `GET /backup/due` on the local API), quiesces, calls the existing `POST /backup`, then unquiesces on completion. Document that agent-initiated vzdump is crash-consistent only. (No inbound-to-guest channel needed — preserves §3/§5.) ### S2 — Agent self-update revert authority unspecified **Where:** `03` §11 ("a watchdog reverts to last-good if the new binary fails to come up healthy"). **What:** The agent is a single host systemd service with `Restart=always` (§3). If the new binary crashes on startup, systemd just restarts the **same bad binary** in a loop. "Revert to last-good" cannot be done *by* the thing that won't boot. §11 doesn't name the actor. **Why it matters:** A bad self-update can brick the crown-jewel host agent — the one component that recovers everything else — with no automatic recovery, requiring break-glass. **Suggested fix:** Put revert authority **outside** the swapped binary: e.g. an A/B symlink (`current → good|new`) where a separate systemd oneshot health-gate (`ExecStartPost` probe; on failure flip the symlink back and restart), or a tiny supervisor unit. Boot-into-last-good + explicit "commit" after a clean health window is the robust pattern. Add agent-update to the §10 journal so an interrupted swap is resumable. ### S3 — Manifest schema omits live `StoragePath` fields without re-homing them **Where:** `03` §7 table vs `settings.go:90-103`. **What:** Today's `StoragePath` carries `Label`, `IsDefault`, `Schedulable`, `StoppedStacks`, `Decommissioned`/`DecommissionedAt`/`MigratedTo`. The manifest covers state (attached/ disconnected/decommissioned) and durable_id, but drops: **Label** (human name, e.g. "Külső HDD 1TB" — UI), **Schedulable/IsDefault** (default placement target for new apps), **StoppedStacks** (which apps to restart on reconnect — app-domain), **MigratedTo** (decommission target pointer). **Why it matters:** `02` named this manifest as the contract that the `settings.StoragePath` reshape depends on. Silently dropped fields become lost behavior (no default-drive choice, no restart-after-reconnect list, no friendly labels). **Suggested fix:** Either add Label + a placement-default marker to the manifest, or explicitly state which fields re-home to the controller's `settings` (StoppedStacks and Label are plausibly controller-side; default/schedulable placement must live wherever placement decisions are made). Make the split explicit so neither side assumes the other owns it. ### S4 — Geo-WAF ownership + Cloudflare API token placement unspecified **Where:** `03` covers `cloudflared` (tunnel) health but is silent on geo-restriction WAF; `02` §6 had `cloudflare/`+`geo` "blocked on tunnel placement"; `01` §3 lists the controller's creds as "hub API key + local-API token" only. **What:** Now that tunnel placement is locked (host), the **geo-restriction WAF** management (`cloudflare/` package: zone/waf/geosync) still has no home. It requires a Cloudflare **API token** (`config.go` InfrastructureConfig.cf_api_token) with zone-wide WAF edit rights. If geo stays in the controller (app-domain, per `02`), a **zone-wide Cloudflare token sits inside the customer guest** — a real blast-radius concern (compromise → edit/disable WAF for the whole zone, potentially other customers on the same zone). **Why it matters:** Trust-boundary gap. `01` §5's boundary table has no row for controller↔ Cloudflare-API. Unspecified ownership blocks the `02` geo classification from being unblocked. **Suggested fix:** Decide geo-WAF ownership explicitly and add it to `01` §5. Options: (a) move WAF management to the **agent/hub** (operator-tier, token off the customer box); (b) keep it in the controller but scope the CF token per-zone/per-customer if the account model allows. Note this is now *unblocked* by the tunnel decision and should leave `02` §6's "blocked" state. ### S5 — Cross-doc staleness on the now-locked tunnel placement **Where:** `01` §11 ("Cloudflare Tunnel placement: host vs guest (§7)") and `02` §6 ("`cloudflare/` + `api/geo.go` — blocked on tunnel placement") vs `03` §13 ("Resolved here: tunnel placement (host, agent-managed)") and the LOCKED list. **What:** `01` and `02` still present as OPEN/blocked a decision `03` and the locked set have resolved. **Why it matters:** A dev reading `01`/`02` would treat a settled decision as open (or a classification as blocked when only geo-ownership, S4, actually remains). **Suggested fix:** When folding this review in: update `01` §7/§11 to record tunnel=host (agent-managed systemd service); update `02` §6 to reduce the cloudflare item from "blocked on tunnel placement" to the narrower "blocked on geo-WAF ownership (S4)." ### S6 — (self-correction) self-restore-test orchestration belongs to the agent, not the controller **Where:** `02` §5(3) said "Self-restore-test orchestration — *controller* asks the agent to restore to scratch guest, validates, reports." `03` §8 makes the **agent** drive it autonomously; §6 gives the controller only `GET /restore-test/status` (read-only). **What:** `03` is right and `02` overreached. Zero-knowledge means only the box/agent holds the PBS key (`03` §8); creating a scratch guest is operator-tier (create/allocate — `phase1-2` §1.3 #7); the controller cannot do either. The controller's only piece is surfacing status. **Why it matters:** Keeps the NEW-component list honest — this is not a controller component to build beyond a status read. **Suggested fix:** Amend `02` §5(3) to "self-restore-test **status display** (read-only); the agent owns orchestration." --- ## Minor findings - **M1 (self-correction):** `02` §3 lists `UUID` among `settings.StoragePath` fields. It is **not** there (`settings.go:91-103`: Path, Label, IsDefault, Schedulable, AddedAt, Disconnected/At, StoppedStacks, Decommissioned/At, MigratedTo). UUID is derived at runtime from fstab / `/host-dev/disk/by-uuid` by `system.ParseFstabUUID` and `watchdog.go`. The classification (settings = MODIFY/split) is unaffected; the field list was wrong. - **M2:** Consequently `03` §7's "absorbs the disk-state fields `settings.StoragePath` carries today" overstates: `durable_id`/UUID is *not* carried today, so the manifest **adds** durable identity (a genuine improvement — today the controller re-derives UUID from fstab each boot, which is fragile). Reword "absorbs" → "absorbs + adds durable_id." - **M3:** `03` §10 journals "provision, restore" but not **controller-update** (§11), which is also a multi-step async op (snapshot→pull→redeploy→health→rollback). Add it so an agent crash mid-controller-update is resume-or-rollback like the others. --- ## Verified-correct (no action) — grounding that held up - LXC flags `nesting=1,keyctl=1` + overlayfs (`03` §9) match `proxmox-platform.md` §2.3 / `phase0` §3. ✓ - async `task exitstatus`, not POST return (`03` §8) matches `proxmox-platform.md` §3.5. ✓ - stop-mode backup not requiring `VM.PowerMgmt` (`03` §8 "per Phase 1") matches `proxmox-platform.md` §3.4. ✓ (applies to the agent role too.) - running-LXC snapshot on LVM-thin (`03` §6/§8/§11) matches `proxmox-platform.md` §4.5 / `phase1-2` §1.6. ✓ - `monitor/pinger.go` deprecation (`02` DELETE-obsolete) confirmed in `main.go:168,175` ("legacy, will be removed" / "no longer used — monitoring is now handled by the Hub"). ✓ - backup keep/delete **intra-file tear** (`02` hazard) confirmed: `backup.go` holds both `RunDBDumps`/`DumpAppVolumes(Safe)` (keep) and `RunBackup`/`RunFullBackup` (restic, delete); `restore.go` holds `RestoreApp` (restic) + `RestoreAppFromTier2` (app). The §7-8 backup contract gives the extracted app-data-backup package a coherent destination. ✓ - Control-plane-not-data-plane (`03` §2/§43): apps keep serving if the agent dies — consistent with Docker-in-LXC running independently (`phase0` §3). ✓ - §6 per-guest local-API authorization (token→guest map): sound; a leaked token acts only on its own guest. Residual: a compromised controller can `POST /rollback` its **own** guest (blast radius = self) — acceptable per design; worth a one-line note that rollback is self-scoped and bounded.