Files
felhom-agent/docs/architecture/_design-review.md
T
2026-06-08 09:15:16 +02:00

19 KiB

Critical design review — Proxmox re-platform doc set

RESOLVED (2026-06-08). All findings folded into 01/02/03 + proxmox-platform.md (Phase-3 spike run for B2/B3 → tests/phase3-findings.md). Folded: B1 (03 §4), B2 (03 §7/§8 + platform §4.7), B3 (03 §2/§3 + platform §3.6), S1 (03 §6/§8), S2 (03 §10/§11), S3 (03 §7), S4 (01 §5/§7 + 02 + 03 §2), S5 (01 §7/§11 + 02 §6), S6 (02 §5), M1 (02 §3), M2 (03 §7), M3 (03 §10), §6-residual (03 §6). Plus the two Phase-3 design updates: provision-by-restore (03 §9) and the settled root-vs-API boundary (03 §3). Deferred/none: no finding was deferred; the pre-existing open items (operator signing-key mechanics, multi-tenant fairness, hub-side desired-state UX, golden-image refresh cadence) remain flagged in 03 §13. This artifact can be deleted once confirmed.

Working artifact. Review pass over 01-topology-and-trust.md, 02-controller-module-map.md, 03-host-agent.md, proxmox-platform.md, and the Phase 0 / Phase 1-2 findings, grounded against the v0.33 source (deploy-felhom-compose/controller/). Every finding cites a file+line or a doc section. Severity: blocking / should-fix / minor.

Two findings are self-corrections of my own earlier work (02 and proxmox-platform.md) — flagged as such.


Ranked summary

# Severity Finding Where
B1 blocking Reversibility gate contradicts the self-heal reconcile loop — crashed-guest healing can require a signature-gated destroy → reconcile stalls 03 §4 vs §4(a)
B2 blocking vzdump bulk-exclusion only works for volume mount points; Docker named volumes live in the LXC rootfs and ARE captured → naive placement silently backs up the 1 TB media drive. Unvalidated by spike. 03 §7 vs proxmox-platform.md §4.3 + pct manpage
B3 blocking Agent's Proxmox role is called "the minimal role from Phase 1" — but that role is the narrow self-backup role that Phase 1 proved is denied create/allocate/restore. The agent's operator-tier role is undefined. 03 §2/§3 vs phase1-2 §1.3-1.4, 01 appendix
S1 should-fix Quiescing for agent/hub-scheduled backups has no agent→controller channel — the local API is controller→agent only 03 §6, §8
S2 should-fix Agent self-update revert authority unspecified — if the new binary won't boot, nothing outside it can flip back 03 §11
S3 should-fix Storage manifest drops fields settings.StoragePath carries today (Label, Schedulable/default, StoppedStacks, MigratedTo) with no re-homing stated 03 §7 vs settings.go:90-103
S4 should-fix Geo-restriction WAF ownership + Cloudflare API token placement unspecified after tunnel placement was locked; zone-wide token in a guest is a blast-radius concern 03 (absent), 01 §3, config.go InfrastructureConfig
S5 should-fix Cross-doc staleness: 01 §11 still lists tunnel placement OPEN; 02 §6 lists geo "blocked on tunnel placement" — both resolved by 03 §13 01 §11, 02 §6 vs 03 §13
S6 should-fix (self-correct) 02 put self-restore-test orchestration in the controller; 03 correctly makes it agent-owned (controller only reads status) 02 §5(3) vs 03 §6/§8
M1 minor (self-correct) 02 §3 lists UUID as a settings.StoragePath field — it isn't; UUID is derived from fstab at runtime 02 §3 vs settings.go:91-103
M2 minor 03 §7 says the manifest "absorbs the disk-state fields StoragePath carries today" incl. UUID — UUID isn't persisted today, so the manifest adds it (an improvement, not absorption) 03 §7
M3 minor controller-update is not in 03 §10's journaled-ops list, though it's a multi-step async op 03 §10 vs §11

Values check: clean. No DR/key-custody/offboarding path leaves a customer locked out. Zero-knowledge DR (03 §8, 01 §8) correctly makes the customer recovery code the irreducible residual; the operator cannot read data and the box can still restore-test. No hostage path found.

Locked premises: reviewed for soundness/consistency only; not relitigated.


Blocking findings

B1 — The reversibility gate stalls the self-healing reconcile loop

Where: 03 §4(a) vs the gate in §4. What: §4(a) lists "redeploy of a crashed controller" as benign convergence that "falls out of reconciliation for free." The gate then lists guest destroy among the irreversible ops that require an operator signature "regardless of whether they arrive as a job or as a desired-state delta." These collide: if healing a wedged guest requires destroy+recreate (corrupt rootfs, failed in-place restart, half-built guest from an interrupted provision), the reconciler hits a signature-gated op and cannot proceed without an operator — the loop either stalls or silently gives up, defeating "self-healing … tolerant of missed polls." Why it matters: This is the security-critical control model. A fuzzy benign/destructive line is unimplementable: either the reconciler can destroy (and a compromised hub's desired state can wipe guests — the exact threat §4 exists to stop), or it can't (and self-heal is a fiction for the crashed-guest case). Grounding: 03 §4 self-describes the gate as "security-critical"; §9/§10 already rely on the reconciler rolling back "a half-built guest" — which is a destroy of a customer-id-bound resource, contradicting the blanket "guest destroy needs a signature." Suggested fix (crisp, implementable rule): Scope the reconciler's destructive verbs by provenance and data-bearing-ness, not by verb:

  • The reconciler MAY, without a signature: (a) create/start/restart; (b) destroy resources it created earlier in the same journaled transaction (compensating rollback, §10); (c) destroy resources tagged ephemeral/scratch (restore-test scratch guests, §8).
  • Destroying or overwriting any resource that holds the only/primary copy of customer data always needs an operator signature.
  • Healing a crashed controller is non-destructive by construction: the controller is reconstructable from its image + the guest's persistent volume, so "redeploy" = restart the LXC / docker compose up -d inside the existing guest — never a guest destroy. State this explicitly so the two clauses stop colliding. (The v0.33 self-heal precedent is already in-place restart: watchdog.go restarts stopped stacks, it never destroys the guest.)

B2 — vzdump bulk-exclusion: the rootfs-Docker-volume trap

Where: 03 §7 ("Bulk external mounts are excluded from the guest's vzdump (a per-mount backup flag)"). What: Two grounded problems:

  1. The flag is real but narrow. The pct manpage (verified): backup=<boolean>"Whether to include the mount point in backups (only used for volume mount points)." It does not apply to bind mounts / device mounts (those are handled separately).
  2. The trap: proxmox-platform.md §4.3 (validated in phase1-2 §2.2) proved that Docker named volumes live inside the LXC rootfs and ARE captured by vzdump — a sentinel in pgdata survived. The default Felhom app uses Docker named volumes. So unless bulk data is deliberately placed on a dedicated Proxmox volume mount point (backup=0) or a bind mount, a "bulk" volume will be an ordinary named volume in rootfs and will be silently swept into the whole-guest image — exactly the 1 TB-media-in-every-backup outcome §7 says it prevents. Why it matters: Backup size/cost and RPO blow up silently; the failure is invisible until a media drive fills the vzdump target. This is load-bearing for the §8 tier model. Grounding: pct manpage (fetched 2026); proxmox-platform.md §4.3; phase1-2 §2.2. Not covered by any spike — proxmox-platform.md §6 "not yet validated" should gain this row. Suggested fix: Make the placement contract explicit: a bulk volume must be realized as a dedicated LXC mount point (volume mountpoint with backup=0, or an external bind mount), never a Docker named volume in rootfs. The per-volume placement component (02 §5(2)) must enforce this at deploy. Add a Phase-3 spike: create an LXC with a backup=0 volume mountpoint + a bind mount, vzdump it, confirm both are excluded and the rootfs+backup=1 volume are included.

B3 — The agent's Proxmox role is mis-grounded as "the Phase-1 minimal role"

Where: 03 §2 ("scoped Proxmox API token (minimal role from Phase 1)"), §3 ("the Phase-1 minimal role is the API floor"). What: Phase 1's minimal role (FelhomSelfBackup = VM.Audit, VM.Snapshot, VM.Backup, Datastore.AllocateSpace, Datastore.Audit) is the narrow self-backup role scoped to one guest, and Phase 1 explicitly proved it is denied (403) on create/allocate (phase1-2 §1.3 call #7) — i.e. exactly the operator-tier ops the agent's whole job consists of (provision, restore, storage allocation). Worse, 01 appendix states that guest-side role "is not used — we chose the agent-mediated path." So 03 cites, as the agent's role floor, a role that (a) the architecture discarded and (b) is provably insufficient for the agent. Why it matters: The agent's actual operator-tier role is undefined. Provisioning, restore, and storage management cannot be built or hardened against an undefined privilege set, and §3's root-minimization argument ("the Phase-1 minimal role is the API floor") collapses because that floor can't create a guest. Grounding: phase1-2 §1.3 (create CT = 403), §1.4 (role = self-backup only); 01 appendix ("not used … confirmed restore = operator-tier"); proxmox-platform.md §3.4. Suggested fix: Replace the Phase-1 reference with a new agent operator role to be defined and least-privilege-tested in a Phase-3 spike — minimally VM.Allocate, VM.Config.*, VM.PowerMgmt, VM.Snapshot(.Rollback), VM.Backup, VM.Audit, Datastore.Allocate(Space), Datastore.Audit, plus whatever storage-attach needs (see S4/root-boundary below). Keep §3's "API token, not root, where the API suffices" principle — that part is sound — but stop calling it the Phase-1 role.


Should-fix findings

S1 — No agent→controller channel for backup quiescing

Where: 03 §6 (local API is controller→agent only) vs §8 ("the controller stops the app stack … before a guest vzdump where app-consistency matters"). What: App-consistent LXC backup requires the controller to quiesce (no fsfreeze for LXC — proxmox-platform.md §4.2, phase1-2 §2.1). But the §6 surface is entirely controller→agent; the box-initiated model forbids the hub calling in, and there is no agent→controller call defined. For a hub/agent-scheduled backup (schedule lives in the manifest policy, §7), the agent has no way to tell the controller "quiesce now." Why it matters: Either scheduled backups silently fall back to crash-consistent (relying on WAL recovery, which phase1-2 §3 warns is unvalidated under write load), or the feature can't be built as drawn. Suggested fix: Make backups controller-driven for app-consistency: the controller learns due/policy via its own hub channel (or a GET /backup/due on the local API), quiesces, calls the existing POST /backup, then unquiesces on completion. Document that agent-initiated vzdump is crash-consistent only. (No inbound-to-guest channel needed — preserves §3/§5.)

S2 — Agent self-update revert authority unspecified

Where: 03 §11 ("a watchdog reverts to last-good if the new binary fails to come up healthy"). What: The agent is a single host systemd service with Restart=always (§3). If the new binary crashes on startup, systemd just restarts the same bad binary in a loop. "Revert to last-good" cannot be done by the thing that won't boot. §11 doesn't name the actor. Why it matters: A bad self-update can brick the crown-jewel host agent — the one component that recovers everything else — with no automatic recovery, requiring break-glass. Suggested fix: Put revert authority outside the swapped binary: e.g. an A/B symlink (current → good|new) where a separate systemd oneshot health-gate (ExecStartPost probe; on failure flip the symlink back and restart), or a tiny supervisor unit. Boot-into-last-good + explicit "commit" after a clean health window is the robust pattern. Add agent-update to the §10 journal so an interrupted swap is resumable.

S3 — Manifest schema omits live StoragePath fields without re-homing them

Where: 03 §7 table vs settings.go:90-103. What: Today's StoragePath carries Label, IsDefault, Schedulable, StoppedStacks, Decommissioned/DecommissionedAt/MigratedTo. The manifest covers state (attached/ disconnected/decommissioned) and durable_id, but drops: Label (human name, e.g. "Külső HDD 1TB" — UI), Schedulable/IsDefault (default placement target for new apps), StoppedStacks (which apps to restart on reconnect — app-domain), MigratedTo (decommission target pointer). Why it matters: 02 named this manifest as the contract that the settings.StoragePath reshape depends on. Silently dropped fields become lost behavior (no default-drive choice, no restart-after-reconnect list, no friendly labels). Suggested fix: Either add Label + a placement-default marker to the manifest, or explicitly state which fields re-home to the controller's settings (StoppedStacks and Label are plausibly controller-side; default/schedulable placement must live wherever placement decisions are made). Make the split explicit so neither side assumes the other owns it.

S4 — Geo-WAF ownership + Cloudflare API token placement unspecified

Where: 03 covers cloudflared (tunnel) health but is silent on geo-restriction WAF; 02 §6 had cloudflare/+geo "blocked on tunnel placement"; 01 §3 lists the controller's creds as "hub API key + local-API token" only. What: Now that tunnel placement is locked (host), the geo-restriction WAF management (cloudflare/ package: zone/waf/geosync) still has no home. It requires a Cloudflare API token (config.go InfrastructureConfig.cf_api_token) with zone-wide WAF edit rights. If geo stays in the controller (app-domain, per 02), a zone-wide Cloudflare token sits inside the customer guest — a real blast-radius concern (compromise → edit/disable WAF for the whole zone, potentially other customers on the same zone). Why it matters: Trust-boundary gap. 01 §5's boundary table has no row for controller↔ Cloudflare-API. Unspecified ownership blocks the 02 geo classification from being unblocked. Suggested fix: Decide geo-WAF ownership explicitly and add it to 01 §5. Options: (a) move WAF management to the agent/hub (operator-tier, token off the customer box); (b) keep it in the controller but scope the CF token per-zone/per-customer if the account model allows. Note this is now unblocked by the tunnel decision and should leave 02 §6's "blocked" state.

S5 — Cross-doc staleness on the now-locked tunnel placement

Where: 01 §11 ("Cloudflare Tunnel placement: host vs guest (§7)") and 02 §6 ("cloudflare/ + api/geo.go — blocked on tunnel placement") vs 03 §13 ("Resolved here: tunnel placement (host, agent-managed)") and the LOCKED list. What: 01 and 02 still present as OPEN/blocked a decision 03 and the locked set have resolved. Why it matters: A dev reading 01/02 would treat a settled decision as open (or a classification as blocked when only geo-ownership, S4, actually remains). Suggested fix: When folding this review in: update 01 §7/§11 to record tunnel=host (agent-managed systemd service); update 02 §6 to reduce the cloudflare item from "blocked on tunnel placement" to the narrower "blocked on geo-WAF ownership (S4)."

S6 — (self-correction) self-restore-test orchestration belongs to the agent, not the controller

Where: 02 §5(3) said "Self-restore-test orchestration — controller asks the agent to restore to scratch guest, validates, reports." 03 §8 makes the agent drive it autonomously; §6 gives the controller only GET /restore-test/status (read-only). What: 03 is right and 02 overreached. Zero-knowledge means only the box/agent holds the PBS key (03 §8); creating a scratch guest is operator-tier (create/allocate — phase1-2 §1.3 #7); the controller cannot do either. The controller's only piece is surfacing status. Why it matters: Keeps the NEW-component list honest — this is not a controller component to build beyond a status read. Suggested fix: Amend 02 §5(3) to "self-restore-test status display (read-only); the agent owns orchestration."


Minor findings

  • M1 (self-correction): 02 §3 lists UUID among settings.StoragePath fields. It is not there (settings.go:91-103: Path, Label, IsDefault, Schedulable, AddedAt, Disconnected/At, StoppedStacks, Decommissioned/At, MigratedTo). UUID is derived at runtime from fstab / /host-dev/disk/by-uuid by system.ParseFstabUUID and watchdog.go. The classification (settings = MODIFY/split) is unaffected; the field list was wrong.
  • M2: Consequently 03 §7's "absorbs the disk-state fields settings.StoragePath carries today" overstates: durable_id/UUID is not carried today, so the manifest adds durable identity (a genuine improvement — today the controller re-derives UUID from fstab each boot, which is fragile). Reword "absorbs" → "absorbs + adds durable_id."
  • M3: 03 §10 journals "provision, restore" but not controller-update (§11), which is also a multi-step async op (snapshot→pull→redeploy→health→rollback). Add it so an agent crash mid-controller-update is resume-or-rollback like the others.

Verified-correct (no action) — grounding that held up

  • LXC flags nesting=1,keyctl=1 + overlayfs (03 §9) match proxmox-platform.md §2.3 / phase0 §3. ✓
  • async task exitstatus, not POST return (03 §8) matches proxmox-platform.md §3.5. ✓
  • stop-mode backup not requiring VM.PowerMgmt (03 §8 "per Phase 1") matches proxmox-platform.md §3.4. ✓ (applies to the agent role too.)
  • running-LXC snapshot on LVM-thin (03 §6/§8/§11) matches proxmox-platform.md §4.5 / phase1-2 §1.6. ✓
  • monitor/pinger.go deprecation (02 DELETE-obsolete) confirmed in main.go:168,175 ("legacy, will be removed" / "no longer used — monitoring is now handled by the Hub"). ✓
  • backup keep/delete intra-file tear (02 hazard) confirmed: backup.go holds both RunDBDumps/DumpAppVolumes(Safe) (keep) and RunBackup/RunFullBackup (restic, delete); restore.go holds RestoreApp (restic) + RestoreAppFromTier2 (app). The §7-8 backup contract gives the extracted app-data-backup package a coherent destination. ✓
  • Control-plane-not-data-plane (03 §2/§43): apps keep serving if the agent dies — consistent with Docker-in-LXC running independently (phase0 §3). ✓
  • §6 per-guest local-API authorization (token→guest map): sound; a leaked token acts only on its own guest. Residual: a compromised controller can POST /rollback its own guest (blast radius = self) — acceptable per design; worth a one-line note that rollback is self-scoped and bounded.