review
This commit is contained in:
@@ -0,0 +1,229 @@
|
|||||||
|
# Architecture Part 3 — The Host Agent
|
||||||
|
|
||||||
|
> Status: design draft (decision content). To be grounded by Claude Code against
|
||||||
|
> `docs/proxmox-platform.md` and `docs/architecture/02-controller-module-map.md`,
|
||||||
|
> then placed at `docs/architecture/03-host-agent.md`.
|
||||||
|
>
|
||||||
|
> Builds on Part 1 (`01-topology-and-trust.md`) and Part 2 (`02-controller-module-map.md`).
|
||||||
|
> Where this doc and the locked decisions disagree, the locked decisions win and this
|
||||||
|
> draft is wrong — flag it.
|
||||||
|
|
||||||
|
## 1. Purpose & scope
|
||||||
|
|
||||||
|
The **host agent** is the operator-tier component that runs on each Proxmox host and
|
||||||
|
owns *all* Proxmox interaction. It is the trusted host actor: it provisions and restores
|
||||||
|
guests, manages host storage, orchestrates backups and restore-tests, watches the host
|
||||||
|
and the tunnel, talks to the hub, and exposes a narrow local API to the in-guest
|
||||||
|
controllers it deploys.
|
||||||
|
|
||||||
|
It is the privileged tier. The controller deliberately holds **no** Proxmox credentials
|
||||||
|
(Part 1) — the privilege the controller shed by losing `storage/` did not disappear, it
|
||||||
|
**moved here**. That makes the agent's hardening and blast-radius discipline the most
|
||||||
|
security-sensitive part of the platform.
|
||||||
|
|
||||||
|
The agent manages a **set** of guests on its host (usually one customer = one guest, but
|
||||||
|
the multi-tenant/company case is not precluded — the agent's data model is per-host,
|
||||||
|
N-guests, never "the guest").
|
||||||
|
|
||||||
|
## 2. Responsibilities (and explicit non-responsibilities)
|
||||||
|
|
||||||
|
Owns:
|
||||||
|
|
||||||
|
1. **Proxmox lifecycle** — create/start/stop/destroy guests, snapshots, storage allocation. Via a scoped Proxmox API token (minimal role from Phase 1) for everything the API covers; raw host ops only where unavoidable.
|
||||||
|
2. **Storage management** — attach/classify targets, reconcile the storage manifest, mount USB-by-UUID, present mounts into guests.
|
||||||
|
3. **Backup/restore orchestration** — vzdump to the tiers, PBS, snapshot management, and the **self-restore-test**.
|
||||||
|
4. **Host & tunnel monitoring** — host metrics, guest up/down, storage-target status, and `cloudflared` health; reports the host domain to the hub.
|
||||||
|
5. **Provisioning** — build a guest, deploy the controller into it, hand it its bootstrap config.
|
||||||
|
6. **Hub control loop** — poll for desired state + signed jobs, reconcile, execute, report, heartbeat.
|
||||||
|
7. **Local API** — the per-guest authorization gate the controller calls.
|
||||||
|
8. **Self-update** — update itself (carefully — it is a host service) and update the controllers it owns.
|
||||||
|
|
||||||
|
Explicitly does **not**:
|
||||||
|
|
||||||
|
- Serve application traffic or sit in the data path. **Control plane, not data plane**: if the agent dies, apps keep serving (Docker + LXC run without it); only *management* degrades — no new backups, no provisioning, hub loses the heartbeat.
|
||||||
|
- Hold or proxy customer application data.
|
||||||
|
- Run inside a guest. It is the thing that recovers guests and the host; it cannot be one of them.
|
||||||
|
|
||||||
|
## 3. Process model & host integration
|
||||||
|
|
||||||
|
- **Native Go binary, systemd service** on the host: boot-start, `Restart=always`, systemd watchdog (kill+restart on hang), journald logging, resource limits.
|
||||||
|
- **Root-minimized.** Default to a **non-root** service user with the scoped Proxmox token for API-covered work + a **narrow `sudoers` allowlist** for the handful of true host ops (USB mount-by-UUID, systemd mount units). Full root on the crown-jewel host is what a compromise most wants; avoid it where the API or a scoped sudoers entry suffices. *(Open: confirm during build which ops genuinely need host root vs. are API-covered — the Phase-1 minimal role is the API floor.)*
|
||||||
|
- **`cloudflared` is a separate systemd service**, not embedded in the agent. This is what makes the data path survive control-plane death by construction. The agent **manages and health-watches** it (see §5) but the tunnel does not live or die with the agent process.
|
||||||
|
|
||||||
|
## 4. Control model — reconcile + signed destructive ops
|
||||||
|
|
||||||
|
Two channels, split by **reversibility**, not by transport.
|
||||||
|
|
||||||
|
**(a) Desired-state reconciliation — steady state.**
|
||||||
|
The hub holds desired state for the host: which guests should exist (and at what spec),
|
||||||
|
the storage manifest, backup/retention policies, controller image versions. The agent
|
||||||
|
runs a reconcile loop converging actual Proxmox state → desired: idempotent, self-healing,
|
||||||
|
and tolerant of missed polls (drift is corrected on the next loop). Provisioning retries,
|
||||||
|
re-attach of a flapping USB target, redeploy of a crashed controller — all fall out of
|
||||||
|
reconciliation for free.
|
||||||
|
|
||||||
|
**(b) Signed one-shot jobs — operator actions.**
|
||||||
|
Restore-now, decommission, force-backup, break-glass-enable. Discrete, run-once
|
||||||
|
(idempotency key), written to the customer-visible audit log, and **outside** the reconcile
|
||||||
|
loop — they are point-in-time and often destructive, and a reconciler must never re-run a
|
||||||
|
restore because it "sees drift." A one-shot job names a **target** ("restore guest X from
|
||||||
|
snapshot S"), not a procedure; the agent owns the *how*.
|
||||||
|
|
||||||
|
**The reversibility gate (security-critical).**
|
||||||
|
"Signed jobs resist hub compromise" only holds if the agent also distrusts hub-supplied
|
||||||
|
*desired state* for destructive changes. So:
|
||||||
|
|
||||||
|
- **Irreversible/destructive operations** — guest destroy, storage detach/wipe, restore-overwrite, decommission — require a valid **operator signature**, *regardless of whether they arrive as a job or as a desired-state delta*. A compromised hub cannot forge them because the signing key is **not held by the hub** (it lives with the operator / a separate signing path; the hub only queues opaque signed blobs).
|
||||||
|
- **Benign convergence** — deploy a guest, attach storage, adjust a non-destructive policy, bump a controller version — runs on normal hub API auth, no signature.
|
||||||
|
|
||||||
|
Signed payloads carry a **nonce + expiry** (anti-replay: a captured "restore" job cannot be
|
||||||
|
re-injected later) and a target binding (host + guest id) so a signature can't be retargeted.
|
||||||
|
Notification-on-destructive-op is an **audit signal, never the guard** — a compromised hub
|
||||||
|
could both issue and suppress the notice, which is exactly why the *signature* (not the
|
||||||
|
notification) is the control.
|
||||||
|
|
||||||
|
## 5. Hub ↔ agent protocol (host domain)
|
||||||
|
|
||||||
|
**Box-initiated poll.** The hub never connects inbound. Each poll cycle exchanges:
|
||||||
|
|
||||||
|
- **Up:** heartbeat + a host-domain state report — host CPU/RAM/disk, per-guest up/down + spec, storage-target status (USB connected? NFS/CIFS reachable? PBS reachable?), last backup per target, last restore-test result, `cloudflared` health, agent + controller versions, audit-log tail.
|
||||||
|
- **Down:** the current desired state, any pending signed one-shot jobs, and config (poll interval, update window, policy changes).
|
||||||
|
|
||||||
|
**Dead-man's-switch (essential, not optional).** In a box-initiated model the heartbeat
|
||||||
|
*is* the liveness signal — a box that stops checking in is otherwise invisible. The hub
|
||||||
|
alerts the operator when an agent misses its expected check-in window. This is the worst
|
||||||
|
failure mode for a managed service, so it gets first-class treatment hub-side.
|
||||||
|
|
||||||
|
**Break-glass.** Standing inbound control is off. But when the poll loop *itself* is wedged
|
||||||
|
(agent hung, host sick) you cannot fix it through the poll loop. So there is an explicit,
|
||||||
|
**off-by-default, customer-consented, fully-audited** emergency path: SSH to the host via
|
||||||
|
the Cloudflare Tunnel behind Cloudflare Access (or on-site). Enabling it is itself a signed,
|
||||||
|
logged operation; it auto-expires.
|
||||||
|
|
||||||
|
## 6. Agent ↔ controller local API
|
||||||
|
|
||||||
|
The controller (in its LXC) reaches the agent (on the host) over the local bridge.
|
||||||
|
|
||||||
|
- **Transport:** HTTPS to the host's bridge IP on a fixed port.
|
||||||
|
- **Auth:** a per-guest local token, minted by the agent when it deploys the controller and written into the guest's bootstrap config. The agent maps token → guest and **authorizes per guest**: a controller can only act on *its own* guest. This is the agent acting as the per-guest authorization gate from Part 1.
|
||||||
|
- **Surface (minimal, all scoped to the caller's own guest):**
|
||||||
|
- `GET /storage` — mounts available to this guest and their **class** (fast/slow), so the controller can place hot vs bulk volumes per `.felhom.yml`. (The agent owns the actual mounts; the controller just binds to the paths it's given.)
|
||||||
|
- `POST /snapshot` — snapshot *this* guest (the snapshot-before-deploy primitive).
|
||||||
|
- `POST /rollback` — roll *this* guest back to a named snapshot (post-deploy failure recovery).
|
||||||
|
- `POST /backup` — request a backup-now of *this* guest (enqueued; non-destructive).
|
||||||
|
- `GET /backup/status`, `GET /restore-test/status` — read-only status for the controller's UI.
|
||||||
|
|
||||||
|
Note what is *absent*: nothing here lets a controller touch another guest, the host, storage
|
||||||
|
attachment, or restore-overwrite. Destructive/cross-guest power stays operator-signed (§4).
|
||||||
|
|
||||||
|
## 7. Storage manifest & reconciliation
|
||||||
|
|
||||||
|
The manifest is the load-bearing contract (it absorbs the disk-state fields that
|
||||||
|
`settings.StoragePath` carries today — see Part 2). Held in the hub, reconciled by the agent.
|
||||||
|
|
||||||
|
Per target:
|
||||||
|
|
||||||
|
| field | meaning |
|
||||||
|
|---|---|
|
||||||
|
| `type` | `local-dir` / `usb` / `nfs` / `cifs` / `pbs` |
|
||||||
|
| `durable_id` | UUID (USB), `server:export` (NFS/CIFS), `repo+fingerprint` (PBS) — survives box loss |
|
||||||
|
| `class` | `fast` or `slow`, set **once at attach**, with an IOPS marker; no runtime speed-test |
|
||||||
|
| `role` | `primary` / `vzdump-target` / `pbs-offsite` / `bulk-data` |
|
||||||
|
| `creds` | encrypted (NFS/CIFS/PBS); USB has none |
|
||||||
|
| `policy` | schedule + retention for this target |
|
||||||
|
| `state` | `attached` / `disconnected` / `decommissioned` |
|
||||||
|
|
||||||
|
Reconciliation: ensure each `attached` target is mounted (USB-by-UUID via the sudoers
|
||||||
|
allowlist), each Proxmox storage entry matches, and `disconnected` targets are surfaced to
|
||||||
|
the hub (the storage watchdog — detect a USB drop in seconds, not at the next health cycle).
|
||||||
|
|
||||||
|
**Placement is per-volume, not per-app.** Hot volumes (DB/config) → a `fast` target,
|
||||||
|
**enforced**; bulk volumes (media) → may live on `slow`, declared in `.felhom.yml`. **Bulk
|
||||||
|
external mounts are excluded from the guest's vzdump** (a per-mount backup flag) and carry
|
||||||
|
their own per-volume policy (file-level to a tier, or explicitly *not* backed up for
|
||||||
|
re-downloadable media). This is what keeps a 1 TB media drive out of the whole-guest image.
|
||||||
|
|
||||||
|
## 8. Backup/restore orchestration
|
||||||
|
|
||||||
|
Tiers double as backup *and* restore-source priority (fastest surviving source first),
|
||||||
|
per Part 1: **snapshot** (LVM-thin, transient, whole-guest rollback — not a backup) →
|
||||||
|
**local second storage** (vzdump to dir/NFS/CIFS) → **PBS offsite** (the DR substrate).
|
||||||
|
|
||||||
|
- **Quiescing:** the controller stops the app stack (volume-consistent) before a guest
|
||||||
|
vzdump where app-consistency matters; stop-mode/snapshot-mode per Phase 1. Every Proxmox
|
||||||
|
op is async → the agent polls `task exitstatus`, never trusts the POST return.
|
||||||
|
- **Key custody (PBS):** the **live** PBS key sits on the box so the agent can both back up
|
||||||
|
*and* run restore-tests. The hub holds only the **recovery-code-wrapped escrow** copy it
|
||||||
|
cannot open (zero-knowledge default). So: the box can restore-test; the operator cannot
|
||||||
|
read the data; the customer's offsite recovery code is the irreducible residual.
|
||||||
|
- **Self-restore-test:** the closing of the "tested restore is the critical gap" theme. The
|
||||||
|
agent periodically restores a backup into a **throwaway scratch guest**, boots it, runs
|
||||||
|
health checks, reports pass/fail, and tears it down. Zero-knowledge backups can *only* be
|
||||||
|
restore-tested by the box (the operator lacks the key) — so this lives in the agent by
|
||||||
|
necessity, not just convenience. Integrity-verify (cheap, ciphertext-level) runs more often
|
||||||
|
as the lighter check.
|
||||||
|
|
||||||
|
## 9. Provisioning & DR flows
|
||||||
|
|
||||||
|
**Provisioning (reconcile-driven).** Desired state says "this customer should have guest G
|
||||||
|
with controller C." The agent: enrolls (mints its scoped Proxmox token as root at setup) →
|
||||||
|
creates the LXC (unprivileged, `nesting=1,keyctl=1`, overlayfs — Phase 0) → deploys the
|
||||||
|
controller → hands it the bootstrap config (identity, hub API key, local-API token, mount
|
||||||
|
map). If any step fails, reconciliation retries; a half-built guest is journaled (§10) and
|
||||||
|
rolled back, never orphaned.
|
||||||
|
|
||||||
|
**Guest loss.** Agent restores G from the fastest surviving tier and resets identity
|
||||||
|
(MAC/hostname) so the restored guest rejoins cleanly.
|
||||||
|
|
||||||
|
**Host/hardware loss.** Re-enroll the new host in **restore mode**; the hub — the durable
|
||||||
|
source of truth that survives box death — hands the new agent the existing identity, PBS
|
||||||
|
namespace, tunnel token, storage manifest, and a restore directive. Tunnel is reused from
|
||||||
|
the hub record, so DNS stays intact.
|
||||||
|
|
||||||
|
## 10. Concurrency, crash-safety, idempotency
|
||||||
|
|
||||||
|
- **Per-guest serialization.** Reconcile, one-shot jobs, and local-API calls all feed a
|
||||||
|
work queue that serializes mutations **per guest** (Proxmox dislikes concurrent conflicting
|
||||||
|
ops on the same guest). Independent guests proceed in parallel.
|
||||||
|
- **Operation journaling.** Multi-step async ops (provision, restore) are journaled with
|
||||||
|
their in-flight Proxmox task ids. On agent restart, the journal is replayed:
|
||||||
|
resume-or-rollback, so a crash mid-restore never leaves a corrupt or half-built guest.
|
||||||
|
- **Idempotency keys** on one-shot jobs (run-once across retries and restarts).
|
||||||
|
|
||||||
|
## 11. Self-update
|
||||||
|
|
||||||
|
- **Agent (the hard case — a host service, no snapshot-rollback).** Atomic binary swap:
|
||||||
|
download → verify signature → atomic rename → restart; **keep last-known-good**; a watchdog
|
||||||
|
reverts to last-good if the new binary fails to come up healthy. Triggered by a hub signed
|
||||||
|
job within the update window; manual always allowed.
|
||||||
|
- **Controller (the easy case — it's a guest).** The agent owns the controller's lifecycle,
|
||||||
|
so the **agent updates the controller**: snapshot-before-update (free rollback, because the
|
||||||
|
controller *is* a snapshottable guest) → pull new image → redeploy → health-check → rollback
|
||||||
|
on failure. This resolves the Part-2 `selfupdate/` open: the controller is **agent-managed**,
|
||||||
|
not self-updating; the controller's old self-update path is removed.
|
||||||
|
|
||||||
|
## 12. Secrets at rest on the host
|
||||||
|
|
||||||
|
The agent holds, root-only on the host fs: the scoped Proxmox token, the hub API key, the
|
||||||
|
operator's **public** verify key (for §4 signatures — public, low-risk), the Cloudflare
|
||||||
|
tunnel token, encrypted storage creds (NFS/CIFS/PBS), and the **live PBS key**. The privilege
|
||||||
|
and the secret footprint that left the controller now concentrate here — which is the whole
|
||||||
|
argument for §3's root-minimization and a small, auditable agent.
|
||||||
|
|
||||||
|
## 13. Open items / what this unblocks
|
||||||
|
|
||||||
|
Resolved here: tunnel placement (host, agent-managed, own systemd service), the
|
||||||
|
reconcile-vs-jobs fork (hybrid, gated by reversibility), agent process model, self-update
|
||||||
|
ownership, the local-API surface, and the storage-manifest schema.
|
||||||
|
|
||||||
|
Still open:
|
||||||
|
|
||||||
|
- Multi-tenant **resource fairness** on a shared host (per-guest cgroup limits, noisy-neighbor) — deferred to the company-case pass.
|
||||||
|
- Operator-side **signing tooling** — where the operator signing key lives operationally and how a destructive op gets signed without undue friction (offline key vs. a small signing service; the security floor is "not in the hub").
|
||||||
|
- Hub-side **desired-state editing UX** and the host-domain report schema details — belong to the hub architecture doc.
|
||||||
|
|
||||||
|
This doc hands the implementation three contracts it was waiting on:
|
||||||
|
|
||||||
|
1. the **local-API surface** (§6) → the controller's NEW local-API client, snapshot-before-deploy, and self-restore-test wiring (Part 2);
|
||||||
|
2. the **storage-manifest schema** (§7) → the `settings.StoragePath` reshape and per-volume hot/bulk placement (Part 2);
|
||||||
|
3. the **backup contract** (§7–8) → the destination for the app-data-backup package extracted in the Part-2 refactor.
|
||||||
@@ -0,0 +1,250 @@
|
|||||||
|
# Critical design review — Proxmox re-platform doc set
|
||||||
|
|
||||||
|
Working artifact. Review pass over `01-topology-and-trust.md`, `02-controller-module-map.md`,
|
||||||
|
`03-host-agent.md`, `proxmox-platform.md`, and the Phase 0 / Phase 1-2 findings, grounded
|
||||||
|
against the v0.33 source (`deploy-felhom-compose/controller/`). Every finding cites a
|
||||||
|
file+line or a doc section. Severity: **blocking** / **should-fix** / **minor**.
|
||||||
|
|
||||||
|
Two findings are self-corrections of my own earlier work (`02` and `proxmox-platform.md`) —
|
||||||
|
flagged as such.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Ranked summary
|
||||||
|
|
||||||
|
| # | Severity | Finding | Where |
|
||||||
|
|---|---|---|---|
|
||||||
|
| B1 | **blocking** | Reversibility gate contradicts the self-heal reconcile loop — crashed-guest healing can require a signature-gated destroy → reconcile stalls | `03` §4 vs §4(a) |
|
||||||
|
| B2 | **blocking** | vzdump bulk-exclusion only works for **volume** mount points; Docker **named volumes live in the LXC rootfs and ARE captured** → naive placement silently backs up the 1 TB media drive. Unvalidated by spike. | `03` §7 vs `proxmox-platform.md` §4.3 + pct manpage |
|
||||||
|
| B3 | **blocking** | Agent's Proxmox role is called "the minimal role from Phase 1" — but that role is the *narrow self-backup* role that Phase 1 proved is **denied** create/allocate/restore. The agent's operator-tier role is undefined. | `03` §2/§3 vs `phase1-2` §1.3-1.4, `01` appendix |
|
||||||
|
| S1 | should-fix | Quiescing for agent/hub-scheduled backups has **no agent→controller channel** — the local API is controller→agent only | `03` §6, §8 |
|
||||||
|
| S2 | should-fix | Agent self-update revert authority unspecified — if the new binary won't boot, nothing outside it can flip back | `03` §11 |
|
||||||
|
| S3 | should-fix | Storage manifest drops fields `settings.StoragePath` carries today (Label, Schedulable/default, StoppedStacks, MigratedTo) with no re-homing stated | `03` §7 vs `settings.go:90-103` |
|
||||||
|
| S4 | should-fix | Geo-restriction WAF ownership + Cloudflare **API token** placement unspecified after tunnel placement was locked; zone-wide token in a guest is a blast-radius concern | `03` (absent), `01` §3, `config.go` InfrastructureConfig |
|
||||||
|
| S5 | should-fix | Cross-doc staleness: `01` §11 still lists tunnel placement OPEN; `02` §6 lists geo "blocked on tunnel placement" — both resolved by `03` §13 | `01` §11, `02` §6 vs `03` §13 |
|
||||||
|
| S6 | should-fix (self-correct) | `02` put self-restore-test **orchestration** in the controller; `03` correctly makes it agent-owned (controller only reads status) | `02` §5(3) vs `03` §6/§8 |
|
||||||
|
| M1 | minor (self-correct) | `02` §3 lists `UUID` as a `settings.StoragePath` field — it isn't; UUID is derived from fstab at runtime | `02` §3 vs `settings.go:91-103` |
|
||||||
|
| M2 | minor | `03` §7 says the manifest "absorbs the disk-state fields StoragePath carries today" incl. UUID — UUID isn't persisted today, so the manifest *adds* it (an improvement, not absorption) | `03` §7 |
|
||||||
|
| M3 | minor | controller-update is not in `03` §10's journaled-ops list, though it's a multi-step async op | `03` §10 vs §11 |
|
||||||
|
|
||||||
|
**Values check: clean.** No DR/key-custody/offboarding path leaves a customer locked out.
|
||||||
|
Zero-knowledge DR (`03` §8, `01` §8) correctly makes the customer recovery code the
|
||||||
|
irreducible residual; the operator cannot read data and the box can still restore-test.
|
||||||
|
No hostage path found.
|
||||||
|
|
||||||
|
**Locked premises:** reviewed for soundness/consistency only; not relitigated.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Blocking findings
|
||||||
|
|
||||||
|
### B1 — The reversibility gate stalls the self-healing reconcile loop
|
||||||
|
**Where:** `03` §4(a) vs the gate in §4.
|
||||||
|
**What:** §4(a) lists "redeploy of a crashed controller" as benign convergence that "falls
|
||||||
|
out of reconciliation for free." The gate then lists **guest destroy** among the
|
||||||
|
irreversible ops that require an operator signature "*regardless of whether they arrive as a
|
||||||
|
job or as a desired-state delta*." These collide: if healing a wedged guest requires
|
||||||
|
destroy+recreate (corrupt rootfs, failed in-place restart, half-built guest from an
|
||||||
|
interrupted provision), the reconciler hits a signature-gated op and **cannot proceed
|
||||||
|
without an operator** — the loop either stalls or silently gives up, defeating "self-healing
|
||||||
|
… tolerant of missed polls."
|
||||||
|
**Why it matters:** This is the security-critical control model. A fuzzy benign/destructive
|
||||||
|
line is unimplementable: either the reconciler can destroy (and a compromised hub's desired
|
||||||
|
state can wipe guests — the exact threat §4 exists to stop), or it can't (and self-heal is a
|
||||||
|
fiction for the crashed-guest case).
|
||||||
|
**Grounding:** `03` §4 self-describes the gate as "security-critical"; §9/§10 already rely on
|
||||||
|
the reconciler rolling back "a half-built guest" — which *is* a destroy of a customer-id-bound
|
||||||
|
resource, contradicting the blanket "guest destroy needs a signature."
|
||||||
|
**Suggested fix (crisp, implementable rule):** Scope the reconciler's destructive verbs by
|
||||||
|
*provenance and data-bearing-ness*, not by verb:
|
||||||
|
- The reconciler MAY, without a signature: (a) create/start/restart; (b) destroy resources it
|
||||||
|
**created earlier in the same journaled transaction** (compensating rollback, §10); (c)
|
||||||
|
destroy resources **tagged ephemeral/scratch** (restore-test scratch guests, §8).
|
||||||
|
- Destroying or overwriting any resource that **holds the only/primary copy of customer data**
|
||||||
|
always needs an operator signature.
|
||||||
|
- **Healing a crashed controller is non-destructive by construction:** the controller is
|
||||||
|
reconstructable from its image + the guest's persistent volume, so "redeploy" = restart the
|
||||||
|
LXC / `docker compose up -d` **inside the existing guest** — never a guest destroy. State
|
||||||
|
this explicitly so the two clauses stop colliding. (The v0.33 self-heal precedent is already
|
||||||
|
in-place restart: `watchdog.go` restarts stopped stacks, it never destroys the guest.)
|
||||||
|
|
||||||
|
### B2 — vzdump bulk-exclusion: the rootfs-Docker-volume trap
|
||||||
|
**Where:** `03` §7 ("Bulk external mounts are excluded from the guest's vzdump (a per-mount
|
||||||
|
backup flag)").
|
||||||
|
**What:** Two grounded problems:
|
||||||
|
1. The flag is real but narrow. The pct manpage (verified): `backup=<boolean>` —
|
||||||
|
*"Whether to include the mount point in backups (**only used for volume mount points**)."*
|
||||||
|
It does **not** apply to bind mounts / device mounts (those are handled separately).
|
||||||
|
2. The trap: `proxmox-platform.md` §4.3 (validated in `phase1-2` §2.2) proved that **Docker
|
||||||
|
named volumes live inside the LXC rootfs and ARE captured by vzdump** — a sentinel in
|
||||||
|
`pgdata` survived. The default Felhom app uses Docker named volumes. So unless bulk data is
|
||||||
|
deliberately placed on a **dedicated Proxmox volume mount point** (backup=0) or a bind
|
||||||
|
mount, a "bulk" volume will be an ordinary named volume in rootfs and will be **silently
|
||||||
|
swept into the whole-guest image** — exactly the 1 TB-media-in-every-backup outcome §7 says
|
||||||
|
it prevents.
|
||||||
|
**Why it matters:** Backup size/cost and RPO blow up silently; the failure is invisible until
|
||||||
|
a media drive fills the vzdump target. This is load-bearing for the §8 tier model.
|
||||||
|
**Grounding:** pct manpage (fetched 2026); `proxmox-platform.md` §4.3; `phase1-2` §2.2.
|
||||||
|
Not covered by any spike — `proxmox-platform.md` §6 "not yet validated" should gain this row.
|
||||||
|
**Suggested fix:** Make the placement contract explicit: a `bulk` volume **must** be realized
|
||||||
|
as a dedicated LXC mount point (volume mountpoint with `backup=0`, or an external bind mount),
|
||||||
|
**never** a Docker named volume in rootfs. The per-volume placement component (`02` §5(2))
|
||||||
|
must enforce this at deploy. Add a Phase-3 spike: create an LXC with a `backup=0` volume
|
||||||
|
mountpoint + a bind mount, vzdump it, confirm both are excluded and the rootfs+`backup=1`
|
||||||
|
volume are included.
|
||||||
|
|
||||||
|
### B3 — The agent's Proxmox role is mis-grounded as "the Phase-1 minimal role"
|
||||||
|
**Where:** `03` §2 ("scoped Proxmox API token (minimal role from Phase 1)"), §3 ("the
|
||||||
|
Phase-1 minimal role is the API floor").
|
||||||
|
**What:** Phase 1's minimal role (`FelhomSelfBackup` = `VM.Audit, VM.Snapshot, VM.Backup,
|
||||||
|
Datastore.AllocateSpace, Datastore.Audit`) is the **narrow self-backup** role scoped to one
|
||||||
|
guest, and Phase 1 explicitly proved it is **denied (403)** on create/allocate
|
||||||
|
(`phase1-2` §1.3 call #7) — i.e. exactly the operator-tier ops the agent's whole job consists
|
||||||
|
of (provision, restore, storage allocation). Worse, `01` appendix states that guest-side role
|
||||||
|
"**is not used** — we chose the agent-mediated path." So `03` cites, as the agent's role
|
||||||
|
floor, a role that (a) the architecture discarded and (b) is provably insufficient for the
|
||||||
|
agent.
|
||||||
|
**Why it matters:** The agent's actual operator-tier role is **undefined**. Provisioning,
|
||||||
|
restore, and storage management cannot be built or hardened against an undefined privilege
|
||||||
|
set, and §3's root-minimization argument ("the Phase-1 minimal role is the API floor")
|
||||||
|
collapses because that floor can't create a guest.
|
||||||
|
**Grounding:** `phase1-2` §1.3 (create CT = 403), §1.4 (role = self-backup only); `01`
|
||||||
|
appendix ("not used … confirmed restore = operator-tier"); `proxmox-platform.md` §3.4.
|
||||||
|
**Suggested fix:** Replace the Phase-1 reference with a **new agent operator role** to be
|
||||||
|
defined and least-privilege-tested in a Phase-3 spike — minimally `VM.Allocate`, `VM.Config.*`,
|
||||||
|
`VM.PowerMgmt`, `VM.Snapshot(.Rollback)`, `VM.Backup`, `VM.Audit`, `Datastore.Allocate(Space)`,
|
||||||
|
`Datastore.Audit`, plus whatever storage-attach needs (see S4/root-boundary below). Keep §3's
|
||||||
|
"API token, not root, where the API suffices" principle — that part is sound — but stop
|
||||||
|
calling it the Phase-1 role.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Should-fix findings
|
||||||
|
|
||||||
|
### S1 — No agent→controller channel for backup quiescing
|
||||||
|
**Where:** `03` §6 (local API is controller→agent only) vs §8 ("the controller stops the app
|
||||||
|
stack … before a guest vzdump where app-consistency matters").
|
||||||
|
**What:** App-consistent LXC backup requires the controller to quiesce (no fsfreeze for LXC —
|
||||||
|
`proxmox-platform.md` §4.2, `phase1-2` §2.1). But the §6 surface is entirely controller→agent;
|
||||||
|
the box-initiated model forbids the hub calling in, and there is no agent→controller call
|
||||||
|
defined. For a **hub/agent-scheduled** backup (schedule lives in the manifest `policy`, §7),
|
||||||
|
the agent has no way to tell the controller "quiesce now."
|
||||||
|
**Why it matters:** Either scheduled backups silently fall back to crash-consistent (relying
|
||||||
|
on WAL recovery, which `phase1-2` §3 warns is unvalidated under write load), or the feature
|
||||||
|
can't be built as drawn.
|
||||||
|
**Suggested fix:** Make backups **controller-driven for app-consistency**: the controller
|
||||||
|
learns due/policy via its own hub channel (or a `GET /backup/due` on the local API), quiesces,
|
||||||
|
calls the existing `POST /backup`, then unquiesces on completion. Document that agent-initiated
|
||||||
|
vzdump is crash-consistent only. (No inbound-to-guest channel needed — preserves §3/§5.)
|
||||||
|
|
||||||
|
### S2 — Agent self-update revert authority unspecified
|
||||||
|
**Where:** `03` §11 ("a watchdog reverts to last-good if the new binary fails to come up
|
||||||
|
healthy").
|
||||||
|
**What:** The agent is a single host systemd service with `Restart=always` (§3). If the new
|
||||||
|
binary crashes on startup, systemd just restarts the **same bad binary** in a loop. "Revert
|
||||||
|
to last-good" cannot be done *by* the thing that won't boot. §11 doesn't name the actor.
|
||||||
|
**Why it matters:** A bad self-update can brick the crown-jewel host agent — the one component
|
||||||
|
that recovers everything else — with no automatic recovery, requiring break-glass.
|
||||||
|
**Suggested fix:** Put revert authority **outside** the swapped binary: e.g. an A/B symlink
|
||||||
|
(`current → good|new`) where a separate systemd oneshot health-gate (`ExecStartPost` probe; on
|
||||||
|
failure flip the symlink back and restart), or a tiny supervisor unit. Boot-into-last-good +
|
||||||
|
explicit "commit" after a clean health window is the robust pattern. Add agent-update to the
|
||||||
|
§10 journal so an interrupted swap is resumable.
|
||||||
|
|
||||||
|
### S3 — Manifest schema omits live `StoragePath` fields without re-homing them
|
||||||
|
**Where:** `03` §7 table vs `settings.go:90-103`.
|
||||||
|
**What:** Today's `StoragePath` carries `Label`, `IsDefault`, `Schedulable`, `StoppedStacks`,
|
||||||
|
`Decommissioned`/`DecommissionedAt`/`MigratedTo`. The manifest covers state (attached/
|
||||||
|
disconnected/decommissioned) and durable_id, but drops: **Label** (human name, e.g. "Külső
|
||||||
|
HDD 1TB" — UI), **Schedulable/IsDefault** (default placement target for new apps),
|
||||||
|
**StoppedStacks** (which apps to restart on reconnect — app-domain), **MigratedTo** (decommission
|
||||||
|
target pointer).
|
||||||
|
**Why it matters:** `02` named this manifest as the contract that the `settings.StoragePath`
|
||||||
|
reshape depends on. Silently dropped fields become lost behavior (no default-drive choice, no
|
||||||
|
restart-after-reconnect list, no friendly labels).
|
||||||
|
**Suggested fix:** Either add Label + a placement-default marker to the manifest, or explicitly
|
||||||
|
state which fields re-home to the controller's `settings` (StoppedStacks and Label are
|
||||||
|
plausibly controller-side; default/schedulable placement must live wherever placement decisions
|
||||||
|
are made). Make the split explicit so neither side assumes the other owns it.
|
||||||
|
|
||||||
|
### S4 — Geo-WAF ownership + Cloudflare API token placement unspecified
|
||||||
|
**Where:** `03` covers `cloudflared` (tunnel) health but is silent on geo-restriction WAF; `02`
|
||||||
|
§6 had `cloudflare/`+`geo` "blocked on tunnel placement"; `01` §3 lists the controller's creds
|
||||||
|
as "hub API key + local-API token" only.
|
||||||
|
**What:** Now that tunnel placement is locked (host), the **geo-restriction WAF** management
|
||||||
|
(`cloudflare/` package: zone/waf/geosync) still has no home. It requires a Cloudflare **API
|
||||||
|
token** (`config.go` InfrastructureConfig.cf_api_token) with zone-wide WAF edit rights. If geo
|
||||||
|
stays in the controller (app-domain, per `02`), a **zone-wide Cloudflare token sits inside the
|
||||||
|
customer guest** — a real blast-radius concern (compromise → edit/disable WAF for the whole
|
||||||
|
zone, potentially other customers on the same zone).
|
||||||
|
**Why it matters:** Trust-boundary gap. `01` §5's boundary table has no row for controller↔
|
||||||
|
Cloudflare-API. Unspecified ownership blocks the `02` geo classification from being unblocked.
|
||||||
|
**Suggested fix:** Decide geo-WAF ownership explicitly and add it to `01` §5. Options: (a) move
|
||||||
|
WAF management to the **agent/hub** (operator-tier, token off the customer box); (b) keep it in
|
||||||
|
the controller but scope the CF token per-zone/per-customer if the account model allows. Note
|
||||||
|
this is now *unblocked* by the tunnel decision and should leave `02` §6's "blocked" state.
|
||||||
|
|
||||||
|
### S5 — Cross-doc staleness on the now-locked tunnel placement
|
||||||
|
**Where:** `01` §11 ("Cloudflare Tunnel placement: host vs guest (§7)") and `02` §6
|
||||||
|
("`cloudflare/` + `api/geo.go` — blocked on tunnel placement") vs `03` §13 ("Resolved here:
|
||||||
|
tunnel placement (host, agent-managed)") and the LOCKED list.
|
||||||
|
**What:** `01` and `02` still present as OPEN/blocked a decision `03` and the locked set have
|
||||||
|
resolved.
|
||||||
|
**Why it matters:** A dev reading `01`/`02` would treat a settled decision as open (or a
|
||||||
|
classification as blocked when only geo-ownership, S4, actually remains).
|
||||||
|
**Suggested fix:** When folding this review in: update `01` §7/§11 to record tunnel=host
|
||||||
|
(agent-managed systemd service); update `02` §6 to reduce the cloudflare item from "blocked on
|
||||||
|
tunnel placement" to the narrower "blocked on geo-WAF ownership (S4)."
|
||||||
|
|
||||||
|
### S6 — (self-correction) self-restore-test orchestration belongs to the agent, not the controller
|
||||||
|
**Where:** `02` §5(3) said "Self-restore-test orchestration — *controller* asks the agent to
|
||||||
|
restore to scratch guest, validates, reports." `03` §8 makes the **agent** drive it
|
||||||
|
autonomously; §6 gives the controller only `GET /restore-test/status` (read-only).
|
||||||
|
**What:** `03` is right and `02` overreached. Zero-knowledge means only the box/agent holds the
|
||||||
|
PBS key (`03` §8); creating a scratch guest is operator-tier (create/allocate — `phase1-2`
|
||||||
|
§1.3 #7); the controller cannot do either. The controller's only piece is surfacing status.
|
||||||
|
**Why it matters:** Keeps the NEW-component list honest — this is not a controller component to
|
||||||
|
build beyond a status read.
|
||||||
|
**Suggested fix:** Amend `02` §5(3) to "self-restore-test **status display** (read-only); the
|
||||||
|
agent owns orchestration."
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Minor findings
|
||||||
|
|
||||||
|
- **M1 (self-correction):** `02` §3 lists `UUID` among `settings.StoragePath` fields. It is
|
||||||
|
**not** there (`settings.go:91-103`: Path, Label, IsDefault, Schedulable, AddedAt,
|
||||||
|
Disconnected/At, StoppedStacks, Decommissioned/At, MigratedTo). UUID is derived at runtime
|
||||||
|
from fstab / `/host-dev/disk/by-uuid` by `system.ParseFstabUUID` and `watchdog.go`. The
|
||||||
|
classification (settings = MODIFY/split) is unaffected; the field list was wrong.
|
||||||
|
- **M2:** Consequently `03` §7's "absorbs the disk-state fields `settings.StoragePath` carries
|
||||||
|
today" overstates: `durable_id`/UUID is *not* carried today, so the manifest **adds** durable
|
||||||
|
identity (a genuine improvement — today the controller re-derives UUID from fstab each boot,
|
||||||
|
which is fragile). Reword "absorbs" → "absorbs + adds durable_id."
|
||||||
|
- **M3:** `03` §10 journals "provision, restore" but not **controller-update** (§11), which is
|
||||||
|
also a multi-step async op (snapshot→pull→redeploy→health→rollback). Add it so an agent crash
|
||||||
|
mid-controller-update is resume-or-rollback like the others.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Verified-correct (no action) — grounding that held up
|
||||||
|
|
||||||
|
- LXC flags `nesting=1,keyctl=1` + overlayfs (`03` §9) match `proxmox-platform.md` §2.3 /
|
||||||
|
`phase0` §3. ✓
|
||||||
|
- async `task exitstatus`, not POST return (`03` §8) matches `proxmox-platform.md` §3.5. ✓
|
||||||
|
- stop-mode backup not requiring `VM.PowerMgmt` (`03` §8 "per Phase 1") matches
|
||||||
|
`proxmox-platform.md` §3.4. ✓ (applies to the agent role too.)
|
||||||
|
- running-LXC snapshot on LVM-thin (`03` §6/§8/§11) matches `proxmox-platform.md` §4.5 /
|
||||||
|
`phase1-2` §1.6. ✓
|
||||||
|
- `monitor/pinger.go` deprecation (`02` DELETE-obsolete) confirmed in `main.go:168,175`
|
||||||
|
("legacy, will be removed" / "no longer used — monitoring is now handled by the Hub"). ✓
|
||||||
|
- backup keep/delete **intra-file tear** (`02` hazard) confirmed: `backup.go` holds both
|
||||||
|
`RunDBDumps`/`DumpAppVolumes(Safe)` (keep) and `RunBackup`/`RunFullBackup` (restic, delete);
|
||||||
|
`restore.go` holds `RestoreApp` (restic) + `RestoreAppFromTier2` (app). The §7-8 backup
|
||||||
|
contract gives the extracted app-data-backup package a coherent destination. ✓
|
||||||
|
- Control-plane-not-data-plane (`03` §2/§43): apps keep serving if the agent dies — consistent
|
||||||
|
with Docker-in-LXC running independently (`phase0` §3). ✓
|
||||||
|
- §6 per-guest local-API authorization (token→guest map): sound; a leaked token acts only on
|
||||||
|
its own guest. Residual: a compromised controller can `POST /rollback` its **own** guest
|
||||||
|
(blast radius = self) — acceptable per design; worth a one-line note that rollback is
|
||||||
|
self-scoped and bounded.
|
||||||
Reference in New Issue
Block a user