229 lines
15 KiB
Markdown
229 lines
15 KiB
Markdown
# Architecture Part 3 — The Host Agent
|
||
|
||
> Status: design draft (decision content). To be grounded by Claude Code against
|
||
> `docs/proxmox-platform.md` and `docs/architecture/02-controller-module-map.md`,
|
||
> then placed at `docs/architecture/03-host-agent.md`.
|
||
>
|
||
> Builds on Part 1 (`01-topology-and-trust.md`) and Part 2 (`02-controller-module-map.md`).
|
||
> Where this doc and the locked decisions disagree, the locked decisions win and this
|
||
> draft is wrong — flag it.
|
||
|
||
## 1. Purpose & scope
|
||
|
||
The **host agent** is the operator-tier component that runs on each Proxmox host and
|
||
owns *all* Proxmox interaction. It is the trusted host actor: it provisions and restores
|
||
guests, manages host storage, orchestrates backups and restore-tests, watches the host
|
||
and the tunnel, talks to the hub, and exposes a narrow local API to the in-guest
|
||
controllers it deploys.
|
||
|
||
It is the privileged tier. The controller deliberately holds **no** Proxmox credentials
|
||
(Part 1) — the privilege the controller shed by losing `storage/` did not disappear, it
|
||
**moved here**. That makes the agent's hardening and blast-radius discipline the most
|
||
security-sensitive part of the platform.
|
||
|
||
The agent manages a **set** of guests on its host (usually one customer = one guest, but
|
||
the multi-tenant/company case is not precluded — the agent's data model is per-host,
|
||
N-guests, never "the guest").
|
||
|
||
## 2. Responsibilities (and explicit non-responsibilities)
|
||
|
||
Owns:
|
||
|
||
1. **Proxmox lifecycle** — create/start/stop/destroy guests, snapshots, storage allocation. Via a scoped Proxmox API token (minimal role from Phase 1) for everything the API covers; raw host ops only where unavoidable.
|
||
2. **Storage management** — attach/classify targets, reconcile the storage manifest, mount USB-by-UUID, present mounts into guests.
|
||
3. **Backup/restore orchestration** — vzdump to the tiers, PBS, snapshot management, and the **self-restore-test**.
|
||
4. **Host & tunnel monitoring** — host metrics, guest up/down, storage-target status, and `cloudflared` health; reports the host domain to the hub.
|
||
5. **Provisioning** — build a guest, deploy the controller into it, hand it its bootstrap config.
|
||
6. **Hub control loop** — poll for desired state + signed jobs, reconcile, execute, report, heartbeat.
|
||
7. **Local API** — the per-guest authorization gate the controller calls.
|
||
8. **Self-update** — update itself (carefully — it is a host service) and update the controllers it owns.
|
||
|
||
Explicitly does **not**:
|
||
|
||
- Serve application traffic or sit in the data path. **Control plane, not data plane**: if the agent dies, apps keep serving (Docker + LXC run without it); only *management* degrades — no new backups, no provisioning, hub loses the heartbeat.
|
||
- Hold or proxy customer application data.
|
||
- Run inside a guest. It is the thing that recovers guests and the host; it cannot be one of them.
|
||
|
||
## 3. Process model & host integration
|
||
|
||
- **Native Go binary, systemd service** on the host: boot-start, `Restart=always`, systemd watchdog (kill+restart on hang), journald logging, resource limits.
|
||
- **Root-minimized.** Default to a **non-root** service user with the scoped Proxmox token for API-covered work + a **narrow `sudoers` allowlist** for the handful of true host ops (USB mount-by-UUID, systemd mount units). Full root on the crown-jewel host is what a compromise most wants; avoid it where the API or a scoped sudoers entry suffices. *(Open: confirm during build which ops genuinely need host root vs. are API-covered — the Phase-1 minimal role is the API floor.)*
|
||
- **`cloudflared` is a separate systemd service**, not embedded in the agent. This is what makes the data path survive control-plane death by construction. The agent **manages and health-watches** it (see §5) but the tunnel does not live or die with the agent process.
|
||
|
||
## 4. Control model — reconcile + signed destructive ops
|
||
|
||
Two channels, split by **reversibility**, not by transport.
|
||
|
||
**(a) Desired-state reconciliation — steady state.**
|
||
The hub holds desired state for the host: which guests should exist (and at what spec),
|
||
the storage manifest, backup/retention policies, controller image versions. The agent
|
||
runs a reconcile loop converging actual Proxmox state → desired: idempotent, self-healing,
|
||
and tolerant of missed polls (drift is corrected on the next loop). Provisioning retries,
|
||
re-attach of a flapping USB target, redeploy of a crashed controller — all fall out of
|
||
reconciliation for free.
|
||
|
||
**(b) Signed one-shot jobs — operator actions.**
|
||
Restore-now, decommission, force-backup, break-glass-enable. Discrete, run-once
|
||
(idempotency key), written to the customer-visible audit log, and **outside** the reconcile
|
||
loop — they are point-in-time and often destructive, and a reconciler must never re-run a
|
||
restore because it "sees drift." A one-shot job names a **target** ("restore guest X from
|
||
snapshot S"), not a procedure; the agent owns the *how*.
|
||
|
||
**The reversibility gate (security-critical).**
|
||
"Signed jobs resist hub compromise" only holds if the agent also distrusts hub-supplied
|
||
*desired state* for destructive changes. So:
|
||
|
||
- **Irreversible/destructive operations** — guest destroy, storage detach/wipe, restore-overwrite, decommission — require a valid **operator signature**, *regardless of whether they arrive as a job or as a desired-state delta*. A compromised hub cannot forge them because the signing key is **not held by the hub** (it lives with the operator / a separate signing path; the hub only queues opaque signed blobs).
|
||
- **Benign convergence** — deploy a guest, attach storage, adjust a non-destructive policy, bump a controller version — runs on normal hub API auth, no signature.
|
||
|
||
Signed payloads carry a **nonce + expiry** (anti-replay: a captured "restore" job cannot be
|
||
re-injected later) and a target binding (host + guest id) so a signature can't be retargeted.
|
||
Notification-on-destructive-op is an **audit signal, never the guard** — a compromised hub
|
||
could both issue and suppress the notice, which is exactly why the *signature* (not the
|
||
notification) is the control.
|
||
|
||
## 5. Hub ↔ agent protocol (host domain)
|
||
|
||
**Box-initiated poll.** The hub never connects inbound. Each poll cycle exchanges:
|
||
|
||
- **Up:** heartbeat + a host-domain state report — host CPU/RAM/disk, per-guest up/down + spec, storage-target status (USB connected? NFS/CIFS reachable? PBS reachable?), last backup per target, last restore-test result, `cloudflared` health, agent + controller versions, audit-log tail.
|
||
- **Down:** the current desired state, any pending signed one-shot jobs, and config (poll interval, update window, policy changes).
|
||
|
||
**Dead-man's-switch (essential, not optional).** In a box-initiated model the heartbeat
|
||
*is* the liveness signal — a box that stops checking in is otherwise invisible. The hub
|
||
alerts the operator when an agent misses its expected check-in window. This is the worst
|
||
failure mode for a managed service, so it gets first-class treatment hub-side.
|
||
|
||
**Break-glass.** Standing inbound control is off. But when the poll loop *itself* is wedged
|
||
(agent hung, host sick) you cannot fix it through the poll loop. So there is an explicit,
|
||
**off-by-default, customer-consented, fully-audited** emergency path: SSH to the host via
|
||
the Cloudflare Tunnel behind Cloudflare Access (or on-site). Enabling it is itself a signed,
|
||
logged operation; it auto-expires.
|
||
|
||
## 6. Agent ↔ controller local API
|
||
|
||
The controller (in its LXC) reaches the agent (on the host) over the local bridge.
|
||
|
||
- **Transport:** HTTPS to the host's bridge IP on a fixed port.
|
||
- **Auth:** a per-guest local token, minted by the agent when it deploys the controller and written into the guest's bootstrap config. The agent maps token → guest and **authorizes per guest**: a controller can only act on *its own* guest. This is the agent acting as the per-guest authorization gate from Part 1.
|
||
- **Surface (minimal, all scoped to the caller's own guest):**
|
||
- `GET /storage` — mounts available to this guest and their **class** (fast/slow), so the controller can place hot vs bulk volumes per `.felhom.yml`. (The agent owns the actual mounts; the controller just binds to the paths it's given.)
|
||
- `POST /snapshot` — snapshot *this* guest (the snapshot-before-deploy primitive).
|
||
- `POST /rollback` — roll *this* guest back to a named snapshot (post-deploy failure recovery).
|
||
- `POST /backup` — request a backup-now of *this* guest (enqueued; non-destructive).
|
||
- `GET /backup/status`, `GET /restore-test/status` — read-only status for the controller's UI.
|
||
|
||
Note what is *absent*: nothing here lets a controller touch another guest, the host, storage
|
||
attachment, or restore-overwrite. Destructive/cross-guest power stays operator-signed (§4).
|
||
|
||
## 7. Storage manifest & reconciliation
|
||
|
||
The manifest is the load-bearing contract (it absorbs the disk-state fields that
|
||
`settings.StoragePath` carries today — see Part 2). Held in the hub, reconciled by the agent.
|
||
|
||
Per target:
|
||
|
||
| field | meaning |
|
||
|---|---|
|
||
| `type` | `local-dir` / `usb` / `nfs` / `cifs` / `pbs` |
|
||
| `durable_id` | UUID (USB), `server:export` (NFS/CIFS), `repo+fingerprint` (PBS) — survives box loss |
|
||
| `class` | `fast` or `slow`, set **once at attach**, with an IOPS marker; no runtime speed-test |
|
||
| `role` | `primary` / `vzdump-target` / `pbs-offsite` / `bulk-data` |
|
||
| `creds` | encrypted (NFS/CIFS/PBS); USB has none |
|
||
| `policy` | schedule + retention for this target |
|
||
| `state` | `attached` / `disconnected` / `decommissioned` |
|
||
|
||
Reconciliation: ensure each `attached` target is mounted (USB-by-UUID via the sudoers
|
||
allowlist), each Proxmox storage entry matches, and `disconnected` targets are surfaced to
|
||
the hub (the storage watchdog — detect a USB drop in seconds, not at the next health cycle).
|
||
|
||
**Placement is per-volume, not per-app.** Hot volumes (DB/config) → a `fast` target,
|
||
**enforced**; bulk volumes (media) → may live on `slow`, declared in `.felhom.yml`. **Bulk
|
||
external mounts are excluded from the guest's vzdump** (a per-mount backup flag) and carry
|
||
their own per-volume policy (file-level to a tier, or explicitly *not* backed up for
|
||
re-downloadable media). This is what keeps a 1 TB media drive out of the whole-guest image.
|
||
|
||
## 8. Backup/restore orchestration
|
||
|
||
Tiers double as backup *and* restore-source priority (fastest surviving source first),
|
||
per Part 1: **snapshot** (LVM-thin, transient, whole-guest rollback — not a backup) →
|
||
**local second storage** (vzdump to dir/NFS/CIFS) → **PBS offsite** (the DR substrate).
|
||
|
||
- **Quiescing:** the controller stops the app stack (volume-consistent) before a guest
|
||
vzdump where app-consistency matters; stop-mode/snapshot-mode per Phase 1. Every Proxmox
|
||
op is async → the agent polls `task exitstatus`, never trusts the POST return.
|
||
- **Key custody (PBS):** the **live** PBS key sits on the box so the agent can both back up
|
||
*and* run restore-tests. The hub holds only the **recovery-code-wrapped escrow** copy it
|
||
cannot open (zero-knowledge default). So: the box can restore-test; the operator cannot
|
||
read the data; the customer's offsite recovery code is the irreducible residual.
|
||
- **Self-restore-test:** the closing of the "tested restore is the critical gap" theme. The
|
||
agent periodically restores a backup into a **throwaway scratch guest**, boots it, runs
|
||
health checks, reports pass/fail, and tears it down. Zero-knowledge backups can *only* be
|
||
restore-tested by the box (the operator lacks the key) — so this lives in the agent by
|
||
necessity, not just convenience. Integrity-verify (cheap, ciphertext-level) runs more often
|
||
as the lighter check.
|
||
|
||
## 9. Provisioning & DR flows
|
||
|
||
**Provisioning (reconcile-driven).** Desired state says "this customer should have guest G
|
||
with controller C." The agent: enrolls (mints its scoped Proxmox token as root at setup) →
|
||
creates the LXC (unprivileged, `nesting=1,keyctl=1`, overlayfs — Phase 0) → deploys the
|
||
controller → hands it the bootstrap config (identity, hub API key, local-API token, mount
|
||
map). If any step fails, reconciliation retries; a half-built guest is journaled (§10) and
|
||
rolled back, never orphaned.
|
||
|
||
**Guest loss.** Agent restores G from the fastest surviving tier and resets identity
|
||
(MAC/hostname) so the restored guest rejoins cleanly.
|
||
|
||
**Host/hardware loss.** Re-enroll the new host in **restore mode**; the hub — the durable
|
||
source of truth that survives box death — hands the new agent the existing identity, PBS
|
||
namespace, tunnel token, storage manifest, and a restore directive. Tunnel is reused from
|
||
the hub record, so DNS stays intact.
|
||
|
||
## 10. Concurrency, crash-safety, idempotency
|
||
|
||
- **Per-guest serialization.** Reconcile, one-shot jobs, and local-API calls all feed a
|
||
work queue that serializes mutations **per guest** (Proxmox dislikes concurrent conflicting
|
||
ops on the same guest). Independent guests proceed in parallel.
|
||
- **Operation journaling.** Multi-step async ops (provision, restore) are journaled with
|
||
their in-flight Proxmox task ids. On agent restart, the journal is replayed:
|
||
resume-or-rollback, so a crash mid-restore never leaves a corrupt or half-built guest.
|
||
- **Idempotency keys** on one-shot jobs (run-once across retries and restarts).
|
||
|
||
## 11. Self-update
|
||
|
||
- **Agent (the hard case — a host service, no snapshot-rollback).** Atomic binary swap:
|
||
download → verify signature → atomic rename → restart; **keep last-known-good**; a watchdog
|
||
reverts to last-good if the new binary fails to come up healthy. Triggered by a hub signed
|
||
job within the update window; manual always allowed.
|
||
- **Controller (the easy case — it's a guest).** The agent owns the controller's lifecycle,
|
||
so the **agent updates the controller**: snapshot-before-update (free rollback, because the
|
||
controller *is* a snapshottable guest) → pull new image → redeploy → health-check → rollback
|
||
on failure. This resolves the Part-2 `selfupdate/` open: the controller is **agent-managed**,
|
||
not self-updating; the controller's old self-update path is removed.
|
||
|
||
## 12. Secrets at rest on the host
|
||
|
||
The agent holds, root-only on the host fs: the scoped Proxmox token, the hub API key, the
|
||
operator's **public** verify key (for §4 signatures — public, low-risk), the Cloudflare
|
||
tunnel token, encrypted storage creds (NFS/CIFS/PBS), and the **live PBS key**. The privilege
|
||
and the secret footprint that left the controller now concentrate here — which is the whole
|
||
argument for §3's root-minimization and a small, auditable agent.
|
||
|
||
## 13. Open items / what this unblocks
|
||
|
||
Resolved here: tunnel placement (host, agent-managed, own systemd service), the
|
||
reconcile-vs-jobs fork (hybrid, gated by reversibility), agent process model, self-update
|
||
ownership, the local-API surface, and the storage-manifest schema.
|
||
|
||
Still open:
|
||
|
||
- Multi-tenant **resource fairness** on a shared host (per-guest cgroup limits, noisy-neighbor) — deferred to the company-case pass.
|
||
- Operator-side **signing tooling** — where the operator signing key lives operationally and how a destructive op gets signed without undue friction (offline key vs. a small signing service; the security floor is "not in the hub").
|
||
- Hub-side **desired-state editing UX** and the host-domain report schema details — belong to the hub architecture doc.
|
||
|
||
This doc hands the implementation three contracts it was waiting on:
|
||
|
||
1. the **local-API surface** (§6) → the controller's NEW local-API client, snapshot-before-deploy, and self-restore-test wiring (Part 2);
|
||
2. the **storage-manifest schema** (§7) → the `settings.StoragePath` reshape and per-volume hot/bulk placement (Part 2);
|
||
3. the **backup contract** (§7–8) → the destination for the app-data-backup package extracted in the Part-2 refactor. |