Files
felhom-agent/docs/architecture/03-host-agent.md
T
2026-06-08 07:49:00 +02:00

229 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Architecture Part 3 — The Host Agent
> Status: design draft (decision content). To be grounded by Claude Code against
> `docs/proxmox-platform.md` and `docs/architecture/02-controller-module-map.md`,
> then placed at `docs/architecture/03-host-agent.md`.
>
> Builds on Part 1 (`01-topology-and-trust.md`) and Part 2 (`02-controller-module-map.md`).
> Where this doc and the locked decisions disagree, the locked decisions win and this
> draft is wrong — flag it.
## 1. Purpose & scope
The **host agent** is the operator-tier component that runs on each Proxmox host and
owns *all* Proxmox interaction. It is the trusted host actor: it provisions and restores
guests, manages host storage, orchestrates backups and restore-tests, watches the host
and the tunnel, talks to the hub, and exposes a narrow local API to the in-guest
controllers it deploys.
It is the privileged tier. The controller deliberately holds **no** Proxmox credentials
(Part 1) — the privilege the controller shed by losing `storage/` did not disappear, it
**moved here**. That makes the agent's hardening and blast-radius discipline the most
security-sensitive part of the platform.
The agent manages a **set** of guests on its host (usually one customer = one guest, but
the multi-tenant/company case is not precluded — the agent's data model is per-host,
N-guests, never "the guest").
## 2. Responsibilities (and explicit non-responsibilities)
Owns:
1. **Proxmox lifecycle** — create/start/stop/destroy guests, snapshots, storage allocation. Via a scoped Proxmox API token (minimal role from Phase 1) for everything the API covers; raw host ops only where unavoidable.
2. **Storage management** — attach/classify targets, reconcile the storage manifest, mount USB-by-UUID, present mounts into guests.
3. **Backup/restore orchestration** — vzdump to the tiers, PBS, snapshot management, and the **self-restore-test**.
4. **Host & tunnel monitoring** — host metrics, guest up/down, storage-target status, and `cloudflared` health; reports the host domain to the hub.
5. **Provisioning** — build a guest, deploy the controller into it, hand it its bootstrap config.
6. **Hub control loop** — poll for desired state + signed jobs, reconcile, execute, report, heartbeat.
7. **Local API** — the per-guest authorization gate the controller calls.
8. **Self-update** — update itself (carefully — it is a host service) and update the controllers it owns.
Explicitly does **not**:
- Serve application traffic or sit in the data path. **Control plane, not data plane**: if the agent dies, apps keep serving (Docker + LXC run without it); only *management* degrades — no new backups, no provisioning, hub loses the heartbeat.
- Hold or proxy customer application data.
- Run inside a guest. It is the thing that recovers guests and the host; it cannot be one of them.
## 3. Process model & host integration
- **Native Go binary, systemd service** on the host: boot-start, `Restart=always`, systemd watchdog (kill+restart on hang), journald logging, resource limits.
- **Root-minimized.** Default to a **non-root** service user with the scoped Proxmox token for API-covered work + a **narrow `sudoers` allowlist** for the handful of true host ops (USB mount-by-UUID, systemd mount units). Full root on the crown-jewel host is what a compromise most wants; avoid it where the API or a scoped sudoers entry suffices. *(Open: confirm during build which ops genuinely need host root vs. are API-covered — the Phase-1 minimal role is the API floor.)*
- **`cloudflared` is a separate systemd service**, not embedded in the agent. This is what makes the data path survive control-plane death by construction. The agent **manages and health-watches** it (see §5) but the tunnel does not live or die with the agent process.
## 4. Control model — reconcile + signed destructive ops
Two channels, split by **reversibility**, not by transport.
**(a) Desired-state reconciliation — steady state.**
The hub holds desired state for the host: which guests should exist (and at what spec),
the storage manifest, backup/retention policies, controller image versions. The agent
runs a reconcile loop converging actual Proxmox state → desired: idempotent, self-healing,
and tolerant of missed polls (drift is corrected on the next loop). Provisioning retries,
re-attach of a flapping USB target, redeploy of a crashed controller — all fall out of
reconciliation for free.
**(b) Signed one-shot jobs — operator actions.**
Restore-now, decommission, force-backup, break-glass-enable. Discrete, run-once
(idempotency key), written to the customer-visible audit log, and **outside** the reconcile
loop — they are point-in-time and often destructive, and a reconciler must never re-run a
restore because it "sees drift." A one-shot job names a **target** ("restore guest X from
snapshot S"), not a procedure; the agent owns the *how*.
**The reversibility gate (security-critical).**
"Signed jobs resist hub compromise" only holds if the agent also distrusts hub-supplied
*desired state* for destructive changes. So:
- **Irreversible/destructive operations** — guest destroy, storage detach/wipe, restore-overwrite, decommission — require a valid **operator signature**, *regardless of whether they arrive as a job or as a desired-state delta*. A compromised hub cannot forge them because the signing key is **not held by the hub** (it lives with the operator / a separate signing path; the hub only queues opaque signed blobs).
- **Benign convergence** — deploy a guest, attach storage, adjust a non-destructive policy, bump a controller version — runs on normal hub API auth, no signature.
Signed payloads carry a **nonce + expiry** (anti-replay: a captured "restore" job cannot be
re-injected later) and a target binding (host + guest id) so a signature can't be retargeted.
Notification-on-destructive-op is an **audit signal, never the guard** — a compromised hub
could both issue and suppress the notice, which is exactly why the *signature* (not the
notification) is the control.
## 5. Hub ↔ agent protocol (host domain)
**Box-initiated poll.** The hub never connects inbound. Each poll cycle exchanges:
- **Up:** heartbeat + a host-domain state report — host CPU/RAM/disk, per-guest up/down + spec, storage-target status (USB connected? NFS/CIFS reachable? PBS reachable?), last backup per target, last restore-test result, `cloudflared` health, agent + controller versions, audit-log tail.
- **Down:** the current desired state, any pending signed one-shot jobs, and config (poll interval, update window, policy changes).
**Dead-man's-switch (essential, not optional).** In a box-initiated model the heartbeat
*is* the liveness signal — a box that stops checking in is otherwise invisible. The hub
alerts the operator when an agent misses its expected check-in window. This is the worst
failure mode for a managed service, so it gets first-class treatment hub-side.
**Break-glass.** Standing inbound control is off. But when the poll loop *itself* is wedged
(agent hung, host sick) you cannot fix it through the poll loop. So there is an explicit,
**off-by-default, customer-consented, fully-audited** emergency path: SSH to the host via
the Cloudflare Tunnel behind Cloudflare Access (or on-site). Enabling it is itself a signed,
logged operation; it auto-expires.
## 6. Agent ↔ controller local API
The controller (in its LXC) reaches the agent (on the host) over the local bridge.
- **Transport:** HTTPS to the host's bridge IP on a fixed port.
- **Auth:** a per-guest local token, minted by the agent when it deploys the controller and written into the guest's bootstrap config. The agent maps token → guest and **authorizes per guest**: a controller can only act on *its own* guest. This is the agent acting as the per-guest authorization gate from Part 1.
- **Surface (minimal, all scoped to the caller's own guest):**
- `GET /storage` — mounts available to this guest and their **class** (fast/slow), so the controller can place hot vs bulk volumes per `.felhom.yml`. (The agent owns the actual mounts; the controller just binds to the paths it's given.)
- `POST /snapshot` — snapshot *this* guest (the snapshot-before-deploy primitive).
- `POST /rollback` — roll *this* guest back to a named snapshot (post-deploy failure recovery).
- `POST /backup` — request a backup-now of *this* guest (enqueued; non-destructive).
- `GET /backup/status`, `GET /restore-test/status` — read-only status for the controller's UI.
Note what is *absent*: nothing here lets a controller touch another guest, the host, storage
attachment, or restore-overwrite. Destructive/cross-guest power stays operator-signed (§4).
## 7. Storage manifest & reconciliation
The manifest is the load-bearing contract (it absorbs the disk-state fields that
`settings.StoragePath` carries today — see Part 2). Held in the hub, reconciled by the agent.
Per target:
| field | meaning |
|---|---|
| `type` | `local-dir` / `usb` / `nfs` / `cifs` / `pbs` |
| `durable_id` | UUID (USB), `server:export` (NFS/CIFS), `repo+fingerprint` (PBS) — survives box loss |
| `class` | `fast` or `slow`, set **once at attach**, with an IOPS marker; no runtime speed-test |
| `role` | `primary` / `vzdump-target` / `pbs-offsite` / `bulk-data` |
| `creds` | encrypted (NFS/CIFS/PBS); USB has none |
| `policy` | schedule + retention for this target |
| `state` | `attached` / `disconnected` / `decommissioned` |
Reconciliation: ensure each `attached` target is mounted (USB-by-UUID via the sudoers
allowlist), each Proxmox storage entry matches, and `disconnected` targets are surfaced to
the hub (the storage watchdog — detect a USB drop in seconds, not at the next health cycle).
**Placement is per-volume, not per-app.** Hot volumes (DB/config) → a `fast` target,
**enforced**; bulk volumes (media) → may live on `slow`, declared in `.felhom.yml`. **Bulk
external mounts are excluded from the guest's vzdump** (a per-mount backup flag) and carry
their own per-volume policy (file-level to a tier, or explicitly *not* backed up for
re-downloadable media). This is what keeps a 1 TB media drive out of the whole-guest image.
## 8. Backup/restore orchestration
Tiers double as backup *and* restore-source priority (fastest surviving source first),
per Part 1: **snapshot** (LVM-thin, transient, whole-guest rollback — not a backup) →
**local second storage** (vzdump to dir/NFS/CIFS) → **PBS offsite** (the DR substrate).
- **Quiescing:** the controller stops the app stack (volume-consistent) before a guest
vzdump where app-consistency matters; stop-mode/snapshot-mode per Phase 1. Every Proxmox
op is async → the agent polls `task exitstatus`, never trusts the POST return.
- **Key custody (PBS):** the **live** PBS key sits on the box so the agent can both back up
*and* run restore-tests. The hub holds only the **recovery-code-wrapped escrow** copy it
cannot open (zero-knowledge default). So: the box can restore-test; the operator cannot
read the data; the customer's offsite recovery code is the irreducible residual.
- **Self-restore-test:** the closing of the "tested restore is the critical gap" theme. The
agent periodically restores a backup into a **throwaway scratch guest**, boots it, runs
health checks, reports pass/fail, and tears it down. Zero-knowledge backups can *only* be
restore-tested by the box (the operator lacks the key) — so this lives in the agent by
necessity, not just convenience. Integrity-verify (cheap, ciphertext-level) runs more often
as the lighter check.
## 9. Provisioning & DR flows
**Provisioning (reconcile-driven).** Desired state says "this customer should have guest G
with controller C." The agent: enrolls (mints its scoped Proxmox token as root at setup) →
creates the LXC (unprivileged, `nesting=1,keyctl=1`, overlayfs — Phase 0) → deploys the
controller → hands it the bootstrap config (identity, hub API key, local-API token, mount
map). If any step fails, reconciliation retries; a half-built guest is journaled (§10) and
rolled back, never orphaned.
**Guest loss.** Agent restores G from the fastest surviving tier and resets identity
(MAC/hostname) so the restored guest rejoins cleanly.
**Host/hardware loss.** Re-enroll the new host in **restore mode**; the hub — the durable
source of truth that survives box death — hands the new agent the existing identity, PBS
namespace, tunnel token, storage manifest, and a restore directive. Tunnel is reused from
the hub record, so DNS stays intact.
## 10. Concurrency, crash-safety, idempotency
- **Per-guest serialization.** Reconcile, one-shot jobs, and local-API calls all feed a
work queue that serializes mutations **per guest** (Proxmox dislikes concurrent conflicting
ops on the same guest). Independent guests proceed in parallel.
- **Operation journaling.** Multi-step async ops (provision, restore) are journaled with
their in-flight Proxmox task ids. On agent restart, the journal is replayed:
resume-or-rollback, so a crash mid-restore never leaves a corrupt or half-built guest.
- **Idempotency keys** on one-shot jobs (run-once across retries and restarts).
## 11. Self-update
- **Agent (the hard case — a host service, no snapshot-rollback).** Atomic binary swap:
download → verify signature → atomic rename → restart; **keep last-known-good**; a watchdog
reverts to last-good if the new binary fails to come up healthy. Triggered by a hub signed
job within the update window; manual always allowed.
- **Controller (the easy case — it's a guest).** The agent owns the controller's lifecycle,
so the **agent updates the controller**: snapshot-before-update (free rollback, because the
controller *is* a snapshottable guest) → pull new image → redeploy → health-check → rollback
on failure. This resolves the Part-2 `selfupdate/` open: the controller is **agent-managed**,
not self-updating; the controller's old self-update path is removed.
## 12. Secrets at rest on the host
The agent holds, root-only on the host fs: the scoped Proxmox token, the hub API key, the
operator's **public** verify key (for §4 signatures — public, low-risk), the Cloudflare
tunnel token, encrypted storage creds (NFS/CIFS/PBS), and the **live PBS key**. The privilege
and the secret footprint that left the controller now concentrate here — which is the whole
argument for §3's root-minimization and a small, auditable agent.
## 13. Open items / what this unblocks
Resolved here: tunnel placement (host, agent-managed, own systemd service), the
reconcile-vs-jobs fork (hybrid, gated by reversibility), agent process model, self-update
ownership, the local-API surface, and the storage-manifest schema.
Still open:
- Multi-tenant **resource fairness** on a shared host (per-guest cgroup limits, noisy-neighbor) — deferred to the company-case pass.
- Operator-side **signing tooling** — where the operator signing key lives operationally and how a destructive op gets signed without undue friction (offline key vs. a small signing service; the security floor is "not in the hub").
- Hub-side **desired-state editing UX** and the host-domain report schema details — belong to the hub architecture doc.
This doc hands the implementation three contracts it was waiting on:
1. the **local-API surface** (§6) → the controller's NEW local-API client, snapshot-before-deploy, and self-restore-test wiring (Part 2);
2. the **storage-manifest schema** (§7) → the `settings.StoragePath` reshape and per-volume hot/bulk placement (Part 2);
3. the **backup contract** (§78) → the destination for the app-data-backup package extracted in the Part-2 refactor.