Files
felhom-agent/docs/architecture/03-host-agent.md
T
2026-06-08 09:15:16 +02:00

22 KiB
Raw Blame History

Architecture Part 3 — The Host Agent

Status: design draft (decision content). To be grounded by Claude Code against docs/proxmox-platform.md and docs/architecture/02-controller-module-map.md, then placed at docs/architecture/03-host-agent.md.

Builds on Part 1 (01-topology-and-trust.md) and Part 2 (02-controller-module-map.md). Where this doc and the locked decisions disagree, the locked decisions win and this draft is wrong — flag it.

1. Purpose & scope

The host agent is the operator-tier component that runs on each Proxmox host and owns all Proxmox interaction. It is the trusted host actor: it provisions and restores guests, manages host storage, orchestrates backups and restore-tests, watches the host and the tunnel, talks to the hub, and exposes a narrow local API to the in-guest controllers it deploys.

It is the privileged tier. The controller deliberately holds no Proxmox credentials (Part 1) — the privilege the controller shed by losing storage/ did not disappear, it moved here. That makes the agent's hardening and blast-radius discipline the most security-sensitive part of the platform.

The agent manages a set of guests on its host (usually one customer = one guest, but the multi-tenant/company case is not precluded — the agent's data model is per-host, N-guests, never "the guest").

2. Responsibilities (and explicit non-responsibilities)

Owns:

  1. Proxmox lifecycle — create/start/stop/destroy guests, snapshots, storage allocation. Via a scoped Proxmox API token (the FelhomAgent operator roleproxmox-platform.md §3.6, validated Phase 3 B3) for everything the API covers; raw host ops only where unavoidable.
  2. Storage management — attach/classify targets, reconcile the storage manifest, mount USB-by-UUID, present mounts into guests.
  3. Backup/restore orchestration — vzdump to the tiers, PBS, snapshot management, and the self-restore-test.
  4. Host & tunnel monitoring — host metrics, guest up/down, storage-target status, and cloudflared health; reports the host domain to the hub.
  5. Provisioning — provision a guest by restoring the golden base image (§9), deploy the controller into it, hand it its bootstrap config; also build and refresh the golden base image itself.
  6. Hub control loop — poll for desired state + signed jobs, reconcile, execute, report, heartbeat.
  7. Local API — the per-guest authorization gate the controller calls.
  8. Self-update — update itself (carefully — it is a host service) and update the controllers it owns.

Explicitly does not:

  • Serve application traffic or sit in the data path. Control plane, not data plane: if the agent dies, apps keep serving (Docker + LXC run without it); only management degrades — no new backups, no provisioning, hub loses the heartbeat.
  • Hold or proxy customer application data.
  • Run inside a guest. It is the thing that recovers guests and the host; it cannot be one of them.
  • Manage geo-restriction / the Cloudflare API. Geo is hub-owned: the customer sets it in the controller UI, the controller reports the geo desired-state to the hub, and the hub (holding the CF API token) reconciles the WAF (S4). The agent manages only the tunnel service (cloudflared, §3/§5), never WAF rules.

3. Process model & host integration

  • Native Go binary, systemd service on the host: boot-start, Restart=always, systemd watchdog (kill+restart on hang), journald logging, resource limits.
  • Root-minimized (boundary settled — Phase 3 B3). The agent runs as a non-root service user with the scoped FelhomAgent token for all API-covered work + a narrow sudoers allowlist for true host ops. Per Phase 3 (B3) the boundary is settled: the entire per-customer guest lifecycle — provision (by restore, §9), config, start/stop, snapshot, backup, restore, destroy — is token-covered. Genuine OS-root is confined to: (1) building/refreshing the golden base image (keyctl create is root@pam-only — one-time at enrollment + a maintenance cadence, §9); (2) host mounts (USB mount-by-UUID, systemd mount units / fstab); (3) SMART / hardware sensors. Root therefore never sits on the per-customer path. See proxmox-platform.md §3.6 for the role + boundary table.
  • cloudflared is a separate systemd service, not embedded in the agent. This is what makes the data path survive control-plane death by construction. The agent manages and health-watches it (see §5) but the tunnel does not live or die with the agent process.

4. Control model — reconcile + signed destructive ops

Two channels, split by reversibility, not by transport.

(a) Desired-state reconciliation — steady state. The hub holds desired state for the host: which guests should exist (and at what spec), the storage manifest, backup/retention policies, controller image versions. The agent runs a reconcile loop converging actual Proxmox state → desired: idempotent, self-healing, and tolerant of missed polls (drift is corrected on the next loop). Provisioning retries, re-attach of a flapping USB target, redeploy of a crashed controller — all fall out of reconciliation for free.

(b) Signed one-shot jobs — operator actions. Restore-now, decommission, force-backup, break-glass-enable. Discrete, run-once (idempotency key), written to the customer-visible audit log, and outside the reconcile loop — they are point-in-time and often destructive, and a reconciler must never re-run a restore because it "sees drift." A one-shot job names a target ("restore guest X from snapshot S"), not a procedure; the agent owns the how.

The reversibility gate (security-critical). "Signed jobs resist hub compromise" only holds if the agent also distrusts hub-supplied desired state for destructive changes. The gate is by provenance + data-bearing-ness, not by verb:

  • The reconciler MAY act without an operator signature when: (a) creating/starting/restarting; (b) destroying resources it created earlier within the same journaled transaction (compensating rollback, §10); (c) destroying resources it tagged ephemeral/scratch (e.g. restore-test scratch guests, §8). The ephemeral/scratch tag is agent-internal provenance and is never accepted from the hub — else a compromised hub could relabel a data-bearing guest as scratch to walk the gate.
  • An operator signature is always required to destroy/overwrite any resource holding the only/primary copy of customer data — live-guest destroy, storage detach/wipe, restore-overwrite, decommission — regardless of whether it arrives as a job or as a desired-state delta. A compromised hub cannot forge them because the signing key is not held by the hub (it lives with the operator / a separate signing path; the hub only queues opaque signed blobs).
  • Healing a crashed controller is non-destructive by construction: it is reconstructable from its image + the guest's persistent volume, so "redeploy" = restart the LXC / docker compose up -d inside the existing guest — never a guest destroy. (v0.33 precedent: watchdog.go restarts stopped stacks, it never destroys the guest.)

Signed payloads carry a nonce + expiry (anti-replay: a captured "restore" job cannot be re-injected later) and a target binding (host + guest id) so a signature can't be retargeted. Notification-on-destructive-op is an audit signal, never the guard — a compromised hub could both issue and suppress the notice, which is exactly why the signature (not the notification) is the control.

5. Hub ↔ agent protocol (host domain)

Box-initiated poll. The hub never connects inbound. Each poll cycle exchanges:

  • Up: heartbeat + a host-domain state report — host CPU/RAM/disk, per-guest up/down + spec, storage-target status (USB connected? NFS/CIFS reachable? PBS reachable?), last backup per target, last restore-test result, cloudflared health, agent + controller versions, audit-log tail.
  • Down: the current desired state, any pending signed one-shot jobs, and config (poll interval, update window, policy changes).

Dead-man's-switch (essential, not optional). In a box-initiated model the heartbeat is the liveness signal — a box that stops checking in is otherwise invisible. The hub alerts the operator when an agent misses its expected check-in window. This is the worst failure mode for a managed service, so it gets first-class treatment hub-side.

Break-glass. Standing inbound control is off. But when the poll loop itself is wedged (agent hung, host sick) you cannot fix it through the poll loop. So there is an explicit, off-by-default, customer-consented, fully-audited emergency path: SSH to the host via the Cloudflare Tunnel behind Cloudflare Access (or on-site). Enabling it is itself a signed, logged operation; it auto-expires.

6. Agent ↔ controller local API

The controller (in its LXC) reaches the agent (on the host) over the local bridge.

  • Transport: HTTPS to the host's bridge IP on a fixed port.
  • Auth: a per-guest local token, minted by the agent when it deploys the controller and written into the guest's bootstrap config. The agent maps token → guest and authorizes per guest: a controller can only act on its own guest. This is the agent acting as the per-guest authorization gate from Part 1.
  • Surface (minimal, all scoped to the caller's own guest):
    • GET /storage — mounts available to this guest and their class (fast/slow), so the controller can place hot vs bulk volumes per .felhom.yml. (The agent owns the actual mounts; the controller just binds to the paths it's given.)
    • POST /snapshot — snapshot this guest (the snapshot-before-deploy primitive).
    • POST /rollback — roll this guest back to a named snapshot (post-deploy failure recovery).
    • POST /backup — request a backup-now of this guest (enqueued; non-destructive).
    • GET /backup/due — whether a policy-scheduled backup is due for this guest, so the controller can quiesce then call POST /backup (the app-consistent path, §8).
    • GET /backup/status, GET /restore-test/status — read-only status for the controller's UI.

Note what is absent: nothing here lets a controller touch another guest, the host, storage attachment, or restore-overwrite. Destructive/cross-guest power stays operator-signed (§4).

A controller can only POST /rollback (or snapshot/backup) its own guest — the agent maps token → guest and authorizes per guest, so a compromised controller's blast radius is self-scoped and bounded to its own guest.

7. Storage manifest & reconciliation

The manifest is the load-bearing contract. It absorbs the persisted disk-state fields that settings.StoragePath carries today and adds durable_id/UUID — today the controller re-derives the UUID from fstab each boot (Part 2 / Phase-3), so persisting it is an improvement. Held in the hub, reconciled by the agent.

Per target:

field meaning
type local-dir / usb / nfs / cifs / pbs
durable_id UUID (USB), server:export (NFS/CIFS), repo+fingerprint (PBS) — survives box loss
class fast or slow, set once at attach, with an IOPS marker; no runtime speed-test
role primary / vzdump-target / pbs-offsite / bulk-data
creds encrypted (NFS/CIFS/PBS); USB has none
policy schedule + retention for this target
state attached / disconnected / decommissioned

Reconciliation: ensure each attached target is mounted (USB-by-UUID via the sudoers allowlist), each Proxmox storage entry matches, and disconnected targets are surfaced to the hub (the storage watchdog — detect a USB drop in seconds, not at the next health cycle).

Placement is per-volume, not per-app. Hot volumes (DB/config) → a fast target, enforced; bulk volumes (media) → may live on slow, declared in .felhom.yml.

A bulk volume MUST be realized as a backup=0 volume mount point (or an external bind mount) — never a Docker named volume in rootfs, which vzdump always captures (verified, phase3-findings.md B2). Proven recipe: attach -mpN <storage>:<size>,mp=/mnt/bulk,backup=0, then docker volume create --driver local -o type=none -o o=bind -o device=/mnt/bulk <vol> (or a compose bind). The per-volume placement component (Part 2 §5(2)) enforces this at deploy. The DR consequence of excluding bulk is covered in §8.

Field re-homing (from settings.StoragePath, Part 2): Label → manifest (canonical); IsDefault/Schedulable → manifest policy; MigratedTo + decommission → manifest state; StoppedStacks → the controller's settings (app-domain: which apps to restart on reconnect, not a host concern).

8. Backup/restore orchestration

Tiers double as backup and restore-source priority (fastest surviving source first), per Part 1: snapshot (LVM-thin, transient, whole-guest rollback — not a backup) → local second storage (vzdump to dir/NFS/CIFS) → PBS offsite (the DR substrate).

  • Quiescing (controller-driven for app-consistency): an LXC has no fsfreeze (proxmox-platform.md §4.2), so app-consistency is the controller's job: it learns a backup is due (GET /backup/due, §6, or via its hub channel) → quiesces the app stack → POST /backup → polls GET /backup/status → unquiesces. An agent-initiated vzdump is crash-consistent only (there is no inbound-to-guest channel to trigger a quiesce — §3/§5). Every Proxmox op is async → the agent polls task exitstatus, never trusts the POST return.
  • Bulk volumes have no DR coverage from the guest vzdump — they are excluded (§7). Every bulk volume needs an explicit own-backup decision: its own backup target per the manifest policy, or deliberately none when the data is re-downloadable (customer informed). On host-loss, un-backed-up bulk is gone; a bind-mounted bulk volume re-attaches only on the same host, so cross-host DR needs the separate backup. A deliberate per-volume choice, never a silent loss.
  • Key custody (PBS): the live PBS key sits on the box so the agent can both back up and run restore-tests. The hub holds only the recovery-code-wrapped escrow copy it cannot open (zero-knowledge default). So: the box can restore-test; the operator cannot read the data; the customer's offsite recovery code is the irreducible residual.
  • Self-restore-test: the closing of the "tested restore is the critical gap" theme. The agent periodically restores a backup into a throwaway scratch guest, boots it, runs health checks, reports pass/fail, and tears it down. Zero-knowledge backups can only be restore-tested by the box (the operator lacks the key) — so this lives in the agent by necessity, not just convenience. Integrity-verify (cheap, ciphertext-level) runs more often as the lighter check.

9. Provisioning & DR flows

Provisioning (reconcile-driven, by restore). Fresh creation of a Docker-capable LXC needs the keyctl=1 feature flag, which Proxmox permits only for root@pam (Phase 3, B3) — not the scoped token. But a token-authorized restore preserves keyctl (Phase 3, B3), so the agent provisions by restoring a golden base image, never by pct create on the per-customer path:

  • A golden base archive — minimal Debian + Docker, nesting=1,keyctl=1, overlayfs — is built once as root@pam at enrollment (when the agent legitimately holds root to mint its Proxmox token) and refreshed on a maintenance cadence. This is the one place keyctl/root provisioning lives — off the per-customer path.
  • To provision guest G: restore the golden archive → new VMID (token-covered: VM.Allocate + Datastore.AllocateSpace; keyctl preserved) → reset identity (MAC/hostname) → size the guest (CPU/mem config + pct resize rootfs, token-covered) → attach storage mounts per the manifest → deploy the controller → hand it bootstrap config. A mid-flight failure is journaled and compensating-rolled-back (destroy the just-restored guest — allowed without a signature per §4, same-transaction provenance).

Unified bring-up primitive. Provisioning and DR-restore share the same token-covered front half — restore an archive → reset identity — and differ only in the archive and the back half: provisioning restores the golden base then deploys a fresh controller; DR-restore restores the customer's backup (already containing controller + data), brings it up, and reattaches external storage. One code path, exercised by every restore-test (§8).

Guest loss. Agent restores G from the fastest surviving tier and resets identity (MAC/hostname) so the restored guest rejoins cleanly — this is the unified restore primitive above (customer-backup archive, DR back half).

Host/hardware loss. Re-enroll the new host in restore mode; the hub — the durable source of truth that survives box death — hands the new agent the existing identity, PBS namespace, tunnel token, storage manifest, and a restore directive. Tunnel is reused from the hub record, so DNS stays intact.

10. Concurrency, crash-safety, idempotency

  • Per-guest serialization. Reconcile, one-shot jobs, and local-API calls all feed a work queue that serializes mutations per guest (Proxmox dislikes concurrent conflicting ops on the same guest). Independent guests proceed in parallel.
  • Operation journaling. Multi-step async ops (provision, restore, controller-update, agent self-update) are journaled with their in-flight Proxmox task ids. On agent restart, the journal is replayed: resume-or-rollback, so a crash mid-restore never leaves a corrupt or half-built guest.
  • Idempotency keys on one-shot jobs (run-once across retries and restarts).

11. Self-update

  • Agent (the hard case — a host service, no snapshot-rollback). A/B layout: download → verify signature → stage as the inactive slot → flip a current → good|new symlink → restart. Revert authority lives outside the swapped binaryRestart=always alone just crash-loops a bad binary — so a separate health-gate (a systemd oneshot ExecStartPost probe, or a tiny supervisor unit) flips current back to last-good and restarts on a failed health window. The new version is committed as "good" only after a clean health window. Triggered by a hub signed job within the update window; manual always allowed. Journaled (§10).
  • Controller (the easy case — it's a guest). The agent owns the controller's lifecycle, so the agent updates the controller: snapshot-before-update (free rollback, because the controller is a snapshottable guest) → pull new image → redeploy → health-check → rollback on failure. This resolves the Part-2 selfupdate/ open: the controller is agent-managed, not self-updating; the controller's old self-update path is removed.

12. Secrets at rest on the host

The agent holds, root-only on the host fs: the scoped Proxmox token, the hub API key, the operator's public verify key (for §4 signatures — public, low-risk), the Cloudflare tunnel token, encrypted storage creds (NFS/CIFS/PBS), and the live PBS key. The privilege and the secret footprint that left the controller now concentrate here — which is the whole argument for §3's root-minimization and a small, auditable agent.

13. Open items / what this unblocks

Resolved here: tunnel placement (host, agent-managed, own systemd service), the reconcile-vs-jobs fork (hybrid, gated by reversibility), agent process model, self-update ownership, the local-API surface, the storage-manifest schema, provision-by-restore, and the root-vs-API boundary (Phase 3, B3).

Still open:

  • Multi-tenant resource fairness on a shared host (per-guest cgroup limits, noisy-neighbor) — deferred to the company-case pass.
  • Operator-side signing tooling — where the operator signing key lives operationally and how a destructive op gets signed without undue friction (offline key vs. a small signing service; the security floor is "not in the hub").
  • Hub-side desired-state editing UX and the host-domain report schema details — belong to the hub architecture doc.
  • Golden base image refresh cadence + fleet versioning — who triggers a rebuild, how the per-host image version is tracked (operational detail, not blocking; §9).

This doc hands the implementation three contracts it was waiting on:

  1. the local-API surface (§6) → the controller's NEW local-API client, snapshot-before-deploy, and self-restore-test wiring (Part 2);
  2. the storage-manifest schema (§7) → the settings.StoragePath reshape and per-volume hot/bulk placement (Part 2);
  3. the backup contract (§78) → the destination for the app-data-backup package extracted in the Part-2 refactor.

Changelog — design-review + Phase-3 fold-in (2026-06-08)

  • NEW provision-by-restore (§9): the agent provisions by restoring a golden base image (token-covered, preserves keyctl), never pct create on the per-customer path; one unified restore primitive shared with DR. §2 responsibility + §3 boundary updated.
  • B3 (§2/§3): replaced "Phase-1 minimal role" with the validated FelhomAgent operator role; root-vs-API boundary settled (root only for golden-image build, host mounts, SMART).
  • B1 (§4): reversibility gate rewritten as provenance + data-bearing (scratch tag is agent-internal, never hub-supplied; crashed-controller heal is non-destructive in-place).
  • B2 (§7/§8): validated bulk-as-backup=0-mountpoint recipe + the bulk-DR consequence (excluded bulk needs its own backup decision).
  • S1 (§6/§8): GET /backup/due added; controller-driven quiescing; agent vzdump is crash-consistent only. S2 (§10/§11): A/B self-update with external revert authority; controller-update + agent self-update journaled. S3 (§7): StoragePath field re-homing. S4: geo non-responsibility added (§2). M2 (§7): manifest "absorbs + adds durable_id". §6: rollback is self-scoped/bounded. §13: golden-image refresh cadence added as open.