Files

T

admin e54f882e70 slice 10A: hub desired-state serving + signed-jobs queue (Down channel) (hub v0.9.0)

Serve operator intent to authenticated hosts: PUT /admin/hosts/{id}/desired-state
(global key) bumps desired_generation; GET /hosts/{id}/desired-state + /jobs are
per-host self-scoped; the host-report envelope now carries the real generation +
has_signed_ops. New signed_jobs table + store methods. Desired-state stored/served
opaquely (agent owns the schema). Cross-repo golden (envelope + desired-state)
byte-identical with felhom-agent; doc 03 §4/§9 updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-10 19:03:14 +02:00

51 KiB

Raw Blame History

Architecture Part 3 — The Host Agent

Status: design draft (decision content). To be grounded by Claude Code against docs/proxmox-platform.md and docs/architecture/02-controller-module-map.md, then placed at docs/architecture/03-host-agent.md.

Builds on Part 1 (01-topology-and-trust.md) and Part 2 (02-controller-module-map.md). Where this doc and the locked decisions disagree, the locked decisions win and this draft is wrong — flag it.

1. Purpose & scope

The host agent is the operator-tier component that runs on each Proxmox host and owns all Proxmox interaction. It is the trusted host actor: it provisions and restores guests, manages host storage, orchestrates backups and restore-tests, watches the host and the tunnel, talks to the hub, and exposes a narrow local API to the in-guest controllers it deploys.

It is the privileged tier. The controller deliberately holds no Proxmox credentials (Part 1) — the privilege the controller shed by losing storage/ did not disappear, it moved here. That makes the agent's hardening and blast-radius discipline the most security-sensitive part of the platform.

The agent manages a set of guests on its host (usually one customer = one guest, but the multi-tenant/company case is not precluded — the agent's data model is per-host, N-guests, never "the guest").

2. Responsibilities (and explicit non-responsibilities)

Owns:

Proxmox lifecycle — create/start/stop/destroy guests, snapshots, storage allocation. Via a scoped Proxmox API token (the FelhomAgent operator role — proxmox-platform.md §3.6, validated Phase 3 B3) for everything the API covers; raw host ops only where unavoidable.
Storage management — attach/classify targets, reconcile the storage manifest, mount USB-by-UUID, present mounts into guests.
Backup/restore orchestration — vzdump to the tiers, PBS, snapshot management, and the self-restore-test.
Host & tunnel monitoring — host metrics, guest up/down, storage-target status, and cloudflared health; reports the host domain to the hub.
Provisioning — provision a guest by restoring the golden base image (§9), deploy the controller into it, hand it its bootstrap config; also build and refresh the golden base image itself.
Hub control loop — poll for desired state + signed jobs, reconcile, execute, report, heartbeat.
Local API — the per-guest authorization gate the controller calls.
Self-update — update itself (carefully — it is a host service) and update the controllers it owns.

Explicitly does not:

Serve application traffic or sit in the data path. Control plane, not data plane: if the agent dies, apps keep serving (Docker + LXC run without it); only management degrades — no new backups, no provisioning, hub loses the heartbeat.
Hold or proxy customer application data.
Run inside a guest. It is the thing that recovers guests and the host; it cannot be one of them.
Manage geo-restriction / the Cloudflare API. Geo is hub-owned: the customer sets it in the controller UI, the controller reports the geo desired-state to the hub, and the hub (holding the CF API token) reconciles the WAF (S4). The agent manages only the tunnel service (cloudflared, §3/§5), never WAF rules.

3. Process model & host integration

Native Go binary, systemd service on the host: boot-start, Restart=always, systemd watchdog (kill+restart on hang), journald logging, resource limits.
Root-minimized (boundary settled — Phase 3 B3). The agent runs as a non-root service user with the scoped FelhomAgent token for all API-covered work + a narrow sudoers allowlist for true host ops. Per Phase 3 (B3) the boundary is settled: the entire per-customer guest lifecycle — provision (by restore, §9), config, start/stop, snapshot, backup, restore, destroy — is token-covered. Genuine OS-root is confined to: (1) building/refreshing the golden base image (keyctl create is root@pam-only — one-time at enrollment + a maintenance cadence, §9); (2) host mounts (USB mount-by-UUID, systemd mount units / fstab); (3) SMART / hardware sensors. Root therefore never sits on the per-customer path. See proxmox-platform.md §3.6 for the role + boundary table.
cloudflared is a separate systemd service, not embedded in the agent. This is what makes the data path survive control-plane death by construction. The agent manages and health-watches it (see §5) but the tunnel does not live or die with the agent process.

4. Control model — reconcile + signed destructive ops

Two channels, split by reversibility, not by transport.

(a) Desired-state reconciliation — steady state. The hub holds desired state for the host: which guests should exist (and at what spec), the storage manifest, backup/retention policies, controller image versions. The agent runs a reconcile loop converging actual Proxmox state → desired: idempotent, self-healing, and tolerant of missed polls (drift is corrected on the next loop). Provisioning retries, re-attach of a flapping USB target, redeploy of a crashed controller — all fall out of reconciliation for free.

(b) Signed one-shot jobs — operator actions. Restore-now, decommission, force-backup, break-glass-enable. Discrete, run-once (idempotency key), written to the customer-visible audit log, and outside the reconcile loop — they are point-in-time and often destructive, and a reconciler must never re-run a restore because it "sees drift." A one-shot job names a target ("restore guest X from snapshot S"), not a procedure; the agent owns the how.

The reversibility gate (security-critical). "Signed jobs resist hub compromise" only holds if the agent also distrusts hub-supplied desired state for destructive changes. The gate is by provenance + data-bearing-ness, not by verb:

The reconciler MAY act without an operator signature when: (a) creating/starting/restarting; (b) destroying resources it created earlier within the same journaled transaction (compensating rollback, §10); (c) destroying resources it tagged ephemeral/scratch (e.g. restore-test scratch guests, §8). The ephemeral/scratch tag is agent-internal provenance and is never accepted from the hub — else a compromised hub could relabel a data-bearing guest as scratch to walk the gate.
An operator signature is always required to destroy/overwrite any resource holding the only/primary copy of customer data — live-guest destroy, storage detach/wipe, restore-overwrite, decommission — regardless of whether it arrives as a job or as a desired-state delta. A compromised hub cannot forge them because the signing key is not held by the hub (it lives with the operator / a separate signing path; the hub only queues opaque signed blobs).
Data-bearing-ness is agent-internal evidence, never a caller's claim (slice 8C). For a customer-driven storage op (POST /disks/format, §6) the agent inspects the actual device (filesystem signature / partition table / partitions / mount, conservative — ambiguous → data-bearing) to decide the class. A blank device → benign self-serve mkfs; a data-bearing device → ClassStorageWipe → this gate → pending_signature. The destructive completion of a data-bearing wipe is slice 10 (the operator-signed path); 8C refuses it. This mirrors the provenance rule above: just as the scratch tag is agent-internal (never hub-sourced), data-bearing-ness is agent-observed (never controller-asserted) — a compromised controller cannot relabel a data-bearing drive "blank" to walk the gate.
Healing a crashed controller is non-destructive by construction: it is reconstructable from its image + the guest's persistent volume, so "redeploy" = restart the LXC / docker compose up -d inside the existing guest — never a guest destroy. (v0.33 precedent: watchdog.go restarts stopped stacks, it never destroys the guest.)

Signed payloads carry a nonce + expiry (anti-replay: a captured "restore" job cannot be re-injected later) and a target binding (host + guest id) so a signature can't be retargeted. Notification-on-destructive-op is an audit signal, never the guard — a compromised hub could both issue and suppress the notice, which is exactly why the signature (not the notification) is the control.

5. Hub ↔ agent protocol (host domain)

Box-initiated poll. The hub never connects inbound. Each poll cycle exchanges:

Up: heartbeat + a host-domain state report — host CPU/RAM/disk, per-guest up/down + spec, storage-target status (USB connected? NFS/CIFS reachable? PBS reachable?), last backup per target, last restore-test result, cloudflared health, agent + controller versions, audit-log tail.
Down: the current desired state, any pending signed one-shot jobs, and config (poll interval, update window, policy changes).

Dead-man's-switch (essential, not optional). In a box-initiated model the heartbeat is the liveness signal — a box that stops checking in is otherwise invisible. The hub alerts the operator when an agent misses its expected check-in window. This is the worst failure mode for a managed service, so it gets first-class treatment hub-side.

Break-glass. Standing inbound control is off. But when the poll loop itself is wedged (agent hung, host sick) you cannot fix it through the poll loop. So there is an explicit, off-by-default, customer-consented, fully-audited emergency path: SSH to the host via the Cloudflare Tunnel behind Cloudflare Access (or on-site). Enabling it is itself a signed, logged operation; it auto-expires.

6. Agent ↔ controller local API

The controller (in its LXC) reaches the agent (on the host) over the local bridge.

Transport: HTTPS to the host's bridge IP on a fixed port.
Auth: a per-guest local token, minted by the agent when it deploys the controller and written into the guest's bootstrap config. The agent maps token → guest and authorizes per guest: a controller can only act on its own guest. This is the agent acting as the per-guest authorization gate from Part 1.
Surface (minimal, all scoped to the caller's own guest):
- GET /storage — mounts available to this guest and their class (fast/slow), so the controller can place hot vs bulk volumes per .felhom.yml. (The agent owns the actual mounts; the controller just binds to the paths it's given.)
- POST /snapshot — snapshot this guest (the snapshot-before-deploy primitive).
- POST /rollback — roll this guest back to a named snapshot (post-deploy failure recovery).
- POST /backup — request a backup-now of this guest (enqueued; non-destructive).
- GET /backup/due — whether a policy-scheduled backup is due for this guest, so the controller can quiesce then call POST /backup (the app-consistent path, §8).
- GET /backup/status, GET /restore-test/status — read-only status for the controller's UI.
- Host metrics (slice 9): GET /host/metrics — host-wide health for the customer's monitoring view: cpu%/mem/load/uptime, CPU/chassis temperature (cpu_temp_c, nullable — "n/a" when the hardware exposes no sensor), and per-storage capacity (total/used/fraction, thin-pool fill, disk SMART temp+wear). It reuses the slice-4 collector (no duplicate collection) and serves a fresh collect (current cpu%/temp, not the 15-min hub snapshot). Unlike the rest of the surface this is host-wide, not per-guest (the box, not the caller's guest) — correct for "see my box's health" — but still token-authed via the per-guest token. Assumption: one customer per host (the home-server model); if a host ever served multiple customers, host-wide CPU/mem would leak cross-customer load → revisit then. The de-privileged controller (slice 8C) sees only its own cgroup, so it cannot read host health itself; this re-serves the agent's existing host + storage observation to the customer. Status: implemented (agent v0.14.0 internal/localapi + internal/hub/cputemp.go; controller v0.39.0 internal/web/agent_host_metrics_handler.go + the monitoring page's host-health card).
- Disk management (slice 8C): GET /disks (host drives + a data-bearing flag), POST /disks/assign (attach a drive as a mount — benign, additive, self-serve), POST /disks/eject (safe-unmount, data preserved, returns the dependent guests so the controller warns which apps lose that storage — benign), POST /disks/format (see the reframed principle below). The controller is Docker-only (de-privileged, slice 8C); execution is the agent's.

The principle (reframed for 8C): a controller may do non-data-destructive storage setup self-serve (list, assign, eject, format a blank drive); anything that can lose customer data stays operator-signed (§4). The enforcer is the classifier: for POST /disks/format the agent inspects the actual device itself (filesystem signature / partition table / partitions / mount — agent-internal evidence, NEVER the caller's claim) and classifies conservatively (ambiguous → data-bearing). A blank device → benign → mkfs. A data-bearing device → ClassStorageWipe → destructive → the §4 gate → refused pending_signature (the operator-signed completion is slice 10). So a compromised controller asserting "this drive is blank" cannot wipe a data-bearing drive — the 8C analog of self-scoping. Status: implemented (agent v0.12.0 internal/localapi + internal/storage; controller v0.37.0 internal/web/agent_disk_handlers.go).

Note what is absent: nothing here lets a controller touch another guest, the host beyond this narrow disk surface, or restore-overwrite; and within the disk surface, data-destructive power stays operator-signed (§4). Destructive/cross-guest power stays operator-signed.

A controller can only POST /rollback (or snapshot/backup) its own guest — the agent maps token → guest and authorizes per guest, so a compromised controller's blast radius is self-scoped and bounded to its own guest.

6a. Implementation (slice 8A — implemented)

Status: implemented (agent v0.10.0 internal/localapi; controller v0.35.0 internal/bootstrap

internal/agentapi). Grounded by documentation/tests/slice8a-channel-deploy-spike-findings.md (commit 4a81a96). The 7 endpoints above are live; GET /backup/due is thin in 8A (the quiesce-on-due consumer is 8B), the rest wrap the existing slice-5/6/7 machinery.

Transport / pin. The agent serves a persisted self-signed leaf bound to the host bridge IP on a fixed port (default :8443). The controller pins the leaf-cert SHA-256 (decision: consistency with the agent's Proxmox/PBS cert pinning), carried in its bootstrap. The leaf is generated once and persisted, so its fingerprint is stable across agent restarts (a fresh cert each boot would invalidate every already-issued bootstrap pin). Defense-in-depth: the listener binds the bridge IP (not 0.0.0.0) and a host firewall rule narrows the port to the guest bridge subnet (configs/felhom-localapi-firewall.example) — the per-guest token stays the gate.
Token custody. The per-guest token is minted by the back-half (§9), persisted as a SHA-256 hash only (the plaintext exists transiently at mint→write-to-mount, then is discarded), in a durable last-write-wins map. Self-scoping is enforced by the token→guest map alone: the VMID is resolved from the token, never from a caller-supplied id; an explicit vmid that disagrees is refused (403) and the Proxmox op is never issued for the other guest. Absent/unknown token → 401.
The bootstrap contract (c). The agent emits a stable bootstrap.json (schema: felhom.bootstrap/v1: customer identity, hub, and the local-API {endpoint, fingerprint, token}) into a read-only config mount; the controller ingests it on first run and seeds its own controller.yaml, skipping setup mode (idempotent — never clobbers an existing config; fail-safe — a malformed/absent bootstrap stays in setup). The agent emits the contract; the controller owns the translation — they stay decoupled (no shared config schema). No registry credential ever enters a guest: the controller image is baked into the golden (§9), so deploy does no docker login/pull.

7. Storage manifest & reconciliation

The manifest is the load-bearing contract. It absorbs the persisted disk-state fields that settings.StoragePath carries today and adds durable_id/UUID — today the controller re-derives the UUID from fstab each boot (Part 2 / Phase-3), so persisting it is an improvement. Held in the hub, reconciled by the agent.

Per target:

field	meaning
`type`	`local-dir` / `usb` / `nfs` / `cifs` / `pbs`
`durable_id`	UUID (USB), `server:export` (NFS/CIFS), `repo+fingerprint` (PBS) — survives box loss
`class`	`fast` or `slow`, set once at attach, with an IOPS marker; no runtime speed-test
`role`	`primary` / `vzdump-target` / `pbs-offsite` / `bulk-data`
`creds`	encrypted (NFS/CIFS/PBS); USB has none
`policy`	schedule + retention for this target
`state`	`attached` / `disconnected` / `decommissioned`

Reconciliation: ensure each attached target is mounted (USB-by-UUID via the sudoers allowlist), each Proxmox storage entry matches, and disconnected targets are surfaced to the hub (the storage watchdog — detect a USB drop in seconds, not at the next health cycle).

Placement is per-volume, not per-app. Hot volumes (DB/config) → a fast target, enforced; bulk volumes (media) → may live on slow, declared in .felhom.yml.

A bulk volume MUST be realized as a backup=0 volume mount point (or an external bind mount) — never a Docker named volume in rootfs, which vzdump always captures (verified, phase3-findings.md B2). Proven recipe: attach -mpN <storage>:<size>,mp=/mnt/bulk,backup=0, then docker volume create --driver local -o type=none -o o=bind -o device=/mnt/bulk <vol> (or a compose bind). The per-volume placement component (Part 2 §5(2)) enforces this at deploy. The DR consequence of excluding bulk is covered in §8.

Field re-homing (from settings.StoragePath, Part 2): Label → manifest (canonical); IsDefault/Schedulable → manifest policy; MigratedTo + decommission → manifest state; StoppedStacks → the controller's settings (app-domain: which apps to restart on reconnect, not a host concern).

8. Backup/restore orchestration

Tiers double as backup and restore-source priority (fastest surviving source first), per Part 1: snapshot (LVM-thin, transient, whole-guest rollback — not a backup) → local second storage (vzdump to dir/NFS/CIFS) → PBS offsite (the DR substrate).

Quiescing (controller-driven for app-consistency) — implemented (slice 8B): an LXC has no fsfreeze (proxmox-platform.md §4.2), so app-consistency is the controller's job: it learns a backup is due (GET /backup/due, §6) → quiesces (stops its app stacks) → POST /backup → polls GET /backup/status to done → unquiesces (restarts exactly the stacks it stopped). Implemented in felhom-controller v0.36.0 (internal/quiesce) + felhom-agent v0.11.0 (the /backup/due cadence policy + /backup/status phases). An agent-initiated vzdump is crash-consistent only (there is no inbound-to-guest channel to trigger a quiesce — §3/§5); the controller stopping its stacks first is what makes the captured state clean-shutdown-consistent (validated live: a quiesced postgres restore comes up clean — "database system was shut down" — vs a crash-consistent restore doing WAL recovery — "redo starts… redo done"). Every Proxmox op is async → the agent polls task exitstatus, never trusts the POST return.
- Crash-safety (the centerpiece — a stranded-down app is worse than a crash-consistent backup): a persisted marker written before stopping anything; guaranteed unquiesce (restart on a backup error, a status-poll error, the max-quiesce bound, or controller shutdown); a max-quiesce-duration hard bound (restart the app no matter what — the backup finishes on the agent); and crash recovery at controller startup (restart stacks left stopped by a mid-quiesce crash). The marker also single-flights the loop. All proven live + unit-tested.
- 8B.2 downtime optimization — implemented (agent v0.13.0 + controller v0.38.0): in snapshot mode, vzdump only needs the app-stopped state captured at the storage-snapshot moment; after that it reads from the snapshot. The agent watches the vzdump task log for the snapshot marker (create storage snapshot, validated on PVE 9.2.2) and emits a snapshotted phase on /backup/status; the controller resumes its app at snapshotted (not done), cutting app downtime from whole-backup to until-snapshot (~24s→~1s for a 934 MB guest) with no loss of app-consistency (the snapshot froze the app-stopped state). Depends on snapshot-capable storage (lvm-thin/ZFS); on stop/downgraded storage the marker never appears and the controller falls back to resume-at-done (8B). The controller keeps tracking to done/failed after early resume (no overlapping backup; the backup isn't "successful" until done).
Bulk volumes have no DR coverage from the guest vzdump — they are excluded (§7). Every bulk volume needs an explicit own-backup decision: its own backup target per the manifest policy, or deliberately none when the data is re-downloadable (customer informed). On host-loss, un-backed-up bulk is gone; a bind-mounted bulk volume re-attaches only on the same host, so cross-host DR needs the separate backup. A deliberate per-volume choice, never a silent loss.
Key custody (PBS): the live PBS key sits on the box so the agent can both back up and run restore-tests. The hub holds only the recovery-code-wrapped escrow copy it cannot open (zero-knowledge default). So: the box can restore-test; the operator cannot read the data; the customer's offsite recovery code is the irreducible residual.
Self-restore-test: the closing of the "tested restore is the critical gap" theme. The agent periodically restores a backup into a throwaway scratch guest, boots it, runs health checks, reports pass/fail, and tears it down. Zero-knowledge backups can only be restore-tested by the box (the operator lacks the key) — so this lives in the agent by necessity, not just convenience. Integrity-verify (cheap, ciphertext-level) runs more often as the lighter check.

8a. PBS recovery-code escrow + the key-custody posture model (zero-knowledge offsite-key recovery)

The DR substrate is the PBS offsite tier, client-side encrypted (zero-knowledge): if the box dies, restoring the offsite backups requires the PBS client encryption key K, which died with the box. The escrow is how K comes back without Felhom ever being able to read customer data. Status: implemented — escrow creation (agent v0.9.0, internal/escrow) + hub opaque storage (hub v0.8.0, PUT /api/v1/hosts/{host_id}/escrow). Validated end-to-end on a throwaway in documentation/tests/slice7-escrow-spike-findings.md. Restore-mode serving/consumption is slice 10.

The separation principle (the rule that governs every posture)

Reading customer data needs BOTH the encrypted chunks AND a usable key. Zero-knowledge holds for exactly as long as Felhom never holds both at once. Every posture below is just a choice about where the data and the key live; the principle decides who can read.

Topology matrix (data location × key custody → who can read)

Data location	Key custody	Who can read	Notes
Felhom storage	customer-only key	only the customer	the DEFAULT — genuine zero-knowledge
Felhom storage	Felhom also holds a key	Felhom can read	the one dangerous cell — explicit, informed opt-in only; never default, never silent
Customer's own offsite	customer key	only the customer	self-hosted data; key XOR data
Customer's own offsite	Felhom holds a key	only the customer	safe by separation (key and data never co-located at Felhom)

The escrow mechanism (decisions + the rationale that pins them)

Live key unencrypted on the box (0600, root): the agent backs up and runs restore-tests unattended — no passphrase prompt on the management path. The privilege concentration this implies is the whole argument for §3 root-minimization + a small auditable agent.
Wrap — PBS-native, not custom crypto. At enrollment the agent generates a high-entropy recovery code R and produces a passphrase-protected copy of K under R via PBS's own key passphrase KDF (proxmox-backup-client key change-passphrase --kdf scrypt; no bespoke AEAD). The spike pinned two implementation constraints: that command is TTY-only (drive it over a pty), and the pty echoes the passphrase (discard the pty output so R can't leak) — F-A1/F-A2.
Agent-side generation. R is generated on the box (it already holds K and does the wrapping), so R never touches the hub even in transit — zero-knowledge by construction. R is ≥128 bits, word-list form (EFF large wordlist, 10 words ≈ 129 bits) for off-paper transcription.
Self-verify before shipping. Creation unwraps a copy of the blob with R and checks the key fingerprint matches — "an escrow you haven't recovered isn't an escrow."
Escrow = the R-wrapped blob → hub (opaque storage, slice 7). The hub stores the ciphertext bytes against the host record and never decrypts them (it has no R; there is no decrypt path). Per-host-key authed; rotation is last-write-wins. Restore-mode serving is slice 10.
Recovery code custody. R is surfaced to the customer exactly once at enrollment (printed/displayed) and never stored by Felhom in any recoverable form.

Default posture + the anti-lockout ladder (opt-in, increasing trust)

Default: Felhom storage + customer-only key, and R is delivered durably (printed) always — note this is distinct from a raw-key paperkey: R is a safe two-factor passphrase (useless without the hub's blob); the raw key is the footgun. The ladder trades resilience for trust:

(b) R-wrapped offline copy — the same two-factor blob, for the customer to print/store. No extra trust; resilience if the hub ever vanishes (still needs R). Implemented (opt-in).
(a) raw paperkey — proxmox-backup-client key paperkey of the unwrapped key, for a safe. Covers losing R, but it is single-factor and unrevocable. Implemented (opt-in, loud caveat).
Felhom-holds-a-key — maximum convenience, but gives up zero-knowledge (the dangerous matrix cell). Not implemented — it needs a separate Felhom-side secure key store + explicit opt-in UX, built only when a customer asks.

SSH-for-support is a SEPARATE grant — deliberately not coupled to key custody

Support access (active / consented / observable — customer-toggleable, commands shown) is not the same as a standing / passive / invisible decryption capability. The transparency features prove controlled support access without Felhom holding a key. Conflating the two is exactly the mistake the separation principle prevents.

Why zero-knowledge stays the default (breach + legal)

Holding data and a key makes a single hub breach an all-customer data leak, and makes Felhom compellable — a court can order what Felhom can produce. Genuine zero-knowledge means "we can't be forced to hand over what we can't read." This is core to the sovereignty pitch, not a nicety.

Honesty properties (stated to the customer at enrollment)

Irreducible residual: losing R and the box (and, if not opted in, having no paperkey) = the offsite backups are unrecoverable, by anyone, including Felhom. The cost of genuine zero-knowledge — communicated, not buried.
Rotation ≠ key rotation: rotating R re-wraps the escrow blob (and re-shows a new code) but does not re-encrypt existing PBS data — that stays keyed by K. Changing K itself is a separate, heavier op (new key → new backups; old backups still need old K), out of scope for routine recovery-code rotation.
Integrity caveat (self-hosted-data postures): moving data to the customer's own offsite loses Felhom's backup guarantees — no PBS verify / monitoring on storage we can't reach. An honest signup-time tradeoff, not a hidden one.

9. Provisioning & DR flows

Provisioning (reconcile-driven, by restore). Fresh creation of a Docker-capable LXC needs the keyctl=1 feature flag, which Proxmox permits only for root@pam (Phase 3, B3) — not the scoped token. But a token-authorized restore preserves keyctl (Phase 3, B3, empirically: a token vzrestore of a keyctl archive produced a guest that kept features: nesting=1,keyctl=1,unprivileged:1), so the agent provisions by restoring a golden base image, never by pct create on the per-customer path.

Golden base image. A golden base archive — minimal Debian + Docker, nesting=1,keyctl=1, overlayfs — is built once as root@pam at enrollment (when the agent legitimately holds root to mint its Proxmox token) and refreshed on a maintenance cadence. This is the one place keyctl/root provisioning lives — off the per-customer path. Refresh cadence + fleet versioning remain an operational open item (§13).

Unified bring-up primitive (shared front half — NOT shared identity policy). Provisioning and DR-restore share one token-covered front-half code path:

restore an archive → reset identity → size the guest (CPU/mem config + pct resize rootfs, token-covered) → attach storage mounts per the manifest

run as a journaled reconcile job; a mid-flight failure is compensating-rolled-back (destroy the just-restored guest — allowed unsigned per §4, same-transaction provenance). They diverge in the archive and the back half, and in identity policy (below).

Identity reset is scenario-specific — this is a correctness boundary, not a detail. "Reset identity" is shorthand for two different operations:

Provision (golden base) → fresh identity, everything. A provisioned guest is new: reset MAC + hostname host-side via the token config (the agent does NOT touch guest internals), while /etc/machine-id (a duplicate breaks journald/DHCP/systemd) and SSH host keys regenerate guest-side on first boot — machine-id by systemd for free, host keys by a baked, Condition-gated felhom-regen-hostkeys.service unit in the golden (the F3 decision: Debian does NOT auto-regenerate host keys after a restore, so the golden carries the regeneration, keeping the agent host-side-only). It then receives a fresh controller identity (host-id, local token, hub channel), fresh restic repo identity, and a fresh tunnel association — all minted in the back half (slice 8A — implemented).
Guest-loss DR (customer backup) → preserve continuity identity, reset only what would collide. The restored guest must continue the customer's world: keep the restic repo identity (resetting it orphans the existing backup chain — a silent data-continuity bug), the tunnel/DNS association, and the hub host/customer binding. Reset only collision-prone host-local identity (machine-id, SSH host keys, hostname as needed). MAC is reset only when a source guest may still be live (e.g. partial loss, or the restore-test which boots link-down for exactly this reason); in a true total guest-loss the original is gone, so the MAC can be kept to preserve DHCP reservations. The agent decides MAC handling from the scenario, not a fixed rule.

The exact reset set was pinned empirically by the slice-7 bring-up spike (live, link-up — documentation/tests/slice7-bringup-spike-findings.md, commit 3342993) and implemented in the unified bring-up reconcile job (agent v0.8.0, internal/reconcile/bringup.go): F1 — a restore preserves the archived MAC, so provision reset is unconditional (PUT net0 with hwaddr omitted); F3 — host keys via the baked golden unit, not an agent guest-internal op.

Guest loss (slice 7). Agent restores G from the fastest surviving tier (snapshot → local → PBS) and applies the DR identity policy above so the restored guest rejoins cleanly. The customer backup already contains the controller + data, so there is no controller deploy in this path — bring up + reattach external storage and it is whole. This is fully in slice 7.

Slice mapping (what is built where — keep this current)

Capability	Slice	Status
Golden base image build (root@pam, at enrollment)	7	recipe implemented (`felhom-agent/configs/build-golden.sh`, incl. the F3 host-key unit; now also bakes the controller image + a controller-bootstrap unit, slice 8A); golden archived at enrollment
Unified bring-up front half (restore→reset identity→size→attach storage), journaled + compensating rollback	7	implemented (agent v0.8.0, `internal/reconcile/bringup.go`)
Guest-loss DR (front half + DR identity policy; no controller deploy)	7	implemented (v0.8.0, `dr_guest_loss` mode — continuity identity preserved)
PBS recovery-code escrow creation + hub opaque storage (§8a)	7	implemented (agent v0.9.0 `internal/escrow`; hub v0.8.0 `PUT /hosts/{id}/escrow`)
Local API server (§6) + provisioning back half — deploy controller, hand bootstrap config, mint per-guest local token	8A	implemented (agent v0.10.0 `internal/localapi` + `internal/provision`; controller v0.35.0 `internal/bootstrap` + `internal/agentapi`). The controller image is baked into the golden (no registry cred in any guest); the back-half mints the token, writes a 0600 `bootstrap.json` to a `chown 100000:100000` config mount, and `pct set`-attaches it read-only; the golden's baked unit deploys the controller, which ingests the bootstrap, comes up configured, and reaches the agent over the bridge (leaf-pin + token). Validated live end-to-end on the demo.
Quiesced app-consistent backup (`/backup/due`-driven stack-stop)	8B	implemented (agent v0.11.0 `/backup/due` cadence + `/backup/status` phases; controller v0.36.0 `internal/quiesce` — stop stacks → backup → restart, with crash-safety marker/guaranteed-unquiesce/max-bound/crash-recovery). Validated live incl. the postgres clean-vs-crash-recovery restore contrast. 8B.2 downtime optimization (resume at `snapshotted`) implemented (agent v0.13.0 + controller v0.38.0 — §8).
Controller de-privileging (retire the disk-execution subsystem; new customer disk endpoints behind the slice-4 data-bearing classifier)	8C	implemented — slice 8 CLOSED (agent v0.12.0: `/disks` endpoints + the data-bearing classifier gate + `mkfs`; controller v0.37.0: ~12.3k LOC of disk-execution retired — storage/restic/cross-drive/migrate/watchdog/scanner/infra-backup — `backup.Manager` split to app-data only, disk mgmt rewired to the agent, container de-privileged). The data-bearing format refusal (§6) is the security centerpiece.
Host metrics to the controller (`GET /host/metrics` — the customer host-health view)	9	implemented (agent v0.14.0: `GET /host/metrics` reuses the slice-4 collector + a new CPU/chassis-temp collector `internal/hub/cputemp.go`, graceful-null; the shared `HostMetrics` gains `cpu_temp_c` so the hub report carries it too — cross-repo golden updated; controller v0.39.0: agentapi `HostMetrics()` + a thin `/api/host-metrics` proxy + the monitoring page's host-health card). Host-wide, token-authed, fresh (not the 15-min hub snapshot). Assumption: one customer per host (the home-server model) — host-wide CPU/mem would leak cross-customer load on a multi-customer host; revisit then. Out of scope: multi-tenant metric filtering; historical/time-series storage (this is a live snapshot).
Hub desired-state serving (the "Down" channel) — store + serve per-host desired-state, bump `desired_generation`, signed-jobs queue + `has_signed_ops`; agent activates the envelope + a hub-backed provider (benign reconciled, destructive gated pending)	10A	implemented (hub v0.9.0: `PUT /admin/hosts/{id}/desired-state` bumps the generation, `GET /hosts/{id}/desired-state` + `/jobs` self-scoped, `signed_jobs` queue; agent v0.15.0: `ControlEnvelope` fields live, `Client.FetchDesiredState`, `internal/desired` Syncer + `reconcile.CachingProvider` feeding the engine — an explicit guest `decommission` is the destructive delta, gated `pending_signature`). Serves to already-authenticated hosts only; desired-state stored opaquely (agent owns the schema). Cross-repo golden (envelope + desired-state) byte-identical.
Signed-op execution (verify + run the gated destructive op)	10B	deferred — 10A lays the queue/flag/serving + the gate marks pending; 10B verifies the signature (role-scoped, action-bound, idempotent — `internal/authz`/`internal/reconcile` gate already built) and runs the executor (e.g. the decommission).
PBS escrow consumption (recover `K` on a new box)	10C	spike validated (2026-06-10, `documentation/tests/slice10-escrow-consumption-spike-findings.md` — recover-from-`(blob,R)` on a key-less box + real-data restore proven, GO). Productionizing the consumption path is 10C; exercised by host-loss DR (10D).
Host/hardware loss DR — re-enroll in "restore mode"; hub serves identity / PBS namespace / tunnel token / storage manifest / restore directive (the `restore_directive` field exists in 10A's desired-state, consumed here)	10D	deferred — the DR capstone; consumes 10A serving + 10C escrow consumption + re-enrollment authorization
Golden base refresh cadence + fleet versioning	post-launch	operational, non-blocking (§13)

Host/hardware loss (design intent — slice 10). Re-enroll the new host in restore mode; the hub — the durable source of truth that survives box death — hands the new agent the existing identity, PBS namespace, tunnel token, storage manifest, a restore directive, and the escrow blob (§8a) for the customer to unlock with their recovery code. Tunnel is reused from the hub record, so DNS stays intact. This depends on hub desired-state serving (slice 10) and is not buildable until then; recorded here so the front-half built in slice 7 lands ready for it.

10. Concurrency, crash-safety, idempotency

Per-guest serialization. Reconcile, one-shot jobs, and local-API calls all feed a work queue that serializes mutations per guest (Proxmox dislikes concurrent conflicting ops on the same guest). Independent guests proceed in parallel.
Operation journaling. Multi-step async ops (provision, restore, controller-update, agent self-update) are journaled with their in-flight Proxmox task ids. On agent restart, the journal is replayed: resume-or-rollback, so a crash mid-restore never leaves a corrupt or half-built guest.
Idempotency keys on one-shot jobs (run-once across retries and restarts).

11. Self-update

Agent (the hard case — a host service, no snapshot-rollback). A/B layout: download → verify signature → stage as the inactive slot → flip a current → good|new symlink → restart. Revert authority lives outside the swapped binary — Restart=always alone just crash-loops a bad binary — so a separate health-gate (a systemd oneshot ExecStartPost probe, or a tiny supervisor unit) flips current back to last-good and restarts on a failed health window. The new version is committed as "good" only after a clean health window. Triggered by a hub signed job within the update window; manual always allowed. Journaled (§10).
Controller (the easy case — it's a guest). The agent owns the controller's lifecycle, so the agent updates the controller: snapshot-before-update (free rollback, because the controller is a snapshottable guest) → pull new image → redeploy → health-check → rollback on failure. This resolves the Part-2 selfupdate/ open: the controller is agent-managed, not self-updating; the controller's old self-update path is removed.

12. Secrets at rest on the host

The agent holds, root-only on the host fs: the scoped Proxmox token, the hub API key, the operator's public verify key (for §4 signatures — public, low-risk), the Cloudflare tunnel token, encrypted storage creds (NFS/CIFS/PBS), and the live PBS key. The privilege and the secret footprint that left the controller now concentrate here — which is the whole argument for §3's root-minimization and a small, auditable agent.

13. Open items / what this unblocks

Resolved here: tunnel placement (host, agent-managed, own systemd service), the reconcile-vs-jobs fork (hybrid, gated by reversibility), agent process model, self-update ownership, the local-API surface (implemented, slice 8A — §6a), the storage-manifest schema, provision-by-restore, the provision/DR slice boundary (7 front-half + guest-loss DR + escrow creation; 8A provisioning back-half + local API — implemented; 8B quiesced backup; 8C controller de-privileging; 10 host-loss DR + escrow consumption — §9 table), the PBS recovery-code escrow design (§8a), and the root-vs-API boundary (Phase 3, B3 — the slice-8A back-half's host-side chown/pct set bind-mount is a deliberate, narrow addition OUTSIDE the API token, in internal/provision, not the 3-exception proxmox.Privileged fence).

Still open:

Multi-tenant resource fairness on a shared host (per-guest cgroup limits, noisy-neighbor) — deferred to the company-case pass.
Operator-side signing tooling — where the operator signing key lives operationally and how a destructive op gets signed without undue friction (offline key vs. a small signing service; the security floor is "not in the hub").
Hub-side desired-state editing UX and the host-domain report schema details — belong to the hub architecture doc.
Golden base image refresh cadence + fleet versioning — operational, non-blocking (§9).
Identity-reset set (live, link-up) — pinned empirically by the slice-7 bring-up spike; the scenario-specific policy is settled in §9, the exact field list is the spike's deliverable.
Escrow restore-mode serving / consumption — handing the opaque blob back to a re-enrolling box and unwrapping K with R is slice-10 / doc-05 (§8a, §9 host-loss). Escrow creation + hub opaque storage are done (slice 7).

This doc hands the implementation three contracts it was waiting on:

the local-API surface (§6) → the controller's NEW local-API client, snapshot-before-deploy, and self-restore-test wiring (Part 2);
the storage-manifest schema (§7) → the settings.StoragePath reshape and per-volume hot/bulk placement (Part 2);
the backup contract (§7–8) → the destination for the app-data-backup package extracted in the Part-2 refactor.

Changelog — design-review + Phase-3 fold-in (2026-06-08)

Slice-10A implemented — hub desired-state serving (the "Down" channel) (2026-06-10)

§4: the control loop is live. The report IS the heartbeat; its response — the control envelope — is the Down channel. The envelope is a cheap change-notification: desired_generation (version) + has_signed_ops (flag) + poll_interval_seconds. The agent caches the desired-state
- its generation and re-fetches the heavy state (GET /hosts/{id}/desired-state, self-scoped) only when the generation advances. The engine reconciles benign deltas; an explicit destructive delta (a guest decommission) is classified Destructive → the gate refuses it pending_signature (no signer in 10A → never executed). Signed-job execution is 10B; the restore_directive field is carried in desired-state now but consumed in 10D.
§9 slice table: 10A done (hub serves desired-state + bumps generation + signed-jobs queue/flag; agent activates the envelope + a hub-backed CachingProvider feeding the engine). 10B/10C/10D pending.
Wire: the envelope's now-active fields + the desired-state response are a cross-repo contract — control-envelope.golden.json + desired-state.golden.json, byte-identical agent↔hub. Status: implemented (hub v0.9.0; agent v0.15.0). Out of 10A (deliberate): the hub stores/serves desired-state opaquely (the agent owns the schema); signed-op execution + verification is 10B; restore-mode/re-enroll consumption (a new box's first directive) is 10D — 10A serves only already-authenticated hosts.

Slice-9 implemented — host metrics to the controller (customer host-health view) (2026-06-10)

§6: added GET /host/metrics — host-wide health (cpu%/mem/load/uptime/cpu_temp_c) + per-storage capacity for the customer's monitoring view. Reuses the slice-4 collector (no duplicate collection); host-wide, token-authed, fresh (not the 15-min hub snapshot).
§9 slice table: defined + marked slice 9 (the roadmap previously jumped 8→10; this fills it). Noted the one-customer-per-host assumption (host-wide CPU/mem would leak cross-customer load on a multi-customer host) and the out-of-scope items (multi-tenant filtering; time-series history).
The one new collector is CPU/chassis temp (internal/hub/cputemp.go, sysfs hwmon/thermal-zone, graceful-null), added to the shared HostMetrics → the hub report gains cpu_temp_c too (operator freebie) → cross-repo host-report golden updated byte-identical. Status: implemented (agent v0.14.0; controller v0.39.0).

Slice-8C implemented — controller de-privileged, slice 8 CLOSED (2026-06-10)

§6: added the disk-management endpoints (/disks, /disks/assign|eject|format) and reframed the principle — a controller may do non-data-destructive storage setup self-serve; anything that can lose customer data stays operator-signed (§4), with the classifier (agent-internal device inspection) as the enforcer. The 8C invariant: the agent decides data-bearing-ness by inspecting the device itself, never the caller's claim; a data-bearing format → ClassStorageWipe → gate → pending_signature (signed completion is slice 10).
§9 slice table: 8C implemented — slice 8 CLOSED (agent v0.12.0 /disks + classifier gate + mkfs; controller v0.37.0 retired ~12.3k LOC of disk-execution + de-privileged + rewired to the agent). The controller-side re-platform milestone: the in-guest controller is now Docker-only with no disk/Proxmox privileges.

Slice-8B implemented: app-consistent backup (quiesce / stack-stop) (2026-06-10)

§8: the controller-driven quiesce (stop app stacks → POST /backup → restart) is implemented (controller v0.36.0 internal/quiesce + agent v0.11.0 /backup/due cadence + /backup/status phases). Documented the crash-safety centerpiece (marker-before-stop, guaranteed unquiesce, max-quiesce bound, startup crash-recovery, single-flight) and the 8B.2 downtime-optimization fast-follow (snapshot mode + a snapshotted phase). Validated live: a quiesced postgres restore comes up clean ("database system was shut down") vs a crash-consistent restore doing WAL recovery.
§9 slice table: 8B → implemented; 8C (controller de-privileging) still pending.

Slice-8A implemented: local API + provisioning back-half (2026-06-10)

NEW §6a: the local-API implementation (agent v0.10.0 internal/localapi; controller v0.35.0 internal/bootstrap + internal/agentapi) — persisted self-signed leaf with a stable leaf-SHA-256 pin, the token→guest self-scoping (explicit cross-guest id → 403, op never issued), the stable bootstrap.json contract + controller ingestion (c) (seed controller.yaml, skip setup; idempotent + fail-safe), and the baked-controller deploy (no registry credential in any guest). Firewall narrowing = defense-in-depth; the token stays the gate.
§9: the provisioning back half row is now slice 8A — implemented (split from the old "8"); build-golden.sh now bakes the controller + a bootstrap unit; quiesced backup → 8B, controller de-privileging → 8C. The host-side chown/pct set bind-mount is a deliberate narrow surface in internal/provision (NOT the 3-exception proxmox.Privileged fence). Validated live end-to-end.
§13 updated accordingly.

Slice-7 scope + escrow design (2026-06-09)

§9 rewritten: the bring-up primitive is a shared front half only — identity-reset policy is scenario-specific (provision = fresh everything; guest-loss DR = preserve restic/tunnel/hub continuity identity, reset only collision-prone host-local identity). Added the slice 7/8/10 mapping table (front half + guest-loss DR + escrow creation in 7; provisioning back-half in 8; host-loss DR + escrow consumption in 10).
NEW §8a: PBS recovery-code escrow — live key unencrypted on box for unattended ops; agent generates recovery code R; PBS-native passphrase-wrap of K under R escrowed to the hub (zero-knowledge); consumption is slice 10. Irreducible-residual + rotation≠key-rotation stated.
§13 updated accordingly.
NEW provision-by-restore (§9): the agent provisions by restoring a golden base image (token-covered, preserves keyctl), never pct create on the per-customer path; one unified restore primitive shared with DR. §2 responsibility + §3 boundary updated.
B3 (§2/§3): replaced "Phase-1 minimal role" with the validated FelhomAgent operator role; root-vs-API boundary settled (root only for golden-image build, host mounts, SMART).
B1 (§4): reversibility gate rewritten as provenance + data-bearing (scratch tag is agent-internal, never hub-supplied; crashed-controller heal is non-destructive in-place).
B2 (§7/§8): validated bulk-as-backup=0-mountpoint recipe + the bulk-DR consequence (excluded bulk needs its own backup decision).
S1 (§6/§8): GET /backup/due added; controller-driven quiescing; agent vzdump is crash-consistent only. S2 (§10/§11): A/B self-update with external revert authority; controller-update + agent self-update journaled. S3 (§7): StoragePath field re-homing. S4: geo non-responsibility added (§2). M2 (§7): manifest "absorbs + adds durable_id". §6: rollback is self-scoped/bounded. §13: golden-image refresh cadence added as open.

51 KiB Raw Blame History Unescape Escape