# Architecture Part 3 — The Host Agent

> Status: design draft (decision content). To be grounded by Claude Code against
> `docs/proxmox-platform.md` and `docs/architecture/02-controller-module-map.md`,
> then placed at `docs/architecture/03-host-agent.md`.
>
> Builds on Part 1 (`01-topology-and-trust.md`) and Part 2 (`02-controller-module-map.md`).
> Where this doc and the locked decisions disagree, the locked decisions win and this
> draft is wrong — flag it.

## 1. Purpose & scope

The **host agent** is the operator-tier component that runs on each Proxmox host and
owns *all* Proxmox interaction. It is the trusted host actor: it provisions and restores
guests, manages host storage, orchestrates backups and restore-tests, watches the host
and the tunnel, talks to the hub, and exposes a narrow local API to the in-guest
controllers it deploys.

It is the privileged tier. The controller deliberately holds **no** Proxmox credentials
(Part 1) — the privilege the controller shed by losing `storage/` did not disappear, it
**moved here**. That makes the agent's hardening and blast-radius discipline the most
security-sensitive part of the platform.

The agent manages a **set** of guests on its host (usually one customer = one guest, but
the multi-tenant/company case is not precluded — the agent's data model is per-host,
N-guests, never "the guest").

## 2. Responsibilities (and explicit non-responsibilities)

Owns:

1. **Proxmox lifecycle** — create/start/stop/destroy guests, snapshots, storage allocation. Via a scoped Proxmox API token (the **`FelhomAgent` operator role** — `proxmox-platform.md` §3.6, validated Phase 3 B3) for everything the API covers; raw host ops only where unavoidable.
2. **Storage management** — attach/classify targets, reconcile the storage manifest, mount USB-by-UUID, present mounts into guests.
3. **Backup/restore orchestration** — vzdump to the tiers, PBS, snapshot management, and the **self-restore-test**.
4. **Host & tunnel monitoring** — host metrics, guest up/down, storage-target status, and `cloudflared` health; reports the host domain to the hub.
5. **Provisioning** — provision a guest **by restoring the golden base image** (§9), deploy the controller into it, hand it its bootstrap config; also **build and refresh the golden base image** itself.
6. **Hub control loop** — poll for desired state + signed jobs, reconcile, execute, report, heartbeat.
7. **Local API** — the per-guest authorization gate the controller calls.
8. **Self-update** — update itself (carefully — it is a host service) and update the controllers it owns.

Explicitly does **not**:

- Serve application traffic or sit in the data path. **Control plane, not data plane**: if the agent dies, apps keep serving (Docker + LXC run without it); only *management* degrades — no new backups, no provisioning, hub loses the heartbeat.
- Hold or proxy customer application data.
- Run inside a guest. It is the thing that recovers guests and the host; it cannot be one of them.
- Manage **geo-restriction / the Cloudflare API**. Geo is hub-owned: the customer sets it in the controller UI, the controller reports the geo desired-state to the hub, and the **hub** (holding the CF API token) reconciles the WAF (S4). The agent manages only the *tunnel* service (`cloudflared`, §3/§5), never WAF rules.

## 3. Process model & host integration

- **Native Go binary, systemd service** on the host: boot-start, `Restart=always`, systemd watchdog (kill+restart on hang), journald logging, resource limits.
- **Root-minimized (boundary settled — Phase 3 B3).** The agent runs as a **non-root** service user with the scoped `FelhomAgent` token for all API-covered work + a **narrow `sudoers` allowlist** for true host ops. Per Phase 3 (B3) the boundary is settled: the entire per-customer guest lifecycle — provision (by restore, §9), config, start/stop, snapshot, backup, **restore**, destroy — is token-covered. Genuine OS-root is confined to: (1) building/refreshing the **golden base image** (`keyctl` create is `root@pam`-only — one-time at enrollment + a maintenance cadence, §9); (2) **host mounts** (USB mount-by-UUID, systemd mount units / fstab); (3) **SMART / hardware sensors**. Root therefore never sits on the per-customer path. See `proxmox-platform.md` §3.6 for the role + boundary table.
- **`cloudflared` is a separate systemd service**, not embedded in the agent. This is what makes the data path survive control-plane death by construction. The agent **manages and health-watches** it (see §5) but the tunnel does not live or die with the agent process.

## 4. Control model — reconcile + signed destructive ops

Two channels, split by **reversibility**, not by transport.

**(a) Desired-state reconciliation — steady state.**
The hub holds desired state for the host: which guests should exist (and at what spec),
the storage manifest, backup/retention policies, controller image versions. The agent
runs a reconcile loop converging actual Proxmox state → desired: idempotent, self-healing,
and tolerant of missed polls (drift is corrected on the next loop). Provisioning retries,
re-attach of a flapping USB target, redeploy of a crashed controller — all fall out of
reconciliation for free.

**(b) Signed one-shot jobs — operator actions.**
Restore-now, decommission, force-backup, break-glass-enable. Discrete, run-once
(idempotency key), written to the customer-visible audit log, and **outside** the reconcile
loop — they are point-in-time and often destructive, and a reconciler must never re-run a
restore because it "sees drift." A one-shot job names a **target** ("restore guest X from
snapshot S"), not a procedure; the agent owns the *how*.

**The reversibility gate (security-critical).**
"Signed jobs resist hub compromise" only holds if the agent also distrusts hub-supplied
*desired state* for destructive changes. The gate is by **provenance + data-bearing-ness, not
by verb**:

- **The reconciler MAY act without an operator signature** when: (a) creating/starting/restarting; (b) destroying resources it created earlier **within the same journaled transaction** (compensating rollback, §10); (c) destroying resources it **tagged ephemeral/scratch** (e.g. restore-test scratch guests, §8). The ephemeral/scratch tag is **agent-internal provenance and is never accepted from the hub** — else a compromised hub could relabel a data-bearing guest as scratch to walk the gate.
- **An operator signature is always required** to destroy/overwrite any resource holding the only/primary copy of customer data — live-guest destroy, storage detach/wipe, restore-overwrite, decommission — *regardless of whether it arrives as a job or as a desired-state delta*. A compromised hub cannot forge them because the signing key is **not held by the hub** (it lives with the operator / a separate signing path; the hub only queues opaque signed blobs).
- **Data-bearing-ness is agent-internal evidence, never a caller's claim (slice 8C).** For a customer-driven storage op (`POST /disks/format`, §6) the agent **inspects the actual device** (filesystem signature / partition table / partitions / mount, conservative — ambiguous → data-bearing) to decide the class. A blank device → benign self-serve `mkfs`; a data-bearing device → `ClassStorageWipe` → this gate → `pending_signature`. The **destructive completion of a data-bearing wipe is slice 10** (the operator-signed path); 8C refuses it. This mirrors the provenance rule above: just as the scratch tag is agent-internal (never hub-sourced), data-bearing-ness is agent-observed (never controller-asserted) — a compromised controller cannot relabel a data-bearing drive "blank" to walk the gate.
- **Healing a crashed controller is non-destructive by construction:** it is reconstructable from its image + the guest's persistent volume, so "redeploy" = restart the LXC / `docker compose up -d` **inside the existing guest** — never a guest destroy. (v0.33 precedent: `watchdog.go` restarts stopped stacks, it never destroys the guest.)

Signed payloads carry a **nonce + expiry** (anti-replay: a captured "restore" job cannot be
re-injected later) and a target binding (host + guest id) so a signature can't be retargeted.
Notification-on-destructive-op is an **audit signal, never the guard** — a compromised hub
could both issue and suppress the notice, which is exactly why the *signature* (not the
notification) is the control.

## 5. Hub ↔ agent protocol (host domain)

**Box-initiated poll.** The hub never connects inbound. Each poll cycle exchanges:

- **Up:** heartbeat + a host-domain state report — host CPU/RAM/disk, per-guest up/down + spec, storage-target status (USB connected? NFS/CIFS reachable? PBS reachable?), last backup per target, last restore-test result, `cloudflared` health, agent + controller versions, audit-log tail.
- **Down:** the current desired state, any pending signed one-shot jobs, and config (poll interval, update window, policy changes).

**Dead-man's-switch (essential, not optional).** In a box-initiated model the heartbeat
*is* the liveness signal — a box that stops checking in is otherwise invisible. The hub
alerts the operator when an agent misses its expected check-in window. This is the worst
failure mode for a managed service, so it gets first-class treatment hub-side.

**Break-glass.** Standing inbound control is off. But when the poll loop *itself* is wedged
(agent hung, host sick) you cannot fix it through the poll loop. So there is an explicit,
**off-by-default, customer-consented, fully-audited** emergency path: SSH to the host via
the Cloudflare Tunnel behind Cloudflare Access (or on-site). Enabling it is itself a signed,
logged operation; it auto-expires.

## 6. Agent ↔ controller local API

The controller (in its LXC) reaches the agent (on the host) over the local bridge.

- **Transport:** HTTPS to the host's bridge IP on a fixed port.
- **Auth:** a per-guest local token, minted by the agent when it deploys the controller and written into the guest's bootstrap config. The agent maps token → guest and **authorizes per guest**: a controller can only act on *its own* guest. This is the agent acting as the per-guest authorization gate from Part 1.
- **Surface (minimal, all scoped to the caller's own guest):**
  - `GET /storage` — mounts available to this guest and their **class** (fast/slow), so the controller can place hot vs bulk volumes per `.felhom.yml`. (The agent owns the actual mounts; the controller just binds to the paths it's given.)
  - `POST /snapshot` — snapshot *this* guest (the snapshot-before-deploy primitive).
  - `POST /rollback` — roll *this* guest back to a named snapshot (post-deploy failure recovery).
  - `POST /backup` — request a backup-now of *this* guest (enqueued; non-destructive).
  - `GET /backup/due` — whether a policy-scheduled backup is due for *this* guest, so the controller can quiesce then call `POST /backup` (the app-consistent path, §8).
  - `GET /backup/status`, `GET /restore-test/status` — read-only status for the controller's UI.
  - **Host metrics (slice 9):** `GET /host/metrics` — **host-wide** health for the customer's
    monitoring view: cpu%/mem/load/uptime, **CPU/chassis temperature** (`cpu_temp_c`, nullable —
    "n/a" when the hardware exposes no sensor), and per-storage capacity (total/used/fraction,
    thin-pool fill, disk SMART temp+wear). It **reuses the slice-4 collector** (no duplicate
    collection) and serves a **fresh** collect (current cpu%/temp, not the 15-min hub snapshot).
    Unlike the rest of the surface this is **host-wide, not per-guest** (the box, not the caller's
    guest) — correct for "see my box's health" — but still **token-authed** via the per-guest token.
    **Assumption: one customer per host** (the home-server model); if a host ever served multiple
    customers, host-wide CPU/mem would leak cross-customer load → revisit then. The de-privileged
    controller (slice 8C) sees only its own cgroup, so it cannot read host health itself; this
    re-serves the agent's existing host + storage observation to the customer. **Status:
    implemented** (agent v0.14.0 `internal/localapi` + `internal/hub/cputemp.go`; controller v0.39.0
    `internal/web/agent_host_metrics_handler.go` + the monitoring page's host-health card).
  - **Disk management (slice 8C):** `GET /disks` (host drives + a **data-bearing flag**),
    `POST /disks/assign` (attach a drive as a mount — benign, additive, self-serve), `POST
    /disks/eject` (safe-unmount, **data preserved**, returns the dependent guests so the controller
    warns which apps lose that storage — benign), `POST /disks/format` (see the reframed principle
    below). The controller is Docker-only (de-privileged, slice 8C); **execution is the agent's**.

**The principle (reframed for 8C):** a controller may do **non-data-destructive** storage setup
**self-serve** (list, assign, eject, format a *blank* drive); **anything that can lose customer data
stays operator-signed (§4)**. The enforcer is the **classifier**: for `POST /disks/format` the agent
**inspects the actual device itself** (filesystem signature / partition table / partitions / mount —
agent-internal evidence, NEVER the caller's claim) and classifies conservatively (ambiguous →
data-bearing). A blank device → benign → `mkfs`. A data-bearing device → `ClassStorageWipe` →
destructive → the §4 gate → refused **`pending_signature`** (the operator-signed completion is slice
10). So a compromised controller asserting "this drive is blank" **cannot** wipe a data-bearing
drive — the 8C analog of self-scoping. **Status: implemented** (agent v0.12.0 `internal/localapi` +
`internal/storage`; controller v0.37.0 `internal/web/agent_disk_handlers.go`).

Note what is *absent*: nothing here lets a controller touch **another guest**, the **host** beyond
this narrow disk surface, or **restore-overwrite**; and within the disk surface, **data-destructive**
power stays operator-signed (§4). Destructive/cross-guest power stays operator-signed.

A controller can only `POST /rollback` (or snapshot/backup) **its own** guest — the agent maps
token → guest and authorizes per guest, so a compromised controller's blast radius is
**self-scoped and bounded** to its own guest.

### 6a. Implementation (slice 8A — implemented)

**Status: implemented** (agent v0.10.0 `internal/localapi`; controller v0.35.0 `internal/bootstrap`
+ `internal/agentapi`). Grounded by `documentation/tests/slice8a-channel-deploy-spike-findings.md`
(commit `4a81a96`). The 7 endpoints above are live; `GET /backup/due` is **thin** in 8A (the
quiesce-on-due consumer is 8B), the rest wrap the existing slice-5/6/7 machinery.

- **Transport / pin.** The agent serves a **persisted self-signed leaf** bound to the host bridge IP
  on a fixed port (default `:8443`). The controller pins the **leaf-cert SHA-256** (decision:
  consistency with the agent's Proxmox/PBS cert pinning), carried in its bootstrap. The leaf is
  generated **once and persisted**, so its fingerprint is stable across agent restarts (a fresh cert
  each boot would invalidate every already-issued bootstrap pin). Defense-in-depth: the listener
  binds the **bridge IP** (not `0.0.0.0`) and a host firewall rule narrows the port to the guest
  bridge subnet (`configs/felhom-localapi-firewall.example`) — the **per-guest token stays the gate**.
- **Token custody.** The per-guest token is minted by the back-half (§9), persisted as a **SHA-256
  hash** only (the plaintext exists transiently at mint→write-to-mount, then is discarded), in a
  durable last-write-wins map. **Self-scoping** is enforced by the token→guest map alone: the VMID is
  resolved from the token, never from a caller-supplied id; an explicit `vmid` that disagrees is
  refused (**403**) and the Proxmox op is never issued for the other guest. Absent/unknown token → 401.
- **The bootstrap contract `(c)`.** The agent emits a stable `bootstrap.json`
  (`schema: felhom.bootstrap/v1`: customer identity, hub, and the local-API `{endpoint, fingerprint,
  token}`) into a read-only config mount; the controller **ingests it on first run and seeds its own
  `controller.yaml`, skipping setup mode** (idempotent — never clobbers an existing config; fail-safe
  — a malformed/absent bootstrap stays in setup). The agent emits the contract; the controller owns
  the translation — they stay decoupled (no shared config schema). **No registry credential ever
  enters a guest**: the controller image is **baked into the golden** (§9), so deploy does no
  `docker login`/`pull`.

## 7. Storage manifest & reconciliation

The manifest is the load-bearing contract. It absorbs the **persisted** disk-state fields that
`settings.StoragePath` carries today **and adds** `durable_id`/UUID — today the controller
re-derives the UUID from fstab each boot (Part 2 / Phase-3), so persisting it is an
improvement. Held in the hub, reconciled by the agent.

Per target:

| field | meaning |
|---|---|
| `type` | `local-dir` / `usb` / `nfs` / `cifs` / `pbs` |
| `durable_id` | UUID (USB), `server:export` (NFS/CIFS), `repo+fingerprint` (PBS) — survives box loss |
| `class` | `fast` or `slow`, set **once at attach**, with an IOPS marker; no runtime speed-test |
| `role` | `primary` / `vzdump-target` / `pbs-offsite` / `bulk-data` |
| `creds` | encrypted (NFS/CIFS/PBS); USB has none |
| `policy` | schedule + retention for this target |
| `state` | `attached` / `disconnected` / `decommissioned` |

Reconciliation: ensure each `attached` target is mounted (USB-by-UUID via the sudoers
allowlist), each Proxmox storage entry matches, and `disconnected` targets are surfaced to
the hub (the storage watchdog — detect a USB drop in seconds, not at the next health cycle).

**Placement is per-volume, not per-app.** Hot volumes (DB/config) → a `fast` target,
**enforced**; bulk volumes (media) → may live on `slow`, declared in `.felhom.yml`.

A `bulk` volume **MUST** be realized as a `backup=0` **volume mount point** (or an external
bind mount) — **never** a Docker named volume in rootfs, which `vzdump` always captures
(verified, `phase3-findings.md` B2). Proven recipe: attach
`-mpN <storage>:<size>,mp=/mnt/bulk,backup=0`, then
`docker volume create --driver local -o type=none -o o=bind -o device=/mnt/bulk <vol>` (or a
compose bind). The per-volume placement component (Part 2 §5(2)) enforces this at deploy. The
**DR consequence** of excluding bulk is covered in §8.

**Field re-homing (from `settings.StoragePath`, Part 2):** `Label` → manifest (canonical);
`IsDefault`/`Schedulable` → manifest `policy`; `MigratedTo` + decommission → manifest `state`;
`StoppedStacks` → the **controller's `settings`** (app-domain: which apps to restart on
reconnect, not a host concern).

## 8. Backup/restore orchestration

Tiers double as backup *and* restore-source priority (fastest surviving source first),
per Part 1: **snapshot** (LVM-thin, transient, whole-guest rollback — not a backup) →
**local second storage** (vzdump to dir/NFS/CIFS) → **PBS offsite** (the DR substrate).

- **Quiescing (controller-driven for app-consistency) — implemented (slice 8B):** an LXC has no
  fsfreeze (`proxmox-platform.md` §4.2), so app-consistency is the controller's job: it learns a
  backup is due (`GET /backup/due`, §6) → **quiesces** (stops its app stacks) → `POST /backup` →
  polls `GET /backup/status` to `done` → **unquiesces** (restarts exactly the stacks it stopped).
  Implemented in `felhom-controller` v0.36.0 (`internal/quiesce`) + `felhom-agent` v0.11.0 (the
  `/backup/due` cadence policy + `/backup/status` phases). **An agent-initiated vzdump is
  crash-consistent only** (there is no inbound-to-guest channel to trigger a quiesce — §3/§5); the
  controller stopping its stacks first is what makes the captured state **clean-shutdown-consistent**
  (validated live: a quiesced postgres restore comes up clean — "database system was shut down" — vs
  a crash-consistent restore doing WAL recovery — "redo starts… redo done"). Every Proxmox op is
  async → the agent polls `task exitstatus`, never trusts the POST return.
  - **Crash-safety (the centerpiece — a stranded-down app is worse than a crash-consistent backup):**
    a persisted marker written **before** stopping anything; **guaranteed unquiesce** (restart on a
    backup error, a status-poll error, the max-quiesce bound, or controller shutdown); a
    **max-quiesce-duration** hard bound (restart the app no matter what — the backup finishes on the
    agent); and **crash recovery** at controller startup (restart stacks left stopped by a mid-quiesce
    crash). The marker also single-flights the loop. All proven live + unit-tested.
  - **8B.2 downtime optimization — implemented (agent v0.13.0 + controller v0.38.0):** in snapshot
    mode, vzdump only needs the app-stopped state captured at the **storage-snapshot moment**; after
    that it reads from the snapshot. The agent watches the vzdump task log for the snapshot marker
    (`create storage snapshot`, validated on PVE 9.2.2) and emits a **`snapshotted`** phase on
    `/backup/status`; the controller **resumes its app at `snapshotted`** (not `done`), cutting app
    downtime from *whole-backup* to *until-snapshot* (~24s→~1s for a 934 MB guest) with **no loss of
    app-consistency** (the snapshot froze the app-stopped state). Depends on snapshot-capable storage
    (lvm-thin/ZFS); on stop/downgraded storage the marker never appears and the controller **falls
    back to resume-at-`done`** (8B). The controller keeps tracking to `done`/`failed` after early
    resume (no overlapping backup; the backup isn't "successful" until `done`).
- **Bulk volumes have no DR coverage from the guest vzdump** — they are excluded (§7). Every
  `bulk` volume needs an explicit own-backup decision: its own backup target per the manifest
  `policy`, **or deliberately none** when the data is re-downloadable (customer informed). On
  host-loss, un-backed-up bulk is gone; a **bind-mounted** bulk volume re-attaches only on the
  *same* host, so cross-host DR needs the separate backup. A deliberate per-volume choice,
  never a silent loss.
- **Key custody (PBS):** the **live** PBS key sits on the box so the agent can both back up
  *and* run restore-tests. The hub holds only the **recovery-code-wrapped escrow** copy it
  cannot open (zero-knowledge default). So: the box can restore-test; the operator cannot
  read the data; the customer's offsite recovery code is the irreducible residual.
- **Self-restore-test:** the closing of the "tested restore is the critical gap" theme. The
  agent periodically restores a backup into a **throwaway scratch guest**, boots it, runs
  health checks, reports pass/fail, and tears it down. Zero-knowledge backups can *only* be
  restore-tested by the box (the operator lacks the key) — so this lives in the agent by
  necessity, not just convenience. Integrity-verify (cheap, ciphertext-level) runs more often
  as the lighter check.

### 8a. PBS recovery-code escrow + the key-custody posture model (zero-knowledge offsite-key recovery)

The DR substrate is the PBS offsite tier, client-side encrypted (zero-knowledge): if the box dies,
restoring the offsite backups requires the **PBS client encryption key `K`**, which died with the
box. The escrow is how `K` comes back **without** Felhom ever being able to read customer data.
**Status: implemented** — escrow *creation* (agent v0.9.0, `internal/escrow`) + hub *opaque storage*
(hub v0.8.0, `PUT /api/v1/hosts/{host_id}/escrow`). Validated end-to-end on a throwaway in
`documentation/tests/slice7-escrow-spike-findings.md`. Restore-mode *serving/consumption* is slice 10.

#### The separation principle (the rule that governs every posture)
Reading customer data needs **BOTH** the encrypted chunks **AND** a usable key. **Zero-knowledge
holds for exactly as long as Felhom never holds both at once.** Every posture below is just a
choice about where the data and the key live; the principle decides who can read.

#### Topology matrix (data location × key custody → who can read)
| Data location | Key custody | Who can read | Notes |
|---|---|---|---|
| **Felhom storage** | customer-only key | **only the customer** | **the DEFAULT** — genuine zero-knowledge |
| **Felhom storage** | Felhom also holds a key | **Felhom can read** | the one dangerous cell — explicit, informed opt-in only; never default, never silent |
| Customer's own offsite | customer key | only the customer | self-hosted data; key XOR data |
| Customer's own offsite | Felhom holds a key | only the customer | safe by separation (key and data never co-located at Felhom) |

#### The escrow mechanism (decisions + the rationale that pins them)
- **Live key unencrypted on the box** (`0600`, root): the agent backs up *and* runs restore-tests
  unattended — no passphrase prompt on the management path. The privilege concentration this implies
  is the whole argument for §3 root-minimization + a small auditable agent.
- **Wrap — PBS-native, not custom crypto.** At enrollment the agent generates a high-entropy
  **recovery code `R`** and produces a **passphrase-protected copy of `K` under `R`** via PBS's own
  key passphrase KDF (`proxmox-backup-client key change-passphrase --kdf scrypt`; no bespoke AEAD).
  The spike pinned two implementation constraints: that command is **TTY-only** (drive it over a
  pty), and the pty **echoes the passphrase** (discard the pty output so `R` can't leak) — F-A1/F-A2.
- **Agent-side generation.** `R` is generated **on the box** (it already holds `K` and does the
  wrapping), so `R` never touches the hub even in transit — zero-knowledge by construction. `R` is
  ≥128 bits, **word-list form** (EFF large wordlist, 10 words ≈ 129 bits) for off-paper transcription.
- **Self-verify before shipping.** Creation unwraps a copy of the blob with `R` and checks the key
  fingerprint matches — "an escrow you haven't recovered isn't an escrow."
- **Escrow = the `R`-wrapped blob → hub (opaque storage, slice 7).** The hub stores the ciphertext
  bytes against the host record and **never decrypts them** (it has no `R`; there is no decrypt
  path). Per-host-key authed; rotation is last-write-wins. **Restore-mode serving is slice 10.**
- **Recovery code custody.** `R` is surfaced to the customer **exactly once** at enrollment
  (printed/displayed) and **never stored by Felhom in any recoverable form**.

#### Default posture + the anti-lockout ladder (opt-in, increasing trust)
**Default:** *Felhom storage + customer-only key*, and **`R` is delivered durably (printed) always**
— note this is distinct from a raw-key paperkey: `R` is a safe two-factor *passphrase* (useless
without the hub's blob); the raw key is the footgun. The ladder trades resilience for trust:
- **(b) `R`-wrapped offline copy** — the same two-factor blob, for the customer to print/store. **No
  extra trust**; resilience if the hub ever vanishes (still needs `R`). *Implemented (opt-in).*
- **(a) raw paperkey** — `proxmox-backup-client key paperkey` of the unwrapped key, for a safe.
  Covers **losing `R`**, but it is **single-factor and unrevocable**. *Implemented (opt-in, loud
  caveat).*
- **Felhom-holds-a-key** — maximum convenience, but **gives up zero-knowledge** (the dangerous
  matrix cell). **Not implemented** — it needs a separate Felhom-side secure key store + explicit
  opt-in UX, built only when a customer asks.

#### SSH-for-support is a SEPARATE grant — deliberately not coupled to key custody
Support access (active / consented / observable — customer-toggleable, commands shown) is **not**
the same as a standing / passive / invisible decryption capability. The transparency features prove
*controlled* support access **without Felhom holding a key**. Conflating the two is exactly the
mistake the separation principle prevents.

#### Why zero-knowledge stays the default (breach + legal)
Holding data **and** a key makes a single hub breach an **all-customer data leak**, and makes Felhom
**compellable** — a court can order what Felhom *can* produce. Genuine zero-knowledge means *"we
can't be forced to hand over what we can't read."* This is core to the sovereignty pitch, not a
nicety.

#### Honesty properties (stated to the customer at enrollment)
- **Irreducible residual:** losing `R` *and* the box (and, if not opted in, having no paperkey) =
  the offsite backups are **unrecoverable, by anyone, including Felhom.** The cost of genuine
  zero-knowledge — communicated, not buried.
- **Rotation ≠ key rotation:** rotating `R` re-wraps the escrow blob (and re-shows a new code) but
  does **not** re-encrypt existing PBS data — that stays keyed by `K`. Changing `K` itself is a
  separate, heavier op (new key → new backups; old backups still need old `K`), out of scope for
  routine recovery-code rotation.
- **Integrity caveat (self-hosted-data postures):** moving data to the customer's own offsite
  **loses Felhom's backup guarantees** — no PBS verify / monitoring on storage we can't reach. An
  honest signup-time tradeoff, not a hidden one.

## 9. Provisioning & DR flows

**Provisioning (reconcile-driven, by restore).** Fresh creation of a Docker-capable LXC needs
the `keyctl=1` feature flag, which Proxmox permits only for `root@pam` (Phase 3, B3) — not the
scoped token. But a token-authorized **restore preserves `keyctl`** (Phase 3, B3, empirically:
a token `vzrestore` of a keyctl archive produced a guest that kept `features:
nesting=1,keyctl=1,unprivileged:1`), so the agent provisions **by restoring a golden base
image**, never by `pct create` on the per-customer path.

**Golden base image.** A **golden base archive** — minimal Debian + Docker, `nesting=1,keyctl=1`,
overlayfs — is built once as `root@pam` **at enrollment** (when the agent legitimately holds root
to mint its Proxmox token) and refreshed on a maintenance cadence. This is the one place
`keyctl`/root provisioning lives — off the per-customer path. Refresh cadence + fleet versioning
remain an operational open item (§13).

**Unified bring-up primitive (shared *front half* — NOT shared identity policy).** Provisioning
and DR-restore share one token-covered front-half code path:

> restore an archive → **reset identity** → size the guest (CPU/mem config + `pct resize`
> rootfs, token-covered) → attach storage mounts per the manifest

run as a **journaled reconcile job**; a mid-flight failure is compensating-rolled-back (destroy
the just-restored guest — allowed unsigned per §4, same-transaction provenance). They diverge in
the *archive* and the *back half*, **and in identity policy** (below).

**Identity reset is scenario-specific — this is a correctness boundary, not a detail.** "Reset
identity" is shorthand for two different operations:

- **Provision (golden base) → fresh identity, everything.** A provisioned guest is new: reset
  MAC + hostname **host-side via the token config** (the agent does NOT touch guest internals),
  while **`/etc/machine-id`** (a duplicate breaks journald/DHCP/systemd) and **SSH host keys**
  regenerate **guest-side on first boot** — machine-id by systemd for free, host keys by a baked,
  Condition-gated `felhom-regen-hostkeys.service` unit in the golden (the F3 decision: Debian does
  NOT auto-regenerate host keys after a restore, so the golden carries the regeneration, keeping
  the agent host-side-only). It then receives a **fresh** controller identity (host-id, local
  token, hub channel), **fresh restic repo identity**, and a fresh tunnel association — all minted
  in the back half (slice 8A — implemented).
- **Guest-loss DR (customer backup) → preserve continuity identity, reset only what would
  collide.** The restored guest must *continue* the customer's world: **keep** the restic repo
  identity (resetting it orphans the existing backup chain — a silent data-continuity bug), the
  tunnel/DNS association, and the hub host/customer binding. Reset only collision-prone host-local
  identity (`machine-id`, SSH host keys, hostname as needed). **MAC is reset only when a source
  guest may still be live** (e.g. partial loss, or the restore-*test* which boots link-down for
  exactly this reason); in a true total guest-loss the original is gone, so the MAC can be kept to
  preserve DHCP reservations. The agent decides MAC handling from the scenario, not a fixed rule.

The exact reset set was pinned empirically by the slice-7 bring-up spike (live, link-up —
`documentation/tests/slice7-bringup-spike-findings.md`, commit `3342993`) and **implemented in the
unified bring-up reconcile job** (agent v0.8.0, `internal/reconcile/bringup.go`): F1 — a restore
preserves the archived MAC, so provision reset is unconditional (`PUT net0` with `hwaddr` omitted);
F3 — host keys via the baked golden unit, not an agent guest-internal op.

**Guest loss (slice 7).** Agent restores G from the fastest surviving tier (snapshot → local →
PBS) and applies the **DR identity policy** above so the restored guest rejoins cleanly. The
customer backup already contains the controller + data, so there is **no controller deploy** in
this path — bring up + reattach external storage and it is whole. This is fully in slice 7.

### Slice mapping (what is built where — keep this current)

| Capability | Slice | Status |
|---|---|---|
| Golden base image build (root@pam, at enrollment) | **7** | **recipe implemented** (`felhom-agent/configs/build-golden.sh`, incl. the F3 host-key unit; **now also bakes the controller image + a controller-bootstrap unit**, slice 8A); golden archived at enrollment |
| Unified bring-up **front half** (restore→reset identity→size→attach storage), journaled + compensating rollback | **7** | **implemented** (agent v0.8.0, `internal/reconcile/bringup.go`) |
| **Guest-loss DR** (front half + DR identity policy; no controller deploy) | **7** | **implemented** (v0.8.0, `dr_guest_loss` mode — continuity identity preserved) |
| PBS recovery-code escrow **creation** + **hub opaque storage** (§8a) | **7** | **implemented** (agent v0.9.0 `internal/escrow`; hub v0.8.0 `PUT /hosts/{id}/escrow`) |
| **Local API** server (§6) + provisioning **back half** — deploy controller, hand bootstrap config, mint per-guest local token | **8A** | **implemented** (agent v0.10.0 `internal/localapi` + `internal/provision`; controller v0.35.0 `internal/bootstrap` + `internal/agentapi`). The controller image is **baked into the golden** (no registry cred in any guest); the back-half mints the token, writes a 0600 `bootstrap.json` to a `chown 100000:100000` config mount, and `pct set`-attaches it read-only; the golden's baked unit deploys the controller, which ingests the bootstrap, comes up configured, and reaches the agent over the bridge (leaf-pin + token). Validated live end-to-end on the demo. |
| **Quiesced app-consistent backup** (`/backup/due`-driven stack-stop) | **8B** | **implemented** (agent v0.11.0 `/backup/due` cadence + `/backup/status` phases; controller v0.36.0 `internal/quiesce` — stop stacks → backup → restart, with crash-safety marker/guaranteed-unquiesce/max-bound/crash-recovery). Validated live incl. the postgres clean-vs-crash-recovery restore contrast. **8B.2 downtime optimization (resume at `snapshotted`) implemented** (agent v0.13.0 + controller v0.38.0 — §8). |
| **Controller de-privileging** (retire the disk-execution subsystem; new customer disk endpoints behind the slice-4 data-bearing classifier) | **8C** | **implemented — slice 8 CLOSED** (agent v0.12.0: `/disks` endpoints + the data-bearing classifier gate + `mkfs`; controller v0.37.0: ~12.3k LOC of disk-execution retired — storage/restic/cross-drive/migrate/watchdog/scanner/infra-backup — `backup.Manager` split to app-data only, disk mgmt rewired to the agent, container de-privileged). The data-bearing format refusal (§6) is the security centerpiece. |
| **Host metrics to the controller** (`GET /host/metrics` — the customer host-health view) | **9** | **implemented** (agent v0.14.0: `GET /host/metrics` reuses the slice-4 collector + a new CPU/chassis-temp collector `internal/hub/cputemp.go`, graceful-null; the shared `HostMetrics` gains `cpu_temp_c` so the hub report carries it too — cross-repo golden updated; controller v0.39.0: agentapi `HostMetrics()` + a thin `/api/host-metrics` proxy + the monitoring page's host-health card). **Host-wide, token-authed, fresh** (not the 15-min hub snapshot). **Assumption: one customer per host** (the home-server model) — host-wide CPU/mem would leak cross-customer load on a multi-customer host; revisit then. Out of scope: multi-tenant metric filtering; historical/time-series storage (this is a live snapshot). |
| **Hub desired-state serving** (the "Down" channel) — store + serve per-host desired-state, bump `desired_generation`, signed-jobs queue + `has_signed_ops`; agent activates the envelope + a hub-backed provider (benign reconciled, destructive gated pending) | **10A** | **implemented** (hub v0.9.0: `PUT /admin/hosts/{id}/desired-state` bumps the generation, `GET /hosts/{id}/desired-state` + `/jobs` self-scoped, `signed_jobs` queue; agent v0.15.0: `ControlEnvelope` fields live, `Client.FetchDesiredState`, `internal/desired` Syncer + `reconcile.CachingProvider` feeding the engine — an explicit guest `decommission` is the destructive delta, gated `pending_signature`). Serves to already-authenticated hosts only; desired-state stored opaquely (agent owns the schema). Cross-repo golden (envelope + desired-state) byte-identical. |
| **Signed-op execution** (verify + run the gated destructive op) | **10B** | deferred — 10A lays the queue/flag/serving + the gate marks pending; 10B verifies the signature (role-scoped, action-bound, idempotent — `internal/authz`/`internal/reconcile` gate already built) and runs the executor (e.g. the decommission). |
| **PBS escrow consumption** (recover `K` on a new box) | **10C** | **spike validated** (2026-06-10, `documentation/tests/slice10-escrow-consumption-spike-findings.md` — recover-from-`(blob,R)` on a key-less box + real-data restore proven, GO). Productionizing the consumption path is 10C; exercised by host-loss DR (10D). |
| **Host/hardware loss** DR — re-enroll in "restore mode"; hub serves identity / PBS namespace / tunnel token / storage manifest / restore directive (the `restore_directive` field exists in 10A's desired-state, consumed here) | **10D** | deferred — the DR capstone; consumes 10A serving + 10C escrow consumption + re-enrollment authorization |
| Golden base refresh cadence + fleet versioning | post-launch | operational, non-blocking (§13) |

**Host/hardware loss (design intent — slice 10).** Re-enroll the new host in **restore mode**;
the hub — the durable source of truth that survives box death — hands the new agent the existing
identity, PBS namespace, tunnel token, storage manifest, a restore directive, and the **escrow
blob** (§8a) for the customer to unlock with their recovery code. Tunnel is reused from the hub
record, so DNS stays intact. This depends on hub desired-state serving (slice 10) and is not
buildable until then; recorded here so the front-half built in slice 7 lands ready for it.

## 10. Concurrency, crash-safety, idempotency

- **Per-guest serialization.** Reconcile, one-shot jobs, and local-API calls all feed a
  work queue that serializes mutations **per guest** (Proxmox dislikes concurrent conflicting
  ops on the same guest). Independent guests proceed in parallel.
- **Operation journaling.** Multi-step async ops (provision, restore, controller-update, agent
  self-update) are journaled with their in-flight Proxmox task ids. On agent restart, the
  journal is replayed: resume-or-rollback, so a crash mid-restore never leaves a corrupt or
  half-built guest.
- **Idempotency keys** on one-shot jobs (run-once across retries and restarts).

## 11. Self-update

- **Agent (the hard case — a host service, no snapshot-rollback).** **A/B layout:** download →
  verify signature → stage as the inactive slot → flip a `current → good|new` symlink → restart.
  **Revert authority lives outside the swapped binary** — `Restart=always` alone just
  crash-loops a bad binary — so a **separate health-gate** (a systemd oneshot `ExecStartPost`
  probe, or a tiny supervisor unit) flips `current` back to last-good and restarts on a failed
  health window. The new version is **committed as "good" only after a clean health window**.
  Triggered by a hub signed job within the update window; manual always allowed. Journaled (§10).
- **Controller (the easy case — it's a guest).** The agent owns the controller's lifecycle,
  so the **agent updates the controller**: snapshot-before-update (free rollback, because the
  controller *is* a snapshottable guest) → pull new image → redeploy → health-check → rollback
  on failure. This resolves the Part-2 `selfupdate/` open: the controller is **agent-managed**,
  not self-updating; the controller's old self-update path is removed.

## 12. Secrets at rest on the host

The agent holds, root-only on the host fs: the scoped Proxmox token, the hub API key, the
operator's **public** verify key (for §4 signatures — public, low-risk), the Cloudflare
tunnel token, encrypted storage creds (NFS/CIFS/PBS), and the **live PBS key**. The privilege
and the secret footprint that left the controller now concentrate here — which is the whole
argument for §3's root-minimization and a small, auditable agent.

## 13. Open items / what this unblocks

Resolved here: tunnel placement (host, agent-managed, own systemd service), the
reconcile-vs-jobs fork (hybrid, gated by reversibility), agent process model, self-update
ownership, the local-API surface (**implemented, slice 8A — §6a**), the storage-manifest schema,
**provision-by-restore**, the **provision/DR slice boundary** (7 front-half + guest-loss DR +
escrow creation; **8A provisioning back-half + local API — implemented**; 8B quiesced backup; 8C
controller de-privileging; 10 host-loss DR + escrow consumption — §9 table), the **PBS
recovery-code escrow design** (§8a), and the **root-vs-API boundary** (Phase 3, B3 — the slice-8A
back-half's host-side `chown`/`pct set` bind-mount is a deliberate, narrow addition OUTSIDE the
API token, in `internal/provision`, not the 3-exception `proxmox.Privileged` fence).

Still open:

- Multi-tenant **resource fairness** on a shared host (per-guest cgroup limits, noisy-neighbor) — deferred to the company-case pass.
- Operator-side **signing tooling** — where the operator signing key lives operationally and how a destructive op gets signed without undue friction (offline key vs. a small signing service; the security floor is "not in the hub").
- Hub-side **desired-state editing UX** and the host-domain report schema details — belong to the hub architecture doc.
- **Golden base image** refresh cadence + fleet versioning — operational, non-blocking (§9).
- **Identity-reset set** (live, link-up) — pinned empirically by the slice-7 bring-up spike; the
  scenario-specific policy is settled in §9, the exact field list is the spike's deliverable.
- **Escrow restore-mode serving / consumption** — handing the opaque blob back to a re-enrolling
  box and unwrapping `K` with `R` is slice-10 / doc-05 (§8a, §9 host-loss). *Escrow creation + hub
  opaque storage are done (slice 7).*

This doc hands the implementation three contracts it was waiting on:

1. the **local-API surface** (§6) → the controller's NEW local-API client, snapshot-before-deploy, and self-restore-test wiring (Part 2);
2. the **storage-manifest schema** (§7) → the `settings.StoragePath` reshape and per-volume hot/bulk placement (Part 2);
3. the **backup contract** (§7–8) → the destination for the app-data-backup package extracted in the Part-2 refactor.

---

## Changelog — design-review + Phase-3 fold-in (2026-06-08)

### Slice-10A implemented — hub desired-state serving (the "Down" channel) (2026-06-10)
- §4: the **control loop is live**. The report IS the heartbeat; its response — the **control
  envelope** — is the Down channel. The envelope is a cheap change-notification: `desired_generation`
  (version) + `has_signed_ops` (flag) + `poll_interval_seconds`. The agent **caches** the desired-state
  + its generation and re-fetches the heavy state (`GET /hosts/{id}/desired-state`, self-scoped) **only
  when the generation advances**. The engine reconciles **benign** deltas; an explicit **destructive**
  delta (a guest `decommission`) is classified Destructive → the gate refuses it **`pending_signature`**
  (no signer in 10A → never executed). **Signed-job execution is 10B**; the `restore_directive` field
  is carried in desired-state now but **consumed in 10D**.
- §9 slice table: **10A done** (hub serves desired-state + bumps generation + signed-jobs queue/flag;
  agent activates the envelope + a hub-backed `CachingProvider` feeding the engine). 10B/10C/10D pending.
- Wire: the envelope's now-active fields + the `desired-state` response are a cross-repo contract —
  `control-envelope.golden.json` + `desired-state.golden.json`, **byte-identical** agent↔hub. Status:
  implemented (hub v0.9.0; agent v0.15.0). **Out of 10A (deliberate):** the hub stores/serves
  desired-state **opaquely** (the agent owns the schema); signed-op **execution** + verification is 10B;
  **restore-mode/re-enroll** consumption (a new box's first directive) is 10D — 10A serves only
  already-authenticated hosts.

### Slice-9 implemented — host metrics to the controller (customer host-health view) (2026-06-10)
- §6: added **`GET /host/metrics`** — host-wide health (cpu%/mem/load/uptime/**`cpu_temp_c`**) +
  per-storage capacity for the customer's monitoring view. **Reuses the slice-4 collector** (no
  duplicate collection); host-wide, **token-authed**, **fresh** (not the 15-min hub snapshot).
- §9 slice table: **defined + marked slice 9** (the roadmap previously jumped 8→10; this fills it).
  Noted the **one-customer-per-host** assumption (host-wide CPU/mem would leak cross-customer load on
  a multi-customer host) and the out-of-scope items (multi-tenant filtering; time-series history).
- The one new collector is **CPU/chassis temp** (`internal/hub/cputemp.go`, sysfs hwmon/thermal-zone,
  **graceful-null**), added to the **shared `HostMetrics`** → the hub report gains `cpu_temp_c` too
  (operator freebie) → **cross-repo host-report golden updated** byte-identical. Status: implemented
  (agent v0.14.0; controller v0.39.0).

### Slice-8C implemented — controller de-privileged, slice 8 CLOSED (2026-06-10)
- §6: added the **disk-management endpoints** (`/disks`, `/disks/assign|eject|format`) and
  **reframed the principle** — a controller may do non-data-destructive storage setup self-serve;
  **anything that can lose customer data stays operator-signed (§4)**, with the **classifier
  (agent-internal device inspection)** as the enforcer. The 8C invariant: the agent decides
  data-bearing-ness by **inspecting the device itself**, never the caller's claim; a data-bearing
  format → `ClassStorageWipe` → gate → `pending_signature` (signed completion is slice 10).
- §9 slice table: **8C implemented — slice 8 CLOSED** (agent v0.12.0 `/disks` + classifier gate +
  `mkfs`; controller v0.37.0 retired ~12.3k LOC of disk-execution + de-privileged + rewired to the
  agent). The controller-side re-platform milestone: the in-guest controller is now Docker-only with
  no disk/Proxmox privileges.

### Slice-8B implemented: app-consistent backup (quiesce / stack-stop) (2026-06-10)
- §8: the **controller-driven quiesce** (stop app stacks → `POST /backup` → restart) is **implemented**
  (controller v0.36.0 `internal/quiesce` + agent v0.11.0 `/backup/due` cadence + `/backup/status`
  phases). Documented the **crash-safety** centerpiece (marker-before-stop, guaranteed unquiesce,
  max-quiesce bound, startup crash-recovery, single-flight) and the **8B.2** downtime-optimization
  fast-follow (snapshot mode + a `snapshotted` phase). Validated live: a **quiesced** postgres restore
  comes up clean ("database system was shut down") vs a **crash-consistent** restore doing WAL recovery.
- §9 slice table: **8B → implemented**; 8C (controller de-privileging) still pending.

### Slice-8A implemented: local API + provisioning back-half (2026-06-10)
- NEW §6a: the **local-API implementation** (agent v0.10.0 `internal/localapi`; controller v0.35.0
  `internal/bootstrap` + `internal/agentapi`) — persisted self-signed leaf with a **stable
  leaf-SHA-256 pin**, the **token→guest self-scoping** (explicit cross-guest id → 403, op never
  issued), the stable **`bootstrap.json` contract + controller ingestion `(c)`** (seed
  `controller.yaml`, skip setup; idempotent + fail-safe), and the **baked-controller deploy** (no
  registry credential in any guest). Firewall narrowing = defense-in-depth; the token stays the gate.
- §9: the provisioning **back half** row is now **slice 8A — implemented** (split from the old "8");
  `build-golden.sh` now **bakes the controller + a bootstrap unit**; quiesced backup → 8B, controller
  de-privileging → 8C. The host-side `chown`/`pct set` bind-mount is a deliberate narrow surface in
  `internal/provision` (NOT the 3-exception `proxmox.Privileged` fence). Validated live end-to-end.
- §13 updated accordingly.

### Slice-7 scope + escrow design (2026-06-09)
- §9 rewritten: the bring-up primitive is a **shared front half only** — identity-reset policy is
  **scenario-specific** (provision = fresh everything; guest-loss DR = preserve restic/tunnel/hub
  continuity identity, reset only collision-prone host-local identity). Added the **slice 7/8/10
  mapping table** (front half + guest-loss DR + escrow creation in 7; provisioning back-half in 8;
  host-loss DR + escrow consumption in 10).
- NEW §8a: **PBS recovery-code escrow** — live key unencrypted on box for unattended ops; agent
  generates recovery code `R`; PBS-native passphrase-wrap of `K` under `R` escrowed to the hub
  (zero-knowledge); consumption is slice 10. Irreducible-residual + rotation≠key-rotation stated.
- §13 updated accordingly.

- **NEW provision-by-restore** (§9): the agent provisions by **restoring a golden base image**
  (token-covered, preserves `keyctl`), never `pct create` on the per-customer path; one unified
  restore primitive shared with DR. §2 responsibility + §3 boundary updated.
- **B3** (§2/§3): replaced "Phase-1 minimal role" with the validated **`FelhomAgent`** operator
  role; root-vs-API boundary **settled** (root only for golden-image build, host mounts, SMART).
- **B1** (§4): reversibility gate rewritten as **provenance + data-bearing** (scratch tag is
  agent-internal, never hub-supplied; crashed-controller heal is non-destructive in-place).
- **B2** (§7/§8): validated bulk-as-`backup=0`-mountpoint recipe + the **bulk-DR consequence**
  (excluded bulk needs its own backup decision).
- **S1** (§6/§8): `GET /backup/due` added; controller-driven quiescing; agent vzdump is
  crash-consistent only. **S2** (§10/§11): A/B self-update with external revert authority;
  controller-update + agent self-update journaled. **S3** (§7): `StoragePath` field re-homing.
  **S4:** geo non-responsibility added (§2). **M2** (§7): manifest "absorbs + adds durable_id".
  **§6:** rollback is self-scoped/bounded. **§13:** golden-image refresh cadence added as open.