Files
felhom.eu/documentation/architecture/03-host-agent.md
T
admin e7ed8a8483 doc(03-host-agent): slice-7 scope, scenario-specific identity-reset, PBS escrow (§8a)
- §9 rewritten: bring-up is a shared FRONT HALF only; identity-reset policy is
  scenario-specific (provision = fresh everything; guest-loss DR = preserve
  restic/tunnel/hub continuity, reset only collision-prone host-local identity).
  Added the slice 7/8/10 mapping table.
- NEW §8a: PBS recovery-code escrow (zero-knowledge) — live key on box; agent-generated
  recovery code R; PBS-native passphrase-wrap of K under R escrowed to hub; consumption
  slice 10; irreducible-residual + rotation != key-rotation stated.
- §13 updated (resolved: provision/DR slice boundary + escrow design; open: identity-reset
  set, hub-side escrow storage + restore-mode serving).

Doc-only; no version bump.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 20:25:11 +02:00

392 lines
29 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Architecture Part 3 — The Host Agent
> Status: design draft (decision content). To be grounded by Claude Code against
> `docs/proxmox-platform.md` and `docs/architecture/02-controller-module-map.md`,
> then placed at `docs/architecture/03-host-agent.md`.
>
> Builds on Part 1 (`01-topology-and-trust.md`) and Part 2 (`02-controller-module-map.md`).
> Where this doc and the locked decisions disagree, the locked decisions win and this
> draft is wrong — flag it.
## 1. Purpose & scope
The **host agent** is the operator-tier component that runs on each Proxmox host and
owns *all* Proxmox interaction. It is the trusted host actor: it provisions and restores
guests, manages host storage, orchestrates backups and restore-tests, watches the host
and the tunnel, talks to the hub, and exposes a narrow local API to the in-guest
controllers it deploys.
It is the privileged tier. The controller deliberately holds **no** Proxmox credentials
(Part 1) — the privilege the controller shed by losing `storage/` did not disappear, it
**moved here**. That makes the agent's hardening and blast-radius discipline the most
security-sensitive part of the platform.
The agent manages a **set** of guests on its host (usually one customer = one guest, but
the multi-tenant/company case is not precluded — the agent's data model is per-host,
N-guests, never "the guest").
## 2. Responsibilities (and explicit non-responsibilities)
Owns:
1. **Proxmox lifecycle** — create/start/stop/destroy guests, snapshots, storage allocation. Via a scoped Proxmox API token (the **`FelhomAgent` operator role** — `proxmox-platform.md` §3.6, validated Phase 3 B3) for everything the API covers; raw host ops only where unavoidable.
2. **Storage management** — attach/classify targets, reconcile the storage manifest, mount USB-by-UUID, present mounts into guests.
3. **Backup/restore orchestration** — vzdump to the tiers, PBS, snapshot management, and the **self-restore-test**.
4. **Host & tunnel monitoring** — host metrics, guest up/down, storage-target status, and `cloudflared` health; reports the host domain to the hub.
5. **Provisioning** — provision a guest **by restoring the golden base image** (§9), deploy the controller into it, hand it its bootstrap config; also **build and refresh the golden base image** itself.
6. **Hub control loop** — poll for desired state + signed jobs, reconcile, execute, report, heartbeat.
7. **Local API** — the per-guest authorization gate the controller calls.
8. **Self-update** — update itself (carefully — it is a host service) and update the controllers it owns.
Explicitly does **not**:
- Serve application traffic or sit in the data path. **Control plane, not data plane**: if the agent dies, apps keep serving (Docker + LXC run without it); only *management* degrades — no new backups, no provisioning, hub loses the heartbeat.
- Hold or proxy customer application data.
- Run inside a guest. It is the thing that recovers guests and the host; it cannot be one of them.
- Manage **geo-restriction / the Cloudflare API**. Geo is hub-owned: the customer sets it in the controller UI, the controller reports the geo desired-state to the hub, and the **hub** (holding the CF API token) reconciles the WAF (S4). The agent manages only the *tunnel* service (`cloudflared`, §3/§5), never WAF rules.
## 3. Process model & host integration
- **Native Go binary, systemd service** on the host: boot-start, `Restart=always`, systemd watchdog (kill+restart on hang), journald logging, resource limits.
- **Root-minimized (boundary settled — Phase 3 B3).** The agent runs as a **non-root** service user with the scoped `FelhomAgent` token for all API-covered work + a **narrow `sudoers` allowlist** for true host ops. Per Phase 3 (B3) the boundary is settled: the entire per-customer guest lifecycle — provision (by restore, §9), config, start/stop, snapshot, backup, **restore**, destroy — is token-covered. Genuine OS-root is confined to: (1) building/refreshing the **golden base image** (`keyctl` create is `root@pam`-only — one-time at enrollment + a maintenance cadence, §9); (2) **host mounts** (USB mount-by-UUID, systemd mount units / fstab); (3) **SMART / hardware sensors**. Root therefore never sits on the per-customer path. See `proxmox-platform.md` §3.6 for the role + boundary table.
- **`cloudflared` is a separate systemd service**, not embedded in the agent. This is what makes the data path survive control-plane death by construction. The agent **manages and health-watches** it (see §5) but the tunnel does not live or die with the agent process.
## 4. Control model — reconcile + signed destructive ops
Two channels, split by **reversibility**, not by transport.
**(a) Desired-state reconciliation — steady state.**
The hub holds desired state for the host: which guests should exist (and at what spec),
the storage manifest, backup/retention policies, controller image versions. The agent
runs a reconcile loop converging actual Proxmox state → desired: idempotent, self-healing,
and tolerant of missed polls (drift is corrected on the next loop). Provisioning retries,
re-attach of a flapping USB target, redeploy of a crashed controller — all fall out of
reconciliation for free.
**(b) Signed one-shot jobs — operator actions.**
Restore-now, decommission, force-backup, break-glass-enable. Discrete, run-once
(idempotency key), written to the customer-visible audit log, and **outside** the reconcile
loop — they are point-in-time and often destructive, and a reconciler must never re-run a
restore because it "sees drift." A one-shot job names a **target** ("restore guest X from
snapshot S"), not a procedure; the agent owns the *how*.
**The reversibility gate (security-critical).**
"Signed jobs resist hub compromise" only holds if the agent also distrusts hub-supplied
*desired state* for destructive changes. The gate is by **provenance + data-bearing-ness, not
by verb**:
- **The reconciler MAY act without an operator signature** when: (a) creating/starting/restarting; (b) destroying resources it created earlier **within the same journaled transaction** (compensating rollback, §10); (c) destroying resources it **tagged ephemeral/scratch** (e.g. restore-test scratch guests, §8). The ephemeral/scratch tag is **agent-internal provenance and is never accepted from the hub** — else a compromised hub could relabel a data-bearing guest as scratch to walk the gate.
- **An operator signature is always required** to destroy/overwrite any resource holding the only/primary copy of customer data — live-guest destroy, storage detach/wipe, restore-overwrite, decommission — *regardless of whether it arrives as a job or as a desired-state delta*. A compromised hub cannot forge them because the signing key is **not held by the hub** (it lives with the operator / a separate signing path; the hub only queues opaque signed blobs).
- **Healing a crashed controller is non-destructive by construction:** it is reconstructable from its image + the guest's persistent volume, so "redeploy" = restart the LXC / `docker compose up -d` **inside the existing guest** — never a guest destroy. (v0.33 precedent: `watchdog.go` restarts stopped stacks, it never destroys the guest.)
Signed payloads carry a **nonce + expiry** (anti-replay: a captured "restore" job cannot be
re-injected later) and a target binding (host + guest id) so a signature can't be retargeted.
Notification-on-destructive-op is an **audit signal, never the guard** — a compromised hub
could both issue and suppress the notice, which is exactly why the *signature* (not the
notification) is the control.
## 5. Hub ↔ agent protocol (host domain)
**Box-initiated poll.** The hub never connects inbound. Each poll cycle exchanges:
- **Up:** heartbeat + a host-domain state report — host CPU/RAM/disk, per-guest up/down + spec, storage-target status (USB connected? NFS/CIFS reachable? PBS reachable?), last backup per target, last restore-test result, `cloudflared` health, agent + controller versions, audit-log tail.
- **Down:** the current desired state, any pending signed one-shot jobs, and config (poll interval, update window, policy changes).
**Dead-man's-switch (essential, not optional).** In a box-initiated model the heartbeat
*is* the liveness signal — a box that stops checking in is otherwise invisible. The hub
alerts the operator when an agent misses its expected check-in window. This is the worst
failure mode for a managed service, so it gets first-class treatment hub-side.
**Break-glass.** Standing inbound control is off. But when the poll loop *itself* is wedged
(agent hung, host sick) you cannot fix it through the poll loop. So there is an explicit,
**off-by-default, customer-consented, fully-audited** emergency path: SSH to the host via
the Cloudflare Tunnel behind Cloudflare Access (or on-site). Enabling it is itself a signed,
logged operation; it auto-expires.
## 6. Agent ↔ controller local API
The controller (in its LXC) reaches the agent (on the host) over the local bridge.
- **Transport:** HTTPS to the host's bridge IP on a fixed port.
- **Auth:** a per-guest local token, minted by the agent when it deploys the controller and written into the guest's bootstrap config. The agent maps token → guest and **authorizes per guest**: a controller can only act on *its own* guest. This is the agent acting as the per-guest authorization gate from Part 1.
- **Surface (minimal, all scoped to the caller's own guest):**
- `GET /storage` — mounts available to this guest and their **class** (fast/slow), so the controller can place hot vs bulk volumes per `.felhom.yml`. (The agent owns the actual mounts; the controller just binds to the paths it's given.)
- `POST /snapshot` — snapshot *this* guest (the snapshot-before-deploy primitive).
- `POST /rollback` — roll *this* guest back to a named snapshot (post-deploy failure recovery).
- `POST /backup` — request a backup-now of *this* guest (enqueued; non-destructive).
- `GET /backup/due` — whether a policy-scheduled backup is due for *this* guest, so the controller can quiesce then call `POST /backup` (the app-consistent path, §8).
- `GET /backup/status`, `GET /restore-test/status` — read-only status for the controller's UI.
Note what is *absent*: nothing here lets a controller touch another guest, the host, storage
attachment, or restore-overwrite. Destructive/cross-guest power stays operator-signed (§4).
A controller can only `POST /rollback` (or snapshot/backup) **its own** guest — the agent maps
token → guest and authorizes per guest, so a compromised controller's blast radius is
**self-scoped and bounded** to its own guest.
## 7. Storage manifest & reconciliation
The manifest is the load-bearing contract. It absorbs the **persisted** disk-state fields that
`settings.StoragePath` carries today **and adds** `durable_id`/UUID — today the controller
re-derives the UUID from fstab each boot (Part 2 / Phase-3), so persisting it is an
improvement. Held in the hub, reconciled by the agent.
Per target:
| field | meaning |
|---|---|
| `type` | `local-dir` / `usb` / `nfs` / `cifs` / `pbs` |
| `durable_id` | UUID (USB), `server:export` (NFS/CIFS), `repo+fingerprint` (PBS) — survives box loss |
| `class` | `fast` or `slow`, set **once at attach**, with an IOPS marker; no runtime speed-test |
| `role` | `primary` / `vzdump-target` / `pbs-offsite` / `bulk-data` |
| `creds` | encrypted (NFS/CIFS/PBS); USB has none |
| `policy` | schedule + retention for this target |
| `state` | `attached` / `disconnected` / `decommissioned` |
Reconciliation: ensure each `attached` target is mounted (USB-by-UUID via the sudoers
allowlist), each Proxmox storage entry matches, and `disconnected` targets are surfaced to
the hub (the storage watchdog — detect a USB drop in seconds, not at the next health cycle).
**Placement is per-volume, not per-app.** Hot volumes (DB/config) → a `fast` target,
**enforced**; bulk volumes (media) → may live on `slow`, declared in `.felhom.yml`.
A `bulk` volume **MUST** be realized as a `backup=0` **volume mount point** (or an external
bind mount) — **never** a Docker named volume in rootfs, which `vzdump` always captures
(verified, `phase3-findings.md` B2). Proven recipe: attach
`-mpN <storage>:<size>,mp=/mnt/bulk,backup=0`, then
`docker volume create --driver local -o type=none -o o=bind -o device=/mnt/bulk <vol>` (or a
compose bind). The per-volume placement component (Part 2 §5(2)) enforces this at deploy. The
**DR consequence** of excluding bulk is covered in §8.
**Field re-homing (from `settings.StoragePath`, Part 2):** `Label` → manifest (canonical);
`IsDefault`/`Schedulable` → manifest `policy`; `MigratedTo` + decommission → manifest `state`;
`StoppedStacks` → the **controller's `settings`** (app-domain: which apps to restart on
reconnect, not a host concern).
## 8. Backup/restore orchestration
Tiers double as backup *and* restore-source priority (fastest surviving source first),
per Part 1: **snapshot** (LVM-thin, transient, whole-guest rollback — not a backup) →
**local second storage** (vzdump to dir/NFS/CIFS) → **PBS offsite** (the DR substrate).
- **Quiescing (controller-driven for app-consistency):** an LXC has no fsfreeze
(`proxmox-platform.md` §4.2), so app-consistency is the controller's job: it learns a backup
is due (`GET /backup/due`, §6, or via its hub channel) → **quiesces** the app stack →
`POST /backup` → polls `GET /backup/status` → unquiesces. **An agent-initiated vzdump is
crash-consistent only** (there is no inbound-to-guest channel to trigger a quiesce — §3/§5).
Every Proxmox op is async → the agent polls `task exitstatus`, never trusts the POST return.
- **Bulk volumes have no DR coverage from the guest vzdump** — they are excluded (§7). Every
`bulk` volume needs an explicit own-backup decision: its own backup target per the manifest
`policy`, **or deliberately none** when the data is re-downloadable (customer informed). On
host-loss, un-backed-up bulk is gone; a **bind-mounted** bulk volume re-attaches only on the
*same* host, so cross-host DR needs the separate backup. A deliberate per-volume choice,
never a silent loss.
- **Key custody (PBS):** the **live** PBS key sits on the box so the agent can both back up
*and* run restore-tests. The hub holds only the **recovery-code-wrapped escrow** copy it
cannot open (zero-knowledge default). So: the box can restore-test; the operator cannot
read the data; the customer's offsite recovery code is the irreducible residual.
- **Self-restore-test:** the closing of the "tested restore is the critical gap" theme. The
agent periodically restores a backup into a **throwaway scratch guest**, boots it, runs
health checks, reports pass/fail, and tears it down. Zero-knowledge backups can *only* be
restore-tested by the box (the operator lacks the key) — so this lives in the agent by
necessity, not just convenience. Integrity-verify (cheap, ciphertext-level) runs more often
as the lighter check.
### 8a. PBS recovery-code escrow (zero-knowledge offsite-key recovery)
The DR substrate is the PBS offsite tier, and it is client-side encrypted (zero-knowledge): if the
box dies, restoring the offsite backups requires the **PBS client encryption key `K`**, which died
with the box. The escrow is how `K` comes back **without** Felhom ever being able to read customer
data. Design (decisions, with the rationale that pins them):
- **Live key unencrypted on the box** (`0600`, root): the agent backs up *and* runs restore-tests
unattended — no passphrase prompt on the management path. The privilege concentration this
implies is the whole argument for §3 root-minimization + a small auditable agent.
- **Wrap mechanism — PBS-native, not custom crypto.** At enrollment the agent generates a
high-entropy **recovery code `R`** and produces a **passphrase-protected copy of `K` under `R`**
using PBS's own key passphrase KDF (`proxmox-backup-client key` family). *Decision: lean on PBS's
documented, battle-tested key+passphrase path; do not roll a bespoke AEAD wrap.* Host/customer
binding is provided at the hub-storage layer (blob keyed by host-id), not by custom crypto.
- **Agent-side generation.** `R` is generated **on the box** (it already holds `K` and does the
wrapping), so `R` never touches the hub even in transit — zero-knowledge by construction.
- **Escrow = the `R`-wrapped blob → hub.** The hub stores opaque ciphertext bound to the
host/customer. Without `R` it is undecryptable; the operator cannot read customer data. (Hub-side
storage schema for the blob is a slice-10 / doc-05 item.)
- **Recovery code custody.** `R` is shown to the customer **once** at enrollment (printed/displayed)
and **never stored by Felhom in recoverable form**. Format: a grouped/word-list code (≥128-bit
entropy) — it is transcribed off paper by a non-technical household, so raw base32 invites typos.
- **Consumption (slice 10, host-loss).** New box re-enrolls in restore mode → hub ships the escrow
blob → customer enters `R` → box unwraps `K` → PBS restores proceed.
- **Optional belt-and-suspenders (product decision, default OFF).** A PBS **paperkey** (the raw key,
for a safe) gives the customer a recovery path that survives *both* box loss *and* recovery-code
loss, at the cost of a higher-value secret (raw key on paper, no second factor). Default is
hub-escrow + `R` only; offer the paperkey as an opt-in "advanced" path.
**Properties stated for honesty (these go to the customer at enrollment):**
- **Irreducible residual:** losing `R` *and* the box (and, if not opted in, having no paperkey) =
the offsite backups are **unrecoverable, by anyone, including Felhom.** This is the cost of
genuine zero-knowledge and must be communicated, not buried.
- **Rotation ≠ key rotation:** rotating `R` re-wraps the escrow blob (and re-shows the customer a
new code) but does **not** re-encrypt existing PBS data — that data stays keyed by `K`. Changing
`K` itself is a separate, heavier operation (new key → new backups; old backups still need old
`K`) and is out of scope for routine recovery-code rotation.
## 9. Provisioning & DR flows
**Provisioning (reconcile-driven, by restore).** Fresh creation of a Docker-capable LXC needs
the `keyctl=1` feature flag, which Proxmox permits only for `root@pam` (Phase 3, B3) — not the
scoped token. But a token-authorized **restore preserves `keyctl`** (Phase 3, B3, empirically:
a token `vzrestore` of a keyctl archive produced a guest that kept `features:
nesting=1,keyctl=1,unprivileged:1`), so the agent provisions **by restoring a golden base
image**, never by `pct create` on the per-customer path.
**Golden base image.** A **golden base archive** — minimal Debian + Docker, `nesting=1,keyctl=1`,
overlayfs — is built once as `root@pam` **at enrollment** (when the agent legitimately holds root
to mint its Proxmox token) and refreshed on a maintenance cadence. This is the one place
`keyctl`/root provisioning lives — off the per-customer path. Refresh cadence + fleet versioning
remain an operational open item (§13).
**Unified bring-up primitive (shared *front half* — NOT shared identity policy).** Provisioning
and DR-restore share one token-covered front-half code path:
> restore an archive → **reset identity** → size the guest (CPU/mem config + `pct resize`
> rootfs, token-covered) → attach storage mounts per the manifest
run as a **journaled reconcile job**; a mid-flight failure is compensating-rolled-back (destroy
the just-restored guest — allowed unsigned per §4, same-transaction provenance). They diverge in
the *archive* and the *back half*, **and in identity policy** (below).
**Identity reset is scenario-specific — this is a correctness boundary, not a detail.** "Reset
identity" is shorthand for two different operations:
- **Provision (golden base) → fresh identity, everything.** A provisioned guest is new: regenerate
MAC, hostname, **`/etc/machine-id`** (a duplicate breaks journald/DHCP/systemd), **SSH host
keys**, and it receives a **fresh** controller identity (host-id, local token, hub channel),
**fresh restic repo identity**, and a fresh tunnel association — all minted in the back half.
- **Guest-loss DR (customer backup) → preserve continuity identity, reset only what would
collide.** The restored guest must *continue* the customer's world: **keep** the restic repo
identity (resetting it orphans the existing backup chain — a silent data-continuity bug), the
tunnel/DNS association, and the hub host/customer binding. Reset only collision-prone host-local
identity (`machine-id`, SSH host keys, hostname as needed). **MAC is reset only when a source
guest may still be live** (e.g. partial loss, or the restore-*test* which boots link-down for
exactly this reason); in a true total guest-loss the original is gone, so the MAC can be kept to
preserve DHCP reservations. The agent decides MAC handling from the scenario, not a fixed rule.
The exact reset set is being pinned empirically by the slice-7 bring-up spike (live, link-up,
which the slice-6 restore-test never did — it boots link-down precisely because identity reset is
slice 7).
**Guest loss (slice 7).** Agent restores G from the fastest surviving tier (snapshot → local →
PBS) and applies the **DR identity policy** above so the restored guest rejoins cleanly. The
customer backup already contains the controller + data, so there is **no controller deploy** in
this path — bring up + reattach external storage and it is whole. This is fully in slice 7.
### Slice mapping (what is built where — keep this current)
| Capability | Slice | Status |
|---|---|---|
| Golden base image build (root@pam, at enrollment) | **7** | spike → build |
| Unified bring-up **front half** (restore→reset identity→size→attach storage), journaled + compensating rollback | **7** | spike → spec → implement |
| **Guest-loss DR** (front half + DR identity policy; no controller deploy) | **7** | in scope |
| PBS recovery-code escrow **creation** (§8a) | **7** | designed (§8a); implement |
| Provisioning **back half** — deploy controller, hand bootstrap config, mint per-guest local token | **8** | deferred — needs the controller-deploy path + agent↔controller local API (§6) |
| **Host/hardware loss** DR — re-enroll in "restore mode"; hub serves identity / PBS namespace / tunnel token / storage manifest / restore directive | **10** | deferred — needs hub desired-state serving; hub store today holds only `{host_id, customer_id, api_key}` (slice 3) |
| PBS escrow **consumption** (recover `K` on a new box) | **10** | deferred — exercised by host-loss DR |
| Golden base refresh cadence + fleet versioning | post-launch | operational, non-blocking (§13) |
**Host/hardware loss (design intent — slice 10).** Re-enroll the new host in **restore mode**;
the hub — the durable source of truth that survives box death — hands the new agent the existing
identity, PBS namespace, tunnel token, storage manifest, a restore directive, and the **escrow
blob** (§8a) for the customer to unlock with their recovery code. Tunnel is reused from the hub
record, so DNS stays intact. This depends on hub desired-state serving (slice 10) and is not
buildable until then; recorded here so the front-half built in slice 7 lands ready for it.
## 10. Concurrency, crash-safety, idempotency
- **Per-guest serialization.** Reconcile, one-shot jobs, and local-API calls all feed a
work queue that serializes mutations **per guest** (Proxmox dislikes concurrent conflicting
ops on the same guest). Independent guests proceed in parallel.
- **Operation journaling.** Multi-step async ops (provision, restore, controller-update, agent
self-update) are journaled with their in-flight Proxmox task ids. On agent restart, the
journal is replayed: resume-or-rollback, so a crash mid-restore never leaves a corrupt or
half-built guest.
- **Idempotency keys** on one-shot jobs (run-once across retries and restarts).
## 11. Self-update
- **Agent (the hard case — a host service, no snapshot-rollback).** **A/B layout:** download →
verify signature → stage as the inactive slot → flip a `current → good|new` symlink → restart.
**Revert authority lives outside the swapped binary**`Restart=always` alone just
crash-loops a bad binary — so a **separate health-gate** (a systemd oneshot `ExecStartPost`
probe, or a tiny supervisor unit) flips `current` back to last-good and restarts on a failed
health window. The new version is **committed as "good" only after a clean health window**.
Triggered by a hub signed job within the update window; manual always allowed. Journaled (§10).
- **Controller (the easy case — it's a guest).** The agent owns the controller's lifecycle,
so the **agent updates the controller**: snapshot-before-update (free rollback, because the
controller *is* a snapshottable guest) → pull new image → redeploy → health-check → rollback
on failure. This resolves the Part-2 `selfupdate/` open: the controller is **agent-managed**,
not self-updating; the controller's old self-update path is removed.
## 12. Secrets at rest on the host
The agent holds, root-only on the host fs: the scoped Proxmox token, the hub API key, the
operator's **public** verify key (for §4 signatures — public, low-risk), the Cloudflare
tunnel token, encrypted storage creds (NFS/CIFS/PBS), and the **live PBS key**. The privilege
and the secret footprint that left the controller now concentrate here — which is the whole
argument for §3's root-minimization and a small, auditable agent.
## 13. Open items / what this unblocks
Resolved here: tunnel placement (host, agent-managed, own systemd service), the
reconcile-vs-jobs fork (hybrid, gated by reversibility), agent process model, self-update
ownership, the local-API surface, the storage-manifest schema, **provision-by-restore**, the
**provision/DR slice boundary** (7 front-half + guest-loss DR + escrow creation; 8 provisioning
back-half; 10 host-loss DR + escrow consumption — §9 table), the **PBS recovery-code escrow
design** (§8a), and the **root-vs-API boundary** (Phase 3, B3).
Still open:
- Multi-tenant **resource fairness** on a shared host (per-guest cgroup limits, noisy-neighbor) — deferred to the company-case pass.
- Operator-side **signing tooling** — where the operator signing key lives operationally and how a destructive op gets signed without undue friction (offline key vs. a small signing service; the security floor is "not in the hub").
- Hub-side **desired-state editing UX** and the host-domain report schema details — belong to the hub architecture doc.
- **Golden base image** refresh cadence + fleet versioning — operational, non-blocking (§9).
- **Identity-reset set** (live, link-up) — pinned empirically by the slice-7 bring-up spike; the
scenario-specific policy is settled in §9, the exact field list is the spike's deliverable.
- **Hub-side escrow storage + restore-mode serving** — the blob's hub schema and the restore-mode
desired-state handover are slice-10 / doc-05 (§8a, §9 host-loss).
This doc hands the implementation three contracts it was waiting on:
1. the **local-API surface** (§6) → the controller's NEW local-API client, snapshot-before-deploy, and self-restore-test wiring (Part 2);
2. the **storage-manifest schema** (§7) → the `settings.StoragePath` reshape and per-volume hot/bulk placement (Part 2);
3. the **backup contract** (§78) → the destination for the app-data-backup package extracted in the Part-2 refactor.
---
## Changelog — design-review + Phase-3 fold-in (2026-06-08)
### Slice-7 scope + escrow design (2026-06-09)
- §9 rewritten: the bring-up primitive is a **shared front half only** — identity-reset policy is
**scenario-specific** (provision = fresh everything; guest-loss DR = preserve restic/tunnel/hub
continuity identity, reset only collision-prone host-local identity). Added the **slice 7/8/10
mapping table** (front half + guest-loss DR + escrow creation in 7; provisioning back-half in 8;
host-loss DR + escrow consumption in 10).
- NEW §8a: **PBS recovery-code escrow** — live key unencrypted on box for unattended ops; agent
generates recovery code `R`; PBS-native passphrase-wrap of `K` under `R` escrowed to the hub
(zero-knowledge); consumption is slice 10. Irreducible-residual + rotation≠key-rotation stated.
- §13 updated accordingly.
- **NEW provision-by-restore** (§9): the agent provisions by **restoring a golden base image**
(token-covered, preserves `keyctl`), never `pct create` on the per-customer path; one unified
restore primitive shared with DR. §2 responsibility + §3 boundary updated.
- **B3** (§2/§3): replaced "Phase-1 minimal role" with the validated **`FelhomAgent`** operator
role; root-vs-API boundary **settled** (root only for golden-image build, host mounts, SMART).
- **B1** (§4): reversibility gate rewritten as **provenance + data-bearing** (scratch tag is
agent-internal, never hub-supplied; crashed-controller heal is non-destructive in-place).
- **B2** (§7/§8): validated bulk-as-`backup=0`-mountpoint recipe + the **bulk-DR consequence**
(excluded bulk needs its own backup decision).
- **S1** (§6/§8): `GET /backup/due` added; controller-driven quiescing; agent vzdump is
crash-consistent only. **S2** (§10/§11): A/B self-update with external revert authority;
controller-update + agent self-update journaled. **S3** (§7): `StoragePath` field re-homing.
**S4:** geo non-responsibility added (§2). **M2** (§7): manifest "absorbs + adds durable_id".
**§6:** rollback is self-scoped/bounded. **§13:** golden-image refresh cadence added as open.