hub: opaque PBS recovery-code escrow storage (v0.8.0) + doc 03 §8a posture model

Slice-7 close-out (hub half). PUT /api/v1/hosts/{host_id}/escrow (per-host key)
stores the agent's OPAQUE R-wrapped blob verbatim against the host; the hub never
decrypts it (no recovery code, no decrypt path). host_escrow table + Save/GetHostEscrow.
Tests: verbatim store, rotation last-write-wins, 401/403/400 auth+body, wire contract.

doc 03 §8a rewritten into the key-custody posture model: separation principle,
topology matrix, default + anti-lockout ladder, SSH-vs-key, breach/legal, integrity
caveat. Corrected: hub opaque storage is slice 7 (this task); serving is slice 10.
Slice table + §13 updated.

No secrets committed (R/K never appear; spike findings + docs use placeholders).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-10 07:46:33 +02:00
parent fe7d0850a5
commit 7eb3772000
6 changed files with 372 additions and 72 deletions
+76 -35
View File
@@ -192,44 +192,84 @@ per Part 1: **snapshot** (LVM-thin, transient, whole-guest rollback — not a ba
necessity, not just convenience. Integrity-verify (cheap, ciphertext-level) runs more often
as the lighter check.
### 8a. PBS recovery-code escrow (zero-knowledge offsite-key recovery)
### 8a. PBS recovery-code escrow + the key-custody posture model (zero-knowledge offsite-key recovery)
The DR substrate is the PBS offsite tier, and it is client-side encrypted (zero-knowledge): if the
box dies, restoring the offsite backups requires the **PBS client encryption key `K`**, which died
with the box. The escrow is how `K` comes back **without** Felhom ever being able to read customer
data. Design (decisions, with the rationale that pins them):
The DR substrate is the PBS offsite tier, client-side encrypted (zero-knowledge): if the box dies,
restoring the offsite backups requires the **PBS client encryption key `K`**, which died with the
box. The escrow is how `K` comes back **without** Felhom ever being able to read customer data.
**Status: implemented** — escrow *creation* (agent v0.9.0, `internal/escrow`) + hub *opaque storage*
(hub v0.8.0, `PUT /api/v1/hosts/{host_id}/escrow`). Validated end-to-end on a throwaway in
`documentation/tests/slice7-escrow-spike-findings.md`. Restore-mode *serving/consumption* is slice 10.
#### The separation principle (the rule that governs every posture)
Reading customer data needs **BOTH** the encrypted chunks **AND** a usable key. **Zero-knowledge
holds for exactly as long as Felhom never holds both at once.** Every posture below is just a
choice about where the data and the key live; the principle decides who can read.
#### Topology matrix (data location × key custody → who can read)
| Data location | Key custody | Who can read | Notes |
|---|---|---|---|
| **Felhom storage** | customer-only key | **only the customer** | **the DEFAULT** — genuine zero-knowledge |
| **Felhom storage** | Felhom also holds a key | **Felhom can read** | the one dangerous cell — explicit, informed opt-in only; never default, never silent |
| Customer's own offsite | customer key | only the customer | self-hosted data; key XOR data |
| Customer's own offsite | Felhom holds a key | only the customer | safe by separation (key and data never co-located at Felhom) |
#### The escrow mechanism (decisions + the rationale that pins them)
- **Live key unencrypted on the box** (`0600`, root): the agent backs up *and* runs restore-tests
unattended — no passphrase prompt on the management path. The privilege concentration this
implies is the whole argument for §3 root-minimization + a small auditable agent.
- **Wrap mechanism — PBS-native, not custom crypto.** At enrollment the agent generates a
high-entropy **recovery code `R`** and produces a **passphrase-protected copy of `K` under `R`**
using PBS's own key passphrase KDF (`proxmox-backup-client key` family). *Decision: lean on PBS's
documented, battle-tested key+passphrase path; do not roll a bespoke AEAD wrap.* Host/customer
binding is provided at the hub-storage layer (blob keyed by host-id), not by custom crypto.
unattended — no passphrase prompt on the management path. The privilege concentration this implies
is the whole argument for §3 root-minimization + a small auditable agent.
- **Wrap — PBS-native, not custom crypto.** At enrollment the agent generates a high-entropy
**recovery code `R`** and produces a **passphrase-protected copy of `K` under `R`** via PBS's own
key passphrase KDF (`proxmox-backup-client key change-passphrase --kdf scrypt`; no bespoke AEAD).
The spike pinned two implementation constraints: that command is **TTY-only** (drive it over a
pty), and the pty **echoes the passphrase** (discard the pty output so `R` can't leak) — F-A1/F-A2.
- **Agent-side generation.** `R` is generated **on the box** (it already holds `K` and does the
wrapping), so `R` never touches the hub even in transit — zero-knowledge by construction.
- **Escrow = the `R`-wrapped blob → hub.** The hub stores opaque ciphertext bound to the
host/customer. Without `R` it is undecryptable; the operator cannot read customer data. (Hub-side
storage schema for the blob is a slice-10 / doc-05 item.)
- **Recovery code custody.** `R` is shown to the customer **once** at enrollment (printed/displayed)
and **never stored by Felhom in recoverable form**. Format: a grouped/word-list code (≥128-bit
entropy) — it is transcribed off paper by a non-technical household, so raw base32 invites typos.
- **Consumption (slice 10, host-loss).** New box re-enrolls in restore mode → hub ships the escrow
blob → customer enters `R` → box unwraps `K` → PBS restores proceed.
- **Optional belt-and-suspenders (product decision, default OFF).** A PBS **paperkey** (the raw key,
for a safe) gives the customer a recovery path that survives *both* box loss *and* recovery-code
loss, at the cost of a higher-value secret (raw key on paper, no second factor). Default is
hub-escrow + `R` only; offer the paperkey as an opt-in "advanced" path.
wrapping), so `R` never touches the hub even in transit — zero-knowledge by construction. `R` is
≥128 bits, **word-list form** (EFF large wordlist, 10 words ≈ 129 bits) for off-paper transcription.
- **Self-verify before shipping.** Creation unwraps a copy of the blob with `R` and checks the key
fingerprint matches — "an escrow you haven't recovered isn't an escrow."
- **Escrow = the `R`-wrapped blob → hub (opaque storage, slice 7).** The hub stores the ciphertext
bytes against the host record and **never decrypts them** (it has no `R`; there is no decrypt
path). Per-host-key authed; rotation is last-write-wins. **Restore-mode serving is slice 10.**
- **Recovery code custody.** `R` is surfaced to the customer **exactly once** at enrollment
(printed/displayed) and **never stored by Felhom in any recoverable form**.
**Properties stated for honesty (these go to the customer at enrollment):**
#### Default posture + the anti-lockout ladder (opt-in, increasing trust)
**Default:** *Felhom storage + customer-only key*, and **`R` is delivered durably (printed) always**
— note this is distinct from a raw-key paperkey: `R` is a safe two-factor *passphrase* (useless
without the hub's blob); the raw key is the footgun. The ladder trades resilience for trust:
- **(b) `R`-wrapped offline copy** — the same two-factor blob, for the customer to print/store. **No
extra trust**; resilience if the hub ever vanishes (still needs `R`). *Implemented (opt-in).*
- **(a) raw paperkey** — `proxmox-backup-client key paperkey` of the unwrapped key, for a safe.
Covers **losing `R`**, but it is **single-factor and unrevocable**. *Implemented (opt-in, loud
caveat).*
- **Felhom-holds-a-key** — maximum convenience, but **gives up zero-knowledge** (the dangerous
matrix cell). **Not implemented** — it needs a separate Felhom-side secure key store + explicit
opt-in UX, built only when a customer asks.
#### SSH-for-support is a SEPARATE grant — deliberately not coupled to key custody
Support access (active / consented / observable — customer-toggleable, commands shown) is **not**
the same as a standing / passive / invisible decryption capability. The transparency features prove
*controlled* support access **without Felhom holding a key**. Conflating the two is exactly the
mistake the separation principle prevents.
#### Why zero-knowledge stays the default (breach + legal)
Holding data **and** a key makes a single hub breach an **all-customer data leak**, and makes Felhom
**compellable** — a court can order what Felhom *can* produce. Genuine zero-knowledge means *"we
can't be forced to hand over what we can't read."* This is core to the sovereignty pitch, not a
nicety.
#### Honesty properties (stated to the customer at enrollment)
- **Irreducible residual:** losing `R` *and* the box (and, if not opted in, having no paperkey) =
the offsite backups are **unrecoverable, by anyone, including Felhom.** This is the cost of
genuine zero-knowledge and must be communicated, not buried.
- **Rotation ≠ key rotation:** rotating `R` re-wraps the escrow blob (and re-shows the customer a
new code) but does **not** re-encrypt existing PBS data — that data stays keyed by `K`. Changing
`K` itself is a separate, heavier operation (new key → new backups; old backups still need old
`K`) and is out of scope for routine recovery-code rotation.
the offsite backups are **unrecoverable, by anyone, including Felhom.** The cost of genuine
zero-knowledge communicated, not buried.
- **Rotation ≠ key rotation:** rotating `R` re-wraps the escrow blob (and re-shows a new code) but
does **not** re-encrypt existing PBS data — that stays keyed by `K`. Changing `K` itself is a
separate, heavier op (new key → new backups; old backups still need old `K`), out of scope for
routine recovery-code rotation.
- **Integrity caveat (self-hosted-data postures):** moving data to the customer's own offsite
**loses Felhom's backup guarantees** — no PBS verify / monitoring on storage we can't reach. An
honest signup-time tradeoff, not a hidden one.
## 9. Provisioning & DR flows
@@ -295,7 +335,7 @@ this path — bring up + reattach external storage and it is whole. This is full
| Golden base image build (root@pam, at enrollment) | **7** | **recipe implemented** (`felhom-agent/configs/build-golden.sh`, incl. the F3 host-key unit); golden archived at enrollment |
| Unified bring-up **front half** (restore→reset identity→size→attach storage), journaled + compensating rollback | **7** | **implemented** (agent v0.8.0, `internal/reconcile/bringup.go`) |
| **Guest-loss DR** (front half + DR identity policy; no controller deploy) | **7** | **implemented** (v0.8.0, `dr_guest_loss` mode — continuity identity preserved) |
| PBS recovery-code escrow **creation** (§8a) | **7** | designed (§8a); implement |
| PBS recovery-code escrow **creation** + **hub opaque storage** (§8a) | **7** | **implemented** (agent v0.9.0 `internal/escrow`; hub v0.8.0 `PUT /hosts/{id}/escrow`) |
| Provisioning **back half** — deploy controller, hand bootstrap config, mint per-guest local token | **8** | deferred — needs the controller-deploy path + agent↔controller local API (§6) |
| **Host/hardware loss** DR — re-enroll in "restore mode"; hub serves identity / PBS namespace / tunnel token / storage manifest / restore directive | **10** | deferred — needs hub desired-state serving; hub store today holds only `{host_id, customer_id, api_key}` (slice 3) |
| PBS escrow **consumption** (recover `K` on a new box) | **10** | deferred — exercised by host-loss DR |
@@ -359,8 +399,9 @@ Still open:
- **Golden base image** refresh cadence + fleet versioning — operational, non-blocking (§9).
- **Identity-reset set** (live, link-up) — pinned empirically by the slice-7 bring-up spike; the
scenario-specific policy is settled in §9, the exact field list is the spike's deliverable.
- **Hub-side escrow storage + restore-mode serving** — the blob's hub schema and the restore-mode
desired-state handover are slice-10 / doc-05 (§8a, §9 host-loss).
- **Escrow restore-mode serving / consumption** — handing the opaque blob back to a re-enrolling
box and unwrapping `K` with `R` is slice-10 / doc-05 (§8a, §9 host-loss). *Escrow creation + hub
opaque storage are done (slice 7).*
This doc hands the implementation three contracts it was waiting on: