hub: opaque PBS recovery-code escrow storage (v0.8.0) + doc 03 §8a posture model
Slice-7 close-out (hub half). PUT /api/v1/hosts/{host_id}/escrow (per-host key)
stores the agent's OPAQUE R-wrapped blob verbatim against the host; the hub never
decrypts it (no recovery code, no decrypt path). host_escrow table + Save/GetHostEscrow.
Tests: verbatim store, rotation last-write-wins, 401/403/400 auth+body, wire contract.
doc 03 §8a rewritten into the key-custody posture model: separation principle,
topology matrix, default + anti-lockout ladder, SSH-vs-key, breach/legal, integrity
caveat. Corrected: hub opaque storage is slice 7 (this task); serving is slice 10.
Slice table + §13 updated.
No secrets committed (R/K never appear; spike findings + docs use placeholders).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -192,44 +192,84 @@ per Part 1: **snapshot** (LVM-thin, transient, whole-guest rollback — not a ba
|
||||
necessity, not just convenience. Integrity-verify (cheap, ciphertext-level) runs more often
|
||||
as the lighter check.
|
||||
|
||||
### 8a. PBS recovery-code escrow (zero-knowledge offsite-key recovery)
|
||||
### 8a. PBS recovery-code escrow + the key-custody posture model (zero-knowledge offsite-key recovery)
|
||||
|
||||
The DR substrate is the PBS offsite tier, and it is client-side encrypted (zero-knowledge): if the
|
||||
box dies, restoring the offsite backups requires the **PBS client encryption key `K`**, which died
|
||||
with the box. The escrow is how `K` comes back **without** Felhom ever being able to read customer
|
||||
data. Design (decisions, with the rationale that pins them):
|
||||
The DR substrate is the PBS offsite tier, client-side encrypted (zero-knowledge): if the box dies,
|
||||
restoring the offsite backups requires the **PBS client encryption key `K`**, which died with the
|
||||
box. The escrow is how `K` comes back **without** Felhom ever being able to read customer data.
|
||||
**Status: implemented** — escrow *creation* (agent v0.9.0, `internal/escrow`) + hub *opaque storage*
|
||||
(hub v0.8.0, `PUT /api/v1/hosts/{host_id}/escrow`). Validated end-to-end on a throwaway in
|
||||
`documentation/tests/slice7-escrow-spike-findings.md`. Restore-mode *serving/consumption* is slice 10.
|
||||
|
||||
#### The separation principle (the rule that governs every posture)
|
||||
Reading customer data needs **BOTH** the encrypted chunks **AND** a usable key. **Zero-knowledge
|
||||
holds for exactly as long as Felhom never holds both at once.** Every posture below is just a
|
||||
choice about where the data and the key live; the principle decides who can read.
|
||||
|
||||
#### Topology matrix (data location × key custody → who can read)
|
||||
| Data location | Key custody | Who can read | Notes |
|
||||
|---|---|---|---|
|
||||
| **Felhom storage** | customer-only key | **only the customer** | **the DEFAULT** — genuine zero-knowledge |
|
||||
| **Felhom storage** | Felhom also holds a key | **Felhom can read** | the one dangerous cell — explicit, informed opt-in only; never default, never silent |
|
||||
| Customer's own offsite | customer key | only the customer | self-hosted data; key XOR data |
|
||||
| Customer's own offsite | Felhom holds a key | only the customer | safe by separation (key and data never co-located at Felhom) |
|
||||
|
||||
#### The escrow mechanism (decisions + the rationale that pins them)
|
||||
- **Live key unencrypted on the box** (`0600`, root): the agent backs up *and* runs restore-tests
|
||||
unattended — no passphrase prompt on the management path. The privilege concentration this
|
||||
implies is the whole argument for §3 root-minimization + a small auditable agent.
|
||||
- **Wrap mechanism — PBS-native, not custom crypto.** At enrollment the agent generates a
|
||||
high-entropy **recovery code `R`** and produces a **passphrase-protected copy of `K` under `R`**
|
||||
using PBS's own key passphrase KDF (`proxmox-backup-client key` family). *Decision: lean on PBS's
|
||||
documented, battle-tested key+passphrase path; do not roll a bespoke AEAD wrap.* Host/customer
|
||||
binding is provided at the hub-storage layer (blob keyed by host-id), not by custom crypto.
|
||||
unattended — no passphrase prompt on the management path. The privilege concentration this implies
|
||||
is the whole argument for §3 root-minimization + a small auditable agent.
|
||||
- **Wrap — PBS-native, not custom crypto.** At enrollment the agent generates a high-entropy
|
||||
**recovery code `R`** and produces a **passphrase-protected copy of `K` under `R`** via PBS's own
|
||||
key passphrase KDF (`proxmox-backup-client key change-passphrase --kdf scrypt`; no bespoke AEAD).
|
||||
The spike pinned two implementation constraints: that command is **TTY-only** (drive it over a
|
||||
pty), and the pty **echoes the passphrase** (discard the pty output so `R` can't leak) — F-A1/F-A2.
|
||||
- **Agent-side generation.** `R` is generated **on the box** (it already holds `K` and does the
|
||||
wrapping), so `R` never touches the hub even in transit — zero-knowledge by construction.
|
||||
- **Escrow = the `R`-wrapped blob → hub.** The hub stores opaque ciphertext bound to the
|
||||
host/customer. Without `R` it is undecryptable; the operator cannot read customer data. (Hub-side
|
||||
storage schema for the blob is a slice-10 / doc-05 item.)
|
||||
- **Recovery code custody.** `R` is shown to the customer **once** at enrollment (printed/displayed)
|
||||
and **never stored by Felhom in recoverable form**. Format: a grouped/word-list code (≥128-bit
|
||||
entropy) — it is transcribed off paper by a non-technical household, so raw base32 invites typos.
|
||||
- **Consumption (slice 10, host-loss).** New box re-enrolls in restore mode → hub ships the escrow
|
||||
blob → customer enters `R` → box unwraps `K` → PBS restores proceed.
|
||||
- **Optional belt-and-suspenders (product decision, default OFF).** A PBS **paperkey** (the raw key,
|
||||
for a safe) gives the customer a recovery path that survives *both* box loss *and* recovery-code
|
||||
loss, at the cost of a higher-value secret (raw key on paper, no second factor). Default is
|
||||
hub-escrow + `R` only; offer the paperkey as an opt-in "advanced" path.
|
||||
wrapping), so `R` never touches the hub even in transit — zero-knowledge by construction. `R` is
|
||||
≥128 bits, **word-list form** (EFF large wordlist, 10 words ≈ 129 bits) for off-paper transcription.
|
||||
- **Self-verify before shipping.** Creation unwraps a copy of the blob with `R` and checks the key
|
||||
fingerprint matches — "an escrow you haven't recovered isn't an escrow."
|
||||
- **Escrow = the `R`-wrapped blob → hub (opaque storage, slice 7).** The hub stores the ciphertext
|
||||
bytes against the host record and **never decrypts them** (it has no `R`; there is no decrypt
|
||||
path). Per-host-key authed; rotation is last-write-wins. **Restore-mode serving is slice 10.**
|
||||
- **Recovery code custody.** `R` is surfaced to the customer **exactly once** at enrollment
|
||||
(printed/displayed) and **never stored by Felhom in any recoverable form**.
|
||||
|
||||
**Properties stated for honesty (these go to the customer at enrollment):**
|
||||
#### Default posture + the anti-lockout ladder (opt-in, increasing trust)
|
||||
**Default:** *Felhom storage + customer-only key*, and **`R` is delivered durably (printed) always**
|
||||
— note this is distinct from a raw-key paperkey: `R` is a safe two-factor *passphrase* (useless
|
||||
without the hub's blob); the raw key is the footgun. The ladder trades resilience for trust:
|
||||
- **(b) `R`-wrapped offline copy** — the same two-factor blob, for the customer to print/store. **No
|
||||
extra trust**; resilience if the hub ever vanishes (still needs `R`). *Implemented (opt-in).*
|
||||
- **(a) raw paperkey** — `proxmox-backup-client key paperkey` of the unwrapped key, for a safe.
|
||||
Covers **losing `R`**, but it is **single-factor and unrevocable**. *Implemented (opt-in, loud
|
||||
caveat).*
|
||||
- **Felhom-holds-a-key** — maximum convenience, but **gives up zero-knowledge** (the dangerous
|
||||
matrix cell). **Not implemented** — it needs a separate Felhom-side secure key store + explicit
|
||||
opt-in UX, built only when a customer asks.
|
||||
|
||||
#### SSH-for-support is a SEPARATE grant — deliberately not coupled to key custody
|
||||
Support access (active / consented / observable — customer-toggleable, commands shown) is **not**
|
||||
the same as a standing / passive / invisible decryption capability. The transparency features prove
|
||||
*controlled* support access **without Felhom holding a key**. Conflating the two is exactly the
|
||||
mistake the separation principle prevents.
|
||||
|
||||
#### Why zero-knowledge stays the default (breach + legal)
|
||||
Holding data **and** a key makes a single hub breach an **all-customer data leak**, and makes Felhom
|
||||
**compellable** — a court can order what Felhom *can* produce. Genuine zero-knowledge means *"we
|
||||
can't be forced to hand over what we can't read."* This is core to the sovereignty pitch, not a
|
||||
nicety.
|
||||
|
||||
#### Honesty properties (stated to the customer at enrollment)
|
||||
- **Irreducible residual:** losing `R` *and* the box (and, if not opted in, having no paperkey) =
|
||||
the offsite backups are **unrecoverable, by anyone, including Felhom.** This is the cost of
|
||||
genuine zero-knowledge and must be communicated, not buried.
|
||||
- **Rotation ≠ key rotation:** rotating `R` re-wraps the escrow blob (and re-shows the customer a
|
||||
new code) but does **not** re-encrypt existing PBS data — that data stays keyed by `K`. Changing
|
||||
`K` itself is a separate, heavier operation (new key → new backups; old backups still need old
|
||||
`K`) and is out of scope for routine recovery-code rotation.
|
||||
the offsite backups are **unrecoverable, by anyone, including Felhom.** The cost of genuine
|
||||
zero-knowledge — communicated, not buried.
|
||||
- **Rotation ≠ key rotation:** rotating `R` re-wraps the escrow blob (and re-shows a new code) but
|
||||
does **not** re-encrypt existing PBS data — that stays keyed by `K`. Changing `K` itself is a
|
||||
separate, heavier op (new key → new backups; old backups still need old `K`), out of scope for
|
||||
routine recovery-code rotation.
|
||||
- **Integrity caveat (self-hosted-data postures):** moving data to the customer's own offsite
|
||||
**loses Felhom's backup guarantees** — no PBS verify / monitoring on storage we can't reach. An
|
||||
honest signup-time tradeoff, not a hidden one.
|
||||
|
||||
## 9. Provisioning & DR flows
|
||||
|
||||
@@ -295,7 +335,7 @@ this path — bring up + reattach external storage and it is whole. This is full
|
||||
| Golden base image build (root@pam, at enrollment) | **7** | **recipe implemented** (`felhom-agent/configs/build-golden.sh`, incl. the F3 host-key unit); golden archived at enrollment |
|
||||
| Unified bring-up **front half** (restore→reset identity→size→attach storage), journaled + compensating rollback | **7** | **implemented** (agent v0.8.0, `internal/reconcile/bringup.go`) |
|
||||
| **Guest-loss DR** (front half + DR identity policy; no controller deploy) | **7** | **implemented** (v0.8.0, `dr_guest_loss` mode — continuity identity preserved) |
|
||||
| PBS recovery-code escrow **creation** (§8a) | **7** | designed (§8a); implement |
|
||||
| PBS recovery-code escrow **creation** + **hub opaque storage** (§8a) | **7** | **implemented** (agent v0.9.0 `internal/escrow`; hub v0.8.0 `PUT /hosts/{id}/escrow`) |
|
||||
| Provisioning **back half** — deploy controller, hand bootstrap config, mint per-guest local token | **8** | deferred — needs the controller-deploy path + agent↔controller local API (§6) |
|
||||
| **Host/hardware loss** DR — re-enroll in "restore mode"; hub serves identity / PBS namespace / tunnel token / storage manifest / restore directive | **10** | deferred — needs hub desired-state serving; hub store today holds only `{host_id, customer_id, api_key}` (slice 3) |
|
||||
| PBS escrow **consumption** (recover `K` on a new box) | **10** | deferred — exercised by host-loss DR |
|
||||
@@ -359,8 +399,9 @@ Still open:
|
||||
- **Golden base image** refresh cadence + fleet versioning — operational, non-blocking (§9).
|
||||
- **Identity-reset set** (live, link-up) — pinned empirically by the slice-7 bring-up spike; the
|
||||
scenario-specific policy is settled in §9, the exact field list is the spike's deliverable.
|
||||
- **Hub-side escrow storage + restore-mode serving** — the blob's hub schema and the restore-mode
|
||||
desired-state handover are slice-10 / doc-05 (§8a, §9 host-loss).
|
||||
- **Escrow restore-mode serving / consumption** — handing the opaque blob back to a re-enrolling
|
||||
box and unwrapping `K` with `R` is slice-10 / doc-05 (§8a, §9 host-loss). *Escrow creation + hub
|
||||
opaque storage are done (slice 7).*
|
||||
|
||||
This doc hands the implementation three contracts it was waiting on:
|
||||
|
||||
|
||||
Reference in New Issue
Block a user