Files
felhom.eu/documentation/tests/slice10-escrow-consumption-spike-findings.md
T
admin f9af3243b9 docs: slice 10C escrow-consumption spike findings (GO)
Validated escrow consumption end-to-end on a genuinely key-less box against
the real felhom-spike datastore: recover K from (blob,R) via the real
escrow.Unwrap, restore REAL data (spike-lxc rootfs, 2.5G) with the recovered
key only, wrong-R fails closed (no plausible-but-wrong key), live K
byte-unchanged. Redacted (no R/K/secret). GO to spec 10C + build 10D.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 17:10:31 +02:00

90 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Slice 10 (10C core) — escrow CONSUMPTION: recover K + restore REAL data on a key-less box — Findings
**Host:** box1 = `demo-felhom` (192.168.0.162); PBS datastore `felhom-spike` on DooPlex (192.168.0.180). PVE 9.2.2 / Debian 13, `proxmox-backup-client` 4.2.0.
**Date:** 2026-06-10. **Driver:** SPIKE — validate escrow **consumption** (recover `K` from `(blob, R)` on a fresh, key-less box) and the **real-data** restore that proves it, before specing the hub Down-channel (10A) and DR orchestration (10D). This is the half the creation self-verify does **not** cover: a genuinely key-less box, real-data recovery (not just a fingerprint), and a load-bearing `R`.
> **REDACTED by policy.** No `K` value, no recovery code `R` value, no token secret appears here — only command *shapes*, blob *size/format*, fingerprint *match* (the truncated PBS-printed prefix, never key bytes), and `R` *structure* (10 EFF words / ~129 bits, not the value). `sha256(K-file)` was recorded **out-of-band** for the byte-unchanged proof and is not pasted here. The live `K` was operated on only as a **copy** and is **byte-unchanged**. The recovered `K` was **shredded** at teardown.
The spike drove the **real agent code** (`internal/escrow` `Create` / `Unwrap` / `KeyFingerprint`) via a throwaway harness (`cmd/escrow-spike`, **not committed**, removed at teardown) — so what passed here is exactly the code path 10C will wrap, not a re-implementation.
---
## 0. Setup (real datastore + real key + real encrypted backup)
- **Real `K`:** the live PBS client encryption key for storage `felhom-pbs` at `/etc/pve/priv/storage/felhom-pbs.enc` (`kdf=none` — stored unencrypted so the agent backs up + restore-tests unattended). Fingerprint `01:36:e9:…` (the PBS-printed prefix).
- **Real encrypted backup:** `ct/9001/2026-06-09T15:01:37Z` in `felhom-spike` — the `spike-lxc` container, `root.pxar` 2.5 GB, `crypt-mode=encrypt`, manifest key fingerprint `01:36:e9:…`, verify state `ok`. A genuinely encrypted real-guest backup (from the slice-6/8B path).
- **box1 has no default client key** (`/root/.config/proxmox-backup/encryption-key.json` absent) — so nothing could silently mask the key-less test.
## 1. The validated consumption sequence (the 10C contract)
Consumption reverses creation (slice 7): the blob is the PBS key file re-keyed `kdf=none → scrypt` under `R`; recovery re-keys `scrypt → none` under `R`, yielding the raw `K`.
- **Unwrap** (`escrow.Unwrap(blob, R)`): `proxmox-backup-client key change-passphrase <blob> --kdf none` (one prompt, fed `R` via a pty). The in-place result **is** the recovered raw key file.
- **Restore with the recovered key only:** `proxmox-backup-client restore <snap> <archive> <dest> --keyfile <recovered.key> --repository <repo>` (PBS auth via `PBS_PASSWORD`/`PBS_FINGERPRINT`).
The pty driving (F-A1/F-A2 from slice 7) carried over unchanged and works **headless** (no controlling TTY) — the harness ran over a non-interactive SSH session exactly as the daemon does.
---
## 2. Phase results
### S0 — pre-flight (create from the REAL K; K untouched) — **PASS**
`escrow.Create` on box1 from the live `K`: **self-verify PASSED** (Create unwraps a copy with `R` and matches the fingerprint), result fingerprint `01:36:e9:…` (== live `K`), blob **383 bytes** `kdf=scrypt`, `R` = **10 words / ~129 bits**. `sha256(K-file)` **identical before and after** Create → the live key was operated on only as a copy. `R` was written **only** to a `0600` file, never to stdout/log.
### S1 — genuinely key-less fresh box; K is absent — **PASS**
A fenced fresh box = an isolated `HOME`/`XDG_CONFIG_HOME` under `/root/escrow-spike/freshbox` with **no encryption key present**. A restore of `ct/9001` there **with no key** failed cleanly:
```
Error: missing key - manifest was created with key 01:36:e9:fe:e1:ee:3d:7a
```
No output file was produced. **This makes S3 meaningful:** without `K`, the real data is unrecoverable — so any later success is attributable to the recovered key, not a pre-existing one. The fresh box was then handed **only** the blob + `R` (nothing else from box1).
### S2 — consume: recover K from (blob, R) — **PASS**
`escrow.Unwrap(freshbox/blob, freshbox/R)`**OK**. The blob went `kdf=scrypt``kdf=none` (raw key), and `KeyFingerprint(recovered)` = `01:36:e9:…` — a **bit-for-bit match** to the live `K` fingerprint. `K` genuinely came from `R` (the box had none).
### S3 — LOAD-BEARING: restore REAL data with the recovered K only — **PASS**
Using **only** the recovered key on the key-less box:
- **Config blob** (`pct.conf.blob`) decrypted to the real guest config — `hostname: spike-lxc`, `ostype: debian`, `rootfs: local-lvm:vm-9001-disk-0,size=10G`, `cores: 2`, `memory: 2048`.
- **Full `root.pxar`** (2.5 GB encrypted) restored in **~19 s**, exit 0. The recovered rootfs is **intact**: `/etc/hostname` = `spike-lxc`, `/etc/os-release` = `Debian GNU/Linux 13 (trixie)`, 143 `/etc` entries, `/bin/bash` present + executable, 2.5 G on disk.
This — not the fingerprint — is the proof the recovered `K` decrypts **real customer data** end-to-end. Directly contrasts the S1 key-absent failure: same snapshot, same box, the **only** difference is the recovered key.
### S4 — negative: R is load-bearing — **PASS**
`escrow.Unwrap(blob, WRONG-R)` **failed cleanly**: `unwrap: FAILED: escrow: unwrap: exit status 255` (nonzero exit). The blob was **left unchanged** (`kdf` still `scrypt`) — **no plausible-but-wrong raw key was emitted**. Using the still-wrapped blob as a restore keyfile failed too (`Error: no password input mechanism available`, no output). A wrong `R` yields *nothing usable*, never silent garbage.
---
## 3. Findings / gotchas (feeds 10C/10D specs)
- **F-C1 — the "missing key" failure is explicit and keyed.** A key-less restore fails with `missing key - manifest was created with key <fp-prefix>` and produces no partial output. 10D's restore step can detect a missing/wrong key deterministically (no silent empty restore) and surface *which* key fingerprint is required.
- **F-C2 — the recovered key is the raw `kdf=none` key, ready to use as-is.** Unwrap leaves a normal PBS key file; `--keyfile <recovered>` restores immediately. **The fresh box needs no key-install ceremony beyond placing the unwrapped file where the restore reads it** (`--keyfile`, or `$XDG_CONFIG_HOME/proxmox-backup/encryption-key.json` for the default path). This is the only "install" step.
- **F-C3 — wrong `R` is fail-closed at the KDF layer.** scrypt passphrase failure aborts `change-passphrase` with a nonzero exit and leaves the blob untouched; there is no code path that emits a wrong-but-structurally-valid key. 10C does not need an extra "did we get the right key?" guard *to avoid garbage* — but it SHOULD still fingerprint-check (F-C4) to fail fast and loudly.
- **F-C4 — fingerprint-after-unwrap is the cheap correctness gate.** `KeyFingerprint(recovered)` vs the expected fingerprint (which the hub knows: it's in `storage.cfg`/the manifest) confirms the right key before a multi-GB restore. 10C should do this immediately after Unwrap.
- **F-C5 — pty driving is headless-safe.** The slice-7 pty mechanism worked over a non-interactive SSH session with no controlling TTY — same as the daemon. No regression; nothing new needed for consumption.
- **F-C6 — `K` is never mutated.** Create copies before wrapping; `sha256(K-file)` was identical before, mid-spike, and after. The consumption path only ever reads/writes the *fresh-box copy*. Safe to run against a live box's escrow without risk to its running key.
## 4. What a fresh box needs before consumption (input to the 10D / Down-channel design)
Recovery on a re-enrolling box needs exactly four inputs, three of which are **not** the escrow secret:
1. the **opaque blob** — from the hub Down-channel (10A) (the hub stores it; cannot open it);
2. the **recovery code `R`** — from the customer, by hand (two-factor; the hub never holds it);
3. **PBS connection + auth** — repo (`<user>@<realm>!<token>@<server>:<datastore>`), the token secret, and the server fingerprint — these come from the **restore directive / identity** the hub serves (10D);
4. the **expected key fingerprint** — to gate F-C4 — also hub-served (it is in the storage manifest).
The blob + `R` produce `K`; (3)+(4) come from the hub. **This cleanly separates the two factors:** the hub serves everything *except* `R`, so a hub compromise alone still cannot decrypt (zero-knowledge holds end-to-end through consumption).
## 5. GO / NO-GO
**GO** to spec **10C** (escrow consumption) and to build **10D** (DR orchestration) around it. The crypto + real-data consumption is proven end-to-end on a genuinely key-less box with the real datastore: recover-from-`(blob,R)` works, the recovered key decrypts **real** customer data, a wrong `R` fails closed, and the live `K` is never touched. 10C is a thin wrapper over the proven `escrow.Unwrap` + a fingerprint gate (F-C4) + the existing PBS restore path; the remaining work is the *plumbing* (10A Down-channel to deliver the blob; 10D to deliver inputs (3)+(4) and orchestrate identity/namespace/tunnel restore + the operator-signed restore-overwrite gate, 10B), **not** the crypto.
## 6. Teardown
- **Shredded** (`shred -u`): the recovered key, the fresh-box blob/`R` copies, the wrong-`R` test blob + wrong-`R` file, box1's blob + `R`.
- **Destroyed:** the fenced fresh-box dir incl. the 2.5 G restored rootfs (`/root/escrow-spike`) and the spike harness binary on box1.
- **Live `K` byte-unchanged:** `sha256(/etc/pve/priv/storage/felhom-pbs.enc)` identical to the S0 baseline at teardown. No secret committed to git.
- **No secret ever resided on the build server (180):** `R`, `K`, and the blob were generated/written/shredded **only on box1 (162)**; 180 only compiled the non-secret harness and served datastore *ciphertext*. The throwaway harness binary/source on 180 (`/tmp/escrow-spike*`, `cmd/escrow-spike/`, never committed, build-only checkout that does not feed git) was **removed** at teardown. (DooPlex — 180 + PBS `:8007` + `gitea.dooplex.hu` — had a transient outage right at teardown, minutes after it served the S3 restore; box1/162 is a separate machine and was unaffected. 180 cleanup completed once it returned.)
- **`felhom-spike` left as found:** the spike used **only read-only** datastore ops (`snapshot list`, `restore`) — **no create/delete/prune/forget** — so no test snapshot could be orphaned or removed regardless of the final re-list (which could not run because 180 was unreachable; S1S4 had already confirmed `ct/9001` intact and verified `ok`).
## Out of scope (validated only the crypto + real-data consumption)
- Hub **Down-channel** serving the blob/restore-directive back to a re-enrolling box → **10A**.
- **Identity / tunnel / PBS-namespace** restore + re-enrollment **authorization****10D**.
- Operator-signed **restore-overwrite** gating → **10B**.