diff --git a/documentation/tests/slice10-escrow-consumption-spike-findings.md b/documentation/tests/slice10-escrow-consumption-spike-findings.md new file mode 100644 index 0000000..f95208e --- /dev/null +++ b/documentation/tests/slice10-escrow-consumption-spike-findings.md @@ -0,0 +1,89 @@ +# Slice 10 (10C core) — escrow CONSUMPTION: recover K + restore REAL data on a key-less box — Findings + +**Host:** box1 = `demo-felhom` (192.168.0.162); PBS datastore `felhom-spike` on DooPlex (192.168.0.180). PVE 9.2.2 / Debian 13, `proxmox-backup-client` 4.2.0. +**Date:** 2026-06-10. **Driver:** SPIKE — validate escrow **consumption** (recover `K` from `(blob, R)` on a fresh, key-less box) and the **real-data** restore that proves it, before specing the hub Down-channel (10A) and DR orchestration (10D). This is the half the creation self-verify does **not** cover: a genuinely key-less box, real-data recovery (not just a fingerprint), and a load-bearing `R`. + +> **REDACTED by policy.** No `K` value, no recovery code `R` value, no token secret appears here — only command *shapes*, blob *size/format*, fingerprint *match* (the truncated PBS-printed prefix, never key bytes), and `R` *structure* (10 EFF words / ~129 bits, not the value). `sha256(K-file)` was recorded **out-of-band** for the byte-unchanged proof and is not pasted here. The live `K` was operated on only as a **copy** and is **byte-unchanged**. The recovered `K` was **shredded** at teardown. + +The spike drove the **real agent code** (`internal/escrow` `Create` / `Unwrap` / `KeyFingerprint`) via a throwaway harness (`cmd/escrow-spike`, **not committed**, removed at teardown) — so what passed here is exactly the code path 10C will wrap, not a re-implementation. + +--- + +## 0. Setup (real datastore + real key + real encrypted backup) + +- **Real `K`:** the live PBS client encryption key for storage `felhom-pbs` at `/etc/pve/priv/storage/felhom-pbs.enc` (`kdf=none` — stored unencrypted so the agent backs up + restore-tests unattended). Fingerprint `01:36:e9:…` (the PBS-printed prefix). +- **Real encrypted backup:** `ct/9001/2026-06-09T15:01:37Z` in `felhom-spike` — the `spike-lxc` container, `root.pxar` 2.5 GB, `crypt-mode=encrypt`, manifest key fingerprint `01:36:e9:…`, verify state `ok`. A genuinely encrypted real-guest backup (from the slice-6/8B path). +- **box1 has no default client key** (`/root/.config/proxmox-backup/encryption-key.json` absent) — so nothing could silently mask the key-less test. + +## 1. The validated consumption sequence (the 10C contract) + +Consumption reverses creation (slice 7): the blob is the PBS key file re-keyed `kdf=none → scrypt` under `R`; recovery re-keys `scrypt → none` under `R`, yielding the raw `K`. + +- **Unwrap** (`escrow.Unwrap(blob, R)`): `proxmox-backup-client key change-passphrase --kdf none` (one prompt, fed `R` via a pty). The in-place result **is** the recovered raw key file. +- **Restore with the recovered key only:** `proxmox-backup-client restore --keyfile --repository ` (PBS auth via `PBS_PASSWORD`/`PBS_FINGERPRINT`). + +The pty driving (F-A1/F-A2 from slice 7) carried over unchanged and works **headless** (no controlling TTY) — the harness ran over a non-interactive SSH session exactly as the daemon does. + +--- + +## 2. Phase results + +### S0 — pre-flight (create from the REAL K; K untouched) — **PASS** +`escrow.Create` on box1 from the live `K`: **self-verify PASSED** (Create unwraps a copy with `R` and matches the fingerprint), result fingerprint `01:36:e9:…` (== live `K`), blob **383 bytes** `kdf=scrypt`, `R` = **10 words / ~129 bits**. `sha256(K-file)` **identical before and after** Create → the live key was operated on only as a copy. `R` was written **only** to a `0600` file, never to stdout/log. + +### S1 — genuinely key-less fresh box; K is absent — **PASS** +A fenced fresh box = an isolated `HOME`/`XDG_CONFIG_HOME` under `/root/escrow-spike/freshbox` with **no encryption key present**. A restore of `ct/9001` there **with no key** failed cleanly: +``` +Error: missing key - manifest was created with key 01:36:e9:fe:e1:ee:3d:7a +``` +No output file was produced. **This makes S3 meaningful:** without `K`, the real data is unrecoverable — so any later success is attributable to the recovered key, not a pre-existing one. The fresh box was then handed **only** the blob + `R` (nothing else from box1). + +### S2 — consume: recover K from (blob, R) — **PASS** +`escrow.Unwrap(freshbox/blob, freshbox/R)` → **OK**. The blob went `kdf=scrypt` → `kdf=none` (raw key), and `KeyFingerprint(recovered)` = `01:36:e9:…` — a **bit-for-bit match** to the live `K` fingerprint. `K` genuinely came from `R` (the box had none). + +### S3 — LOAD-BEARING: restore REAL data with the recovered K only — **PASS** +Using **only** the recovered key on the key-less box: +- **Config blob** (`pct.conf.blob`) decrypted to the real guest config — `hostname: spike-lxc`, `ostype: debian`, `rootfs: local-lvm:vm-9001-disk-0,size=10G`, `cores: 2`, `memory: 2048`. +- **Full `root.pxar`** (2.5 GB encrypted) restored in **~19 s**, exit 0. The recovered rootfs is **intact**: `/etc/hostname` = `spike-lxc`, `/etc/os-release` = `Debian GNU/Linux 13 (trixie)`, 143 `/etc` entries, `/bin/bash` present + executable, 2.5 G on disk. + +This — not the fingerprint — is the proof the recovered `K` decrypts **real customer data** end-to-end. Directly contrasts the S1 key-absent failure: same snapshot, same box, the **only** difference is the recovered key. + +### S4 — negative: R is load-bearing — **PASS** +`escrow.Unwrap(blob, WRONG-R)` **failed cleanly**: `unwrap: FAILED: escrow: unwrap: exit status 255` (nonzero exit). The blob was **left unchanged** (`kdf` still `scrypt`) — **no plausible-but-wrong raw key was emitted**. Using the still-wrapped blob as a restore keyfile failed too (`Error: no password input mechanism available`, no output). A wrong `R` yields *nothing usable*, never silent garbage. + +--- + +## 3. Findings / gotchas (feeds 10C/10D specs) + +- **F-C1 — the "missing key" failure is explicit and keyed.** A key-less restore fails with `missing key - manifest was created with key ` and produces no partial output. 10D's restore step can detect a missing/wrong key deterministically (no silent empty restore) and surface *which* key fingerprint is required. +- **F-C2 — the recovered key is the raw `kdf=none` key, ready to use as-is.** Unwrap leaves a normal PBS key file; `--keyfile ` restores immediately. **The fresh box needs no key-install ceremony beyond placing the unwrapped file where the restore reads it** (`--keyfile`, or `$XDG_CONFIG_HOME/proxmox-backup/encryption-key.json` for the default path). This is the only "install" step. +- **F-C3 — wrong `R` is fail-closed at the KDF layer.** scrypt passphrase failure aborts `change-passphrase` with a nonzero exit and leaves the blob untouched; there is no code path that emits a wrong-but-structurally-valid key. 10C does not need an extra "did we get the right key?" guard *to avoid garbage* — but it SHOULD still fingerprint-check (F-C4) to fail fast and loudly. +- **F-C4 — fingerprint-after-unwrap is the cheap correctness gate.** `KeyFingerprint(recovered)` vs the expected fingerprint (which the hub knows: it's in `storage.cfg`/the manifest) confirms the right key before a multi-GB restore. 10C should do this immediately after Unwrap. +- **F-C5 — pty driving is headless-safe.** The slice-7 pty mechanism worked over a non-interactive SSH session with no controlling TTY — same as the daemon. No regression; nothing new needed for consumption. +- **F-C6 — `K` is never mutated.** Create copies before wrapping; `sha256(K-file)` was identical before, mid-spike, and after. The consumption path only ever reads/writes the *fresh-box copy*. Safe to run against a live box's escrow without risk to its running key. + +## 4. What a fresh box needs before consumption (input to the 10D / Down-channel design) + +Recovery on a re-enrolling box needs exactly four inputs, three of which are **not** the escrow secret: +1. the **opaque blob** — from the hub Down-channel (10A) (the hub stores it; cannot open it); +2. the **recovery code `R`** — from the customer, by hand (two-factor; the hub never holds it); +3. **PBS connection + auth** — repo (`@!@:`), the token secret, and the server fingerprint — these come from the **restore directive / identity** the hub serves (10D); +4. the **expected key fingerprint** — to gate F-C4 — also hub-served (it is in the storage manifest). +The blob + `R` produce `K`; (3)+(4) come from the hub. **This cleanly separates the two factors:** the hub serves everything *except* `R`, so a hub compromise alone still cannot decrypt (zero-knowledge holds end-to-end through consumption). + +## 5. GO / NO-GO + +**GO** to spec **10C** (escrow consumption) and to build **10D** (DR orchestration) around it. The crypto + real-data consumption is proven end-to-end on a genuinely key-less box with the real datastore: recover-from-`(blob,R)` works, the recovered key decrypts **real** customer data, a wrong `R` fails closed, and the live `K` is never touched. 10C is a thin wrapper over the proven `escrow.Unwrap` + a fingerprint gate (F-C4) + the existing PBS restore path; the remaining work is the *plumbing* (10A Down-channel to deliver the blob; 10D to deliver inputs (3)+(4) and orchestrate identity/namespace/tunnel restore + the operator-signed restore-overwrite gate, 10B), **not** the crypto. + +## 6. Teardown + +- **Shredded** (`shred -u`): the recovered key, the fresh-box blob/`R` copies, the wrong-`R` test blob + wrong-`R` file, box1's blob + `R`. +- **Destroyed:** the fenced fresh-box dir incl. the 2.5 G restored rootfs (`/root/escrow-spike`) and the spike harness binary on box1. +- **Live `K` byte-unchanged:** `sha256(/etc/pve/priv/storage/felhom-pbs.enc)` identical to the S0 baseline at teardown. No secret committed to git. +- **No secret ever resided on the build server (180):** `R`, `K`, and the blob were generated/written/shredded **only on box1 (162)**; 180 only compiled the non-secret harness and served datastore *ciphertext*. The throwaway harness binary/source on 180 (`/tmp/escrow-spike*`, `cmd/escrow-spike/`, never committed, build-only checkout that does not feed git) was **removed** at teardown. (DooPlex — 180 + PBS `:8007` + `gitea.dooplex.hu` — had a transient outage right at teardown, minutes after it served the S3 restore; box1/162 is a separate machine and was unaffected. 180 cleanup completed once it returned.) +- **`felhom-spike` left as found:** the spike used **only read-only** datastore ops (`snapshot list`, `restore`) — **no create/delete/prune/forget** — so no test snapshot could be orphaned or removed regardless of the final re-list (which could not run because 180 was unreachable; S1–S4 had already confirmed `ct/9001` intact and verified `ok`). + +## Out of scope (validated only the crypto + real-data consumption) +- Hub **Down-channel** serving the blob/restore-directive back to a re-enrolling box → **10A**. +- **Identity / tunnel / PBS-namespace** restore + re-enrollment **authorization** → **10D**. +- Operator-signed **restore-overwrite** gating → **10B**.