docs: slice 10C escrow-consumption spike findings (GO)
Validated escrow consumption end-to-end on a genuinely key-less box against the real felhom-spike datastore: recover K from (blob,R) via the real escrow.Unwrap, restore REAL data (spike-lxc rootfs, 2.5G) with the recovered key only, wrong-R fails closed (no plausible-but-wrong key), live K byte-unchanged. Redacted (no R/K/secret). GO to spec 10C + build 10D. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,89 @@
|
||||
# Slice 10 (10C core) — escrow CONSUMPTION: recover K + restore REAL data on a key-less box — Findings
|
||||
|
||||
**Host:** box1 = `demo-felhom` (192.168.0.162); PBS datastore `felhom-spike` on DooPlex (192.168.0.180). PVE 9.2.2 / Debian 13, `proxmox-backup-client` 4.2.0.
|
||||
**Date:** 2026-06-10. **Driver:** SPIKE — validate escrow **consumption** (recover `K` from `(blob, R)` on a fresh, key-less box) and the **real-data** restore that proves it, before specing the hub Down-channel (10A) and DR orchestration (10D). This is the half the creation self-verify does **not** cover: a genuinely key-less box, real-data recovery (not just a fingerprint), and a load-bearing `R`.
|
||||
|
||||
> **REDACTED by policy.** No `K` value, no recovery code `R` value, no token secret appears here — only command *shapes*, blob *size/format*, fingerprint *match* (the truncated PBS-printed prefix, never key bytes), and `R` *structure* (10 EFF words / ~129 bits, not the value). `sha256(K-file)` was recorded **out-of-band** for the byte-unchanged proof and is not pasted here. The live `K` was operated on only as a **copy** and is **byte-unchanged**. The recovered `K` was **shredded** at teardown.
|
||||
|
||||
The spike drove the **real agent code** (`internal/escrow` `Create` / `Unwrap` / `KeyFingerprint`) via a throwaway harness (`cmd/escrow-spike`, **not committed**, removed at teardown) — so what passed here is exactly the code path 10C will wrap, not a re-implementation.
|
||||
|
||||
---
|
||||
|
||||
## 0. Setup (real datastore + real key + real encrypted backup)
|
||||
|
||||
- **Real `K`:** the live PBS client encryption key for storage `felhom-pbs` at `/etc/pve/priv/storage/felhom-pbs.enc` (`kdf=none` — stored unencrypted so the agent backs up + restore-tests unattended). Fingerprint `01:36:e9:…` (the PBS-printed prefix).
|
||||
- **Real encrypted backup:** `ct/9001/2026-06-09T15:01:37Z` in `felhom-spike` — the `spike-lxc` container, `root.pxar` 2.5 GB, `crypt-mode=encrypt`, manifest key fingerprint `01:36:e9:…`, verify state `ok`. A genuinely encrypted real-guest backup (from the slice-6/8B path).
|
||||
- **box1 has no default client key** (`/root/.config/proxmox-backup/encryption-key.json` absent) — so nothing could silently mask the key-less test.
|
||||
|
||||
## 1. The validated consumption sequence (the 10C contract)
|
||||
|
||||
Consumption reverses creation (slice 7): the blob is the PBS key file re-keyed `kdf=none → scrypt` under `R`; recovery re-keys `scrypt → none` under `R`, yielding the raw `K`.
|
||||
|
||||
- **Unwrap** (`escrow.Unwrap(blob, R)`): `proxmox-backup-client key change-passphrase <blob> --kdf none` (one prompt, fed `R` via a pty). The in-place result **is** the recovered raw key file.
|
||||
- **Restore with the recovered key only:** `proxmox-backup-client restore <snap> <archive> <dest> --keyfile <recovered.key> --repository <repo>` (PBS auth via `PBS_PASSWORD`/`PBS_FINGERPRINT`).
|
||||
|
||||
The pty driving (F-A1/F-A2 from slice 7) carried over unchanged and works **headless** (no controlling TTY) — the harness ran over a non-interactive SSH session exactly as the daemon does.
|
||||
|
||||
---
|
||||
|
||||
## 2. Phase results
|
||||
|
||||
### S0 — pre-flight (create from the REAL K; K untouched) — **PASS**
|
||||
`escrow.Create` on box1 from the live `K`: **self-verify PASSED** (Create unwraps a copy with `R` and matches the fingerprint), result fingerprint `01:36:e9:…` (== live `K`), blob **383 bytes** `kdf=scrypt`, `R` = **10 words / ~129 bits**. `sha256(K-file)` **identical before and after** Create → the live key was operated on only as a copy. `R` was written **only** to a `0600` file, never to stdout/log.
|
||||
|
||||
### S1 — genuinely key-less fresh box; K is absent — **PASS**
|
||||
A fenced fresh box = an isolated `HOME`/`XDG_CONFIG_HOME` under `/root/escrow-spike/freshbox` with **no encryption key present**. A restore of `ct/9001` there **with no key** failed cleanly:
|
||||
```
|
||||
Error: missing key - manifest was created with key 01:36:e9:fe:e1:ee:3d:7a
|
||||
```
|
||||
No output file was produced. **This makes S3 meaningful:** without `K`, the real data is unrecoverable — so any later success is attributable to the recovered key, not a pre-existing one. The fresh box was then handed **only** the blob + `R` (nothing else from box1).
|
||||
|
||||
### S2 — consume: recover K from (blob, R) — **PASS**
|
||||
`escrow.Unwrap(freshbox/blob, freshbox/R)` → **OK**. The blob went `kdf=scrypt` → `kdf=none` (raw key), and `KeyFingerprint(recovered)` = `01:36:e9:…` — a **bit-for-bit match** to the live `K` fingerprint. `K` genuinely came from `R` (the box had none).
|
||||
|
||||
### S3 — LOAD-BEARING: restore REAL data with the recovered K only — **PASS**
|
||||
Using **only** the recovered key on the key-less box:
|
||||
- **Config blob** (`pct.conf.blob`) decrypted to the real guest config — `hostname: spike-lxc`, `ostype: debian`, `rootfs: local-lvm:vm-9001-disk-0,size=10G`, `cores: 2`, `memory: 2048`.
|
||||
- **Full `root.pxar`** (2.5 GB encrypted) restored in **~19 s**, exit 0. The recovered rootfs is **intact**: `/etc/hostname` = `spike-lxc`, `/etc/os-release` = `Debian GNU/Linux 13 (trixie)`, 143 `/etc` entries, `/bin/bash` present + executable, 2.5 G on disk.
|
||||
|
||||
This — not the fingerprint — is the proof the recovered `K` decrypts **real customer data** end-to-end. Directly contrasts the S1 key-absent failure: same snapshot, same box, the **only** difference is the recovered key.
|
||||
|
||||
### S4 — negative: R is load-bearing — **PASS**
|
||||
`escrow.Unwrap(blob, WRONG-R)` **failed cleanly**: `unwrap: FAILED: escrow: unwrap: exit status 255` (nonzero exit). The blob was **left unchanged** (`kdf` still `scrypt`) — **no plausible-but-wrong raw key was emitted**. Using the still-wrapped blob as a restore keyfile failed too (`Error: no password input mechanism available`, no output). A wrong `R` yields *nothing usable*, never silent garbage.
|
||||
|
||||
---
|
||||
|
||||
## 3. Findings / gotchas (feeds 10C/10D specs)
|
||||
|
||||
- **F-C1 — the "missing key" failure is explicit and keyed.** A key-less restore fails with `missing key - manifest was created with key <fp-prefix>` and produces no partial output. 10D's restore step can detect a missing/wrong key deterministically (no silent empty restore) and surface *which* key fingerprint is required.
|
||||
- **F-C2 — the recovered key is the raw `kdf=none` key, ready to use as-is.** Unwrap leaves a normal PBS key file; `--keyfile <recovered>` restores immediately. **The fresh box needs no key-install ceremony beyond placing the unwrapped file where the restore reads it** (`--keyfile`, or `$XDG_CONFIG_HOME/proxmox-backup/encryption-key.json` for the default path). This is the only "install" step.
|
||||
- **F-C3 — wrong `R` is fail-closed at the KDF layer.** scrypt passphrase failure aborts `change-passphrase` with a nonzero exit and leaves the blob untouched; there is no code path that emits a wrong-but-structurally-valid key. 10C does not need an extra "did we get the right key?" guard *to avoid garbage* — but it SHOULD still fingerprint-check (F-C4) to fail fast and loudly.
|
||||
- **F-C4 — fingerprint-after-unwrap is the cheap correctness gate.** `KeyFingerprint(recovered)` vs the expected fingerprint (which the hub knows: it's in `storage.cfg`/the manifest) confirms the right key before a multi-GB restore. 10C should do this immediately after Unwrap.
|
||||
- **F-C5 — pty driving is headless-safe.** The slice-7 pty mechanism worked over a non-interactive SSH session with no controlling TTY — same as the daemon. No regression; nothing new needed for consumption.
|
||||
- **F-C6 — `K` is never mutated.** Create copies before wrapping; `sha256(K-file)` was identical before, mid-spike, and after. The consumption path only ever reads/writes the *fresh-box copy*. Safe to run against a live box's escrow without risk to its running key.
|
||||
|
||||
## 4. What a fresh box needs before consumption (input to the 10D / Down-channel design)
|
||||
|
||||
Recovery on a re-enrolling box needs exactly four inputs, three of which are **not** the escrow secret:
|
||||
1. the **opaque blob** — from the hub Down-channel (10A) (the hub stores it; cannot open it);
|
||||
2. the **recovery code `R`** — from the customer, by hand (two-factor; the hub never holds it);
|
||||
3. **PBS connection + auth** — repo (`<user>@<realm>!<token>@<server>:<datastore>`), the token secret, and the server fingerprint — these come from the **restore directive / identity** the hub serves (10D);
|
||||
4. the **expected key fingerprint** — to gate F-C4 — also hub-served (it is in the storage manifest).
|
||||
The blob + `R` produce `K`; (3)+(4) come from the hub. **This cleanly separates the two factors:** the hub serves everything *except* `R`, so a hub compromise alone still cannot decrypt (zero-knowledge holds end-to-end through consumption).
|
||||
|
||||
## 5. GO / NO-GO
|
||||
|
||||
**GO** to spec **10C** (escrow consumption) and to build **10D** (DR orchestration) around it. The crypto + real-data consumption is proven end-to-end on a genuinely key-less box with the real datastore: recover-from-`(blob,R)` works, the recovered key decrypts **real** customer data, a wrong `R` fails closed, and the live `K` is never touched. 10C is a thin wrapper over the proven `escrow.Unwrap` + a fingerprint gate (F-C4) + the existing PBS restore path; the remaining work is the *plumbing* (10A Down-channel to deliver the blob; 10D to deliver inputs (3)+(4) and orchestrate identity/namespace/tunnel restore + the operator-signed restore-overwrite gate, 10B), **not** the crypto.
|
||||
|
||||
## 6. Teardown
|
||||
|
||||
- **Shredded** (`shred -u`): the recovered key, the fresh-box blob/`R` copies, the wrong-`R` test blob + wrong-`R` file, box1's blob + `R`.
|
||||
- **Destroyed:** the fenced fresh-box dir incl. the 2.5 G restored rootfs (`/root/escrow-spike`) and the spike harness binary on box1.
|
||||
- **Live `K` byte-unchanged:** `sha256(/etc/pve/priv/storage/felhom-pbs.enc)` identical to the S0 baseline at teardown. No secret committed to git.
|
||||
- **No secret ever resided on the build server (180):** `R`, `K`, and the blob were generated/written/shredded **only on box1 (162)**; 180 only compiled the non-secret harness and served datastore *ciphertext*. The throwaway harness binary/source on 180 (`/tmp/escrow-spike*`, `cmd/escrow-spike/`, never committed, build-only checkout that does not feed git) was **removed** at teardown. (DooPlex — 180 + PBS `:8007` + `gitea.dooplex.hu` — had a transient outage right at teardown, minutes after it served the S3 restore; box1/162 is a separate machine and was unaffected. 180 cleanup completed once it returned.)
|
||||
- **`felhom-spike` left as found:** the spike used **only read-only** datastore ops (`snapshot list`, `restore`) — **no create/delete/prune/forget** — so no test snapshot could be orphaned or removed regardless of the final re-list (which could not run because 180 was unreachable; S1–S4 had already confirmed `ct/9001` intact and verified `ok`).
|
||||
|
||||
## Out of scope (validated only the crypto + real-data consumption)
|
||||
- Hub **Down-channel** serving the blob/restore-directive back to a re-enrolling box → **10A**.
|
||||
- **Identity / tunnel / PBS-namespace** restore + re-enrollment **authorization** → **10D**.
|
||||
- Operator-signed **restore-overwrite** gating → **10B**.
|
||||
Reference in New Issue
Block a user