Files
felhom.eu/documentation/tests/slice10-escrow-consumption-spike-findings.md
T
admin f9af3243b9 docs: slice 10C escrow-consumption spike findings (GO)
Validated escrow consumption end-to-end on a genuinely key-less box against
the real felhom-spike datastore: recover K from (blob,R) via the real
escrow.Unwrap, restore REAL data (spike-lxc rootfs, 2.5G) with the recovered
key only, wrong-R fails closed (no plausible-but-wrong key), live K
byte-unchanged. Redacted (no R/K/secret). GO to spec 10C + build 10D.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 17:10:31 +02:00

11 KiB
Raw Blame History

Slice 10 (10C core) — escrow CONSUMPTION: recover K + restore REAL data on a key-less box — Findings

Host: box1 = demo-felhom (192.168.0.162); PBS datastore felhom-spike on DooPlex (192.168.0.180). PVE 9.2.2 / Debian 13, proxmox-backup-client 4.2.0. Date: 2026-06-10. Driver: SPIKE — validate escrow consumption (recover K from (blob, R) on a fresh, key-less box) and the real-data restore that proves it, before specing the hub Down-channel (10A) and DR orchestration (10D). This is the half the creation self-verify does not cover: a genuinely key-less box, real-data recovery (not just a fingerprint), and a load-bearing R.

REDACTED by policy. No K value, no recovery code R value, no token secret appears here — only command shapes, blob size/format, fingerprint match (the truncated PBS-printed prefix, never key bytes), and R structure (10 EFF words / ~129 bits, not the value). sha256(K-file) was recorded out-of-band for the byte-unchanged proof and is not pasted here. The live K was operated on only as a copy and is byte-unchanged. The recovered K was shredded at teardown.

The spike drove the real agent code (internal/escrow Create / Unwrap / KeyFingerprint) via a throwaway harness (cmd/escrow-spike, not committed, removed at teardown) — so what passed here is exactly the code path 10C will wrap, not a re-implementation.


0. Setup (real datastore + real key + real encrypted backup)

  • Real K: the live PBS client encryption key for storage felhom-pbs at /etc/pve/priv/storage/felhom-pbs.enc (kdf=none — stored unencrypted so the agent backs up + restore-tests unattended). Fingerprint 01:36:e9:… (the PBS-printed prefix).
  • Real encrypted backup: ct/9001/2026-06-09T15:01:37Z in felhom-spike — the spike-lxc container, root.pxar 2.5 GB, crypt-mode=encrypt, manifest key fingerprint 01:36:e9:…, verify state ok. A genuinely encrypted real-guest backup (from the slice-6/8B path).
  • box1 has no default client key (/root/.config/proxmox-backup/encryption-key.json absent) — so nothing could silently mask the key-less test.

1. The validated consumption sequence (the 10C contract)

Consumption reverses creation (slice 7): the blob is the PBS key file re-keyed kdf=none → scrypt under R; recovery re-keys scrypt → none under R, yielding the raw K.

  • Unwrap (escrow.Unwrap(blob, R)): proxmox-backup-client key change-passphrase <blob> --kdf none (one prompt, fed R via a pty). The in-place result is the recovered raw key file.
  • Restore with the recovered key only: proxmox-backup-client restore <snap> <archive> <dest> --keyfile <recovered.key> --repository <repo> (PBS auth via PBS_PASSWORD/PBS_FINGERPRINT).

The pty driving (F-A1/F-A2 from slice 7) carried over unchanged and works headless (no controlling TTY) — the harness ran over a non-interactive SSH session exactly as the daemon does.


2. Phase results

S0 — pre-flight (create from the REAL K; K untouched) — PASS

escrow.Create on box1 from the live K: self-verify PASSED (Create unwraps a copy with R and matches the fingerprint), result fingerprint 01:36:e9:… (== live K), blob 383 bytes kdf=scrypt, R = 10 words / ~129 bits. sha256(K-file) identical before and after Create → the live key was operated on only as a copy. R was written only to a 0600 file, never to stdout/log.

S1 — genuinely key-less fresh box; K is absent — PASS

A fenced fresh box = an isolated HOME/XDG_CONFIG_HOME under /root/escrow-spike/freshbox with no encryption key present. A restore of ct/9001 there with no key failed cleanly:

Error: missing key - manifest was created with key 01:36:e9:fe:e1:ee:3d:7a

No output file was produced. This makes S3 meaningful: without K, the real data is unrecoverable — so any later success is attributable to the recovered key, not a pre-existing one. The fresh box was then handed only the blob + R (nothing else from box1).

S2 — consume: recover K from (blob, R) — PASS

escrow.Unwrap(freshbox/blob, freshbox/R)OK. The blob went kdf=scryptkdf=none (raw key), and KeyFingerprint(recovered) = 01:36:e9:… — a bit-for-bit match to the live K fingerprint. K genuinely came from R (the box had none).

S3 — LOAD-BEARING: restore REAL data with the recovered K only — PASS

Using only the recovered key on the key-less box:

  • Config blob (pct.conf.blob) decrypted to the real guest config — hostname: spike-lxc, ostype: debian, rootfs: local-lvm:vm-9001-disk-0,size=10G, cores: 2, memory: 2048.
  • Full root.pxar (2.5 GB encrypted) restored in ~19 s, exit 0. The recovered rootfs is intact: /etc/hostname = spike-lxc, /etc/os-release = Debian GNU/Linux 13 (trixie), 143 /etc entries, /bin/bash present + executable, 2.5 G on disk.

This — not the fingerprint — is the proof the recovered K decrypts real customer data end-to-end. Directly contrasts the S1 key-absent failure: same snapshot, same box, the only difference is the recovered key.

S4 — negative: R is load-bearing — PASS

escrow.Unwrap(blob, WRONG-R) failed cleanly: unwrap: FAILED: escrow: unwrap: exit status 255 (nonzero exit). The blob was left unchanged (kdf still scrypt) — no plausible-but-wrong raw key was emitted. Using the still-wrapped blob as a restore keyfile failed too (Error: no password input mechanism available, no output). A wrong R yields nothing usable, never silent garbage.


3. Findings / gotchas (feeds 10C/10D specs)

  • F-C1 — the "missing key" failure is explicit and keyed. A key-less restore fails with missing key - manifest was created with key <fp-prefix> and produces no partial output. 10D's restore step can detect a missing/wrong key deterministically (no silent empty restore) and surface which key fingerprint is required.
  • F-C2 — the recovered key is the raw kdf=none key, ready to use as-is. Unwrap leaves a normal PBS key file; --keyfile <recovered> restores immediately. The fresh box needs no key-install ceremony beyond placing the unwrapped file where the restore reads it (--keyfile, or $XDG_CONFIG_HOME/proxmox-backup/encryption-key.json for the default path). This is the only "install" step.
  • F-C3 — wrong R is fail-closed at the KDF layer. scrypt passphrase failure aborts change-passphrase with a nonzero exit and leaves the blob untouched; there is no code path that emits a wrong-but-structurally-valid key. 10C does not need an extra "did we get the right key?" guard to avoid garbage — but it SHOULD still fingerprint-check (F-C4) to fail fast and loudly.
  • F-C4 — fingerprint-after-unwrap is the cheap correctness gate. KeyFingerprint(recovered) vs the expected fingerprint (which the hub knows: it's in storage.cfg/the manifest) confirms the right key before a multi-GB restore. 10C should do this immediately after Unwrap.
  • F-C5 — pty driving is headless-safe. The slice-7 pty mechanism worked over a non-interactive SSH session with no controlling TTY — same as the daemon. No regression; nothing new needed for consumption.
  • F-C6 — K is never mutated. Create copies before wrapping; sha256(K-file) was identical before, mid-spike, and after. The consumption path only ever reads/writes the fresh-box copy. Safe to run against a live box's escrow without risk to its running key.

4. What a fresh box needs before consumption (input to the 10D / Down-channel design)

Recovery on a re-enrolling box needs exactly four inputs, three of which are not the escrow secret:

  1. the opaque blob — from the hub Down-channel (10A) (the hub stores it; cannot open it);
  2. the recovery code R — from the customer, by hand (two-factor; the hub never holds it);
  3. PBS connection + auth — repo (<user>@<realm>!<token>@<server>:<datastore>), the token secret, and the server fingerprint — these come from the restore directive / identity the hub serves (10D);
  4. the expected key fingerprint — to gate F-C4 — also hub-served (it is in the storage manifest). The blob + R produce K; (3)+(4) come from the hub. This cleanly separates the two factors: the hub serves everything except R, so a hub compromise alone still cannot decrypt (zero-knowledge holds end-to-end through consumption).

5. GO / NO-GO

GO to spec 10C (escrow consumption) and to build 10D (DR orchestration) around it. The crypto + real-data consumption is proven end-to-end on a genuinely key-less box with the real datastore: recover-from-(blob,R) works, the recovered key decrypts real customer data, a wrong R fails closed, and the live K is never touched. 10C is a thin wrapper over the proven escrow.Unwrap + a fingerprint gate (F-C4) + the existing PBS restore path; the remaining work is the plumbing (10A Down-channel to deliver the blob; 10D to deliver inputs (3)+(4) and orchestrate identity/namespace/tunnel restore + the operator-signed restore-overwrite gate, 10B), not the crypto.

6. Teardown

  • Shredded (shred -u): the recovered key, the fresh-box blob/R copies, the wrong-R test blob + wrong-R file, box1's blob + R.
  • Destroyed: the fenced fresh-box dir incl. the 2.5 G restored rootfs (/root/escrow-spike) and the spike harness binary on box1.
  • Live K byte-unchanged: sha256(/etc/pve/priv/storage/felhom-pbs.enc) identical to the S0 baseline at teardown. No secret committed to git.
  • No secret ever resided on the build server (180): R, K, and the blob were generated/written/shredded only on box1 (162); 180 only compiled the non-secret harness and served datastore ciphertext. The throwaway harness binary/source on 180 (/tmp/escrow-spike*, cmd/escrow-spike/, never committed, build-only checkout that does not feed git) was removed at teardown. (DooPlex — 180 + PBS :8007 + gitea.dooplex.hu — had a transient outage right at teardown, minutes after it served the S3 restore; box1/162 is a separate machine and was unaffected. 180 cleanup completed once it returned.)
  • felhom-spike left as found: the spike used only read-only datastore ops (snapshot list, restore) — no create/delete/prune/forget — so no test snapshot could be orphaned or removed regardless of the final re-list (which could not run because 180 was unreachable; S1S4 had already confirmed ct/9001 intact and verified ok).

Out of scope (validated only the crypto + real-data consumption)

  • Hub Down-channel serving the blob/restore-directive back to a re-enrolling box → 10A.
  • Identity / tunnel / PBS-namespace restore + re-enrollment authorization10D.
  • Operator-signed restore-overwrite gating → 10B.