docs: slice 10D core spike findings (identity-escrow + tunnel re-establishment) — GO
Validated both unvalidated 10D mechanisms: (1) identity-bundle escrow round-trip via age scrypt+AEAD (recover on a secret-less box, wrong-R fails closed), (2) Cloudflare tunnel re-establishment — running the recovered token on a new box routes the hostname there immediately (no DNS change); the old connector is a hot standby, superseded in routing but not auto-retired -> 10D must rotate the tunnel/PBS token + retire the stale connector for host-loss security. Redacted; secrets shredded; live demo untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -4,28 +4,44 @@
|
||||
|
||||
---
|
||||
|
||||
# REPORT — Slice 10C (docs only): escrow consumption productionized (2026-06-10)
|
||||
# REPORT — Slice 10D core SPIKE: identity-escrow round-trip + tunnel re-establishment (2026-06-10)
|
||||
|
||||
## Type
|
||||
|
||||
Documentation update for **slice 10C** (implementation is **agent-only**: `felhom-agent` v0.17.0 —
|
||||
`escrow.Consume`). **No hub code change** — 10C reads a restore directive it is given; 10D wires the
|
||||
hub side (serving the blob + expected fingerprint + PBS connection, prompting for R).
|
||||
SPIKE runbook (CC-executed on the demo). Validated the two unvalidated mechanisms under the 10D DR
|
||||
capstone **before** speccing the orchestration. Deliverable: the redacted findings doc
|
||||
[`documentation/tests/slice10d-identity-restore-spike-findings.md`](documentation/tests/slice10d-identity-restore-spike-findings.md).
|
||||
Handled crown jewels (R + identity/tunnel tokens) — staged `0600`, by reference, **shredded** at teardown; no secret committed.
|
||||
|
||||
## What changed (doc 03 — host-agent)
|
||||
## Results — GO to spec 10D
|
||||
|
||||
- **§8a**: escrow **consumption** is now a real, tested path (`escrow.Consume` = **Unwrap →
|
||||
fingerprint-gate → install**), replacing the throwaway spike harness. The spike findings are baked
|
||||
in: F-C2 (install the raw key where the restore reads it), **F-C3** (wrong R fails closed), **F-C4**
|
||||
(fingerprint-gate *before* any multi-GB restore), **F-C6** (blob read-only/retryable, `K` never
|
||||
mutated). **Zero-knowledge holds end-to-end**: the hub serves the blob + expected fingerprint + PBS
|
||||
connection; **R comes from the customer by hand, never the hub** — a hub compromise alone cannot
|
||||
decrypt.
|
||||
- **§9 slice table**: **10C done**. **10D** (DR capstone — re-enroll in restore mode, serve the
|
||||
directive, consume, restore guests + identity, reuse the 10B gate for restore-overwrite, the
|
||||
re-enrollment-auth fork) is the last piece of slice 10.
|
||||
**S1 — identity-escrow round-trip (age):** the identity bundle `{tunnel_token, pbs_token}` wraps under
|
||||
an EFF-wordlist `R` via **age (scrypt + ChaCha20-Poly1305 AEAD)**, recovers **byte-identical** on a
|
||||
secret-less fresh box given only blob + R, and a **wrong R fails closed** (no plaintext). Mirrors the
|
||||
proven K-escrow → 10D reuses the 10C `Consume` shape for the identity bundle.
|
||||
|
||||
## Pending
|
||||
**S2 — tunnel re-establishment:** running the recovered Cloudflare tunnel token's connector on a NEW
|
||||
box → the customer's hostname routes to it **immediately, no DNS change** (the CNAME→tunnel is stable;
|
||||
only the connector moves). With both connectors up, 14/14 requests served from NEW; stopping NEW fell
|
||||
back to OLD (6/6) — **the old connector is a hot standby, superseded in routing but NOT auto-retired.**
|
||||
|
||||
- Live validation runs against the demo (agent v0.17.0): create escrow → `Consume` → restore real
|
||||
data with the consumed key; wrong R → clean failure, nothing installed; live `K` byte-unchanged.
|
||||
**Load-bearing consequence for 10D:** routing failover is automatic, but the old box's connector + the
|
||||
(same) tunnel token stay valid → **10D must rotate the tunnel/PBS tokens and/or delete the stale
|
||||
connector after re-establishment** (host-LOSS security). That needs an **Account Cloudflare-Tunnel
|
||||
-scoped** hub credential (broader than the current WAF-only zone token) — feeds the design-review S4
|
||||
CF-token-placement decision. Also: a remotely-managed tunnel uses its **dashboard ingress** (cloudflared
|
||||
ignores local config), so the new box must run the tunnel's expected origin (the restore orchestration
|
||||
brings it up).
|
||||
|
||||
## Safety / teardown
|
||||
|
||||
Per operator instruction the test used a **new** `dr-spike.demo-felhom.eu` subdomain on the demo's own
|
||||
(idle — guests down) tunnel; the live `*.demo-felhom.eu` wildcard + all other records were **untouched**,
|
||||
the tunnel's remote config was **never modified** (the zone API token lacks `cfd_tunnel` permission), and
|
||||
the throwaway subdomain + both connectors + all secrets were removed/shredded at teardown. The demo
|
||||
returns to exactly its prior state.
|
||||
|
||||
## Out of scope (→ 10D spec)
|
||||
|
||||
Recovery-mode toggle + re-enroll handshake + cred rotation; identity-escrow creation wired into
|
||||
provisioning; the restore orchestration (consume → pull → `RestoreLXC` → bring up origin → re-establish).
|
||||
|
||||
Reference in New Issue
Block a user