docs: slice 10D core spike findings (identity-escrow + tunnel re-establishment) — GO
Validated both unvalidated 10D mechanisms: (1) identity-bundle escrow round-trip via age scrypt+AEAD (recover on a secret-less box, wrong-R fails closed), (2) Cloudflare tunnel re-establishment — running the recovered token on a new box routes the hostname there immediately (no DNS change); the old connector is a hot standby, superseded in routing but not auto-retired -> 10D must rotate the tunnel/PBS token + retire the stale connector for host-loss security. Redacted; secrets shredded; live demo untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -4,28 +4,44 @@
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
# REPORT — Slice 10C (docs only): escrow consumption productionized (2026-06-10)
|
# REPORT — Slice 10D core SPIKE: identity-escrow round-trip + tunnel re-establishment (2026-06-10)
|
||||||
|
|
||||||
## Type
|
## Type
|
||||||
|
|
||||||
Documentation update for **slice 10C** (implementation is **agent-only**: `felhom-agent` v0.17.0 —
|
SPIKE runbook (CC-executed on the demo). Validated the two unvalidated mechanisms under the 10D DR
|
||||||
`escrow.Consume`). **No hub code change** — 10C reads a restore directive it is given; 10D wires the
|
capstone **before** speccing the orchestration. Deliverable: the redacted findings doc
|
||||||
hub side (serving the blob + expected fingerprint + PBS connection, prompting for R).
|
[`documentation/tests/slice10d-identity-restore-spike-findings.md`](documentation/tests/slice10d-identity-restore-spike-findings.md).
|
||||||
|
Handled crown jewels (R + identity/tunnel tokens) — staged `0600`, by reference, **shredded** at teardown; no secret committed.
|
||||||
|
|
||||||
## What changed (doc 03 — host-agent)
|
## Results — GO to spec 10D
|
||||||
|
|
||||||
- **§8a**: escrow **consumption** is now a real, tested path (`escrow.Consume` = **Unwrap →
|
**S1 — identity-escrow round-trip (age):** the identity bundle `{tunnel_token, pbs_token}` wraps under
|
||||||
fingerprint-gate → install**), replacing the throwaway spike harness. The spike findings are baked
|
an EFF-wordlist `R` via **age (scrypt + ChaCha20-Poly1305 AEAD)**, recovers **byte-identical** on a
|
||||||
in: F-C2 (install the raw key where the restore reads it), **F-C3** (wrong R fails closed), **F-C4**
|
secret-less fresh box given only blob + R, and a **wrong R fails closed** (no plaintext). Mirrors the
|
||||||
(fingerprint-gate *before* any multi-GB restore), **F-C6** (blob read-only/retryable, `K` never
|
proven K-escrow → 10D reuses the 10C `Consume` shape for the identity bundle.
|
||||||
mutated). **Zero-knowledge holds end-to-end**: the hub serves the blob + expected fingerprint + PBS
|
|
||||||
connection; **R comes from the customer by hand, never the hub** — a hub compromise alone cannot
|
|
||||||
decrypt.
|
|
||||||
- **§9 slice table**: **10C done**. **10D** (DR capstone — re-enroll in restore mode, serve the
|
|
||||||
directive, consume, restore guests + identity, reuse the 10B gate for restore-overwrite, the
|
|
||||||
re-enrollment-auth fork) is the last piece of slice 10.
|
|
||||||
|
|
||||||
## Pending
|
**S2 — tunnel re-establishment:** running the recovered Cloudflare tunnel token's connector on a NEW
|
||||||
|
box → the customer's hostname routes to it **immediately, no DNS change** (the CNAME→tunnel is stable;
|
||||||
|
only the connector moves). With both connectors up, 14/14 requests served from NEW; stopping NEW fell
|
||||||
|
back to OLD (6/6) — **the old connector is a hot standby, superseded in routing but NOT auto-retired.**
|
||||||
|
|
||||||
- Live validation runs against the demo (agent v0.17.0): create escrow → `Consume` → restore real
|
**Load-bearing consequence for 10D:** routing failover is automatic, but the old box's connector + the
|
||||||
data with the consumed key; wrong R → clean failure, nothing installed; live `K` byte-unchanged.
|
(same) tunnel token stay valid → **10D must rotate the tunnel/PBS tokens and/or delete the stale
|
||||||
|
connector after re-establishment** (host-LOSS security). That needs an **Account Cloudflare-Tunnel
|
||||||
|
-scoped** hub credential (broader than the current WAF-only zone token) — feeds the design-review S4
|
||||||
|
CF-token-placement decision. Also: a remotely-managed tunnel uses its **dashboard ingress** (cloudflared
|
||||||
|
ignores local config), so the new box must run the tunnel's expected origin (the restore orchestration
|
||||||
|
brings it up).
|
||||||
|
|
||||||
|
## Safety / teardown
|
||||||
|
|
||||||
|
Per operator instruction the test used a **new** `dr-spike.demo-felhom.eu` subdomain on the demo's own
|
||||||
|
(idle — guests down) tunnel; the live `*.demo-felhom.eu` wildcard + all other records were **untouched**,
|
||||||
|
the tunnel's remote config was **never modified** (the zone API token lacks `cfd_tunnel` permission), and
|
||||||
|
the throwaway subdomain + both connectors + all secrets were removed/shredded at teardown. The demo
|
||||||
|
returns to exactly its prior state.
|
||||||
|
|
||||||
|
## Out of scope (→ 10D spec)
|
||||||
|
|
||||||
|
Recovery-mode toggle + re-enroll handshake + cred rotation; identity-escrow creation wired into
|
||||||
|
provisioning; the restore orchestration (consume → pull → `RestoreLXC` → bring up origin → re-establish).
|
||||||
|
|||||||
@@ -0,0 +1,129 @@
|
|||||||
|
# Slice 10D core — identity-escrow round-trip + tunnel re-establishment: Findings
|
||||||
|
|
||||||
|
**Hosts:** "box1"/OLD = `demo-felhom` (192.168.0.162); "NEW box" = the build server (192.168.0.180).
|
||||||
|
Cloudflare zone `demo-felhom.eu` (per operator instruction — see the zone note in §2), tunnel
|
||||||
|
**`demo-minipc`** (`8b4edf48-…`). `cloudflared` 2026.6.0; `age` (filippo.io/age) scrypt+ChaCha20-Poly1305.
|
||||||
|
**Date:** 2026-06-10. **Driver:** SPIKE — validate the two unvalidated mechanisms under the 10D DR
|
||||||
|
capstone (identity-escrow round-trip + tunnel re-establishment) BEFORE speccing the orchestration.
|
||||||
|
|
||||||
|
> **REDACTED by policy.** No recovery code `R`, no Cloudflare **tunnel token**, no **API token**, no
|
||||||
|
> tunnel **connector secret**, no identity-bundle token values appear here — only mechanism, command
|
||||||
|
> *shapes*, and routing *behaviour*. Tunnel/zone/connector *identifiers* (non-secret) are shown. R +
|
||||||
|
> all tokens were staged to `0600` files, referenced by path, and **shredded at teardown**.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Phase S1 — identity-escrow round-trip (age over R) — **PASS**
|
||||||
|
|
||||||
|
The identity bundle `{tunnel_token, pbs_token}` is wrapped under a recovery code `R` and recovered on
|
||||||
|
a secret-less box — the K-escrow mechanism (slice 7/10C), applied to the identity bundle.
|
||||||
|
|
||||||
|
- **Crypto:** `age` with a **scrypt** passphrase recipient + **ChaCha20-Poly1305 AEAD** (the blob
|
||||||
|
header is `age-encryption.org/v1` / `-> scrypt …`). No hand-rolled crypto — a vetted
|
||||||
|
passphrase-AEAD, equivalent to `age -p`. `R` = a 10-word EFF-wordlist code (the slice-7 generator).
|
||||||
|
- **Wrap → recover:** `wrap(bundle, R) → identity.blob`; on a **fenced, secret-less fresh box** handed
|
||||||
|
ONLY the blob + R, `unwrap(blob, R)` recovered the bundle **byte-identical** to the original
|
||||||
|
(`sha256` match) — `tunnel_token` + `pbs_token` intact.
|
||||||
|
- **Negative (wrong R):** `unwrap(blob, WRONG-R)` **failed closed** — `incorrect passphrase`, **no
|
||||||
|
plaintext emitted** (no file written). Identical fail-closed behaviour to the K-escrow's wrong-R.
|
||||||
|
|
||||||
|
**F-D1 — the identity bundle escrows exactly like `K`.** Same two-factor, zero-knowledge shape: the
|
||||||
|
blob is opaque without `R`; `R` is the only out-of-band secret. 10D can reuse the **10C `Consume`
|
||||||
|
pattern** (Unwrap → install) for the identity bundle, with `age` (or the PBS-key path) as the AEAD.
|
||||||
|
|
||||||
|
## 2. Phase S2 — tunnel re-establishment on a NEW box — **PASS (with a security caveat)**
|
||||||
|
|
||||||
|
**Zone note:** the operator directed the test to the `demo-felhom.eu` zone (the `sajatfelhom.hu`
|
||||||
|
throwaway zone resolved IPv6-only and was unreachable from the demo host). A **new** test subdomain
|
||||||
|
`dr-spike.demo-felhom.eu` was added (the live `*.demo-felhom.eu` wildcard + all other records were
|
||||||
|
**untouched**) and removed at teardown. The tunnel used was the demo's own `demo-minipc` — its live
|
||||||
|
connector was **down** (demo guests stopped), so no live traffic was displaced; this made it a
|
||||||
|
faithful "host X is back" test (X = the demo).
|
||||||
|
|
||||||
|
### Setup
|
||||||
|
- A `CNAME dr-spike.demo-felhom.eu → <tunnel-id>.cfargotunnel.com` (proxied) was created with the
|
||||||
|
**zone-scoped** API token (DNS:Edit). The same token **lacked** Account `Cloudflare Tunnel:Edit`
|
||||||
|
(`cfd_tunnel` → auth error) — so the tunnel's ingress config could not be set via the API.
|
||||||
|
- OLD box (162) + NEW box (180): each ran `cloudflared` with the **recovered tunnel token**, plus a
|
||||||
|
distinguishable HTTPS origin ("OLD box" / "NEW box") behind the hostname the remote ingress expects.
|
||||||
|
|
||||||
|
### Results
|
||||||
|
- **Routing to the connector works.** `dr-spike.demo-felhom.eu` → Cloudflare edge → the tunnel →
|
||||||
|
**the running connector** (the cloudflared log shows `dest=https://dr-spike.demo-felhom.eu/`
|
||||||
|
arriving at the connector). The DNS CNAME → tunnel is **stable**; only the *connector* moves — **no
|
||||||
|
DNS change is needed to move a hostname to a new box.**
|
||||||
|
- **New box takes over routing immediately.** With BOTH connectors up (OLD 162 + NEW 180), **14/14**
|
||||||
|
requests served from **NEW**; **0** from OLD. Cloudflare routes to the most-recently-established
|
||||||
|
connector.
|
||||||
|
- **Old connector is a HOT STANDBY, not auto-retired.** The OLD connector stayed **active +
|
||||||
|
registered** (no unregister/lost events) while serving 0 traffic. On **stopping NEW**, traffic
|
||||||
|
**fell back to OLD (6/6)** within seconds — so OLD was a live failover the whole time.
|
||||||
|
|
||||||
|
**F-D2 — a tunnel TOKEN carries the credentials, but a remotely-managed tunnel ignores local
|
||||||
|
ingress.** The base64 token decodes to `{AccountTag, TunnelID, TunnelSecret}` → a cloudflared
|
||||||
|
credentials file. BUT for a **remotely-managed** tunnel (dashboard/API config), cloudflared uses the
|
||||||
|
**REMOTE** ingress (here `originService=https://traefik`) and **ignores any local `config.yml`**. So
|
||||||
|
on DR the new box's connector serves the tunnel's **hub/dashboard-owned** ingress → the **origin
|
||||||
|
service (traefik/the app) must be running on the new box** (the restore orchestration brings it up;
|
||||||
|
the 502 here was only the missing origin, not a routing failure). Alternatively DR uses a
|
||||||
|
**locally-managed** tunnel (credentials file + local config) for full local ingress control.
|
||||||
|
|
||||||
|
**F-D3 — identity continuity is automatic on running the recovered token.** Recovered tunnel token →
|
||||||
|
run cloudflared on the new box → the customer's hostname routes to the new box, **no DNS edit, no
|
||||||
|
operator routing step.** This is the "the host is back as host X" mechanism.
|
||||||
|
|
||||||
|
**F-D4 / the load-bearing DR consequence — the OLD connector + token stay valid → 10D MUST rotate.**
|
||||||
|
Routing needs **no** explicit old-connector retirement (newest wins, old is standby). BUT the old
|
||||||
|
box's connector remains **registered and authenticated with the SAME tunnel token**, and the leaked
|
||||||
|
token still grants tunnel access. In **host-LOSS DR** (the old box is gone/untrusted/compromised),
|
||||||
|
that is a security gap: a recovered old box (or a leaked token) can silently **re-register and
|
||||||
|
co-serve** the customer's hostname. **10D must, after re-establishment, ROTATE the tunnel token (and
|
||||||
|
PBS token) and/or explicitly delete the stale connector** (`cleanup_connections` / the connector
|
||||||
|
DELETE API) — this needs an **Account `Cloudflare Tunnel:Edit` token**, which the geo-restriction
|
||||||
|
zone token does NOT have (the hub's CF credential placement, design-review S4, must cover tunnel +
|
||||||
|
connector management for DR, not just WAF).
|
||||||
|
|
||||||
|
### Gotchas (test-environment, not DR)
|
||||||
|
- **Split-horizon DNS:** the LAN pi-hole resolves `*.demo-felhom.eu → 192.168.0.162`, masking the
|
||||||
|
Cloudflare edge from internal hosts. Tested via the real edge with `curl --resolve <host>:443:<CF-IP>`
|
||||||
|
(CF IP from `dig @1.1.1.1`).
|
||||||
|
- **Origin TLS:** the remote ingress origin was `https://traefik`; the spike pointed `traefik →
|
||||||
|
127.0.0.1` (`/etc/hosts`) at a self-signed HTTPS responder, which the remote config accepted (its
|
||||||
|
`originRequest.noTLSVerify` is set for the internal traefik). On DR the new box must present the
|
||||||
|
real origin.
|
||||||
|
|
||||||
|
## 3. GO / NO-GO
|
||||||
|
|
||||||
|
**GO** to spec **10D**. Both unvalidated mechanisms are proven:
|
||||||
|
1. The **identity bundle escrows + recovers exactly like `K`** (age scrypt+AEAD; wrong-R fails closed)
|
||||||
|
→ reuse the 10C `Consume` shape.
|
||||||
|
2. **Tunnel re-establishment is automatic**: run the recovered token on the new box → the customer's
|
||||||
|
hostname routes there (no DNS step). The old connector is a hot standby, superseded in routing.
|
||||||
|
|
||||||
|
**The 10D spec MUST include (consequences of this spike):**
|
||||||
|
- **Identity-escrow CREATION at provisioning** (extend slice-7 escrow to also emit the identity blob:
|
||||||
|
`{tunnel_token, pbs_token, …}` wrapped under the SAME `R`, or a sibling blob) — so DR has it.
|
||||||
|
- **Restore-mode consumption** of the identity blob (10C `Consume` pattern; `R` by hand) + install
|
||||||
|
the tunnel/PBS tokens.
|
||||||
|
- **The new box must run the tunnel's expected origin** (restore orchestration brings up traefik/apps
|
||||||
|
before/with the connector), OR DR uses a locally-managed tunnel config.
|
||||||
|
- **Cred ROTATION after re-establishment** (rotate tunnel + PBS tokens; delete the stale connector) —
|
||||||
|
the security capstone for host-LOSS DR. Requires an **Account Cloudflare-Tunnel-scoped** credential
|
||||||
|
on the hub (broader than the current WAF-only zone token).
|
||||||
|
|
||||||
|
## 4. Teardown (verify the live demo is untouched)
|
||||||
|
- **Connectors stopped + removed** on both boxes (cloudflared + the HTTPS/responder units); `cloudflared`
|
||||||
|
binaries removed; `/etc/hosts` `traefik` entries removed.
|
||||||
|
- **DNS:** the throwaway `dr-spike.demo-felhom.eu` record **deleted**; the live `*.demo-felhom.eu`
|
||||||
|
wildcard + all other records **untouched**; the `sajatfelhom.hu` test record (created then abandoned
|
||||||
|
on the zone-switch) **deleted**.
|
||||||
|
- **Tunnel:** its **remote config was never modified** (the API token lacked `cfd_tunnel` permission) —
|
||||||
|
so `demo-minipc` returns to exactly its prior state (no spike connectors; the demo's own connector
|
||||||
|
reclaims it when the demo guest restarts).
|
||||||
|
- **Secrets shredded:** `R`, the identity bundle/blob, the tunnel token, the API token, the cloudflared
|
||||||
|
credentials file (`AccountTag/TunnelID/TunnelSecret`), the throwaway `age` harness. No secret committed.
|
||||||
|
|
||||||
|
## Out of scope (note; don't build — → 10D spec)
|
||||||
|
- The recovery-mode toggle + re-enroll handshake + **cred rotation**.
|
||||||
|
- Identity-escrow **creation wired into provisioning** (slice-7 escrow extension).
|
||||||
|
- The **restore orchestration** (consume → pull → `RestoreLXC` → bring up origin → re-establish under identity).
|
||||||
Reference in New Issue
Block a user