Files
felhom.eu/documentation/tests/slice10d-identity-restore-spike-findings.md
T
admin a22b87e6e3 docs: slice 10D core spike findings (identity-escrow + tunnel re-establishment) — GO
Validated both unvalidated 10D mechanisms: (1) identity-bundle escrow round-trip
via age scrypt+AEAD (recover on a secret-less box, wrong-R fails closed), (2)
Cloudflare tunnel re-establishment — running the recovered token on a new box
routes the hostname there immediately (no DNS change); the old connector is a
hot standby, superseded in routing but not auto-retired -> 10D must rotate the
tunnel/PBS token + retire the stale connector for host-loss security. Redacted;
secrets shredded; live demo untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 23:17:53 +02:00

9.2 KiB

Slice 10D core — identity-escrow round-trip + tunnel re-establishment: Findings

Hosts: "box1"/OLD = demo-felhom (192.168.0.162); "NEW box" = the build server (192.168.0.180). Cloudflare zone demo-felhom.eu (per operator instruction — see the zone note in §2), tunnel demo-minipc (8b4edf48-…). cloudflared 2026.6.0; age (filippo.io/age) scrypt+ChaCha20-Poly1305. Date: 2026-06-10. Driver: SPIKE — validate the two unvalidated mechanisms under the 10D DR capstone (identity-escrow round-trip + tunnel re-establishment) BEFORE speccing the orchestration.

REDACTED by policy. No recovery code R, no Cloudflare tunnel token, no API token, no tunnel connector secret, no identity-bundle token values appear here — only mechanism, command shapes, and routing behaviour. Tunnel/zone/connector identifiers (non-secret) are shown. R + all tokens were staged to 0600 files, referenced by path, and shredded at teardown.


1. Phase S1 — identity-escrow round-trip (age over R) — PASS

The identity bundle {tunnel_token, pbs_token} is wrapped under a recovery code R and recovered on a secret-less box — the K-escrow mechanism (slice 7/10C), applied to the identity bundle.

  • Crypto: age with a scrypt passphrase recipient + ChaCha20-Poly1305 AEAD (the blob header is age-encryption.org/v1 / -> scrypt …). No hand-rolled crypto — a vetted passphrase-AEAD, equivalent to age -p. R = a 10-word EFF-wordlist code (the slice-7 generator).
  • Wrap → recover: wrap(bundle, R) → identity.blob; on a fenced, secret-less fresh box handed ONLY the blob + R, unwrap(blob, R) recovered the bundle byte-identical to the original (sha256 match) — tunnel_token + pbs_token intact.
  • Negative (wrong R): unwrap(blob, WRONG-R) failed closedincorrect passphrase, no plaintext emitted (no file written). Identical fail-closed behaviour to the K-escrow's wrong-R.

F-D1 — the identity bundle escrows exactly like K. Same two-factor, zero-knowledge shape: the blob is opaque without R; R is the only out-of-band secret. 10D can reuse the 10C Consume pattern (Unwrap → install) for the identity bundle, with age (or the PBS-key path) as the AEAD.

2. Phase S2 — tunnel re-establishment on a NEW box — PASS (with a security caveat)

Zone note: the operator directed the test to the demo-felhom.eu zone (the sajatfelhom.hu throwaway zone resolved IPv6-only and was unreachable from the demo host). A new test subdomain dr-spike.demo-felhom.eu was added (the live *.demo-felhom.eu wildcard + all other records were untouched) and removed at teardown. The tunnel used was the demo's own demo-minipc — its live connector was down (demo guests stopped), so no live traffic was displaced; this made it a faithful "host X is back" test (X = the demo).

Setup

  • A CNAME dr-spike.demo-felhom.eu → <tunnel-id>.cfargotunnel.com (proxied) was created with the zone-scoped API token (DNS:Edit). The same token lacked Account Cloudflare Tunnel:Edit (cfd_tunnel → auth error) — so the tunnel's ingress config could not be set via the API.
  • OLD box (162) + NEW box (180): each ran cloudflared with the recovered tunnel token, plus a distinguishable HTTPS origin ("OLD box" / "NEW box") behind the hostname the remote ingress expects.

Results

  • Routing to the connector works. dr-spike.demo-felhom.eu → Cloudflare edge → the tunnel → the running connector (the cloudflared log shows dest=https://dr-spike.demo-felhom.eu/ arriving at the connector). The DNS CNAME → tunnel is stable; only the connector moves — no DNS change is needed to move a hostname to a new box.
  • New box takes over routing immediately. With BOTH connectors up (OLD 162 + NEW 180), 14/14 requests served from NEW; 0 from OLD. Cloudflare routes to the most-recently-established connector.
  • Old connector is a HOT STANDBY, not auto-retired. The OLD connector stayed active + registered (no unregister/lost events) while serving 0 traffic. On stopping NEW, traffic fell back to OLD (6/6) within seconds — so OLD was a live failover the whole time.

F-D2 — a tunnel TOKEN carries the credentials, but a remotely-managed tunnel ignores local ingress. The base64 token decodes to {AccountTag, TunnelID, TunnelSecret} → a cloudflared credentials file. BUT for a remotely-managed tunnel (dashboard/API config), cloudflared uses the REMOTE ingress (here originService=https://traefik) and ignores any local config.yml. So on DR the new box's connector serves the tunnel's hub/dashboard-owned ingress → the origin service (traefik/the app) must be running on the new box (the restore orchestration brings it up; the 502 here was only the missing origin, not a routing failure). Alternatively DR uses a locally-managed tunnel (credentials file + local config) for full local ingress control.

F-D3 — identity continuity is automatic on running the recovered token. Recovered tunnel token → run cloudflared on the new box → the customer's hostname routes to the new box, no DNS edit, no operator routing step. This is the "the host is back as host X" mechanism.

F-D4 / the load-bearing DR consequence — the OLD connector + token stay valid → 10D MUST rotate. Routing needs no explicit old-connector retirement (newest wins, old is standby). BUT the old box's connector remains registered and authenticated with the SAME tunnel token, and the leaked token still grants tunnel access. In host-LOSS DR (the old box is gone/untrusted/compromised), that is a security gap: a recovered old box (or a leaked token) can silently re-register and co-serve the customer's hostname. 10D must, after re-establishment, ROTATE the tunnel token (and PBS token) and/or explicitly delete the stale connector (cleanup_connections / the connector DELETE API) — this needs an Account Cloudflare Tunnel:Edit token, which the geo-restriction zone token does NOT have (the hub's CF credential placement, design-review S4, must cover tunnel + connector management for DR, not just WAF).

Gotchas (test-environment, not DR)

  • Split-horizon DNS: the LAN pi-hole resolves *.demo-felhom.eu → 192.168.0.162, masking the Cloudflare edge from internal hosts. Tested via the real edge with curl --resolve <host>:443:<CF-IP> (CF IP from dig @1.1.1.1).
  • Origin TLS: the remote ingress origin was https://traefik; the spike pointed traefik → 127.0.0.1 (/etc/hosts) at a self-signed HTTPS responder, which the remote config accepted (its originRequest.noTLSVerify is set for the internal traefik). On DR the new box must present the real origin.

3. GO / NO-GO

GO to spec 10D. Both unvalidated mechanisms are proven:

  1. The identity bundle escrows + recovers exactly like K (age scrypt+AEAD; wrong-R fails closed) → reuse the 10C Consume shape.
  2. Tunnel re-establishment is automatic: run the recovered token on the new box → the customer's hostname routes there (no DNS step). The old connector is a hot standby, superseded in routing.

The 10D spec MUST include (consequences of this spike):

  • Identity-escrow CREATION at provisioning (extend slice-7 escrow to also emit the identity blob: {tunnel_token, pbs_token, …} wrapped under the SAME R, or a sibling blob) — so DR has it.
  • Restore-mode consumption of the identity blob (10C Consume pattern; R by hand) + install the tunnel/PBS tokens.
  • The new box must run the tunnel's expected origin (restore orchestration brings up traefik/apps before/with the connector), OR DR uses a locally-managed tunnel config.
  • Cred ROTATION after re-establishment (rotate tunnel + PBS tokens; delete the stale connector) — the security capstone for host-LOSS DR. Requires an Account Cloudflare-Tunnel-scoped credential on the hub (broader than the current WAF-only zone token).

4. Teardown (verify the live demo is untouched)

  • Connectors stopped + removed on both boxes (cloudflared + the HTTPS/responder units); cloudflared binaries removed; /etc/hosts traefik entries removed.
  • DNS: the throwaway dr-spike.demo-felhom.eu record deleted; the live *.demo-felhom.eu wildcard + all other records untouched; the sajatfelhom.hu test record (created then abandoned on the zone-switch) deleted.
  • Tunnel: its remote config was never modified (the API token lacked cfd_tunnel permission) — so demo-minipc returns to exactly its prior state (no spike connectors; the demo's own connector reclaims it when the demo guest restarts).
  • Secrets shredded: R, the identity bundle/blob, the tunnel token, the API token, the cloudflared credentials file (AccountTag/TunnelID/TunnelSecret), the throwaway age harness. No secret committed.

Out of scope (note; don't build — → 10D spec)

  • The recovery-mode toggle + re-enroll handshake + cred rotation.
  • Identity-escrow creation wired into provisioning (slice-7 escrow extension).
  • The restore orchestration (consume → pull → RestoreLXC → bring up origin → re-establish under identity).