Validated both unvalidated 10D mechanisms: (1) identity-bundle escrow round-trip via age scrypt+AEAD (recover on a secret-less box, wrong-R fails closed), (2) Cloudflare tunnel re-establishment — running the recovered token on a new box routes the hostname there immediately (no DNS change); the old connector is a hot standby, superseded in routing but not auto-retired -> 10D must rotate the tunnel/PBS token + retire the stale connector for host-loss security. Redacted; secrets shredded; live demo untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
9.2 KiB
Slice 10D core — identity-escrow round-trip + tunnel re-establishment: Findings
Hosts: "box1"/OLD = demo-felhom (192.168.0.162); "NEW box" = the build server (192.168.0.180).
Cloudflare zone demo-felhom.eu (per operator instruction — see the zone note in §2), tunnel
demo-minipc (8b4edf48-…). cloudflared 2026.6.0; age (filippo.io/age) scrypt+ChaCha20-Poly1305.
Date: 2026-06-10. Driver: SPIKE — validate the two unvalidated mechanisms under the 10D DR
capstone (identity-escrow round-trip + tunnel re-establishment) BEFORE speccing the orchestration.
REDACTED by policy. No recovery code
R, no Cloudflare tunnel token, no API token, no tunnel connector secret, no identity-bundle token values appear here — only mechanism, command shapes, and routing behaviour. Tunnel/zone/connector identifiers (non-secret) are shown. R + all tokens were staged to0600files, referenced by path, and shredded at teardown.
1. Phase S1 — identity-escrow round-trip (age over R) — PASS
The identity bundle {tunnel_token, pbs_token} is wrapped under a recovery code R and recovered on
a secret-less box — the K-escrow mechanism (slice 7/10C), applied to the identity bundle.
- Crypto:
agewith a scrypt passphrase recipient + ChaCha20-Poly1305 AEAD (the blob header isage-encryption.org/v1/-> scrypt …). No hand-rolled crypto — a vetted passphrase-AEAD, equivalent toage -p.R= a 10-word EFF-wordlist code (the slice-7 generator). - Wrap → recover:
wrap(bundle, R) → identity.blob; on a fenced, secret-less fresh box handed ONLY the blob + R,unwrap(blob, R)recovered the bundle byte-identical to the original (sha256match) —tunnel_token+pbs_tokenintact. - Negative (wrong R):
unwrap(blob, WRONG-R)failed closed —incorrect passphrase, no plaintext emitted (no file written). Identical fail-closed behaviour to the K-escrow's wrong-R.
F-D1 — the identity bundle escrows exactly like K. Same two-factor, zero-knowledge shape: the
blob is opaque without R; R is the only out-of-band secret. 10D can reuse the 10C Consume
pattern (Unwrap → install) for the identity bundle, with age (or the PBS-key path) as the AEAD.
2. Phase S2 — tunnel re-establishment on a NEW box — PASS (with a security caveat)
Zone note: the operator directed the test to the demo-felhom.eu zone (the sajatfelhom.hu
throwaway zone resolved IPv6-only and was unreachable from the demo host). A new test subdomain
dr-spike.demo-felhom.eu was added (the live *.demo-felhom.eu wildcard + all other records were
untouched) and removed at teardown. The tunnel used was the demo's own demo-minipc — its live
connector was down (demo guests stopped), so no live traffic was displaced; this made it a
faithful "host X is back" test (X = the demo).
Setup
- A
CNAME dr-spike.demo-felhom.eu → <tunnel-id>.cfargotunnel.com(proxied) was created with the zone-scoped API token (DNS:Edit). The same token lacked AccountCloudflare Tunnel:Edit(cfd_tunnel→ auth error) — so the tunnel's ingress config could not be set via the API. - OLD box (162) + NEW box (180): each ran
cloudflaredwith the recovered tunnel token, plus a distinguishable HTTPS origin ("OLD box" / "NEW box") behind the hostname the remote ingress expects.
Results
- Routing to the connector works.
dr-spike.demo-felhom.eu→ Cloudflare edge → the tunnel → the running connector (the cloudflared log showsdest=https://dr-spike.demo-felhom.eu/arriving at the connector). The DNS CNAME → tunnel is stable; only the connector moves — no DNS change is needed to move a hostname to a new box. - New box takes over routing immediately. With BOTH connectors up (OLD 162 + NEW 180), 14/14 requests served from NEW; 0 from OLD. Cloudflare routes to the most-recently-established connector.
- Old connector is a HOT STANDBY, not auto-retired. The OLD connector stayed active + registered (no unregister/lost events) while serving 0 traffic. On stopping NEW, traffic fell back to OLD (6/6) within seconds — so OLD was a live failover the whole time.
F-D2 — a tunnel TOKEN carries the credentials, but a remotely-managed tunnel ignores local
ingress. The base64 token decodes to {AccountTag, TunnelID, TunnelSecret} → a cloudflared
credentials file. BUT for a remotely-managed tunnel (dashboard/API config), cloudflared uses the
REMOTE ingress (here originService=https://traefik) and ignores any local config.yml. So
on DR the new box's connector serves the tunnel's hub/dashboard-owned ingress → the origin
service (traefik/the app) must be running on the new box (the restore orchestration brings it up;
the 502 here was only the missing origin, not a routing failure). Alternatively DR uses a
locally-managed tunnel (credentials file + local config) for full local ingress control.
F-D3 — identity continuity is automatic on running the recovered token. Recovered tunnel token → run cloudflared on the new box → the customer's hostname routes to the new box, no DNS edit, no operator routing step. This is the "the host is back as host X" mechanism.
F-D4 / the load-bearing DR consequence — the OLD connector + token stay valid → 10D MUST rotate.
Routing needs no explicit old-connector retirement (newest wins, old is standby). BUT the old
box's connector remains registered and authenticated with the SAME tunnel token, and the leaked
token still grants tunnel access. In host-LOSS DR (the old box is gone/untrusted/compromised),
that is a security gap: a recovered old box (or a leaked token) can silently re-register and
co-serve the customer's hostname. 10D must, after re-establishment, ROTATE the tunnel token (and
PBS token) and/or explicitly delete the stale connector (cleanup_connections / the connector
DELETE API) — this needs an Account Cloudflare Tunnel:Edit token, which the geo-restriction
zone token does NOT have (the hub's CF credential placement, design-review S4, must cover tunnel +
connector management for DR, not just WAF).
Gotchas (test-environment, not DR)
- Split-horizon DNS: the LAN pi-hole resolves
*.demo-felhom.eu → 192.168.0.162, masking the Cloudflare edge from internal hosts. Tested via the real edge withcurl --resolve <host>:443:<CF-IP>(CF IP fromdig @1.1.1.1). - Origin TLS: the remote ingress origin was
https://traefik; the spike pointedtraefik → 127.0.0.1(/etc/hosts) at a self-signed HTTPS responder, which the remote config accepted (itsoriginRequest.noTLSVerifyis set for the internal traefik). On DR the new box must present the real origin.
3. GO / NO-GO
GO to spec 10D. Both unvalidated mechanisms are proven:
- The identity bundle escrows + recovers exactly like
K(age scrypt+AEAD; wrong-R fails closed) → reuse the 10CConsumeshape. - Tunnel re-establishment is automatic: run the recovered token on the new box → the customer's hostname routes there (no DNS step). The old connector is a hot standby, superseded in routing.
The 10D spec MUST include (consequences of this spike):
- Identity-escrow CREATION at provisioning (extend slice-7 escrow to also emit the identity blob:
{tunnel_token, pbs_token, …}wrapped under the SAMER, or a sibling blob) — so DR has it. - Restore-mode consumption of the identity blob (10C
Consumepattern;Rby hand) + install the tunnel/PBS tokens. - The new box must run the tunnel's expected origin (restore orchestration brings up traefik/apps before/with the connector), OR DR uses a locally-managed tunnel config.
- Cred ROTATION after re-establishment (rotate tunnel + PBS tokens; delete the stale connector) — the security capstone for host-LOSS DR. Requires an Account Cloudflare-Tunnel-scoped credential on the hub (broader than the current WAF-only zone token).
4. Teardown (verify the live demo is untouched)
- Connectors stopped + removed on both boxes (cloudflared + the HTTPS/responder units);
cloudflaredbinaries removed;/etc/hoststraefikentries removed. - DNS: the throwaway
dr-spike.demo-felhom.eurecord deleted; the live*.demo-felhom.euwildcard + all other records untouched; thesajatfelhom.hutest record (created then abandoned on the zone-switch) deleted. - Tunnel: its remote config was never modified (the API token lacked
cfd_tunnelpermission) — sodemo-minipcreturns to exactly its prior state (no spike connectors; the demo's own connector reclaims it when the demo guest restarts). - Secrets shredded:
R, the identity bundle/blob, the tunnel token, the API token, the cloudflared credentials file (AccountTag/TunnelID/TunnelSecret), the throwawayageharness. No secret committed.
Out of scope (note; don't build — → 10D spec)
- The recovery-mode toggle + re-enroll handshake + cred rotation.
- Identity-escrow creation wired into provisioning (slice-7 escrow extension).
- The restore orchestration (consume → pull →
RestoreLXC→ bring up origin → re-establish under identity).