slice 10D (hub): DR capstone — recovery mode + re-enroll + directive serving (hub v0.11.0)
Recovery-mode toggle (global key, bounded auto-expiry) gates re-enroll + restore-directive serving. Re-enroll rotates the agent<->hub credential to the new box (old key revoked); returns the opaque escrow blobs + non-secret directive. Store gains recovery_mode_until + identity_blob + directive_json. Hub holds no usable secret + no Cloudflare write-power (operator-side rotation). Doc 03 §9: slice 10 CLOSED. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -4,44 +4,46 @@
|
||||
|
||||
---
|
||||
|
||||
# REPORT — Slice 10D core SPIKE: identity-escrow round-trip + tunnel re-establishment (2026-06-10)
|
||||
# REPORT — Slice 10D (hub half): DR capstone — recovery mode + re-enroll + directive serving (hub v0.11.0) (2026-06-10)
|
||||
|
||||
## Type
|
||||
|
||||
SPIKE runbook (CC-executed on the demo). Validated the two unvalidated mechanisms under the 10D DR
|
||||
capstone **before** speccing the orchestration. Deliverable: the redacted findings doc
|
||||
[`documentation/tests/slice10d-identity-restore-spike-findings.md`](documentation/tests/slice10d-identity-restore-spike-findings.md).
|
||||
Handled crown jewels (R + identity/tunnel tokens) — staged `0600`, by reference, **shredded** at teardown; no secret committed.
|
||||
TASK (CC-implemented). The hub half of the slice-10 DR capstone (closes slice 10). Pairs with
|
||||
`felhom-agent` v0.18.0 (identity escrow + restore-mode consumption).
|
||||
|
||||
## Results — GO to spec 10D
|
||||
## What changed (hub)
|
||||
|
||||
**S1 — identity-escrow round-trip (age):** the identity bundle `{tunnel_token, pbs_token}` wraps under
|
||||
an EFF-wordlist `R` via **age (scrypt + ChaCha20-Poly1305 AEAD)**, recovers **byte-identical** on a
|
||||
secret-less fresh box given only blob + R, and a **wrong R fails closed** (no plaintext). Mirrors the
|
||||
proven K-escrow → 10D reuses the 10C `Consume` shape for the identity bundle.
|
||||
The hub ORCHESTRATES recovery but holds **no usable secret and no Cloudflare write-power** — a
|
||||
compromised hub can at most hand out **opaque** blobs (they need `R`, which the hub never has) + rotate
|
||||
its own per-host credential. It cannot hijack a customer's tunnel (the destructive rotation is the
|
||||
operator's job).
|
||||
|
||||
**S2 — tunnel re-establishment:** running the recovered Cloudflare tunnel token's connector on a NEW
|
||||
box → the customer's hostname routes to it **immediately, no DNS change** (the CNAME→tunnel is stable;
|
||||
only the connector moves). With both connectors up, 14/14 requests served from NEW; stopping NEW fell
|
||||
back to OLD (6/6) — **the old connector is a hot standby, superseded in routing but NOT auto-retired.**
|
||||
### API
|
||||
- **`PUT/DELETE /admin/hosts/{id}/recovery-mode`** (global key) — arm/disable recovery mode with a
|
||||
bounded TTL (clamped [60s, 4h], default 30m → **auto-expires**). Directive + re-enroll are served
|
||||
ONLY while active.
|
||||
- **`POST /hosts/{id}/re-enroll`** — gated ONLY on recovery mode (the lost box has no old key). Rotates
|
||||
the host's API key to the new box's key (**old box revoked**) + returns the directive + opaque blobs.
|
||||
- **`GET /hosts/{id}/restore-directive`** (re-enrolled key, recovery-gated) — re-fetch.
|
||||
- The slice-7 escrow upload now also accepts the **identity blob** + **non-secret directive** (additive).
|
||||
|
||||
**Load-bearing consequence for 10D:** routing failover is automatic, but the old box's connector + the
|
||||
(same) tunnel token stay valid → **10D must rotate the tunnel/PBS tokens and/or delete the stale
|
||||
connector after re-establishment** (host-LOSS security). That needs an **Account Cloudflare-Tunnel
|
||||
-scoped** hub credential (broader than the current WAF-only zone token) — feeds the design-review S4
|
||||
CF-token-placement decision. Also: a remotely-managed tunnel uses its **dashboard ingress** (cloudflared
|
||||
ignores local config), so the new box must run the tunnel's expected origin (the restore orchestration
|
||||
brings it up).
|
||||
### Store
|
||||
- `hosts.recovery_mode_until`; `host_escrow.identity_blob` + `directive_json`. Methods:
|
||||
`SetRecoveryMode`/`ClearRecoveryMode`, `RotateHostAPIKey`, `SaveHostDRBundle`/`GetHostDRBundle`.
|
||||
|
||||
## Safety / teardown
|
||||
## Tests (green)
|
||||
- re-enroll refused without recovery mode (403); recovery-arm is global-key-only; re-enroll **rotates +
|
||||
revokes** (old key→401, new key→200); directive served only in recovery mode + **expires**; clear
|
||||
disables re-enroll.
|
||||
|
||||
Per operator instruction the test used a **new** `dr-spike.demo-felhom.eu` subdomain on the demo's own
|
||||
(idle — guests down) tunnel; the live `*.demo-felhom.eu` wildcard + all other records were **untouched**,
|
||||
the tunnel's remote config was **never modified** (the zone API token lacks `cfd_tunnel` permission), and
|
||||
the throwaway subdomain + both connectors + all secrets were removed/shredded at teardown. The demo
|
||||
returns to exactly its prior state.
|
||||
## Docs
|
||||
- Doc 03 §9 (10D done → **SLICE 10 CLOSED**) + the host-loss DR flow with the **operator-side rotation**
|
||||
model (hub orchestrates + read-only verifies; the operator deletes the stale connector + rotates the
|
||||
tunnel/PBS token from a trusted environment).
|
||||
|
||||
## Out of scope (→ 10D spec)
|
||||
## Deferred (non-blocking, per the locked model)
|
||||
- The Config DR/Recovery **web UI** (functional today via the recovery-mode admin API) + a small
|
||||
operator rotation CLI. **No Cloudflare write-credential is in the hub by design.**
|
||||
|
||||
Recovery-mode toggle + re-enroll handshake + cred rotation; identity-escrow creation wired into
|
||||
provisioning; the restore orchestration (consume → pull → `RestoreLXC` → bring up origin → re-establish).
|
||||
## Pending
|
||||
- Build + deploy hub v0.11.0 + agent v0.18.0; run the operator-in-the-loop DR drill (throwaway identity).
|
||||
|
||||
Reference in New Issue
Block a user