slice 10D (hub): DR capstone — recovery mode + re-enroll + directive serving (hub v0.11.0)

Recovery-mode toggle (global key, bounded auto-expiry) gates re-enroll +
restore-directive serving. Re-enroll rotates the agent<->hub credential to the
new box (old key revoked); returns the opaque escrow blobs + non-secret
directive. Store gains recovery_mode_until + identity_blob + directive_json.
Hub holds no usable secret + no Cloudflare write-power (operator-side rotation).
Doc 03 §9: slice 10 CLOSED.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-11 09:48:38 +02:00
parent a22b87e6e3
commit 3457415117
7 changed files with 533 additions and 34 deletions
+26 -1
View File
@@ -423,7 +423,7 @@ this path — bring up + reattach external storage and it is whole. This is full
| **Hub desired-state serving** (the "Down" channel) — store + serve per-host desired-state, bump `desired_generation`, signed-jobs queue + `has_signed_ops`; agent activates the envelope + a hub-backed provider (benign reconciled, destructive gated pending) | **10A** | **implemented** (hub v0.9.0: `PUT /admin/hosts/{id}/desired-state` bumps the generation, `GET /hosts/{id}/desired-state` + `/jobs` self-scoped, `signed_jobs` queue; agent v0.15.0: `ControlEnvelope` fields live, `Client.FetchDesiredState`, `internal/desired` Syncer + `reconcile.CachingProvider` feeding the engine — an explicit guest `decommission` is the destructive delta, gated `pending_signature`). Serves to already-authenticated hosts only; desired-state stored opaquely (agent owns the schema). Cross-repo golden (envelope + desired-state) byte-identical. |
| **Signed-op execution** (verify + run the gated destructive op) | **10B** | **implemented** (agent v0.16.0: `cmd/felhom-opsign` offline signing CLI + `internal/signedjobs` runner/WipeExecutor + `internal/storage` durable-device resolution; hub v0.10.0: `DELETE /hosts/{id}/jobs/{job_id}` completion). Verify → durable nonce-burn → execute → clear; pinned-key (multi-key rotation, trusted path), host + **durable-id** anti-retarget, 8C re-inspect. Closes the 8C data-bearing-wipe gap. Other destructive executors (guest_destroy, decommission, restore-overwrite → 10D) reuse the same gate+runner machinery. |
| **PBS escrow consumption** (recover `K` on a new box) | **10C** | **implemented** (agent v0.17.0: `escrow.Consume` = Unwrap → fingerprint-gate → atomic install; spike-proven crypto + real-data restore productionized; `--selftest=escrow-consume`). Zero-knowledge holds (hub serves all but R). Spike findings: `documentation/tests/slice10-escrow-consumption-spike-findings.md`. The four inputs are sourced from the hub directive in 10D. |
| **Host/hardware loss** DR — re-enroll in "restore mode"; hub serves identity / PBS namespace / tunnel token / storage manifest / restore directive (the `restore_directive` field exists in 10A's desired-state, consumed here) | **10D** | deferred — the DR capstone; consumes 10A serving + 10C escrow consumption + re-enrollment authorization |
| **Host/hardware loss** DR — re-enroll in "restore mode"; hub serves identity / tunnel token / restore directive; consume + restore + re-establish under identity; **operator-side** cred rotation | **10D** | **implemented — SLICE 10 CLOSED** (agent v0.18.0: identity escrow via `age` + `Consume`/identity-consume + restore-mode orchestration; hub v0.11.0: recovery-mode toggle + auto-expiry + re-enroll credential rotation + directive serving). Locked rotation model: **hub holds no Cloudflare write-power**; the operator deletes the stale connector + rotates the tunnel/PBS token from a trusted environment. Both 10D mechanisms spike-validated. Deferred (non-blocking): the DR web-UI page + a small operator rotation CLI. |
| Golden base refresh cadence + fleet versioning | post-launch | operational, non-blocking (§13) |
**Host/hardware loss (design intent — slice 10).** Re-enroll the new host in **restore mode**;
@@ -501,6 +501,31 @@ This doc hands the implementation three contracts it was waiting on:
## Changelog — design-review + Phase-3 fold-in (2026-06-08)
### Slice-10D implemented — DR capstone; SLICE 10 CLOSED (2026-06-10)
- The host/hardware-loss DR flow is wired end-to-end, grounded by both 10-series spikes. **Rotation
model (locked): the hub holds no Cloudflare write-power** — it orchestrates recovery (recovery-mode
toggle, directive serving, re-enroll + its OWN agent↔hub credential rotation) and at most read-only
*verifies* connector state; the **destructive tunnel/PBS rotation + stale-connector delete is the
operator's step from a trusted environment** (same spirit as 10B — the operator authorizes/executes
the dangerous op). A compromised hub can only hand out opaque blobs + rotate its own per-host cred.
- **10D.1 identity escrow:** `{tunnel_token, pbs_token}` wrapped under the SAME `R` via `age` (scrypt +
ChaCha20-Poly1305) — a second opaque blob; the K-escrow + 10C `Consume` are untouched. The hub
stores both ciphertext blobs + the **non-secret** directive (pbs repo/ns, expected key fingerprint,
tunnel id). **No usable secret in the hub.**
- **10D.2 recovery mode + re-enroll:** operator-armed **recovery-mode toggle** with bounded
**auto-expiry** gates directive serving + re-enroll. The re-enroll handshake rotates the agent↔hub
credential to the new box's key (**old box's hub access revoked**, hub-internal). Re-enroll auth =
recovery-mode toggle + **R** (zero-knowledge for data *and* identity) + **out-of-band phone
validation** (operator protocol) + auto-expiry + rotation.
- **10D.3 restore mode (agent):** receive directive (10A) → prompt for **R** by hand → `Consume`
(K-escrow → K installed, fingerprint-gated; identity-escrow → tunnel/pbs tokens) → restore guests
from PBS (restore-overwrite gated by **10B** on a non-blank target) → re-establish the tunnel (run
the recovered connector + reconstitute the dashboard-expected origin) → host routes as host X. The
destructive cred rotation is then the operator's step.
- §9 slice table: **10D done → SLICE 10 CLOSED**. Status: implemented (agent v0.18.0; hub v0.11.0).
Deferred (non-blocking): the hub Config DR/Recovery **web UI** (functional via the recovery-mode
admin API today) + a small operator rotation CLI (the rotation is a documented operator procedure).
### Slice-10C implemented — escrow consumption (productionized) (2026-06-10)
- §8a: escrow **consumption** is now a real, tested path (`escrow.Consume`): **Unwrap → fingerprint-
gate → install**. The throwaway 10C spike harness is gone; the spike's findings are baked in (F-C2