slice 10A: hub desired-state serving + signed-jobs queue (Down channel) (hub v0.9.0)

Serve operator intent to authenticated hosts: PUT /admin/hosts/{id}/desired-state
(global key) bumps desired_generation; GET /hosts/{id}/desired-state + /jobs are
per-host self-scoped; the host-report envelope now carries the real generation +
has_signed_ops. New signed_jobs table + store methods. Desired-state stored/served
opaquely (agent owns the schema). Cross-repo golden (envelope + desired-state)
byte-identical with felhom-agent; doc 03 §4/§9 updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-10 19:03:14 +02:00
parent f9af3243b9
commit e54f882e70
8 changed files with 669 additions and 30 deletions
+22 -2
View File
@@ -420,8 +420,10 @@ this path — bring up + reattach external storage and it is whole. This is full
| **Quiesced app-consistent backup** (`/backup/due`-driven stack-stop) | **8B** | **implemented** (agent v0.11.0 `/backup/due` cadence + `/backup/status` phases; controller v0.36.0 `internal/quiesce` — stop stacks → backup → restart, with crash-safety marker/guaranteed-unquiesce/max-bound/crash-recovery). Validated live incl. the postgres clean-vs-crash-recovery restore contrast. **8B.2 downtime optimization (resume at `snapshotted`) implemented** (agent v0.13.0 + controller v0.38.0 — §8). |
| **Controller de-privileging** (retire the disk-execution subsystem; new customer disk endpoints behind the slice-4 data-bearing classifier) | **8C** | **implemented — slice 8 CLOSED** (agent v0.12.0: `/disks` endpoints + the data-bearing classifier gate + `mkfs`; controller v0.37.0: ~12.3k LOC of disk-execution retired — storage/restic/cross-drive/migrate/watchdog/scanner/infra-backup — `backup.Manager` split to app-data only, disk mgmt rewired to the agent, container de-privileged). The data-bearing format refusal (§6) is the security centerpiece. |
| **Host metrics to the controller** (`GET /host/metrics` — the customer host-health view) | **9** | **implemented** (agent v0.14.0: `GET /host/metrics` reuses the slice-4 collector + a new CPU/chassis-temp collector `internal/hub/cputemp.go`, graceful-null; the shared `HostMetrics` gains `cpu_temp_c` so the hub report carries it too — cross-repo golden updated; controller v0.39.0: agentapi `HostMetrics()` + a thin `/api/host-metrics` proxy + the monitoring page's host-health card). **Host-wide, token-authed, fresh** (not the 15-min hub snapshot). **Assumption: one customer per host** (the home-server model) — host-wide CPU/mem would leak cross-customer load on a multi-customer host; revisit then. Out of scope: multi-tenant metric filtering; historical/time-series storage (this is a live snapshot). |
| **Host/hardware loss** DR — re-enroll in "restore mode"; hub serves identity / PBS namespace / tunnel token / storage manifest / restore directive | **10** | deferred — needs hub desired-state serving; hub store today holds only `{host_id, customer_id, api_key}` (slice 3) |
| PBS escrow **consumption** (recover `K` on a new box) | **10** | deferred — exercised by host-loss DR |
| **Hub desired-state serving** (the "Down" channel) — store + serve per-host desired-state, bump `desired_generation`, signed-jobs queue + `has_signed_ops`; agent activates the envelope + a hub-backed provider (benign reconciled, destructive gated pending) | **10A** | **implemented** (hub v0.9.0: `PUT /admin/hosts/{id}/desired-state` bumps the generation, `GET /hosts/{id}/desired-state` + `/jobs` self-scoped, `signed_jobs` queue; agent v0.15.0: `ControlEnvelope` fields live, `Client.FetchDesiredState`, `internal/desired` Syncer + `reconcile.CachingProvider` feeding the engine — an explicit guest `decommission` is the destructive delta, gated `pending_signature`). Serves to already-authenticated hosts only; desired-state stored opaquely (agent owns the schema). Cross-repo golden (envelope + desired-state) byte-identical. |
| **Signed-op execution** (verify + run the gated destructive op) | **10B** | deferred — 10A lays the queue/flag/serving + the gate marks pending; 10B verifies the signature (role-scoped, action-bound, idempotent — `internal/authz`/`internal/reconcile` gate already built) and runs the executor (e.g. the decommission). |
| **PBS escrow consumption** (recover `K` on a new box) | **10C** | **spike validated** (2026-06-10, `documentation/tests/slice10-escrow-consumption-spike-findings.md` — recover-from-`(blob,R)` on a key-less box + real-data restore proven, GO). Productionizing the consumption path is 10C; exercised by host-loss DR (10D). |
| **Host/hardware loss** DR — re-enroll in "restore mode"; hub serves identity / PBS namespace / tunnel token / storage manifest / restore directive (the `restore_directive` field exists in 10A's desired-state, consumed here) | **10D** | deferred — the DR capstone; consumes 10A serving + 10C escrow consumption + re-enrollment authorization |
| Golden base refresh cadence + fleet versioning | post-launch | operational, non-blocking (§13) |
**Host/hardware loss (design intent — slice 10).** Re-enroll the new host in **restore mode**;
@@ -499,6 +501,24 @@ This doc hands the implementation three contracts it was waiting on:
## Changelog — design-review + Phase-3 fold-in (2026-06-08)
### Slice-10A implemented — hub desired-state serving (the "Down" channel) (2026-06-10)
- §4: the **control loop is live**. The report IS the heartbeat; its response — the **control
envelope** — is the Down channel. The envelope is a cheap change-notification: `desired_generation`
(version) + `has_signed_ops` (flag) + `poll_interval_seconds`. The agent **caches** the desired-state
+ its generation and re-fetches the heavy state (`GET /hosts/{id}/desired-state`, self-scoped) **only
when the generation advances**. The engine reconciles **benign** deltas; an explicit **destructive**
delta (a guest `decommission`) is classified Destructive → the gate refuses it **`pending_signature`**
(no signer in 10A → never executed). **Signed-job execution is 10B**; the `restore_directive` field
is carried in desired-state now but **consumed in 10D**.
- §9 slice table: **10A done** (hub serves desired-state + bumps generation + signed-jobs queue/flag;
agent activates the envelope + a hub-backed `CachingProvider` feeding the engine). 10B/10C/10D pending.
- Wire: the envelope's now-active fields + the `desired-state` response are a cross-repo contract —
`control-envelope.golden.json` + `desired-state.golden.json`, **byte-identical** agent↔hub. Status:
implemented (hub v0.9.0; agent v0.15.0). **Out of 10A (deliberate):** the hub stores/serves
desired-state **opaquely** (the agent owns the schema); signed-op **execution** + verification is 10B;
**restore-mode/re-enroll** consumption (a new box's first directive) is 10D — 10A serves only
already-authenticated hosts.
### Slice-9 implemented — host metrics to the controller (customer host-health view) (2026-06-10)
- §6: added **`GET /host/metrics`** — host-wide health (cpu%/mem/load/uptime/**`cpu_temp_c`**) +
per-storage capacity for the customer's monitoring view. **Reuses the slice-4 collector** (no