slice 10B: signed-op job completion (DELETE clear-job) (hub v0.10.0)

Add DELETE /hosts/{id}/jobs/{job_id} (per-host self-scoped, idempotent) so the
agent clears a job after executing or terminally rejecting it. The hub stores
the operator-signed blobs opaquely (no signing key — cannot forge or open);
the agent verifies + executes. Doc 03 §4/§6/§9 updated (operator-signed path
live; 8C wipe completes; 10B done).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-10 20:14:32 +02:00
parent 8c54775b6f
commit 0c843286a2
6 changed files with 134 additions and 35 deletions
+21 -1
View File
@@ -421,7 +421,7 @@ this path — bring up + reattach external storage and it is whole. This is full
| **Controller de-privileging** (retire the disk-execution subsystem; new customer disk endpoints behind the slice-4 data-bearing classifier) | **8C** | **implemented — slice 8 CLOSED** (agent v0.12.0: `/disks` endpoints + the data-bearing classifier gate + `mkfs`; controller v0.37.0: ~12.3k LOC of disk-execution retired — storage/restic/cross-drive/migrate/watchdog/scanner/infra-backup — `backup.Manager` split to app-data only, disk mgmt rewired to the agent, container de-privileged). The data-bearing format refusal (§6) is the security centerpiece. |
| **Host metrics to the controller** (`GET /host/metrics` — the customer host-health view) | **9** | **implemented** (agent v0.14.0: `GET /host/metrics` reuses the slice-4 collector + a new CPU/chassis-temp collector `internal/hub/cputemp.go`, graceful-null; the shared `HostMetrics` gains `cpu_temp_c` so the hub report carries it too — cross-repo golden updated; controller v0.39.0: agentapi `HostMetrics()` + a thin `/api/host-metrics` proxy + the monitoring page's host-health card). **Host-wide, token-authed, fresh** (not the 15-min hub snapshot). **Assumption: one customer per host** (the home-server model) — host-wide CPU/mem would leak cross-customer load on a multi-customer host; revisit then. Out of scope: multi-tenant metric filtering; historical/time-series storage (this is a live snapshot). |
| **Hub desired-state serving** (the "Down" channel) — store + serve per-host desired-state, bump `desired_generation`, signed-jobs queue + `has_signed_ops`; agent activates the envelope + a hub-backed provider (benign reconciled, destructive gated pending) | **10A** | **implemented** (hub v0.9.0: `PUT /admin/hosts/{id}/desired-state` bumps the generation, `GET /hosts/{id}/desired-state` + `/jobs` self-scoped, `signed_jobs` queue; agent v0.15.0: `ControlEnvelope` fields live, `Client.FetchDesiredState`, `internal/desired` Syncer + `reconcile.CachingProvider` feeding the engine — an explicit guest `decommission` is the destructive delta, gated `pending_signature`). Serves to already-authenticated hosts only; desired-state stored opaquely (agent owns the schema). Cross-repo golden (envelope + desired-state) byte-identical. |
| **Signed-op execution** (verify + run the gated destructive op) | **10B** | deferred — 10A lays the queue/flag/serving + the gate marks pending; 10B verifies the signature (role-scoped, action-bound, idempotent — `internal/authz`/`internal/reconcile` gate already built) and runs the executor (e.g. the decommission). |
| **Signed-op execution** (verify + run the gated destructive op) | **10B** | **implemented** (agent v0.16.0: `cmd/felhom-opsign` offline signing CLI + `internal/signedjobs` runner/WipeExecutor + `internal/storage` durable-device resolution; hub v0.10.0: `DELETE /hosts/{id}/jobs/{job_id}` completion). Verify → durable nonce-burn → execute → clear; pinned-key (multi-key rotation, trusted path), host + **durable-id** anti-retarget, 8C re-inspect. Closes the 8C data-bearing-wipe gap. Other destructive executors (guest_destroy, decommission, restore-overwrite → 10D) reuse the same gate+runner machinery. |
| **PBS escrow consumption** (recover `K` on a new box) | **10C** | **spike validated** (2026-06-10, `documentation/tests/slice10-escrow-consumption-spike-findings.md` — recover-from-`(blob,R)` on a key-less box + real-data restore proven, GO). Productionizing the consumption path is 10C; exercised by host-loss DR (10D). |
| **Host/hardware loss** DR — re-enroll in "restore mode"; hub serves identity / PBS namespace / tunnel token / storage manifest / restore directive (the `restore_directive` field exists in 10A's desired-state, consumed here) | **10D** | deferred — the DR capstone; consumes 10A serving + 10C escrow consumption + re-enrollment authorization |
| Golden base refresh cadence + fleet versioning | post-launch | operational, non-blocking (§13) |
@@ -501,6 +501,26 @@ This doc hands the implementation three contracts it was waiting on:
## Changelog — design-review + Phase-3 fold-in (2026-06-08)
### Slice-10B implemented — operator-signed destructive completion (offline key + signing CLI) (2026-06-10)
- §4: the **operator-signed path is LIVE**. gate → pending op (the agent surfaces the bound intent:
op + target + params on **durable** ids) → the operator **signs OFFLINE** (`cmd/felhom-opsign`,
`ssh-keygen -Y sign` — hardware-ready) → uploads to the hub jobs queue → the agent fetches +
**verifies** (pinned-key SSHSIG, namespace, allow-list by key MATERIAL, crypto over raw bytes,
host target, time window, **durable nonce-burn**) → **executes**, journaled. Order: verify → burn
nonce → execute → report. Pinning is via a **trusted path** (provision/agent config, NEVER
hub-alone), **multi-key** for rotation (KeyID selects, role-scoped). **Key floor: the signing key
is not in the hub and not in the agent.** Resource-level anti-retarget: params bind a **durable
device id** (wwn/serial), and execution re-resolves + **re-inspects (8C)** — "wipe device X" wipes
exactly X, never whatever is at `/dev/sdb` now.
- §6: the **8C data-bearing wipe now COMPLETES** via 10B — `POST /disks/format` of a data-bearing
device still refuses `pending_signature`, but now surfaces the bound `storage_wipe` op (durable id
+ host) to sign; the signed job is executed by the agent's signed-jobs runner (re-resolve +
re-inspect → `mkfs`). A path-only, vanished, or no-longer-data-bearing target is refused **even
with a valid signature**.
- §9 slice table: **10B done**. 10C (escrow consumption — spike validated) / 10D (DR capstone,
reuses this gate for restore-overwrite) pending. Status: implemented (agent v0.16.0; hub v0.10.0).
The signed blob is opaque on the jobs wire (no golden change).
### Slice-10A implemented — hub desired-state serving (the "Down" channel) (2026-06-10)
- §4: the **control loop is live**. The report IS the heartbeat; its response — the **control
envelope** — is the Down channel. The envelope is a cheap change-notification: `desired_generation`