slice 10B: signed-op job completion (DELETE clear-job) (hub v0.10.0)
Add DELETE /hosts/{id}/jobs/{job_id} (per-host self-scoped, idempotent) so the
agent clears a job after executing or terminally rejecting it. The hub stores
the operator-signed blobs opaquely (no signing key — cannot forge or open);
the agent verifies + executes. Doc 03 §4/§6/§9 updated (operator-signed path
live; 8C wipe completes; 10B done).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -4,51 +4,41 @@
|
||||
|
||||
---
|
||||
|
||||
# REPORT — Slice 10A (hub half): desired-state serving — the "Down" channel (hub v0.9.0) (2026-06-10)
|
||||
# REPORT — Slice 10B (hub half): signed-op job completion (hub v0.10.0) (2026-06-10)
|
||||
|
||||
## Type
|
||||
|
||||
TASK (CC-implemented). The hub half of slice 10A. Pairs with `felhom-agent` v0.15.0.
|
||||
TASK (CC-implemented). The hub half of slice 10B. Pairs with `felhom-agent` v0.16.0 (the signing CLI
|
||||
+ verify-and-execute machinery + the storage-wipe consumer).
|
||||
|
||||
## What changed (hub)
|
||||
|
||||
The hub now **serves operator intent** down to already-authenticated hosts; the control envelope stops
|
||||
returning placeholders and carries the host's real generation + signed-jobs flag.
|
||||
Small by design — the hub stores + serves the operator-signed blobs **opaquely** (it holds no signing
|
||||
key, can neither forge nor open them; the agent verifies + executes). 10B adds the **completion** path.
|
||||
|
||||
### Store (`internal/store`)
|
||||
- New `signed_jobs` table (per-host **opaque** signed-op blob queue). New methods: `SetHostDesired`
|
||||
(set desired-state + **atomically bump `desired_generation`**), `EnqueueSignedJob` / `GetSignedJobs`
|
||||
/ `CountSignedJobs`. The `hosts` table's previously-inert `desired_json` / `desired_generation`
|
||||
columns are now live.
|
||||
|
||||
### API (`internal/api`)
|
||||
- **`PUT /api/v1/admin/hosts/{id}/desired-state`** (global key) — set + bump generation; body stored +
|
||||
served **opaquely** (validated only as well-formed JSON — the agent owns the schema).
|
||||
- **`GET /api/v1/hosts/{id}/desired-state`** (per-host key, **self-scoped**) — `{generation,
|
||||
desired_state}`; host A's key cannot read host B (403); global key may read any.
|
||||
- **`GET /api/v1/hosts/{id}/jobs`** (per-host key, self-scoped) — serves the host's pending opaque
|
||||
signed-op blobs, oldest first (verify+execute is 10B).
|
||||
- **`POST /api/v1/admin/hosts/{id}/jobs`** (global key) — enqueue a pre-signed opaque blob (the hub
|
||||
holds no signing key).
|
||||
- The host-report **control envelope** now reports the real `desired_generation` + `has_signed_ops`,
|
||||
degrading safely to defaults on a store error.
|
||||
### Store + API
|
||||
- **`DELETE /api/v1/hosts/{host_id}/jobs/{job_id}`** (per-host key, **self-scoped**; global key may
|
||||
clear any) — the agent calls it after executing OR terminally rejecting a job. Idempotent. Store:
|
||||
`DeleteSignedJob`.
|
||||
- Reused unchanged from 10A: `POST /admin/hosts/{id}/jobs` (operator enqueue), `GET /hosts/{id}/jobs`
|
||||
(agent fetch), `has_signed_ops` envelope flag. The signed blob stays opaque on the wire (a base64
|
||||
`{op_blob_b64, sig_armored}` envelope) — **no jobs-wire golden change**.
|
||||
|
||||
## Tests (green)
|
||||
- admin-set bumps the generation + serves the latest body; global-key-only (per-host 403, malformed
|
||||
400, unknown host 404); `GET /desired-state` self-scoped (A→B 403, global any, no-token 401);
|
||||
envelope carries generation + `has_signed_ops` flips on enqueue; `GET /jobs` self-scoped oldest-first;
|
||||
cross-repo golden round-trip (set → fetched back unchanged), **byte-identical** with felhom-agent.
|
||||
- `DELETE …/jobs/{id}` self-scoped (host A cannot clear host B's job → 403) + idempotent.
|
||||
|
||||
## Docs
|
||||
- Doc 03 §4 (control loop live: heartbeat → envelope generation/jobs → fetch-on-change → reconcile
|
||||
benign / gate destructive) + §9 slice table (**10A done**; 10B signed-op execution / 10C escrow
|
||||
consumption / 10D DR capstone pending; the `restore_directive` field exists now, consumed in 10D).
|
||||
- Doc 03 §4 (the operator-signed path is LIVE: gate → pending op → offline signature → verify
|
||||
(pinned key / nonce-burn / expiry / host + durable-id anti-retarget) → execute; key floor: not in
|
||||
the hub, not in the agent), §6 (the 8C data-bearing wipe now completes via 10B), §9 slice table
|
||||
(**10B done**; 10C escrow-consumption spike-validated, 10D DR capstone pending).
|
||||
|
||||
## Deferred / out of scope
|
||||
- Signed-op **execution** + signature verification → **10B** (10A only serves the queue + flag).
|
||||
- **Restore-mode / re-enroll** consumption (a new box's first directive) → **10D**; 10A serves
|
||||
already-authenticated hosts only. Rich desired-state editing UX → doc-05 (10A's admin-set is minimal).
|
||||
## Security framing (why the hub stays minimal)
|
||||
The hub is deliberately a dumb queue here: it cannot forge a signed op (no key) and the agent never
|
||||
trusts a queued blob until the pinned-key verify passes. A **compromised hub queuing a forged blob is
|
||||
rejected** by the agent (tested in felhom-agent). That is the whole point of the offline-key design.
|
||||
|
||||
## Pending
|
||||
- Build + deploy hub v0.9.0 (+ agent v0.15.0) and live-validate against the demo host (admin-set
|
||||
benign+destructive → generation bump → agent fetch → reconcile/gate; self-scope refusal).
|
||||
- Build + deploy hub v0.10.0 (+ agent v0.16.0) and live-validate the full loop on the demo: a
|
||||
data-bearing wipe → `pending_signature` → offline-signed → queued → agent verifies + wipes the
|
||||
device; replay + non-pinned-key rejected.
|
||||
|
||||
@@ -421,7 +421,7 @@ this path — bring up + reattach external storage and it is whole. This is full
|
||||
| **Controller de-privileging** (retire the disk-execution subsystem; new customer disk endpoints behind the slice-4 data-bearing classifier) | **8C** | **implemented — slice 8 CLOSED** (agent v0.12.0: `/disks` endpoints + the data-bearing classifier gate + `mkfs`; controller v0.37.0: ~12.3k LOC of disk-execution retired — storage/restic/cross-drive/migrate/watchdog/scanner/infra-backup — `backup.Manager` split to app-data only, disk mgmt rewired to the agent, container de-privileged). The data-bearing format refusal (§6) is the security centerpiece. |
|
||||
| **Host metrics to the controller** (`GET /host/metrics` — the customer host-health view) | **9** | **implemented** (agent v0.14.0: `GET /host/metrics` reuses the slice-4 collector + a new CPU/chassis-temp collector `internal/hub/cputemp.go`, graceful-null; the shared `HostMetrics` gains `cpu_temp_c` so the hub report carries it too — cross-repo golden updated; controller v0.39.0: agentapi `HostMetrics()` + a thin `/api/host-metrics` proxy + the monitoring page's host-health card). **Host-wide, token-authed, fresh** (not the 15-min hub snapshot). **Assumption: one customer per host** (the home-server model) — host-wide CPU/mem would leak cross-customer load on a multi-customer host; revisit then. Out of scope: multi-tenant metric filtering; historical/time-series storage (this is a live snapshot). |
|
||||
| **Hub desired-state serving** (the "Down" channel) — store + serve per-host desired-state, bump `desired_generation`, signed-jobs queue + `has_signed_ops`; agent activates the envelope + a hub-backed provider (benign reconciled, destructive gated pending) | **10A** | **implemented** (hub v0.9.0: `PUT /admin/hosts/{id}/desired-state` bumps the generation, `GET /hosts/{id}/desired-state` + `/jobs` self-scoped, `signed_jobs` queue; agent v0.15.0: `ControlEnvelope` fields live, `Client.FetchDesiredState`, `internal/desired` Syncer + `reconcile.CachingProvider` feeding the engine — an explicit guest `decommission` is the destructive delta, gated `pending_signature`). Serves to already-authenticated hosts only; desired-state stored opaquely (agent owns the schema). Cross-repo golden (envelope + desired-state) byte-identical. |
|
||||
| **Signed-op execution** (verify + run the gated destructive op) | **10B** | deferred — 10A lays the queue/flag/serving + the gate marks pending; 10B verifies the signature (role-scoped, action-bound, idempotent — `internal/authz`/`internal/reconcile` gate already built) and runs the executor (e.g. the decommission). |
|
||||
| **Signed-op execution** (verify + run the gated destructive op) | **10B** | **implemented** (agent v0.16.0: `cmd/felhom-opsign` offline signing CLI + `internal/signedjobs` runner/WipeExecutor + `internal/storage` durable-device resolution; hub v0.10.0: `DELETE /hosts/{id}/jobs/{job_id}` completion). Verify → durable nonce-burn → execute → clear; pinned-key (multi-key rotation, trusted path), host + **durable-id** anti-retarget, 8C re-inspect. Closes the 8C data-bearing-wipe gap. Other destructive executors (guest_destroy, decommission, restore-overwrite → 10D) reuse the same gate+runner machinery. |
|
||||
| **PBS escrow consumption** (recover `K` on a new box) | **10C** | **spike validated** (2026-06-10, `documentation/tests/slice10-escrow-consumption-spike-findings.md` — recover-from-`(blob,R)` on a key-less box + real-data restore proven, GO). Productionizing the consumption path is 10C; exercised by host-loss DR (10D). |
|
||||
| **Host/hardware loss** DR — re-enroll in "restore mode"; hub serves identity / PBS namespace / tunnel token / storage manifest / restore directive (the `restore_directive` field exists in 10A's desired-state, consumed here) | **10D** | deferred — the DR capstone; consumes 10A serving + 10C escrow consumption + re-enrollment authorization |
|
||||
| Golden base refresh cadence + fleet versioning | post-launch | operational, non-blocking (§13) |
|
||||
@@ -501,6 +501,26 @@ This doc hands the implementation three contracts it was waiting on:
|
||||
|
||||
## Changelog — design-review + Phase-3 fold-in (2026-06-08)
|
||||
|
||||
### Slice-10B implemented — operator-signed destructive completion (offline key + signing CLI) (2026-06-10)
|
||||
- §4: the **operator-signed path is LIVE**. gate → pending op (the agent surfaces the bound intent:
|
||||
op + target + params on **durable** ids) → the operator **signs OFFLINE** (`cmd/felhom-opsign`,
|
||||
`ssh-keygen -Y sign` — hardware-ready) → uploads to the hub jobs queue → the agent fetches +
|
||||
**verifies** (pinned-key SSHSIG, namespace, allow-list by key MATERIAL, crypto over raw bytes,
|
||||
host target, time window, **durable nonce-burn**) → **executes**, journaled. Order: verify → burn
|
||||
nonce → execute → report. Pinning is via a **trusted path** (provision/agent config, NEVER
|
||||
hub-alone), **multi-key** for rotation (KeyID selects, role-scoped). **Key floor: the signing key
|
||||
is not in the hub and not in the agent.** Resource-level anti-retarget: params bind a **durable
|
||||
device id** (wwn/serial), and execution re-resolves + **re-inspects (8C)** — "wipe device X" wipes
|
||||
exactly X, never whatever is at `/dev/sdb` now.
|
||||
- §6: the **8C data-bearing wipe now COMPLETES** via 10B — `POST /disks/format` of a data-bearing
|
||||
device still refuses `pending_signature`, but now surfaces the bound `storage_wipe` op (durable id
|
||||
+ host) to sign; the signed job is executed by the agent's signed-jobs runner (re-resolve +
|
||||
re-inspect → `mkfs`). A path-only, vanished, or no-longer-data-bearing target is refused **even
|
||||
with a valid signature**.
|
||||
- §9 slice table: **10B done**. 10C (escrow consumption — spike validated) / 10D (DR capstone,
|
||||
reuses this gate for restore-overwrite) pending. Status: implemented (agent v0.16.0; hub v0.10.0).
|
||||
The signed blob is opaque on the jobs wire (no golden change).
|
||||
|
||||
### Slice-10A implemented — hub desired-state serving (the "Down" channel) (2026-06-10)
|
||||
- §4: the **control loop is live**. The report IS the heartbeat; its response — the **control
|
||||
envelope** — is the Down channel. The envelope is a cheap change-notification: `desired_generation`
|
||||
|
||||
@@ -1,5 +1,24 @@
|
||||
# Felhom Hub — Changelog
|
||||
|
||||
## v0.10.0 — slice 10B: signed-op job completion (clear-job) (2026-06-10)
|
||||
|
||||
The hub half of slice 10B is small by design — the hub stores + serves the operator-signed blobs
|
||||
**opaquely** (it holds no signing key and can neither forge nor open them; the agent verifies +
|
||||
executes). 10B adds the missing **completion** path so a processed job leaves the queue.
|
||||
|
||||
### Added
|
||||
- **`DELETE /api/v1/hosts/{host_id}/jobs/{job_id}`** (per-host key, **self-scoped**; the global key
|
||||
may clear any) — the agent calls it after executing OR terminally rejecting a job. Idempotent
|
||||
(clearing an absent job is a clean 200). Store: `DeleteSignedJob`.
|
||||
|
||||
### Unchanged (already in 10A, reused by 10B)
|
||||
- `POST /admin/hosts/{id}/jobs` (operator enqueues the signed blob), `GET /hosts/{id}/jobs` (the
|
||||
agent fetches), and the `has_signed_ops` envelope flag. The signed blob stays opaque on the wire
|
||||
(a base64 `{op_blob_b64, sig_armored}` envelope the agent parses) — **no jobs-wire golden change**.
|
||||
|
||||
### Tests
|
||||
- `DELETE …/jobs/{id}` is self-scoped (host A cannot clear host B's job → 403) and idempotent.
|
||||
|
||||
## v0.9.0 — slice 10A: desired-state serving + signed-jobs queue (the "Down" channel) (2026-06-10)
|
||||
|
||||
The hub half of slice 10A: the hub now **serves operator intent** down to already-authenticated
|
||||
|
||||
@@ -244,6 +244,33 @@ func TestGetJobs_SelfScopedAndServesBlobs(t *testing.T) {
|
||||
}
|
||||
}
|
||||
|
||||
// DELETE /hosts/{id}/jobs/{job_id} clears a processed job (slice 10B), self-scoped + idempotent.
|
||||
func TestDeleteJob_SelfScopedAndIdempotent(t *testing.T) {
|
||||
h, st, _ := newTestHandler(t)
|
||||
seedHost(t, st, "h1", "c1", "HKEY1")
|
||||
seedHost(t, st, "h2", "c2", "HKEY2")
|
||||
st.EnqueueSignedJob("h1", "jobA", []byte("blob"))
|
||||
|
||||
// h2 cannot clear h1's job (self-scope).
|
||||
if rr := do(h, http.MethodDelete, "/hosts/h1/jobs/jobA", "HKEY2", ""); rr.Code != http.StatusForbidden {
|
||||
t.Errorf("h2 clearing h1's job = %d, want 403", rr.Code)
|
||||
}
|
||||
if n, _ := st.CountSignedJobs("h1"); n != 1 {
|
||||
t.Errorf("job was removed by an unauthorized delete (depth=%d)", n)
|
||||
}
|
||||
// h1 clears its own job → 200, queue empties.
|
||||
if rr := do(h, http.MethodDelete, "/hosts/h1/jobs/jobA", "HKEY1", ""); rr.Code != http.StatusOK {
|
||||
t.Fatalf("h1 clearing own job = %d, want 200", rr.Code)
|
||||
}
|
||||
if n, _ := st.CountSignedJobs("h1"); n != 0 {
|
||||
t.Errorf("queue depth after delete = %d, want 0", n)
|
||||
}
|
||||
// Idempotent: deleting an absent job is a clean 200.
|
||||
if rr := do(h, http.MethodDelete, "/hosts/h1/jobs/jobA", "HKEY1", ""); rr.Code != http.StatusOK {
|
||||
t.Errorf("idempotent delete = %d, want 200", rr.Code)
|
||||
}
|
||||
}
|
||||
|
||||
// The admin enqueue-job endpoint (global key only) seeds the queue, reflected in has_signed_ops.
|
||||
func TestAdminEnqueueJob_GlobalKeyOnly(t *testing.T) {
|
||||
h, st, _ := newTestHandler(t)
|
||||
|
||||
@@ -136,6 +136,14 @@ func (h *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
|
||||
case r.Method == http.MethodGet && strings.HasPrefix(path, "/hosts/") && strings.HasSuffix(path, "/jobs"):
|
||||
hostID := strings.TrimSuffix(strings.TrimPrefix(path, "/hosts/"), "/jobs")
|
||||
h.handleGetJobs(w, r, hostID)
|
||||
// Job completion (slice 10B) — per-host-key, self-scoped: DELETE /hosts/{id}/jobs/{job_id}.
|
||||
case r.Method == http.MethodDelete && strings.HasPrefix(path, "/hosts/") && strings.Contains(path, "/jobs/"):
|
||||
rest := strings.TrimPrefix(path, "/hosts/")
|
||||
if i := strings.Index(rest, "/jobs/"); i > 0 {
|
||||
h.handleDeleteJob(w, r, rest[:i], rest[i+len("/jobs/"):])
|
||||
} else {
|
||||
http.NotFound(w, r)
|
||||
}
|
||||
// Admin-set (slice 10A) — global/operator key only; bumps the generation.
|
||||
case r.Method == http.MethodPut && strings.HasPrefix(path, "/admin/hosts/") && strings.HasSuffix(path, "/desired-state"):
|
||||
hostID := strings.TrimSuffix(strings.TrimPrefix(path, "/admin/hosts/"), "/desired-state")
|
||||
@@ -739,6 +747,33 @@ func (h *Handler) handleGetJobs(w http.ResponseWriter, r *http.Request, pathHost
|
||||
json.NewEncoder(w).Encode(map[string]interface{}{"jobs": out})
|
||||
}
|
||||
|
||||
// handleDeleteJob clears a processed job from a host's queue (slice 10B). Per-host key,
|
||||
// SELF-SCOPED (a host clears only its own jobs; the global key may clear any). Idempotent.
|
||||
func (h *Handler) handleDeleteJob(w http.ResponseWriter, r *http.Request, pathHostID, jobID string) {
|
||||
authHostID, _, isGlobal, ok := h.checkAuthHost(r)
|
||||
if !ok {
|
||||
http.Error(w, "Unauthorized", http.StatusUnauthorized)
|
||||
return
|
||||
}
|
||||
if pathHostID == "" || jobID == "" {
|
||||
http.Error(w, "Missing host_id or job_id", http.StatusBadRequest)
|
||||
return
|
||||
}
|
||||
if !isGlobal && authHostID != pathHostID {
|
||||
http.Error(w, "Forbidden: host_id mismatch", http.StatusForbidden)
|
||||
return
|
||||
}
|
||||
if err := h.store.DeleteSignedJob(pathHostID, jobID); err != nil {
|
||||
h.logger.Printf("[ERROR] delete job %s for %s: %v", jobID, pathHostID, err)
|
||||
http.Error(w, "Internal error", http.StatusInternalServerError)
|
||||
return
|
||||
}
|
||||
h.logger.Printf("[INFO] host %s cleared signed-op job %s (executed or rejected)", pathHostID, jobID)
|
||||
w.Header().Set("Content-Type", "application/json")
|
||||
w.WriteHeader(http.StatusOK)
|
||||
w.Write([]byte(`{"status":"ok"}`))
|
||||
}
|
||||
|
||||
// handleAdminSetDesiredState sets a host's desired-state (slice 10A). GLOBAL/operator key ONLY —
|
||||
// a per-host key cannot author its own intent. The body is the desired-state JSON (opaque to the
|
||||
// hub: it stores + serves bytes, never validates/interprets the schema — the agent/CLI owns it).
|
||||
|
||||
@@ -1517,6 +1517,14 @@ func (s *Store) CountSignedJobs(hostID string) (int, error) {
|
||||
return n, err
|
||||
}
|
||||
|
||||
// DeleteSignedJob removes a processed job from a host's queue (slice 10B completion). The agent
|
||||
// calls it after executing OR terminally rejecting a job. Idempotent (deleting an absent job is a
|
||||
// no-op, returns nil) — a retried completion must not error.
|
||||
func (s *Store) DeleteSignedJob(hostID, jobID string) error {
|
||||
_, err := s.db.Exec(`DELETE FROM signed_jobs WHERE host_id = ? AND job_id = ?`, hostID, jobID)
|
||||
return err
|
||||
}
|
||||
|
||||
// SaveHostReport inserts a host_reports row and bumps the host's reality columns
|
||||
// (agent_version/last_report_at/updated_at) — never the inert intent columns.
|
||||
func (s *Store) SaveHostReport(hostID, customerID string, reportJSON []byte, d HostReportDenorm) error {
|
||||
|
||||
Reference in New Issue
Block a user