slice 9 docs + wire-contract: host.cpu_temp_c golden + doc 03 GET /host/metrics

Update the cross-repo host-report golden byte-identical with felhom-agent
(host.cpu_temp_c). Document GET /host/metrics in doc 03 section 6 and define
slice 9 in the section 9 roadmap. No hub code change / no version bump.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-10 16:16:38 +02:00
parent 5dc363771b
commit 4590fc0ee0
4 changed files with 67 additions and 20 deletions
@@ -117,6 +117,19 @@ The controller (in its LXC) reaches the agent (on the host) over the local bridg
- `POST /backup` — request a backup-now of *this* guest (enqueued; non-destructive).
- `GET /backup/due` — whether a policy-scheduled backup is due for *this* guest, so the controller can quiesce then call `POST /backup` (the app-consistent path, §8).
- `GET /backup/status`, `GET /restore-test/status` — read-only status for the controller's UI.
- **Host metrics (slice 9):** `GET /host/metrics`**host-wide** health for the customer's
monitoring view: cpu%/mem/load/uptime, **CPU/chassis temperature** (`cpu_temp_c`, nullable —
"n/a" when the hardware exposes no sensor), and per-storage capacity (total/used/fraction,
thin-pool fill, disk SMART temp+wear). It **reuses the slice-4 collector** (no duplicate
collection) and serves a **fresh** collect (current cpu%/temp, not the 15-min hub snapshot).
Unlike the rest of the surface this is **host-wide, not per-guest** (the box, not the caller's
guest) — correct for "see my box's health" — but still **token-authed** via the per-guest token.
**Assumption: one customer per host** (the home-server model); if a host ever served multiple
customers, host-wide CPU/mem would leak cross-customer load → revisit then. The de-privileged
controller (slice 8C) sees only its own cgroup, so it cannot read host health itself; this
re-serves the agent's existing host + storage observation to the customer. **Status:
implemented** (agent v0.14.0 `internal/localapi` + `internal/hub/cputemp.go`; controller v0.39.0
`internal/web/agent_host_metrics_handler.go` + the monitoring page's host-health card).
- **Disk management (slice 8C):** `GET /disks` (host drives + a **data-bearing flag**),
`POST /disks/assign` (attach a drive as a mount — benign, additive, self-serve), `POST
/disks/eject` (safe-unmount, **data preserved**, returns the dependent guests so the controller
@@ -406,6 +419,7 @@ this path — bring up + reattach external storage and it is whole. This is full
| **Local API** server (§6) + provisioning **back half** — deploy controller, hand bootstrap config, mint per-guest local token | **8A** | **implemented** (agent v0.10.0 `internal/localapi` + `internal/provision`; controller v0.35.0 `internal/bootstrap` + `internal/agentapi`). The controller image is **baked into the golden** (no registry cred in any guest); the back-half mints the token, writes a 0600 `bootstrap.json` to a `chown 100000:100000` config mount, and `pct set`-attaches it read-only; the golden's baked unit deploys the controller, which ingests the bootstrap, comes up configured, and reaches the agent over the bridge (leaf-pin + token). Validated live end-to-end on the demo. |
| **Quiesced app-consistent backup** (`/backup/due`-driven stack-stop) | **8B** | **implemented** (agent v0.11.0 `/backup/due` cadence + `/backup/status` phases; controller v0.36.0 `internal/quiesce` — stop stacks → backup → restart, with crash-safety marker/guaranteed-unquiesce/max-bound/crash-recovery). Validated live incl. the postgres clean-vs-crash-recovery restore contrast. **8B.2 downtime optimization (resume at `snapshotted`) implemented** (agent v0.13.0 + controller v0.38.0 — §8). |
| **Controller de-privileging** (retire the disk-execution subsystem; new customer disk endpoints behind the slice-4 data-bearing classifier) | **8C** | **implemented — slice 8 CLOSED** (agent v0.12.0: `/disks` endpoints + the data-bearing classifier gate + `mkfs`; controller v0.37.0: ~12.3k LOC of disk-execution retired — storage/restic/cross-drive/migrate/watchdog/scanner/infra-backup — `backup.Manager` split to app-data only, disk mgmt rewired to the agent, container de-privileged). The data-bearing format refusal (§6) is the security centerpiece. |
| **Host metrics to the controller** (`GET /host/metrics` — the customer host-health view) | **9** | **implemented** (agent v0.14.0: `GET /host/metrics` reuses the slice-4 collector + a new CPU/chassis-temp collector `internal/hub/cputemp.go`, graceful-null; the shared `HostMetrics` gains `cpu_temp_c` so the hub report carries it too — cross-repo golden updated; controller v0.39.0: agentapi `HostMetrics()` + a thin `/api/host-metrics` proxy + the monitoring page's host-health card). **Host-wide, token-authed, fresh** (not the 15-min hub snapshot). **Assumption: one customer per host** (the home-server model) — host-wide CPU/mem would leak cross-customer load on a multi-customer host; revisit then. Out of scope: multi-tenant metric filtering; historical/time-series storage (this is a live snapshot). |
| **Host/hardware loss** DR — re-enroll in "restore mode"; hub serves identity / PBS namespace / tunnel token / storage manifest / restore directive | **10** | deferred — needs hub desired-state serving; hub store today holds only `{host_id, customer_id, api_key}` (slice 3) |
| PBS escrow **consumption** (recover `K` on a new box) | **10** | deferred — exercised by host-loss DR |
| Golden base refresh cadence + fleet versioning | post-launch | operational, non-blocking (§13) |
@@ -485,6 +499,18 @@ This doc hands the implementation three contracts it was waiting on:
## Changelog — design-review + Phase-3 fold-in (2026-06-08)
### Slice-9 implemented — host metrics to the controller (customer host-health view) (2026-06-10)
- §6: added **`GET /host/metrics`** — host-wide health (cpu%/mem/load/uptime/**`cpu_temp_c`**) +
per-storage capacity for the customer's monitoring view. **Reuses the slice-4 collector** (no
duplicate collection); host-wide, **token-authed**, **fresh** (not the 15-min hub snapshot).
- §9 slice table: **defined + marked slice 9** (the roadmap previously jumped 8→10; this fills it).
Noted the **one-customer-per-host** assumption (host-wide CPU/mem would leak cross-customer load on
a multi-customer host) and the out-of-scope items (multi-tenant filtering; time-series history).
- The one new collector is **CPU/chassis temp** (`internal/hub/cputemp.go`, sysfs hwmon/thermal-zone,
**graceful-null**), added to the **shared `HostMetrics`** → the hub report gains `cpu_temp_c` too
(operator freebie) → **cross-repo host-report golden updated** byte-identical. Status: implemented
(agent v0.14.0; controller v0.39.0).
### Slice-8C implemented — controller de-privileged, slice 8 CLOSED (2026-06-10)
- §6: added the **disk-management endpoints** (`/disks`, `/disks/assign|eject|format`) and
**reframed the principle** — a controller may do non-data-destructive storage setup self-serve;