slice 9 docs + wire-contract: host.cpu_temp_c golden + doc 03 GET /host/metrics

Update the cross-repo host-report golden byte-identical with felhom-agent
(host.cpu_temp_c). Document GET /host/metrics in doc 03 section 6 and define
slice 9 in the section 9 roadmap. No hub code change / no version bump.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-10 16:16:38 +02:00
parent 5dc363771b
commit 4590fc0ee0
4 changed files with 67 additions and 20 deletions
+28 -19
View File
@@ -4,32 +4,41 @@
---
# REPORT — Slice 8B.2 docs: quiesce downtime optimization (resume at `snapshotted`) (2026-06-10)
# REPORT — Slice 9 (hub + docs): host metrics to the controller — `cpu_temp_c` wire field + docs (2026-06-10)
## Type
Documentation update for **slice 8B.2** (implementation: `felhom-agent` v0.13.0 + `felhom-controller`
v0.38.0; no hub change).
Cross-repo wire-contract + documentation update for **slice 9** (implementation: `felhom-agent`
v0.14.0 + `felhom-controller` v0.39.0). **No hub code change, no hub version bump.**
## What changed (hub)
- **Cross-repo host-report golden** (`hub/internal/api/testdata/host-report.golden.json`) gained
**`host.cpu_temp_c: 47`**, kept **byte-identical** with
`felhom-agent/internal/hub/testdata/host-report.golden.json` (the duplicated-contract discipline;
manual diff confirmed identical). No code change: the full `report_json` already persists the field
verbatim, and the hub's host parse-struct ignores the extra key — the golden-contract test
(`host_test.go`) still passes. CPU temp on the operator dashboard is an optional later freebie.
- `hub/CHANGELOG.md` records the contract update (no version bump).
## What changed (doc 03 — host-agent)
- **§8** — the **8B.2 downtime optimization is now implemented** (was a fast-follow note): in snapshot
mode the agent watches the vzdump task log for the snapshot marker (`create storage snapshot`,
validated PVE 9.2.2) and emits a **`snapshotted`** phase on `/backup/status`; the controller
**resumes its app at `snapshotted`** (not `done`), cutting app downtime from *whole-backup* to
*until-snapshot* with **no loss of app-consistency** (the snapshot froze the app-stopped state).
Noted the snapshot-capable-storage dependency + the stop-mode **fallback to resume-at-`done`**, and
that the controller keeps tracking to `done`/`failed` after early resume.
- **§9 slice table** — the 8B row notes 8B.2 implemented.
- **§6** — added **`GET /host/metrics`** to the local-API surface: host-wide health
(cpu%/mem/load/uptime/`cpu_temp_c`) + per-storage capacity for the customer's monitoring view.
Reuses the slice-4 collector (no duplicate collection); **host-wide, token-authed, fresh** (not the
15-min hub snapshot); noted the **one-customer-per-host** assumption.
- **§9 slice table** — **defined + marked slice 9** (the roadmap previously jumped 8→10; this fills
it), incl. the assumption + out-of-scope items (multi-tenant filtering, time-series history). Added
a slice-9 entry to the doc changelog.
## Live validation (cross-repo, on the demo)
## Why (the slice 9 thesis)
A provisioned controller + postgres stack: `quiescing``snapshotted — resuming app early`
`backup done`. **App downtime ≈ 3s** (resume at snapshot) vs **≈ 23s** if it had waited for `done`
(~87% cut). The snapshot backup restored **clean** (`database system was shut down`, no WAL replay) —
the early resume preserved app-consistency. See the agent + controller REPORTs.
The de-privileged controller (slice 8C) sees only its own cgroup — it can't read the host. Slice 9
re-serves the agent's existing host + storage observation to the customer, plus the one new collector
(CPU/chassis temp, graceful-null). On-ethos for a data-sovereignty product: the customer sees their
own box's health.
## Deferred
## Deferred / not built
Snapshot-capable storage required for the win; stop/downgraded storage falls back to resume-at-`done`
(8B). No hub change → no deploy. No secrets committed.
Multi-tenant host-metric filtering (one-customer-per-host assumed); historical/time-series metric
storage (this is a live snapshot view). No secrets committed.