slice 9 docs + wire-contract: host.cpu_temp_c golden + doc 03 GET /host/metrics
Update the cross-repo host-report golden byte-identical with felhom-agent (host.cpu_temp_c). Document GET /host/metrics in doc 03 section 6 and define slice 9 in the section 9 roadmap. No hub code change / no version bump. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -4,32 +4,41 @@
|
||||
|
||||
---
|
||||
|
||||
# REPORT — Slice 8B.2 docs: quiesce downtime optimization (resume at `snapshotted`) (2026-06-10)
|
||||
# REPORT — Slice 9 (hub + docs): host metrics to the controller — `cpu_temp_c` wire field + docs (2026-06-10)
|
||||
|
||||
## Type
|
||||
|
||||
Documentation update for **slice 8B.2** (implementation: `felhom-agent` v0.13.0 + `felhom-controller`
|
||||
v0.38.0; no hub change).
|
||||
Cross-repo wire-contract + documentation update for **slice 9** (implementation: `felhom-agent`
|
||||
v0.14.0 + `felhom-controller` v0.39.0). **No hub code change, no hub version bump.**
|
||||
|
||||
## What changed (hub)
|
||||
|
||||
- **Cross-repo host-report golden** (`hub/internal/api/testdata/host-report.golden.json`) gained
|
||||
**`host.cpu_temp_c: 47`**, kept **byte-identical** with
|
||||
`felhom-agent/internal/hub/testdata/host-report.golden.json` (the duplicated-contract discipline;
|
||||
manual diff confirmed identical). No code change: the full `report_json` already persists the field
|
||||
verbatim, and the hub's host parse-struct ignores the extra key — the golden-contract test
|
||||
(`host_test.go`) still passes. CPU temp on the operator dashboard is an optional later freebie.
|
||||
- `hub/CHANGELOG.md` records the contract update (no version bump).
|
||||
|
||||
## What changed (doc 03 — host-agent)
|
||||
|
||||
- **§8** — the **8B.2 downtime optimization is now implemented** (was a fast-follow note): in snapshot
|
||||
mode the agent watches the vzdump task log for the snapshot marker (`create storage snapshot`,
|
||||
validated PVE 9.2.2) and emits a **`snapshotted`** phase on `/backup/status`; the controller
|
||||
**resumes its app at `snapshotted`** (not `done`), cutting app downtime from *whole-backup* to
|
||||
*until-snapshot* with **no loss of app-consistency** (the snapshot froze the app-stopped state).
|
||||
Noted the snapshot-capable-storage dependency + the stop-mode **fallback to resume-at-`done`**, and
|
||||
that the controller keeps tracking to `done`/`failed` after early resume.
|
||||
- **§9 slice table** — the 8B row notes 8B.2 implemented.
|
||||
- **§6** — added **`GET /host/metrics`** to the local-API surface: host-wide health
|
||||
(cpu%/mem/load/uptime/`cpu_temp_c`) + per-storage capacity for the customer's monitoring view.
|
||||
Reuses the slice-4 collector (no duplicate collection); **host-wide, token-authed, fresh** (not the
|
||||
15-min hub snapshot); noted the **one-customer-per-host** assumption.
|
||||
- **§9 slice table** — **defined + marked slice 9** (the roadmap previously jumped 8→10; this fills
|
||||
it), incl. the assumption + out-of-scope items (multi-tenant filtering, time-series history). Added
|
||||
a slice-9 entry to the doc changelog.
|
||||
|
||||
## Live validation (cross-repo, on the demo)
|
||||
## Why (the slice 9 thesis)
|
||||
|
||||
A provisioned controller + postgres stack: `quiescing` → `snapshotted — resuming app early` →
|
||||
`backup done`. **App downtime ≈ 3s** (resume at snapshot) vs **≈ 23s** if it had waited for `done`
|
||||
(~87% cut). The snapshot backup restored **clean** (`database system was shut down`, no WAL replay) —
|
||||
the early resume preserved app-consistency. See the agent + controller REPORTs.
|
||||
The de-privileged controller (slice 8C) sees only its own cgroup — it can't read the host. Slice 9
|
||||
re-serves the agent's existing host + storage observation to the customer, plus the one new collector
|
||||
(CPU/chassis temp, graceful-null). On-ethos for a data-sovereignty product: the customer sees their
|
||||
own box's health.
|
||||
|
||||
## Deferred
|
||||
## Deferred / not built
|
||||
|
||||
Snapshot-capable storage required for the win; stop/downgraded storage falls back to resume-at-`done`
|
||||
(8B). No hub change → no deploy. No secrets committed.
|
||||
Multi-tenant host-metric filtering (one-customer-per-host assumed); historical/time-series metric
|
||||
storage (this is a live snapshot view). No secrets committed.
|
||||
|
||||
Reference in New Issue
Block a user