diff --git a/REPORT.md b/REPORT.md index de81fa7..2f5f8ec 100644 --- a/REPORT.md +++ b/REPORT.md @@ -4,32 +4,41 @@ --- -# REPORT — Slice 8B.2 docs: quiesce downtime optimization (resume at `snapshotted`) (2026-06-10) +# REPORT — Slice 9 (hub + docs): host metrics to the controller — `cpu_temp_c` wire field + docs (2026-06-10) ## Type -Documentation update for **slice 8B.2** (implementation: `felhom-agent` v0.13.0 + `felhom-controller` -v0.38.0; no hub change). +Cross-repo wire-contract + documentation update for **slice 9** (implementation: `felhom-agent` +v0.14.0 + `felhom-controller` v0.39.0). **No hub code change, no hub version bump.** + +## What changed (hub) + +- **Cross-repo host-report golden** (`hub/internal/api/testdata/host-report.golden.json`) gained + **`host.cpu_temp_c: 47`**, kept **byte-identical** with + `felhom-agent/internal/hub/testdata/host-report.golden.json` (the duplicated-contract discipline; + manual diff confirmed identical). No code change: the full `report_json` already persists the field + verbatim, and the hub's host parse-struct ignores the extra key — the golden-contract test + (`host_test.go`) still passes. CPU temp on the operator dashboard is an optional later freebie. +- `hub/CHANGELOG.md` records the contract update (no version bump). ## What changed (doc 03 — host-agent) -- **§8** — the **8B.2 downtime optimization is now implemented** (was a fast-follow note): in snapshot - mode the agent watches the vzdump task log for the snapshot marker (`create storage snapshot`, - validated PVE 9.2.2) and emits a **`snapshotted`** phase on `/backup/status`; the controller - **resumes its app at `snapshotted`** (not `done`), cutting app downtime from *whole-backup* to - *until-snapshot* with **no loss of app-consistency** (the snapshot froze the app-stopped state). - Noted the snapshot-capable-storage dependency + the stop-mode **fallback to resume-at-`done`**, and - that the controller keeps tracking to `done`/`failed` after early resume. -- **§9 slice table** — the 8B row notes 8B.2 implemented. +- **§6** — added **`GET /host/metrics`** to the local-API surface: host-wide health + (cpu%/mem/load/uptime/`cpu_temp_c`) + per-storage capacity for the customer's monitoring view. + Reuses the slice-4 collector (no duplicate collection); **host-wide, token-authed, fresh** (not the + 15-min hub snapshot); noted the **one-customer-per-host** assumption. +- **§9 slice table** — **defined + marked slice 9** (the roadmap previously jumped 8→10; this fills + it), incl. the assumption + out-of-scope items (multi-tenant filtering, time-series history). Added + a slice-9 entry to the doc changelog. -## Live validation (cross-repo, on the demo) +## Why (the slice 9 thesis) -A provisioned controller + postgres stack: `quiescing` → `snapshotted — resuming app early` → -`backup done`. **App downtime ≈ 3s** (resume at snapshot) vs **≈ 23s** if it had waited for `done` -(~87% cut). The snapshot backup restored **clean** (`database system was shut down`, no WAL replay) — -the early resume preserved app-consistency. See the agent + controller REPORTs. +The de-privileged controller (slice 8C) sees only its own cgroup — it can't read the host. Slice 9 +re-serves the agent's existing host + storage observation to the customer, plus the one new collector +(CPU/chassis temp, graceful-null). On-ethos for a data-sovereignty product: the customer sees their +own box's health. -## Deferred +## Deferred / not built -Snapshot-capable storage required for the win; stop/downgraded storage falls back to resume-at-`done` -(8B). No hub change → no deploy. No secrets committed. +Multi-tenant host-metric filtering (one-customer-per-host assumed); historical/time-series metric +storage (this is a live snapshot view). No secrets committed. diff --git a/documentation/architecture/03-host-agent.md b/documentation/architecture/03-host-agent.md index dd0ff14..a4cac23 100644 --- a/documentation/architecture/03-host-agent.md +++ b/documentation/architecture/03-host-agent.md @@ -117,6 +117,19 @@ The controller (in its LXC) reaches the agent (on the host) over the local bridg - `POST /backup` — request a backup-now of *this* guest (enqueued; non-destructive). - `GET /backup/due` — whether a policy-scheduled backup is due for *this* guest, so the controller can quiesce then call `POST /backup` (the app-consistent path, §8). - `GET /backup/status`, `GET /restore-test/status` — read-only status for the controller's UI. + - **Host metrics (slice 9):** `GET /host/metrics` — **host-wide** health for the customer's + monitoring view: cpu%/mem/load/uptime, **CPU/chassis temperature** (`cpu_temp_c`, nullable — + "n/a" when the hardware exposes no sensor), and per-storage capacity (total/used/fraction, + thin-pool fill, disk SMART temp+wear). It **reuses the slice-4 collector** (no duplicate + collection) and serves a **fresh** collect (current cpu%/temp, not the 15-min hub snapshot). + Unlike the rest of the surface this is **host-wide, not per-guest** (the box, not the caller's + guest) — correct for "see my box's health" — but still **token-authed** via the per-guest token. + **Assumption: one customer per host** (the home-server model); if a host ever served multiple + customers, host-wide CPU/mem would leak cross-customer load → revisit then. The de-privileged + controller (slice 8C) sees only its own cgroup, so it cannot read host health itself; this + re-serves the agent's existing host + storage observation to the customer. **Status: + implemented** (agent v0.14.0 `internal/localapi` + `internal/hub/cputemp.go`; controller v0.39.0 + `internal/web/agent_host_metrics_handler.go` + the monitoring page's host-health card). - **Disk management (slice 8C):** `GET /disks` (host drives + a **data-bearing flag**), `POST /disks/assign` (attach a drive as a mount — benign, additive, self-serve), `POST /disks/eject` (safe-unmount, **data preserved**, returns the dependent guests so the controller @@ -406,6 +419,7 @@ this path — bring up + reattach external storage and it is whole. This is full | **Local API** server (§6) + provisioning **back half** — deploy controller, hand bootstrap config, mint per-guest local token | **8A** | **implemented** (agent v0.10.0 `internal/localapi` + `internal/provision`; controller v0.35.0 `internal/bootstrap` + `internal/agentapi`). The controller image is **baked into the golden** (no registry cred in any guest); the back-half mints the token, writes a 0600 `bootstrap.json` to a `chown 100000:100000` config mount, and `pct set`-attaches it read-only; the golden's baked unit deploys the controller, which ingests the bootstrap, comes up configured, and reaches the agent over the bridge (leaf-pin + token). Validated live end-to-end on the demo. | | **Quiesced app-consistent backup** (`/backup/due`-driven stack-stop) | **8B** | **implemented** (agent v0.11.0 `/backup/due` cadence + `/backup/status` phases; controller v0.36.0 `internal/quiesce` — stop stacks → backup → restart, with crash-safety marker/guaranteed-unquiesce/max-bound/crash-recovery). Validated live incl. the postgres clean-vs-crash-recovery restore contrast. **8B.2 downtime optimization (resume at `snapshotted`) implemented** (agent v0.13.0 + controller v0.38.0 — §8). | | **Controller de-privileging** (retire the disk-execution subsystem; new customer disk endpoints behind the slice-4 data-bearing classifier) | **8C** | **implemented — slice 8 CLOSED** (agent v0.12.0: `/disks` endpoints + the data-bearing classifier gate + `mkfs`; controller v0.37.0: ~12.3k LOC of disk-execution retired — storage/restic/cross-drive/migrate/watchdog/scanner/infra-backup — `backup.Manager` split to app-data only, disk mgmt rewired to the agent, container de-privileged). The data-bearing format refusal (§6) is the security centerpiece. | +| **Host metrics to the controller** (`GET /host/metrics` — the customer host-health view) | **9** | **implemented** (agent v0.14.0: `GET /host/metrics` reuses the slice-4 collector + a new CPU/chassis-temp collector `internal/hub/cputemp.go`, graceful-null; the shared `HostMetrics` gains `cpu_temp_c` so the hub report carries it too — cross-repo golden updated; controller v0.39.0: agentapi `HostMetrics()` + a thin `/api/host-metrics` proxy + the monitoring page's host-health card). **Host-wide, token-authed, fresh** (not the 15-min hub snapshot). **Assumption: one customer per host** (the home-server model) — host-wide CPU/mem would leak cross-customer load on a multi-customer host; revisit then. Out of scope: multi-tenant metric filtering; historical/time-series storage (this is a live snapshot). | | **Host/hardware loss** DR — re-enroll in "restore mode"; hub serves identity / PBS namespace / tunnel token / storage manifest / restore directive | **10** | deferred — needs hub desired-state serving; hub store today holds only `{host_id, customer_id, api_key}` (slice 3) | | PBS escrow **consumption** (recover `K` on a new box) | **10** | deferred — exercised by host-loss DR | | Golden base refresh cadence + fleet versioning | post-launch | operational, non-blocking (§13) | @@ -485,6 +499,18 @@ This doc hands the implementation three contracts it was waiting on: ## Changelog — design-review + Phase-3 fold-in (2026-06-08) +### Slice-9 implemented — host metrics to the controller (customer host-health view) (2026-06-10) +- §6: added **`GET /host/metrics`** — host-wide health (cpu%/mem/load/uptime/**`cpu_temp_c`**) + + per-storage capacity for the customer's monitoring view. **Reuses the slice-4 collector** (no + duplicate collection); host-wide, **token-authed**, **fresh** (not the 15-min hub snapshot). +- §9 slice table: **defined + marked slice 9** (the roadmap previously jumped 8→10; this fills it). + Noted the **one-customer-per-host** assumption (host-wide CPU/mem would leak cross-customer load on + a multi-customer host) and the out-of-scope items (multi-tenant filtering; time-series history). +- The one new collector is **CPU/chassis temp** (`internal/hub/cputemp.go`, sysfs hwmon/thermal-zone, + **graceful-null**), added to the **shared `HostMetrics`** → the hub report gains `cpu_temp_c` too + (operator freebie) → **cross-repo host-report golden updated** byte-identical. Status: implemented + (agent v0.14.0; controller v0.39.0). + ### Slice-8C implemented — controller de-privileged, slice 8 CLOSED (2026-06-10) - §6: added the **disk-management endpoints** (`/disks`, `/disks/assign|eject|format`) and **reframed the principle** — a controller may do non-data-destructive storage setup self-serve; diff --git a/hub/CHANGELOG.md b/hub/CHANGELOG.md index 14e19ea..925549d 100644 --- a/hub/CHANGELOG.md +++ b/hub/CHANGELOG.md @@ -1,5 +1,16 @@ # Felhom Hub — Changelog +## (no version bump) — slice 9 cross-repo wire-contract: `host.cpu_temp_c` (2026-06-10) + +Slice 9 adds a nullable **`cpu_temp_c`** field to the shared `HostMetrics` wire struct (the agent's +new CPU/chassis-temperature collector). The agent's host-report carries it too, so the hub's +**cross-repo host-report golden** (`internal/api/testdata/host-report.golden.json`) was updated to +stay **byte-identical** with `felhom-agent/internal/hub/testdata/host-report.golden.json` (the +duplicated-contract discipline; manual diff confirmed identical). **No hub code change** — the full +report_json already persists the field verbatim, and the hub does not surface CPU temp on the +operator dashboard yet (an optional later freebie). The golden-contract test (`host_test.go`) still +passes (the host parse-struct ignores the extra key). + ## v0.8.0 — opaque PBS recovery-code escrow storage (slice 7, doc 03 §8a) (2026-06-10) Hub half of slice-7 close-out: store the agent's **opaque** `R`-wrapped PBS-key escrow blob. The diff --git a/hub/internal/api/testdata/host-report.golden.json b/hub/internal/api/testdata/host-report.golden.json index 695f894..7309266 100644 --- a/hub/internal/api/testdata/host-report.golden.json +++ b/hub/internal/api/testdata/host-report.golden.json @@ -12,7 +12,8 @@ "disk_used_bytes": 30000000000, "disk_percent": 19.7, "loadavg": ["0.10", "0.20", "0.15"], - "uptime_seconds": 86400 + "uptime_seconds": 86400, + "cpu_temp_c": 47 }, "guests": [ {