slice 9 docs + wire-contract: host.cpu_temp_c golden + doc 03 GET /host/metrics
Update the cross-repo host-report golden byte-identical with felhom-agent (host.cpu_temp_c). Document GET /host/metrics in doc 03 section 6 and define slice 9 in the section 9 roadmap. No hub code change / no version bump. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -4,32 +4,41 @@
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
# REPORT — Slice 8B.2 docs: quiesce downtime optimization (resume at `snapshotted`) (2026-06-10)
|
# REPORT — Slice 9 (hub + docs): host metrics to the controller — `cpu_temp_c` wire field + docs (2026-06-10)
|
||||||
|
|
||||||
## Type
|
## Type
|
||||||
|
|
||||||
Documentation update for **slice 8B.2** (implementation: `felhom-agent` v0.13.0 + `felhom-controller`
|
Cross-repo wire-contract + documentation update for **slice 9** (implementation: `felhom-agent`
|
||||||
v0.38.0; no hub change).
|
v0.14.0 + `felhom-controller` v0.39.0). **No hub code change, no hub version bump.**
|
||||||
|
|
||||||
|
## What changed (hub)
|
||||||
|
|
||||||
|
- **Cross-repo host-report golden** (`hub/internal/api/testdata/host-report.golden.json`) gained
|
||||||
|
**`host.cpu_temp_c: 47`**, kept **byte-identical** with
|
||||||
|
`felhom-agent/internal/hub/testdata/host-report.golden.json` (the duplicated-contract discipline;
|
||||||
|
manual diff confirmed identical). No code change: the full `report_json` already persists the field
|
||||||
|
verbatim, and the hub's host parse-struct ignores the extra key — the golden-contract test
|
||||||
|
(`host_test.go`) still passes. CPU temp on the operator dashboard is an optional later freebie.
|
||||||
|
- `hub/CHANGELOG.md` records the contract update (no version bump).
|
||||||
|
|
||||||
## What changed (doc 03 — host-agent)
|
## What changed (doc 03 — host-agent)
|
||||||
|
|
||||||
- **§8** — the **8B.2 downtime optimization is now implemented** (was a fast-follow note): in snapshot
|
- **§6** — added **`GET /host/metrics`** to the local-API surface: host-wide health
|
||||||
mode the agent watches the vzdump task log for the snapshot marker (`create storage snapshot`,
|
(cpu%/mem/load/uptime/`cpu_temp_c`) + per-storage capacity for the customer's monitoring view.
|
||||||
validated PVE 9.2.2) and emits a **`snapshotted`** phase on `/backup/status`; the controller
|
Reuses the slice-4 collector (no duplicate collection); **host-wide, token-authed, fresh** (not the
|
||||||
**resumes its app at `snapshotted`** (not `done`), cutting app downtime from *whole-backup* to
|
15-min hub snapshot); noted the **one-customer-per-host** assumption.
|
||||||
*until-snapshot* with **no loss of app-consistency** (the snapshot froze the app-stopped state).
|
- **§9 slice table** — **defined + marked slice 9** (the roadmap previously jumped 8→10; this fills
|
||||||
Noted the snapshot-capable-storage dependency + the stop-mode **fallback to resume-at-`done`**, and
|
it), incl. the assumption + out-of-scope items (multi-tenant filtering, time-series history). Added
|
||||||
that the controller keeps tracking to `done`/`failed` after early resume.
|
a slice-9 entry to the doc changelog.
|
||||||
- **§9 slice table** — the 8B row notes 8B.2 implemented.
|
|
||||||
|
|
||||||
## Live validation (cross-repo, on the demo)
|
## Why (the slice 9 thesis)
|
||||||
|
|
||||||
A provisioned controller + postgres stack: `quiescing` → `snapshotted — resuming app early` →
|
The de-privileged controller (slice 8C) sees only its own cgroup — it can't read the host. Slice 9
|
||||||
`backup done`. **App downtime ≈ 3s** (resume at snapshot) vs **≈ 23s** if it had waited for `done`
|
re-serves the agent's existing host + storage observation to the customer, plus the one new collector
|
||||||
(~87% cut). The snapshot backup restored **clean** (`database system was shut down`, no WAL replay) —
|
(CPU/chassis temp, graceful-null). On-ethos for a data-sovereignty product: the customer sees their
|
||||||
the early resume preserved app-consistency. See the agent + controller REPORTs.
|
own box's health.
|
||||||
|
|
||||||
## Deferred
|
## Deferred / not built
|
||||||
|
|
||||||
Snapshot-capable storage required for the win; stop/downgraded storage falls back to resume-at-`done`
|
Multi-tenant host-metric filtering (one-customer-per-host assumed); historical/time-series metric
|
||||||
(8B). No hub change → no deploy. No secrets committed.
|
storage (this is a live snapshot view). No secrets committed.
|
||||||
|
|||||||
@@ -117,6 +117,19 @@ The controller (in its LXC) reaches the agent (on the host) over the local bridg
|
|||||||
- `POST /backup` — request a backup-now of *this* guest (enqueued; non-destructive).
|
- `POST /backup` — request a backup-now of *this* guest (enqueued; non-destructive).
|
||||||
- `GET /backup/due` — whether a policy-scheduled backup is due for *this* guest, so the controller can quiesce then call `POST /backup` (the app-consistent path, §8).
|
- `GET /backup/due` — whether a policy-scheduled backup is due for *this* guest, so the controller can quiesce then call `POST /backup` (the app-consistent path, §8).
|
||||||
- `GET /backup/status`, `GET /restore-test/status` — read-only status for the controller's UI.
|
- `GET /backup/status`, `GET /restore-test/status` — read-only status for the controller's UI.
|
||||||
|
- **Host metrics (slice 9):** `GET /host/metrics` — **host-wide** health for the customer's
|
||||||
|
monitoring view: cpu%/mem/load/uptime, **CPU/chassis temperature** (`cpu_temp_c`, nullable —
|
||||||
|
"n/a" when the hardware exposes no sensor), and per-storage capacity (total/used/fraction,
|
||||||
|
thin-pool fill, disk SMART temp+wear). It **reuses the slice-4 collector** (no duplicate
|
||||||
|
collection) and serves a **fresh** collect (current cpu%/temp, not the 15-min hub snapshot).
|
||||||
|
Unlike the rest of the surface this is **host-wide, not per-guest** (the box, not the caller's
|
||||||
|
guest) — correct for "see my box's health" — but still **token-authed** via the per-guest token.
|
||||||
|
**Assumption: one customer per host** (the home-server model); if a host ever served multiple
|
||||||
|
customers, host-wide CPU/mem would leak cross-customer load → revisit then. The de-privileged
|
||||||
|
controller (slice 8C) sees only its own cgroup, so it cannot read host health itself; this
|
||||||
|
re-serves the agent's existing host + storage observation to the customer. **Status:
|
||||||
|
implemented** (agent v0.14.0 `internal/localapi` + `internal/hub/cputemp.go`; controller v0.39.0
|
||||||
|
`internal/web/agent_host_metrics_handler.go` + the monitoring page's host-health card).
|
||||||
- **Disk management (slice 8C):** `GET /disks` (host drives + a **data-bearing flag**),
|
- **Disk management (slice 8C):** `GET /disks` (host drives + a **data-bearing flag**),
|
||||||
`POST /disks/assign` (attach a drive as a mount — benign, additive, self-serve), `POST
|
`POST /disks/assign` (attach a drive as a mount — benign, additive, self-serve), `POST
|
||||||
/disks/eject` (safe-unmount, **data preserved**, returns the dependent guests so the controller
|
/disks/eject` (safe-unmount, **data preserved**, returns the dependent guests so the controller
|
||||||
@@ -406,6 +419,7 @@ this path — bring up + reattach external storage and it is whole. This is full
|
|||||||
| **Local API** server (§6) + provisioning **back half** — deploy controller, hand bootstrap config, mint per-guest local token | **8A** | **implemented** (agent v0.10.0 `internal/localapi` + `internal/provision`; controller v0.35.0 `internal/bootstrap` + `internal/agentapi`). The controller image is **baked into the golden** (no registry cred in any guest); the back-half mints the token, writes a 0600 `bootstrap.json` to a `chown 100000:100000` config mount, and `pct set`-attaches it read-only; the golden's baked unit deploys the controller, which ingests the bootstrap, comes up configured, and reaches the agent over the bridge (leaf-pin + token). Validated live end-to-end on the demo. |
|
| **Local API** server (§6) + provisioning **back half** — deploy controller, hand bootstrap config, mint per-guest local token | **8A** | **implemented** (agent v0.10.0 `internal/localapi` + `internal/provision`; controller v0.35.0 `internal/bootstrap` + `internal/agentapi`). The controller image is **baked into the golden** (no registry cred in any guest); the back-half mints the token, writes a 0600 `bootstrap.json` to a `chown 100000:100000` config mount, and `pct set`-attaches it read-only; the golden's baked unit deploys the controller, which ingests the bootstrap, comes up configured, and reaches the agent over the bridge (leaf-pin + token). Validated live end-to-end on the demo. |
|
||||||
| **Quiesced app-consistent backup** (`/backup/due`-driven stack-stop) | **8B** | **implemented** (agent v0.11.0 `/backup/due` cadence + `/backup/status` phases; controller v0.36.0 `internal/quiesce` — stop stacks → backup → restart, with crash-safety marker/guaranteed-unquiesce/max-bound/crash-recovery). Validated live incl. the postgres clean-vs-crash-recovery restore contrast. **8B.2 downtime optimization (resume at `snapshotted`) implemented** (agent v0.13.0 + controller v0.38.0 — §8). |
|
| **Quiesced app-consistent backup** (`/backup/due`-driven stack-stop) | **8B** | **implemented** (agent v0.11.0 `/backup/due` cadence + `/backup/status` phases; controller v0.36.0 `internal/quiesce` — stop stacks → backup → restart, with crash-safety marker/guaranteed-unquiesce/max-bound/crash-recovery). Validated live incl. the postgres clean-vs-crash-recovery restore contrast. **8B.2 downtime optimization (resume at `snapshotted`) implemented** (agent v0.13.0 + controller v0.38.0 — §8). |
|
||||||
| **Controller de-privileging** (retire the disk-execution subsystem; new customer disk endpoints behind the slice-4 data-bearing classifier) | **8C** | **implemented — slice 8 CLOSED** (agent v0.12.0: `/disks` endpoints + the data-bearing classifier gate + `mkfs`; controller v0.37.0: ~12.3k LOC of disk-execution retired — storage/restic/cross-drive/migrate/watchdog/scanner/infra-backup — `backup.Manager` split to app-data only, disk mgmt rewired to the agent, container de-privileged). The data-bearing format refusal (§6) is the security centerpiece. |
|
| **Controller de-privileging** (retire the disk-execution subsystem; new customer disk endpoints behind the slice-4 data-bearing classifier) | **8C** | **implemented — slice 8 CLOSED** (agent v0.12.0: `/disks` endpoints + the data-bearing classifier gate + `mkfs`; controller v0.37.0: ~12.3k LOC of disk-execution retired — storage/restic/cross-drive/migrate/watchdog/scanner/infra-backup — `backup.Manager` split to app-data only, disk mgmt rewired to the agent, container de-privileged). The data-bearing format refusal (§6) is the security centerpiece. |
|
||||||
|
| **Host metrics to the controller** (`GET /host/metrics` — the customer host-health view) | **9** | **implemented** (agent v0.14.0: `GET /host/metrics` reuses the slice-4 collector + a new CPU/chassis-temp collector `internal/hub/cputemp.go`, graceful-null; the shared `HostMetrics` gains `cpu_temp_c` so the hub report carries it too — cross-repo golden updated; controller v0.39.0: agentapi `HostMetrics()` + a thin `/api/host-metrics` proxy + the monitoring page's host-health card). **Host-wide, token-authed, fresh** (not the 15-min hub snapshot). **Assumption: one customer per host** (the home-server model) — host-wide CPU/mem would leak cross-customer load on a multi-customer host; revisit then. Out of scope: multi-tenant metric filtering; historical/time-series storage (this is a live snapshot). |
|
||||||
| **Host/hardware loss** DR — re-enroll in "restore mode"; hub serves identity / PBS namespace / tunnel token / storage manifest / restore directive | **10** | deferred — needs hub desired-state serving; hub store today holds only `{host_id, customer_id, api_key}` (slice 3) |
|
| **Host/hardware loss** DR — re-enroll in "restore mode"; hub serves identity / PBS namespace / tunnel token / storage manifest / restore directive | **10** | deferred — needs hub desired-state serving; hub store today holds only `{host_id, customer_id, api_key}` (slice 3) |
|
||||||
| PBS escrow **consumption** (recover `K` on a new box) | **10** | deferred — exercised by host-loss DR |
|
| PBS escrow **consumption** (recover `K` on a new box) | **10** | deferred — exercised by host-loss DR |
|
||||||
| Golden base refresh cadence + fleet versioning | post-launch | operational, non-blocking (§13) |
|
| Golden base refresh cadence + fleet versioning | post-launch | operational, non-blocking (§13) |
|
||||||
@@ -485,6 +499,18 @@ This doc hands the implementation three contracts it was waiting on:
|
|||||||
|
|
||||||
## Changelog — design-review + Phase-3 fold-in (2026-06-08)
|
## Changelog — design-review + Phase-3 fold-in (2026-06-08)
|
||||||
|
|
||||||
|
### Slice-9 implemented — host metrics to the controller (customer host-health view) (2026-06-10)
|
||||||
|
- §6: added **`GET /host/metrics`** — host-wide health (cpu%/mem/load/uptime/**`cpu_temp_c`**) +
|
||||||
|
per-storage capacity for the customer's monitoring view. **Reuses the slice-4 collector** (no
|
||||||
|
duplicate collection); host-wide, **token-authed**, **fresh** (not the 15-min hub snapshot).
|
||||||
|
- §9 slice table: **defined + marked slice 9** (the roadmap previously jumped 8→10; this fills it).
|
||||||
|
Noted the **one-customer-per-host** assumption (host-wide CPU/mem would leak cross-customer load on
|
||||||
|
a multi-customer host) and the out-of-scope items (multi-tenant filtering; time-series history).
|
||||||
|
- The one new collector is **CPU/chassis temp** (`internal/hub/cputemp.go`, sysfs hwmon/thermal-zone,
|
||||||
|
**graceful-null**), added to the **shared `HostMetrics`** → the hub report gains `cpu_temp_c` too
|
||||||
|
(operator freebie) → **cross-repo host-report golden updated** byte-identical. Status: implemented
|
||||||
|
(agent v0.14.0; controller v0.39.0).
|
||||||
|
|
||||||
### Slice-8C implemented — controller de-privileged, slice 8 CLOSED (2026-06-10)
|
### Slice-8C implemented — controller de-privileged, slice 8 CLOSED (2026-06-10)
|
||||||
- §6: added the **disk-management endpoints** (`/disks`, `/disks/assign|eject|format`) and
|
- §6: added the **disk-management endpoints** (`/disks`, `/disks/assign|eject|format`) and
|
||||||
**reframed the principle** — a controller may do non-data-destructive storage setup self-serve;
|
**reframed the principle** — a controller may do non-data-destructive storage setup self-serve;
|
||||||
|
|||||||
@@ -1,5 +1,16 @@
|
|||||||
# Felhom Hub — Changelog
|
# Felhom Hub — Changelog
|
||||||
|
|
||||||
|
## (no version bump) — slice 9 cross-repo wire-contract: `host.cpu_temp_c` (2026-06-10)
|
||||||
|
|
||||||
|
Slice 9 adds a nullable **`cpu_temp_c`** field to the shared `HostMetrics` wire struct (the agent's
|
||||||
|
new CPU/chassis-temperature collector). The agent's host-report carries it too, so the hub's
|
||||||
|
**cross-repo host-report golden** (`internal/api/testdata/host-report.golden.json`) was updated to
|
||||||
|
stay **byte-identical** with `felhom-agent/internal/hub/testdata/host-report.golden.json` (the
|
||||||
|
duplicated-contract discipline; manual diff confirmed identical). **No hub code change** — the full
|
||||||
|
report_json already persists the field verbatim, and the hub does not surface CPU temp on the
|
||||||
|
operator dashboard yet (an optional later freebie). The golden-contract test (`host_test.go`) still
|
||||||
|
passes (the host parse-struct ignores the extra key).
|
||||||
|
|
||||||
## v0.8.0 — opaque PBS recovery-code escrow storage (slice 7, doc 03 §8a) (2026-06-10)
|
## v0.8.0 — opaque PBS recovery-code escrow storage (slice 7, doc 03 §8a) (2026-06-10)
|
||||||
|
|
||||||
Hub half of slice-7 close-out: store the agent's **opaque** `R`-wrapped PBS-key escrow blob. The
|
Hub half of slice-7 close-out: store the agent's **opaque** `R`-wrapped PBS-key escrow blob. The
|
||||||
|
|||||||
+2
-1
@@ -12,7 +12,8 @@
|
|||||||
"disk_used_bytes": 30000000000,
|
"disk_used_bytes": 30000000000,
|
||||||
"disk_percent": 19.7,
|
"disk_percent": 19.7,
|
||||||
"loadavg": ["0.10", "0.20", "0.15"],
|
"loadavg": ["0.10", "0.20", "0.15"],
|
||||||
"uptime_seconds": 86400
|
"uptime_seconds": 86400,
|
||||||
|
"cpu_temp_c": 47
|
||||||
},
|
},
|
||||||
"guests": [
|
"guests": [
|
||||||
{
|
{
|
||||||
|
|||||||
Reference in New Issue
Block a user