slice 9: host-health view on the monitoring page (v0.39.0)
Add agentapi HostMetrics() + a thin /api/host-metrics proxy to the agent's new GET /host/metrics, and a 'Szerver allapota (gazdagep)' card on the monitoring page rendering host CPU%/load/mem/CPU-temp(n/a)/uptime + per- storage capacity bars (thin-pool fill, disk temp/wear). Polls every 8s. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,43 +1,40 @@
|
||||
# REPORT — slice 8B.2 (controller half): resume at `snapshotted` (v0.38.0) (2026-06-10)
|
||||
# REPORT — slice 9 (controller half): host-health view (v0.39.0) (2026-06-10)
|
||||
|
||||
> Overwrite-latest report. Cumulative history: [CHANGELOG.md](CHANGELOG.md). Implements the
|
||||
> controller half of `TASK — Slice 8B.2`. Pairs with `felhom-agent` v0.13.0. No hub change.
|
||||
> Overwrite-latest report. Cumulative history: [CHANGELOG.md](CHANGELOG.md).
|
||||
|
||||
## Outcome
|
||||
## What was implemented
|
||||
|
||||
The quiesce loop (8B) kept the app stopped for the **whole backup**. In snapshot mode the app only
|
||||
needs to be stopped until the **storage snapshot** is taken; after that vzdump reads from the
|
||||
snapshot. The controller now resumes its app at the agent's **`snapshotted`** phase instead of
|
||||
`done` — app downtime drops from *whole-backup* to *until-snapshot*, with no loss of app-consistency.
|
||||
**Measured live: ~3s vs ~23s (~87% cut)**, restore still clean.
|
||||
The customer-facing half of **slice 9**. Pairs with `felhom-agent` v0.14.0. The de-privileged
|
||||
controller (slice 8C) sees only its own cgroup, so it can't read the host. The monitoring page now
|
||||
shows the **real Proxmox box**, proxied from the agent's new `GET /host/metrics`.
|
||||
|
||||
## What landed (`internal/quiesce`)
|
||||
### `internal/agentapi` — client method
|
||||
- **`Client.HostMetrics(ctx)`** — calls the agent's `GET /host/metrics` over the leaf-pinned,
|
||||
per-guest-token channel (same client as the 8C disk proxy). New mirror structs `HostMetrics` (with
|
||||
nullable `CPUTempC`), `StorageTarget`, `ThinPoolFill`, `SmartSummary` (a **subset** — only the
|
||||
fields the UI renders; unknown wire keys ignored).
|
||||
|
||||
- The status-poll loop **resumes (`StartStack` + clears the marker) at `snapshotted`**, then **keeps
|
||||
polling to `done`/`failed`** — so a new backup isn't started until this one truly finishes and a
|
||||
post-snapshot failure is still observed (the backup isn't "successful" until `done`; the early
|
||||
resume does not mark it done).
|
||||
- **Fallback:** if `snapshotted` never arrives (stop/downgraded storage), it resumes at `done`
|
||||
exactly as 8B. The agent only emits `snapshotted` when the actual mode is snapshot.
|
||||
- **Crash-safety unchanged:** marker written before stop; guaranteed unquiesce (deferred); startup
|
||||
`Recover()`. A failure *after* `snapshotted` is harmless — the app is already up.
|
||||
### `internal/web` — proxy + UI
|
||||
- **`ServeHostMetricsAPI`** (`agent_host_metrics_handler.go`) — a thin read-only proxy:
|
||||
`GET /api/host-metrics` → agent `GET /host/metrics`. Returns the `{ok,data,error}` envelope; 503
|
||||
when the local API is not configured (unprovisioned guest), 502 on an agent error. Wired in
|
||||
`main.go` behind `RequireAuth` (GET-only → no CSRF wrapper).
|
||||
- **Monitoring view** (`templates/monitoring.html`): a new **"Szerver állapota (gazdagép)"** card at
|
||||
the top renders the host block (CPU% + load, memory used/total, **CPU temp** or **"n/a"** when
|
||||
null, uptime) + per-storage capacity bars (used/total, thin-pool fill, disk temp/wear), reusing
|
||||
the existing `system-bar`/`storage-item` styling. Polls `/api/host-metrics` every **8 s** while the
|
||||
page is open (a live snapshot, distinct from the controller's own 60 s charts); yellow "nem
|
||||
elérhető" banner when the agent is unreachable.
|
||||
|
||||
## Tests
|
||||
## Tests (green)
|
||||
- `agentapi/host_metrics_test.go`: decodes host + storage (thin-pool, SMART temp + NVMe wear), USB
|
||||
drive's null SMART, and a null `cpu_temp_c` → nil pointer.
|
||||
- `go build ./...` + `go test ./internal/agentapi ./internal/web` green.
|
||||
|
||||
`go build ./...` + `go test ./...` green. quiesce: resume at `snapshotted` (RESUME event before
|
||||
`done`, marker cleared, then tracked to `done`); stop-mode fallback (resume at `done`, no
|
||||
`snapshotted`); fail-after-`snapshotted` (single resume, app stays up); the 8B crash-safety tests
|
||||
stay green.
|
||||
## Versioning / docs
|
||||
- Version `0.38.0 → 0.39.0` (set at build via ldflags); `CHANGELOG.md` + `controller/README.md`
|
||||
(Monitoring → "Host (Proxmox box) Health" section) updated.
|
||||
|
||||
## Live validation (demo-felhom)
|
||||
|
||||
A provisioned controller v0.38.0 with a postgres stack, short quiesce poll: timeline —
|
||||
`quiescing [pgtest]` 12:58:45 → `snapshotted — resuming app early` 12:58:48 → `backup done` 12:59:08.
|
||||
**App downtime ≈ 3s** (vs ≈ 23s to `done`). The snapshot backup restored to a scratch guest came up
|
||||
**clean** (`database system was shut down at 12:58:45`, no WAL replay) — the early resume preserved
|
||||
app-consistency. The controller kept tracking to `done` after resuming (no overlapping backup).
|
||||
|
||||
## Deferred / dependency
|
||||
|
||||
Snapshot-capable storage (lvm-thin/ZFS) required for the win; stop/downgraded storage falls back to
|
||||
resume-at-`done` (8B). No consistency-contract or crash-safety change. No secrets committed.
|
||||
## Pending
|
||||
- **Build + deploy** controller v0.39.0 to the demo nodes and live-validate the monitoring page
|
||||
against the real N100 (cross-check vs `pvesh`/`free`/`df`).
|
||||
|
||||
Reference in New Issue
Block a user