feat(hub): host-domain ingest — tables + /host-report + per-host auth + host dead-man's-switch (v0.7.0, slice 3)

Purely additive; the controller path (reports/customer_configs/checkAuthCustomer/
existing checkers) is untouched. Cutover remains slice 10.

- store: new hosts/guests/host_reports tables (full schema incl. columns INERT
  until slice 10, so no later ALTER); GetHostByAPIKey/GetHost/ListHosts/UpsertHost/
  SaveHostReport/UpsertGuestFromReport (preserves inert cols)/GetHostStaleness/
  GuestID; Prune also prunes host_reports.
- api: checkAuthHost (sibling of checkAuthCustomer); POST /host-report (per-host
  Bearer, 4MiB, denorm + guest upsert, control envelope); POST /admin/hosts
  (PROVISIONAL global-key host mint); host_* event types registered.
- monitor: HostStalenessChecker sibling over host_reports (host_stale/down/
  recovered), wired on the existing 60s ticker; controller checkers unchanged.
- tests (hermetic): store intent/inert-column preservation, auth, ingest
  (envelope+denorm, mismatch/unknown/blocked/oversize), admin mint round-trip,
  host staleness transitions.

CHANGELOG v0.7.0. Contract matches the agent host-report spec field-for-field.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-08 16:36:16 +02:00
parent 0d832def7b
commit 7c0c75457f
12 changed files with 1204 additions and 38 deletions
+91
View File
@@ -0,0 +1,91 @@
# felhom.eu — task reports
> One section per task, **appended** (newest last) — not overwritten. Cumulative
> hub history lives in [hub/CHANGELOG.md](hub/CHANGELOG.md).
---
## Hub slice 3 — host-domain ingest (v0.7.0) — 2026-06-08
Purely **additive** host-domain ingest in `hub/`: new tables, the agent's
`/host-report` heartbeat endpoint, per-host Bearer auth, a provisional host mint, and a
host-domain dead-man's-switch. The existing controller path is **untouched**; the schema/
auth cutover remains **slice 10**. Pushed to `main`; build/vet/test green locally and on
the build server.
### New tables (`store.go migrate()`, idempotent — `// v0.7.0: host-domain`)
- **`hosts`** — one per customer agent. Reality columns (`agent_version`, `last_report_at`)
+ operator-intent columns **INERT until slice 10** (`desired_json`, `desired_generation`,
`dr_record_json`).
- **`guests`** — one per controller LXC, PK `guest_id = "<host_id>/<vmid>"` (hub-derived).
Reality columns (`display_name`, `status`, `controller_version`, `vmid`, `last_seen_at`)
+ **INERT** `api_key`, `desired_spec_json`.
- **`host_reports`** — the report stream + denormalized columns (cpu/mem/disk %, guest
counts, cloudflared status); pruned by `Prune(maxDays)` alongside `reports`.
> Inert columns exist **now** so slice 10 needs no `ALTER`; nothing reads/writes them this
> slice. Migration is additive-only (no `DROP`, no edits to `reports`/`customer_configs`)
> and idempotent.
### New store methods
`GetHostByAPIKey`, `GetHost`, `ListHosts`, `UpsertHost` (updates only identity + `updated_at`
on conflict), `SaveHostReport` (inserts a report row + bumps reality columns only),
`UpsertGuestFromReport` (updates reality columns only — **preserves** `api_key`/
`desired_spec_json`), `GetHostStaleness` (skips never-reported hosts), `GuestID`.
Structs: `Host`, `Guest`, `HostReportDenorm`, `HostStaleRow`.
### Auth (added; existing path unchanged)
`checkAuthHost(r)``(hostID, customerID, isGlobal, ok)`: global key → trust `body.host_id`;
per-host key → bound identity; failure → not-ok. `checkAuthCustomer` is byte-for-byte unchanged.
### Endpoints
- **`POST /api/v1/host-report`** (the heartbeat): per-host auth; 4 MiB body; computes denorm
(`guest_running` counts only `status=="running"`); `SaveHostReport` + per-guest
`UpsertGuestFromReport` (a guest upsert failure is logged, not fatal — liveness); returns the
control envelope `{status:"ok", poll_interval_seconds:900, blocked, desired_generation:0,
has_signed_ops:false}`. `blocked` reflects `customer_configs.status`; the other two are
reserved placeholders (slice 4). Global-key bootstrap requires the host to already exist
(else 400); per-host key requires `body.host_id == hostID` (else 403).
- **`POST /api/v1/admin/hosts`** — **PROVISIONAL**, global-key only. Mints `host_id` (legible
`<customer>-<hex>`) + a random `api_key` (`configgen.RandomHex(32)`); 201 `{host_id, api_key}`.
Flagged in code as the slice-3 bootstrap to be removed/locked at enrollment (slices 78).
### Host dead-man's-switch
`monitor.HostStalenessChecker` (`host_staleness.go`) — a **sibling** of the controller
`StalenessChecker`, keyed on host↔`host_reports`, emitting `host_stale`/`host_down`/
`host_recovered` (30m / 60m), attributed to the host's customer (so the existing per-customer
notification UX picks them up). Registered in `allowedEventTypes`; wired in `main.go` on the
existing 60s ticker. The controller staleness/deadline checkers are untouched and keep running.
### Contract
The `/host-report` JSON matches the agent spec §4 field-for-field (host_id, reported_at,
agent_version, host{…}, guests[{vmid,name,status,controller_version,spec}], cloudflared{status},
and the empty storage_targets/backups/restore_tests/pbs_snapshots/audit_tail — accepted
empty/absent). The envelope matches agent spec §5.
### Test matrix (new, hermetic — temp SQLite, no live data)
- **store**: upsert/lookup; a report-path update **preserves** `desired_json`/`desired_generation`;
guest upsert **preserves** `api_key`/`desired_spec_json` while updating reality; `GuestID`;
staleness skips never-reported.
- **auth**: `checkAuthHost` global / per-host / unknown.
- **ingest**: valid → 200 + envelope + denorm (`guest_running` = 1 of 2); host_id mismatch → 403;
unknown host under global key → 400; blocked customer → `blocked:true`; oversize body → 400.
- **admin mint**: non-global → 403; unknown customer → 400; success → 201 + minted key
round-trips through `/host-report`.
- **host staleness**: seed emits no events; ok→stale→down→recovered transitions.
### Untouched / deferred (explicit)
- **Controller path unchanged**: `/api/v1/report`, `reports`, `customer_configs`,
`checkAuthCustomer`, existing staleness + deadline checkers — additions only, all still green.
- **Not built** (per scope): desired-state serving, `signed_ops`, geo→hub, DR-record migration,
dashboard re-design. The cutover (drop `reports``guest_reports`, merge checkers, tighten the
provisional admin/global-key auth) remains **slice 10**.
### Versioning / deploy
Hub version is the `main.Version` ldflags var (`build.sh <VER>`), default `"dev"`; recorded
**v0.7.0** in `hub/CHANGELOG.md`. The image build + ArgoCD deploy are **not** part of this task
(no deploy performed).
### Repo state
Branch: `main`. Verified `go build/vet/test ./...` green in `hub/` locally (go1.26) and on the
build server (go1.26).