Files
felhom.eu/REPORT.md
T
admin 4be3bdf486 fix(hub): slice-3 follow-ups — /host-report 413 oversize + contract golden (v0.7.1)
- handleHostReport: read maxHostReportBytes+1 (4 MiB const) and reject oversize with
  413 instead of silent LimitReader truncation. Controller handleReport (1 MiB) is
  unchanged. Test asserts 413.
- contract: hub/internal/api/testdata/host-report.golden.json (byte-identical with
  felhom-agent's copy) + TestHostReport_GoldenContract drives the real handler and
  asserts 200 + denorm + both guests upserted.
- CHANGELOG v0.7.1.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 18:31:44 +02:00

134 lines
7.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# felhom.eu — task reports
> One section per task, **appended** (newest last) — not overwritten. Cumulative
> hub history lives in [hub/CHANGELOG.md](hub/CHANGELOG.md).
---
## Hub slice 3 — host-domain ingest (v0.7.0) — 2026-06-08
Purely **additive** host-domain ingest in `hub/`: new tables, the agent's
`/host-report` heartbeat endpoint, per-host Bearer auth, a provisional host mint, and a
host-domain dead-man's-switch. The existing controller path is **untouched**; the schema/
auth cutover remains **slice 10**. Pushed to `main`; build/vet/test green locally and on
the build server.
### New tables (`store.go migrate()`, idempotent — `// v0.7.0: host-domain`)
- **`hosts`** — one per customer agent. Reality columns (`agent_version`, `last_report_at`)
+ operator-intent columns **INERT until slice 10** (`desired_json`, `desired_generation`,
`dr_record_json`).
- **`guests`** — one per controller LXC, PK `guest_id = "<host_id>/<vmid>"` (hub-derived).
Reality columns (`display_name`, `status`, `controller_version`, `vmid`, `last_seen_at`)
+ **INERT** `api_key`, `desired_spec_json`.
- **`host_reports`** — the report stream + denormalized columns (cpu/mem/disk %, guest
counts, cloudflared status); pruned by `Prune(maxDays)` alongside `reports`.
> Inert columns exist **now** so slice 10 needs no `ALTER`; nothing reads/writes them this
> slice. Migration is additive-only (no `DROP`, no edits to `reports`/`customer_configs`)
> and idempotent.
### New store methods
`GetHostByAPIKey`, `GetHost`, `ListHosts`, `UpsertHost` (updates only identity + `updated_at`
on conflict), `SaveHostReport` (inserts a report row + bumps reality columns only),
`UpsertGuestFromReport` (updates reality columns only — **preserves** `api_key`/
`desired_spec_json`), `GetHostStaleness` (skips never-reported hosts), `GuestID`.
Structs: `Host`, `Guest`, `HostReportDenorm`, `HostStaleRow`.
### Auth (added; existing path unchanged)
`checkAuthHost(r)``(hostID, customerID, isGlobal, ok)`: global key → trust `body.host_id`;
per-host key → bound identity; failure → not-ok. `checkAuthCustomer` is byte-for-byte unchanged.
### Endpoints
- **`POST /api/v1/host-report`** (the heartbeat): per-host auth; 4 MiB body; computes denorm
(`guest_running` counts only `status=="running"`); `SaveHostReport` + per-guest
`UpsertGuestFromReport` (a guest upsert failure is logged, not fatal — liveness); returns the
control envelope `{status:"ok", poll_interval_seconds:900, blocked, desired_generation:0,
has_signed_ops:false}`. `blocked` reflects `customer_configs.status`; the other two are
reserved placeholders (slice 4). Global-key bootstrap requires the host to already exist
(else 400); per-host key requires `body.host_id == hostID` (else 403).
- **`POST /api/v1/admin/hosts`** — **PROVISIONAL**, global-key only. Mints `host_id` (legible
`<customer>-<hex>`) + a random `api_key` (`configgen.RandomHex(32)`); 201 `{host_id, api_key}`.
Flagged in code as the slice-3 bootstrap to be removed/locked at enrollment (slices 78).
### Host dead-man's-switch
`monitor.HostStalenessChecker` (`host_staleness.go`) — a **sibling** of the controller
`StalenessChecker`, keyed on host↔`host_reports`, emitting `host_stale`/`host_down`/
`host_recovered` (30m / 60m), attributed to the host's customer (so the existing per-customer
notification UX picks them up). Registered in `allowedEventTypes`; wired in `main.go` on the
existing 60s ticker. The controller staleness/deadline checkers are untouched and keep running.
### Contract
The `/host-report` JSON matches the agent spec §4 field-for-field (host_id, reported_at,
agent_version, host{…}, guests[{vmid,name,status,controller_version,spec}], cloudflared{status},
and the empty storage_targets/backups/restore_tests/pbs_snapshots/audit_tail — accepted
empty/absent). The envelope matches agent spec §5.
### Test matrix (new, hermetic — temp SQLite, no live data)
- **store**: upsert/lookup; a report-path update **preserves** `desired_json`/`desired_generation`;
guest upsert **preserves** `api_key`/`desired_spec_json` while updating reality; `GuestID`;
staleness skips never-reported.
- **auth**: `checkAuthHost` global / per-host / unknown.
- **ingest**: valid → 200 + envelope + denorm (`guest_running` = 1 of 2); host_id mismatch → 403;
unknown host under global key → 400; blocked customer → `blocked:true`; oversize body → 400.
- **admin mint**: non-global → 403; unknown customer → 400; success → 201 + minted key
round-trips through `/host-report`.
- **host staleness**: seed emits no events; ok→stale→down→recovered transitions.
### Untouched / deferred (explicit)
- **Controller path unchanged**: `/api/v1/report`, `reports`, `customer_configs`,
`checkAuthCustomer`, existing staleness + deadline checkers — additions only, all still green.
- **Not built** (per scope): desired-state serving, `signed_ops`, geo→hub, DR-record migration,
dashboard re-design. The cutover (drop `reports``guest_reports`, merge checkers, tighten the
provisional admin/global-key auth) remains **slice 10**.
### Versioning / deploy
Hub version is the `main.Version` ldflags var (`build.sh <VER>`), default `"dev"`; recorded
**v0.7.0** in `hub/CHANGELOG.md`. The image build + ArgoCD deploy are **not** part of this task
(no deploy performed).
### Repo state
Branch: `main`. Verified `go build/vet/test ./...` green in `hub/` locally (go1.26) and on the
build server (go1.26).
---
## Hub slice-3 follow-ups (v0.7.1) — 2026-06-08
Validation follow-ups (hub half). Pushed to `main`; build/vet/test green locally (go1.26) and on
the build server.
### §3 — `/host-report` rejects oversize with 413 (not silent truncation)
`handleHostReport` now reads `maxHostReportBytes+1` (const `4 << 20`, defined near
`defaultHostPollSeconds`) and returns **`413 Payload too large`** when exceeded, instead of relying
on `LimitReader` truncation (which could accept a truncated-but-valid JSON as a partial report,
dropping guests from the mirror). **Scope-frozen:** the controller `handleReport` 1 MiB read is
**unchanged** (diff touches only the host path); the small divergence is acceptable until cutover.
`TestHandleHostReport_OversizeRejected` now asserts 413.
### §4 — cross-repo contract golden fixture (hub half)
- `hub/internal/api/testdata/host-report.golden.json` — a **byte-identical copy** of felhom-agent's
golden (verified by md5).
- `TestHostReport_GoldenContract` — mints a host, POSTs the golden through the **real**
`handleHostReport`, asserts 200 + denorm (`guest_total=2`, `guest_running=1`,
`cloudflared_status="active"`) + both guests upserted. Proves `hostReportPayload` still extracts
the contract from the real wire shape.
**Caveat (called out):** the two golden files are a *duplicated* contract with no shared source of
truth. JSON can't hold a comment, so the mandatory "keep byte-identical" marker lives in each test
file's doc comment. When slices 5/6 add real `storage_targets`/`backups` fields, promote this to a
shared Go types module (the proper fix); this fixture is the bridge.
### Versioning / scope
Recorded **v0.7.1** in `hub/CHANGELOG.md`. The hub version is the `main.Version` ldflags var
(`build.sh <VER>`, default `"dev"`) — there is no in-repo version constant to bump (the task's
pointer to `web/version.go` is the controller-image `VersionChecker`, unrelated); the image tag is
applied at build/deploy (ArgoCD), not in this task. No deploy performed.
### Untouched (confirmed)
Controller path (`handleReport`/`reports`/`customer_configs`/`checkAuthCustomer`/existing checkers)
unchanged. The agent's proxmox client timeout was a "confirm" item — already bounded (30s default),
no change.
### Repo state
Branch: `main`. Verified `go build/vet/test ./...` green in `hub/` locally (go1.26) and on the build server (go1.26).