Files
felhom.eu/REPORT.md
T
admin 4be3bdf486 fix(hub): slice-3 follow-ups — /host-report 413 oversize + contract golden (v0.7.1)
- handleHostReport: read maxHostReportBytes+1 (4 MiB const) and reject oversize with
  413 instead of silent LimitReader truncation. Controller handleReport (1 MiB) is
  unchanged. Test asserts 413.
- contract: hub/internal/api/testdata/host-report.golden.json (byte-identical with
  felhom-agent's copy) + TestHostReport_GoldenContract drives the real handler and
  asserts 200 + denorm + both guests upserted.
- CHANGELOG v0.7.1.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 18:31:44 +02:00

7.7 KiB
Raw Blame History

felhom.eu — task reports

One section per task, appended (newest last) — not overwritten. Cumulative hub history lives in hub/CHANGELOG.md.


Hub slice 3 — host-domain ingest (v0.7.0) — 2026-06-08

Purely additive host-domain ingest in hub/: new tables, the agent's /host-report heartbeat endpoint, per-host Bearer auth, a provisional host mint, and a host-domain dead-man's-switch. The existing controller path is untouched; the schema/ auth cutover remains slice 10. Pushed to main; build/vet/test green locally and on the build server.

New tables (store.go migrate(), idempotent — // v0.7.0: host-domain)

  • hosts — one per customer agent. Reality columns (agent_version, last_report_at)
    • operator-intent columns INERT until slice 10 (desired_json, desired_generation, dr_record_json).
  • guests — one per controller LXC, PK guest_id = "<host_id>/<vmid>" (hub-derived). Reality columns (display_name, status, controller_version, vmid, last_seen_at)
    • INERT api_key, desired_spec_json.
  • host_reports — the report stream + denormalized columns (cpu/mem/disk %, guest counts, cloudflared status); pruned by Prune(maxDays) alongside reports.

Inert columns exist now so slice 10 needs no ALTER; nothing reads/writes them this slice. Migration is additive-only (no DROP, no edits to reports/customer_configs) and idempotent.

New store methods

GetHostByAPIKey, GetHost, ListHosts, UpsertHost (updates only identity + updated_at on conflict), SaveHostReport (inserts a report row + bumps reality columns only), UpsertGuestFromReport (updates reality columns only — preserves api_key/ desired_spec_json), GetHostStaleness (skips never-reported hosts), GuestID. Structs: Host, Guest, HostReportDenorm, HostStaleRow.

Auth (added; existing path unchanged)

checkAuthHost(r)(hostID, customerID, isGlobal, ok): global key → trust body.host_id; per-host key → bound identity; failure → not-ok. checkAuthCustomer is byte-for-byte unchanged.

Endpoints

  • POST /api/v1/host-report (the heartbeat): per-host auth; 4 MiB body; computes denorm (guest_running counts only status=="running"); SaveHostReport + per-guest UpsertGuestFromReport (a guest upsert failure is logged, not fatal — liveness); returns the control envelope {status:"ok", poll_interval_seconds:900, blocked, desired_generation:0, has_signed_ops:false}. blocked reflects customer_configs.status; the other two are reserved placeholders (slice 4). Global-key bootstrap requires the host to already exist (else 400); per-host key requires body.host_id == hostID (else 403).
  • POST /api/v1/admin/hostsPROVISIONAL, global-key only. Mints host_id (legible <customer>-<hex>) + a random api_key (configgen.RandomHex(32)); 201 {host_id, api_key}. Flagged in code as the slice-3 bootstrap to be removed/locked at enrollment (slices 78).

Host dead-man's-switch

monitor.HostStalenessChecker (host_staleness.go) — a sibling of the controller StalenessChecker, keyed on host↔host_reports, emitting host_stale/host_down/ host_recovered (30m / 60m), attributed to the host's customer (so the existing per-customer notification UX picks them up). Registered in allowedEventTypes; wired in main.go on the existing 60s ticker. The controller staleness/deadline checkers are untouched and keep running.

Contract

The /host-report JSON matches the agent spec §4 field-for-field (host_id, reported_at, agent_version, host{…}, guests[{vmid,name,status,controller_version,spec}], cloudflared{status}, and the empty storage_targets/backups/restore_tests/pbs_snapshots/audit_tail — accepted empty/absent). The envelope matches agent spec §5.

Test matrix (new, hermetic — temp SQLite, no live data)

  • store: upsert/lookup; a report-path update preserves desired_json/desired_generation; guest upsert preserves api_key/desired_spec_json while updating reality; GuestID; staleness skips never-reported.
  • auth: checkAuthHost global / per-host / unknown.
  • ingest: valid → 200 + envelope + denorm (guest_running = 1 of 2); host_id mismatch → 403; unknown host under global key → 400; blocked customer → blocked:true; oversize body → 400.
  • admin mint: non-global → 403; unknown customer → 400; success → 201 + minted key round-trips through /host-report.
  • host staleness: seed emits no events; ok→stale→down→recovered transitions.

Untouched / deferred (explicit)

  • Controller path unchanged: /api/v1/report, reports, customer_configs, checkAuthCustomer, existing staleness + deadline checkers — additions only, all still green.
  • Not built (per scope): desired-state serving, signed_ops, geo→hub, DR-record migration, dashboard re-design. The cutover (drop reportsguest_reports, merge checkers, tighten the provisional admin/global-key auth) remains slice 10.

Versioning / deploy

Hub version is the main.Version ldflags var (build.sh <VER>), default "dev"; recorded v0.7.0 in hub/CHANGELOG.md. The image build + ArgoCD deploy are not part of this task (no deploy performed).

Repo state

Branch: main. Verified go build/vet/test ./... green in hub/ locally (go1.26) and on the build server (go1.26).


Hub slice-3 follow-ups (v0.7.1) — 2026-06-08

Validation follow-ups (hub half). Pushed to main; build/vet/test green locally (go1.26) and on the build server.

§3 — /host-report rejects oversize with 413 (not silent truncation)

handleHostReport now reads maxHostReportBytes+1 (const 4 << 20, defined near defaultHostPollSeconds) and returns 413 Payload too large when exceeded, instead of relying on LimitReader truncation (which could accept a truncated-but-valid JSON as a partial report, dropping guests from the mirror). Scope-frozen: the controller handleReport 1 MiB read is unchanged (diff touches only the host path); the small divergence is acceptable until cutover. TestHandleHostReport_OversizeRejected now asserts 413.

§4 — cross-repo contract golden fixture (hub half)

  • hub/internal/api/testdata/host-report.golden.json — a byte-identical copy of felhom-agent's golden (verified by md5).
  • TestHostReport_GoldenContract — mints a host, POSTs the golden through the real handleHostReport, asserts 200 + denorm (guest_total=2, guest_running=1, cloudflared_status="active") + both guests upserted. Proves hostReportPayload still extracts the contract from the real wire shape.

Caveat (called out): the two golden files are a duplicated contract with no shared source of truth. JSON can't hold a comment, so the mandatory "keep byte-identical" marker lives in each test file's doc comment. When slices 5/6 add real storage_targets/backups fields, promote this to a shared Go types module (the proper fix); this fixture is the bridge.

Versioning / scope

Recorded v0.7.1 in hub/CHANGELOG.md. The hub version is the main.Version ldflags var (build.sh <VER>, default "dev") — there is no in-repo version constant to bump (the task's pointer to web/version.go is the controller-image VersionChecker, unrelated); the image tag is applied at build/deploy (ArgoCD), not in this task. No deploy performed.

Untouched (confirmed)

Controller path (handleReport/reports/customer_configs/checkAuthCustomer/existing checkers) unchanged. The agent's proxmox client timeout was a "confirm" item — already bounded (30s default), no change.

Repo state

Branch: main. Verified go build/vet/test ./... green in hub/ locally (go1.26) and on the build server (go1.26).