Files
felhom-controller/REPORT.md
T

6.9 KiB

REPORT — controller v0.40.0: bootstrap pull+merge onboarding (live-validated) (2026-06-11)

Lockstep two-repo change with felhom-agent v0.19.0. Fixes the onboarding 401 found last session: a freshly provisioned guest used to seed a "configured" controller.yaml from the agent's host hub key, which the hub's customer-scoped /api/v1/report rejects → the controller could never report ONLINE. Now, on first boot, the controller pulls its full controller.yaml from the hub (using the bootstrap's retrieval passphrase, which yields the customer-scoped key) and merges in the per-guest local_api block. Validated live end-to-end on the demo (guest 9201).

What changed (internal/bootstrap, cmd/controller/main.go)

  • Contract v1 → v2 (felhom.bootstrap/v2): BootstrapCustomer keeps only id; BootstrapHub drops api_key/host_id, adds retrieval_password; local_api unchanged. Non-v2 → setup mode.
  • MaybeIngest(configPath, cfg, logger, pull PullFunc)pull injected (decision (b): keeps bootstrap free of the heavy internal/report package; main.go wires report.PullConfig). Flow: idempotent (configured → return, no pull) → parse+validate v2 → pull with bounded retry (1 + 3 backoff attempts, transient ErrPullTransient only; auth/not-found fail fast) → merge local_api at the YAML-map level (decision (c): preserves every hub-emitted field) → write 0600 atomic → reload. Fail-safe + never-crash (hub outage at first boot → setup mode).
  • New sentinel ErrPullTransient; main.go's adapter maps report.ErrHubUnreachable → transient, passes auth/not-found through as permanent. Removed configFromBootstrap (the host-key path).

Cross-repo contract checksum-diff (rendered bootstrap.json field set)

The agent's v2 renderer output was ingested by the controller's json.Unmarshalevery field populated, exact match:

level fields (agent emits == controller ingests)
top schema, customer, hub, local_api
customer id
hub url, retrieval_password
local_api endpoint, fingerprint, token

(Automated round-trip via a throwaway test in each package; removed after verifying.)

Tests — non-hollow (internal/bootstrap), all green

  • Pull+merge: stub pull returns a hub yaml with hub.api_key: CUSTKEY_FROM_HUB, customer.domain, and an unmodeled assets.source_url. Asserts the written controller.yaml carries **the customer key
    • identity + the preserved unmodeled assets field** AND the bootstrap's local_api.{endpoint, fingerprint,token}, and contains no host key/id.
  • Idempotency: preset cfg.Customer.ID → asserts pull never invoked, file untouched.
  • Transient retry: stub returns ErrPullTransient always → asserts exactly 1+len(delays) calls, then setup mode, no file (backoff shrunk to ~1ms via the overridable pullRetryDelays).
  • Permanent no-retry: stub returns a plain (auth-style) error → asserts a single call.
  • Schema reject (non-v2), missing-required, malformed/absent → setup mode, no pull.

go build ./... && go test ./... green.

Live validation (demo Proxmox felhom-pve, guest 9201, golden baked :0.40.0)

Golden re-baked: local:backup/vzdump-lxc-9100-2026_06_11-13_26_45.tar.zst (baked image confirmed gitea.dooplex.hu/admin/felhom-controller:0.40.0). Provisioned fresh as demo-felhom via agent v0.19.0 --selftest=provision -customer-id demo-felhom -hub-password <passphrase> (passphrase read from the hub customer_configs and transported base64 to avoid UTF-8 mangling; stored out-of-band), then pct reboot + systemctl restart felhom-agent (the local-API token workaround, Finding #1).

  • Bootstrap (v2) on the guest: hub keys = [url, retrieval_password] (no host key), customer keys = [id] only, 0600. ✓
  • Pull+merge worked — the merged /opt/docker/felhom-controller/controller.yaml (secrets redacted) carries from the hub pull: hub.api_key: 4b11c0c3… (the customer-scoped key, matches the hub's customer_configs row), hub.enabled: true, customer.{id: demo-felhom, domain: demo-felhom.eu, name, email}, assets.source_url, git (catalog repo), infrastructure.cf_* (Cloudflare config); and merged from the bootstrap: local_api.{endpoint: 192.168.0.162:8443, fingerprint: 60b5974d…, token}. No host_id, no agent host key.
  • Hub ONLINE at v0.40.0[report] Hub report pushed successfully (3090 bytes) + Startup hub report sent, no 401. Hub reports row for demo-felhom: controller_version=0.40.0, received_at=2026-06-11 11:32:00 (fresh → online). 0 deployed apps (fresh guest — expected). ✓
  • local_api survived the mergeGET /api/host-metrics{ok:true}, cpu_temp_c=49 (real), 4 storage targets; GET /api/disks{ok:true}, felhom-usb data_bearing:true. ✓
  • 8C invariant intact — agent-direct POST /disks/format on data-bearing /dev/sdb1HTTP 403 {formatted:false, data_bearing:true, reason:"device is mounted", pending_op:{op:storage_wipe, durable_id:byid:wwn-…, …}} "operator signature required (pending_signature)". Disk untouched (/dev/sdb1 ext4 8G, still mounted). ✓

What broke / what's missing

  • Bootstrap log line absent in docker logs (observability nit, reproduced from last session's seed-log). MaybeIngest's [INFO] bootstrap: pulled config … coming up configured does not surface in docker logs even though setupLogger writes to stdout and the pull demonstrably ran (customer key present, hub report OK, catalog repo configured). The first captured line is a later async local-api WARN — the early synchronous bootstrap log is being swallowed before docker attaches. Worth a follow-up (flush/sequence the logger before MaybeIngest, or log the pull result post-startup).
  • Finding #1 still open (separate spec): the local-API channel 401s until systemctl restart felhom-agent after provisioning a live-daemon host (the running daemon didn't reload the freshly minted token). Reproduced (startup WARN at 11:31:55); workaround applied.
  • Operational gotcha (mine, fixed): kubectl cp's "tar: removing leading '/'" warning polluted a captured base64 passphrase on the first attempt → a 2-char garbage passphrase → re-extracted with tail -1 and re-provisioned cleanly. The UTF-8 (Hungarian) passphrase must be transported byte-exact (base64), not through the Windows shell.
  • Minor: guest 9201's hostname is felhom-golden (no -hostname passed); cosmetic, customer.id is correct.

Versions / artifacts

  • Controller v0.40.0 (CHANGELOG updated). Pushed to main: commit 6a594f9 (code) — this REPORT in the follow-up commit.
  • Lockstep agent v0.19.0 (commit e5a1819). New golden: local:backup/vzdump-lxc-9100-2026_06_11-13_26_45.tar.zst.
  • No secrets committed (passphrase, customer key, CF tokens, local-api token — all out-of-band/redacted).