Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
6.9 KiB
REPORT — controller v0.40.0: bootstrap pull+merge onboarding (live-validated) (2026-06-11)
Lockstep two-repo change with felhom-agent v0.19.0. Fixes the onboarding 401 found last session:
a freshly provisioned guest used to seed a "configured" controller.yaml from the agent's host hub
key, which the hub's customer-scoped /api/v1/report rejects → the controller could never report
ONLINE. Now, on first boot, the controller pulls its full controller.yaml from the hub (using the
bootstrap's retrieval passphrase, which yields the customer-scoped key) and merges in the
per-guest local_api block. Validated live end-to-end on the demo (guest 9201).
What changed (internal/bootstrap, cmd/controller/main.go)
- Contract v1 → v2 (
felhom.bootstrap/v2):BootstrapCustomerkeeps onlyid;BootstrapHubdropsapi_key/host_id, addsretrieval_password;local_apiunchanged. Non-v2 → setup mode. MaybeIngest(configPath, cfg, logger, pull PullFunc)—pullinjected (decision (b): keepsbootstrapfree of the heavyinternal/reportpackage;main.gowiresreport.PullConfig). Flow: idempotent (configured → return, no pull) → parse+validate v2 → pull with bounded retry (1 + 3 backoff attempts, transientErrPullTransientonly; auth/not-found fail fast) → mergelocal_apiat the YAML-map level (decision (c): preserves every hub-emitted field) → write 0600 atomic → reload. Fail-safe + never-crash (hub outage at first boot → setup mode).- New sentinel
ErrPullTransient;main.go's adapter mapsreport.ErrHubUnreachable→ transient, passes auth/not-found through as permanent. RemovedconfigFromBootstrap(the host-key path).
Cross-repo contract checksum-diff (rendered bootstrap.json field set)
The agent's v2 renderer output was ingested by the controller's json.Unmarshal — every field
populated, exact match:
| level | fields (agent emits == controller ingests) |
|---|---|
| top | schema, customer, hub, local_api |
| customer | id |
| hub | url, retrieval_password |
| local_api | endpoint, fingerprint, token |
(Automated round-trip via a throwaway test in each package; removed after verifying.)
Tests — non-hollow (internal/bootstrap), all green
- Pull+merge: stub
pullreturns a hub yaml withhub.api_key: CUSTKEY_FROM_HUB,customer.domain, and an unmodeledassets.source_url. Asserts the written controller.yaml carries **the customer key- identity + the preserved unmodeled assets field** AND the bootstrap's
local_api.{endpoint, fingerprint,token}, and contains no host key/id.
- identity + the preserved unmodeled assets field** AND the bootstrap's
- Idempotency: preset
cfg.Customer.ID→ assertspullnever invoked, file untouched. - Transient retry: stub returns
ErrPullTransientalways → asserts exactly1+len(delays)calls, then setup mode, no file (backoff shrunk to ~1ms via the overridablepullRetryDelays). - Permanent no-retry: stub returns a plain (auth-style) error → asserts a single call.
- Schema reject (non-v2), missing-required, malformed/absent → setup mode, no pull.
go build ./... && go test ./... green.
Live validation (demo Proxmox felhom-pve, guest 9201, golden baked :0.40.0)
Golden re-baked: local:backup/vzdump-lxc-9100-2026_06_11-13_26_45.tar.zst (baked image confirmed
gitea.dooplex.hu/admin/felhom-controller:0.40.0). Provisioned fresh as demo-felhom via agent
v0.19.0 --selftest=provision -customer-id demo-felhom -hub-password <passphrase> (passphrase read
from the hub customer_configs and transported base64 to avoid UTF-8 mangling; stored out-of-band),
then pct reboot + systemctl restart felhom-agent (the local-API token workaround, Finding #1).
- Bootstrap (v2) on the guest:
hubkeys =[url, retrieval_password](no host key),customerkeys =[id]only, 0600. ✓ - Pull+merge worked — the merged
/opt/docker/felhom-controller/controller.yaml(secrets redacted) carries from the hub pull:hub.api_key: 4b11c0c3…(the customer-scoped key, matches the hub'scustomer_configsrow),hub.enabled: true,customer.{id: demo-felhom, domain: demo-felhom.eu, name, email},assets.source_url,git(catalog repo),infrastructure.cf_*(Cloudflare config); and merged from the bootstrap:local_api.{endpoint: 192.168.0.162:8443, fingerprint: 60b5974d…, token}. Nohost_id, no agent host key. ✓ - Hub ONLINE at v0.40.0 —
[report] Hub report pushed successfully (3090 bytes)+Startup hub report sent, no 401. Hubreportsrow fordemo-felhom:controller_version=0.40.0,received_at=2026-06-11 11:32:00(fresh → online). 0 deployed apps (fresh guest — expected). ✓ local_apisurvived the merge —GET /api/host-metrics→{ok:true},cpu_temp_c=49(real), 4 storage targets;GET /api/disks→{ok:true}, felhom-usbdata_bearing:true. ✓- 8C invariant intact — agent-direct
POST /disks/formaton data-bearing/dev/sdb1→ HTTP 403{formatted:false, data_bearing:true, reason:"device is mounted", pending_op:{op:storage_wipe, durable_id:byid:wwn-…, …}}"operator signature required (pending_signature)". Disk untouched (/dev/sdb1 ext4 8G, still mounted). ✓
What broke / what's missing
- Bootstrap log line absent in
docker logs(observability nit, reproduced from last session's seed-log).MaybeIngest's[INFO] bootstrap: pulled config … coming up configureddoes not surface indocker logseven thoughsetupLoggerwrites to stdout and the pull demonstrably ran (customer key present, hub report OK, catalog repo configured). The first captured line is a later async local-api WARN — the early synchronous bootstrap log is being swallowed before docker attaches. Worth a follow-up (flush/sequence the logger before MaybeIngest, or log the pull result post-startup). - Finding #1 still open (separate spec): the local-API channel 401s until
systemctl restart felhom-agentafter provisioning a live-daemon host (the running daemon didn't reload the freshly minted token). Reproduced (startup WARN at 11:31:55); workaround applied. - Operational gotcha (mine, fixed):
kubectl cp's "tar: removing leading '/'" warning polluted a captured base64 passphrase on the first attempt → a 2-char garbage passphrase → re-extracted withtail -1and re-provisioned cleanly. The UTF-8 (Hungarian) passphrase must be transported byte-exact (base64), not through the Windows shell. - Minor: guest 9201's hostname is
felhom-golden(no-hostnamepassed); cosmetic,customer.idis correct.
Versions / artifacts
- Controller v0.40.0 (CHANGELOG updated). Pushed to
main: commit6a594f9(code) — this REPORT in the follow-up commit. - Lockstep agent v0.19.0 (commit
e5a1819). New golden:local:backup/vzdump-lxc-9100-2026_06_11-13_26_45.tar.zst. - No secrets committed (passphrase, customer key, CF tokens, local-api token — all out-of-band/redacted).