# REPORT — controller v0.40.0: bootstrap pull+merge onboarding (live-validated) (2026-06-11) Lockstep two-repo change with `felhom-agent` v0.19.0. Fixes the onboarding **401** found last session: a freshly provisioned guest used to seed a "configured" controller.yaml from the agent's **host** hub key, which the hub's customer-scoped `/api/v1/report` rejects → the controller could never report ONLINE. Now, on first boot, the controller **pulls** its full controller.yaml from the hub (using the bootstrap's retrieval passphrase, which yields the **customer-scoped** key) and **merges in** the per-guest `local_api` block. Validated live end-to-end on the demo (guest 9201). ## What changed (`internal/bootstrap`, `cmd/controller/main.go`) - **Contract v1 → v2** (`felhom.bootstrap/v2`): `BootstrapCustomer` keeps only `id`; `BootstrapHub` drops `api_key`/`host_id`, adds **`retrieval_password`**; `local_api` unchanged. Non-v2 → setup mode. - **`MaybeIngest(configPath, cfg, logger, pull PullFunc)`** — `pull` injected (decision (b): keeps `bootstrap` free of the heavy `internal/report` package; `main.go` wires `report.PullConfig`). Flow: idempotent (configured → return, **no pull**) → parse+validate v2 → **pull** with bounded retry (1 + 3 backoff attempts, transient `ErrPullTransient` only; auth/not-found fail fast) → **merge** `local_api` at the YAML-**map** level (decision (c): preserves every hub-emitted field) → write 0600 atomic → reload. Fail-safe + never-crash (hub outage at first boot → setup mode). - New sentinel **`ErrPullTransient`**; `main.go`'s adapter maps `report.ErrHubUnreachable` → transient, passes auth/not-found through as permanent. Removed `configFromBootstrap` (the host-key path). ## Cross-repo contract checksum-diff (rendered bootstrap.json field set) The agent's v2 renderer output was ingested by the controller's `json.Unmarshal` — **every field populated**, exact match: | level | fields (agent emits == controller ingests) | |---|---| | top | `schema, customer, hub, local_api` | | customer | `id` | | hub | `url, retrieval_password` | | local_api | `endpoint, fingerprint, token` | (Automated round-trip via a throwaway test in each package; removed after verifying.) ## Tests — non-hollow (`internal/bootstrap`), all green - **Pull+merge:** stub `pull` returns a hub yaml with `hub.api_key: CUSTKEY_FROM_HUB`, `customer.domain`, and an unmodeled `assets.source_url`. Asserts the written controller.yaml carries **the customer key + identity + the preserved unmodeled assets field** AND the bootstrap's `local_api.{endpoint, fingerprint,token}`, and contains **no host key/id**. - **Idempotency:** preset `cfg.Customer.ID` → asserts `pull` **never invoked**, file untouched. - **Transient retry:** stub returns `ErrPullTransient` always → asserts exactly `1+len(delays)` calls, then setup mode, no file (backoff shrunk to ~1ms via the overridable `pullRetryDelays`). - **Permanent no-retry:** stub returns a plain (auth-style) error → asserts a single call. - **Schema reject** (non-v2), **missing-required**, **malformed/absent** → setup mode, no pull. `go build ./... && go test ./...` green. ## Live validation (demo Proxmox `felhom-pve`, guest 9201, golden baked `:0.40.0`) Golden re-baked: `local:backup/vzdump-lxc-9100-2026_06_11-13_26_45.tar.zst` (baked image confirmed `gitea.dooplex.hu/admin/felhom-controller:0.40.0`). Provisioned fresh as `demo-felhom` via agent v0.19.0 `--selftest=provision -customer-id demo-felhom -hub-password ` (passphrase read from the hub `customer_configs` and transported base64 to avoid UTF-8 mangling; **stored out-of-band**), then `pct reboot` + `systemctl restart felhom-agent` (the local-API token workaround, Finding #1). - **Bootstrap (v2) on the guest:** `hub` keys = `[url, retrieval_password]` (no host key), `customer` keys = `[id]` only, 0600. ✓ - **Pull+merge worked** — the merged `/opt/docker/felhom-controller/controller.yaml` (secrets redacted) carries **from the hub pull**: `hub.api_key: 4b11c0c3…` (the **customer-scoped** key, matches the hub's `customer_configs` row), `hub.enabled: true`, `customer.{id: demo-felhom, domain: demo-felhom.eu, name, email}`, `assets.source_url`, `git` (catalog repo), `infrastructure.cf_*` (Cloudflare config); and **merged from the bootstrap**: `local_api.{endpoint: 192.168.0.162:8443, fingerprint: 60b5974d…, token}`. **No `host_id`, no agent host key.** ✓ - **Hub ONLINE at v0.40.0** — `[report] Hub report pushed successfully (3090 bytes)` + `Startup hub report sent`, **no 401**. Hub `reports` row for `demo-felhom`: `controller_version=0.40.0`, `received_at=2026-06-11 11:32:00` (fresh → online). 0 deployed apps (fresh guest — expected). ✓ - **`local_api` survived the merge** — `GET /api/host-metrics` → `{ok:true}`, `cpu_temp_c=49` (real), 4 storage targets; `GET /api/disks` → `{ok:true}`, felhom-usb `data_bearing:true`. ✓ - **8C invariant intact** — agent-direct `POST /disks/format` on data-bearing `/dev/sdb1` → **HTTP 403** `{formatted:false, data_bearing:true, reason:"device is mounted", pending_op:{op:storage_wipe, durable_id:byid:wwn-…, …}}` "operator signature required (pending_signature)". Disk untouched (`/dev/sdb1 ext4 8G`, still mounted). ✓ ## What broke / what's missing - **Bootstrap log line absent in `docker logs`** (observability nit, reproduced from last session's seed-log). `MaybeIngest`'s `[INFO] bootstrap: pulled config … coming up configured` does not surface in `docker logs` even though `setupLogger` writes to stdout and the pull demonstrably ran (customer key present, hub report OK, catalog repo configured). The first captured line is a later async local-api WARN — the early synchronous bootstrap log is being swallowed before docker attaches. Worth a follow-up (flush/sequence the logger before MaybeIngest, or log the pull result post-startup). - **Finding #1 still open (separate spec):** the local-API channel 401s until `systemctl restart felhom-agent` after provisioning a live-daemon host (the running daemon didn't reload the freshly minted token). Reproduced (startup WARN at 11:31:55); workaround applied. - **Operational gotcha (mine, fixed):** `kubectl cp`'s "tar: removing leading '/'" warning polluted a captured base64 passphrase on the first attempt → a 2-char garbage passphrase → re-extracted with `tail -1` and re-provisioned cleanly. The UTF-8 (Hungarian) passphrase must be transported byte-exact (base64), not through the Windows shell. - Minor: guest 9201's hostname is `felhom-golden` (no `-hostname` passed); cosmetic, `customer.id` is correct. ## Versions / artifacts - Controller **v0.40.0** (CHANGELOG updated). Pushed to `main`: commit `6a594f9` (code) — this REPORT in the follow-up commit. - Lockstep agent **v0.19.0** (commit `e5a1819`). New golden: `local:backup/vzdump-lxc-9100-2026_06_11-13_26_45.tar.zst`. - No secrets committed (passphrase, customer key, CF tokens, local-api token — all out-of-band/redacted).