57b8f56c52
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
96 lines
6.9 KiB
Markdown
96 lines
6.9 KiB
Markdown
# REPORT — controller v0.40.0: bootstrap pull+merge onboarding (live-validated) (2026-06-11)
|
|
|
|
Lockstep two-repo change with `felhom-agent` v0.19.0. Fixes the onboarding **401** found last session:
|
|
a freshly provisioned guest used to seed a "configured" controller.yaml from the agent's **host** hub
|
|
key, which the hub's customer-scoped `/api/v1/report` rejects → the controller could never report
|
|
ONLINE. Now, on first boot, the controller **pulls** its full controller.yaml from the hub (using the
|
|
bootstrap's retrieval passphrase, which yields the **customer-scoped** key) and **merges in** the
|
|
per-guest `local_api` block. Validated live end-to-end on the demo (guest 9201).
|
|
|
|
## What changed (`internal/bootstrap`, `cmd/controller/main.go`)
|
|
- **Contract v1 → v2** (`felhom.bootstrap/v2`): `BootstrapCustomer` keeps only `id`; `BootstrapHub`
|
|
drops `api_key`/`host_id`, adds **`retrieval_password`**; `local_api` unchanged. Non-v2 → setup mode.
|
|
- **`MaybeIngest(configPath, cfg, logger, pull PullFunc)`** — `pull` injected (decision (b): keeps
|
|
`bootstrap` free of the heavy `internal/report` package; `main.go` wires `report.PullConfig`). Flow:
|
|
idempotent (configured → return, **no pull**) → parse+validate v2 → **pull** with bounded retry
|
|
(1 + 3 backoff attempts, transient `ErrPullTransient` only; auth/not-found fail fast) → **merge**
|
|
`local_api` at the YAML-**map** level (decision (c): preserves every hub-emitted field) → write 0600
|
|
atomic → reload. Fail-safe + never-crash (hub outage at first boot → setup mode).
|
|
- New sentinel **`ErrPullTransient`**; `main.go`'s adapter maps `report.ErrHubUnreachable` → transient,
|
|
passes auth/not-found through as permanent. Removed `configFromBootstrap` (the host-key path).
|
|
|
|
## Cross-repo contract checksum-diff (rendered bootstrap.json field set)
|
|
The agent's v2 renderer output was ingested by the controller's `json.Unmarshal` — **every field
|
|
populated**, exact match:
|
|
|
|
| level | fields (agent emits == controller ingests) |
|
|
|---|---|
|
|
| top | `schema, customer, hub, local_api` |
|
|
| customer | `id` |
|
|
| hub | `url, retrieval_password` |
|
|
| local_api | `endpoint, fingerprint, token` |
|
|
|
|
(Automated round-trip via a throwaway test in each package; removed after verifying.)
|
|
|
|
## Tests — non-hollow (`internal/bootstrap`), all green
|
|
- **Pull+merge:** stub `pull` returns a hub yaml with `hub.api_key: CUSTKEY_FROM_HUB`, `customer.domain`,
|
|
and an unmodeled `assets.source_url`. Asserts the written controller.yaml carries **the customer key
|
|
+ identity + the preserved unmodeled assets field** AND the bootstrap's `local_api.{endpoint,
|
|
fingerprint,token}`, and contains **no host key/id**.
|
|
- **Idempotency:** preset `cfg.Customer.ID` → asserts `pull` **never invoked**, file untouched.
|
|
- **Transient retry:** stub returns `ErrPullTransient` always → asserts exactly `1+len(delays)` calls,
|
|
then setup mode, no file (backoff shrunk to ~1ms via the overridable `pullRetryDelays`).
|
|
- **Permanent no-retry:** stub returns a plain (auth-style) error → asserts a single call.
|
|
- **Schema reject** (non-v2), **missing-required**, **malformed/absent** → setup mode, no pull.
|
|
|
|
`go build ./... && go test ./...` green.
|
|
|
|
## Live validation (demo Proxmox `felhom-pve`, guest 9201, golden baked `:0.40.0`)
|
|
Golden re-baked: `local:backup/vzdump-lxc-9100-2026_06_11-13_26_45.tar.zst` (baked image confirmed
|
|
`gitea.dooplex.hu/admin/felhom-controller:0.40.0`). Provisioned fresh as `demo-felhom` via agent
|
|
v0.19.0 `--selftest=provision -customer-id demo-felhom -hub-password <passphrase>` (passphrase read
|
|
from the hub `customer_configs` and transported base64 to avoid UTF-8 mangling; **stored out-of-band**),
|
|
then `pct reboot` + `systemctl restart felhom-agent` (the local-API token workaround, Finding #1).
|
|
|
|
- **Bootstrap (v2) on the guest:** `hub` keys = `[url, retrieval_password]` (no host key), `customer`
|
|
keys = `[id]` only, 0600. ✓
|
|
- **Pull+merge worked** — the merged `/opt/docker/felhom-controller/controller.yaml` (secrets redacted)
|
|
carries **from the hub pull**: `hub.api_key: 4b11c0c3…` (the **customer-scoped** key, matches the
|
|
hub's `customer_configs` row), `hub.enabled: true`, `customer.{id: demo-felhom, domain:
|
|
demo-felhom.eu, name, email}`, `assets.source_url`, `git` (catalog repo), `infrastructure.cf_*`
|
|
(Cloudflare config); and **merged from the bootstrap**: `local_api.{endpoint: 192.168.0.162:8443,
|
|
fingerprint: 60b5974d…, token}`. **No `host_id`, no agent host key.** ✓
|
|
- **Hub ONLINE at v0.40.0** — `[report] Hub report pushed successfully (3090 bytes)` + `Startup hub
|
|
report sent`, **no 401**. Hub `reports` row for `demo-felhom`: `controller_version=0.40.0`,
|
|
`received_at=2026-06-11 11:32:00` (fresh → online). 0 deployed apps (fresh guest — expected). ✓
|
|
- **`local_api` survived the merge** — `GET /api/host-metrics` → `{ok:true}`, `cpu_temp_c=49` (real),
|
|
4 storage targets; `GET /api/disks` → `{ok:true}`, felhom-usb `data_bearing:true`. ✓
|
|
- **8C invariant intact** — agent-direct `POST /disks/format` on data-bearing `/dev/sdb1` → **HTTP 403**
|
|
`{formatted:false, data_bearing:true, reason:"device is mounted", pending_op:{op:storage_wipe,
|
|
durable_id:byid:wwn-…, …}}` "operator signature required (pending_signature)". Disk untouched
|
|
(`/dev/sdb1 ext4 8G`, still mounted). ✓
|
|
|
|
## What broke / what's missing
|
|
- **Bootstrap log line absent in `docker logs`** (observability nit, reproduced from last session's
|
|
seed-log). `MaybeIngest`'s `[INFO] bootstrap: pulled config … coming up configured` does not surface
|
|
in `docker logs` even though `setupLogger` writes to stdout and the pull demonstrably ran (customer
|
|
key present, hub report OK, catalog repo configured). The first captured line is a later async
|
|
local-api WARN — the early synchronous bootstrap log is being swallowed before docker attaches.
|
|
Worth a follow-up (flush/sequence the logger before MaybeIngest, or log the pull result post-startup).
|
|
- **Finding #1 still open (separate spec):** the local-API channel 401s until `systemctl restart
|
|
felhom-agent` after provisioning a live-daemon host (the running daemon didn't reload the freshly
|
|
minted token). Reproduced (startup WARN at 11:31:55); workaround applied.
|
|
- **Operational gotcha (mine, fixed):** `kubectl cp`'s "tar: removing leading '/'" warning polluted a
|
|
captured base64 passphrase on the first attempt → a 2-char garbage passphrase → re-extracted with
|
|
`tail -1` and re-provisioned cleanly. The UTF-8 (Hungarian) passphrase must be transported
|
|
byte-exact (base64), not through the Windows shell.
|
|
- Minor: guest 9201's hostname is `felhom-golden` (no `-hostname` passed); cosmetic, `customer.id` is
|
|
correct.
|
|
|
|
## Versions / artifacts
|
|
- Controller **v0.40.0** (CHANGELOG updated). Pushed to `main`: commit `6a594f9` (code) — this REPORT
|
|
in the follow-up commit.
|
|
- Lockstep agent **v0.19.0** (commit `e5a1819`). New golden:
|
|
`local:backup/vzdump-lxc-9100-2026_06_11-13_26_45.tar.zst`.
|
|
- No secrets committed (passphrase, customer key, CF tokens, local-api token — all out-of-band/redacted).
|