docs(v0.41.0): README base-infra bring-up section + REPORT (live-validated)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-11 15:17:09 +02:00
parent abbd9488c6
commit f1780100ee
2 changed files with 62 additions and 87 deletions
+47 -86
View File
@@ -1,95 +1,56 @@
# REPORT — controller v0.40.0: bootstrap pull+merge onboarding (live-validated) (2026-06-11)
# REPORT — v0.41.0: first-boot base-infrastructure bring-up + self-heal (+ Section-G mount fix)
Lockstep two-repo change with `felhom-agent` v0.19.0. Fixes the onboarding **401** found last session:
a freshly provisioned guest used to seed a "configured" controller.yaml from the agent's **host** hub
key, which the hub's customer-scoped `/api/v1/report` rejects → the controller could never report
ONLINE. Now, on first boot, the controller **pulls** its full controller.yaml from the hub (using the
bootstrap's retrieval passphrase, which yields the **customer-scoped** key) and **merges in** the
per-guest `local_api` block. Validated live end-to-end on the demo (guest 9201).
**Repo:** `felhom-controller` · **Version:** 0.41.0 · **Date:** 2026-06-11
**Pushed commit:** `abbd948` (controller) · paired with `felhom-agent` v0.20.0 (`1799fcd`) + golden rebake.
## What changed (`internal/bootstrap`, `cmd/controller/main.go`)
- **Contract v1 → v2** (`felhom.bootstrap/v2`): `BootstrapCustomer` keeps only `id`; `BootstrapHub`
drops `api_key`/`host_id`, adds **`retrieval_password`**; `local_api` unchanged. Non-v2 → setup mode.
- **`MaybeIngest(configPath, cfg, logger, pull PullFunc)`** — `pull` injected (decision (b): keeps
`bootstrap` free of the heavy `internal/report` package; `main.go` wires `report.PullConfig`). Flow:
idempotent (configured → return, **no pull**) → parse+validate v2 → **pull** with bounded retry
(1 + 3 backoff attempts, transient `ErrPullTransient` only; auth/not-found fail fast) → **merge**
`local_api` at the YAML-**map** level (decision (c): preserves every hub-emitted field) → write 0600
atomic → reload. Fail-safe + never-crash (hub outage at first boot → setup mode).
- New sentinel **`ErrPullTransient`**; `main.go`'s adapter maps `report.ErrHubUnreachable` → transient,
passes auth/not-found through as permanent. Removed `configFromBootstrap` (the host-key path).
## What shipped
## Cross-repo contract checksum-diff (rendered bootstrap.json field set)
The agent's v2 renderer output was ingested by the controller's `json.Unmarshal` — **every field
populated**, exact match:
A freshly-onboarded controller came up ONLINE on the hub but **Health = FAIL: protected containers not
running — traefik, cloudflared, filebrowser**: nothing ever deployed the base stack on a Proxmox
bootstrap (it was only ever created by the bare-metal `scripts/docker-setup.sh`), and the health loop
only *detected* the gap. The controller now stands up its own base infrastructure.
| level | fields (agent emits == controller ingests) |
|---|---|
| top | `schema, customer, hub, local_api` |
| customer | `id` |
| hub | `url, retrieval_password` |
| local_api | `endpoint, fingerprint, token` |
- **`internal/infra`** (new) — pure renderers (`//go:embed` `text/template`s lifted verbatim from
`scripts/docker-setup.sh`) for traefik (`traefik.yml` + compose + a 0600 `.env` carrying the CF DNS
token only when set), cloudflared (compose; `TUNNEL_TOKEN`), filebrowser (compose + `config.yaml`).
**Pinned images as the single source of truth:** `traefik:v3.6.7`, `cloudflare/cloudflared:2026.6.0`,
`gtstef/filebrowser:1.3.3-stable`. The web FileBrowser sync path delegates here (pins can't diverge).
- **`stacks.Manager.EnsureBaseStack`** (`internal/stacks/infra.go`) — creates the `traefik-public`
network, then deploys traefik → cloudflared → filebrowser under `${stacks_dir}/<name>`. **Single-flight**
(`TryLock` — fired from both first boot and every health tick), **idempotent** (skips running stacks;
never overwrites an existing filebrowser compose), **non-fatal** (logs, never crashes).
- **Triggers** (`cmd/controller/main.go`): first-boot goroutine after stack init; self-heal calls
`EnsureBaseStack` unconditionally on every `system-health` tick (decoupled — safe via single-flight +
idempotency).
- **`monitor.EffectiveProtected`** — cloudflared counts as protected only when a tunnel token is set, so
a LAN-only node doesn't report FAIL forever for a stack it intentionally skips.
- **Section-G mount fix** (in `felhom-agent` `build-golden.sh`): same-path `-v /opt/docker/stacks:/opt/docker/stacks`
host bind — without it the guest daemon resolved every relative bind source on the guest filesystem
(empty dirs), breaking all bind-mounted stacks. Empirically proven on guest 9201 (probe printed
`cat: read error: Is a directory` before, `hello-from-controller` after).
(Automated round-trip via a throwaway test in each package; removed after verifying.)
## Tests (non-hollow)
## Tests — non-hollow (`internal/bootstrap`), all green
- **Pull+merge:** stub `pull` returns a hub yaml with `hub.api_key: CUSTKEY_FROM_HUB`, `customer.domain`,
and an unmodeled `assets.source_url`. Asserts the written controller.yaml carries **the customer key
+ identity + the preserved unmodeled assets field** AND the bootstrap's `local_api.{endpoint,
fingerprint,token}`, and contains **no host key/id**.
- **Idempotency:** preset `cfg.Customer.ID` → asserts `pull` **never invoked**, file untouched.
- **Transient retry:** stub returns `ErrPullTransient` always → asserts exactly `1+len(delays)` calls,
then setup mode, no file (backoff shrunk to ~1ms via the overridable `pullRetryDelays`).
- **Permanent no-retry:** stub returns a plain (auth-style) error → asserts a single call.
- **Schema reject** (non-v2), **missing-required**, **malformed/absent** → setup mode, no pull.
`go build/vet/test` clean. `internal/infra`: customer params appear in output, **no `:latest` survives**,
both ACME branches render (DNS-01 with CF token / HTTP-01 without), `.env` is 0600, **rendered YAML parses**.
`internal/stacks`: `EnsureBaseStack` single-flight short-circuits while the lock is held.
`internal/monitor`: `EffectiveProtected` drops cloudflared without a token, keeps it with one.
`go build ./... && go test ./...` green.
## Live validation (demo guest 9201, destroyed + re-provisioned from the rebaked golden)
## Live validation (demo Proxmox `felhom-pve`, guest 9201, golden baked `:0.40.0`)
Golden re-baked: `local:backup/vzdump-lxc-9100-2026_06_11-13_26_45.tar.zst` (baked image confirmed
`gitea.dooplex.hu/admin/felhom-controller:0.40.0`). Provisioned fresh as `demo-felhom` via agent
v0.19.0 `--selftest=provision -customer-id demo-felhom -hub-password <passphrase>` (passphrase read
from the hub `customer_configs` and transported base64 to avoid UTF-8 mangling; **stored out-of-band**),
then `pct reboot` + `systemctl restart felhom-agent` (the local-API token workaround, Finding #1).
- **4 containers running**, all pinned/baked images: `felhom-controller:0.41.0` (healthy), `traefik:v3.6.7`,
`cloudflare/cloudflared:2026.6.0`, `gtstef/filebrowser:1.3.3-stable` (healthy); `traefik-public` network present.
- **Health = OK** (no "protected container not running"); hub report pushed successfully → demo-felhom ONLINE v0.41.0.
- **Section-G holds end-to-end:** `docker exec traefik cat /etc/traefik/traefik.yml` is the full rendered
config (the bind resolves on the shared host path).
- **Templates rendered the right branch:** DNS-01 / `provider: cloudflare` / `email: admin@felhom.eu`
(cf_api_token + email present); `.env` and `acme.json` are 0600.
- **cloudflared registered tunnel connections** (bud01 + vie05 edges, QUIC).
- **Hostname fixed (3A):** controller `os.Hostname()` = `demo-felhom` (was the Docker container ID);
CT/LXC hostname = `demo-felhom` (3B, was `felhom-golden`).
- **Regression:** controller↔agent **local-api channel up** (disks/host-metrics proxy intact).
- **Self-heal:** `docker rm -f traefik` → redeployed by the next `system-health` tick (idempotent no-op when healthy).
- **Bootstrap (v2) on the guest:** `hub` keys = `[url, retrieval_password]` (no host key), `customer`
keys = `[id]` only, 0600. ✓
- **Pull+merge worked** — the merged `/opt/docker/felhom-controller/controller.yaml` (secrets redacted)
carries **from the hub pull**: `hub.api_key: 4b11c0c3…` (the **customer-scoped** key, matches the
hub's `customer_configs` row), `hub.enabled: true`, `customer.{id: demo-felhom, domain:
demo-felhom.eu, name, email}`, `assets.source_url`, `git` (catalog repo), `infrastructure.cf_*`
(Cloudflare config); and **merged from the bootstrap**: `local_api.{endpoint: 192.168.0.162:8443,
fingerprint: 60b5974d…, token}`. **No `host_id`, no agent host key.**
- **Hub ONLINE at v0.40.0** — `[report] Hub report pushed successfully (3090 bytes)` + `Startup hub
report sent`, **no 401**. Hub `reports` row for `demo-felhom`: `controller_version=0.40.0`,
`received_at=2026-06-11 11:32:00` (fresh → online). 0 deployed apps (fresh guest — expected). ✓
- **`local_api` survived the merge** — `GET /api/host-metrics` → `{ok:true}`, `cpu_temp_c=49` (real),
4 storage targets; `GET /api/disks` → `{ok:true}`, felhom-usb `data_bearing:true`. ✓
- **8C invariant intact** — agent-direct `POST /disks/format` on data-bearing `/dev/sdb1` → **HTTP 403**
`{formatted:false, data_bearing:true, reason:"device is mounted", pending_op:{op:storage_wipe,
durable_id:byid:wwn-…, …}}` "operator signature required (pending_signature)". Disk untouched
(`/dev/sdb1 ext4 8G`, still mounted). ✓
## What broke / what's missing
- **Bootstrap log line absent in `docker logs`** (observability nit, reproduced from last session's
seed-log). `MaybeIngest`'s `[INFO] bootstrap: pulled config … coming up configured` does not surface
in `docker logs` even though `setupLogger` writes to stdout and the pull demonstrably ran (customer
key present, hub report OK, catalog repo configured). The first captured line is a later async
local-api WARN — the early synchronous bootstrap log is being swallowed before docker attaches.
Worth a follow-up (flush/sequence the logger before MaybeIngest, or log the pull result post-startup).
- **Finding #1 still open (separate spec):** the local-API channel 401s until `systemctl restart
felhom-agent` after provisioning a live-daemon host (the running daemon didn't reload the freshly
minted token). Reproduced (startup WARN at 11:31:55); workaround applied.
- **Operational gotcha (mine, fixed):** `kubectl cp`'s "tar: removing leading '/'" warning polluted a
captured base64 passphrase on the first attempt → a 2-char garbage passphrase → re-extracted with
`tail -1` and re-provisioned cleanly. The UTF-8 (Hungarian) passphrase must be transported
byte-exact (base64), not through the Windows shell.
- Minor: guest 9201's hostname is `felhom-golden` (no `-hostname` passed); cosmetic, `customer.id` is
correct.
## Versions / artifacts
- Controller **v0.40.0** (CHANGELOG updated). Pushed to `main`: commit `6a594f9` (code) — this REPORT
in the follow-up commit.
- Lockstep agent **v0.19.0** (commit `e5a1819`). New golden:
`local:backup/vzdump-lxc-9100-2026_06_11-13_26_45.tar.zst`.
- No secrets committed (passphrase, customer key, CF tokens, local-api token — all out-of-band/redacted).
## Notes / follow-ups
- The traefik **dashboard route** (`dynamic/dashboard.yml`) is deferred — it needs a generated htpasswd
basic-auth hash. Routing for filebrowser/controller works without it.