docs(v0.41.0): README base-infra bring-up section + REPORT (live-validated)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-11 15:17:09 +02:00
parent abbd9488c6
commit f1780100ee
2 changed files with 62 additions and 87 deletions
+47 -86
View File
@@ -1,95 +1,56 @@
# REPORT — controller v0.40.0: bootstrap pull+merge onboarding (live-validated) (2026-06-11) # REPORT — v0.41.0: first-boot base-infrastructure bring-up + self-heal (+ Section-G mount fix)
Lockstep two-repo change with `felhom-agent` v0.19.0. Fixes the onboarding **401** found last session: **Repo:** `felhom-controller` · **Version:** 0.41.0 · **Date:** 2026-06-11
a freshly provisioned guest used to seed a "configured" controller.yaml from the agent's **host** hub **Pushed commit:** `abbd948` (controller) · paired with `felhom-agent` v0.20.0 (`1799fcd`) + golden rebake.
key, which the hub's customer-scoped `/api/v1/report` rejects → the controller could never report
ONLINE. Now, on first boot, the controller **pulls** its full controller.yaml from the hub (using the
bootstrap's retrieval passphrase, which yields the **customer-scoped** key) and **merges in** the
per-guest `local_api` block. Validated live end-to-end on the demo (guest 9201).
## What changed (`internal/bootstrap`, `cmd/controller/main.go`) ## What shipped
- **Contract v1 → v2** (`felhom.bootstrap/v2`): `BootstrapCustomer` keeps only `id`; `BootstrapHub`
drops `api_key`/`host_id`, adds **`retrieval_password`**; `local_api` unchanged. Non-v2 → setup mode.
- **`MaybeIngest(configPath, cfg, logger, pull PullFunc)`** — `pull` injected (decision (b): keeps
`bootstrap` free of the heavy `internal/report` package; `main.go` wires `report.PullConfig`). Flow:
idempotent (configured → return, **no pull**) → parse+validate v2 → **pull** with bounded retry
(1 + 3 backoff attempts, transient `ErrPullTransient` only; auth/not-found fail fast) → **merge**
`local_api` at the YAML-**map** level (decision (c): preserves every hub-emitted field) → write 0600
atomic → reload. Fail-safe + never-crash (hub outage at first boot → setup mode).
- New sentinel **`ErrPullTransient`**; `main.go`'s adapter maps `report.ErrHubUnreachable` → transient,
passes auth/not-found through as permanent. Removed `configFromBootstrap` (the host-key path).
## Cross-repo contract checksum-diff (rendered bootstrap.json field set) A freshly-onboarded controller came up ONLINE on the hub but **Health = FAIL: protected containers not
The agent's v2 renderer output was ingested by the controller's `json.Unmarshal` — **every field running — traefik, cloudflared, filebrowser**: nothing ever deployed the base stack on a Proxmox
populated**, exact match: bootstrap (it was only ever created by the bare-metal `scripts/docker-setup.sh`), and the health loop
only *detected* the gap. The controller now stands up its own base infrastructure.
| level | fields (agent emits == controller ingests) | - **`internal/infra`** (new) — pure renderers (`//go:embed` `text/template`s lifted verbatim from
|---|---| `scripts/docker-setup.sh`) for traefik (`traefik.yml` + compose + a 0600 `.env` carrying the CF DNS
| top | `schema, customer, hub, local_api` | token only when set), cloudflared (compose; `TUNNEL_TOKEN`), filebrowser (compose + `config.yaml`).
| customer | `id` | **Pinned images as the single source of truth:** `traefik:v3.6.7`, `cloudflare/cloudflared:2026.6.0`,
| hub | `url, retrieval_password` | `gtstef/filebrowser:1.3.3-stable`. The web FileBrowser sync path delegates here (pins can't diverge).
| local_api | `endpoint, fingerprint, token` | - **`stacks.Manager.EnsureBaseStack`** (`internal/stacks/infra.go`) — creates the `traefik-public`
network, then deploys traefik → cloudflared → filebrowser under `${stacks_dir}/<name>`. **Single-flight**
(`TryLock` — fired from both first boot and every health tick), **idempotent** (skips running stacks;
never overwrites an existing filebrowser compose), **non-fatal** (logs, never crashes).
- **Triggers** (`cmd/controller/main.go`): first-boot goroutine after stack init; self-heal calls
`EnsureBaseStack` unconditionally on every `system-health` tick (decoupled — safe via single-flight +
idempotency).
- **`monitor.EffectiveProtected`** — cloudflared counts as protected only when a tunnel token is set, so
a LAN-only node doesn't report FAIL forever for a stack it intentionally skips.
- **Section-G mount fix** (in `felhom-agent` `build-golden.sh`): same-path `-v /opt/docker/stacks:/opt/docker/stacks`
host bind — without it the guest daemon resolved every relative bind source on the guest filesystem
(empty dirs), breaking all bind-mounted stacks. Empirically proven on guest 9201 (probe printed
`cat: read error: Is a directory` before, `hello-from-controller` after).
(Automated round-trip via a throwaway test in each package; removed after verifying.) ## Tests (non-hollow)
## Tests — non-hollow (`internal/bootstrap`), all green `go build/vet/test` clean. `internal/infra`: customer params appear in output, **no `:latest` survives**,
- **Pull+merge:** stub `pull` returns a hub yaml with `hub.api_key: CUSTKEY_FROM_HUB`, `customer.domain`, both ACME branches render (DNS-01 with CF token / HTTP-01 without), `.env` is 0600, **rendered YAML parses**.
and an unmodeled `assets.source_url`. Asserts the written controller.yaml carries **the customer key `internal/stacks`: `EnsureBaseStack` single-flight short-circuits while the lock is held.
+ identity + the preserved unmodeled assets field** AND the bootstrap's `local_api.{endpoint, `internal/monitor`: `EffectiveProtected` drops cloudflared without a token, keeps it with one.
fingerprint,token}`, and contains **no host key/id**.
- **Idempotency:** preset `cfg.Customer.ID` → asserts `pull` **never invoked**, file untouched.
- **Transient retry:** stub returns `ErrPullTransient` always → asserts exactly `1+len(delays)` calls,
then setup mode, no file (backoff shrunk to ~1ms via the overridable `pullRetryDelays`).
- **Permanent no-retry:** stub returns a plain (auth-style) error → asserts a single call.
- **Schema reject** (non-v2), **missing-required**, **malformed/absent** → setup mode, no pull.
`go build ./... && go test ./...` green. ## Live validation (demo guest 9201, destroyed + re-provisioned from the rebaked golden)
## Live validation (demo Proxmox `felhom-pve`, guest 9201, golden baked `:0.40.0`) - **4 containers running**, all pinned/baked images: `felhom-controller:0.41.0` (healthy), `traefik:v3.6.7`,
Golden re-baked: `local:backup/vzdump-lxc-9100-2026_06_11-13_26_45.tar.zst` (baked image confirmed `cloudflare/cloudflared:2026.6.0`, `gtstef/filebrowser:1.3.3-stable` (healthy); `traefik-public` network present.
`gitea.dooplex.hu/admin/felhom-controller:0.40.0`). Provisioned fresh as `demo-felhom` via agent - **Health = OK** (no "protected container not running"); hub report pushed successfully → demo-felhom ONLINE v0.41.0.
v0.19.0 `--selftest=provision -customer-id demo-felhom -hub-password <passphrase>` (passphrase read - **Section-G holds end-to-end:** `docker exec traefik cat /etc/traefik/traefik.yml` is the full rendered
from the hub `customer_configs` and transported base64 to avoid UTF-8 mangling; **stored out-of-band**), config (the bind resolves on the shared host path).
then `pct reboot` + `systemctl restart felhom-agent` (the local-API token workaround, Finding #1). - **Templates rendered the right branch:** DNS-01 / `provider: cloudflare` / `email: admin@felhom.eu`
(cf_api_token + email present); `.env` and `acme.json` are 0600.
- **cloudflared registered tunnel connections** (bud01 + vie05 edges, QUIC).
- **Hostname fixed (3A):** controller `os.Hostname()` = `demo-felhom` (was the Docker container ID);
CT/LXC hostname = `demo-felhom` (3B, was `felhom-golden`).
- **Regression:** controller↔agent **local-api channel up** (disks/host-metrics proxy intact).
- **Self-heal:** `docker rm -f traefik` → redeployed by the next `system-health` tick (idempotent no-op when healthy).
- **Bootstrap (v2) on the guest:** `hub` keys = `[url, retrieval_password]` (no host key), `customer` ## Notes / follow-ups
keys = `[id]` only, 0600. ✓ - The traefik **dashboard route** (`dynamic/dashboard.yml`) is deferred — it needs a generated htpasswd
- **Pull+merge worked** — the merged `/opt/docker/felhom-controller/controller.yaml` (secrets redacted) basic-auth hash. Routing for filebrowser/controller works without it.
carries **from the hub pull**: `hub.api_key: 4b11c0c3…` (the **customer-scoped** key, matches the
hub's `customer_configs` row), `hub.enabled: true`, `customer.{id: demo-felhom, domain:
demo-felhom.eu, name, email}`, `assets.source_url`, `git` (catalog repo), `infrastructure.cf_*`
(Cloudflare config); and **merged from the bootstrap**: `local_api.{endpoint: 192.168.0.162:8443,
fingerprint: 60b5974d…, token}`. **No `host_id`, no agent host key.**
- **Hub ONLINE at v0.40.0** — `[report] Hub report pushed successfully (3090 bytes)` + `Startup hub
report sent`, **no 401**. Hub `reports` row for `demo-felhom`: `controller_version=0.40.0`,
`received_at=2026-06-11 11:32:00` (fresh → online). 0 deployed apps (fresh guest — expected). ✓
- **`local_api` survived the merge** — `GET /api/host-metrics` → `{ok:true}`, `cpu_temp_c=49` (real),
4 storage targets; `GET /api/disks` → `{ok:true}`, felhom-usb `data_bearing:true`. ✓
- **8C invariant intact** — agent-direct `POST /disks/format` on data-bearing `/dev/sdb1` → **HTTP 403**
`{formatted:false, data_bearing:true, reason:"device is mounted", pending_op:{op:storage_wipe,
durable_id:byid:wwn-…, …}}` "operator signature required (pending_signature)". Disk untouched
(`/dev/sdb1 ext4 8G`, still mounted). ✓
## What broke / what's missing
- **Bootstrap log line absent in `docker logs`** (observability nit, reproduced from last session's
seed-log). `MaybeIngest`'s `[INFO] bootstrap: pulled config … coming up configured` does not surface
in `docker logs` even though `setupLogger` writes to stdout and the pull demonstrably ran (customer
key present, hub report OK, catalog repo configured). The first captured line is a later async
local-api WARN — the early synchronous bootstrap log is being swallowed before docker attaches.
Worth a follow-up (flush/sequence the logger before MaybeIngest, or log the pull result post-startup).
- **Finding #1 still open (separate spec):** the local-API channel 401s until `systemctl restart
felhom-agent` after provisioning a live-daemon host (the running daemon didn't reload the freshly
minted token). Reproduced (startup WARN at 11:31:55); workaround applied.
- **Operational gotcha (mine, fixed):** `kubectl cp`'s "tar: removing leading '/'" warning polluted a
captured base64 passphrase on the first attempt → a 2-char garbage passphrase → re-extracted with
`tail -1` and re-provisioned cleanly. The UTF-8 (Hungarian) passphrase must be transported
byte-exact (base64), not through the Windows shell.
- Minor: guest 9201's hostname is `felhom-golden` (no `-hostname` passed); cosmetic, `customer.id` is
correct.
## Versions / artifacts
- Controller **v0.40.0** (CHANGELOG updated). Pushed to `main`: commit `6a594f9` (code) — this REPORT
in the follow-up commit.
- Lockstep agent **v0.19.0** (commit `e5a1819`). New golden:
`local:backup/vzdump-lxc-9100-2026_06_11-13_26_45.tar.zst`.
- No secrets committed (passphrase, customer key, CF tokens, local-api token — all out-of-band/redacted).
+15 -1
View File
@@ -92,7 +92,8 @@ A single, lightweight Go container that replaces Portainer + scattered systemd s
|--------|------|----------------| |--------|------|----------------|
| **Config** | `internal/config/` | YAML loader, validation, `FELHOM_*` env overrides | | **Config** | `internal/config/` | YAML loader, validation, `FELHOM_*` env overrides |
| **Settings** | `internal/settings/` | Runtime-mutable `settings.json` (passwords, backup prefs, storage paths, notifications) | | **Settings** | `internal/settings/` | Runtime-mutable `settings.json` (passwords, backup prefs, storage paths, notifications) |
| **Stacks** | `internal/stacks/` | Compose operations, scanning, `.felhom.yml` metadata, deploy/delete flow | | **Stacks** | `internal/stacks/` | Compose operations, scanning, `.felhom.yml` metadata, deploy/delete flow; **base-infra bring-up** (`infra.go``EnsureBaseStack`) |
| **Infra** | `internal/infra/` | Pure renderers (embedded `text/template`) for the base-infra stacks (traefik/cloudflared/filebrowser); **pinned image tags as the single source of truth** (web filebrowser sync delegates here) |
| **Crypto** | `internal/crypto/` | AES-256-GCM encryption for sensitive app.yaml values (passwords, secrets), key management | | **Crypto** | `internal/crypto/` | AES-256-GCM encryption for sensitive app.yaml values (passwords, secrets), key management |
| **Sync** | `internal/sync/` | Git-based app catalog sync (clone/pull, content-hash copy) | | **Sync** | `internal/sync/` | Git-based app catalog sync (clone/pull, content-hash copy) |
| **AppBackup** | `internal/appbackup/` | Self-contained app-data backup primitives: DB dump discovery/execution (`DiscoverDatabases`, `DumpOne`), Docker-volume/app-data discovery (`StackDataProvider`, `DiscoverAppData`), keep-side path helpers (`AppDBDumpPath`, `AppVolumeDumpPath`, `AppDataDir`). No dependency on restic/cross-drive/drive-mount. Imported directly by `appexport` and `storage`. | | **AppBackup** | `internal/appbackup/` | Self-contained app-data backup primitives: DB dump discovery/execution (`DiscoverDatabases`, `DumpOne`), Docker-volume/app-data discovery (`StackDataProvider`, `DiscoverAppData`), keep-side path helpers (`AppDBDumpPath`, `AppVolumeDumpPath`, `AppDataDir`). No dependency on restic/cross-drive/drive-mount. Imported directly by `appexport` and `storage`. |
@@ -195,6 +196,19 @@ The `/apps/{slug}` page renders hero section, screenshots, setup guide, and opti
**Orphan detection**: Deployed stacks with no matching catalog template are marked as orphaned with an "Elavult" badge and can be safely deleted. **Orphan detection**: Deployed stacks with no matching catalog template are marked as orphaned with an "Elavult" badge and can be safely deleted.
#### Base-infrastructure bring-up (`stacks/infra.go` + `internal/infra/`, v0.41.0)
The controller stands up its own base stack — **traefik** (reverse proxy), **cloudflared** (external tunnel), **filebrowser** — instead of relying on the bare-metal `scripts/docker-setup.sh` (which a Proxmox-provisioned guest never runs). `internal/infra` renders the compose + config files from `controller.yaml` via embedded `text/template`s (lifted from `docker-setup.sh`); image tags are **pinned constants there** (`TraefikImage`/`CloudflaredImage`/`FileBrowserImage`) and the web FileBrowser sync path delegates to the same renderers, so the pinned versions can never diverge.
`Manager.EnsureBaseStack()` creates the `traefik-public` network, then deploys traefik → cloudflared → filebrowser under `${stacks_dir}/<name>`. It is:
- **single-flight** (a `TryLock` guard — it's called from both first boot and every health tick, so overlapping runs must not race on the same stack dir),
- **idempotent** (skips a stack whose container is already running; never overwrites an existing filebrowser compose, preserving the storage mounts `SyncFileBrowserMounts` manages),
- **non-fatal** (logs, never crashes the controller).
cloudflared is only deployed when a tunnel token is configured. **Triggers**: a first-boot goroutine (after stack init) and an unconditional call on every `system-health` tick (self-heal — cheap when healthy thanks to the idempotency). `monitor.EffectiveProtected` mirrors the cloudflared condition so a LAN-only node (no tunnel token) doesn't report a perpetual "protected container not running" FAIL.
> **Mount prerequisite (Section-G):** the controller writes these stacks under `/opt/docker/stacks` *inside its container*, but `docker compose up` runs on the **guest** Docker daemon. The golden's controller-bootstrap (`felhom-agent` `build-golden.sh`) therefore bind-mounts that path **same-path** (`-v /opt/docker/stacks:/opt/docker/stacks`) so the daemon resolves every relative bind source — without it, all bind-mounted stacks (base infra and customer apps) silently break.
#### Missing Field Injection (`deploy.go`) #### Missing Field Injection (`deploy.go`)
When app templates are updated (e.g., a new `APP_KEY` secret is added to `.felhom.yml`), existing deployed apps need the new field in their `app.yaml`. The controller handles this automatically: When app templates are updated (e.g., a new `APP_KEY` secret is added to `.felhom.yml`), existing deployed apps need the new field in their `app.yaml`. The controller handles this automatically: