Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
6.0 KiB
REPORT — v0.41.0–0.41.2: base-infra bring-up + controller routing (+ Section-G mount fix)
Repo: felhom-controller · Version: 0.41.2 · Date: 2026-06-11
Pushed commits: abbd948 (0.41.0) → 91736eb (0.41.1) → 2bed7ce (0.41.2) · paired with
felhom-agent v0.20.0 (79ba2f1) + golden rebake (controller 0.41.2).
Follow-up after live diagnosis (v0.41.1 + v0.41.2)
The base-infra bring-up stood up traefik/cloudflared/filebrowser, but nothing routed the controller
itself — felhom.<domain> 404'd (controller on bridge only, no traefik labels, empty dynamic/).
A live inside-out diagnostic confirmed the tunnel chain was already healthy (token tunnel-id matches
the DNS tunnel 8b4edf48…; CF ingress *.<domain> → https://traefik; traefik routes files.* 200) —
the only gap was the controller wiring. filebrowser self-registers via Docker labels + network
membership in its compose; the controller can't (started by the golden bootstrap before traefik-public
exists, and the v2 bootstrap carries no domain), so the wiring must happen post-pull.
- v0.41.1 —
infra.RenderControllerRoute(domain)(traefik file-provider routeHost(felhom.<domain>) → http://felhom-controller:8080, websecure) +wireControllerinEnsureBaseStack: writedynamic/controller.yml(write-if-changed) anddocker network connect traefik-public felhom-controller. - v0.41.2 — two fixes found in live validation: (1)
containerOnNetworkmisread the absent-key<nil>as "already attached" → the auto-connect was skipped (traefik 502'd); fixed by listing+matching network names. (2) Removed a deadCrossDrive*block indashboard.html(a slice-8C de-privileging leftover) thatgt <nil> 0-500'd the entire dashboard — only surfaced once routing made it reachable.
Live result: felhom.demo-felhom.eu → HTTP 200 serving <title>Vezérlőpult — Felhom.eu</title>;
files.* still 200; 0 dashboard template errors; auto-connect proven (fresh bridge-only container →
[infra] connected felhom-controller to traefik-public).
What shipped (v0.41.0 base)
A freshly-onboarded controller came up ONLINE on the hub but Health = FAIL: protected containers not
running — traefik, cloudflared, filebrowser: nothing ever deployed the base stack on a Proxmox
bootstrap (it was only ever created by the bare-metal scripts/docker-setup.sh), and the health loop
only detected the gap. The controller now stands up its own base infrastructure.
internal/infra(new) — pure renderers (//go:embedtext/templates lifted verbatim fromscripts/docker-setup.sh) for traefik (traefik.yml+ compose + a 0600.envcarrying the CF DNS token only when set), cloudflared (compose;TUNNEL_TOKEN), filebrowser (compose +config.yaml). Pinned images as the single source of truth:traefik:v3.6.7,cloudflare/cloudflared:2026.6.0,gtstef/filebrowser:1.3.3-stable. The web FileBrowser sync path delegates here (pins can't diverge).stacks.Manager.EnsureBaseStack(internal/stacks/infra.go) — creates thetraefik-publicnetwork, then deploys traefik → cloudflared → filebrowser under${stacks_dir}/<name>. Single-flight (TryLock— fired from both first boot and every health tick), idempotent (skips running stacks; never overwrites an existing filebrowser compose), non-fatal (logs, never crashes).- Triggers (
cmd/controller/main.go): first-boot goroutine after stack init; self-heal callsEnsureBaseStackunconditionally on everysystem-healthtick (decoupled — safe via single-flight + idempotency). monitor.EffectiveProtected— cloudflared counts as protected only when a tunnel token is set, so a LAN-only node doesn't report FAIL forever for a stack it intentionally skips.- Section-G mount fix (in
felhom-agentbuild-golden.sh): same-path-v /opt/docker/stacks:/opt/docker/stackshost bind — without it the guest daemon resolved every relative bind source on the guest filesystem (empty dirs), breaking all bind-mounted stacks. Empirically proven on guest 9201 (probe printedcat: read error: Is a directorybefore,hello-from-controllerafter).
Tests (non-hollow)
go build/vet/test clean. internal/infra: customer params appear in output, no :latest survives,
both ACME branches render (DNS-01 with CF token / HTTP-01 without), .env is 0600, rendered YAML parses.
internal/stacks: EnsureBaseStack single-flight short-circuits while the lock is held.
internal/monitor: EffectiveProtected drops cloudflared without a token, keeps it with one.
Live validation (demo guest 9201, destroyed + re-provisioned from the rebaked golden)
- 4 containers running, all pinned/baked images:
felhom-controller:0.41.0(healthy),traefik:v3.6.7,cloudflare/cloudflared:2026.6.0,gtstef/filebrowser:1.3.3-stable(healthy);traefik-publicnetwork present. - Health = OK (no "protected container not running"); hub report pushed successfully → demo-felhom ONLINE v0.41.0.
- Section-G holds end-to-end:
docker exec traefik cat /etc/traefik/traefik.ymlis the full rendered config (the bind resolves on the shared host path). - Templates rendered the right branch: DNS-01 /
provider: cloudflare/email: admin@felhom.eu(cf_api_token + email present);.envandacme.jsonare 0600. - cloudflared registered tunnel connections (bud01 + vie05 edges, QUIC).
- Hostname fixed (3A): controller
os.Hostname()=demo-felhom(was the Docker container ID); CT/LXC hostname =demo-felhom(3B, wasfelhom-golden). - Regression: controller↔agent local-api channel up (disks/host-metrics proxy intact).
- Self-heal:
docker rm -f traefik→ redeployed by the nextsystem-healthtick (idempotent no-op when healthy).
Notes / follow-ups
- The traefik dashboard route (
dynamic/dashboard.yml) is deferred — it needs a generated htpasswd basic-auth hash. Routing for filebrowser/controller works without it.