Files
felhom-controller/REPORT.md
T
2026-06-11 15:52:41 +02:00

6.0 KiB
Raw Blame History

REPORT — v0.41.00.41.2: base-infra bring-up + controller routing (+ Section-G mount fix)

Repo: felhom-controller · Version: 0.41.2 · Date: 2026-06-11 Pushed commits: abbd948 (0.41.0) → 91736eb (0.41.1) → 2bed7ce (0.41.2) · paired with felhom-agent v0.20.0 (79ba2f1) + golden rebake (controller 0.41.2).

Follow-up after live diagnosis (v0.41.1 + v0.41.2)

The base-infra bring-up stood up traefik/cloudflared/filebrowser, but nothing routed the controller itselffelhom.<domain> 404'd (controller on bridge only, no traefik labels, empty dynamic/). A live inside-out diagnostic confirmed the tunnel chain was already healthy (token tunnel-id matches the DNS tunnel 8b4edf48…; CF ingress *.<domain> → https://traefik; traefik routes files.* 200) — the only gap was the controller wiring. filebrowser self-registers via Docker labels + network membership in its compose; the controller can't (started by the golden bootstrap before traefik-public exists, and the v2 bootstrap carries no domain), so the wiring must happen post-pull.

  • v0.41.1infra.RenderControllerRoute(domain) (traefik file-provider route Host(felhom.<domain>) → http://felhom-controller:8080, websecure) + wireController in EnsureBaseStack: write dynamic/controller.yml (write-if-changed) and docker network connect traefik-public felhom-controller.
  • v0.41.2 — two fixes found in live validation: (1) containerOnNetwork misread the absent-key <nil> as "already attached" → the auto-connect was skipped (traefik 502'd); fixed by listing+matching network names. (2) Removed a dead CrossDrive* block in dashboard.html (a slice-8C de-privileging leftover) that gt <nil> 0-500'd the entire dashboard — only surfaced once routing made it reachable.

Live result: felhom.demo-felhom.eu → HTTP 200 serving <title>Vezérlőpult — Felhom.eu</title>; files.* still 200; 0 dashboard template errors; auto-connect proven (fresh bridge-only container → [infra] connected felhom-controller to traefik-public).

What shipped (v0.41.0 base)

A freshly-onboarded controller came up ONLINE on the hub but Health = FAIL: protected containers not running — traefik, cloudflared, filebrowser: nothing ever deployed the base stack on a Proxmox bootstrap (it was only ever created by the bare-metal scripts/docker-setup.sh), and the health loop only detected the gap. The controller now stands up its own base infrastructure.

  • internal/infra (new) — pure renderers (//go:embed text/templates lifted verbatim from scripts/docker-setup.sh) for traefik (traefik.yml + compose + a 0600 .env carrying the CF DNS token only when set), cloudflared (compose; TUNNEL_TOKEN), filebrowser (compose + config.yaml). Pinned images as the single source of truth: traefik:v3.6.7, cloudflare/cloudflared:2026.6.0, gtstef/filebrowser:1.3.3-stable. The web FileBrowser sync path delegates here (pins can't diverge).
  • stacks.Manager.EnsureBaseStack (internal/stacks/infra.go) — creates the traefik-public network, then deploys traefik → cloudflared → filebrowser under ${stacks_dir}/<name>. Single-flight (TryLock — fired from both first boot and every health tick), idempotent (skips running stacks; never overwrites an existing filebrowser compose), non-fatal (logs, never crashes).
  • Triggers (cmd/controller/main.go): first-boot goroutine after stack init; self-heal calls EnsureBaseStack unconditionally on every system-health tick (decoupled — safe via single-flight + idempotency).
  • monitor.EffectiveProtected — cloudflared counts as protected only when a tunnel token is set, so a LAN-only node doesn't report FAIL forever for a stack it intentionally skips.
  • Section-G mount fix (in felhom-agent build-golden.sh): same-path -v /opt/docker/stacks:/opt/docker/stacks host bind — without it the guest daemon resolved every relative bind source on the guest filesystem (empty dirs), breaking all bind-mounted stacks. Empirically proven on guest 9201 (probe printed cat: read error: Is a directory before, hello-from-controller after).

Tests (non-hollow)

go build/vet/test clean. internal/infra: customer params appear in output, no :latest survives, both ACME branches render (DNS-01 with CF token / HTTP-01 without), .env is 0600, rendered YAML parses. internal/stacks: EnsureBaseStack single-flight short-circuits while the lock is held. internal/monitor: EffectiveProtected drops cloudflared without a token, keeps it with one.

Live validation (demo guest 9201, destroyed + re-provisioned from the rebaked golden)

  • 4 containers running, all pinned/baked images: felhom-controller:0.41.0 (healthy), traefik:v3.6.7, cloudflare/cloudflared:2026.6.0, gtstef/filebrowser:1.3.3-stable (healthy); traefik-public network present.
  • Health = OK (no "protected container not running"); hub report pushed successfully → demo-felhom ONLINE v0.41.0.
  • Section-G holds end-to-end: docker exec traefik cat /etc/traefik/traefik.yml is the full rendered config (the bind resolves on the shared host path).
  • Templates rendered the right branch: DNS-01 / provider: cloudflare / email: admin@felhom.eu (cf_api_token + email present); .env and acme.json are 0600.
  • cloudflared registered tunnel connections (bud01 + vie05 edges, QUIC).
  • Hostname fixed (3A): controller os.Hostname() = demo-felhom (was the Docker container ID); CT/LXC hostname = demo-felhom (3B, was felhom-golden).
  • Regression: controller↔agent local-api channel up (disks/host-metrics proxy intact).
  • Self-heal: docker rm -f traefik → redeployed by the next system-health tick (idempotent no-op when healthy).

Notes / follow-ups

  • The traefik dashboard route (dynamic/dashboard.yml) is deferred — it needs a generated htpasswd basic-auth hash. Routing for filebrowser/controller works without it.