Files
felhom-controller/REPORT.md
T
2026-06-11 15:52:41 +02:00

80 lines
6.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# REPORT — v0.41.00.41.2: base-infra bring-up + controller routing (+ Section-G mount fix)
**Repo:** `felhom-controller` · **Version:** 0.41.2 · **Date:** 2026-06-11
**Pushed commits:** `abbd948` (0.41.0) → `91736eb` (0.41.1) → `2bed7ce` (0.41.2) · paired with
`felhom-agent` v0.20.0 (`79ba2f1`) + golden rebake (controller 0.41.2).
## Follow-up after live diagnosis (v0.41.1 + v0.41.2)
The base-infra bring-up stood up traefik/cloudflared/filebrowser, but **nothing routed the controller
itself** — `felhom.<domain>` 404'd (controller on `bridge` only, no traefik labels, empty `dynamic/`).
A live inside-out diagnostic confirmed the **tunnel chain was already healthy** (token tunnel-id matches
the DNS tunnel `8b4edf48…`; CF ingress `*.<domain> → https://traefik`; traefik routes `files.*` 200) —
the **only** gap was the controller wiring. filebrowser self-registers via Docker labels + network
membership in its compose; the controller can't (started by the golden bootstrap before `traefik-public`
exists, and the v2 bootstrap carries no domain), so the wiring must happen post-pull.
- **v0.41.1** — `infra.RenderControllerRoute(domain)` (traefik file-provider route `Host(felhom.<domain>)
→ http://felhom-controller:8080`, websecure) + `wireController` in `EnsureBaseStack`: write
`dynamic/controller.yml` (write-if-changed) and `docker network connect traefik-public felhom-controller`.
- **v0.41.2** — two fixes found in live validation: (1) `containerOnNetwork` misread the absent-key
`<nil>` as "already attached" → the auto-connect was skipped (traefik 502'd); fixed by listing+matching
network names. (2) Removed a **dead `CrossDrive*` block** in `dashboard.html` (a slice-8C de-privileging
leftover) that `gt <nil> 0`-**500'd the entire dashboard** — only surfaced once routing made it reachable.
**Live result:** `felhom.demo-felhom.eu → HTTP 200` serving `<title>Vezérlőpult — Felhom.eu</title>`;
`files.*` still 200; 0 dashboard template errors; auto-connect proven (fresh `bridge`-only container →
`[infra] connected felhom-controller to traefik-public`).
## What shipped (v0.41.0 base)
A freshly-onboarded controller came up ONLINE on the hub but **Health = FAIL: protected containers not
running — traefik, cloudflared, filebrowser**: nothing ever deployed the base stack on a Proxmox
bootstrap (it was only ever created by the bare-metal `scripts/docker-setup.sh`), and the health loop
only *detected* the gap. The controller now stands up its own base infrastructure.
- **`internal/infra`** (new) — pure renderers (`//go:embed` `text/template`s lifted verbatim from
`scripts/docker-setup.sh`) for traefik (`traefik.yml` + compose + a 0600 `.env` carrying the CF DNS
token only when set), cloudflared (compose; `TUNNEL_TOKEN`), filebrowser (compose + `config.yaml`).
**Pinned images as the single source of truth:** `traefik:v3.6.7`, `cloudflare/cloudflared:2026.6.0`,
`gtstef/filebrowser:1.3.3-stable`. The web FileBrowser sync path delegates here (pins can't diverge).
- **`stacks.Manager.EnsureBaseStack`** (`internal/stacks/infra.go`) — creates the `traefik-public`
network, then deploys traefik → cloudflared → filebrowser under `${stacks_dir}/<name>`. **Single-flight**
(`TryLock` — fired from both first boot and every health tick), **idempotent** (skips running stacks;
never overwrites an existing filebrowser compose), **non-fatal** (logs, never crashes).
- **Triggers** (`cmd/controller/main.go`): first-boot goroutine after stack init; self-heal calls
`EnsureBaseStack` unconditionally on every `system-health` tick (decoupled — safe via single-flight +
idempotency).
- **`monitor.EffectiveProtected`** — cloudflared counts as protected only when a tunnel token is set, so
a LAN-only node doesn't report FAIL forever for a stack it intentionally skips.
- **Section-G mount fix** (in `felhom-agent` `build-golden.sh`): same-path `-v /opt/docker/stacks:/opt/docker/stacks`
host bind — without it the guest daemon resolved every relative bind source on the guest filesystem
(empty dirs), breaking all bind-mounted stacks. Empirically proven on guest 9201 (probe printed
`cat: read error: Is a directory` before, `hello-from-controller` after).
## Tests (non-hollow)
`go build/vet/test` clean. `internal/infra`: customer params appear in output, **no `:latest` survives**,
both ACME branches render (DNS-01 with CF token / HTTP-01 without), `.env` is 0600, **rendered YAML parses**.
`internal/stacks`: `EnsureBaseStack` single-flight short-circuits while the lock is held.
`internal/monitor`: `EffectiveProtected` drops cloudflared without a token, keeps it with one.
## Live validation (demo guest 9201, destroyed + re-provisioned from the rebaked golden)
- **4 containers running**, all pinned/baked images: `felhom-controller:0.41.0` (healthy), `traefik:v3.6.7`,
`cloudflare/cloudflared:2026.6.0`, `gtstef/filebrowser:1.3.3-stable` (healthy); `traefik-public` network present.
- **Health = OK** (no "protected container not running"); hub report pushed successfully → demo-felhom ONLINE v0.41.0.
- **Section-G holds end-to-end:** `docker exec traefik cat /etc/traefik/traefik.yml` is the full rendered
config (the bind resolves on the shared host path).
- **Templates rendered the right branch:** DNS-01 / `provider: cloudflare` / `email: admin@felhom.eu`
(cf_api_token + email present); `.env` and `acme.json` are 0600.
- **cloudflared registered tunnel connections** (bud01 + vie05 edges, QUIC).
- **Hostname fixed (3A):** controller `os.Hostname()` = `demo-felhom` (was the Docker container ID);
CT/LXC hostname = `demo-felhom` (3B, was `felhom-golden`).
- **Regression:** controller↔agent **local-api channel up** (disks/host-metrics proxy intact).
- **Self-heal:** `docker rm -f traefik` → redeployed by the next `system-health` tick (idempotent no-op when healthy).
## Notes / follow-ups
- The traefik **dashboard route** (`dynamic/dashboard.yml`) is deferred — it needs a generated htpasswd
basic-auth hash. Routing for filebrowser/controller works without it.