diff --git a/REPORT.md b/REPORT.md index 77a5d87..f3a699a 100644 --- a/REPORT.md +++ b/REPORT.md @@ -1,79 +1,46 @@ -# REPORT — v0.41.0–0.41.2: base-infra bring-up + controller routing (+ Section-G mount fix) +# REPORT — v0.42.1: real Let's Encrypt wildcard cert (wildcard proactive issuance) -**Repo:** `felhom-controller` · **Version:** 0.41.2 · **Date:** 2026-06-11 -**Pushed commits:** `abbd948` (0.41.0) → `91736eb` (0.41.1) → `2bed7ce` (0.41.2) · paired with -`felhom-agent` v0.20.0 (`79ba2f1`) + golden rebake (controller 0.41.2). +**Repo:** `felhom-controller` · **Version:** 0.42.1 · **Date:** 2026-06-11 +**Pushed commits:** `84c3e84` (v0.42.0, superseded) → `e61e7dd` (v0.42.1) · paired with `felhom-agent` +v0.21.0 (split-horizon LAN resolver — depends on this real cert) + golden rebake (controller 0.42.1). -## Follow-up after live diagnosis (v0.41.1 + v0.41.2) +## What shipped -The base-infra bring-up stood up traefik/cloudflared/filebrowser, but **nothing routed the controller -itself** — `felhom.` 404'd (controller on `bridge` only, no traefik labels, empty `dynamic/`). -A live inside-out diagnostic confirmed the **tunnel chain was already healthy** (token tunnel-id matches -the DNS tunnel `8b4edf48…`; CF ingress `*. → https://traefik`; traefik routes `files.*` 200) — -the **only** gap was the controller wiring. filebrowser self-registers via Docker labels + network -membership in its compose; the controller can't (started by the golden bootstrap before `traefik-public` -exists, and the v2 bootstrap carries no domain), so the wiring must happen post-pull. +The base-infra traefik obtained **no** real cert (acme.json empty) — both routers relied on the +websecure entrypoint-default `certResolver`, which does **not** trigger proactive DNS-01 issuance, so +everything ran on traefik's self-signed default (masked externally only by the tunnel's `noTLSVerify`). +This blocked LAN-direct (a LAN client TLS-handshakes straight to traefik and needs the real cert). -- **v0.41.1** — `infra.RenderControllerRoute(domain)` (traefik file-provider route `Host(felhom.) - → http://felhom-controller:8080`, websecure) + `wireController` in `EnsureBaseStack`: write - `dynamic/controller.yml` (write-if-changed) and `docker network connect traefik-public felhom-controller`. -- **v0.41.2** — two fixes found in live validation: (1) `containerOnNetwork` misread the absent-key - `` as "already attached" → the auto-connect was skipped (traefik 502'd); fixed by listing+matching - network names. (2) Removed a **dead `CrossDrive*` block** in `dashboard.html` (a slice-8C de-privileging - leftover) that `gt 0`-**500'd the entire dashboard** — only surfaced once routing made it reachable. +- **`infra.RenderControllerRoute(domain, wildcardTLS)`** — the always-present controller route is now + the **wildcard-issuance anchor**: when DNS-01 ACME is configured (`CFAPIToken && Email`) it carries + router-level `tls.certResolver: letsencrypt` + `tls.domains: [{main: "*.", sans: [""]}]`, + so traefik **proactively obtains `*.` + apex at startup** via Cloudflare DNS-01. Every other + router (filebrowser, future apps) then serves that one wildcard by SNI match — **no per-app + certresolver labels**, real cert ready before the first client connects. `stacks.wireController` passes + `wildcardTLS = (CFAPIToken != "" && Email != "")`. +- **Key empirical finding (staging on 9201):** traefik v3 issues a cert from a **router-level** + `tls.domains` but **NOT** from the entrypoint-level `http.tls.domains` (acme.json stayed 0 bytes with + the latter). v0.42.0's entrypoint-domains attempt + `TraefikData.Domain` was reverted. -**Live result:** `felhom.demo-felhom.eu → HTTP 200` serving `Vezérlőpult — Felhom.eu`; -`files.*` still 200; 0 dashboard template errors; auto-connect proven (fresh `bridge`-only container → -`[infra] connected felhom-controller to traefik-public`). +## Validation (staging → prod on guest 9201) +- **CF token pre-check:** active, scoped to `demo-felhom.eu`, DNS read OK (DNS:Edit confirmed by the run). +- **Staging (Fake LE):** with the router-level wildcard, acme.json went 0 → populated; `felhom.*`, + `files.*`, and arbitrary `anything.demo-felhom.eu` all presented `*.demo-felhom.eu` (issuer `(STAGING)`) + — one wildcard, zero per-app labels. +- **Prod switch:** wiped acme.json, clean first-boot render (no `caServer`, no entrypoint domains) → + PROD wildcard issued (acme.json ~16 KB). +- **GATE (from dooplex, real LAN host, direct to guest IP):** `felhom.demo-felhom.eu` + `files.demo-felhom.eu` + → **`200 ssl_verify=0`**; issuer **`C=US, O=Let's Encrypt, CN=YR1`** (real prod LE); subject + `CN=*.demo-felhom.eu`, SAN `*.demo-felhom.eu, demo-felhom.eu`. Dashboard title `Vezérlőpult — Felhom.eu`. -## What shipped (v0.41.0 base) +## Live deploy +- Built + pushed `0.42.1`; deployed to 9201 (clean first-boot: wipe acme.json → controller renders prod + traefik.yml + wildcard route → traefik obtains the prod wildcard). +- **Golden rebaked** with controller 0.42.1 → `local:backup/vzdump-lxc-9100-2026_06_11-18_10_11.tar.zst` + (fresh provisions get a real wildcard cert on first boot). -A freshly-onboarded controller came up ONLINE on the hub but **Health = FAIL: protected containers not -running — traefik, cloudflared, filebrowser**: nothing ever deployed the base stack on a Proxmox -bootstrap (it was only ever created by the bare-metal `scripts/docker-setup.sh`), and the health loop -only *detected* the gap. The controller now stands up its own base infrastructure. - -- **`internal/infra`** (new) — pure renderers (`//go:embed` `text/template`s lifted verbatim from - `scripts/docker-setup.sh`) for traefik (`traefik.yml` + compose + a 0600 `.env` carrying the CF DNS - token only when set), cloudflared (compose; `TUNNEL_TOKEN`), filebrowser (compose + `config.yaml`). - **Pinned images as the single source of truth:** `traefik:v3.6.7`, `cloudflare/cloudflared:2026.6.0`, - `gtstef/filebrowser:1.3.3-stable`. The web FileBrowser sync path delegates here (pins can't diverge). -- **`stacks.Manager.EnsureBaseStack`** (`internal/stacks/infra.go`) — creates the `traefik-public` - network, then deploys traefik → cloudflared → filebrowser under `${stacks_dir}/`. **Single-flight** - (`TryLock` — fired from both first boot and every health tick), **idempotent** (skips running stacks; - never overwrites an existing filebrowser compose), **non-fatal** (logs, never crashes). -- **Triggers** (`cmd/controller/main.go`): first-boot goroutine after stack init; self-heal calls - `EnsureBaseStack` unconditionally on every `system-health` tick (decoupled — safe via single-flight + - idempotency). -- **`monitor.EffectiveProtected`** — cloudflared counts as protected only when a tunnel token is set, so - a LAN-only node doesn't report FAIL forever for a stack it intentionally skips. -- **Section-G mount fix** (in `felhom-agent` `build-golden.sh`): same-path `-v /opt/docker/stacks:/opt/docker/stacks` - host bind — without it the guest daemon resolved every relative bind source on the guest filesystem - (empty dirs), breaking all bind-mounted stacks. Empirically proven on guest 9201 (probe printed - `cat: read error: Is a directory` before, `hello-from-controller` after). - -## Tests (non-hollow) - -`go build/vet/test` clean. `internal/infra`: customer params appear in output, **no `:latest` survives**, -both ACME branches render (DNS-01 with CF token / HTTP-01 without), `.env` is 0600, **rendered YAML parses**. -`internal/stacks`: `EnsureBaseStack` single-flight short-circuits while the lock is held. -`internal/monitor`: `EffectiveProtected` drops cloudflared without a token, keeps it with one. - -## Live validation (demo guest 9201, destroyed + re-provisioned from the rebaked golden) - -- **4 containers running**, all pinned/baked images: `felhom-controller:0.41.0` (healthy), `traefik:v3.6.7`, - `cloudflare/cloudflared:2026.6.0`, `gtstef/filebrowser:1.3.3-stable` (healthy); `traefik-public` network present. -- **Health = OK** (no "protected container not running"); hub report pushed successfully → demo-felhom ONLINE v0.41.0. -- **Section-G holds end-to-end:** `docker exec traefik cat /etc/traefik/traefik.yml` is the full rendered - config (the bind resolves on the shared host path). -- **Templates rendered the right branch:** DNS-01 / `provider: cloudflare` / `email: admin@felhom.eu` - (cf_api_token + email present); `.env` and `acme.json` are 0600. -- **cloudflared registered tunnel connections** (bud01 + vie05 edges, QUIC). -- **Hostname fixed (3A):** controller `os.Hostname()` = `demo-felhom` (was the Docker container ID); - CT/LXC hostname = `demo-felhom` (3B, was `felhom-golden`). -- **Regression:** controller↔agent **local-api channel up** (disks/host-metrics proxy intact). -- **Self-heal:** `docker rm -f traefik` → redeployed by the next `system-health` tick (idempotent no-op when healthy). - -## Notes / follow-ups -- The traefik **dashboard route** (`dynamic/dashboard.yml`) is deferred — it needs a generated htpasswd - basic-auth hash. Routing for filebrowser/controller works without it. +## Notes +- The traefik **dashboard route** (`dynamic/dashboard.yml`) remains deferred (needs a generated + basic-auth hash) — routing/cert for filebrowser + controller work without it. +- HTTP-01 path (ACME email but no CF token) can't issue wildcards → falls back to a plain TLS route + (self-signed). The felhom production always uses Cloudflare DNS-01, so the wildcard path is the norm. diff --git a/controller/README.md b/controller/README.md index 2c19b02..fd6e27e 100644 --- a/controller/README.md +++ b/controller/README.md @@ -209,6 +209,8 @@ cloudflared is only deployed when a tunnel token is configured. **Triggers**: a > **Mount prerequisite (Section-G):** the controller writes these stacks under `/opt/docker/stacks` *inside its container*, but `docker compose up` runs on the **guest** Docker daemon. The golden's controller-bootstrap (`felhom-agent` `build-golden.sh`) therefore bind-mounts that path **same-path** (`-v /opt/docker/stacks:/opt/docker/stacks`) so the daemon resolves every relative bind source — without it, all bind-mounted stacks (base infra and customer apps) silently break. +**Controller routing + the wildcard cert anchor (`wireController` → `RenderControllerRoute`, v0.41.1 / v0.42.1).** filebrowser self-registers with traefik via Docker labels + `traefik-public` membership baked into its compose; the controller can't (it's started by the golden bootstrap *before* `traefik-public` exists, and the v2 `bootstrap.json` carries no domain — that comes from the hub pull). So `EnsureBaseStack` wires the controller **post-pull**: it `docker network connect traefik-public felhom-controller` and writes a traefik file-provider route `dynamic/controller.yml` (`Host(felhom.) → http://felhom-controller:8080`, write-if-changed). When DNS-01 ACME is configured, that route is **also the wildcard-cert anchor**: its router-level `tls.domains: *.` makes traefik **proactively obtain the wildcard `*.` + apex via Cloudflare DNS-01 at startup** (an entrypoint-level `http.tls.domains` does *not* trigger issuance in traefik v3 — only a router-level `tls.domains` does). Every other router then serves that one real wildcard cert by SNI — no per-app `certresolver` labels. This is what lets a LAN client reach the box directly at `*.` with the real cert (the `felhom-agent` split-horizon resolver depends on it). + #### Missing Field Injection (`deploy.go`) When app templates are updated (e.g., a new `APP_KEY` secret is added to `.felhom.yml`), existing deployed apps need the new field in their `app.yaml`. The controller handles this automatically: