docs(v0.42.1): REPORT (real wildcard cert) + README controller-route/wildcard-anchor
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,79 +1,46 @@
|
||||
# REPORT — v0.41.0–0.41.2: base-infra bring-up + controller routing (+ Section-G mount fix)
|
||||
# REPORT — v0.42.1: real Let's Encrypt wildcard cert (wildcard proactive issuance)
|
||||
|
||||
**Repo:** `felhom-controller` · **Version:** 0.41.2 · **Date:** 2026-06-11
|
||||
**Pushed commits:** `abbd948` (0.41.0) → `91736eb` (0.41.1) → `2bed7ce` (0.41.2) · paired with
|
||||
`felhom-agent` v0.20.0 (`79ba2f1`) + golden rebake (controller 0.41.2).
|
||||
**Repo:** `felhom-controller` · **Version:** 0.42.1 · **Date:** 2026-06-11
|
||||
**Pushed commits:** `84c3e84` (v0.42.0, superseded) → `e61e7dd` (v0.42.1) · paired with `felhom-agent`
|
||||
v0.21.0 (split-horizon LAN resolver — depends on this real cert) + golden rebake (controller 0.42.1).
|
||||
|
||||
## Follow-up after live diagnosis (v0.41.1 + v0.41.2)
|
||||
## What shipped
|
||||
|
||||
The base-infra bring-up stood up traefik/cloudflared/filebrowser, but **nothing routed the controller
|
||||
itself** — `felhom.<domain>` 404'd (controller on `bridge` only, no traefik labels, empty `dynamic/`).
|
||||
A live inside-out diagnostic confirmed the **tunnel chain was already healthy** (token tunnel-id matches
|
||||
the DNS tunnel `8b4edf48…`; CF ingress `*.<domain> → https://traefik`; traefik routes `files.*` 200) —
|
||||
the **only** gap was the controller wiring. filebrowser self-registers via Docker labels + network
|
||||
membership in its compose; the controller can't (started by the golden bootstrap before `traefik-public`
|
||||
exists, and the v2 bootstrap carries no domain), so the wiring must happen post-pull.
|
||||
The base-infra traefik obtained **no** real cert (acme.json empty) — both routers relied on the
|
||||
websecure entrypoint-default `certResolver`, which does **not** trigger proactive DNS-01 issuance, so
|
||||
everything ran on traefik's self-signed default (masked externally only by the tunnel's `noTLSVerify`).
|
||||
This blocked LAN-direct (a LAN client TLS-handshakes straight to traefik and needs the real cert).
|
||||
|
||||
- **v0.41.1** — `infra.RenderControllerRoute(domain)` (traefik file-provider route `Host(felhom.<domain>)
|
||||
→ http://felhom-controller:8080`, websecure) + `wireController` in `EnsureBaseStack`: write
|
||||
`dynamic/controller.yml` (write-if-changed) and `docker network connect traefik-public felhom-controller`.
|
||||
- **v0.41.2** — two fixes found in live validation: (1) `containerOnNetwork` misread the absent-key
|
||||
`<nil>` as "already attached" → the auto-connect was skipped (traefik 502'd); fixed by listing+matching
|
||||
network names. (2) Removed a **dead `CrossDrive*` block** in `dashboard.html` (a slice-8C de-privileging
|
||||
leftover) that `gt <nil> 0`-**500'd the entire dashboard** — only surfaced once routing made it reachable.
|
||||
- **`infra.RenderControllerRoute(domain, wildcardTLS)`** — the always-present controller route is now
|
||||
the **wildcard-issuance anchor**: when DNS-01 ACME is configured (`CFAPIToken && Email`) it carries
|
||||
router-level `tls.certResolver: letsencrypt` + `tls.domains: [{main: "*.<domain>", sans: ["<domain>"]}]`,
|
||||
so traefik **proactively obtains `*.<domain>` + apex at startup** via Cloudflare DNS-01. Every other
|
||||
router (filebrowser, future apps) then serves that one wildcard by SNI match — **no per-app
|
||||
certresolver labels**, real cert ready before the first client connects. `stacks.wireController` passes
|
||||
`wildcardTLS = (CFAPIToken != "" && Email != "")`.
|
||||
- **Key empirical finding (staging on 9201):** traefik v3 issues a cert from a **router-level**
|
||||
`tls.domains` but **NOT** from the entrypoint-level `http.tls.domains` (acme.json stayed 0 bytes with
|
||||
the latter). v0.42.0's entrypoint-domains attempt + `TraefikData.Domain` was reverted.
|
||||
|
||||
**Live result:** `felhom.demo-felhom.eu → HTTP 200` serving `<title>Vezérlőpult — Felhom.eu</title>`;
|
||||
`files.*` still 200; 0 dashboard template errors; auto-connect proven (fresh `bridge`-only container →
|
||||
`[infra] connected felhom-controller to traefik-public`).
|
||||
## Validation (staging → prod on guest 9201)
|
||||
- **CF token pre-check:** active, scoped to `demo-felhom.eu`, DNS read OK (DNS:Edit confirmed by the run).
|
||||
- **Staging (Fake LE):** with the router-level wildcard, acme.json went 0 → populated; `felhom.*`,
|
||||
`files.*`, and arbitrary `anything.demo-felhom.eu` all presented `*.demo-felhom.eu` (issuer `(STAGING)`)
|
||||
— one wildcard, zero per-app labels.
|
||||
- **Prod switch:** wiped acme.json, clean first-boot render (no `caServer`, no entrypoint domains) →
|
||||
PROD wildcard issued (acme.json ~16 KB).
|
||||
- **GATE (from dooplex, real LAN host, direct to guest IP):** `felhom.demo-felhom.eu` + `files.demo-felhom.eu`
|
||||
→ **`200 ssl_verify=0`**; issuer **`C=US, O=Let's Encrypt, CN=YR1`** (real prod LE); subject
|
||||
`CN=*.demo-felhom.eu`, SAN `*.demo-felhom.eu, demo-felhom.eu`. Dashboard title `Vezérlőpult — Felhom.eu`.
|
||||
|
||||
## What shipped (v0.41.0 base)
|
||||
## Live deploy
|
||||
- Built + pushed `0.42.1`; deployed to 9201 (clean first-boot: wipe acme.json → controller renders prod
|
||||
traefik.yml + wildcard route → traefik obtains the prod wildcard).
|
||||
- **Golden rebaked** with controller 0.42.1 → `local:backup/vzdump-lxc-9100-2026_06_11-18_10_11.tar.zst`
|
||||
(fresh provisions get a real wildcard cert on first boot).
|
||||
|
||||
A freshly-onboarded controller came up ONLINE on the hub but **Health = FAIL: protected containers not
|
||||
running — traefik, cloudflared, filebrowser**: nothing ever deployed the base stack on a Proxmox
|
||||
bootstrap (it was only ever created by the bare-metal `scripts/docker-setup.sh`), and the health loop
|
||||
only *detected* the gap. The controller now stands up its own base infrastructure.
|
||||
|
||||
- **`internal/infra`** (new) — pure renderers (`//go:embed` `text/template`s lifted verbatim from
|
||||
`scripts/docker-setup.sh`) for traefik (`traefik.yml` + compose + a 0600 `.env` carrying the CF DNS
|
||||
token only when set), cloudflared (compose; `TUNNEL_TOKEN`), filebrowser (compose + `config.yaml`).
|
||||
**Pinned images as the single source of truth:** `traefik:v3.6.7`, `cloudflare/cloudflared:2026.6.0`,
|
||||
`gtstef/filebrowser:1.3.3-stable`. The web FileBrowser sync path delegates here (pins can't diverge).
|
||||
- **`stacks.Manager.EnsureBaseStack`** (`internal/stacks/infra.go`) — creates the `traefik-public`
|
||||
network, then deploys traefik → cloudflared → filebrowser under `${stacks_dir}/<name>`. **Single-flight**
|
||||
(`TryLock` — fired from both first boot and every health tick), **idempotent** (skips running stacks;
|
||||
never overwrites an existing filebrowser compose), **non-fatal** (logs, never crashes).
|
||||
- **Triggers** (`cmd/controller/main.go`): first-boot goroutine after stack init; self-heal calls
|
||||
`EnsureBaseStack` unconditionally on every `system-health` tick (decoupled — safe via single-flight +
|
||||
idempotency).
|
||||
- **`monitor.EffectiveProtected`** — cloudflared counts as protected only when a tunnel token is set, so
|
||||
a LAN-only node doesn't report FAIL forever for a stack it intentionally skips.
|
||||
- **Section-G mount fix** (in `felhom-agent` `build-golden.sh`): same-path `-v /opt/docker/stacks:/opt/docker/stacks`
|
||||
host bind — without it the guest daemon resolved every relative bind source on the guest filesystem
|
||||
(empty dirs), breaking all bind-mounted stacks. Empirically proven on guest 9201 (probe printed
|
||||
`cat: read error: Is a directory` before, `hello-from-controller` after).
|
||||
|
||||
## Tests (non-hollow)
|
||||
|
||||
`go build/vet/test` clean. `internal/infra`: customer params appear in output, **no `:latest` survives**,
|
||||
both ACME branches render (DNS-01 with CF token / HTTP-01 without), `.env` is 0600, **rendered YAML parses**.
|
||||
`internal/stacks`: `EnsureBaseStack` single-flight short-circuits while the lock is held.
|
||||
`internal/monitor`: `EffectiveProtected` drops cloudflared without a token, keeps it with one.
|
||||
|
||||
## Live validation (demo guest 9201, destroyed + re-provisioned from the rebaked golden)
|
||||
|
||||
- **4 containers running**, all pinned/baked images: `felhom-controller:0.41.0` (healthy), `traefik:v3.6.7`,
|
||||
`cloudflare/cloudflared:2026.6.0`, `gtstef/filebrowser:1.3.3-stable` (healthy); `traefik-public` network present.
|
||||
- **Health = OK** (no "protected container not running"); hub report pushed successfully → demo-felhom ONLINE v0.41.0.
|
||||
- **Section-G holds end-to-end:** `docker exec traefik cat /etc/traefik/traefik.yml` is the full rendered
|
||||
config (the bind resolves on the shared host path).
|
||||
- **Templates rendered the right branch:** DNS-01 / `provider: cloudflare` / `email: admin@felhom.eu`
|
||||
(cf_api_token + email present); `.env` and `acme.json` are 0600.
|
||||
- **cloudflared registered tunnel connections** (bud01 + vie05 edges, QUIC).
|
||||
- **Hostname fixed (3A):** controller `os.Hostname()` = `demo-felhom` (was the Docker container ID);
|
||||
CT/LXC hostname = `demo-felhom` (3B, was `felhom-golden`).
|
||||
- **Regression:** controller↔agent **local-api channel up** (disks/host-metrics proxy intact).
|
||||
- **Self-heal:** `docker rm -f traefik` → redeployed by the next `system-health` tick (idempotent no-op when healthy).
|
||||
|
||||
## Notes / follow-ups
|
||||
- The traefik **dashboard route** (`dynamic/dashboard.yml`) is deferred — it needs a generated htpasswd
|
||||
basic-auth hash. Routing for filebrowser/controller works without it.
|
||||
## Notes
|
||||
- The traefik **dashboard route** (`dynamic/dashboard.yml`) remains deferred (needs a generated
|
||||
basic-auth hash) — routing/cert for filebrowser + controller work without it.
|
||||
- HTTP-01 path (ACME email but no CF token) can't issue wildcards → falls back to a plain TLS route
|
||||
(self-signed). The felhom production always uses Cloudflare DNS-01, so the wildcard path is the norm.
|
||||
|
||||
@@ -209,6 +209,8 @@ cloudflared is only deployed when a tunnel token is configured. **Triggers**: a
|
||||
|
||||
> **Mount prerequisite (Section-G):** the controller writes these stacks under `/opt/docker/stacks` *inside its container*, but `docker compose up` runs on the **guest** Docker daemon. The golden's controller-bootstrap (`felhom-agent` `build-golden.sh`) therefore bind-mounts that path **same-path** (`-v /opt/docker/stacks:/opt/docker/stacks`) so the daemon resolves every relative bind source — without it, all bind-mounted stacks (base infra and customer apps) silently break.
|
||||
|
||||
**Controller routing + the wildcard cert anchor (`wireController` → `RenderControllerRoute`, v0.41.1 / v0.42.1).** filebrowser self-registers with traefik via Docker labels + `traefik-public` membership baked into its compose; the controller can't (it's started by the golden bootstrap *before* `traefik-public` exists, and the v2 `bootstrap.json` carries no domain — that comes from the hub pull). So `EnsureBaseStack` wires the controller **post-pull**: it `docker network connect traefik-public felhom-controller` and writes a traefik file-provider route `dynamic/controller.yml` (`Host(felhom.<domain>) → http://felhom-controller:8080`, write-if-changed). When DNS-01 ACME is configured, that route is **also the wildcard-cert anchor**: its router-level `tls.domains: *.<domain>` makes traefik **proactively obtain the wildcard `*.<domain>` + apex via Cloudflare DNS-01 at startup** (an entrypoint-level `http.tls.domains` does *not* trigger issuance in traefik v3 — only a router-level `tls.domains` does). Every other router then serves that one real wildcard cert by SNI — no per-app `certresolver` labels. This is what lets a LAN client reach the box directly at `*.<domain>` with the real cert (the `felhom-agent` split-horizon resolver depends on it).
|
||||
|
||||
#### Missing Field Injection (`deploy.go`)
|
||||
|
||||
When app templates are updated (e.g., a new `APP_KEY` secret is added to `.felhom.yml`), existing deployed apps need the new field in their `app.yaml`. The controller handles this automatically:
|
||||
|
||||
Reference in New Issue
Block a user