f1780100ee
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
57 lines
4.2 KiB
Markdown
57 lines
4.2 KiB
Markdown
# REPORT — v0.41.0: first-boot base-infrastructure bring-up + self-heal (+ Section-G mount fix)
|
|
|
|
**Repo:** `felhom-controller` · **Version:** 0.41.0 · **Date:** 2026-06-11
|
|
**Pushed commit:** `abbd948` (controller) · paired with `felhom-agent` v0.20.0 (`1799fcd`) + golden rebake.
|
|
|
|
## What shipped
|
|
|
|
A freshly-onboarded controller came up ONLINE on the hub but **Health = FAIL: protected containers not
|
|
running — traefik, cloudflared, filebrowser**: nothing ever deployed the base stack on a Proxmox
|
|
bootstrap (it was only ever created by the bare-metal `scripts/docker-setup.sh`), and the health loop
|
|
only *detected* the gap. The controller now stands up its own base infrastructure.
|
|
|
|
- **`internal/infra`** (new) — pure renderers (`//go:embed` `text/template`s lifted verbatim from
|
|
`scripts/docker-setup.sh`) for traefik (`traefik.yml` + compose + a 0600 `.env` carrying the CF DNS
|
|
token only when set), cloudflared (compose; `TUNNEL_TOKEN`), filebrowser (compose + `config.yaml`).
|
|
**Pinned images as the single source of truth:** `traefik:v3.6.7`, `cloudflare/cloudflared:2026.6.0`,
|
|
`gtstef/filebrowser:1.3.3-stable`. The web FileBrowser sync path delegates here (pins can't diverge).
|
|
- **`stacks.Manager.EnsureBaseStack`** (`internal/stacks/infra.go`) — creates the `traefik-public`
|
|
network, then deploys traefik → cloudflared → filebrowser under `${stacks_dir}/<name>`. **Single-flight**
|
|
(`TryLock` — fired from both first boot and every health tick), **idempotent** (skips running stacks;
|
|
never overwrites an existing filebrowser compose), **non-fatal** (logs, never crashes).
|
|
- **Triggers** (`cmd/controller/main.go`): first-boot goroutine after stack init; self-heal calls
|
|
`EnsureBaseStack` unconditionally on every `system-health` tick (decoupled — safe via single-flight +
|
|
idempotency).
|
|
- **`monitor.EffectiveProtected`** — cloudflared counts as protected only when a tunnel token is set, so
|
|
a LAN-only node doesn't report FAIL forever for a stack it intentionally skips.
|
|
- **Section-G mount fix** (in `felhom-agent` `build-golden.sh`): same-path `-v /opt/docker/stacks:/opt/docker/stacks`
|
|
host bind — without it the guest daemon resolved every relative bind source on the guest filesystem
|
|
(empty dirs), breaking all bind-mounted stacks. Empirically proven on guest 9201 (probe printed
|
|
`cat: read error: Is a directory` before, `hello-from-controller` after).
|
|
|
|
## Tests (non-hollow)
|
|
|
|
`go build/vet/test` clean. `internal/infra`: customer params appear in output, **no `:latest` survives**,
|
|
both ACME branches render (DNS-01 with CF token / HTTP-01 without), `.env` is 0600, **rendered YAML parses**.
|
|
`internal/stacks`: `EnsureBaseStack` single-flight short-circuits while the lock is held.
|
|
`internal/monitor`: `EffectiveProtected` drops cloudflared without a token, keeps it with one.
|
|
|
|
## Live validation (demo guest 9201, destroyed + re-provisioned from the rebaked golden)
|
|
|
|
- **4 containers running**, all pinned/baked images: `felhom-controller:0.41.0` (healthy), `traefik:v3.6.7`,
|
|
`cloudflare/cloudflared:2026.6.0`, `gtstef/filebrowser:1.3.3-stable` (healthy); `traefik-public` network present.
|
|
- **Health = OK** (no "protected container not running"); hub report pushed successfully → demo-felhom ONLINE v0.41.0.
|
|
- **Section-G holds end-to-end:** `docker exec traefik cat /etc/traefik/traefik.yml` is the full rendered
|
|
config (the bind resolves on the shared host path).
|
|
- **Templates rendered the right branch:** DNS-01 / `provider: cloudflare` / `email: admin@felhom.eu`
|
|
(cf_api_token + email present); `.env` and `acme.json` are 0600.
|
|
- **cloudflared registered tunnel connections** (bud01 + vie05 edges, QUIC).
|
|
- **Hostname fixed (3A):** controller `os.Hostname()` = `demo-felhom` (was the Docker container ID);
|
|
CT/LXC hostname = `demo-felhom` (3B, was `felhom-golden`).
|
|
- **Regression:** controller↔agent **local-api channel up** (disks/host-metrics proxy intact).
|
|
- **Self-heal:** `docker rm -f traefik` → redeployed by the next `system-health` tick (idempotent no-op when healthy).
|
|
|
|
## Notes / follow-ups
|
|
- The traefik **dashboard route** (`dynamic/dashboard.yml`) is deferred — it needs a generated htpasswd
|
|
basic-auth hash. Routing for filebrowser/controller works without it.
|