diff --git a/REPORT.md b/REPORT.md index 23b9747..746c733 100644 --- a/REPORT.md +++ b/REPORT.md @@ -1,95 +1,56 @@ -# REPORT — controller v0.40.0: bootstrap pull+merge onboarding (live-validated) (2026-06-11) +# REPORT — v0.41.0: first-boot base-infrastructure bring-up + self-heal (+ Section-G mount fix) -Lockstep two-repo change with `felhom-agent` v0.19.0. Fixes the onboarding **401** found last session: -a freshly provisioned guest used to seed a "configured" controller.yaml from the agent's **host** hub -key, which the hub's customer-scoped `/api/v1/report` rejects → the controller could never report -ONLINE. Now, on first boot, the controller **pulls** its full controller.yaml from the hub (using the -bootstrap's retrieval passphrase, which yields the **customer-scoped** key) and **merges in** the -per-guest `local_api` block. Validated live end-to-end on the demo (guest 9201). +**Repo:** `felhom-controller` · **Version:** 0.41.0 · **Date:** 2026-06-11 +**Pushed commit:** `abbd948` (controller) · paired with `felhom-agent` v0.20.0 (`1799fcd`) + golden rebake. -## What changed (`internal/bootstrap`, `cmd/controller/main.go`) -- **Contract v1 → v2** (`felhom.bootstrap/v2`): `BootstrapCustomer` keeps only `id`; `BootstrapHub` - drops `api_key`/`host_id`, adds **`retrieval_password`**; `local_api` unchanged. Non-v2 → setup mode. -- **`MaybeIngest(configPath, cfg, logger, pull PullFunc)`** — `pull` injected (decision (b): keeps - `bootstrap` free of the heavy `internal/report` package; `main.go` wires `report.PullConfig`). Flow: - idempotent (configured → return, **no pull**) → parse+validate v2 → **pull** with bounded retry - (1 + 3 backoff attempts, transient `ErrPullTransient` only; auth/not-found fail fast) → **merge** - `local_api` at the YAML-**map** level (decision (c): preserves every hub-emitted field) → write 0600 - atomic → reload. Fail-safe + never-crash (hub outage at first boot → setup mode). -- New sentinel **`ErrPullTransient`**; `main.go`'s adapter maps `report.ErrHubUnreachable` → transient, - passes auth/not-found through as permanent. Removed `configFromBootstrap` (the host-key path). +## What shipped -## Cross-repo contract checksum-diff (rendered bootstrap.json field set) -The agent's v2 renderer output was ingested by the controller's `json.Unmarshal` — **every field -populated**, exact match: +A freshly-onboarded controller came up ONLINE on the hub but **Health = FAIL: protected containers not +running — traefik, cloudflared, filebrowser**: nothing ever deployed the base stack on a Proxmox +bootstrap (it was only ever created by the bare-metal `scripts/docker-setup.sh`), and the health loop +only *detected* the gap. The controller now stands up its own base infrastructure. -| level | fields (agent emits == controller ingests) | -|---|---| -| top | `schema, customer, hub, local_api` | -| customer | `id` | -| hub | `url, retrieval_password` | -| local_api | `endpoint, fingerprint, token` | +- **`internal/infra`** (new) — pure renderers (`//go:embed` `text/template`s lifted verbatim from + `scripts/docker-setup.sh`) for traefik (`traefik.yml` + compose + a 0600 `.env` carrying the CF DNS + token only when set), cloudflared (compose; `TUNNEL_TOKEN`), filebrowser (compose + `config.yaml`). + **Pinned images as the single source of truth:** `traefik:v3.6.7`, `cloudflare/cloudflared:2026.6.0`, + `gtstef/filebrowser:1.3.3-stable`. The web FileBrowser sync path delegates here (pins can't diverge). +- **`stacks.Manager.EnsureBaseStack`** (`internal/stacks/infra.go`) — creates the `traefik-public` + network, then deploys traefik → cloudflared → filebrowser under `${stacks_dir}/`. **Single-flight** + (`TryLock` — fired from both first boot and every health tick), **idempotent** (skips running stacks; + never overwrites an existing filebrowser compose), **non-fatal** (logs, never crashes). +- **Triggers** (`cmd/controller/main.go`): first-boot goroutine after stack init; self-heal calls + `EnsureBaseStack` unconditionally on every `system-health` tick (decoupled — safe via single-flight + + idempotency). +- **`monitor.EffectiveProtected`** — cloudflared counts as protected only when a tunnel token is set, so + a LAN-only node doesn't report FAIL forever for a stack it intentionally skips. +- **Section-G mount fix** (in `felhom-agent` `build-golden.sh`): same-path `-v /opt/docker/stacks:/opt/docker/stacks` + host bind — without it the guest daemon resolved every relative bind source on the guest filesystem + (empty dirs), breaking all bind-mounted stacks. Empirically proven on guest 9201 (probe printed + `cat: read error: Is a directory` before, `hello-from-controller` after). -(Automated round-trip via a throwaway test in each package; removed after verifying.) +## Tests (non-hollow) -## Tests — non-hollow (`internal/bootstrap`), all green -- **Pull+merge:** stub `pull` returns a hub yaml with `hub.api_key: CUSTKEY_FROM_HUB`, `customer.domain`, - and an unmodeled `assets.source_url`. Asserts the written controller.yaml carries **the customer key - + identity + the preserved unmodeled assets field** AND the bootstrap's `local_api.{endpoint, - fingerprint,token}`, and contains **no host key/id**. -- **Idempotency:** preset `cfg.Customer.ID` → asserts `pull` **never invoked**, file untouched. -- **Transient retry:** stub returns `ErrPullTransient` always → asserts exactly `1+len(delays)` calls, - then setup mode, no file (backoff shrunk to ~1ms via the overridable `pullRetryDelays`). -- **Permanent no-retry:** stub returns a plain (auth-style) error → asserts a single call. -- **Schema reject** (non-v2), **missing-required**, **malformed/absent** → setup mode, no pull. +`go build/vet/test` clean. `internal/infra`: customer params appear in output, **no `:latest` survives**, +both ACME branches render (DNS-01 with CF token / HTTP-01 without), `.env` is 0600, **rendered YAML parses**. +`internal/stacks`: `EnsureBaseStack` single-flight short-circuits while the lock is held. +`internal/monitor`: `EffectiveProtected` drops cloudflared without a token, keeps it with one. -`go build ./... && go test ./...` green. +## Live validation (demo guest 9201, destroyed + re-provisioned from the rebaked golden) -## Live validation (demo Proxmox `felhom-pve`, guest 9201, golden baked `:0.40.0`) -Golden re-baked: `local:backup/vzdump-lxc-9100-2026_06_11-13_26_45.tar.zst` (baked image confirmed -`gitea.dooplex.hu/admin/felhom-controller:0.40.0`). Provisioned fresh as `demo-felhom` via agent -v0.19.0 `--selftest=provision -customer-id demo-felhom -hub-password ` (passphrase read -from the hub `customer_configs` and transported base64 to avoid UTF-8 mangling; **stored out-of-band**), -then `pct reboot` + `systemctl restart felhom-agent` (the local-API token workaround, Finding #1). +- **4 containers running**, all pinned/baked images: `felhom-controller:0.41.0` (healthy), `traefik:v3.6.7`, + `cloudflare/cloudflared:2026.6.0`, `gtstef/filebrowser:1.3.3-stable` (healthy); `traefik-public` network present. +- **Health = OK** (no "protected container not running"); hub report pushed successfully → demo-felhom ONLINE v0.41.0. +- **Section-G holds end-to-end:** `docker exec traefik cat /etc/traefik/traefik.yml` is the full rendered + config (the bind resolves on the shared host path). +- **Templates rendered the right branch:** DNS-01 / `provider: cloudflare` / `email: admin@felhom.eu` + (cf_api_token + email present); `.env` and `acme.json` are 0600. +- **cloudflared registered tunnel connections** (bud01 + vie05 edges, QUIC). +- **Hostname fixed (3A):** controller `os.Hostname()` = `demo-felhom` (was the Docker container ID); + CT/LXC hostname = `demo-felhom` (3B, was `felhom-golden`). +- **Regression:** controller↔agent **local-api channel up** (disks/host-metrics proxy intact). +- **Self-heal:** `docker rm -f traefik` → redeployed by the next `system-health` tick (idempotent no-op when healthy). -- **Bootstrap (v2) on the guest:** `hub` keys = `[url, retrieval_password]` (no host key), `customer` - keys = `[id]` only, 0600. ✓ -- **Pull+merge worked** — the merged `/opt/docker/felhom-controller/controller.yaml` (secrets redacted) - carries **from the hub pull**: `hub.api_key: 4b11c0c3…` (the **customer-scoped** key, matches the - hub's `customer_configs` row), `hub.enabled: true`, `customer.{id: demo-felhom, domain: - demo-felhom.eu, name, email}`, `assets.source_url`, `git` (catalog repo), `infrastructure.cf_*` - (Cloudflare config); and **merged from the bootstrap**: `local_api.{endpoint: 192.168.0.162:8443, - fingerprint: 60b5974d…, token}`. **No `host_id`, no agent host key.** ✓ -- **Hub ONLINE at v0.40.0** — `[report] Hub report pushed successfully (3090 bytes)` + `Startup hub - report sent`, **no 401**. Hub `reports` row for `demo-felhom`: `controller_version=0.40.0`, - `received_at=2026-06-11 11:32:00` (fresh → online). 0 deployed apps (fresh guest — expected). ✓ -- **`local_api` survived the merge** — `GET /api/host-metrics` → `{ok:true}`, `cpu_temp_c=49` (real), - 4 storage targets; `GET /api/disks` → `{ok:true}`, felhom-usb `data_bearing:true`. ✓ -- **8C invariant intact** — agent-direct `POST /disks/format` on data-bearing `/dev/sdb1` → **HTTP 403** - `{formatted:false, data_bearing:true, reason:"device is mounted", pending_op:{op:storage_wipe, - durable_id:byid:wwn-…, …}}` "operator signature required (pending_signature)". Disk untouched - (`/dev/sdb1 ext4 8G`, still mounted). ✓ - -## What broke / what's missing -- **Bootstrap log line absent in `docker logs`** (observability nit, reproduced from last session's - seed-log). `MaybeIngest`'s `[INFO] bootstrap: pulled config … coming up configured` does not surface - in `docker logs` even though `setupLogger` writes to stdout and the pull demonstrably ran (customer - key present, hub report OK, catalog repo configured). The first captured line is a later async - local-api WARN — the early synchronous bootstrap log is being swallowed before docker attaches. - Worth a follow-up (flush/sequence the logger before MaybeIngest, or log the pull result post-startup). -- **Finding #1 still open (separate spec):** the local-API channel 401s until `systemctl restart - felhom-agent` after provisioning a live-daemon host (the running daemon didn't reload the freshly - minted token). Reproduced (startup WARN at 11:31:55); workaround applied. -- **Operational gotcha (mine, fixed):** `kubectl cp`'s "tar: removing leading '/'" warning polluted a - captured base64 passphrase on the first attempt → a 2-char garbage passphrase → re-extracted with - `tail -1` and re-provisioned cleanly. The UTF-8 (Hungarian) passphrase must be transported - byte-exact (base64), not through the Windows shell. -- Minor: guest 9201's hostname is `felhom-golden` (no `-hostname` passed); cosmetic, `customer.id` is - correct. - -## Versions / artifacts -- Controller **v0.40.0** (CHANGELOG updated). Pushed to `main`: commit `6a594f9` (code) — this REPORT - in the follow-up commit. -- Lockstep agent **v0.19.0** (commit `e5a1819`). New golden: - `local:backup/vzdump-lxc-9100-2026_06_11-13_26_45.tar.zst`. -- No secrets committed (passphrase, customer key, CF tokens, local-api token — all out-of-band/redacted). +## Notes / follow-ups +- The traefik **dashboard route** (`dynamic/dashboard.yml`) is deferred — it needs a generated htpasswd + basic-auth hash. Routing for filebrowser/controller works without it. diff --git a/controller/README.md b/controller/README.md index ec54844..2c19b02 100644 --- a/controller/README.md +++ b/controller/README.md @@ -92,7 +92,8 @@ A single, lightweight Go container that replaces Portainer + scattered systemd s |--------|------|----------------| | **Config** | `internal/config/` | YAML loader, validation, `FELHOM_*` env overrides | | **Settings** | `internal/settings/` | Runtime-mutable `settings.json` (passwords, backup prefs, storage paths, notifications) | -| **Stacks** | `internal/stacks/` | Compose operations, scanning, `.felhom.yml` metadata, deploy/delete flow | +| **Stacks** | `internal/stacks/` | Compose operations, scanning, `.felhom.yml` metadata, deploy/delete flow; **base-infra bring-up** (`infra.go` — `EnsureBaseStack`) | +| **Infra** | `internal/infra/` | Pure renderers (embedded `text/template`) for the base-infra stacks (traefik/cloudflared/filebrowser); **pinned image tags as the single source of truth** (web filebrowser sync delegates here) | | **Crypto** | `internal/crypto/` | AES-256-GCM encryption for sensitive app.yaml values (passwords, secrets), key management | | **Sync** | `internal/sync/` | Git-based app catalog sync (clone/pull, content-hash copy) | | **AppBackup** | `internal/appbackup/` | Self-contained app-data backup primitives: DB dump discovery/execution (`DiscoverDatabases`, `DumpOne`), Docker-volume/app-data discovery (`StackDataProvider`, `DiscoverAppData`), keep-side path helpers (`AppDBDumpPath`, `AppVolumeDumpPath`, `AppDataDir`). No dependency on restic/cross-drive/drive-mount. Imported directly by `appexport` and `storage`. | @@ -195,6 +196,19 @@ The `/apps/{slug}` page renders hero section, screenshots, setup guide, and opti **Orphan detection**: Deployed stacks with no matching catalog template are marked as orphaned with an "Elavult" badge and can be safely deleted. +#### Base-infrastructure bring-up (`stacks/infra.go` + `internal/infra/`, v0.41.0) + +The controller stands up its own base stack — **traefik** (reverse proxy), **cloudflared** (external tunnel), **filebrowser** — instead of relying on the bare-metal `scripts/docker-setup.sh` (which a Proxmox-provisioned guest never runs). `internal/infra` renders the compose + config files from `controller.yaml` via embedded `text/template`s (lifted from `docker-setup.sh`); image tags are **pinned constants there** (`TraefikImage`/`CloudflaredImage`/`FileBrowserImage`) and the web FileBrowser sync path delegates to the same renderers, so the pinned versions can never diverge. + +`Manager.EnsureBaseStack()` creates the `traefik-public` network, then deploys traefik → cloudflared → filebrowser under `${stacks_dir}/`. It is: +- **single-flight** (a `TryLock` guard — it's called from both first boot and every health tick, so overlapping runs must not race on the same stack dir), +- **idempotent** (skips a stack whose container is already running; never overwrites an existing filebrowser compose, preserving the storage mounts `SyncFileBrowserMounts` manages), +- **non-fatal** (logs, never crashes the controller). + +cloudflared is only deployed when a tunnel token is configured. **Triggers**: a first-boot goroutine (after stack init) and an unconditional call on every `system-health` tick (self-heal — cheap when healthy thanks to the idempotency). `monitor.EffectiveProtected` mirrors the cloudflared condition so a LAN-only node (no tunnel token) doesn't report a perpetual "protected container not running" FAIL. + +> **Mount prerequisite (Section-G):** the controller writes these stacks under `/opt/docker/stacks` *inside its container*, but `docker compose up` runs on the **guest** Docker daemon. The golden's controller-bootstrap (`felhom-agent` `build-golden.sh`) therefore bind-mounts that path **same-path** (`-v /opt/docker/stacks:/opt/docker/stacks`) so the daemon resolves every relative bind source — without it, all bind-mounted stacks (base infra and customer apps) silently break. + #### Missing Field Injection (`deploy.go`) When app templates are updated (e.g., a new `APP_KEY` secret is added to `.felhom.yml`), existing deployed apps need the new field in their `app.yaml`. The controller handles this automatically: