diff --git a/DIAGNOSTIC.md b/DIAGNOSTIC.md new file mode 100644 index 0000000..8062b3f --- /dev/null +++ b/DIAGNOSTIC.md @@ -0,0 +1,192 @@ +# DIAGNOSTIC — base-infra bring-up: why fresh guests are health=FAIL + +**Scope:** read-only diagnosis. No code changes, no deploys, no state changes on guest 9201 or any guest. +**Subject:** demo guest **9201** (LXC), controller **v0.40.0**, online on the hub, Health = **FAIL: protected containers not running — traefik, cloudflared, filebrowser**. +**Date:** 2026-06-11. Evidence is live repo `file:line` + live 9201 output (secrets redacted). + +--- + +## TL;DR — the one-line cause + +**Nothing deploys the base/protected stack on a Proxmox bootstrap.** The traefik / cloudflared / filebrowser stacks were *only ever* created by the **bare-metal `scripts/docker-setup.sh`** (heredoc-generated compose files). The Proxmox golden→bootstrap path never runs that script, the controller has **no first-boot/reconcile/self-heal deploy** for the protected stacks, and the health loop only **detects** them missing. So on a provisioned guest there is no `/opt/docker/stacks`, no `traefik-public` network, and no infra containers — only `felhom-controller` itself runs, and health is permanently FAIL. + +**Live confirmation (9201):** +``` +pct status 9201 → running +docker ps → felhom-controller … Up (healthy) [ONLY container] +docker network ls → bridge, host, none [NO traefik-public] +ls /opt/docker/ → No such file or directory [no stacks dir at all] +docker logs felhom-controller | grep health + → [monitor] Health check: status=fail (every cycle) + → [stacks] ScanStacks complete: 52 stacks found (0 deployed, 52 available) +``` + +--- + +## A. Who is *supposed* to deploy the base stack, and why it never fires + +**A1 — Every compose-up / DeployStack caller.** `DeployStack` has exactly **one** caller: +- [controller/internal/api/router.go:350](controller/internal/api/router.go#L350) — `r.stackMgr.DeployStack(deployReq)`, driven by the UI deploy form (`POST /api/stacks/{name}/deploy`). + +A repo-wide grep for `EnsureBaseStack` / `deployProtected` / `BaseStack` / first-boot reconcile returns **nothing**. There is no programmatic deploy of the protected stacks anywhere. `runComposeDeploy` ([stacks/deploy.go:337](controller/internal/stacks/deploy.go#L337)) is reached only through `DeployStack`. The `deploy.go:24-25` "base" marking referenced in the brief is about *backup ordering* (filebrowser/traefik are flagged so backup skips them), not about deploying them. + +**A2 — The Proxmox startup sequence, with the gap.** [controller/cmd/controller/main.go:56-711](controller/cmd/controller/main.go#L56): +``` +71 config.LoadPermissive +93 bootstrap.MaybeIngest(...) ← comes up CONFIGURED (pulls yaml from hub, merges local_api) +101 setup.NeedsSetup(cfg) → false (configured → skip wizard) +111 probeLocalAPI(...) (agent channel) +137 stacks.NewManager → 144 ScanStacks (scans catalog; 0 deployed) +170 startQuiesceLoop / schedulers / web server +``` +After `MaybeIngest` the controller is fully configured (has CF tokens, domain, protected list) **but never deploys the protected stacks**. There is no `Ensure…`/`deploy` step between "configured" and "serving". This is the missing trigger. + +**A3 — The health loop only reports, never self-heals.** [controller/internal/monitor/healthcheck.go:159-162](controller/internal/monitor/healthcheck.go#L159): +```go +missingProtected := checkProtectedContainers(cfg.Stacks.Protected) +for _, name := range missingProtected { + report.Issues = append(report.Issues, fmt.Sprintf("Protected container not running: %s", name)) +} +``` +→ `status="fail"` (healthcheck.go:177-181). The scheduler runs this every 5 min ([main.go:259](controller/cmd/controller/main.go#L259)) and pushes the FAIL to the hub. **No code path attempts to start or deploy the missing containers.** That is exactly the FAIL 9201 shows. + +--- + +## B. Where the base-stack compose + config come from (decides what to bake) + +**B4 — On 9201: nothing.** `/opt/docker/` does not exist (so no `/opt/docker/stacks`, no `traefik`, no `cloudflared`, no `filebrowser` dirs). The controller runs from a Docker **named volume** (`felhom-controller-data`), config at `/var/lib/docker/volumes/felhom-controller-data/_data/controller.yaml`. The 52 "available" stacks are the **catalog cache** (git-synced app templates), none deployed. + +**B5 — Provenance of each infra app's compose + static config:** generated by **`scripts/docker-setup.sh`** heredocs — the **bare-metal** installer. They are **not** in the controller image, **not** in the app-catalog (`templates/` there = the 52 user apps), and **not** hub/asset-served. +- `install_traefik` — [scripts/docker-setup.sh:876](scripts/docker-setup.sh#L876); writes `${TRAEFIK_DIR}/docker-compose.yml` (:1015), `traefik.yml` static config (:936), dynamic config, `acme.json`, certs. `TRAEFIK_DIR=/opt/docker/traefik` (:155 — **not** under `stacks/`). +- `install_cloudflared` — [scripts/docker-setup.sh:1067](scripts/docker-setup.sh#L1067); writes `${CLOUDFLARED_DIR}/docker-compose.yml` (:1083). `CLOUDFLARED_DIR=/opt/docker/cloudflared` (:157 — **not** under `stacks/`). +- `install_filebrowser` — [scripts/docker-setup.sh:1263](scripts/docker-setup.sh#L1263); writes `${FILEBROWSER_DIR}/docker-compose.yml` (:1295) + `config.yaml`. `FILEBROWSER_DIR=/opt/docker/stacks/filebrowser` (:156). + + Note the health check matches **container names** (`traefik`/`cloudflared`/`filebrowser`), so traefik+cloudflared living outside `stacks/` is fine for detection. + +**Partial exception — filebrowser only.** The controller *can* regenerate filebrowser's compose+config: `generateFileBrowserCompose` ([web/handlers.go:1383](controller/internal/web/handlers.go#L1383)) + `generateFileBrowserConfig`, driven by `syncFileBrowserMounts` ([web/handlers.go:1295](controller/internal/web/handlers.go#L1295)). **But it refuses to create it the first time** — it early-returns if the compose is absent: +```go +// web/handlers.go:1304 +if _, err := os.Stat(composePath); os.IsNotExist(err) { + s.logger.Printf("[WARN] ... FileBrowser stack not found at %s — skipping mount sync"); return +} +``` +There is **no traefik or cloudflared generator in the controller at all** (grep: zero hits for `traefik.yml`, cloudflared compose, or any `EnsureInfra`/`deployInfra`). + +**B6 — Offline-capable?** On 9201 `assets.source_url: https://felhom.eu`, **`assets.sync_enabled: false`** — and assets are UI assets (logos), *not* infra compose. The controller needs **no hub fetch to deploy compose** *in principle* (config is already local post-`MaybeIngest`), but today there is simply **no template to deploy** on a provisioned guest. First-boot deploy becomes offline-possible only once the templates are baked/embedded **and** a generator exists. + +--- + +## C. Image provenance + bake feasibility (answers Viktor's question) + +**C7 — Image refs (from the docker-setup.sh heredocs):** + +| App | Image | Pinned? | Registry | +|---|---|---|---| +| traefik | `traefik:v3.6.7` (:1019) | ✅ pinned | Docker Hub (public) | +| cloudflared | `cloudflare/cloudflared:latest` (:1094) | ❌ `:latest` | Docker Hub (public) | +| filebrowser | `gtstef/filebrowser:latest` (:1302, and controller's generator handlers.go:1397) | ❌ `:latest` | Docker Hub (public — note: `gtstef/`, **not** the official `filebrowser/filebrowser`) | + +**C8 — Registry pull at first boot?** All three are **public Docker Hub** pulls → **no gitea private-registry credential needed** in the guest (good — none must ever be there). Without baking, first boot needs outbound Docker Hub access. + +**C9 — Can `build-golden.sh` bake them?** **Yes — same mechanism.** It already bakes the controller image with a plain pull-into-the-golden's-Docker: [felhom-agent/configs/build-golden.sh:71](../felhom-agent/configs/build-golden.sh#L71) `docker pull "$CONTROLLER_IMAGE"`, then logs out/removes the cred (:72) so nothing is baked but the image. Adding three more `docker pull` lines for traefik/cloudflared/filebrowser bakes them identically — and these are **public**, so they don't even need the build-time `docker login` the controller image uses. +- **Blocker / must-fix:** pin the two `:latest` tags before baking. A baked `:latest` drifts (the baked digest ≠ whatever `:latest` later resolves to), and any first-boot fallback pull would re-resolve `:latest` non-reproducibly. Pin to digests or explicit versions. +- **Baking the compose templates:** feasible but **not free** — it requires porting docker-setup.sh's traefik/cloudflared heredoc generators (static `traefik.yml`, ACME/cert-resolver block, dynamic config, the cloudflared compose) into the controller as Go templates rendered from `controller.yaml`. The controller today has only the **filebrowser** generator. This is the real work item; the image bake is trivial by comparison. + +**C10 — Running-container bake (the hard line):** **No infra app is safe to bake as a *running* container.** Each is per-customer-parameterized with secrets injected at run: +- cloudflared run env `TUNNEL_TOKEN=${CF_TUNNEL_TOKEN}` ([docker-setup.sh:1099](scripts/docker-setup.sh#L1099)) — per-customer tunnel token → **must NOT** be baked running. +- traefik consumes the per-customer CF API token + ACME email + domain (see D) → **must NOT** be baked running. +- filebrowser binds per-customer storage paths + domain. + +**Verdict:** bake **images** (pinned) ✅, optionally bake **rendered-able templates** ✅ (after porting the generators), bake **running containers with secrets** ❌. + +--- + +## D. Customer-specific parameters first-boot must inject — all present on 9201 + +Confirmed in 9201's merged `controller.yaml` (values redacted): + +| Infra app | Needs | Config key (present on 9201) | Compose env wiring | +|---|---|---|---| +| cloudflared | tunnel token | `infrastructure.cf_tunnel_token: ` ✅ | `TUNNEL_TOKEN=${CF_TUNNEL_TOKEN}` (docker-setup.sh:1099) | +| traefik | CF API token (DNS-01) | `infrastructure.cf_api_token: ` ✅ | `CF_DNS_API_TOKEN` → ACME `dnsChallenge: provider: cloudflare` (docker-setup.sh:899-916) | +| traefik | ACME email | `customer.email: admin@felhom.eu` ✅ | `acme: email: ${ACME_EMAIL}` (docker-setup.sh:907) | +| traefik / all | base domain | `customer.domain: demo-felhom.eu` ✅ | `Host(\`traefik.${BASE_DOMAIN}\`)`, websecure routing | +| filebrowser | storage paths + domain | `settings.GetStoragePaths()` + `customer.domain` ✅ | volume mounts + `files.${domain}` (handlers.go:1310-1349) | + +Every customer parameter the base stack needs is **already in the local config** after `MaybeIngest`. Nothing additional must be fetched to render them. + +--- + +## E. Hostname / CT-name (diagnose now, fix later) + +**E12 — Reported hostname is the Docker container ID.** [controller/internal/report/builder.go:75](controller/internal/report/builder.go#L75) `Hostname: staticInfo.Hostname` ← `os.Hostname()`. The controller runs inside Docker, and the golden bootstrap `docker run` sets **no `--hostname`** ([felhom-agent/configs/build-golden.sh:94](../felhom-agent/configs/build-golden.sh#L94)) → `os.Hostname()` returns the container ID. +- Live: `docker inspect felhom-controller --format '{{.Config.Hostname}}'` → **`3dff0fe73b5c`** (the value reported to the hub). +- **Insertion point:** the bootstrap unit's `docker run` (build-golden.sh:94). It already reads `/etc/felhom-bootstrap/bootstrap.json`; add `--hostname ` parsed from that file. The id is present — `bootstrap.json` carries `customer.id` (the pull target), per [controller/internal/bootstrap/bootstrap.go:66-68](controller/internal/bootstrap/bootstrap.go#L66) (`BootstrapCustomer.ID`). Feasible with a small `grep`/`jq` in the baked `felhom-controller-bootstrap.sh` heredoc. + +**E13 — Proxmox CT/LXC hostname is `felhom-golden`.** The golden is created `--hostname felhom-golden` ([build-golden.sh:38](../felhom-agent/configs/build-golden.sh#L38)); `/etc/hostname` is removed at minimize (:146) but the **PVE container-config hostname is not reset on restore**, so the guest inherits `felhom-golden`. +- Live: `grep hostname /etc/pve/lxc/9201.conf` → **`hostname: felhom-golden`**; `pct exec 9201 -- hostname` → **`felhom-golden`**. +- **The mechanism to fix it already exists in the agent:** [felhom-agent/internal/reconcile/bringup.go:303-304](../felhom-agent/internal/reconcile/bringup.go#L303) sets `params["hostname"] = spec.Hostname` (via `SetConfig` / `pct set`) when `Mode==ModeProvision && Hostname!=""`. The provision path passes `Hostname: a.hostname` ([felhom-agent/cmd/felhom-agent/main.go:1041](../felhom-agent/cmd/felhom-agent/main.go#L1041)) from a `-hostname` flag. +- **Why 9201 still shows `felhom-golden`:** it was provisioned **without** a `-hostname` value → `spec.Hostname==""` → the `SetConfig` hostname step is skipped → the golden's name persists. **Fix = wire the provision back-half to pass `Hostname=` (sanitized) into `BringUpSpec`.** No new mechanism needed. + +> These are two **independent** layers: E13 fixes the Proxmox CT name + LXC hostname; E12 fixes what the *controller* reports to the hub (the Docker container's `os.Hostname()`). Fixing only one leaves the other wrong. + +--- + +## F. Recommended insertion point for first-boot base-stack bring-up + +**Recommendation: option (a) — the controller deploys its own base stack on first configured boot, and self-heals it when missing.** + +Place an `EnsureBaseInfra()` step in [cmd/controller/main.go](controller/cmd/controller/main.go) **after** `stackMgr.ScanStacks()` (line ~144) and Docker is confirmed reachable, and additionally invoke it from the 5-min `system-health` job when `checkProtectedContainers` reports any protected container missing (turn healthcheck.go's detection into a reconcile trigger). + +**Why (a):** +- The full config (CF tunnel token, CF API token, domain, email, storage paths) is **already local** after `MaybeIngest` (Section D) — no secret needs to enter the golden. +- The controller already **owns stack deployment** (`stacks.Manager`, `docker compose` via the mounted socket) and already has the **filebrowser generator** — extend the same pattern to traefik/cloudflared. +- The health loop already **detects** the missing protected set; making it reconcile is the natural, idempotent, self-healing design (survives a wiped/half-deployed guest). +- Keeps customer secrets out of the golden and out of the agent's bootstrap payload. + +**Why not the others:** +- (b) golden bootstrap-unit step → would have to render per-customer traefik/cloudflared config in shell and risks putting/handling secrets in the unit; duplicates logic the controller is better placed to own. +- (c) headless reuse of the setup wizard's deploy path → the wizard **never deployed** the base stack either (it only writes `controller.yaml`, [setup/handlers.go:398-514](controller/internal/setup/handlers.go#L398)); there is no deploy path to reuse. + +**Prerequisites / ordering constraints for (a):** +1. **Port the traefik + cloudflared compose/config generators into the controller** (Go templates from `controller.yaml`). This is the main build item; filebrowser's generator already exists but must **drop its "skip if absent" early-return** ([web/handlers.go:1304](controller/internal/web/handlers.go#L1304)) so it can create on first boot. +2. **Bake the three infra images (pinned) into the golden** (build-golden.sh) so first-boot deploy is offline-capable; pin the two `:latest` tags. +3. **Create the `traefik-public` docker network** + the stack dirs as part of bring-up (absent on 9201 today). +4. Run only when configured (post-`MaybeIngest`, `NeedsSetup==false`) and after Docker is reachable; make it idempotent (no-op when the protected containers are already up). + +--- + +## G. Additional gap surfaced (flag — needs validation before the spec) + +**The bootstrap `docker run` does not bind-mount the stacks dir or `/opt/docker` from the LXC host.** It mounts only ([build-golden.sh:94-99](../felhom-agent/configs/build-golden.sh#L94)): +``` +-v /etc/felhom-bootstrap:/etc/felhom-bootstrap:ro +-v felhom-controller-data:/opt/docker/felhom-controller (named volume) +-v /var/run/docker.sock:/var/run/docker.sock +``` +So `paths.stacks_dir = /opt/docker/stacks` exists **only inside the controller container**, while `docker compose up` (invoked by the controller over the shared socket) is executed by the **host LXC's** Docker daemon. Compose files are read by the in-container CLI, but **bind-mount sources** in those compose files (e.g. traefik's `./traefik.yml:/etc/traefik/...`, filebrowser's `./config.yaml`, app `HDD_PATH` mounts) are resolved by the **daemon on the host filesystem**, where `/opt/docker/stacks/...` does **not** exist. On bare metal this worked because `/opt/docker/stacks` was a shared host bind-mount into the controller. + +This is a **path-namespace mismatch that affects ALL stack deploys** (every catalog app, not just base infra), so it sits squarely in the blast radius of "stand up the base stack." It is inferred from the mount topology + how the controller shells `docker compose` with `cmd.Dir=stackDir`; it was **not** live-exercised here (no deploy attempted, per the read-only rule). **Recommend the bring-up spec validate this explicitly** and, if confirmed, add a host bind-mount (e.g. `-v /opt/docker/stacks:/opt/docker/stacks`) to the bootstrap `docker run` so container and daemon agree on the path. + +--- + +## Evidence index (live repo file:line) + +- No base-stack deploy caller: [api/router.go:350](controller/internal/api/router.go#L350) is the sole `DeployStack` caller; startup [cmd/controller/main.go:56-711](controller/cmd/controller/main.go#L56). +- Detect-only health: [monitor/healthcheck.go:159-181](controller/internal/monitor/healthcheck.go#L159). +- Infra compose source (bare-metal only): [scripts/docker-setup.sh:876](scripts/docker-setup.sh#L876) / [:1067](scripts/docker-setup.sh#L1067) / [:1263](scripts/docker-setup.sh#L1263). +- Filebrowser generator + "skip if absent": [web/handlers.go:1295-1383](controller/internal/web/handlers.go#L1295). +- Protected list written to yaml, no deploy: [setup/handlers.go:475-481](controller/internal/setup/handlers.go#L475). +- Bootstrap pull/merge (configured-on-first-boot): [bootstrap/bootstrap.go:100-162](controller/internal/bootstrap/bootstrap.go#L100); customer.id field [:66](controller/internal/bootstrap/bootstrap.go#L66). +- Reported hostname = os.Hostname: [report/builder.go:75](controller/internal/report/builder.go#L75). +- Golden bake + bootstrap `docker run` (no `--hostname`, mounts): [felhom-agent/configs/build-golden.sh:38,71,94-99,146](../felhom-agent/configs/build-golden.sh#L94). +- Agent hostname-set mechanism: [felhom-agent/internal/reconcile/bringup.go:303-304](../felhom-agent/internal/reconcile/bringup.go#L303); provision wiring [felhom-agent/cmd/felhom-agent/main.go:1039-1041](../felhom-agent/cmd/felhom-agent/main.go#L1039). + +### Live 9201 output (secrets redacted) +- `pct status 9201` → running; `docker ps` → only `felhom-controller … Up (healthy)`. +- `docker network ls` → `bridge / host / none` (no `traefik-public`). +- `ls /opt/docker/` → `No such file or directory`. +- `docker inspect felhom-controller {{.Config.Hostname}}` → `3dff0fe73b5c`. +- `pct exec 9201 -- hostname` → `felhom-golden`; `/etc/pve/lxc/9201.conf` → `hostname: felhom-golden`. +- `docker logs felhom-controller` → repeating `[monitor] Health check: status=fail`; `[stacks] ScanStacks complete: 52 stacks found (0 deployed, 52 available)`. +- Merged `controller.yaml` keys present: `infrastructure.cf_tunnel_token`, `infrastructure.cf_api_token`, `customer.domain=demo-felhom.eu`, `customer.email=admin@felhom.eu`, `stacks.protected=[traefik,cloudflared,felhom-controller,filebrowser]`, `assets.sync_enabled=false`, `paths.stacks_dir=/opt/docker/stacks`.