Files
felhom-controller/DIAGNOSTIC.md
T
2026-06-11 14:02:47 +02:00

20 KiB

DIAGNOSTIC — base-infra bring-up: why fresh guests are health=FAIL

Scope: read-only diagnosis. No code changes, no deploys, no state changes on guest 9201 or any guest. Subject: demo guest 9201 (LXC), controller v0.40.0, online on the hub, Health = FAIL: protected containers not running — traefik, cloudflared, filebrowser. Date: 2026-06-11. Evidence is live repo file:line + live 9201 output (secrets redacted).


TL;DR — the one-line cause

Nothing deploys the base/protected stack on a Proxmox bootstrap. The traefik / cloudflared / filebrowser stacks were only ever created by the bare-metal scripts/docker-setup.sh (heredoc-generated compose files). The Proxmox golden→bootstrap path never runs that script, the controller has no first-boot/reconcile/self-heal deploy for the protected stacks, and the health loop only detects them missing. So on a provisioned guest there is no /opt/docker/stacks, no traefik-public network, and no infra containers — only felhom-controller itself runs, and health is permanently FAIL.

Live confirmation (9201):

pct status 9201            → running
docker ps                  → felhom-controller … Up (healthy)   [ONLY container]
docker network ls          → bridge, host, none                 [NO traefik-public]
ls /opt/docker/            → No such file or directory           [no stacks dir at all]
docker logs felhom-controller | grep health
   → [monitor] Health check: status=fail   (every cycle)
   → [stacks] ScanStacks complete: 52 stacks found (0 deployed, 52 available)

A. Who is supposed to deploy the base stack, and why it never fires

A1 — Every compose-up / DeployStack caller. DeployStack has exactly one caller:

A repo-wide grep for EnsureBaseStack / deployProtected / BaseStack / first-boot reconcile returns nothing. There is no programmatic deploy of the protected stacks anywhere. runComposeDeploy (stacks/deploy.go:337) is reached only through DeployStack. The deploy.go:24-25 "base" marking referenced in the brief is about backup ordering (filebrowser/traefik are flagged so backup skips them), not about deploying them.

A2 — The Proxmox startup sequence, with the gap. controller/cmd/controller/main.go:56-711:

71   config.LoadPermissive
93   bootstrap.MaybeIngest(...)          ← comes up CONFIGURED (pulls yaml from hub, merges local_api)
101  setup.NeedsSetup(cfg) → false        (configured → skip wizard)
111  probeLocalAPI(...)                   (agent channel)
137  stacks.NewManager → 144 ScanStacks   (scans catalog; 0 deployed)
170  startQuiesceLoop / schedulers / web server

After MaybeIngest the controller is fully configured (has CF tokens, domain, protected list) but never deploys the protected stacks. There is no Ensure…/deploy step between "configured" and "serving". This is the missing trigger.

A3 — The health loop only reports, never self-heals. controller/internal/monitor/healthcheck.go:159-162:

missingProtected := checkProtectedContainers(cfg.Stacks.Protected)
for _, name := range missingProtected {
    report.Issues = append(report.Issues, fmt.Sprintf("Protected container not running: %s", name))
}

status="fail" (healthcheck.go:177-181). The scheduler runs this every 5 min (main.go:259) and pushes the FAIL to the hub. No code path attempts to start or deploy the missing containers. That is exactly the FAIL 9201 shows.


B. Where the base-stack compose + config come from (decides what to bake)

B4 — On 9201: nothing. /opt/docker/ does not exist (so no /opt/docker/stacks, no traefik, no cloudflared, no filebrowser dirs). The controller runs from a Docker named volume (felhom-controller-data), config at /var/lib/docker/volumes/felhom-controller-data/_data/controller.yaml. The 52 "available" stacks are the catalog cache (git-synced app templates), none deployed.

B5 — Provenance of each infra app's compose + static config: generated by scripts/docker-setup.sh heredocs — the bare-metal installer. They are not in the controller image, not in the app-catalog (templates/ there = the 52 user apps), and not hub/asset-served.

  • install_traefikscripts/docker-setup.sh:876; writes ${TRAEFIK_DIR}/docker-compose.yml (:1015), traefik.yml static config (:936), dynamic config, acme.json, certs. TRAEFIK_DIR=/opt/docker/traefik (:155 — not under stacks/).

  • install_cloudflaredscripts/docker-setup.sh:1067; writes ${CLOUDFLARED_DIR}/docker-compose.yml (:1083). CLOUDFLARED_DIR=/opt/docker/cloudflared (:157 — not under stacks/).

  • install_filebrowserscripts/docker-setup.sh:1263; writes ${FILEBROWSER_DIR}/docker-compose.yml (:1295) + config.yaml. FILEBROWSER_DIR=/opt/docker/stacks/filebrowser (:156).

    Note the health check matches container names (traefik/cloudflared/filebrowser), so traefik+cloudflared living outside stacks/ is fine for detection.

Partial exception — filebrowser only. The controller can regenerate filebrowser's compose+config: generateFileBrowserCompose (web/handlers.go:1383) + generateFileBrowserConfig, driven by syncFileBrowserMounts (web/handlers.go:1295). But it refuses to create it the first time — it early-returns if the compose is absent:

// web/handlers.go:1304
if _, err := os.Stat(composePath); os.IsNotExist(err) {
    s.logger.Printf("[WARN] ... FileBrowser stack not found at %s — skipping mount sync"); return
}

There is no traefik or cloudflared generator in the controller at all (grep: zero hits for traefik.yml, cloudflared compose, or any EnsureInfra/deployInfra).

B6 — Offline-capable? On 9201 assets.source_url: https://felhom.eu, assets.sync_enabled: false — and assets are UI assets (logos), not infra compose. The controller needs no hub fetch to deploy compose in principle (config is already local post-MaybeIngest), but today there is simply no template to deploy on a provisioned guest. First-boot deploy becomes offline-possible only once the templates are baked/embedded and a generator exists.


C. Image provenance + bake feasibility (answers Viktor's question)

C7 — Image refs (from the docker-setup.sh heredocs):

App Image Pinned? Registry
traefik traefik:v3.6.7 (:1019) pinned Docker Hub (public)
cloudflared cloudflare/cloudflared:latest (:1094) :latest Docker Hub (public)
filebrowser gtstef/filebrowser:latest (:1302, and controller's generator handlers.go:1397) :latest Docker Hub (public — note: gtstef/, not the official filebrowser/filebrowser)

C8 — Registry pull at first boot? All three are public Docker Hub pulls → no gitea private-registry credential needed in the guest (good — none must ever be there). Without baking, first boot needs outbound Docker Hub access.

C9 — Can build-golden.sh bake them? Yes — same mechanism. It already bakes the controller image with a plain pull-into-the-golden's-Docker: felhom-agent/configs/build-golden.sh:71 docker pull "$CONTROLLER_IMAGE", then logs out/removes the cred (:72) so nothing is baked but the image. Adding three more docker pull lines for traefik/cloudflared/filebrowser bakes them identically — and these are public, so they don't even need the build-time docker login the controller image uses.

  • Blocker / must-fix: pin the two :latest tags before baking. A baked :latest drifts (the baked digest ≠ whatever :latest later resolves to), and any first-boot fallback pull would re-resolve :latest non-reproducibly. Pin to digests or explicit versions.
  • Baking the compose templates: feasible but not free — it requires porting docker-setup.sh's traefik/cloudflared heredoc generators (static traefik.yml, ACME/cert-resolver block, dynamic config, the cloudflared compose) into the controller as Go templates rendered from controller.yaml. The controller today has only the filebrowser generator. This is the real work item; the image bake is trivial by comparison.

C10 — Running-container bake (the hard line): No infra app is safe to bake as a running container. Each is per-customer-parameterized with secrets injected at run:

  • cloudflared run env TUNNEL_TOKEN=${CF_TUNNEL_TOKEN} (docker-setup.sh:1099) — per-customer tunnel token → must NOT be baked running.
  • traefik consumes the per-customer CF API token + ACME email + domain (see D) → must NOT be baked running.
  • filebrowser binds per-customer storage paths + domain.

Verdict: bake images (pinned) , optionally bake rendered-able templates (after porting the generators), bake running containers with secrets .


D. Customer-specific parameters first-boot must inject — all present on 9201

Confirmed in 9201's merged controller.yaml (values redacted):

Infra app Needs Config key (present on 9201) Compose env wiring
cloudflared tunnel token infrastructure.cf_tunnel_token: <REDACTED> TUNNEL_TOKEN=${CF_TUNNEL_TOKEN} (docker-setup.sh:1099)
traefik CF API token (DNS-01) infrastructure.cf_api_token: <REDACTED> CF_DNS_API_TOKEN → ACME dnsChallenge: provider: cloudflare (docker-setup.sh:899-916)
traefik ACME email customer.email: admin@felhom.eu acme: email: ${ACME_EMAIL} (docker-setup.sh:907)
traefik / all base domain customer.domain: demo-felhom.eu Host(\traefik.${BASE_DOMAIN}`)`, websecure routing
filebrowser storage paths + domain settings.GetStoragePaths() + customer.domain volume mounts + files.${domain} (handlers.go:1310-1349)

Every customer parameter the base stack needs is already in the local config after MaybeIngest. Nothing additional must be fetched to render them.


E. Hostname / CT-name (diagnose now, fix later)

E12 — Reported hostname is the Docker container ID. controller/internal/report/builder.go:75 Hostname: staticInfo.Hostnameos.Hostname(). The controller runs inside Docker, and the golden bootstrap docker run sets no --hostname (felhom-agent/configs/build-golden.sh:94) → os.Hostname() returns the container ID.

  • Live: docker inspect felhom-controller --format '{{.Config.Hostname}}'3dff0fe73b5c (the value reported to the hub).
  • Insertion point: the bootstrap unit's docker run (build-golden.sh:94). It already reads /etc/felhom-bootstrap/bootstrap.json; add --hostname <customer-id> parsed from that file. The id is present — bootstrap.json carries customer.id (the pull target), per controller/internal/bootstrap/bootstrap.go:66-68 (BootstrapCustomer.ID). Feasible with a small grep/jq in the baked felhom-controller-bootstrap.sh heredoc.

E13 — Proxmox CT/LXC hostname is felhom-golden. The golden is created --hostname felhom-golden (build-golden.sh:38); /etc/hostname is removed at minimize (:146) but the PVE container-config hostname is not reset on restore, so the guest inherits felhom-golden.

  • Live: grep hostname /etc/pve/lxc/9201.confhostname: felhom-golden; pct exec 9201 -- hostnamefelhom-golden.
  • The mechanism to fix it already exists in the agent: felhom-agent/internal/reconcile/bringup.go:303-304 sets params["hostname"] = spec.Hostname (via SetConfig / pct set) when Mode==ModeProvision && Hostname!="". The provision path passes Hostname: a.hostname (felhom-agent/cmd/felhom-agent/main.go:1041) from a -hostname flag.
  • Why 9201 still shows felhom-golden: it was provisioned without a -hostname value → spec.Hostname=="" → the SetConfig hostname step is skipped → the golden's name persists. Fix = wire the provision back-half to pass Hostname=<customer-id> (sanitized) into BringUpSpec. No new mechanism needed.

These are two independent layers: E13 fixes the Proxmox CT name + LXC hostname; E12 fixes what the controller reports to the hub (the Docker container's os.Hostname()). Fixing only one leaves the other wrong.


Recommendation: option (a) — the controller deploys its own base stack on first configured boot, and self-heals it when missing.

Place an EnsureBaseInfra() step in cmd/controller/main.go after stackMgr.ScanStacks() (line ~144) and Docker is confirmed reachable, and additionally invoke it from the 5-min system-health job when checkProtectedContainers reports any protected container missing (turn healthcheck.go's detection into a reconcile trigger).

Why (a):

  • The full config (CF tunnel token, CF API token, domain, email, storage paths) is already local after MaybeIngest (Section D) — no secret needs to enter the golden.
  • The controller already owns stack deployment (stacks.Manager, docker compose via the mounted socket) and already has the filebrowser generator — extend the same pattern to traefik/cloudflared.
  • The health loop already detects the missing protected set; making it reconcile is the natural, idempotent, self-healing design (survives a wiped/half-deployed guest).
  • Keeps customer secrets out of the golden and out of the agent's bootstrap payload.

Why not the others:

  • (b) golden bootstrap-unit step → would have to render per-customer traefik/cloudflared config in shell and risks putting/handling secrets in the unit; duplicates logic the controller is better placed to own.
  • (c) headless reuse of the setup wizard's deploy path → the wizard never deployed the base stack either (it only writes controller.yaml, setup/handlers.go:398-514); there is no deploy path to reuse.

Prerequisites / ordering constraints for (a):

  1. Port the traefik + cloudflared compose/config generators into the controller (Go templates from controller.yaml). This is the main build item; filebrowser's generator already exists but must drop its "skip if absent" early-return (web/handlers.go:1304) so it can create on first boot.
  2. Bake the three infra images (pinned) into the golden (build-golden.sh) so first-boot deploy is offline-capable; pin the two :latest tags.
  3. Create the traefik-public docker network + the stack dirs as part of bring-up (absent on 9201 today).
  4. Run only when configured (post-MaybeIngest, NeedsSetup==false) and after Docker is reachable; make it idempotent (no-op when the protected containers are already up).

G. Additional gap surfaced (flag — needs validation before the spec)

The bootstrap docker run does not bind-mount the stacks dir or /opt/docker from the LXC host. It mounts only (build-golden.sh:94-99):

-v /etc/felhom-bootstrap:/etc/felhom-bootstrap:ro
-v felhom-controller-data:/opt/docker/felhom-controller     (named volume)
-v /var/run/docker.sock:/var/run/docker.sock

So paths.stacks_dir = /opt/docker/stacks exists only inside the controller container, while docker compose up (invoked by the controller over the shared socket) is executed by the host LXC's Docker daemon. Compose files are read by the in-container CLI, but bind-mount sources in those compose files (e.g. traefik's ./traefik.yml:/etc/traefik/..., filebrowser's ./config.yaml, app HDD_PATH mounts) are resolved by the daemon on the host filesystem, where /opt/docker/stacks/... does not exist. On bare metal this worked because /opt/docker/stacks was a shared host bind-mount into the controller.

This is a path-namespace mismatch that affects ALL stack deploys (every catalog app, not just base infra), so it sits squarely in the blast radius of "stand up the base stack." It is inferred from the mount topology + how the controller shells docker compose with cmd.Dir=stackDir; it was not live-exercised here (no deploy attempted, per the read-only rule). Recommend the bring-up spec validate this explicitly and, if confirmed, add a host bind-mount (e.g. -v /opt/docker/stacks:/opt/docker/stacks) to the bootstrap docker run so container and daemon agree on the path.


Evidence index (live repo file:line)

Live 9201 output (secrets redacted)

  • pct status 9201 → running; docker ps → only felhom-controller … Up (healthy).
  • docker network lsbridge / host / none (no traefik-public).
  • ls /opt/docker/No such file or directory.
  • docker inspect felhom-controller {{.Config.Hostname}}3dff0fe73b5c.
  • pct exec 9201 -- hostnamefelhom-golden; /etc/pve/lxc/9201.confhostname: felhom-golden.
  • docker logs felhom-controller → repeating [monitor] Health check: status=fail; [stacks] ScanStacks complete: 52 stacks found (0 deployed, 52 available).
  • Merged controller.yaml keys present: infrastructure.cf_tunnel_token, infrastructure.cf_api_token, customer.domain=demo-felhom.eu, customer.email=admin@felhom.eu, stacks.protected=[traefik,cloudflared,felhom-controller,filebrowser], assets.sync_enabled=false, paths.stacks_dir=/opt/docker/stacks.