20 KiB
DIAGNOSTIC — base-infra bring-up: why fresh guests are health=FAIL
Scope: read-only diagnosis. No code changes, no deploys, no state changes on guest 9201 or any guest.
Subject: demo guest 9201 (LXC), controller v0.40.0, online on the hub, Health = FAIL: protected containers not running — traefik, cloudflared, filebrowser.
Date: 2026-06-11. Evidence is live repo file:line + live 9201 output (secrets redacted).
TL;DR — the one-line cause
Nothing deploys the base/protected stack on a Proxmox bootstrap. The traefik / cloudflared / filebrowser stacks were only ever created by the bare-metal scripts/docker-setup.sh (heredoc-generated compose files). The Proxmox golden→bootstrap path never runs that script, the controller has no first-boot/reconcile/self-heal deploy for the protected stacks, and the health loop only detects them missing. So on a provisioned guest there is no /opt/docker/stacks, no traefik-public network, and no infra containers — only felhom-controller itself runs, and health is permanently FAIL.
Live confirmation (9201):
pct status 9201 → running
docker ps → felhom-controller … Up (healthy) [ONLY container]
docker network ls → bridge, host, none [NO traefik-public]
ls /opt/docker/ → No such file or directory [no stacks dir at all]
docker logs felhom-controller | grep health
→ [monitor] Health check: status=fail (every cycle)
→ [stacks] ScanStacks complete: 52 stacks found (0 deployed, 52 available)
A. Who is supposed to deploy the base stack, and why it never fires
A1 — Every compose-up / DeployStack caller. DeployStack has exactly one caller:
- controller/internal/api/router.go:350 —
r.stackMgr.DeployStack(deployReq), driven by the UI deploy form (POST /api/stacks/{name}/deploy).
A repo-wide grep for EnsureBaseStack / deployProtected / BaseStack / first-boot reconcile returns nothing. There is no programmatic deploy of the protected stacks anywhere. runComposeDeploy (stacks/deploy.go:337) is reached only through DeployStack. The deploy.go:24-25 "base" marking referenced in the brief is about backup ordering (filebrowser/traefik are flagged so backup skips them), not about deploying them.
A2 — The Proxmox startup sequence, with the gap. controller/cmd/controller/main.go:56-711:
71 config.LoadPermissive
93 bootstrap.MaybeIngest(...) ← comes up CONFIGURED (pulls yaml from hub, merges local_api)
101 setup.NeedsSetup(cfg) → false (configured → skip wizard)
111 probeLocalAPI(...) (agent channel)
137 stacks.NewManager → 144 ScanStacks (scans catalog; 0 deployed)
170 startQuiesceLoop / schedulers / web server
After MaybeIngest the controller is fully configured (has CF tokens, domain, protected list) but never deploys the protected stacks. There is no Ensure…/deploy step between "configured" and "serving". This is the missing trigger.
A3 — The health loop only reports, never self-heals. controller/internal/monitor/healthcheck.go:159-162:
missingProtected := checkProtectedContainers(cfg.Stacks.Protected)
for _, name := range missingProtected {
report.Issues = append(report.Issues, fmt.Sprintf("Protected container not running: %s", name))
}
→ status="fail" (healthcheck.go:177-181). The scheduler runs this every 5 min (main.go:259) and pushes the FAIL to the hub. No code path attempts to start or deploy the missing containers. That is exactly the FAIL 9201 shows.
B. Where the base-stack compose + config come from (decides what to bake)
B4 — On 9201: nothing. /opt/docker/ does not exist (so no /opt/docker/stacks, no traefik, no cloudflared, no filebrowser dirs). The controller runs from a Docker named volume (felhom-controller-data), config at /var/lib/docker/volumes/felhom-controller-data/_data/controller.yaml. The 52 "available" stacks are the catalog cache (git-synced app templates), none deployed.
B5 — Provenance of each infra app's compose + static config: generated by scripts/docker-setup.sh heredocs — the bare-metal installer. They are not in the controller image, not in the app-catalog (templates/ there = the 52 user apps), and not hub/asset-served.
-
install_traefik— scripts/docker-setup.sh:876; writes${TRAEFIK_DIR}/docker-compose.yml(:1015),traefik.ymlstatic config (:936), dynamic config,acme.json, certs.TRAEFIK_DIR=/opt/docker/traefik(:155 — not understacks/). -
install_cloudflared— scripts/docker-setup.sh:1067; writes${CLOUDFLARED_DIR}/docker-compose.yml(:1083).CLOUDFLARED_DIR=/opt/docker/cloudflared(:157 — not understacks/). -
install_filebrowser— scripts/docker-setup.sh:1263; writes${FILEBROWSER_DIR}/docker-compose.yml(:1295) +config.yaml.FILEBROWSER_DIR=/opt/docker/stacks/filebrowser(:156).Note the health check matches container names (
traefik/cloudflared/filebrowser), so traefik+cloudflared living outsidestacks/is fine for detection.
Partial exception — filebrowser only. The controller can regenerate filebrowser's compose+config: generateFileBrowserCompose (web/handlers.go:1383) + generateFileBrowserConfig, driven by syncFileBrowserMounts (web/handlers.go:1295). But it refuses to create it the first time — it early-returns if the compose is absent:
// web/handlers.go:1304
if _, err := os.Stat(composePath); os.IsNotExist(err) {
s.logger.Printf("[WARN] ... FileBrowser stack not found at %s — skipping mount sync"); return
}
There is no traefik or cloudflared generator in the controller at all (grep: zero hits for traefik.yml, cloudflared compose, or any EnsureInfra/deployInfra).
B6 — Offline-capable? On 9201 assets.source_url: https://felhom.eu, assets.sync_enabled: false — and assets are UI assets (logos), not infra compose. The controller needs no hub fetch to deploy compose in principle (config is already local post-MaybeIngest), but today there is simply no template to deploy on a provisioned guest. First-boot deploy becomes offline-possible only once the templates are baked/embedded and a generator exists.
C. Image provenance + bake feasibility (answers Viktor's question)
C7 — Image refs (from the docker-setup.sh heredocs):
| App | Image | Pinned? | Registry |
|---|---|---|---|
| traefik | traefik:v3.6.7 (:1019) |
✅ pinned | Docker Hub (public) |
| cloudflared | cloudflare/cloudflared:latest (:1094) |
❌ :latest |
Docker Hub (public) |
| filebrowser | gtstef/filebrowser:latest (:1302, and controller's generator handlers.go:1397) |
❌ :latest |
Docker Hub (public — note: gtstef/, not the official filebrowser/filebrowser) |
C8 — Registry pull at first boot? All three are public Docker Hub pulls → no gitea private-registry credential needed in the guest (good — none must ever be there). Without baking, first boot needs outbound Docker Hub access.
C9 — Can build-golden.sh bake them? Yes — same mechanism. It already bakes the controller image with a plain pull-into-the-golden's-Docker: felhom-agent/configs/build-golden.sh:71 docker pull "$CONTROLLER_IMAGE", then logs out/removes the cred (:72) so nothing is baked but the image. Adding three more docker pull lines for traefik/cloudflared/filebrowser bakes them identically — and these are public, so they don't even need the build-time docker login the controller image uses.
- Blocker / must-fix: pin the two
:latesttags before baking. A baked:latestdrifts (the baked digest ≠ whatever:latestlater resolves to), and any first-boot fallback pull would re-resolve:latestnon-reproducibly. Pin to digests or explicit versions. - Baking the compose templates: feasible but not free — it requires porting docker-setup.sh's traefik/cloudflared heredoc generators (static
traefik.yml, ACME/cert-resolver block, dynamic config, the cloudflared compose) into the controller as Go templates rendered fromcontroller.yaml. The controller today has only the filebrowser generator. This is the real work item; the image bake is trivial by comparison.
C10 — Running-container bake (the hard line): No infra app is safe to bake as a running container. Each is per-customer-parameterized with secrets injected at run:
- cloudflared run env
TUNNEL_TOKEN=${CF_TUNNEL_TOKEN}(docker-setup.sh:1099) — per-customer tunnel token → must NOT be baked running. - traefik consumes the per-customer CF API token + ACME email + domain (see D) → must NOT be baked running.
- filebrowser binds per-customer storage paths + domain.
Verdict: bake images (pinned) ✅, optionally bake rendered-able templates ✅ (after porting the generators), bake running containers with secrets ❌.
D. Customer-specific parameters first-boot must inject — all present on 9201
Confirmed in 9201's merged controller.yaml (values redacted):
| Infra app | Needs | Config key (present on 9201) | Compose env wiring |
|---|---|---|---|
| cloudflared | tunnel token | infrastructure.cf_tunnel_token: <REDACTED> ✅ |
TUNNEL_TOKEN=${CF_TUNNEL_TOKEN} (docker-setup.sh:1099) |
| traefik | CF API token (DNS-01) | infrastructure.cf_api_token: <REDACTED> ✅ |
CF_DNS_API_TOKEN → ACME dnsChallenge: provider: cloudflare (docker-setup.sh:899-916) |
| traefik | ACME email | customer.email: admin@felhom.eu ✅ |
acme: email: ${ACME_EMAIL} (docker-setup.sh:907) |
| traefik / all | base domain | customer.domain: demo-felhom.eu ✅ |
Host(\traefik.${BASE_DOMAIN}`)`, websecure routing |
| filebrowser | storage paths + domain | settings.GetStoragePaths() + customer.domain ✅ |
volume mounts + files.${domain} (handlers.go:1310-1349) |
Every customer parameter the base stack needs is already in the local config after MaybeIngest. Nothing additional must be fetched to render them.
E. Hostname / CT-name (diagnose now, fix later)
E12 — Reported hostname is the Docker container ID. controller/internal/report/builder.go:75 Hostname: staticInfo.Hostname ← os.Hostname(). The controller runs inside Docker, and the golden bootstrap docker run sets no --hostname (felhom-agent/configs/build-golden.sh:94) → os.Hostname() returns the container ID.
- Live:
docker inspect felhom-controller --format '{{.Config.Hostname}}'→3dff0fe73b5c(the value reported to the hub). - Insertion point: the bootstrap unit's
docker run(build-golden.sh:94). It already reads/etc/felhom-bootstrap/bootstrap.json; add--hostname <customer-id>parsed from that file. The id is present —bootstrap.jsoncarriescustomer.id(the pull target), per controller/internal/bootstrap/bootstrap.go:66-68 (BootstrapCustomer.ID). Feasible with a smallgrep/jqin the bakedfelhom-controller-bootstrap.shheredoc.
E13 — Proxmox CT/LXC hostname is felhom-golden. The golden is created --hostname felhom-golden (build-golden.sh:38); /etc/hostname is removed at minimize (:146) but the PVE container-config hostname is not reset on restore, so the guest inherits felhom-golden.
- Live:
grep hostname /etc/pve/lxc/9201.conf→hostname: felhom-golden;pct exec 9201 -- hostname→felhom-golden. - The mechanism to fix it already exists in the agent: felhom-agent/internal/reconcile/bringup.go:303-304 sets
params["hostname"] = spec.Hostname(viaSetConfig/pct set) whenMode==ModeProvision && Hostname!="". The provision path passesHostname: a.hostname(felhom-agent/cmd/felhom-agent/main.go:1041) from a-hostnameflag. - Why 9201 still shows
felhom-golden: it was provisioned without a-hostnamevalue →spec.Hostname==""→ theSetConfighostname step is skipped → the golden's name persists. Fix = wire the provision back-half to passHostname=<customer-id>(sanitized) intoBringUpSpec. No new mechanism needed.
These are two independent layers: E13 fixes the Proxmox CT name + LXC hostname; E12 fixes what the controller reports to the hub (the Docker container's
os.Hostname()). Fixing only one leaves the other wrong.
F. Recommended insertion point for first-boot base-stack bring-up
Recommendation: option (a) — the controller deploys its own base stack on first configured boot, and self-heals it when missing.
Place an EnsureBaseInfra() step in cmd/controller/main.go after stackMgr.ScanStacks() (line ~144) and Docker is confirmed reachable, and additionally invoke it from the 5-min system-health job when checkProtectedContainers reports any protected container missing (turn healthcheck.go's detection into a reconcile trigger).
Why (a):
- The full config (CF tunnel token, CF API token, domain, email, storage paths) is already local after
MaybeIngest(Section D) — no secret needs to enter the golden. - The controller already owns stack deployment (
stacks.Manager,docker composevia the mounted socket) and already has the filebrowser generator — extend the same pattern to traefik/cloudflared. - The health loop already detects the missing protected set; making it reconcile is the natural, idempotent, self-healing design (survives a wiped/half-deployed guest).
- Keeps customer secrets out of the golden and out of the agent's bootstrap payload.
Why not the others:
- (b) golden bootstrap-unit step → would have to render per-customer traefik/cloudflared config in shell and risks putting/handling secrets in the unit; duplicates logic the controller is better placed to own.
- (c) headless reuse of the setup wizard's deploy path → the wizard never deployed the base stack either (it only writes
controller.yaml, setup/handlers.go:398-514); there is no deploy path to reuse.
Prerequisites / ordering constraints for (a):
- Port the traefik + cloudflared compose/config generators into the controller (Go templates from
controller.yaml). This is the main build item; filebrowser's generator already exists but must drop its "skip if absent" early-return (web/handlers.go:1304) so it can create on first boot. - Bake the three infra images (pinned) into the golden (build-golden.sh) so first-boot deploy is offline-capable; pin the two
:latesttags. - Create the
traefik-publicdocker network + the stack dirs as part of bring-up (absent on 9201 today). - Run only when configured (post-
MaybeIngest,NeedsSetup==false) and after Docker is reachable; make it idempotent (no-op when the protected containers are already up).
G. Additional gap surfaced (flag — needs validation before the spec)
The bootstrap docker run does not bind-mount the stacks dir or /opt/docker from the LXC host. It mounts only (build-golden.sh:94-99):
-v /etc/felhom-bootstrap:/etc/felhom-bootstrap:ro
-v felhom-controller-data:/opt/docker/felhom-controller (named volume)
-v /var/run/docker.sock:/var/run/docker.sock
So paths.stacks_dir = /opt/docker/stacks exists only inside the controller container, while docker compose up (invoked by the controller over the shared socket) is executed by the host LXC's Docker daemon. Compose files are read by the in-container CLI, but bind-mount sources in those compose files (e.g. traefik's ./traefik.yml:/etc/traefik/..., filebrowser's ./config.yaml, app HDD_PATH mounts) are resolved by the daemon on the host filesystem, where /opt/docker/stacks/... does not exist. On bare metal this worked because /opt/docker/stacks was a shared host bind-mount into the controller.
This is a path-namespace mismatch that affects ALL stack deploys (every catalog app, not just base infra), so it sits squarely in the blast radius of "stand up the base stack." It is inferred from the mount topology + how the controller shells docker compose with cmd.Dir=stackDir; it was not live-exercised here (no deploy attempted, per the read-only rule). Recommend the bring-up spec validate this explicitly and, if confirmed, add a host bind-mount (e.g. -v /opt/docker/stacks:/opt/docker/stacks) to the bootstrap docker run so container and daemon agree on the path.
Evidence index (live repo file:line)
- No base-stack deploy caller: api/router.go:350 is the sole
DeployStackcaller; startup cmd/controller/main.go:56-711. - Detect-only health: monitor/healthcheck.go:159-181.
- Infra compose source (bare-metal only): scripts/docker-setup.sh:876 / :1067 / :1263.
- Filebrowser generator + "skip if absent": web/handlers.go:1295-1383.
- Protected list written to yaml, no deploy: setup/handlers.go:475-481.
- Bootstrap pull/merge (configured-on-first-boot): bootstrap/bootstrap.go:100-162; customer.id field :66.
- Reported hostname = os.Hostname: report/builder.go:75.
- Golden bake + bootstrap
docker run(no--hostname, mounts): felhom-agent/configs/build-golden.sh:38,71,94-99,146. - Agent hostname-set mechanism: felhom-agent/internal/reconcile/bringup.go:303-304; provision wiring felhom-agent/cmd/felhom-agent/main.go:1039-1041.
Live 9201 output (secrets redacted)
pct status 9201→ running;docker ps→ onlyfelhom-controller … Up (healthy).docker network ls→bridge / host / none(notraefik-public).ls /opt/docker/→No such file or directory.docker inspect felhom-controller {{.Config.Hostname}}→3dff0fe73b5c.pct exec 9201 -- hostname→felhom-golden;/etc/pve/lxc/9201.conf→hostname: felhom-golden.docker logs felhom-controller→ repeating[monitor] Health check: status=fail;[stacks] ScanStacks complete: 52 stacks found (0 deployed, 52 available).- Merged
controller.yamlkeys present:infrastructure.cf_tunnel_token,infrastructure.cf_api_token,customer.domain=demo-felhom.eu,customer.email=admin@felhom.eu,stacks.protected=[traefik,cloudflared,felhom-controller,filebrowser],assets.sync_enabled=false,paths.stacks_dir=/opt/docker/stacks.