document
This commit is contained in:
+192
@@ -0,0 +1,192 @@
|
|||||||
|
# DIAGNOSTIC — base-infra bring-up: why fresh guests are health=FAIL
|
||||||
|
|
||||||
|
**Scope:** read-only diagnosis. No code changes, no deploys, no state changes on guest 9201 or any guest.
|
||||||
|
**Subject:** demo guest **9201** (LXC), controller **v0.40.0**, online on the hub, Health = **FAIL: protected containers not running — traefik, cloudflared, filebrowser**.
|
||||||
|
**Date:** 2026-06-11. Evidence is live repo `file:line` + live 9201 output (secrets redacted).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## TL;DR — the one-line cause
|
||||||
|
|
||||||
|
**Nothing deploys the base/protected stack on a Proxmox bootstrap.** The traefik / cloudflared / filebrowser stacks were *only ever* created by the **bare-metal `scripts/docker-setup.sh`** (heredoc-generated compose files). The Proxmox golden→bootstrap path never runs that script, the controller has **no first-boot/reconcile/self-heal deploy** for the protected stacks, and the health loop only **detects** them missing. So on a provisioned guest there is no `/opt/docker/stacks`, no `traefik-public` network, and no infra containers — only `felhom-controller` itself runs, and health is permanently FAIL.
|
||||||
|
|
||||||
|
**Live confirmation (9201):**
|
||||||
|
```
|
||||||
|
pct status 9201 → running
|
||||||
|
docker ps → felhom-controller … Up (healthy) [ONLY container]
|
||||||
|
docker network ls → bridge, host, none [NO traefik-public]
|
||||||
|
ls /opt/docker/ → No such file or directory [no stacks dir at all]
|
||||||
|
docker logs felhom-controller | grep health
|
||||||
|
→ [monitor] Health check: status=fail (every cycle)
|
||||||
|
→ [stacks] ScanStacks complete: 52 stacks found (0 deployed, 52 available)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## A. Who is *supposed* to deploy the base stack, and why it never fires
|
||||||
|
|
||||||
|
**A1 — Every compose-up / DeployStack caller.** `DeployStack` has exactly **one** caller:
|
||||||
|
- [controller/internal/api/router.go:350](controller/internal/api/router.go#L350) — `r.stackMgr.DeployStack(deployReq)`, driven by the UI deploy form (`POST /api/stacks/{name}/deploy`).
|
||||||
|
|
||||||
|
A repo-wide grep for `EnsureBaseStack` / `deployProtected` / `BaseStack` / first-boot reconcile returns **nothing**. There is no programmatic deploy of the protected stacks anywhere. `runComposeDeploy` ([stacks/deploy.go:337](controller/internal/stacks/deploy.go#L337)) is reached only through `DeployStack`. The `deploy.go:24-25` "base" marking referenced in the brief is about *backup ordering* (filebrowser/traefik are flagged so backup skips them), not about deploying them.
|
||||||
|
|
||||||
|
**A2 — The Proxmox startup sequence, with the gap.** [controller/cmd/controller/main.go:56-711](controller/cmd/controller/main.go#L56):
|
||||||
|
```
|
||||||
|
71 config.LoadPermissive
|
||||||
|
93 bootstrap.MaybeIngest(...) ← comes up CONFIGURED (pulls yaml from hub, merges local_api)
|
||||||
|
101 setup.NeedsSetup(cfg) → false (configured → skip wizard)
|
||||||
|
111 probeLocalAPI(...) (agent channel)
|
||||||
|
137 stacks.NewManager → 144 ScanStacks (scans catalog; 0 deployed)
|
||||||
|
170 startQuiesceLoop / schedulers / web server
|
||||||
|
```
|
||||||
|
After `MaybeIngest` the controller is fully configured (has CF tokens, domain, protected list) **but never deploys the protected stacks**. There is no `Ensure…`/`deploy` step between "configured" and "serving". This is the missing trigger.
|
||||||
|
|
||||||
|
**A3 — The health loop only reports, never self-heals.** [controller/internal/monitor/healthcheck.go:159-162](controller/internal/monitor/healthcheck.go#L159):
|
||||||
|
```go
|
||||||
|
missingProtected := checkProtectedContainers(cfg.Stacks.Protected)
|
||||||
|
for _, name := range missingProtected {
|
||||||
|
report.Issues = append(report.Issues, fmt.Sprintf("Protected container not running: %s", name))
|
||||||
|
}
|
||||||
|
```
|
||||||
|
→ `status="fail"` (healthcheck.go:177-181). The scheduler runs this every 5 min ([main.go:259](controller/cmd/controller/main.go#L259)) and pushes the FAIL to the hub. **No code path attempts to start or deploy the missing containers.** That is exactly the FAIL 9201 shows.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## B. Where the base-stack compose + config come from (decides what to bake)
|
||||||
|
|
||||||
|
**B4 — On 9201: nothing.** `/opt/docker/` does not exist (so no `/opt/docker/stacks`, no `traefik`, no `cloudflared`, no `filebrowser` dirs). The controller runs from a Docker **named volume** (`felhom-controller-data`), config at `/var/lib/docker/volumes/felhom-controller-data/_data/controller.yaml`. The 52 "available" stacks are the **catalog cache** (git-synced app templates), none deployed.
|
||||||
|
|
||||||
|
**B5 — Provenance of each infra app's compose + static config:** generated by **`scripts/docker-setup.sh`** heredocs — the **bare-metal** installer. They are **not** in the controller image, **not** in the app-catalog (`templates/` there = the 52 user apps), and **not** hub/asset-served.
|
||||||
|
- `install_traefik` — [scripts/docker-setup.sh:876](scripts/docker-setup.sh#L876); writes `${TRAEFIK_DIR}/docker-compose.yml` (:1015), `traefik.yml` static config (:936), dynamic config, `acme.json`, certs. `TRAEFIK_DIR=/opt/docker/traefik` (:155 — **not** under `stacks/`).
|
||||||
|
- `install_cloudflared` — [scripts/docker-setup.sh:1067](scripts/docker-setup.sh#L1067); writes `${CLOUDFLARED_DIR}/docker-compose.yml` (:1083). `CLOUDFLARED_DIR=/opt/docker/cloudflared` (:157 — **not** under `stacks/`).
|
||||||
|
- `install_filebrowser` — [scripts/docker-setup.sh:1263](scripts/docker-setup.sh#L1263); writes `${FILEBROWSER_DIR}/docker-compose.yml` (:1295) + `config.yaml`. `FILEBROWSER_DIR=/opt/docker/stacks/filebrowser` (:156).
|
||||||
|
|
||||||
|
Note the health check matches **container names** (`traefik`/`cloudflared`/`filebrowser`), so traefik+cloudflared living outside `stacks/` is fine for detection.
|
||||||
|
|
||||||
|
**Partial exception — filebrowser only.** The controller *can* regenerate filebrowser's compose+config: `generateFileBrowserCompose` ([web/handlers.go:1383](controller/internal/web/handlers.go#L1383)) + `generateFileBrowserConfig`, driven by `syncFileBrowserMounts` ([web/handlers.go:1295](controller/internal/web/handlers.go#L1295)). **But it refuses to create it the first time** — it early-returns if the compose is absent:
|
||||||
|
```go
|
||||||
|
// web/handlers.go:1304
|
||||||
|
if _, err := os.Stat(composePath); os.IsNotExist(err) {
|
||||||
|
s.logger.Printf("[WARN] ... FileBrowser stack not found at %s — skipping mount sync"); return
|
||||||
|
}
|
||||||
|
```
|
||||||
|
There is **no traefik or cloudflared generator in the controller at all** (grep: zero hits for `traefik.yml`, cloudflared compose, or any `EnsureInfra`/`deployInfra`).
|
||||||
|
|
||||||
|
**B6 — Offline-capable?** On 9201 `assets.source_url: https://felhom.eu`, **`assets.sync_enabled: false`** — and assets are UI assets (logos), *not* infra compose. The controller needs **no hub fetch to deploy compose** *in principle* (config is already local post-`MaybeIngest`), but today there is simply **no template to deploy** on a provisioned guest. First-boot deploy becomes offline-possible only once the templates are baked/embedded **and** a generator exists.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## C. Image provenance + bake feasibility (answers Viktor's question)
|
||||||
|
|
||||||
|
**C7 — Image refs (from the docker-setup.sh heredocs):**
|
||||||
|
|
||||||
|
| App | Image | Pinned? | Registry |
|
||||||
|
|---|---|---|---|
|
||||||
|
| traefik | `traefik:v3.6.7` (:1019) | ✅ pinned | Docker Hub (public) |
|
||||||
|
| cloudflared | `cloudflare/cloudflared:latest` (:1094) | ❌ `:latest` | Docker Hub (public) |
|
||||||
|
| filebrowser | `gtstef/filebrowser:latest` (:1302, and controller's generator handlers.go:1397) | ❌ `:latest` | Docker Hub (public — note: `gtstef/`, **not** the official `filebrowser/filebrowser`) |
|
||||||
|
|
||||||
|
**C8 — Registry pull at first boot?** All three are **public Docker Hub** pulls → **no gitea private-registry credential needed** in the guest (good — none must ever be there). Without baking, first boot needs outbound Docker Hub access.
|
||||||
|
|
||||||
|
**C9 — Can `build-golden.sh` bake them?** **Yes — same mechanism.** It already bakes the controller image with a plain pull-into-the-golden's-Docker: [felhom-agent/configs/build-golden.sh:71](../felhom-agent/configs/build-golden.sh#L71) `docker pull "$CONTROLLER_IMAGE"`, then logs out/removes the cred (:72) so nothing is baked but the image. Adding three more `docker pull` lines for traefik/cloudflared/filebrowser bakes them identically — and these are **public**, so they don't even need the build-time `docker login` the controller image uses.
|
||||||
|
- **Blocker / must-fix:** pin the two `:latest` tags before baking. A baked `:latest` drifts (the baked digest ≠ whatever `:latest` later resolves to), and any first-boot fallback pull would re-resolve `:latest` non-reproducibly. Pin to digests or explicit versions.
|
||||||
|
- **Baking the compose templates:** feasible but **not free** — it requires porting docker-setup.sh's traefik/cloudflared heredoc generators (static `traefik.yml`, ACME/cert-resolver block, dynamic config, the cloudflared compose) into the controller as Go templates rendered from `controller.yaml`. The controller today has only the **filebrowser** generator. This is the real work item; the image bake is trivial by comparison.
|
||||||
|
|
||||||
|
**C10 — Running-container bake (the hard line):** **No infra app is safe to bake as a *running* container.** Each is per-customer-parameterized with secrets injected at run:
|
||||||
|
- cloudflared run env `TUNNEL_TOKEN=${CF_TUNNEL_TOKEN}` ([docker-setup.sh:1099](scripts/docker-setup.sh#L1099)) — per-customer tunnel token → **must NOT** be baked running.
|
||||||
|
- traefik consumes the per-customer CF API token + ACME email + domain (see D) → **must NOT** be baked running.
|
||||||
|
- filebrowser binds per-customer storage paths + domain.
|
||||||
|
|
||||||
|
**Verdict:** bake **images** (pinned) ✅, optionally bake **rendered-able templates** ✅ (after porting the generators), bake **running containers with secrets** ❌.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## D. Customer-specific parameters first-boot must inject — all present on 9201
|
||||||
|
|
||||||
|
Confirmed in 9201's merged `controller.yaml` (values redacted):
|
||||||
|
|
||||||
|
| Infra app | Needs | Config key (present on 9201) | Compose env wiring |
|
||||||
|
|---|---|---|---|
|
||||||
|
| cloudflared | tunnel token | `infrastructure.cf_tunnel_token: <REDACTED>` ✅ | `TUNNEL_TOKEN=${CF_TUNNEL_TOKEN}` (docker-setup.sh:1099) |
|
||||||
|
| traefik | CF API token (DNS-01) | `infrastructure.cf_api_token: <REDACTED>` ✅ | `CF_DNS_API_TOKEN` → ACME `dnsChallenge: provider: cloudflare` (docker-setup.sh:899-916) |
|
||||||
|
| traefik | ACME email | `customer.email: admin@felhom.eu` ✅ | `acme: email: ${ACME_EMAIL}` (docker-setup.sh:907) |
|
||||||
|
| traefik / all | base domain | `customer.domain: demo-felhom.eu` ✅ | `Host(\`traefik.${BASE_DOMAIN}\`)`, websecure routing |
|
||||||
|
| filebrowser | storage paths + domain | `settings.GetStoragePaths()` + `customer.domain` ✅ | volume mounts + `files.${domain}` (handlers.go:1310-1349) |
|
||||||
|
|
||||||
|
Every customer parameter the base stack needs is **already in the local config** after `MaybeIngest`. Nothing additional must be fetched to render them.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## E. Hostname / CT-name (diagnose now, fix later)
|
||||||
|
|
||||||
|
**E12 — Reported hostname is the Docker container ID.** [controller/internal/report/builder.go:75](controller/internal/report/builder.go#L75) `Hostname: staticInfo.Hostname` ← `os.Hostname()`. The controller runs inside Docker, and the golden bootstrap `docker run` sets **no `--hostname`** ([felhom-agent/configs/build-golden.sh:94](../felhom-agent/configs/build-golden.sh#L94)) → `os.Hostname()` returns the container ID.
|
||||||
|
- Live: `docker inspect felhom-controller --format '{{.Config.Hostname}}'` → **`3dff0fe73b5c`** (the value reported to the hub).
|
||||||
|
- **Insertion point:** the bootstrap unit's `docker run` (build-golden.sh:94). It already reads `/etc/felhom-bootstrap/bootstrap.json`; add `--hostname <customer-id>` parsed from that file. The id is present — `bootstrap.json` carries `customer.id` (the pull target), per [controller/internal/bootstrap/bootstrap.go:66-68](controller/internal/bootstrap/bootstrap.go#L66) (`BootstrapCustomer.ID`). Feasible with a small `grep`/`jq` in the baked `felhom-controller-bootstrap.sh` heredoc.
|
||||||
|
|
||||||
|
**E13 — Proxmox CT/LXC hostname is `felhom-golden`.** The golden is created `--hostname felhom-golden` ([build-golden.sh:38](../felhom-agent/configs/build-golden.sh#L38)); `/etc/hostname` is removed at minimize (:146) but the **PVE container-config hostname is not reset on restore**, so the guest inherits `felhom-golden`.
|
||||||
|
- Live: `grep hostname /etc/pve/lxc/9201.conf` → **`hostname: felhom-golden`**; `pct exec 9201 -- hostname` → **`felhom-golden`**.
|
||||||
|
- **The mechanism to fix it already exists in the agent:** [felhom-agent/internal/reconcile/bringup.go:303-304](../felhom-agent/internal/reconcile/bringup.go#L303) sets `params["hostname"] = spec.Hostname` (via `SetConfig` / `pct set`) when `Mode==ModeProvision && Hostname!=""`. The provision path passes `Hostname: a.hostname` ([felhom-agent/cmd/felhom-agent/main.go:1041](../felhom-agent/cmd/felhom-agent/main.go#L1041)) from a `-hostname` flag.
|
||||||
|
- **Why 9201 still shows `felhom-golden`:** it was provisioned **without** a `-hostname` value → `spec.Hostname==""` → the `SetConfig` hostname step is skipped → the golden's name persists. **Fix = wire the provision back-half to pass `Hostname=<customer-id>` (sanitized) into `BringUpSpec`.** No new mechanism needed.
|
||||||
|
|
||||||
|
> These are two **independent** layers: E13 fixes the Proxmox CT name + LXC hostname; E12 fixes what the *controller* reports to the hub (the Docker container's `os.Hostname()`). Fixing only one leaves the other wrong.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## F. Recommended insertion point for first-boot base-stack bring-up
|
||||||
|
|
||||||
|
**Recommendation: option (a) — the controller deploys its own base stack on first configured boot, and self-heals it when missing.**
|
||||||
|
|
||||||
|
Place an `EnsureBaseInfra()` step in [cmd/controller/main.go](controller/cmd/controller/main.go) **after** `stackMgr.ScanStacks()` (line ~144) and Docker is confirmed reachable, and additionally invoke it from the 5-min `system-health` job when `checkProtectedContainers` reports any protected container missing (turn healthcheck.go's detection into a reconcile trigger).
|
||||||
|
|
||||||
|
**Why (a):**
|
||||||
|
- The full config (CF tunnel token, CF API token, domain, email, storage paths) is **already local** after `MaybeIngest` (Section D) — no secret needs to enter the golden.
|
||||||
|
- The controller already **owns stack deployment** (`stacks.Manager`, `docker compose` via the mounted socket) and already has the **filebrowser generator** — extend the same pattern to traefik/cloudflared.
|
||||||
|
- The health loop already **detects** the missing protected set; making it reconcile is the natural, idempotent, self-healing design (survives a wiped/half-deployed guest).
|
||||||
|
- Keeps customer secrets out of the golden and out of the agent's bootstrap payload.
|
||||||
|
|
||||||
|
**Why not the others:**
|
||||||
|
- (b) golden bootstrap-unit step → would have to render per-customer traefik/cloudflared config in shell and risks putting/handling secrets in the unit; duplicates logic the controller is better placed to own.
|
||||||
|
- (c) headless reuse of the setup wizard's deploy path → the wizard **never deployed** the base stack either (it only writes `controller.yaml`, [setup/handlers.go:398-514](controller/internal/setup/handlers.go#L398)); there is no deploy path to reuse.
|
||||||
|
|
||||||
|
**Prerequisites / ordering constraints for (a):**
|
||||||
|
1. **Port the traefik + cloudflared compose/config generators into the controller** (Go templates from `controller.yaml`). This is the main build item; filebrowser's generator already exists but must **drop its "skip if absent" early-return** ([web/handlers.go:1304](controller/internal/web/handlers.go#L1304)) so it can create on first boot.
|
||||||
|
2. **Bake the three infra images (pinned) into the golden** (build-golden.sh) so first-boot deploy is offline-capable; pin the two `:latest` tags.
|
||||||
|
3. **Create the `traefik-public` docker network** + the stack dirs as part of bring-up (absent on 9201 today).
|
||||||
|
4. Run only when configured (post-`MaybeIngest`, `NeedsSetup==false`) and after Docker is reachable; make it idempotent (no-op when the protected containers are already up).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## G. Additional gap surfaced (flag — needs validation before the spec)
|
||||||
|
|
||||||
|
**The bootstrap `docker run` does not bind-mount the stacks dir or `/opt/docker` from the LXC host.** It mounts only ([build-golden.sh:94-99](../felhom-agent/configs/build-golden.sh#L94)):
|
||||||
|
```
|
||||||
|
-v /etc/felhom-bootstrap:/etc/felhom-bootstrap:ro
|
||||||
|
-v felhom-controller-data:/opt/docker/felhom-controller (named volume)
|
||||||
|
-v /var/run/docker.sock:/var/run/docker.sock
|
||||||
|
```
|
||||||
|
So `paths.stacks_dir = /opt/docker/stacks` exists **only inside the controller container**, while `docker compose up` (invoked by the controller over the shared socket) is executed by the **host LXC's** Docker daemon. Compose files are read by the in-container CLI, but **bind-mount sources** in those compose files (e.g. traefik's `./traefik.yml:/etc/traefik/...`, filebrowser's `./config.yaml`, app `HDD_PATH` mounts) are resolved by the **daemon on the host filesystem**, where `/opt/docker/stacks/...` does **not** exist. On bare metal this worked because `/opt/docker/stacks` was a shared host bind-mount into the controller.
|
||||||
|
|
||||||
|
This is a **path-namespace mismatch that affects ALL stack deploys** (every catalog app, not just base infra), so it sits squarely in the blast radius of "stand up the base stack." It is inferred from the mount topology + how the controller shells `docker compose` with `cmd.Dir=stackDir`; it was **not** live-exercised here (no deploy attempted, per the read-only rule). **Recommend the bring-up spec validate this explicitly** and, if confirmed, add a host bind-mount (e.g. `-v /opt/docker/stacks:/opt/docker/stacks`) to the bootstrap `docker run` so container and daemon agree on the path.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Evidence index (live repo file:line)
|
||||||
|
|
||||||
|
- No base-stack deploy caller: [api/router.go:350](controller/internal/api/router.go#L350) is the sole `DeployStack` caller; startup [cmd/controller/main.go:56-711](controller/cmd/controller/main.go#L56).
|
||||||
|
- Detect-only health: [monitor/healthcheck.go:159-181](controller/internal/monitor/healthcheck.go#L159).
|
||||||
|
- Infra compose source (bare-metal only): [scripts/docker-setup.sh:876](scripts/docker-setup.sh#L876) / [:1067](scripts/docker-setup.sh#L1067) / [:1263](scripts/docker-setup.sh#L1263).
|
||||||
|
- Filebrowser generator + "skip if absent": [web/handlers.go:1295-1383](controller/internal/web/handlers.go#L1295).
|
||||||
|
- Protected list written to yaml, no deploy: [setup/handlers.go:475-481](controller/internal/setup/handlers.go#L475).
|
||||||
|
- Bootstrap pull/merge (configured-on-first-boot): [bootstrap/bootstrap.go:100-162](controller/internal/bootstrap/bootstrap.go#L100); customer.id field [:66](controller/internal/bootstrap/bootstrap.go#L66).
|
||||||
|
- Reported hostname = os.Hostname: [report/builder.go:75](controller/internal/report/builder.go#L75).
|
||||||
|
- Golden bake + bootstrap `docker run` (no `--hostname`, mounts): [felhom-agent/configs/build-golden.sh:38,71,94-99,146](../felhom-agent/configs/build-golden.sh#L94).
|
||||||
|
- Agent hostname-set mechanism: [felhom-agent/internal/reconcile/bringup.go:303-304](../felhom-agent/internal/reconcile/bringup.go#L303); provision wiring [felhom-agent/cmd/felhom-agent/main.go:1039-1041](../felhom-agent/cmd/felhom-agent/main.go#L1039).
|
||||||
|
|
||||||
|
### Live 9201 output (secrets redacted)
|
||||||
|
- `pct status 9201` → running; `docker ps` → only `felhom-controller … Up (healthy)`.
|
||||||
|
- `docker network ls` → `bridge / host / none` (no `traefik-public`).
|
||||||
|
- `ls /opt/docker/` → `No such file or directory`.
|
||||||
|
- `docker inspect felhom-controller {{.Config.Hostname}}` → `3dff0fe73b5c`.
|
||||||
|
- `pct exec 9201 -- hostname` → `felhom-golden`; `/etc/pve/lxc/9201.conf` → `hostname: felhom-golden`.
|
||||||
|
- `docker logs felhom-controller` → repeating `[monitor] Health check: status=fail`; `[stacks] ScanStacks complete: 52 stacks found (0 deployed, 52 available)`.
|
||||||
|
- Merged `controller.yaml` keys present: `infrastructure.cf_tunnel_token`, `infrastructure.cf_api_token`, `customer.domain=demo-felhom.eu`, `customer.email=admin@felhom.eu`, `stacks.protected=[traefik,cloudflared,felhom-controller,filebrowser]`, `assets.sync_enabled=false`, `paths.stacks_dir=/opt/docker/stacks`.
|
||||||
Reference in New Issue
Block a user