# REPORT — Live demo validation of felhom-controller v0.39.0 + 8C orphan-template cleanup (2026-06-11) Two things this session: (1) **provisioned a fresh customer guest from a v0.39.0 golden on the demo Proxmox host and walked the full controller flow**, reporting what works vs breaks against the live guest; (2) a small source-hygiene code change — deleting five dead 8C orphan templates (v0.39.1). The code change is **source-only** (ships in the next golden); the running demo guest stays the v0.39.0 golden provisioned below. --- ## Premise correction (surfaced to Viktor, decision taken) The session brief said "controller v0.39.0 is built and baked into the current golden." That was **false on the baked half**: - v0.39.0 **source** committed (`d8d1e17`, slice 9). ✓ - v0.39.0 **image** built & pushed to the registry — but under tag **`0.39.0`** (no `v` prefix, unlike every prior build; same digest as `latest`). The `v0.39.0` alias was **never pushed**. - v0.39.0 **golden** did **not exist** — the newest golden baked **v0.38.0** (others v0.37.0/.36/.35). Per the plan's Step-1.6 gate, this was surfaced. **Viktor chose: re-bake the v0.39.0 golden, then run the full session.** Done (Step 1b below). --- ## Step 1 — Discovery (observed) | Item | Value | |---|---| | Live agent config path | `/root/.config/felhom-agent/agent.json` (NOT the `/etc/...` default — confirmed from `systemctl cat felhom-agent` `ExecStart`) | | Agent version (deployed) | **v0.18.0** (serves `GET /host/metrics`, `/disks`, `/disks/format`) | | `local_api` | `enable: true`, `listen_addr: 192.168.0.162:8443` ✓ | | `backup.restore_storage` | `local-lvm` ✓ | | `hub` | `url: https://hub.felhom.eu`, `host_id: demo-felhom-01`, `api_key` set (stored out-of-band) ✓ | | Free VMID chosen | **9200** (in use: 9001 spike, 9999 selftest-scratch) | | Golden volid (re-baked) | `local:backup/vzdump-lxc-9100-2026_06_11-11_55_03.tar.zst` (baked image `gitea.dooplex.hu/admin/felhom-controller:0.39.0`, verified by extracting `/etc/felhom-controller-image` from the archive) | | Hub row key | Controller report keys on `customer.id`; the existing **`demo-felhom` customer row exists** (stale, last reports Feb 2026 up to v0.28.8). Agent host-report keys on `host_id=demo-felhom-01` (separate row). | | Daemon contention | **None at mint time** — provision ran with the daemon up, no token-store/leaf lock. (But see the post-provision 401 finding under Step 2.) | ### Step 1b — re-bake the v0.39.0 golden - The supplied Gitea token (read-only) **lacks container-registry scope** — the build LXC's `docker pull` failed `401 unauthorized` (verified: the token gets HTTP 401 from the registry token endpoint). Re-used the **build server's existing registry credential** (package-scoped, the same one prior bakes used; **stored out-of-band**) purely inside the build LXC. `build-golden.sh` logs out + removes `/root/.docker/config.json` before archiving, so **no credential is baked** into the golden; the build logfile contains no secret. - New golden built (VMID 9100 build LXC → archived → build guest destroyed). Baked image confirmed `:0.39.0`. - **Follow-up for Viktor:** push the `v0.39.0` tag alias for naming convention, and bump the `build-golden.sh:33` default off the stale `:v0.35.0`. --- ## Step 2 — Provision guest 9200 as `demo-felhom` `felhom-agent --selftest=provision -config /root/.config/felhom-agent/agent.json -archive -vmid 9200 -customer-id demo-felhom -customer-domain demo-felhom.eu -customer-name "Felhom Demo" -hostname felhom-demo` → `selftest=provision OK — guest 9200 provisioned + bootstrap-mounted (KEPT)`. **`pct reboot 9200` was mandatory** (the bootstrap oneshot is gated on the mount existing at boot, and provision attaches the mount *after* the front-half boot). After one reboot the controller deployed: | Assertion | Result | |---|---| | Controller running | `gitea.dooplex.hu/admin/felhom-controller:0.39.0 Up (healthy)` — startup line `felhom-controller 0.39.0 starting (customer: demo-felhom, domain: demo-felhom.eu)` | | Not setup mode | ✓ — no `setup mode` line; came up **configured** | | Bootstrap-seeded config | ✓ — `controller.yaml` written **0600 root:root** at boot, with customer id + hub + `local_api.{endpoint,fingerprint,token}` | | De-privileged container | ✓ — `Privileged=false`; mounts = **exactly 3**: `/etc/felhom-bootstrap` (ro), `felhom-controller-data` volume (rw), `/var/run/docker.sock` (rw). No `/dev`, `/etc/fstab`, `/mnt` rshared, `/sys`, `/run/udev`. LXC `unprivileged=1`, features `nesting,keyctl` only, single `mp9` bootstrap mount, no device passthrough/hookscript. | | No registry pull | ✓ — bootstrap unit did `docker run` (no pull); image shows `created 20 hours ago` (golden bake time) | | bootstrap.json identity | ✓ — `/etc/felhom-bootstrap/bootstrap.json` `600 root:root`, `customer.id=demo-felhom`, hub creds, `local_api.{endpoint=192.168.0.162:8443, fingerprint=60b5974d…, token}` (token/api_key out-of-band) | ### FINDING 2a — local-API channel 401 until the agent daemon is restarted At controller startup: `local-api: GET /storage failed (agentapi: GET /storage: HTTP 401) — channel not verified`. Root cause: provision minted the per-guest token and durably recorded its hash (`/var/lib/felhom-agent/local-tokens.log` carries `{"v":9200,...}`), but the **already-running daemon loaded its in-memory token map at its own startup, before the mint** — so it rejected the controller's token. **`systemctl restart felhom-agent` cleared it** (the daemon reloads the durable store on restart; minted hash persists). After restart, all agent-channel calls succeed. **Recommendation:** provision should signal the running daemon to reload the token store (or the local-API should consult the durable store per-request), OR the provisioning runbook must include a post-provision `systemctl restart felhom-agent`. As-is, a guest provisioned while the daemon runs has a dead local-API channel until the daemon restarts. ### FINDING 2b — bootstrap seed-log line absent (cosmetic) The seed functionally worked (controller.yaml 0600 written at boot, configured, not setup mode), but the explicit `[INFO] bootstrap: seeded … coming up configured` line (`bootstrap/bootstrap.go:111`) did **not** appear in `docker logs`. Functionally correct; logging-only discrepancy worth a glance. --- ## Step 3 — Full flow (validated from inside the container via `docker exec … curl localhost:8080`) The bootstrap `docker run` publishes no port (bridge-only), so host-`localhost` is refused — all checks ran inside the container (host=localhost passes the catch-all gate). 1. **UI renders** — all HTTP 200 with Hungarian markers: `/` **Vezérlőpult** (open dashboard — no web password set, logged `no password configured — dashboard is open`), `/stacks` (Alkalmazások), `/monitoring` **Rendszermonitor** + the slice-9 **"Szerver állapota (gazdagép)"** host card, `/backups` (Biztonsági mentés), `/settings` (Beállítások). - **Host-gate routing proven live:** `Host: felhom.demo-felhom.eu` → **200**; `Host: 1.2.3.4` → **404** (catch-all). External browser access (`felhom.demo-felhom.eu`) needs the **Cloudflare tunnel**, which is **unconfigured on a bare bootstrap** (`cf_tunnel_token` empty) — same slice-10 onboarding gap as below; not separately testable without the tunnel/DNS. 2. **`/api/host-metrics` (slice 9)** — `{ok:true}` HTTP 200. **Cross-checked against the host:** memory_total **exact** (16537989120 = pvesh), memory_used ~match (2.49 vs 2.43 GB), loadavg same ballpark, uptime match (324255 vs 324283), disk match (93.9 GB / 10.5% vs `df` 94G/12%), felhom-usb match (915.8 GiB / 0.87%, SMART 31°C). **`cpu_temp_c` is a real value** — read 46 live, matching sysfs `x86_pkg_temp=44`/coretemp max 46 (`sensors` is **not installed**; the agent sources temp from coretemp/thermal, not the `sensors` CLI). An earlier reading of 59 was a genuine transient load peak during the golden-build+provision. 3. **`/api/disks` (8C proxy)** — `{ok:true}` HTTP 200, 4 devices with data-bearing flags; **felhom-usb flagged `data_bearing:true` (reason "device is mounted", `/dev/sdb1`)**. ### BREAK 3.2 — storage auto-discovery (by-design consequence of 8C de-privileging) The 1TB HDD does **NOT** appear under Settings → Storage Paths. The de-privileged container sees **no host storage mounts** (`df` inside shows only the bootstrap bind, its own disk, udev; `/mnt` empty; logs: "no storage paths registered", "stat /mnt/sys_drive: no such file or directory"). This is correct for the 8C model (mounts limited to bootstrap+data+docker.sock). The HDD is instead surfaced via the **agent host-metrics storage view** (felhom-usb appears there with capacity + SMART). The legacy local `discoverHDDPaths` path is effectively **vestigial** in v0.39.0 — worth retiring or repurposing onto the agent-sourced storage list. ### BREAK 3.3 — deploy→run→remove an app: not exercisable on a bare bootstrap `/api/stacks` → `{ok:true,data:[]}` (empty catalog). Catalog templates come from a synced `catalog-cache/templates`, populated by git/assets sync — **disabled on a bootstrap-seeded controller** (manual mode, no repo URL; the seed sets only identity/hub/local-api). No apps → nothing to deploy (`POST /api/stacks/filebrowser/deploy` → 400 "invalid request body"; even with a body the slug isn't in the catalog). Catalog configuration is **slice-10 (hub desired-state)** territory. ### BREAK 3.5 — hub reporting stays DOWN (HTTP 401): host-key vs customer-key gap The controller's startup push and all 3 retries got **HTTP 401** from the hub (`[report] Push failed: HTTP 401`), reproduced directly (`POST https://hub.felhom.eu/api/v1/report` with the baked key → `Unauthorized` / 401). **Root cause (code-traced + DB-confirmed):** - The hub's `POST /api/v1/report` authenticates a **customer-scoped** key — `checkAuthCustomer` → `GetCustomerConfigByAPIKey` against the `customer_configs` table, then enforces `authCustomerID == payload.CustomerID` (`hub/internal/api/handler.go:74-92, 208-234`). - Provision baked the **agent's HOST key** (`hub.api_key`, keyed on `host_id=demo-felhom-01` in the separate `hosts` table) into the guest's bootstrap. Host keys and customer keys are **distinct tables / code paths**. - Hub DB confirms: a `demo-felhom` customer row exists (with its **own** dashboard-generated api_key), a `demo-felhom-01` host row exists, and the baked key appears **once** (as the host key). So the controller presents the host key → `GetCustomerConfigByAPIKey` returns nil → **401**. The `demo-felhom` hub row therefore **stays DOWN/stale** — the freshly-provisioned controller can never report ONLINE until the bootstrap carries the **customer-scoped** api_key (or provision creates/fetches the customer config key, or the hub accepts host keys for customer reports). This is a **cross-component provisioning gap (slice-10 onboarding)**, not a controller bug. *(Reported value differs from the brief's "v0.34.0" — the DB shows last reports up to v0.28.8; immaterial, the row is stale either way.)* --- ## Step 4 — Disk proxy (API-level only; NO destructive op) — INVARIANT PROVEN 1. **List** — `/api/disks` `{ok:true}` with per-device data-bearing flags (above). 2. **Data-bearing format → refusal** — hit the agent **directly** (dodges the controller's CSRF): `POST https://192.168.0.162:8443/disks/format` with the guest's `local_api.token`, body `{"vmid":9200,"device":"/dev/sdb1","fstype":"ext4"}` (felhom-usb — mounted, data-bearing). Result: **HTTP 403** — `{ "formatted": false, "data_bearing": true, "reason": "device is mounted", "pending_op": { "op":"storage_wipe", "host_scope":"demo-felhom-01", "durable_id":"byid:wwn-0x5000039ddb108568-part1", "fstype":"ext4" }, "error": "device is data-bearing — format requires an operator signature (pending_signature)" }` The agent **inspected the device itself** (`data_bearing:true`, reason "device is mounted"), ignored any caller claim, refused with `pending_signature`, and surfaced the durable-id-bound op to sign. **The disk was untouched** (post-test: `/dev/sdb1 ext4 915.8G, 8G used`, still mounted). No operator signature was ever passed. The controller maps this 403→409 (`agentapi.ErrFormatRefused`, already unit-tested). ✓ --- ## Step 5 / 6 — Orphan-template cleanup + gate + push - **Deleted** (re-confirmed unreferenced first — `grep -rn` over `internal/` matched only the templates' own `{{define}}` lines): `internal/web/templates/{storage_init, storage_attach, migrate, migrate_drive, restore}.html`. Embed is a glob; 14 templates remain. - **Noted, not deleted** (dead-but-harmless): `NotifyCrossDriveCompleted`/`NotifyCrossDriveFailed` (`notify/notifier.go:353,359`, no callers) + a vestigial `crossdrive_failed` notification toggle (`web/handlers.go:937`) + restic config fields/comments. Flagged for a future dedicated cleanup. - **Version:** v0.39.0 → **v0.39.1** (CHANGELOG entry added; version is ldflags-injected, applied at the next build). Source-only — no re-bake this session. - **Gate:** `go build ./...` **OK**; `go test ./...` **green** (agentapi, bootstrap, quiesce). - **Commit:** see CHANGELOG / git log — pushed to `main`. No working UI feature lost (the deleted pages were already unreachable — removed routes). --- ## What broke / what's missing (the headline) | # | Item | Severity | Nature | |---|---|---|---| | 2a | Local-API channel 401 until `felhom-agent` restart after provisioning a live-daemon host | **Medium** | Provision doesn't make the running daemon reload its token store. Workaround: restart the daemon (done). Needs a provision→daemon reload signal or per-request store lookup. | | 3.5 | Hub report 401 — bootstrap bakes the **host** api_key, but `/api/v1/report` needs the **customer** api_key | **Medium/High** | Cross-component provisioning gap (slice-10 onboarding). Controller stays DOWN on the hub until fixed. | | 3.3 | No app catalog on a bare bootstrap (git/assets sync disabled) — deploy not exercisable | Expected (slice-10) | Catalog/desired-state comes from the hub later. | | 3.2 | Legacy Storage-Paths auto-discovery finds nothing (de-privileged container has no host mounts) | Expected (8C) | HDD is correctly surfaced via agent host-metrics instead; retire/repurpose the legacy path. | | 3.1 | External browser access (felhom.demo-felhom.eu) needs the Cloudflare tunnel (`cf_tunnel_token` empty) | Expected (slice-10) | Host-gate routing itself verified live (200 vs 404). | | 2b | Bootstrap seed-log line absent | Cosmetic | Functionally correct; logging-only. | | infra | Supplied Gitea token lacks registry/package scope; build-golden default tag stale (`:v0.35.0`) | Low | Used the build-server credential for the re-bake; flagged both for Viktor. | **Worked cleanly:** golden re-bake, provision + reboot deploy, configured-not-setup bootstrap, 0600 bootstrap.json/controller.yaml, container de-privileging (Privileged=false, 3 mounts), no-registry-pull, all 5 UI pages + slice-9 host card, host-metrics (cross-checked, real cpu_temp), `/api/disks`, and the **8C data-bearing format-refusal invariant (403, disk untouched)**.