Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
15 KiB
REPORT — Live demo validation of felhom-controller v0.39.0 + 8C orphan-template cleanup (2026-06-11)
Two things this session: (1) provisioned a fresh customer guest from a v0.39.0 golden on the demo Proxmox host and walked the full controller flow, reporting what works vs breaks against the live guest; (2) a small source-hygiene code change — deleting five dead 8C orphan templates (v0.39.1).
The code change is source-only (ships in the next golden); the running demo guest stays the v0.39.0 golden provisioned below.
Premise correction (surfaced to Viktor, decision taken)
The session brief said "controller v0.39.0 is built and baked into the current golden." That was false on the baked half:
- v0.39.0 source committed (
d8d1e17, slice 9). ✓ - v0.39.0 image built & pushed to the registry — but under tag
0.39.0(novprefix, unlike every prior build; same digest aslatest). Thev0.39.0alias was never pushed. - v0.39.0 golden did not exist — the newest golden baked v0.38.0 (others v0.37.0/.36/.35).
Per the plan's Step-1.6 gate, this was surfaced. Viktor chose: re-bake the v0.39.0 golden, then run the full session. Done (Step 1b below).
Step 1 — Discovery (observed)
| Item | Value |
|---|---|
| Live agent config path | /root/.config/felhom-agent/agent.json (NOT the /etc/... default — confirmed from systemctl cat felhom-agent ExecStart) |
| Agent version (deployed) | v0.18.0 (serves GET /host/metrics, /disks, /disks/format) |
local_api |
enable: true, listen_addr: 192.168.0.162:8443 ✓ |
backup.restore_storage |
local-lvm ✓ |
hub |
url: https://hub.felhom.eu, host_id: demo-felhom-01, api_key set (stored out-of-band) ✓ |
| Free VMID chosen | 9200 (in use: 9001 spike, 9999 selftest-scratch) |
| Golden volid (re-baked) | local:backup/vzdump-lxc-9100-2026_06_11-11_55_03.tar.zst (baked image gitea.dooplex.hu/admin/felhom-controller:0.39.0, verified by extracting /etc/felhom-controller-image from the archive) |
| Hub row key | Controller report keys on customer.id; the existing demo-felhom customer row exists (stale, last reports Feb 2026 up to v0.28.8). Agent host-report keys on host_id=demo-felhom-01 (separate row). |
| Daemon contention | None at mint time — provision ran with the daemon up, no token-store/leaf lock. (But see the post-provision 401 finding under Step 2.) |
Step 1b — re-bake the v0.39.0 golden
- The supplied Gitea token (read-only) lacks container-registry scope — the build LXC's
docker pullfailed401 unauthorized(verified: the token gets HTTP 401 from the registry token endpoint). Re-used the build server's existing registry credential (package-scoped, the same one prior bakes used; stored out-of-band) purely inside the build LXC.build-golden.shlogs out + removes/root/.docker/config.jsonbefore archiving, so no credential is baked into the golden; the build logfile contains no secret. - New golden built (VMID 9100 build LXC → archived → build guest destroyed). Baked image confirmed
:0.39.0. - Follow-up for Viktor: push the
v0.39.0tag alias for naming convention, and bump thebuild-golden.sh:33default off the stale:v0.35.0.
Step 2 — Provision guest 9200 as demo-felhom
felhom-agent --selftest=provision -config /root/.config/felhom-agent/agent.json -archive <golden> -vmid 9200 -customer-id demo-felhom -customer-domain demo-felhom.eu -customer-name "Felhom Demo" -hostname felhom-demo → selftest=provision OK — guest 9200 provisioned + bootstrap-mounted (KEPT).
pct reboot 9200 was mandatory (the bootstrap oneshot is gated on the mount existing at boot, and
provision attaches the mount after the front-half boot). After one reboot the controller deployed:
| Assertion | Result |
|---|---|
| Controller running | gitea.dooplex.hu/admin/felhom-controller:0.39.0 Up (healthy) — startup line felhom-controller 0.39.0 starting (customer: demo-felhom, domain: demo-felhom.eu) |
| Not setup mode | ✓ — no setup mode line; came up configured |
| Bootstrap-seeded config | ✓ — controller.yaml written 0600 root:root at boot, with customer id + hub + local_api.{endpoint,fingerprint,token} |
| De-privileged container | ✓ — Privileged=false; mounts = exactly 3: /etc/felhom-bootstrap (ro), felhom-controller-data volume (rw), /var/run/docker.sock (rw). No /dev, /etc/fstab, /mnt rshared, /sys, /run/udev. LXC unprivileged=1, features nesting,keyctl only, single mp9 bootstrap mount, no device passthrough/hookscript. |
| No registry pull | ✓ — bootstrap unit did docker run (no pull); image shows created 20 hours ago (golden bake time) |
| bootstrap.json identity | ✓ — /etc/felhom-bootstrap/bootstrap.json 600 root:root, customer.id=demo-felhom, hub creds, local_api.{endpoint=192.168.0.162:8443, fingerprint=60b5974d…, token} (token/api_key out-of-band) |
FINDING 2a — local-API channel 401 until the agent daemon is restarted
At controller startup: local-api: GET /storage failed (agentapi: GET /storage: HTTP 401) — channel not verified. Root cause: provision minted the per-guest token and durably recorded its hash
(/var/lib/felhom-agent/local-tokens.log carries {"v":9200,...}), but the already-running daemon
loaded its in-memory token map at its own startup, before the mint — so it rejected the controller's
token. systemctl restart felhom-agent cleared it (the daemon reloads the durable store on
restart; minted hash persists). After restart, all agent-channel calls succeed.
Recommendation: provision should signal the running daemon to reload the token store (or the
local-API should consult the durable store per-request), OR the provisioning runbook must include a
post-provision systemctl restart felhom-agent. As-is, a guest provisioned while the daemon runs has a
dead local-API channel until the daemon restarts.
FINDING 2b — bootstrap seed-log line absent (cosmetic)
The seed functionally worked (controller.yaml 0600 written at boot, configured, not setup mode), but
the explicit [INFO] bootstrap: seeded … coming up configured line (bootstrap/bootstrap.go:111) did
not appear in docker logs. Functionally correct; logging-only discrepancy worth a glance.
Step 3 — Full flow (validated from inside the container via docker exec … curl localhost:8080)
The bootstrap docker run publishes no port (bridge-only), so host-localhost is refused — all checks
ran inside the container (host=localhost passes the catch-all gate).
- UI renders — all HTTP 200 with Hungarian markers:
/Vezérlőpult (open dashboard — no web password set, loggedno password configured — dashboard is open),/stacks(Alkalmazások),/monitoringRendszermonitor + the slice-9 "Szerver állapota (gazdagép)" host card,/backups(Biztonsági mentés),/settings(Beállítások).- Host-gate routing proven live:
Host: felhom.demo-felhom.eu→ 200;Host: 1.2.3.4→ 404 (catch-all). External browser access (felhom.demo-felhom.eu) needs the Cloudflare tunnel, which is unconfigured on a bare bootstrap (cf_tunnel_tokenempty) — same slice-10 onboarding gap as below; not separately testable without the tunnel/DNS.
- Host-gate routing proven live:
/api/host-metrics(slice 9) —{ok:true}HTTP 200. Cross-checked against the host: memory_total exact (16537989120 = pvesh), memory_used ~match (2.49 vs 2.43 GB), loadavg same ballpark, uptime match (324255 vs 324283), disk match (93.9 GB / 10.5% vsdf94G/12%), felhom-usb match (915.8 GiB / 0.87%, SMART 31°C).cpu_temp_cis a real value — read 46 live, matching sysfsx86_pkg_temp=44/coretemp max 46 (sensorsis not installed; the agent sources temp from coretemp/thermal, not thesensorsCLI). An earlier reading of 59 was a genuine transient load peak during the golden-build+provision./api/disks(8C proxy) —{ok:true}HTTP 200, 4 devices with data-bearing flags; felhom-usb flaggeddata_bearing:true(reason "device is mounted",/dev/sdb1).
BREAK 3.2 — storage auto-discovery (by-design consequence of 8C de-privileging)
The 1TB HDD does NOT appear under Settings → Storage Paths. The de-privileged container sees no
host storage mounts (df inside shows only the bootstrap bind, its own disk, udev; /mnt empty;
logs: "no storage paths registered", "stat /mnt/sys_drive: no such file or directory"). This is correct
for the 8C model (mounts limited to bootstrap+data+docker.sock). The HDD is instead surfaced via the
agent host-metrics storage view (felhom-usb appears there with capacity + SMART). The legacy local
discoverHDDPaths path is effectively vestigial in v0.39.0 — worth retiring or repurposing onto the
agent-sourced storage list.
BREAK 3.3 — deploy→run→remove an app: not exercisable on a bare bootstrap
/api/stacks → {ok:true,data:[]} (empty catalog). Catalog templates come from a synced
catalog-cache/templates, populated by git/assets sync — disabled on a bootstrap-seeded controller
(manual mode, no repo URL; the seed sets only identity/hub/local-api). No apps → nothing to deploy
(POST /api/stacks/filebrowser/deploy → 400 "invalid request body"; even with a body the slug isn't in
the catalog). Catalog configuration is slice-10 (hub desired-state) territory.
BREAK 3.5 — hub reporting stays DOWN (HTTP 401): host-key vs customer-key gap
The controller's startup push and all 3 retries got HTTP 401 from the hub
([report] Push failed: HTTP 401), reproduced directly (POST https://hub.felhom.eu/api/v1/report
with the baked key → Unauthorized / 401). Root cause (code-traced + DB-confirmed):
- The hub's
POST /api/v1/reportauthenticates a customer-scoped key —checkAuthCustomer→GetCustomerConfigByAPIKeyagainst thecustomer_configstable, then enforcesauthCustomerID == payload.CustomerID(hub/internal/api/handler.go:74-92, 208-234). - Provision baked the agent's HOST key (
hub.api_key, keyed onhost_id=demo-felhom-01in the separatehoststable) into the guest's bootstrap. Host keys and customer keys are distinct tables / code paths. - Hub DB confirms: a
demo-felhomcustomer row exists (with its own dashboard-generated api_key), ademo-felhom-01host row exists, and the baked key appears once (as the host key). So the controller presents the host key →GetCustomerConfigByAPIKeyreturns nil → 401.
The demo-felhom hub row therefore stays DOWN/stale — the freshly-provisioned controller can never
report ONLINE until the bootstrap carries the customer-scoped api_key (or provision creates/fetches
the customer config key, or the hub accepts host keys for customer reports). This is a cross-component
provisioning gap (slice-10 onboarding), not a controller bug. (Reported value differs from the
brief's "v0.34.0" — the DB shows last reports up to v0.28.8; immaterial, the row is stale either way.)
Step 4 — Disk proxy (API-level only; NO destructive op) — INVARIANT PROVEN
-
List —
/api/disks{ok:true}with per-device data-bearing flags (above). -
Data-bearing format → refusal — hit the agent directly (dodges the controller's CSRF):
POST https://192.168.0.162:8443/disks/formatwith the guest'slocal_api.token, body{"vmid":9200,"device":"/dev/sdb1","fstype":"ext4"}(felhom-usb — mounted, data-bearing). Result:HTTP 403 —
{ "formatted": false, "data_bearing": true, "reason": "device is mounted", "pending_op": { "op":"storage_wipe", "host_scope":"demo-felhom-01", "durable_id":"byid:wwn-0x5000039ddb108568-part1", "fstype":"ext4" }, "error": "device is data-bearing — format requires an operator signature (pending_signature)" }The agent inspected the device itself (
data_bearing:true, reason "device is mounted"), ignored any caller claim, refused withpending_signature, and surfaced the durable-id-bound op to sign. The disk was untouched (post-test:/dev/sdb1 ext4 915.8G, 8G used, still mounted). No operator signature was ever passed. The controller maps this 403→409 (agentapi.ErrFormatRefused, already unit-tested). ✓
Step 5 / 6 — Orphan-template cleanup + gate + push
- Deleted (re-confirmed unreferenced first —
grep -rnoverinternal/matched only the templates' own{{define}}lines):internal/web/templates/{storage_init, storage_attach, migrate, migrate_drive, restore}.html. Embed is a glob; 14 templates remain. - Noted, not deleted (dead-but-harmless):
NotifyCrossDriveCompleted/NotifyCrossDriveFailed(notify/notifier.go:353,359, no callers) + a vestigialcrossdrive_failednotification toggle (web/handlers.go:937) + restic config fields/comments. Flagged for a future dedicated cleanup. - Version: v0.39.0 → v0.39.1 (CHANGELOG entry added; version is ldflags-injected, applied at the next build). Source-only — no re-bake this session.
- Gate:
go build ./...OK;go test ./...green (agentapi, bootstrap, quiesce). - Commit:
6e77bea(the template deletion + this report), pushed tomain(d8d1e17..6e77bea). No working UI feature lost (the deleted pages were already unreachable — removed routes).
What broke / what's missing (the headline)
| # | Item | Severity | Nature |
|---|---|---|---|
| 2a | Local-API channel 401 until felhom-agent restart after provisioning a live-daemon host |
Medium | Provision doesn't make the running daemon reload its token store. Workaround: restart the daemon (done). Needs a provision→daemon reload signal or per-request store lookup. |
| 3.5 | Hub report 401 — bootstrap bakes the host api_key, but /api/v1/report needs the customer api_key |
Medium/High | Cross-component provisioning gap (slice-10 onboarding). Controller stays DOWN on the hub until fixed. |
| 3.3 | No app catalog on a bare bootstrap (git/assets sync disabled) — deploy not exercisable | Expected (slice-10) | Catalog/desired-state comes from the hub later. |
| 3.2 | Legacy Storage-Paths auto-discovery finds nothing (de-privileged container has no host mounts) | Expected (8C) | HDD is correctly surfaced via agent host-metrics instead; retire/repurpose the legacy path. |
| 3.1 | External browser access (felhom.demo-felhom.eu) needs the Cloudflare tunnel (cf_tunnel_token empty) |
Expected (slice-10) | Host-gate routing itself verified live (200 vs 404). |
| 2b | Bootstrap seed-log line absent | Cosmetic | Functionally correct; logging-only. |
| infra | Supplied Gitea token lacks registry/package scope; build-golden default tag stale (:v0.35.0) |
Low | Used the build-server credential for the re-bake; flagged both for Viktor. |
Worked cleanly: golden re-bake, provision + reboot deploy, configured-not-setup bootstrap, 0600
bootstrap.json/controller.yaml, container de-privileging (Privileged=false, 3 mounts), no-registry-pull,
all 5 UI pages + slice-9 host card, host-metrics (cross-checked, real cpu_temp), /api/disks, and the
8C data-bearing format-refusal invariant (403, disk untouched).