Files
felhom-controller/REPORT.md
T
admin 6e77bea4d3 v0.39.1: 8C orphan-template cleanup (delete 5 dead templates)
Remove five orphaned HTML templates left behind when slice 8C retired the
disk/storage/restore web handlers (storage_handlers.go, handler_restore.go and
the /api/storage/* + /api/restore/* routes): storage_init, storage_attach,
migrate, migrate_drive, restore. Zero .go references, zero cross-template
references, no route, no nav entry; embed is a glob so deletion is safe (14
templates remain, build + tests green). No behaviour change; the deleted pages
were already unreachable.

Also ships the live demo validation (v0.39.0) writeup in REPORT.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 12:24:13 +02:00

15 KiB

REPORT — Live demo validation of felhom-controller v0.39.0 + 8C orphan-template cleanup (2026-06-11)

Two things this session: (1) provisioned a fresh customer guest from a v0.39.0 golden on the demo Proxmox host and walked the full controller flow, reporting what works vs breaks against the live guest; (2) a small source-hygiene code change — deleting five dead 8C orphan templates (v0.39.1).

The code change is source-only (ships in the next golden); the running demo guest stays the v0.39.0 golden provisioned below.


Premise correction (surfaced to Viktor, decision taken)

The session brief said "controller v0.39.0 is built and baked into the current golden." That was false on the baked half:

  • v0.39.0 source committed (d8d1e17, slice 9). ✓
  • v0.39.0 image built & pushed to the registry — but under tag 0.39.0 (no v prefix, unlike every prior build; same digest as latest). The v0.39.0 alias was never pushed.
  • v0.39.0 golden did not exist — the newest golden baked v0.38.0 (others v0.37.0/.36/.35).

Per the plan's Step-1.6 gate, this was surfaced. Viktor chose: re-bake the v0.39.0 golden, then run the full session. Done (Step 1b below).


Step 1 — Discovery (observed)

Item Value
Live agent config path /root/.config/felhom-agent/agent.json (NOT the /etc/... default — confirmed from systemctl cat felhom-agent ExecStart)
Agent version (deployed) v0.18.0 (serves GET /host/metrics, /disks, /disks/format)
local_api enable: true, listen_addr: 192.168.0.162:8443
backup.restore_storage local-lvm
hub url: https://hub.felhom.eu, host_id: demo-felhom-01, api_key set (stored out-of-band) ✓
Free VMID chosen 9200 (in use: 9001 spike, 9999 selftest-scratch)
Golden volid (re-baked) local:backup/vzdump-lxc-9100-2026_06_11-11_55_03.tar.zst (baked image gitea.dooplex.hu/admin/felhom-controller:0.39.0, verified by extracting /etc/felhom-controller-image from the archive)
Hub row key Controller report keys on customer.id; the existing demo-felhom customer row exists (stale, last reports Feb 2026 up to v0.28.8). Agent host-report keys on host_id=demo-felhom-01 (separate row).
Daemon contention None at mint time — provision ran with the daemon up, no token-store/leaf lock. (But see the post-provision 401 finding under Step 2.)

Step 1b — re-bake the v0.39.0 golden

  • The supplied Gitea token (read-only) lacks container-registry scope — the build LXC's docker pull failed 401 unauthorized (verified: the token gets HTTP 401 from the registry token endpoint). Re-used the build server's existing registry credential (package-scoped, the same one prior bakes used; stored out-of-band) purely inside the build LXC. build-golden.sh logs out + removes /root/.docker/config.json before archiving, so no credential is baked into the golden; the build logfile contains no secret.
  • New golden built (VMID 9100 build LXC → archived → build guest destroyed). Baked image confirmed :0.39.0.
  • Follow-up for Viktor: push the v0.39.0 tag alias for naming convention, and bump the build-golden.sh:33 default off the stale :v0.35.0.

Step 2 — Provision guest 9200 as demo-felhom

felhom-agent --selftest=provision -config /root/.config/felhom-agent/agent.json -archive <golden> -vmid 9200 -customer-id demo-felhom -customer-domain demo-felhom.eu -customer-name "Felhom Demo" -hostname felhom-demoselftest=provision OK — guest 9200 provisioned + bootstrap-mounted (KEPT).

pct reboot 9200 was mandatory (the bootstrap oneshot is gated on the mount existing at boot, and provision attaches the mount after the front-half boot). After one reboot the controller deployed:

Assertion Result
Controller running gitea.dooplex.hu/admin/felhom-controller:0.39.0 Up (healthy) — startup line felhom-controller 0.39.0 starting (customer: demo-felhom, domain: demo-felhom.eu)
Not setup mode ✓ — no setup mode line; came up configured
Bootstrap-seeded config ✓ — controller.yaml written 0600 root:root at boot, with customer id + hub + local_api.{endpoint,fingerprint,token}
De-privileged container ✓ — Privileged=false; mounts = exactly 3: /etc/felhom-bootstrap (ro), felhom-controller-data volume (rw), /var/run/docker.sock (rw). No /dev, /etc/fstab, /mnt rshared, /sys, /run/udev. LXC unprivileged=1, features nesting,keyctl only, single mp9 bootstrap mount, no device passthrough/hookscript.
No registry pull ✓ — bootstrap unit did docker run (no pull); image shows created 20 hours ago (golden bake time)
bootstrap.json identity ✓ — /etc/felhom-bootstrap/bootstrap.json 600 root:root, customer.id=demo-felhom, hub creds, local_api.{endpoint=192.168.0.162:8443, fingerprint=60b5974d…, token} (token/api_key out-of-band)

FINDING 2a — local-API channel 401 until the agent daemon is restarted

At controller startup: local-api: GET /storage failed (agentapi: GET /storage: HTTP 401) — channel not verified. Root cause: provision minted the per-guest token and durably recorded its hash (/var/lib/felhom-agent/local-tokens.log carries {"v":9200,...}), but the already-running daemon loaded its in-memory token map at its own startup, before the mint — so it rejected the controller's token. systemctl restart felhom-agent cleared it (the daemon reloads the durable store on restart; minted hash persists). After restart, all agent-channel calls succeed. Recommendation: provision should signal the running daemon to reload the token store (or the local-API should consult the durable store per-request), OR the provisioning runbook must include a post-provision systemctl restart felhom-agent. As-is, a guest provisioned while the daemon runs has a dead local-API channel until the daemon restarts.

FINDING 2b — bootstrap seed-log line absent (cosmetic)

The seed functionally worked (controller.yaml 0600 written at boot, configured, not setup mode), but the explicit [INFO] bootstrap: seeded … coming up configured line (bootstrap/bootstrap.go:111) did not appear in docker logs. Functionally correct; logging-only discrepancy worth a glance.


Step 3 — Full flow (validated from inside the container via docker exec … curl localhost:8080)

The bootstrap docker run publishes no port (bridge-only), so host-localhost is refused — all checks ran inside the container (host=localhost passes the catch-all gate).

  1. UI renders — all HTTP 200 with Hungarian markers: / Vezérlőpult (open dashboard — no web password set, logged no password configured — dashboard is open), /stacks (Alkalmazások), /monitoring Rendszermonitor + the slice-9 "Szerver állapota (gazdagép)" host card, /backups (Biztonsági mentés), /settings (Beállítások).
    • Host-gate routing proven live: Host: felhom.demo-felhom.eu200; Host: 1.2.3.4404 (catch-all). External browser access (felhom.demo-felhom.eu) needs the Cloudflare tunnel, which is unconfigured on a bare bootstrap (cf_tunnel_token empty) — same slice-10 onboarding gap as below; not separately testable without the tunnel/DNS.
  2. /api/host-metrics (slice 9){ok:true} HTTP 200. Cross-checked against the host: memory_total exact (16537989120 = pvesh), memory_used ~match (2.49 vs 2.43 GB), loadavg same ballpark, uptime match (324255 vs 324283), disk match (93.9 GB / 10.5% vs df 94G/12%), felhom-usb match (915.8 GiB / 0.87%, SMART 31°C). cpu_temp_c is a real value — read 46 live, matching sysfs x86_pkg_temp=44/coretemp max 46 (sensors is not installed; the agent sources temp from coretemp/thermal, not the sensors CLI). An earlier reading of 59 was a genuine transient load peak during the golden-build+provision.
  3. /api/disks (8C proxy){ok:true} HTTP 200, 4 devices with data-bearing flags; felhom-usb flagged data_bearing:true (reason "device is mounted", /dev/sdb1).

BREAK 3.2 — storage auto-discovery (by-design consequence of 8C de-privileging)

The 1TB HDD does NOT appear under Settings → Storage Paths. The de-privileged container sees no host storage mounts (df inside shows only the bootstrap bind, its own disk, udev; /mnt empty; logs: "no storage paths registered", "stat /mnt/sys_drive: no such file or directory"). This is correct for the 8C model (mounts limited to bootstrap+data+docker.sock). The HDD is instead surfaced via the agent host-metrics storage view (felhom-usb appears there with capacity + SMART). The legacy local discoverHDDPaths path is effectively vestigial in v0.39.0 — worth retiring or repurposing onto the agent-sourced storage list.

BREAK 3.3 — deploy→run→remove an app: not exercisable on a bare bootstrap

/api/stacks{ok:true,data:[]} (empty catalog). Catalog templates come from a synced catalog-cache/templates, populated by git/assets sync — disabled on a bootstrap-seeded controller (manual mode, no repo URL; the seed sets only identity/hub/local-api). No apps → nothing to deploy (POST /api/stacks/filebrowser/deploy → 400 "invalid request body"; even with a body the slug isn't in the catalog). Catalog configuration is slice-10 (hub desired-state) territory.

BREAK 3.5 — hub reporting stays DOWN (HTTP 401): host-key vs customer-key gap

The controller's startup push and all 3 retries got HTTP 401 from the hub ([report] Push failed: HTTP 401), reproduced directly (POST https://hub.felhom.eu/api/v1/report with the baked key → Unauthorized / 401). Root cause (code-traced + DB-confirmed):

  • The hub's POST /api/v1/report authenticates a customer-scoped key — checkAuthCustomerGetCustomerConfigByAPIKey against the customer_configs table, then enforces authCustomerID == payload.CustomerID (hub/internal/api/handler.go:74-92, 208-234).
  • Provision baked the agent's HOST key (hub.api_key, keyed on host_id=demo-felhom-01 in the separate hosts table) into the guest's bootstrap. Host keys and customer keys are distinct tables / code paths.
  • Hub DB confirms: a demo-felhom customer row exists (with its own dashboard-generated api_key), a demo-felhom-01 host row exists, and the baked key appears once (as the host key). So the controller presents the host key → GetCustomerConfigByAPIKey returns nil → 401.

The demo-felhom hub row therefore stays DOWN/stale — the freshly-provisioned controller can never report ONLINE until the bootstrap carries the customer-scoped api_key (or provision creates/fetches the customer config key, or the hub accepts host keys for customer reports). This is a cross-component provisioning gap (slice-10 onboarding), not a controller bug. (Reported value differs from the brief's "v0.34.0" — the DB shows last reports up to v0.28.8; immaterial, the row is stale either way.)


Step 4 — Disk proxy (API-level only; NO destructive op) — INVARIANT PROVEN

  1. List/api/disks {ok:true} with per-device data-bearing flags (above).

  2. Data-bearing format → refusal — hit the agent directly (dodges the controller's CSRF): POST https://192.168.0.162:8443/disks/format with the guest's local_api.token, body {"vmid":9200,"device":"/dev/sdb1","fstype":"ext4"} (felhom-usb — mounted, data-bearing). Result:

    HTTP 403{ "formatted": false, "data_bearing": true, "reason": "device is mounted", "pending_op": { "op":"storage_wipe", "host_scope":"demo-felhom-01", "durable_id":"byid:wwn-0x5000039ddb108568-part1", "fstype":"ext4" }, "error": "device is data-bearing — format requires an operator signature (pending_signature)" }

    The agent inspected the device itself (data_bearing:true, reason "device is mounted"), ignored any caller claim, refused with pending_signature, and surfaced the durable-id-bound op to sign. The disk was untouched (post-test: /dev/sdb1 ext4 915.8G, 8G used, still mounted). No operator signature was ever passed. The controller maps this 403→409 (agentapi.ErrFormatRefused, already unit-tested). ✓


Step 5 / 6 — Orphan-template cleanup + gate + push

  • Deleted (re-confirmed unreferenced first — grep -rn over internal/ matched only the templates' own {{define}} lines): internal/web/templates/{storage_init, storage_attach, migrate, migrate_drive, restore}.html. Embed is a glob; 14 templates remain.
  • Noted, not deleted (dead-but-harmless): NotifyCrossDriveCompleted/NotifyCrossDriveFailed (notify/notifier.go:353,359, no callers) + a vestigial crossdrive_failed notification toggle (web/handlers.go:937) + restic config fields/comments. Flagged for a future dedicated cleanup.
  • Version: v0.39.0 → v0.39.1 (CHANGELOG entry added; version is ldflags-injected, applied at the next build). Source-only — no re-bake this session.
  • Gate: go build ./... OK; go test ./... green (agentapi, bootstrap, quiesce).
  • Commit: see CHANGELOG / git log — pushed to main. No working UI feature lost (the deleted pages were already unreachable — removed routes).

What broke / what's missing (the headline)

# Item Severity Nature
2a Local-API channel 401 until felhom-agent restart after provisioning a live-daemon host Medium Provision doesn't make the running daemon reload its token store. Workaround: restart the daemon (done). Needs a provision→daemon reload signal or per-request store lookup.
3.5 Hub report 401 — bootstrap bakes the host api_key, but /api/v1/report needs the customer api_key Medium/High Cross-component provisioning gap (slice-10 onboarding). Controller stays DOWN on the hub until fixed.
3.3 No app catalog on a bare bootstrap (git/assets sync disabled) — deploy not exercisable Expected (slice-10) Catalog/desired-state comes from the hub later.
3.2 Legacy Storage-Paths auto-discovery finds nothing (de-privileged container has no host mounts) Expected (8C) HDD is correctly surfaced via agent host-metrics instead; retire/repurpose the legacy path.
3.1 External browser access (felhom.demo-felhom.eu) needs the Cloudflare tunnel (cf_tunnel_token empty) Expected (slice-10) Host-gate routing itself verified live (200 vs 404).
2b Bootstrap seed-log line absent Cosmetic Functionally correct; logging-only.
infra Supplied Gitea token lacks registry/package scope; build-golden default tag stale (:v0.35.0) Low Used the build-server credential for the re-bake; flagged both for Viktor.

Worked cleanly: golden re-bake, provision + reboot deploy, configured-not-setup bootstrap, 0600 bootstrap.json/controller.yaml, container de-privileging (Privileged=false, 3 mounts), no-registry-pull, all 5 UI pages + slice-9 host card, host-metrics (cross-checked, real cpu_temp), /api/disks, and the 8C data-bearing format-refusal invariant (403, disk untouched).