Files
felhom-controller/REPORT.md
T
2026-06-11 12:24:36 +02:00

204 lines
15 KiB
Markdown

# REPORT — Live demo validation of felhom-controller v0.39.0 + 8C orphan-template cleanup (2026-06-11)
Two things this session: (1) **provisioned a fresh customer guest from a v0.39.0 golden on the demo
Proxmox host and walked the full controller flow**, reporting what works vs breaks against the live
guest; (2) a small source-hygiene code change — deleting five dead 8C orphan templates (v0.39.1).
The code change is **source-only** (ships in the next golden); the running demo guest stays the
v0.39.0 golden provisioned below.
---
## Premise correction (surfaced to Viktor, decision taken)
The session brief said "controller v0.39.0 is built and baked into the current golden." That was
**false on the baked half**:
- v0.39.0 **source** committed (`d8d1e17`, slice 9). ✓
- v0.39.0 **image** built & pushed to the registry — but under tag **`0.39.0`** (no `v` prefix,
unlike every prior build; same digest as `latest`). The `v0.39.0` alias was **never pushed**.
- v0.39.0 **golden** did **not exist** — the newest golden baked **v0.38.0** (others v0.37.0/.36/.35).
Per the plan's Step-1.6 gate, this was surfaced. **Viktor chose: re-bake the v0.39.0 golden, then run
the full session.** Done (Step 1b below).
---
## Step 1 — Discovery (observed)
| Item | Value |
|---|---|
| Live agent config path | `/root/.config/felhom-agent/agent.json` (NOT the `/etc/...` default — confirmed from `systemctl cat felhom-agent` `ExecStart`) |
| Agent version (deployed) | **v0.18.0** (serves `GET /host/metrics`, `/disks`, `/disks/format`) |
| `local_api` | `enable: true`, `listen_addr: 192.168.0.162:8443` ✓ |
| `backup.restore_storage` | `local-lvm` ✓ |
| `hub` | `url: https://hub.felhom.eu`, `host_id: demo-felhom-01`, `api_key` set (stored out-of-band) ✓ |
| Free VMID chosen | **9200** (in use: 9001 spike, 9999 selftest-scratch) |
| Golden volid (re-baked) | `local:backup/vzdump-lxc-9100-2026_06_11-11_55_03.tar.zst` (baked image `gitea.dooplex.hu/admin/felhom-controller:0.39.0`, verified by extracting `/etc/felhom-controller-image` from the archive) |
| Hub row key | Controller report keys on `customer.id`; the existing **`demo-felhom` customer row exists** (stale, last reports Feb 2026 up to v0.28.8). Agent host-report keys on `host_id=demo-felhom-01` (separate row). |
| Daemon contention | **None at mint time** — provision ran with the daemon up, no token-store/leaf lock. (But see the post-provision 401 finding under Step 2.) |
### Step 1b — re-bake the v0.39.0 golden
- The supplied Gitea token (read-only) **lacks container-registry scope** — the build LXC's
`docker pull` failed `401 unauthorized` (verified: the token gets HTTP 401 from the registry token
endpoint). Re-used the **build server's existing registry credential** (package-scoped, the same one
prior bakes used; **stored out-of-band**) purely inside the build LXC. `build-golden.sh` logs out +
removes `/root/.docker/config.json` before archiving, so **no credential is baked** into the golden;
the build logfile contains no secret.
- New golden built (VMID 9100 build LXC → archived → build guest destroyed). Baked image confirmed
`:0.39.0`.
- **Follow-up for Viktor:** push the `v0.39.0` tag alias for naming convention, and bump the
`build-golden.sh:33` default off the stale `:v0.35.0`.
---
## Step 2 — Provision guest 9200 as `demo-felhom`
`felhom-agent --selftest=provision -config /root/.config/felhom-agent/agent.json -archive <golden>
-vmid 9200 -customer-id demo-felhom -customer-domain demo-felhom.eu -customer-name "Felhom Demo"
-hostname felhom-demo``selftest=provision OK — guest 9200 provisioned + bootstrap-mounted (KEPT)`.
**`pct reboot 9200` was mandatory** (the bootstrap oneshot is gated on the mount existing at boot, and
provision attaches the mount *after* the front-half boot). After one reboot the controller deployed:
| Assertion | Result |
|---|---|
| Controller running | `gitea.dooplex.hu/admin/felhom-controller:0.39.0 Up (healthy)` — startup line `felhom-controller 0.39.0 starting (customer: demo-felhom, domain: demo-felhom.eu)` |
| Not setup mode | ✓ — no `setup mode` line; came up **configured** |
| Bootstrap-seeded config | ✓ — `controller.yaml` written **0600 root:root** at boot, with customer id + hub + `local_api.{endpoint,fingerprint,token}` |
| De-privileged container | ✓ — `Privileged=false`; mounts = **exactly 3**: `/etc/felhom-bootstrap` (ro), `felhom-controller-data` volume (rw), `/var/run/docker.sock` (rw). No `/dev`, `/etc/fstab`, `/mnt` rshared, `/sys`, `/run/udev`. LXC `unprivileged=1`, features `nesting,keyctl` only, single `mp9` bootstrap mount, no device passthrough/hookscript. |
| No registry pull | ✓ — bootstrap unit did `docker run` (no pull); image shows `created 20 hours ago` (golden bake time) |
| bootstrap.json identity | ✓ — `/etc/felhom-bootstrap/bootstrap.json` `600 root:root`, `customer.id=demo-felhom`, hub creds, `local_api.{endpoint=192.168.0.162:8443, fingerprint=60b5974d…, token}` (token/api_key out-of-band) |
### FINDING 2a — local-API channel 401 until the agent daemon is restarted
At controller startup: `local-api: GET /storage failed (agentapi: GET /storage: HTTP 401) — channel
not verified`. Root cause: provision minted the per-guest token and durably recorded its hash
(`/var/lib/felhom-agent/local-tokens.log` carries `{"v":9200,...}`), but the **already-running daemon
loaded its in-memory token map at its own startup, before the mint** — so it rejected the controller's
token. **`systemctl restart felhom-agent` cleared it** (the daemon reloads the durable store on
restart; minted hash persists). After restart, all agent-channel calls succeed.
**Recommendation:** provision should signal the running daemon to reload the token store (or the
local-API should consult the durable store per-request), OR the provisioning runbook must include a
post-provision `systemctl restart felhom-agent`. As-is, a guest provisioned while the daemon runs has a
dead local-API channel until the daemon restarts.
### FINDING 2b — bootstrap seed-log line absent (cosmetic)
The seed functionally worked (controller.yaml 0600 written at boot, configured, not setup mode), but
the explicit `[INFO] bootstrap: seeded … coming up configured` line (`bootstrap/bootstrap.go:111`) did
**not** appear in `docker logs`. Functionally correct; logging-only discrepancy worth a glance.
---
## Step 3 — Full flow (validated from inside the container via `docker exec … curl localhost:8080`)
The bootstrap `docker run` publishes no port (bridge-only), so host-`localhost` is refused — all checks
ran inside the container (host=localhost passes the catch-all gate).
1. **UI renders** — all HTTP 200 with Hungarian markers: `/` **Vezérlőpult** (open dashboard — no
web password set, logged `no password configured — dashboard is open`), `/stacks` (Alkalmazások),
`/monitoring` **Rendszermonitor** + the slice-9 **"Szerver állapota (gazdagép)"** host card,
`/backups` (Biztonsági mentés), `/settings` (Beállítások).
- **Host-gate routing proven live:** `Host: felhom.demo-felhom.eu`**200**; `Host: 1.2.3.4`
**404** (catch-all). External browser access (`felhom.demo-felhom.eu`) needs the **Cloudflare
tunnel**, which is **unconfigured on a bare bootstrap** (`cf_tunnel_token` empty) — same slice-10
onboarding gap as below; not separately testable without the tunnel/DNS.
2. **`/api/host-metrics` (slice 9)** — `{ok:true}` HTTP 200. **Cross-checked against the host:**
memory_total **exact** (16537989120 = pvesh), memory_used ~match (2.49 vs 2.43 GB), loadavg same
ballpark, uptime match (324255 vs 324283), disk match (93.9 GB / 10.5% vs `df` 94G/12%),
felhom-usb match (915.8 GiB / 0.87%, SMART 31°C). **`cpu_temp_c` is a real value** — read 46 live,
matching sysfs `x86_pkg_temp=44`/coretemp max 46 (`sensors` is **not installed**; the agent sources
temp from coretemp/thermal, not the `sensors` CLI). An earlier reading of 59 was a genuine transient
load peak during the golden-build+provision.
3. **`/api/disks` (8C proxy)** — `{ok:true}` HTTP 200, 4 devices with data-bearing flags;
**felhom-usb flagged `data_bearing:true` (reason "device is mounted", `/dev/sdb1`)**.
### BREAK 3.2 — storage auto-discovery (by-design consequence of 8C de-privileging)
The 1TB HDD does **NOT** appear under Settings → Storage Paths. The de-privileged container sees **no
host storage mounts** (`df` inside shows only the bootstrap bind, its own disk, udev; `/mnt` empty;
logs: "no storage paths registered", "stat /mnt/sys_drive: no such file or directory"). This is correct
for the 8C model (mounts limited to bootstrap+data+docker.sock). The HDD is instead surfaced via the
**agent host-metrics storage view** (felhom-usb appears there with capacity + SMART). The legacy local
`discoverHDDPaths` path is effectively **vestigial** in v0.39.0 — worth retiring or repurposing onto the
agent-sourced storage list.
### BREAK 3.3 — deploy→run→remove an app: not exercisable on a bare bootstrap
`/api/stacks``{ok:true,data:[]}` (empty catalog). Catalog templates come from a synced
`catalog-cache/templates`, populated by git/assets sync — **disabled on a bootstrap-seeded controller**
(manual mode, no repo URL; the seed sets only identity/hub/local-api). No apps → nothing to deploy
(`POST /api/stacks/filebrowser/deploy` → 400 "invalid request body"; even with a body the slug isn't in
the catalog). Catalog configuration is **slice-10 (hub desired-state)** territory.
### BREAK 3.5 — hub reporting stays DOWN (HTTP 401): host-key vs customer-key gap
The controller's startup push and all 3 retries got **HTTP 401** from the hub
(`[report] Push failed: HTTP 401`), reproduced directly (`POST https://hub.felhom.eu/api/v1/report`
with the baked key → `Unauthorized` / 401). **Root cause (code-traced + DB-confirmed):**
- The hub's `POST /api/v1/report` authenticates a **customer-scoped** key — `checkAuthCustomer`
`GetCustomerConfigByAPIKey` against the `customer_configs` table, then enforces
`authCustomerID == payload.CustomerID` (`hub/internal/api/handler.go:74-92, 208-234`).
- Provision baked the **agent's HOST key** (`hub.api_key`, keyed on `host_id=demo-felhom-01` in the
separate `hosts` table) into the guest's bootstrap. Host keys and customer keys are **distinct tables
/ code paths**.
- Hub DB confirms: a `demo-felhom` customer row exists (with its **own** dashboard-generated api_key),
a `demo-felhom-01` host row exists, and the baked key appears **once** (as the host key). So the
controller presents the host key → `GetCustomerConfigByAPIKey` returns nil → **401**.
The `demo-felhom` hub row therefore **stays DOWN/stale** — the freshly-provisioned controller can never
report ONLINE until the bootstrap carries the **customer-scoped** api_key (or provision creates/fetches
the customer config key, or the hub accepts host keys for customer reports). This is a **cross-component
provisioning gap (slice-10 onboarding)**, not a controller bug. *(Reported value differs from the
brief's "v0.34.0" — the DB shows last reports up to v0.28.8; immaterial, the row is stale either way.)*
---
## Step 4 — Disk proxy (API-level only; NO destructive op) — INVARIANT PROVEN
1. **List**`/api/disks` `{ok:true}` with per-device data-bearing flags (above).
2. **Data-bearing format → refusal** — hit the agent **directly** (dodges the controller's CSRF):
`POST https://192.168.0.162:8443/disks/format` with the guest's `local_api.token`, body
`{"vmid":9200,"device":"/dev/sdb1","fstype":"ext4"}` (felhom-usb — mounted, data-bearing). Result:
**HTTP 403**`{ "formatted": false, "data_bearing": true, "reason": "device is mounted",
"pending_op": { "op":"storage_wipe", "host_scope":"demo-felhom-01",
"durable_id":"byid:wwn-0x5000039ddb108568-part1", "fstype":"ext4" },
"error": "device is data-bearing — format requires an operator signature (pending_signature)" }`
The agent **inspected the device itself** (`data_bearing:true`, reason "device is mounted"),
ignored any caller claim, refused with `pending_signature`, and surfaced the durable-id-bound op to
sign. **The disk was untouched** (post-test: `/dev/sdb1 ext4 915.8G, 8G used`, still mounted). No
operator signature was ever passed. The controller maps this 403→409 (`agentapi.ErrFormatRefused`,
already unit-tested). ✓
---
## Step 5 / 6 — Orphan-template cleanup + gate + push
- **Deleted** (re-confirmed unreferenced first — `grep -rn` over `internal/` matched only the
templates' own `{{define}}` lines): `internal/web/templates/{storage_init, storage_attach, migrate,
migrate_drive, restore}.html`. Embed is a glob; 14 templates remain.
- **Noted, not deleted** (dead-but-harmless): `NotifyCrossDriveCompleted`/`NotifyCrossDriveFailed`
(`notify/notifier.go:353,359`, no callers) + a vestigial `crossdrive_failed` notification toggle
(`web/handlers.go:937`) + restic config fields/comments. Flagged for a future dedicated cleanup.
- **Version:** v0.39.0 → **v0.39.1** (CHANGELOG entry added; version is ldflags-injected, applied at
the next build). Source-only — no re-bake this session.
- **Gate:** `go build ./...` **OK**; `go test ./...` **green** (agentapi, bootstrap, quiesce).
- **Commit:** `6e77bea` (the template deletion + this report), pushed to `main` (`d8d1e17..6e77bea`).
No working UI feature lost (the deleted pages were already unreachable — removed routes).
---
## What broke / what's missing (the headline)
| # | Item | Severity | Nature |
|---|---|---|---|
| 2a | Local-API channel 401 until `felhom-agent` restart after provisioning a live-daemon host | **Medium** | Provision doesn't make the running daemon reload its token store. Workaround: restart the daemon (done). Needs a provision→daemon reload signal or per-request store lookup. |
| 3.5 | Hub report 401 — bootstrap bakes the **host** api_key, but `/api/v1/report` needs the **customer** api_key | **Medium/High** | Cross-component provisioning gap (slice-10 onboarding). Controller stays DOWN on the hub until fixed. |
| 3.3 | No app catalog on a bare bootstrap (git/assets sync disabled) — deploy not exercisable | Expected (slice-10) | Catalog/desired-state comes from the hub later. |
| 3.2 | Legacy Storage-Paths auto-discovery finds nothing (de-privileged container has no host mounts) | Expected (8C) | HDD is correctly surfaced via agent host-metrics instead; retire/repurpose the legacy path. |
| 3.1 | External browser access (felhom.demo-felhom.eu) needs the Cloudflare tunnel (`cf_tunnel_token` empty) | Expected (slice-10) | Host-gate routing itself verified live (200 vs 404). |
| 2b | Bootstrap seed-log line absent | Cosmetic | Functionally correct; logging-only. |
| infra | Supplied Gitea token lacks registry/package scope; build-golden default tag stale (`:v0.35.0`) | Low | Used the build-server credential for the re-bake; flagged both for Viktor. |
**Worked cleanly:** golden re-bake, provision + reboot deploy, configured-not-setup bootstrap, 0600
bootstrap.json/controller.yaml, container de-privileging (Privileged=false, 3 mounts), no-registry-pull,
all 5 UI pages + slice-9 host card, host-metrics (cross-checked, real cpu_temp), `/api/disks`, and the
**8C data-bearing format-refusal invariant (403, disk untouched)**.