diff --git a/REPORT.md b/REPORT.md index 21d13f2..23b9747 100644 --- a/REPORT.md +++ b/REPORT.md @@ -1,203 +1,95 @@ -# REPORT — Live demo validation of felhom-controller v0.39.0 + 8C orphan-template cleanup (2026-06-11) +# REPORT — controller v0.40.0: bootstrap pull+merge onboarding (live-validated) (2026-06-11) -Two things this session: (1) **provisioned a fresh customer guest from a v0.39.0 golden on the demo -Proxmox host and walked the full controller flow**, reporting what works vs breaks against the live -guest; (2) a small source-hygiene code change — deleting five dead 8C orphan templates (v0.39.1). +Lockstep two-repo change with `felhom-agent` v0.19.0. Fixes the onboarding **401** found last session: +a freshly provisioned guest used to seed a "configured" controller.yaml from the agent's **host** hub +key, which the hub's customer-scoped `/api/v1/report` rejects → the controller could never report +ONLINE. Now, on first boot, the controller **pulls** its full controller.yaml from the hub (using the +bootstrap's retrieval passphrase, which yields the **customer-scoped** key) and **merges in** the +per-guest `local_api` block. Validated live end-to-end on the demo (guest 9201). -The code change is **source-only** (ships in the next golden); the running demo guest stays the -v0.39.0 golden provisioned below. +## What changed (`internal/bootstrap`, `cmd/controller/main.go`) +- **Contract v1 → v2** (`felhom.bootstrap/v2`): `BootstrapCustomer` keeps only `id`; `BootstrapHub` + drops `api_key`/`host_id`, adds **`retrieval_password`**; `local_api` unchanged. Non-v2 → setup mode. +- **`MaybeIngest(configPath, cfg, logger, pull PullFunc)`** — `pull` injected (decision (b): keeps + `bootstrap` free of the heavy `internal/report` package; `main.go` wires `report.PullConfig`). Flow: + idempotent (configured → return, **no pull**) → parse+validate v2 → **pull** with bounded retry + (1 + 3 backoff attempts, transient `ErrPullTransient` only; auth/not-found fail fast) → **merge** + `local_api` at the YAML-**map** level (decision (c): preserves every hub-emitted field) → write 0600 + atomic → reload. Fail-safe + never-crash (hub outage at first boot → setup mode). +- New sentinel **`ErrPullTransient`**; `main.go`'s adapter maps `report.ErrHubUnreachable` → transient, + passes auth/not-found through as permanent. Removed `configFromBootstrap` (the host-key path). ---- +## Cross-repo contract checksum-diff (rendered bootstrap.json field set) +The agent's v2 renderer output was ingested by the controller's `json.Unmarshal` — **every field +populated**, exact match: -## Premise correction (surfaced to Viktor, decision taken) - -The session brief said "controller v0.39.0 is built and baked into the current golden." That was -**false on the baked half**: -- v0.39.0 **source** committed (`d8d1e17`, slice 9). ✓ -- v0.39.0 **image** built & pushed to the registry — but under tag **`0.39.0`** (no `v` prefix, - unlike every prior build; same digest as `latest`). The `v0.39.0` alias was **never pushed**. -- v0.39.0 **golden** did **not exist** — the newest golden baked **v0.38.0** (others v0.37.0/.36/.35). - -Per the plan's Step-1.6 gate, this was surfaced. **Viktor chose: re-bake the v0.39.0 golden, then run -the full session.** Done (Step 1b below). - ---- - -## Step 1 — Discovery (observed) - -| Item | Value | +| level | fields (agent emits == controller ingests) | |---|---| -| Live agent config path | `/root/.config/felhom-agent/agent.json` (NOT the `/etc/...` default — confirmed from `systemctl cat felhom-agent` `ExecStart`) | -| Agent version (deployed) | **v0.18.0** (serves `GET /host/metrics`, `/disks`, `/disks/format`) | -| `local_api` | `enable: true`, `listen_addr: 192.168.0.162:8443` ✓ | -| `backup.restore_storage` | `local-lvm` ✓ | -| `hub` | `url: https://hub.felhom.eu`, `host_id: demo-felhom-01`, `api_key` set (stored out-of-band) ✓ | -| Free VMID chosen | **9200** (in use: 9001 spike, 9999 selftest-scratch) | -| Golden volid (re-baked) | `local:backup/vzdump-lxc-9100-2026_06_11-11_55_03.tar.zst` (baked image `gitea.dooplex.hu/admin/felhom-controller:0.39.0`, verified by extracting `/etc/felhom-controller-image` from the archive) | -| Hub row key | Controller report keys on `customer.id`; the existing **`demo-felhom` customer row exists** (stale, last reports Feb 2026 up to v0.28.8). Agent host-report keys on `host_id=demo-felhom-01` (separate row). | -| Daemon contention | **None at mint time** — provision ran with the daemon up, no token-store/leaf lock. (But see the post-provision 401 finding under Step 2.) | +| top | `schema, customer, hub, local_api` | +| customer | `id` | +| hub | `url, retrieval_password` | +| local_api | `endpoint, fingerprint, token` | -### Step 1b — re-bake the v0.39.0 golden -- The supplied Gitea token (read-only) **lacks container-registry scope** — the build LXC's - `docker pull` failed `401 unauthorized` (verified: the token gets HTTP 401 from the registry token - endpoint). Re-used the **build server's existing registry credential** (package-scoped, the same one - prior bakes used; **stored out-of-band**) purely inside the build LXC. `build-golden.sh` logs out + - removes `/root/.docker/config.json` before archiving, so **no credential is baked** into the golden; - the build logfile contains no secret. -- New golden built (VMID 9100 build LXC → archived → build guest destroyed). Baked image confirmed - `:0.39.0`. -- **Follow-up for Viktor:** push the `v0.39.0` tag alias for naming convention, and bump the - `build-golden.sh:33` default off the stale `:v0.35.0`. +(Automated round-trip via a throwaway test in each package; removed after verifying.) ---- +## Tests — non-hollow (`internal/bootstrap`), all green +- **Pull+merge:** stub `pull` returns a hub yaml with `hub.api_key: CUSTKEY_FROM_HUB`, `customer.domain`, + and an unmodeled `assets.source_url`. Asserts the written controller.yaml carries **the customer key + + identity + the preserved unmodeled assets field** AND the bootstrap's `local_api.{endpoint, + fingerprint,token}`, and contains **no host key/id**. +- **Idempotency:** preset `cfg.Customer.ID` → asserts `pull` **never invoked**, file untouched. +- **Transient retry:** stub returns `ErrPullTransient` always → asserts exactly `1+len(delays)` calls, + then setup mode, no file (backoff shrunk to ~1ms via the overridable `pullRetryDelays`). +- **Permanent no-retry:** stub returns a plain (auth-style) error → asserts a single call. +- **Schema reject** (non-v2), **missing-required**, **malformed/absent** → setup mode, no pull. -## Step 2 — Provision guest 9200 as `demo-felhom` +`go build ./... && go test ./...` green. -`felhom-agent --selftest=provision -config /root/.config/felhom-agent/agent.json -archive --vmid 9200 -customer-id demo-felhom -customer-domain demo-felhom.eu -customer-name "Felhom Demo" --hostname felhom-demo` → `selftest=provision OK — guest 9200 provisioned + bootstrap-mounted (KEPT)`. +## Live validation (demo Proxmox `felhom-pve`, guest 9201, golden baked `:0.40.0`) +Golden re-baked: `local:backup/vzdump-lxc-9100-2026_06_11-13_26_45.tar.zst` (baked image confirmed +`gitea.dooplex.hu/admin/felhom-controller:0.40.0`). Provisioned fresh as `demo-felhom` via agent +v0.19.0 `--selftest=provision -customer-id demo-felhom -hub-password ` (passphrase read +from the hub `customer_configs` and transported base64 to avoid UTF-8 mangling; **stored out-of-band**), +then `pct reboot` + `systemctl restart felhom-agent` (the local-API token workaround, Finding #1). -**`pct reboot 9200` was mandatory** (the bootstrap oneshot is gated on the mount existing at boot, and -provision attaches the mount *after* the front-half boot). After one reboot the controller deployed: +- **Bootstrap (v2) on the guest:** `hub` keys = `[url, retrieval_password]` (no host key), `customer` + keys = `[id]` only, 0600. ✓ +- **Pull+merge worked** — the merged `/opt/docker/felhom-controller/controller.yaml` (secrets redacted) + carries **from the hub pull**: `hub.api_key: 4b11c0c3…` (the **customer-scoped** key, matches the + hub's `customer_configs` row), `hub.enabled: true`, `customer.{id: demo-felhom, domain: + demo-felhom.eu, name, email}`, `assets.source_url`, `git` (catalog repo), `infrastructure.cf_*` + (Cloudflare config); and **merged from the bootstrap**: `local_api.{endpoint: 192.168.0.162:8443, + fingerprint: 60b5974d…, token}`. **No `host_id`, no agent host key.** ✓ +- **Hub ONLINE at v0.40.0** — `[report] Hub report pushed successfully (3090 bytes)` + `Startup hub + report sent`, **no 401**. Hub `reports` row for `demo-felhom`: `controller_version=0.40.0`, + `received_at=2026-06-11 11:32:00` (fresh → online). 0 deployed apps (fresh guest — expected). ✓ +- **`local_api` survived the merge** — `GET /api/host-metrics` → `{ok:true}`, `cpu_temp_c=49` (real), + 4 storage targets; `GET /api/disks` → `{ok:true}`, felhom-usb `data_bearing:true`. ✓ +- **8C invariant intact** — agent-direct `POST /disks/format` on data-bearing `/dev/sdb1` → **HTTP 403** + `{formatted:false, data_bearing:true, reason:"device is mounted", pending_op:{op:storage_wipe, + durable_id:byid:wwn-…, …}}` "operator signature required (pending_signature)". Disk untouched + (`/dev/sdb1 ext4 8G`, still mounted). ✓ -| Assertion | Result | -|---|---| -| Controller running | `gitea.dooplex.hu/admin/felhom-controller:0.39.0 Up (healthy)` — startup line `felhom-controller 0.39.0 starting (customer: demo-felhom, domain: demo-felhom.eu)` | -| Not setup mode | ✓ — no `setup mode` line; came up **configured** | -| Bootstrap-seeded config | ✓ — `controller.yaml` written **0600 root:root** at boot, with customer id + hub + `local_api.{endpoint,fingerprint,token}` | -| De-privileged container | ✓ — `Privileged=false`; mounts = **exactly 3**: `/etc/felhom-bootstrap` (ro), `felhom-controller-data` volume (rw), `/var/run/docker.sock` (rw). No `/dev`, `/etc/fstab`, `/mnt` rshared, `/sys`, `/run/udev`. LXC `unprivileged=1`, features `nesting,keyctl` only, single `mp9` bootstrap mount, no device passthrough/hookscript. | -| No registry pull | ✓ — bootstrap unit did `docker run` (no pull); image shows `created 20 hours ago` (golden bake time) | -| bootstrap.json identity | ✓ — `/etc/felhom-bootstrap/bootstrap.json` `600 root:root`, `customer.id=demo-felhom`, hub creds, `local_api.{endpoint=192.168.0.162:8443, fingerprint=60b5974d…, token}` (token/api_key out-of-band) | +## What broke / what's missing +- **Bootstrap log line absent in `docker logs`** (observability nit, reproduced from last session's + seed-log). `MaybeIngest`'s `[INFO] bootstrap: pulled config … coming up configured` does not surface + in `docker logs` even though `setupLogger` writes to stdout and the pull demonstrably ran (customer + key present, hub report OK, catalog repo configured). The first captured line is a later async + local-api WARN — the early synchronous bootstrap log is being swallowed before docker attaches. + Worth a follow-up (flush/sequence the logger before MaybeIngest, or log the pull result post-startup). +- **Finding #1 still open (separate spec):** the local-API channel 401s until `systemctl restart + felhom-agent` after provisioning a live-daemon host (the running daemon didn't reload the freshly + minted token). Reproduced (startup WARN at 11:31:55); workaround applied. +- **Operational gotcha (mine, fixed):** `kubectl cp`'s "tar: removing leading '/'" warning polluted a + captured base64 passphrase on the first attempt → a 2-char garbage passphrase → re-extracted with + `tail -1` and re-provisioned cleanly. The UTF-8 (Hungarian) passphrase must be transported + byte-exact (base64), not through the Windows shell. +- Minor: guest 9201's hostname is `felhom-golden` (no `-hostname` passed); cosmetic, `customer.id` is + correct. -### FINDING 2a — local-API channel 401 until the agent daemon is restarted -At controller startup: `local-api: GET /storage failed (agentapi: GET /storage: HTTP 401) — channel -not verified`. Root cause: provision minted the per-guest token and durably recorded its hash -(`/var/lib/felhom-agent/local-tokens.log` carries `{"v":9200,...}`), but the **already-running daemon -loaded its in-memory token map at its own startup, before the mint** — so it rejected the controller's -token. **`systemctl restart felhom-agent` cleared it** (the daemon reloads the durable store on -restart; minted hash persists). After restart, all agent-channel calls succeed. -**Recommendation:** provision should signal the running daemon to reload the token store (or the -local-API should consult the durable store per-request), OR the provisioning runbook must include a -post-provision `systemctl restart felhom-agent`. As-is, a guest provisioned while the daemon runs has a -dead local-API channel until the daemon restarts. - -### FINDING 2b — bootstrap seed-log line absent (cosmetic) -The seed functionally worked (controller.yaml 0600 written at boot, configured, not setup mode), but -the explicit `[INFO] bootstrap: seeded … coming up configured` line (`bootstrap/bootstrap.go:111`) did -**not** appear in `docker logs`. Functionally correct; logging-only discrepancy worth a glance. - ---- - -## Step 3 — Full flow (validated from inside the container via `docker exec … curl localhost:8080`) - -The bootstrap `docker run` publishes no port (bridge-only), so host-`localhost` is refused — all checks -ran inside the container (host=localhost passes the catch-all gate). - -1. **UI renders** — all HTTP 200 with Hungarian markers: `/` **Vezérlőpult** (open dashboard — no - web password set, logged `no password configured — dashboard is open`), `/stacks` (Alkalmazások), - `/monitoring` **Rendszermonitor** + the slice-9 **"Szerver állapota (gazdagép)"** host card, - `/backups` (Biztonsági mentés), `/settings` (Beállítások). - - **Host-gate routing proven live:** `Host: felhom.demo-felhom.eu` → **200**; `Host: 1.2.3.4` → - **404** (catch-all). External browser access (`felhom.demo-felhom.eu`) needs the **Cloudflare - tunnel**, which is **unconfigured on a bare bootstrap** (`cf_tunnel_token` empty) — same slice-10 - onboarding gap as below; not separately testable without the tunnel/DNS. -2. **`/api/host-metrics` (slice 9)** — `{ok:true}` HTTP 200. **Cross-checked against the host:** - memory_total **exact** (16537989120 = pvesh), memory_used ~match (2.49 vs 2.43 GB), loadavg same - ballpark, uptime match (324255 vs 324283), disk match (93.9 GB / 10.5% vs `df` 94G/12%), - felhom-usb match (915.8 GiB / 0.87%, SMART 31°C). **`cpu_temp_c` is a real value** — read 46 live, - matching sysfs `x86_pkg_temp=44`/coretemp max 46 (`sensors` is **not installed**; the agent sources - temp from coretemp/thermal, not the `sensors` CLI). An earlier reading of 59 was a genuine transient - load peak during the golden-build+provision. -3. **`/api/disks` (8C proxy)** — `{ok:true}` HTTP 200, 4 devices with data-bearing flags; - **felhom-usb flagged `data_bearing:true` (reason "device is mounted", `/dev/sdb1`)**. - -### BREAK 3.2 — storage auto-discovery (by-design consequence of 8C de-privileging) -The 1TB HDD does **NOT** appear under Settings → Storage Paths. The de-privileged container sees **no -host storage mounts** (`df` inside shows only the bootstrap bind, its own disk, udev; `/mnt` empty; -logs: "no storage paths registered", "stat /mnt/sys_drive: no such file or directory"). This is correct -for the 8C model (mounts limited to bootstrap+data+docker.sock). The HDD is instead surfaced via the -**agent host-metrics storage view** (felhom-usb appears there with capacity + SMART). The legacy local -`discoverHDDPaths` path is effectively **vestigial** in v0.39.0 — worth retiring or repurposing onto the -agent-sourced storage list. - -### BREAK 3.3 — deploy→run→remove an app: not exercisable on a bare bootstrap -`/api/stacks` → `{ok:true,data:[]}` (empty catalog). Catalog templates come from a synced -`catalog-cache/templates`, populated by git/assets sync — **disabled on a bootstrap-seeded controller** -(manual mode, no repo URL; the seed sets only identity/hub/local-api). No apps → nothing to deploy -(`POST /api/stacks/filebrowser/deploy` → 400 "invalid request body"; even with a body the slug isn't in -the catalog). Catalog configuration is **slice-10 (hub desired-state)** territory. - -### BREAK 3.5 — hub reporting stays DOWN (HTTP 401): host-key vs customer-key gap -The controller's startup push and all 3 retries got **HTTP 401** from the hub -(`[report] Push failed: HTTP 401`), reproduced directly (`POST https://hub.felhom.eu/api/v1/report` -with the baked key → `Unauthorized` / 401). **Root cause (code-traced + DB-confirmed):** -- The hub's `POST /api/v1/report` authenticates a **customer-scoped** key — `checkAuthCustomer` → - `GetCustomerConfigByAPIKey` against the `customer_configs` table, then enforces - `authCustomerID == payload.CustomerID` (`hub/internal/api/handler.go:74-92, 208-234`). -- Provision baked the **agent's HOST key** (`hub.api_key`, keyed on `host_id=demo-felhom-01` in the - separate `hosts` table) into the guest's bootstrap. Host keys and customer keys are **distinct tables - / code paths**. -- Hub DB confirms: a `demo-felhom` customer row exists (with its **own** dashboard-generated api_key), - a `demo-felhom-01` host row exists, and the baked key appears **once** (as the host key). So the - controller presents the host key → `GetCustomerConfigByAPIKey` returns nil → **401**. - -The `demo-felhom` hub row therefore **stays DOWN/stale** — the freshly-provisioned controller can never -report ONLINE until the bootstrap carries the **customer-scoped** api_key (or provision creates/fetches -the customer config key, or the hub accepts host keys for customer reports). This is a **cross-component -provisioning gap (slice-10 onboarding)**, not a controller bug. *(Reported value differs from the -brief's "v0.34.0" — the DB shows last reports up to v0.28.8; immaterial, the row is stale either way.)* - ---- - -## Step 4 — Disk proxy (API-level only; NO destructive op) — INVARIANT PROVEN - -1. **List** — `/api/disks` `{ok:true}` with per-device data-bearing flags (above). -2. **Data-bearing format → refusal** — hit the agent **directly** (dodges the controller's CSRF): - `POST https://192.168.0.162:8443/disks/format` with the guest's `local_api.token`, body - `{"vmid":9200,"device":"/dev/sdb1","fstype":"ext4"}` (felhom-usb — mounted, data-bearing). Result: - - **HTTP 403** — `{ "formatted": false, "data_bearing": true, "reason": "device is mounted", - "pending_op": { "op":"storage_wipe", "host_scope":"demo-felhom-01", - "durable_id":"byid:wwn-0x5000039ddb108568-part1", "fstype":"ext4" }, - "error": "device is data-bearing — format requires an operator signature (pending_signature)" }` - - The agent **inspected the device itself** (`data_bearing:true`, reason "device is mounted"), - ignored any caller claim, refused with `pending_signature`, and surfaced the durable-id-bound op to - sign. **The disk was untouched** (post-test: `/dev/sdb1 ext4 915.8G, 8G used`, still mounted). No - operator signature was ever passed. The controller maps this 403→409 (`agentapi.ErrFormatRefused`, - already unit-tested). ✓ - ---- - -## Step 5 / 6 — Orphan-template cleanup + gate + push - -- **Deleted** (re-confirmed unreferenced first — `grep -rn` over `internal/` matched only the - templates' own `{{define}}` lines): `internal/web/templates/{storage_init, storage_attach, migrate, - migrate_drive, restore}.html`. Embed is a glob; 14 templates remain. -- **Noted, not deleted** (dead-but-harmless): `NotifyCrossDriveCompleted`/`NotifyCrossDriveFailed` - (`notify/notifier.go:353,359`, no callers) + a vestigial `crossdrive_failed` notification toggle - (`web/handlers.go:937`) + restic config fields/comments. Flagged for a future dedicated cleanup. -- **Version:** v0.39.0 → **v0.39.1** (CHANGELOG entry added; version is ldflags-injected, applied at - the next build). Source-only — no re-bake this session. -- **Gate:** `go build ./...` **OK**; `go test ./...` **green** (agentapi, bootstrap, quiesce). -- **Commit:** `6e77bea` (the template deletion + this report), pushed to `main` (`d8d1e17..6e77bea`). - No working UI feature lost (the deleted pages were already unreachable — removed routes). - ---- - -## What broke / what's missing (the headline) - -| # | Item | Severity | Nature | -|---|---|---|---| -| 2a | Local-API channel 401 until `felhom-agent` restart after provisioning a live-daemon host | **Medium** | Provision doesn't make the running daemon reload its token store. Workaround: restart the daemon (done). Needs a provision→daemon reload signal or per-request store lookup. | -| 3.5 | Hub report 401 — bootstrap bakes the **host** api_key, but `/api/v1/report` needs the **customer** api_key | **Medium/High** | Cross-component provisioning gap (slice-10 onboarding). Controller stays DOWN on the hub until fixed. | -| 3.3 | No app catalog on a bare bootstrap (git/assets sync disabled) — deploy not exercisable | Expected (slice-10) | Catalog/desired-state comes from the hub later. | -| 3.2 | Legacy Storage-Paths auto-discovery finds nothing (de-privileged container has no host mounts) | Expected (8C) | HDD is correctly surfaced via agent host-metrics instead; retire/repurpose the legacy path. | -| 3.1 | External browser access (felhom.demo-felhom.eu) needs the Cloudflare tunnel (`cf_tunnel_token` empty) | Expected (slice-10) | Host-gate routing itself verified live (200 vs 404). | -| 2b | Bootstrap seed-log line absent | Cosmetic | Functionally correct; logging-only. | -| infra | Supplied Gitea token lacks registry/package scope; build-golden default tag stale (`:v0.35.0`) | Low | Used the build-server credential for the re-bake; flagged both for Viktor. | - -**Worked cleanly:** golden re-bake, provision + reboot deploy, configured-not-setup bootstrap, 0600 -bootstrap.json/controller.yaml, container de-privileging (Privileged=false, 3 mounts), no-registry-pull, -all 5 UI pages + slice-9 host card, host-metrics (cross-checked, real cpu_temp), `/api/disks`, and the -**8C data-bearing format-refusal invariant (403, disk untouched)**. +## Versions / artifacts +- Controller **v0.40.0** (CHANGELOG updated). Pushed to `main`: commit `6a594f9` (code) — this REPORT + in the follow-up commit. +- Lockstep agent **v0.19.0** (commit `e5a1819`). New golden: + `local:backup/vzdump-lxc-9100-2026_06_11-13_26_45.tar.zst`. +- No secrets committed (passphrase, customer key, CF tokens, local-api token — all out-of-band/redacted).