# `05-hub-architecture.md` — critical review (grounded against felhom-hub v0.6.3 source + Parts 01–04) Method: every claim about the existing hub was checked against `felhom.eu/hub/` source; every cross-doc claim against Parts 01/03/04. Citations are `file:line`. Severity: **blocking** (wrong / breaks an assumption) · **should-fix** (real gap or contradiction, low blast) · **minor**. The two highest-value catches (doc assumes something the code contradicts) are **S1** and **S2**. --- ## Ranked summary | # | What | Where (doc → code) | Severity | |---|---|---|---| | S1 | §9/§11 name the **wrong infra-backup table as current** — `infra_backup_versions` is the live/primary one; `infra_backups` is the deprecated write-only mirror | 05 §9/§11 → `store.go:198-217,541-578` | should-fix (code-contradiction) | | S2 | §7 treats the CF token as **geo-only**; it is **dual-purpose (DNS-01/ACME + WAF)** and is injected into the generated `controller.yaml` | 05 §7 → `config_form.html:76-80`, `controller.yaml.default:26`, `configgen.go:28-37`, `configs.go:1041` | should-fix (code-contradiction / unverified assumption) | | S3 | §6 leans on the existing **"Push"**, but that is a hub→box **inbound** POST — forbidden by the box-initiated model; transport must invert to poll | 05 §6 → `configs.go:569-570,1148-1150`; Part 1 §4/§5/§11; Part 3 §5 | should-fix | | S4 | Part 1 §6 calls app inventory **"declarative"**; 05 §1 (LOCKED) says apps are mirrored, never declared/reconciled, restored from PBS | Part 1 §6 ↔ 05 §1/§9 | should-fix (cross-doc) | | S5 | §9 hands "guest inventory + snapshots" **from the prunable report mirror**; DR soundness actually rests on durable sources | 05 §9/§3 → `store.go:809-816` | should-fix (DR robustness) | | S6 | §4 says backup-deadline checker "maps onto host_reports' last-backup field"; today it is **event-based** and controller-emitted | 05 §4 → `deadline.go:31-86` | should-fix (mechanism) | | M1 | "60s staleness checker" conflates the 60s **cadence** with the 30m/1h **threshold** | 05 §4 → `main.go:207-217,99-102`, `staleness.go:33-37` | minor | | M2 | §2 `customer_configs` field list omits `api_key` — the very field the per-reporter plan retires | 05 §2 → `store.go:102-112` | minor | | M3 | §11 `reports`→`guest_reports` "rename" is really drop+create under the locked clean cutover | 05 §11 → `store.go:55-119` | minor | | M4 | Pre-existing weak authz on infra-backup GET / `/notify` (any valid key, not customer-scoped) | handler.go:407,536,568,596 | minor | No **blocking** findings — the data model and the two-driver framing are sound, and the LOCKED clean cutover absorbs most schema risk. The items below are gaps/contradictions worth fixing before the doc drives work. --- ## Highest-value: doc assumes something the code contradicts ### S1 — `infra_backups` vs `infra_backup_versions` is inverted (should-fix, code-contradiction) 05 §9: *"`infra_backup_versions` retires; `infra_backups` is repurposed into the slim DR record."* §11 repeats: *"RETIRE `infra_backup_versions`; repurpose `infra_backups`."* The code is the other way round: - `infra_backup_versions` (added v0.7.0, `store.go:198-211`) is the **live/primary** table. **Every read path hits it**: `GetInfraBackup` (`store.go:565-578`), `GetInfraBackupByID` (`store.go:581-593`), `GetInfraBackupMeta` (`store.go:604`), `ListInfraBackupVersions` (`store.go:640`), and the recovery endpoint (`handler.go:670-686`). - `infra_backups` (original single-row, `store.go:96-100`) is **deprecated**. It is now **written only as a legacy mirror** ("for backward compatibility during rollback window", `store.go:552-558`) and is **never read** except as the one-time migration *source* (`store.go:214-217`). So the doc proposes retiring the current table and repurposing the dead one. Under the LOCKED clean cutover both are discarded anyway, so blast radius is low — but an implementer following §9/§11 literally would point the DR record at the wrong table. **Fix:** take §11's own alternative — *fold the slim DR record onto `hosts`* and **drop both** infra-backup tables. If a standalone table is kept, base it on `infra_backup_versions` (the one with the data/readers), and correct the "which is current" framing. ### S2 — the CF API token is **not** geo-only; it is the ACME token too, and ships into `controller.yaml` (should-fix, code-contradiction) 05 §7: *"The hub already holds the CF API token (the config form notes Zone WAF:Edit)… rather than pushing the token down to the controller… The controller no longer calls the CF API."* Grounding confirms the hub **does** hold the token and **does** have a remove-all path: `config_json → infrastructure.cf_api_token` (`configs.go:714-715,1041-1042,1089-1096`) → `cfClient.RemoveGeoRules(cfToken, cfg.Domain, …)` in `handleGeoDisable` (`configs.go:1112`), route `/customers/{id}/geo/disable` (`server.go:201-205`). ✓ The §7 framing of geo-enforcement-moves-to-hub is also consistent with Part 1 §5/§7 and Part 3 §2/§46. **But the doc's assumption that the token is *for geo* is contradicted by the code:** the same `cf_api_token` is **dual-purpose** — - the config-form hint says **"Zone DNS:Edit (ACME), Zone WAF:Edit (geo)"** (`config_form.html:80`), - `controller.yaml.default:26` documents it as the **"Cloudflare API token (DNS-01 challenge)"**, - and it is **deep-merged into the generated `controller.yaml`** via `configgen.Generate` (config_json overrides, `configgen.go:28-37`), i.e. **today it is shipped down to the box** and served at `/config/{id}` and `/recovery/{id}`. Consequences §7 must address: 1. **"Token off the controller" is incomplete** if the box still does DNS-01/ACME. In the CF-Tunnel model the box may no longer need a public cert at all (edge-terminated), making the ACME use moot — but that is an assumption the doc must state, not skip. Either confirm ACME is gone, or the CF token cannot fully come off the box. 2. **`configgen` must stop emitting `cf_api_token` into `controller.yaml`** (or relocate it to a hub-only field). As written, the generated config still carries it. --- ## Should-fix ### S3 — §6 "Push" is an inbound-to-box mechanism the new model forbids 05 §6: *"the operator edits a customer's desired state (building on the existing config-form + Push/Pull/Diff)."* The form + diff/pull/push handlers exist — `handlePushConfig` (`configs.go:569`), `handlePullConfig` (`configs.go:952`), `handleConfigDiff` (`configs.go:861`), routes at `server.go:209-229`. ✓ So the UI base is real. The wrinkle: **"Push" today is a hub→controller outbound POST** (`handlePushConfig` "sends the generated YAML config to the controller", `configs.go:569-570`), as is the geo-disable notify (`notifyControllerGeoDisable` → `POST controllerURL/api/geo/settings`, `configs.go:1148-1153`). Both are the hub **connecting into the box** — explicitly disallowed by the box-initiated model (Part 1 §4 "the hub never initiates inbound"; §5 row `agent↔hub`/`controller↔hub` = outbound poll; Part 3 §5 "The hub never connects inbound"). 05's own §5 already resolves this (desired state is **pulled** on poll with a `desired_generation`). So the doc is internally consistent in *mechanism* but loose in *wording*: **make §6 explicit that "Push" becomes "publish to desired state, delivered on the next agent/controller poll," not a reuse of the inbound push transport.** The form/diff UX carries over; the transport inverts. (Same applies to the geo-disable controller-notify path.) ### S4 — "declarative app inventory" (Part 1 §6) vs "apps are mirrored, never reconciled" (05 §1) Part 1 §6 lists the durable record as including a **"declarative app inventory"** that survives box loss — wording that implies an operator-authored, re-deployable spec. 05 §1 (LOCKED two-driver model) is explicit the opposite way: *"Apps never enter the reconcile loop… the hub only mirrors the resulting inventory… the app/customer layer is mirrored,"* and 05 §9 restores apps **from the PBS guest snapshot**, not by re-deploying a declared inventory. These are reconcilable (the mirror *is* durable last-known truth) but the word "declarative" contradicts the locked framing and the §9 restore-from-snapshot path. **Fix (align the older doc to the locked model):** in Part 1 §6 change "declarative app inventory" → "mirrored / last-reported app inventory," and note apps are recovered from the guest snapshot, not re-declared. (Flagging an internal inconsistency, not relitigating the locked premise.) ### S5 — §9 reads DR inputs from a prunable mirror; soundness rests on durable sources 05 §9 hands the recovering agent *"identity, tunnel token, storage manifest, PBS namespace, guest inventory + snapshots."* §3 places "guest inventory" and "latest PBS snapshot pointers" in `host_reports` — the bottom-up mirror. But reports are **pruned** (`Prune` deletes rows older than `maxDays`, `store.go:809-816`; the doc keeps this), so after a long pre-DR outage the last `host_report` can be gone or stale. The actually-durable DR inputs are: desired-state on `hosts`/`guests` (§5), the slim DR record (PBS namespace + repo fingerprint + wrapped escrow key, §9/§11), and **PBS's own snapshot enumeration** (the agent lists snapshots once it has the namespace + unwrapped key). The mirrored inventory/pointers are convenience, not the source of record. **Fix:** state in §9 that DR reads from the durable sources (desired-state + DR record + PBS), **not** from prunable `host_reports`, so recovery doesn't degrade when the last report has aged out. This also keeps §1's two-driver discipline clean: DR must not depend on bottom-up mirror rows being retained. (Note: the `hosts` row legitimately mixes top-down intent columns with a few box-reported columns — repo fingerprint, wrapped escrow key. That is fine; just label them as box-reported so the §1 split stays legible at the column level.) ### S6 — backup-deadline checker: doc says field-based, code is event-based (and re-emitter changes) 05 §4: *"The existing backup-deadline checker maps onto `host_reports`' last-backup-per-target."* The existing checker is **event-based**, not field-based: `CheckBackupDeadlines` looks for `backup_completed` / `backup_failed` (and `db_dump_*`) **events** since Budapest midnight and emits `expected_backup_missed` if neither is present (`deadline.go:31-86`). Two changes the doc should make explicit: 1. **Mechanism:** either keep it event-based (someone emits `backup_completed`) or genuinely move it to a `host_reports.last_backup_per_target` field check — the doc says the latter but the impl is the former. 2. **Emitter:** today the **controller** emits backup events; in the de-privileged model the **agent** owns backup/PBS (Part 3 §8), so the agent must now emit `backup_completed`/`backup_failed` (or the host report carries last-backup-per-target). Without re-homing the emitter, the deadline check goes silent after the controller stops doing backups. --- ## Minor - **M1 — "60s staleness checker" (§4).** 60s is the **check cadence** (`main.go:207-217`, `ticker := time.NewTicker(60 * time.Second)`); the **staleness threshold** is 30m (default, `main.go:99-102`) with down at 2× = 60m (`staleness.go:33-37`; CLAUDE.md "OK <30m, DOWN >1h"). The event-transition mechanism (`node_stale`/`node_down`/`node_recovered`) is described correctly (`staleness.go:155-185`). Reword to "the staleness checker (60s cadence, 30m/1h thresholds)." - **M2 — `customer_configs` fields (§2).** The list ("identity, domain, email, retrieval_password, status, config_json") omits **`api_key`** (`store.go:108`) — the field §2's per-reporter plan actually retires. Worth noting `customer_configs.api_key` becomes unused once auth resolves via `hosts.api_key` / `guests.api_key`. - **M3 — rename under clean cutover (§11).** `migrate()` is all `CREATE TABLE IF NOT EXISTS` + idempotent `ALTER` (`store.go:55-119,146-149`). §11's claim "grounded in store.go's idempotent style" is accurate. But a `reports`→`guest_reports` **rename** isn't part of that style; under the LOCKED clean cutover (demo re-enrolls fresh, §2) it is really **drop `reports` + create `guest_reports`** with no data migration. Name it as such to avoid implying an in-place rename + backfill. - **M4 — pre-existing weak authz.** `handleInfraBackupGet`/`Versions` and `handleNotify`/ `handleSavePreferences`/`handleInfraBackupPush` use `checkAuth` (global **or any** customer key, `handler.go:63-66`), not customer-scoped `checkAuthCustomer`. Most retire with the infra-backup push (§9); for any that carry over, the per-reporter model (§2) should scope them to the resolved host/guest→customer. Not a regression the doc introduces — a cleanup the cutover enables. --- ## Confirmed accurate (grounding that holds — so the rest of the doc can be trusted) - **§10 KEEP list** matches the schema exactly: `customer_configs`, `events`, `customer_notifications`, `notification_log`, `app_telemetry`, `app_log_issues` all present (`store.go:74-189,102-135`). The asset manager exists (`handler.go:57,834-867`). ✓ - **§10 two-tier notifications** (operator English / customer Hungarian, Resend, cooldowns) match `notify/dispatcher.go`: `processOperator` (1h cooldown, `FormatOperatorEmail`, gated by `operatorOn`, `dispatcher.go:91-114`) + `processCustomer` (prefs-driven, default 6h, `FormatCustomerEmail`, `dispatcher.go:116-158`); wired in `main.go:134`. ✓ - **§8 enrollment / §11 configgen** — deep-merge + Hungarian passphrase base is real: `configgen.deepMerge` (`configgen.go:76-91`), programmatic overrides + `hub.api_key = cfg.APIKey` (`configgen.go:40-47`), retrieval-password gate (`handler.go:709-753`). The evolution to agent-first + per-guest keys + key-pinning is a clean extension. ✓ - **§2 auth extension** (Bearer → reporter → customer) is clean against today's `checkAuthCustomer` (global key, else `GetCustomerConfigByAPIKey`, `handler.go:72-90`, `store.go:913-935`); adding host/guest key lookups slots straight in. ✓ - **§11 "idempotent style"** is accurate (`store.go:55-119`). New tables/columns (`hosts`, `guests`, `host_reports`, `signed_ops`, `desired_generation`, `desired_spec_json`) follow the existing `CREATE IF NOT EXISTS` / `ALTER … ` pattern cleanly. - **§9 escrow/custody** is consistent with Part 1 §8 (three-tier custody, zero-knowledge default, recovery-code-wrapped PBS keyfile, operator can't open) and Part 3 §8 (live PBS key on the box for backup + restore-test; hub holds only the wrapped escrow). The "customer recovery code is the irreducible residual; operator-managed tier avoids it" matches Part 1 §8 verbatim in spirit. ✓ - **§4 dead-man's-switch** (host-report recency = primary liveness) is consistent with Part 3 §5 ("the heartbeat *is* the liveness signal… first-class treatment hub-side"). ✓ - **§5/§6 signed-op + desired-state** are consistent with Part 4 and Part 3 §4: hub holds **no** signing key and queues opaque blobs (Part 4 §5; 05 §6 "The hub holds no signing key"); agent runs the verify pipeline and is the authoritative guard (Part 4 §2.3, Part 3 §4; 05 §6 "the agent's classification is the authoritative guard"); hub classifies at edit-time for UX only. 05 §6's `signed_ops` columns are a consistent superset of Part 4 §2.1's blob `{op, target:{host_id,guest_id}, params, nonce, issued_at, expires_at, key_id}` (05 adds hub-side lifecycle states `delivered`/`rejected` — fine). The local-CLI hand-off (`felhom-sign --pending`) matches Part 4 §5–6's `Signer`-on-the-workstation model. ✓ ## Two-driver soundness (axis 3) — holds No place in 05 has the hub **drive** box/customer-owned state. Desired-state (§5) is all infrastructure intent (guests, storage *policy*, versions, identity, tunnel) — top-down and legitimate. Apps are explicitly excluded from reconcile (§1, §5) and mirrored only. Storage is the handshake (detect → assign policy → reconcile policy onto the detected drive), matching Part 3 §7. The one nuance (S5): the `hosts` row holds both top-down intent and a few box-reported columns (repo fingerprint, wrapped escrow key) — acceptable, just label provenance per column. Reconcile (§5) never collides with app/storage reality because the reality columns (`durable_id` attached, snapshot pointers, app inventory) are mirror-only and never serve as desired state. ## DR completeness (axis 4) — safe to retire the heavy push, with S5's clarification Retiring the controller's infra-backup push is safe **given** that DR reads from durable sources, not the prunable mirror (S5). What the old push carried — `deployed_stacks` + `disk_layout.mounts` (`store.go:768-795`, surfaced by `handleRecovery`, `handler.go:620-705`) — is reconstructible: storage layout/`durable_id`s from the storage manifest (desired-state, durable) + host-report mirror; app inventory from the guest **inside the PBS snapshot** (so it need not be separately stored); snapshot list from PBS itself. The one artifact only the box can produce — the recovery-code-wrapped PBS key — is explicitly escrowed (§9), zero-knowledge, consistent with Part 1 §8 / Part 3 §8. So nothing DR-essential is lost by removing the push **provided** §9 is amended per S5 to name durable sources and not lean on `host_reports` retention.