Files
felhom.eu/documentation/architecture/_hub-review.md
T

17 KiB
Raw Blame History

05-hub-architecture.md — critical review (grounded against felhom-hub v0.6.3 source + Parts 0104)

Method: every claim about the existing hub was checked against felhom.eu/hub/ source; every cross-doc claim against Parts 01/03/04. Citations are file:line. Severity: blocking (wrong / breaks an assumption) · should-fix (real gap or contradiction, low blast) · minor.

The two highest-value catches (doc assumes something the code contradicts) are S1 and S2.


Ranked summary

# What Where (doc → code) Severity
S1 §9/§11 name the wrong infra-backup table as currentinfra_backup_versions is the live/primary one; infra_backups is the deprecated write-only mirror 05 §9/§11 → store.go:198-217,541-578 should-fix (code-contradiction)
S2 §7 treats the CF token as geo-only; it is dual-purpose (DNS-01/ACME + WAF) and is injected into the generated controller.yaml 05 §7 → config_form.html:76-80, controller.yaml.default:26, configgen.go:28-37, configs.go:1041 should-fix (code-contradiction / unverified assumption)
S3 §6 leans on the existing "Push", but that is a hub→box inbound POST — forbidden by the box-initiated model; transport must invert to poll 05 §6 → configs.go:569-570,1148-1150; Part 1 §4/§5/§11; Part 3 §5 should-fix
S4 Part 1 §6 calls app inventory "declarative"; 05 §1 (LOCKED) says apps are mirrored, never declared/reconciled, restored from PBS Part 1 §6 ↔ 05 §1/§9 should-fix (cross-doc)
S5 §9 hands "guest inventory + snapshots" from the prunable report mirror; DR soundness actually rests on durable sources 05 §9/§3 → store.go:809-816 should-fix (DR robustness)
S6 §4 says backup-deadline checker "maps onto host_reports' last-backup field"; today it is event-based and controller-emitted 05 §4 → deadline.go:31-86 should-fix (mechanism)
M1 "60s staleness checker" conflates the 60s cadence with the 30m/1h threshold 05 §4 → main.go:207-217,99-102, staleness.go:33-37 minor
M2 §2 customer_configs field list omits api_key — the very field the per-reporter plan retires 05 §2 → store.go:102-112 minor
M3 §11 reportsguest_reports "rename" is really drop+create under the locked clean cutover 05 §11 → store.go:55-119 minor
M4 Pre-existing weak authz on infra-backup GET / /notify (any valid key, not customer-scoped) handler.go:407,536,568,596 minor

No blocking findings — the data model and the two-driver framing are sound, and the LOCKED clean cutover absorbs most schema risk. The items below are gaps/contradictions worth fixing before the doc drives work.


Highest-value: doc assumes something the code contradicts

S1 — infra_backups vs infra_backup_versions is inverted (should-fix, code-contradiction)

05 §9: "infra_backup_versions retires; infra_backups is repurposed into the slim DR record." §11 repeats: "RETIRE infra_backup_versions; repurpose infra_backups."

The code is the other way round:

  • infra_backup_versions (added v0.7.0, store.go:198-211) is the live/primary table. Every read path hits it: GetInfraBackup (store.go:565-578), GetInfraBackupByID (store.go:581-593), GetInfraBackupMeta (store.go:604), ListInfraBackupVersions (store.go:640), and the recovery endpoint (handler.go:670-686).
  • infra_backups (original single-row, store.go:96-100) is deprecated. It is now written only as a legacy mirror ("for backward compatibility during rollback window", store.go:552-558) and is never read except as the one-time migration source (store.go:214-217).

So the doc proposes retiring the current table and repurposing the dead one. Under the LOCKED clean cutover both are discarded anyway, so blast radius is low — but an implementer following §9/§11 literally would point the DR record at the wrong table. Fix: take §11's own alternative — fold the slim DR record onto hosts and drop both infra-backup tables. If a standalone table is kept, base it on infra_backup_versions (the one with the data/readers), and correct the "which is current" framing.

S2 — the CF API token is not geo-only; it is the ACME token too, and ships into controller.yaml (should-fix, code-contradiction)

05 §7: "The hub already holds the CF API token (the config form notes Zone WAF:Edit)… rather than pushing the token down to the controller… The controller no longer calls the CF API."

Grounding confirms the hub does hold the token and does have a remove-all path: config_json → infrastructure.cf_api_token (configs.go:714-715,1041-1042,1089-1096) → cfClient.RemoveGeoRules(cfToken, cfg.Domain, …) in handleGeoDisable (configs.go:1112), route /customers/{id}/geo/disable (server.go:201-205). ✓ The §7 framing of geo-enforcement-moves-to-hub is also consistent with Part 1 §5/§7 and Part 3 §2/§46.

But the doc's assumption that the token is for geo is contradicted by the code: the same cf_api_token is dual-purpose

  • the config-form hint says "Zone DNS:Edit (ACME), Zone WAF:Edit (geo)" (config_form.html:80),
  • controller.yaml.default:26 documents it as the "Cloudflare API token (DNS-01 challenge)",
  • and it is deep-merged into the generated controller.yaml via configgen.Generate (config_json overrides, configgen.go:28-37), i.e. today it is shipped down to the box and served at /config/{id} and /recovery/{id}.

Consequences §7 must address:

  1. "Token off the controller" is incomplete if the box still does DNS-01/ACME. In the CF-Tunnel model the box may no longer need a public cert at all (edge-terminated), making the ACME use moot — but that is an assumption the doc must state, not skip. Either confirm ACME is gone, or the CF token cannot fully come off the box.
  2. configgen must stop emitting cf_api_token into controller.yaml (or relocate it to a hub-only field). As written, the generated config still carries it.

Should-fix

S3 — §6 "Push" is an inbound-to-box mechanism the new model forbids

05 §6: "the operator edits a customer's desired state (building on the existing config-form + Push/Pull/Diff)." The form + diff/pull/push handlers exist — handlePushConfig (configs.go:569), handlePullConfig (configs.go:952), handleConfigDiff (configs.go:861), routes at server.go:209-229. ✓ So the UI base is real.

The wrinkle: "Push" today is a hub→controller outbound POST (handlePushConfig "sends the generated YAML config to the controller", configs.go:569-570), as is the geo-disable notify (notifyControllerGeoDisablePOST controllerURL/api/geo/settings, configs.go:1148-1153). Both are the hub connecting into the box — explicitly disallowed by the box-initiated model (Part 1 §4 "the hub never initiates inbound"; §5 row agent↔hub/controller↔hub = outbound poll; Part 3 §5 "The hub never connects inbound"). 05's own §5 already resolves this (desired state is pulled on poll with a desired_generation). So the doc is internally consistent in mechanism but loose in wording: make §6 explicit that "Push" becomes "publish to desired state, delivered on the next agent/controller poll," not a reuse of the inbound push transport. The form/diff UX carries over; the transport inverts. (Same applies to the geo-disable controller-notify path.)

S4 — "declarative app inventory" (Part 1 §6) vs "apps are mirrored, never reconciled" (05 §1)

Part 1 §6 lists the durable record as including a "declarative app inventory" that survives box loss — wording that implies an operator-authored, re-deployable spec. 05 §1 (LOCKED two-driver model) is explicit the opposite way: "Apps never enter the reconcile loop… the hub only mirrors the resulting inventory… the app/customer layer is mirrored," and 05 §9 restores apps from the PBS guest snapshot, not by re-deploying a declared inventory. These are reconcilable (the mirror is durable last-known truth) but the word "declarative" contradicts the locked framing and the §9 restore-from-snapshot path. Fix (align the older doc to the locked model): in Part 1 §6 change "declarative app inventory" → "mirrored / last-reported app inventory," and note apps are recovered from the guest snapshot, not re-declared. (Flagging an internal inconsistency, not relitigating the locked premise.)

S5 — §9 reads DR inputs from a prunable mirror; soundness rests on durable sources

05 §9 hands the recovering agent "identity, tunnel token, storage manifest, PBS namespace, guest inventory + snapshots." §3 places "guest inventory" and "latest PBS snapshot pointers" in host_reports — the bottom-up mirror. But reports are pruned (Prune deletes rows older than maxDays, store.go:809-816; the doc keeps this), so after a long pre-DR outage the last host_report can be gone or stale. The actually-durable DR inputs are: desired-state on hosts/guests (§5), the slim DR record (PBS namespace + repo fingerprint + wrapped escrow key, §9/§11), and PBS's own snapshot enumeration (the agent lists snapshots once it has the namespace + unwrapped key). The mirrored inventory/pointers are convenience, not the source of record. Fix: state in §9 that DR reads from the durable sources (desired-state + DR record + PBS), not from prunable host_reports, so recovery doesn't degrade when the last report has aged out. This also keeps §1's two-driver discipline clean: DR must not depend on bottom-up mirror rows being retained. (Note: the hosts row legitimately mixes top-down intent columns with a few box-reported columns — repo fingerprint, wrapped escrow key. That is fine; just label them as box-reported so the §1 split stays legible at the column level.)

S6 — backup-deadline checker: doc says field-based, code is event-based (and re-emitter changes)

05 §4: "The existing backup-deadline checker maps onto host_reports' last-backup-per-target." The existing checker is event-based, not field-based: CheckBackupDeadlines looks for backup_completed / backup_failed (and db_dump_*) events since Budapest midnight and emits expected_backup_missed if neither is present (deadline.go:31-86). Two changes the doc should make explicit:

  1. Mechanism: either keep it event-based (someone emits backup_completed) or genuinely move it to a host_reports.last_backup_per_target field check — the doc says the latter but the impl is the former.
  2. Emitter: today the controller emits backup events; in the de-privileged model the agent owns backup/PBS (Part 3 §8), so the agent must now emit backup_completed/backup_failed (or the host report carries last-backup-per-target). Without re-homing the emitter, the deadline check goes silent after the controller stops doing backups.

Minor

  • M1 — "60s staleness checker" (§4). 60s is the check cadence (main.go:207-217, ticker := time.NewTicker(60 * time.Second)); the staleness threshold is 30m (default, main.go:99-102) with down at 2× = 60m (staleness.go:33-37; CLAUDE.md "OK <30m, DOWN >1h"). The event-transition mechanism (node_stale/node_down/node_recovered) is described correctly (staleness.go:155-185). Reword to "the staleness checker (60s cadence, 30m/1h thresholds)."
  • M2 — customer_configs fields (§2). The list ("identity, domain, email, retrieval_password, status, config_json") omits api_key (store.go:108) — the field §2's per-reporter plan actually retires. Worth noting customer_configs.api_key becomes unused once auth resolves via hosts.api_key / guests.api_key.
  • M3 — rename under clean cutover (§11). migrate() is all CREATE TABLE IF NOT EXISTS + idempotent ALTER (store.go:55-119,146-149). §11's claim "grounded in store.go's idempotent style" is accurate. But a reportsguest_reports rename isn't part of that style; under the LOCKED clean cutover (demo re-enrolls fresh, §2) it is really drop reports + create guest_reports with no data migration. Name it as such to avoid implying an in-place rename + backfill.
  • M4 — pre-existing weak authz. handleInfraBackupGet/Versions and handleNotify/ handleSavePreferences/handleInfraBackupPush use checkAuth (global or any customer key, handler.go:63-66), not customer-scoped checkAuthCustomer. Most retire with the infra-backup push (§9); for any that carry over, the per-reporter model (§2) should scope them to the resolved host/guest→customer. Not a regression the doc introduces — a cleanup the cutover enables.

Confirmed accurate (grounding that holds — so the rest of the doc can be trusted)

  • §10 KEEP list matches the schema exactly: customer_configs, events, customer_notifications, notification_log, app_telemetry, app_log_issues all present (store.go:74-189,102-135). The asset manager exists (handler.go:57,834-867). ✓
  • §10 two-tier notifications (operator English / customer Hungarian, Resend, cooldowns) match notify/dispatcher.go: processOperator (1h cooldown, FormatOperatorEmail, gated by operatorOn, dispatcher.go:91-114) + processCustomer (prefs-driven, default 6h, FormatCustomerEmail, dispatcher.go:116-158); wired in main.go:134. ✓
  • §8 enrollment / §11 configgen — deep-merge + Hungarian passphrase base is real: configgen.deepMerge (configgen.go:76-91), programmatic overrides + hub.api_key = cfg.APIKey (configgen.go:40-47), retrieval-password gate (handler.go:709-753). The evolution to agent-first + per-guest keys + key-pinning is a clean extension. ✓
  • §2 auth extension (Bearer → reporter → customer) is clean against today's checkAuthCustomer (global key, else GetCustomerConfigByAPIKey, handler.go:72-90, store.go:913-935); adding host/guest key lookups slots straight in. ✓
  • §11 "idempotent style" is accurate (store.go:55-119). New tables/columns (hosts, guests, host_reports, signed_ops, desired_generation, desired_spec_json) follow the existing CREATE IF NOT EXISTS / ALTER … pattern cleanly.
  • §9 escrow/custody is consistent with Part 1 §8 (three-tier custody, zero-knowledge default, recovery-code-wrapped PBS keyfile, operator can't open) and Part 3 §8 (live PBS key on the box for backup + restore-test; hub holds only the wrapped escrow). The "customer recovery code is the irreducible residual; operator-managed tier avoids it" matches Part 1 §8 verbatim in spirit. ✓
  • §4 dead-man's-switch (host-report recency = primary liveness) is consistent with Part 3 §5 ("the heartbeat is the liveness signal… first-class treatment hub-side"). ✓
  • §5/§6 signed-op + desired-state are consistent with Part 4 and Part 3 §4: hub holds no signing key and queues opaque blobs (Part 4 §5; 05 §6 "The hub holds no signing key"); agent runs the verify pipeline and is the authoritative guard (Part 4 §2.3, Part 3 §4; 05 §6 "the agent's classification is the authoritative guard"); hub classifies at edit-time for UX only. 05 §6's signed_ops columns are a consistent superset of Part 4 §2.1's blob {op, target:{host_id,guest_id}, params, nonce, issued_at, expires_at, key_id} (05 adds hub-side lifecycle states delivered/rejected — fine). The local-CLI hand-off (felhom-sign --pending) matches Part 4 §56's Signer-on-the-workstation model. ✓

Two-driver soundness (axis 3) — holds

No place in 05 has the hub drive box/customer-owned state. Desired-state (§5) is all infrastructure intent (guests, storage policy, versions, identity, tunnel) — top-down and legitimate. Apps are explicitly excluded from reconcile (§1, §5) and mirrored only. Storage is the handshake (detect → assign policy → reconcile policy onto the detected drive), matching Part 3 §7. The one nuance (S5): the hosts row holds both top-down intent and a few box-reported columns (repo fingerprint, wrapped escrow key) — acceptable, just label provenance per column. Reconcile (§5) never collides with app/storage reality because the reality columns (durable_id attached, snapshot pointers, app inventory) are mirror-only and never serve as desired state.

DR completeness (axis 4) — safe to retire the heavy push, with S5's clarification

Retiring the controller's infra-backup push is safe given that DR reads from durable sources, not the prunable mirror (S5). What the old push carried — deployed_stacks + disk_layout.mounts (store.go:768-795, surfaced by handleRecovery, handler.go:620-705) — is reconstructible: storage layout/durable_ids from the storage manifest (desired-state, durable) + host-report mirror; app inventory from the guest inside the PBS snapshot (so it need not be separately stored); snapshot list from PBS itself. The one artifact only the box can produce — the recovery-code-wrapped PBS key — is explicitly escrowed (§9), zero-knowledge, consistent with Part 1 §8 / Part 3 §8. So nothing DR-essential is lost by removing the push provided §9 is amended per S5 to name durable sources and not lean on host_reports retention.