17 KiB
05-hub-architecture.md — critical review (grounded against felhom-hub v0.6.3 source + Parts 01–04)
Method: every claim about the existing hub was checked against felhom.eu/hub/ source; every
cross-doc claim against Parts 01/03/04. Citations are file:line. Severity: blocking (wrong /
breaks an assumption) · should-fix (real gap or contradiction, low blast) · minor.
The two highest-value catches (doc assumes something the code contradicts) are S1 and S2.
Ranked summary
| # | What | Where (doc → code) | Severity |
|---|---|---|---|
| S1 | §9/§11 name the wrong infra-backup table as current — infra_backup_versions is the live/primary one; infra_backups is the deprecated write-only mirror |
05 §9/§11 → store.go:198-217,541-578 |
should-fix (code-contradiction) |
| S2 | §7 treats the CF token as geo-only; it is dual-purpose (DNS-01/ACME + WAF) and is injected into the generated controller.yaml |
05 §7 → config_form.html:76-80, controller.yaml.default:26, configgen.go:28-37, configs.go:1041 |
should-fix (code-contradiction / unverified assumption) |
| S3 | §6 leans on the existing "Push", but that is a hub→box inbound POST — forbidden by the box-initiated model; transport must invert to poll | 05 §6 → configs.go:569-570,1148-1150; Part 1 §4/§5/§11; Part 3 §5 |
should-fix |
| S4 | Part 1 §6 calls app inventory "declarative"; 05 §1 (LOCKED) says apps are mirrored, never declared/reconciled, restored from PBS | Part 1 §6 ↔ 05 §1/§9 | should-fix (cross-doc) |
| S5 | §9 hands "guest inventory + snapshots" from the prunable report mirror; DR soundness actually rests on durable sources | 05 §9/§3 → store.go:809-816 |
should-fix (DR robustness) |
| S6 | §4 says backup-deadline checker "maps onto host_reports' last-backup field"; today it is event-based and controller-emitted | 05 §4 → deadline.go:31-86 |
should-fix (mechanism) |
| M1 | "60s staleness checker" conflates the 60s cadence with the 30m/1h threshold | 05 §4 → main.go:207-217,99-102, staleness.go:33-37 |
minor |
| M2 | §2 customer_configs field list omits api_key — the very field the per-reporter plan retires |
05 §2 → store.go:102-112 |
minor |
| M3 | §11 reports→guest_reports "rename" is really drop+create under the locked clean cutover |
05 §11 → store.go:55-119 |
minor |
| M4 | Pre-existing weak authz on infra-backup GET / /notify (any valid key, not customer-scoped) |
handler.go:407,536,568,596 | minor |
No blocking findings — the data model and the two-driver framing are sound, and the LOCKED clean cutover absorbs most schema risk. The items below are gaps/contradictions worth fixing before the doc drives work.
Highest-value: doc assumes something the code contradicts
S1 — infra_backups vs infra_backup_versions is inverted (should-fix, code-contradiction)
05 §9: "infra_backup_versions retires; infra_backups is repurposed into the slim DR record."
§11 repeats: "RETIRE infra_backup_versions; repurpose infra_backups."
The code is the other way round:
infra_backup_versions(added v0.7.0,store.go:198-211) is the live/primary table. Every read path hits it:GetInfraBackup(store.go:565-578),GetInfraBackupByID(store.go:581-593),GetInfraBackupMeta(store.go:604),ListInfraBackupVersions(store.go:640), and the recovery endpoint (handler.go:670-686).infra_backups(original single-row,store.go:96-100) is deprecated. It is now written only as a legacy mirror ("for backward compatibility during rollback window",store.go:552-558) and is never read except as the one-time migration source (store.go:214-217).
So the doc proposes retiring the current table and repurposing the dead one. Under the LOCKED clean
cutover both are discarded anyway, so blast radius is low — but an implementer following §9/§11
literally would point the DR record at the wrong table.
Fix: take §11's own alternative — fold the slim DR record onto hosts and drop both
infra-backup tables. If a standalone table is kept, base it on infra_backup_versions (the one with the
data/readers), and correct the "which is current" framing.
S2 — the CF API token is not geo-only; it is the ACME token too, and ships into controller.yaml (should-fix, code-contradiction)
05 §7: "The hub already holds the CF API token (the config form notes Zone WAF:Edit)… rather than pushing the token down to the controller… The controller no longer calls the CF API."
Grounding confirms the hub does hold the token and does have a remove-all path:
config_json → infrastructure.cf_api_token (configs.go:714-715,1041-1042,1089-1096) →
cfClient.RemoveGeoRules(cfToken, cfg.Domain, …) in handleGeoDisable (configs.go:1112), route
/customers/{id}/geo/disable (server.go:201-205). ✓ The §7 framing of geo-enforcement-moves-to-hub
is also consistent with Part 1 §5/§7 and Part 3 §2/§46.
But the doc's assumption that the token is for geo is contradicted by the code: the same
cf_api_token is dual-purpose —
- the config-form hint says "Zone DNS:Edit (ACME), Zone WAF:Edit (geo)" (
config_form.html:80), controller.yaml.default:26documents it as the "Cloudflare API token (DNS-01 challenge)",- and it is deep-merged into the generated
controller.yamlviaconfiggen.Generate(config_json overrides,configgen.go:28-37), i.e. today it is shipped down to the box and served at/config/{id}and/recovery/{id}.
Consequences §7 must address:
- "Token off the controller" is incomplete if the box still does DNS-01/ACME. In the CF-Tunnel model the box may no longer need a public cert at all (edge-terminated), making the ACME use moot — but that is an assumption the doc must state, not skip. Either confirm ACME is gone, or the CF token cannot fully come off the box.
configgenmust stop emittingcf_api_tokenintocontroller.yaml(or relocate it to a hub-only field). As written, the generated config still carries it.
Should-fix
S3 — §6 "Push" is an inbound-to-box mechanism the new model forbids
05 §6: "the operator edits a customer's desired state (building on the existing config-form +
Push/Pull/Diff)." The form + diff/pull/push handlers exist — handlePushConfig (configs.go:569),
handlePullConfig (configs.go:952), handleConfigDiff (configs.go:861), routes at
server.go:209-229. ✓ So the UI base is real.
The wrinkle: "Push" today is a hub→controller outbound POST (handlePushConfig "sends the generated
YAML config to the controller", configs.go:569-570), as is the geo-disable notify
(notifyControllerGeoDisable → POST controllerURL/api/geo/settings, configs.go:1148-1153). Both are
the hub connecting into the box — explicitly disallowed by the box-initiated model (Part 1 §4
"the hub never initiates inbound"; §5 row agent↔hub/controller↔hub = outbound poll; Part 3 §5 "The
hub never connects inbound"). 05's own §5 already resolves this (desired state is pulled on poll
with a desired_generation). So the doc is internally consistent in mechanism but loose in wording:
make §6 explicit that "Push" becomes "publish to desired state, delivered on the next agent/controller
poll," not a reuse of the inbound push transport. The form/diff UX carries over; the transport inverts.
(Same applies to the geo-disable controller-notify path.)
S4 — "declarative app inventory" (Part 1 §6) vs "apps are mirrored, never reconciled" (05 §1)
Part 1 §6 lists the durable record as including a "declarative app inventory" that survives box loss — wording that implies an operator-authored, re-deployable spec. 05 §1 (LOCKED two-driver model) is explicit the opposite way: "Apps never enter the reconcile loop… the hub only mirrors the resulting inventory… the app/customer layer is mirrored," and 05 §9 restores apps from the PBS guest snapshot, not by re-deploying a declared inventory. These are reconcilable (the mirror is durable last-known truth) but the word "declarative" contradicts the locked framing and the §9 restore-from-snapshot path. Fix (align the older doc to the locked model): in Part 1 §6 change "declarative app inventory" → "mirrored / last-reported app inventory," and note apps are recovered from the guest snapshot, not re-declared. (Flagging an internal inconsistency, not relitigating the locked premise.)
S5 — §9 reads DR inputs from a prunable mirror; soundness rests on durable sources
05 §9 hands the recovering agent "identity, tunnel token, storage manifest, PBS namespace, guest
inventory + snapshots." §3 places "guest inventory" and "latest PBS snapshot pointers" in
host_reports — the bottom-up mirror. But reports are pruned (Prune deletes rows older than
maxDays, store.go:809-816; the doc keeps this), so after a long pre-DR outage the last host_report
can be gone or stale. The actually-durable DR inputs are: desired-state on hosts/guests (§5), the
slim DR record (PBS namespace + repo fingerprint + wrapped escrow key, §9/§11), and PBS's own snapshot
enumeration (the agent lists snapshots once it has the namespace + unwrapped key). The mirrored
inventory/pointers are convenience, not the source of record.
Fix: state in §9 that DR reads from the durable sources (desired-state + DR record + PBS), not
from prunable host_reports, so recovery doesn't degrade when the last report has aged out. This also
keeps §1's two-driver discipline clean: DR must not depend on bottom-up mirror rows being retained.
(Note: the hosts row legitimately mixes top-down intent columns with a few box-reported columns —
repo fingerprint, wrapped escrow key. That is fine; just label them as box-reported so the §1 split
stays legible at the column level.)
S6 — backup-deadline checker: doc says field-based, code is event-based (and re-emitter changes)
05 §4: "The existing backup-deadline checker maps onto host_reports' last-backup-per-target." The
existing checker is event-based, not field-based: CheckBackupDeadlines looks for
backup_completed / backup_failed (and db_dump_*) events since Budapest midnight and emits
expected_backup_missed if neither is present (deadline.go:31-86). Two changes the doc should make
explicit:
- Mechanism: either keep it event-based (someone emits
backup_completed) or genuinely move it to ahost_reports.last_backup_per_targetfield check — the doc says the latter but the impl is the former. - Emitter: today the controller emits backup events; in the de-privileged model the agent
owns backup/PBS (Part 3 §8), so the agent must now emit
backup_completed/backup_failed(or the host report carries last-backup-per-target). Without re-homing the emitter, the deadline check goes silent after the controller stops doing backups.
Minor
- M1 — "60s staleness checker" (§4). 60s is the check cadence (
main.go:207-217,ticker := time.NewTicker(60 * time.Second)); the staleness threshold is 30m (default,main.go:99-102) with down at 2× = 60m (staleness.go:33-37; CLAUDE.md "OK <30m, DOWN >1h"). The event-transition mechanism (node_stale/node_down/node_recovered) is described correctly (staleness.go:155-185). Reword to "the staleness checker (60s cadence, 30m/1h thresholds)." - M2 —
customer_configsfields (§2). The list ("identity, domain, email, retrieval_password, status, config_json") omitsapi_key(store.go:108) — the field §2's per-reporter plan actually retires. Worth notingcustomer_configs.api_keybecomes unused once auth resolves viahosts.api_key/guests.api_key. - M3 — rename under clean cutover (§11).
migrate()is allCREATE TABLE IF NOT EXISTS+ idempotentALTER(store.go:55-119,146-149). §11's claim "grounded in store.go's idempotent style" is accurate. But areports→guest_reportsrename isn't part of that style; under the LOCKED clean cutover (demo re-enrolls fresh, §2) it is really dropreports+ createguest_reportswith no data migration. Name it as such to avoid implying an in-place rename + backfill. - M4 — pre-existing weak authz.
handleInfraBackupGet/VersionsandhandleNotify/handleSavePreferences/handleInfraBackupPushusecheckAuth(global or any customer key,handler.go:63-66), not customer-scopedcheckAuthCustomer. Most retire with the infra-backup push (§9); for any that carry over, the per-reporter model (§2) should scope them to the resolved host/guest→customer. Not a regression the doc introduces — a cleanup the cutover enables.
Confirmed accurate (grounding that holds — so the rest of the doc can be trusted)
- §10 KEEP list matches the schema exactly:
customer_configs,events,customer_notifications,notification_log,app_telemetry,app_log_issuesall present (store.go:74-189,102-135). The asset manager exists (handler.go:57,834-867). ✓ - §10 two-tier notifications (operator English / customer Hungarian, Resend, cooldowns) match
notify/dispatcher.go:processOperator(1h cooldown,FormatOperatorEmail, gated byoperatorOn,dispatcher.go:91-114) +processCustomer(prefs-driven, default 6h,FormatCustomerEmail,dispatcher.go:116-158); wired inmain.go:134. ✓ - §8 enrollment / §11 configgen — deep-merge + Hungarian passphrase base is real:
configgen.deepMerge(configgen.go:76-91), programmatic overrides +hub.api_key = cfg.APIKey(configgen.go:40-47), retrieval-password gate (handler.go:709-753). The evolution to agent-first + per-guest keys + key-pinning is a clean extension. ✓ - §2 auth extension (Bearer → reporter → customer) is clean against today's
checkAuthCustomer(global key, elseGetCustomerConfigByAPIKey,handler.go:72-90,store.go:913-935); adding host/guest key lookups slots straight in. ✓ - §11 "idempotent style" is accurate (
store.go:55-119). New tables/columns (hosts,guests,host_reports,signed_ops,desired_generation,desired_spec_json) follow the existingCREATE IF NOT EXISTS/ALTER …pattern cleanly. - §9 escrow/custody is consistent with Part 1 §8 (three-tier custody, zero-knowledge default, recovery-code-wrapped PBS keyfile, operator can't open) and Part 3 §8 (live PBS key on the box for backup + restore-test; hub holds only the wrapped escrow). The "customer recovery code is the irreducible residual; operator-managed tier avoids it" matches Part 1 §8 verbatim in spirit. ✓
- §4 dead-man's-switch (host-report recency = primary liveness) is consistent with Part 3 §5 ("the heartbeat is the liveness signal… first-class treatment hub-side"). ✓
- §5/§6 signed-op + desired-state are consistent with Part 4 and Part 3 §4:
hub holds no signing key and queues opaque blobs (Part 4 §5; 05 §6 "The hub holds no signing
key"); agent runs the verify pipeline and is the authoritative guard (Part 4 §2.3, Part 3 §4; 05 §6
"the agent's classification is the authoritative guard"); hub classifies at edit-time for UX only.
05 §6's
signed_opscolumns are a consistent superset of Part 4 §2.1's blob{op, target:{host_id,guest_id}, params, nonce, issued_at, expires_at, key_id}(05 adds hub-side lifecycle statesdelivered/rejected— fine). The local-CLI hand-off (felhom-sign --pending) matches Part 4 §5–6'sSigner-on-the-workstation model. ✓
Two-driver soundness (axis 3) — holds
No place in 05 has the hub drive box/customer-owned state. Desired-state (§5) is all infrastructure
intent (guests, storage policy, versions, identity, tunnel) — top-down and legitimate. Apps are
explicitly excluded from reconcile (§1, §5) and mirrored only. Storage is the handshake (detect →
assign policy → reconcile policy onto the detected drive), matching Part 3 §7. The one nuance (S5): the
hosts row holds both top-down intent and a few box-reported columns (repo fingerprint, wrapped escrow
key) — acceptable, just label provenance per column. Reconcile (§5) never collides with app/storage
reality because the reality columns (durable_id attached, snapshot pointers, app inventory) are
mirror-only and never serve as desired state.
DR completeness (axis 4) — safe to retire the heavy push, with S5's clarification
Retiring the controller's infra-backup push is safe given that DR reads from durable sources, not
the prunable mirror (S5). What the old push carried — deployed_stacks + disk_layout.mounts
(store.go:768-795, surfaced by handleRecovery, handler.go:620-705) — is reconstructible:
storage layout/durable_ids from the storage manifest (desired-state, durable) + host-report mirror;
app inventory from the guest inside the PBS snapshot (so it need not be separately stored); snapshot
list from PBS itself. The one artifact only the box can produce — the recovery-code-wrapped PBS key — is
explicitly escrowed (§9), zero-knowledge, consistent with Part 1 §8 / Part 3 §8. So nothing
DR-essential is lost by removing the push provided §9 is amended per S5 to name durable sources and
not lean on host_reports retention.