Files
felhom.eu/documentation/architecture/_hub-review.md
T

222 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# `05-hub-architecture.md` — critical review (grounded against felhom-hub v0.6.3 source + Parts 0104)
Method: every claim about the existing hub was checked against `felhom.eu/hub/` source; every
cross-doc claim against Parts 01/03/04. Citations are `file:line`. Severity: **blocking** (wrong /
breaks an assumption) · **should-fix** (real gap or contradiction, low blast) · **minor**.
The two highest-value catches (doc assumes something the code contradicts) are **S1** and **S2**.
---
## Ranked summary
| # | What | Where (doc → code) | Severity |
|---|---|---|---|
| S1 | §9/§11 name the **wrong infra-backup table as current**`infra_backup_versions` is the live/primary one; `infra_backups` is the deprecated write-only mirror | 05 §9/§11 → `store.go:198-217,541-578` | should-fix (code-contradiction) |
| S2 | §7 treats the CF token as **geo-only**; it is **dual-purpose (DNS-01/ACME + WAF)** and is injected into the generated `controller.yaml` | 05 §7 → `config_form.html:76-80`, `controller.yaml.default:26`, `configgen.go:28-37`, `configs.go:1041` | should-fix (code-contradiction / unverified assumption) |
| S3 | §6 leans on the existing **"Push"**, but that is a hub→box **inbound** POST — forbidden by the box-initiated model; transport must invert to poll | 05 §6 → `configs.go:569-570,1148-1150`; Part 1 §4/§5/§11; Part 3 §5 | should-fix |
| S4 | Part 1 §6 calls app inventory **"declarative"**; 05 §1 (LOCKED) says apps are mirrored, never declared/reconciled, restored from PBS | Part 1 §6 ↔ 05 §1/§9 | should-fix (cross-doc) |
| S5 | §9 hands "guest inventory + snapshots" **from the prunable report mirror**; DR soundness actually rests on durable sources | 05 §9/§3 → `store.go:809-816` | should-fix (DR robustness) |
| S6 | §4 says backup-deadline checker "maps onto host_reports' last-backup field"; today it is **event-based** and controller-emitted | 05 §4 → `deadline.go:31-86` | should-fix (mechanism) |
| M1 | "60s staleness checker" conflates the 60s **cadence** with the 30m/1h **threshold** | 05 §4 → `main.go:207-217,99-102`, `staleness.go:33-37` | minor |
| M2 | §2 `customer_configs` field list omits `api_key` — the very field the per-reporter plan retires | 05 §2 → `store.go:102-112` | minor |
| M3 | §11 `reports``guest_reports` "rename" is really drop+create under the locked clean cutover | 05 §11 → `store.go:55-119` | minor |
| M4 | Pre-existing weak authz on infra-backup GET / `/notify` (any valid key, not customer-scoped) | handler.go:407,536,568,596 | minor |
No **blocking** findings — the data model and the two-driver framing are sound, and the LOCKED clean
cutover absorbs most schema risk. The items below are gaps/contradictions worth fixing before the doc
drives work.
---
## Highest-value: doc assumes something the code contradicts
### S1 — `infra_backups` vs `infra_backup_versions` is inverted (should-fix, code-contradiction)
05 §9: *"`infra_backup_versions` retires; `infra_backups` is repurposed into the slim DR record."*
§11 repeats: *"RETIRE `infra_backup_versions`; repurpose `infra_backups`."*
The code is the other way round:
- `infra_backup_versions` (added v0.7.0, `store.go:198-211`) is the **live/primary** table. **Every read
path hits it**: `GetInfraBackup` (`store.go:565-578`), `GetInfraBackupByID` (`store.go:581-593`),
`GetInfraBackupMeta` (`store.go:604`), `ListInfraBackupVersions` (`store.go:640`), and the recovery
endpoint (`handler.go:670-686`).
- `infra_backups` (original single-row, `store.go:96-100`) is **deprecated**. It is now **written only
as a legacy mirror** ("for backward compatibility during rollback window", `store.go:552-558`) and is
**never read** except as the one-time migration *source* (`store.go:214-217`).
So the doc proposes retiring the current table and repurposing the dead one. Under the LOCKED clean
cutover both are discarded anyway, so blast radius is low — but an implementer following §9/§11
literally would point the DR record at the wrong table.
**Fix:** take §11's own alternative — *fold the slim DR record onto `hosts`* and **drop both**
infra-backup tables. If a standalone table is kept, base it on `infra_backup_versions` (the one with the
data/readers), and correct the "which is current" framing.
### S2 — the CF API token is **not** geo-only; it is the ACME token too, and ships into `controller.yaml` (should-fix, code-contradiction)
05 §7: *"The hub already holds the CF API token (the config form notes Zone WAF:Edit)… rather than
pushing the token down to the controller… The controller no longer calls the CF API."*
Grounding confirms the hub **does** hold the token and **does** have a remove-all path:
`config_json → infrastructure.cf_api_token` (`configs.go:714-715,1041-1042,1089-1096`) →
`cfClient.RemoveGeoRules(cfToken, cfg.Domain, …)` in `handleGeoDisable` (`configs.go:1112`), route
`/customers/{id}/geo/disable` (`server.go:201-205`). ✓ The §7 framing of geo-enforcement-moves-to-hub
is also consistent with Part 1 §5/§7 and Part 3 §2/§46.
**But the doc's assumption that the token is *for geo* is contradicted by the code:** the same
`cf_api_token` is **dual-purpose**
- the config-form hint says **"Zone DNS:Edit (ACME), Zone WAF:Edit (geo)"** (`config_form.html:80`),
- `controller.yaml.default:26` documents it as the **"Cloudflare API token (DNS-01 challenge)"**,
- and it is **deep-merged into the generated `controller.yaml`** via `configgen.Generate` (config_json
overrides, `configgen.go:28-37`), i.e. **today it is shipped down to the box** and served at
`/config/{id}` and `/recovery/{id}`.
Consequences §7 must address:
1. **"Token off the controller" is incomplete** if the box still does DNS-01/ACME. In the CF-Tunnel
model the box may no longer need a public cert at all (edge-terminated), making the ACME use moot —
but that is an assumption the doc must state, not skip. Either confirm ACME is gone, or the CF token
cannot fully come off the box.
2. **`configgen` must stop emitting `cf_api_token` into `controller.yaml`** (or relocate it to a
hub-only field). As written, the generated config still carries it.
---
## Should-fix
### S3 — §6 "Push" is an inbound-to-box mechanism the new model forbids
05 §6: *"the operator edits a customer's desired state (building on the existing config-form +
Push/Pull/Diff)."* The form + diff/pull/push handlers exist — `handlePushConfig` (`configs.go:569`),
`handlePullConfig` (`configs.go:952`), `handleConfigDiff` (`configs.go:861`), routes at
`server.go:209-229`. ✓ So the UI base is real.
The wrinkle: **"Push" today is a hub→controller outbound POST** (`handlePushConfig` "sends the generated
YAML config to the controller", `configs.go:569-570`), as is the geo-disable notify
(`notifyControllerGeoDisable``POST controllerURL/api/geo/settings`, `configs.go:1148-1153`). Both are
the hub **connecting into the box** — explicitly disallowed by the box-initiated model (Part 1 §4
"the hub never initiates inbound"; §5 row `agent↔hub`/`controller↔hub` = outbound poll; Part 3 §5 "The
hub never connects inbound"). 05's own §5 already resolves this (desired state is **pulled** on poll
with a `desired_generation`). So the doc is internally consistent in *mechanism* but loose in *wording*:
**make §6 explicit that "Push" becomes "publish to desired state, delivered on the next agent/controller
poll," not a reuse of the inbound push transport.** The form/diff UX carries over; the transport inverts.
(Same applies to the geo-disable controller-notify path.)
### S4 — "declarative app inventory" (Part 1 §6) vs "apps are mirrored, never reconciled" (05 §1)
Part 1 §6 lists the durable record as including a **"declarative app inventory"** that survives box loss
— wording that implies an operator-authored, re-deployable spec. 05 §1 (LOCKED two-driver model) is
explicit the opposite way: *"Apps never enter the reconcile loop… the hub only mirrors the resulting
inventory… the app/customer layer is mirrored,"* and 05 §9 restores apps **from the PBS guest snapshot**,
not by re-deploying a declared inventory. These are reconcilable (the mirror *is* durable last-known
truth) but the word "declarative" contradicts the locked framing and the §9 restore-from-snapshot path.
**Fix (align the older doc to the locked model):** in Part 1 §6 change "declarative app inventory" →
"mirrored / last-reported app inventory," and note apps are recovered from the guest snapshot, not
re-declared. (Flagging an internal inconsistency, not relitigating the locked premise.)
### S5 — §9 reads DR inputs from a prunable mirror; soundness rests on durable sources
05 §9 hands the recovering agent *"identity, tunnel token, storage manifest, PBS namespace, guest
inventory + snapshots."* §3 places "guest inventory" and "latest PBS snapshot pointers" in
`host_reports` — the bottom-up mirror. But reports are **pruned** (`Prune` deletes rows older than
`maxDays`, `store.go:809-816`; the doc keeps this), so after a long pre-DR outage the last `host_report`
can be gone or stale. The actually-durable DR inputs are: desired-state on `hosts`/`guests` (§5), the
slim DR record (PBS namespace + repo fingerprint + wrapped escrow key, §9/§11), and **PBS's own snapshot
enumeration** (the agent lists snapshots once it has the namespace + unwrapped key). The mirrored
inventory/pointers are convenience, not the source of record.
**Fix:** state in §9 that DR reads from the durable sources (desired-state + DR record + PBS), **not**
from prunable `host_reports`, so recovery doesn't degrade when the last report has aged out. This also
keeps §1's two-driver discipline clean: DR must not depend on bottom-up mirror rows being retained.
(Note: the `hosts` row legitimately mixes top-down intent columns with a few box-reported columns —
repo fingerprint, wrapped escrow key. That is fine; just label them as box-reported so the §1 split
stays legible at the column level.)
### S6 — backup-deadline checker: doc says field-based, code is event-based (and re-emitter changes)
05 §4: *"The existing backup-deadline checker maps onto `host_reports`' last-backup-per-target."* The
existing checker is **event-based**, not field-based: `CheckBackupDeadlines` looks for
`backup_completed` / `backup_failed` (and `db_dump_*`) **events** since Budapest midnight and emits
`expected_backup_missed` if neither is present (`deadline.go:31-86`). Two changes the doc should make
explicit:
1. **Mechanism:** either keep it event-based (someone emits `backup_completed`) or genuinely move it to
a `host_reports.last_backup_per_target` field check — the doc says the latter but the impl is the
former.
2. **Emitter:** today the **controller** emits backup events; in the de-privileged model the **agent**
owns backup/PBS (Part 3 §8), so the agent must now emit `backup_completed`/`backup_failed` (or the
host report carries last-backup-per-target). Without re-homing the emitter, the deadline check goes
silent after the controller stops doing backups.
---
## Minor
- **M1 — "60s staleness checker" (§4).** 60s is the **check cadence** (`main.go:207-217`,
`ticker := time.NewTicker(60 * time.Second)`); the **staleness threshold** is 30m (default,
`main.go:99-102`) with down at 2× = 60m (`staleness.go:33-37`; CLAUDE.md "OK <30m, DOWN >1h"). The
event-transition mechanism (`node_stale`/`node_down`/`node_recovered`) is described correctly
(`staleness.go:155-185`). Reword to "the staleness checker (60s cadence, 30m/1h thresholds)."
- **M2 — `customer_configs` fields (§2).** The list ("identity, domain, email, retrieval_password,
status, config_json") omits **`api_key`** (`store.go:108`) — the field §2's per-reporter plan
actually retires. Worth noting `customer_configs.api_key` becomes unused once auth resolves via
`hosts.api_key` / `guests.api_key`.
- **M3 — rename under clean cutover (§11).** `migrate()` is all `CREATE TABLE IF NOT EXISTS` +
idempotent `ALTER` (`store.go:55-119,146-149`). §11's claim "grounded in store.go's idempotent style"
is accurate. But a `reports``guest_reports` **rename** isn't part of that style; under the LOCKED
clean cutover (demo re-enrolls fresh, §2) it is really **drop `reports` + create `guest_reports`**
with no data migration. Name it as such to avoid implying an in-place rename + backfill.
- **M4 — pre-existing weak authz.** `handleInfraBackupGet`/`Versions` and `handleNotify`/
`handleSavePreferences`/`handleInfraBackupPush` use `checkAuth` (global **or any** customer key,
`handler.go:63-66`), not customer-scoped `checkAuthCustomer`. Most retire with the infra-backup push
(§9); for any that carry over, the per-reporter model (§2) should scope them to the resolved
host/guest→customer. Not a regression the doc introduces — a cleanup the cutover enables.
---
## Confirmed accurate (grounding that holds — so the rest of the doc can be trusted)
- **§10 KEEP list** matches the schema exactly: `customer_configs`, `events`, `customer_notifications`,
`notification_log`, `app_telemetry`, `app_log_issues` all present (`store.go:74-189,102-135`). The
asset manager exists (`handler.go:57,834-867`). ✓
- **§10 two-tier notifications** (operator English / customer Hungarian, Resend, cooldowns) match
`notify/dispatcher.go`: `processOperator` (1h cooldown, `FormatOperatorEmail`, gated by `operatorOn`,
`dispatcher.go:91-114`) + `processCustomer` (prefs-driven, default 6h, `FormatCustomerEmail`,
`dispatcher.go:116-158`); wired in `main.go:134`. ✓
- **§8 enrollment / §11 configgen** — deep-merge + Hungarian passphrase base is real:
`configgen.deepMerge` (`configgen.go:76-91`), programmatic overrides + `hub.api_key = cfg.APIKey`
(`configgen.go:40-47`), retrieval-password gate (`handler.go:709-753`). The evolution to agent-first +
per-guest keys + key-pinning is a clean extension. ✓
- **§2 auth extension** (Bearer → reporter → customer) is clean against today's
`checkAuthCustomer` (global key, else `GetCustomerConfigByAPIKey`, `handler.go:72-90`,
`store.go:913-935`); adding host/guest key lookups slots straight in. ✓
- **§11 "idempotent style"** is accurate (`store.go:55-119`). New tables/columns (`hosts`, `guests`,
`host_reports`, `signed_ops`, `desired_generation`, `desired_spec_json`) follow the existing
`CREATE IF NOT EXISTS` / `ALTER … ` pattern cleanly.
- **§9 escrow/custody** is consistent with Part 1 §8 (three-tier custody, zero-knowledge default,
recovery-code-wrapped PBS keyfile, operator can't open) and Part 3 §8 (live PBS key on the box for
backup + restore-test; hub holds only the wrapped escrow). The "customer recovery code is the
irreducible residual; operator-managed tier avoids it" matches Part 1 §8 verbatim in spirit. ✓
- **§4 dead-man's-switch** (host-report recency = primary liveness) is consistent with Part 3 §5
("the heartbeat *is* the liveness signal… first-class treatment hub-side"). ✓
- **§5/§6 signed-op + desired-state** are consistent with Part 4 and Part 3 §4:
hub holds **no** signing key and queues opaque blobs (Part 4 §5; 05 §6 "The hub holds no signing
key"); agent runs the verify pipeline and is the authoritative guard (Part 4 §2.3, Part 3 §4; 05 §6
"the agent's classification is the authoritative guard"); hub classifies at edit-time for UX only.
05 §6's `signed_ops` columns are a consistent superset of Part 4 §2.1's blob
`{op, target:{host_id,guest_id}, params, nonce, issued_at, expires_at, key_id}` (05 adds hub-side
lifecycle states `delivered`/`rejected` — fine). The local-CLI hand-off (`felhom-sign --pending`)
matches Part 4 §56's `Signer`-on-the-workstation model. ✓
## Two-driver soundness (axis 3) — holds
No place in 05 has the hub **drive** box/customer-owned state. Desired-state (§5) is all infrastructure
intent (guests, storage *policy*, versions, identity, tunnel) — top-down and legitimate. Apps are
explicitly excluded from reconcile (§1, §5) and mirrored only. Storage is the handshake (detect →
assign policy → reconcile policy onto the detected drive), matching Part 3 §7. The one nuance (S5): the
`hosts` row holds both top-down intent and a few box-reported columns (repo fingerprint, wrapped escrow
key) — acceptable, just label provenance per column. Reconcile (§5) never collides with app/storage
reality because the reality columns (`durable_id` attached, snapshot pointers, app inventory) are
mirror-only and never serve as desired state.
## DR completeness (axis 4) — safe to retire the heavy push, with S5's clarification
Retiring the controller's infra-backup push is safe **given** that DR reads from durable sources, not
the prunable mirror (S5). What the old push carried — `deployed_stacks` + `disk_layout.mounts`
(`store.go:768-795`, surfaced by `handleRecovery`, `handler.go:620-705`) — is reconstructible:
storage layout/`durable_id`s from the storage manifest (desired-state, durable) + host-report mirror;
app inventory from the guest **inside the PBS snapshot** (so it need not be separately stored); snapshot
list from PBS itself. The one artifact only the box can produce — the recovery-code-wrapped PBS key — is
explicitly escrowed (§9), zero-knowledge, consistent with Part 1 §8 / Part 3 §8. So nothing
DR-essential is lost by removing the push **provided** §9 is amended per S5 to name durable sources and
not lean on `host_reports` retention.