Files
felhom.eu/documentation/architecture/05-hub-architecture.md
T

223 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Architecture Part 5 — The Hub
> Status: design draft (decision content). To be validated by Claude Code against the **actual
> felhom-hub source** (`felhom.eu` repo, `hub/`) + Parts 0104, then placed at
> `docs/architecture/05-hub-architecture.md`.
>
> The hub is **not** greenfield — it's a mature service (felhom-hub v0.6.3, Go + SQLite on k3s,
> `hub.felhom.eu`). This doc is the **deltas** to evolve it for the Proxmox model, plus the new
> data model. Builds on Part 1 (trust/enrollment), Part 3 (the agent + reconcile), Part 4 (signing).
## 1. Source-of-truth model — two drivers, two directions
The single most important framing, and the one that governs everything below: the hub is **not** a
monolithic source of truth. State flows in two directions with opposite drivers.
- **Operator-driven *intent* — hub authors, agent reconciles (top-down).** Which guests should
exist and their spec, storage *policy* (a target's role/class/backup schedule), controller +
golden-image versions, identity, tunnel. The operator sets these in the hub; the agent converges
toward them. Here the hub *is* the source of truth.
- **Box/customer-driven *reality* — box authors, pushes up, hub mirrors (bottom-up).** Which USB
drive is *physically* attached (and its `durable_id`), what apps are deployed and where, the
customer's controller configs/settings, host/guest health, latest PBS snapshot pointers. The
customer or the physical world drives these; the box reports them; the hub stays an up-to-date
**mirror** but is **never** the driver.
They meet at a **handshake**, not a tug-of-war. Storage is the clearest case: the customer plugs in
a drive → the agent *detects* it and reports `durable_id X attached` (reality) → the operator
assigns `role=bulk, class=slow, backup=weekly` (policy, intent) → the agent reconciles that policy
*onto the detected drive*. **Apps never enter the reconcile loop** — app deployment is the
controller's domain (customer- or operator-driven, inside the guest); the hub only mirrors the
resulting inventory. **Reconciliation applies to infrastructure; the app/customer layer is mirrored.**
## 2. Data model (Part 1 decision (b): customer-anchored)
A customer's deployment is one **Host** (its agent) plus one-or-more **Guests** (its controllers).
1 customer = 1 host + N guests; the shared-host multi-tenant case is deferred (not precluded — the
`hosts` table is the seam it would use).
- **`customer_configs`** (existing) — the Customer anchor: identity, domain, email,
`retrieval_password`, status, config_json. Unchanged role.
- **`hosts`** (new) — `host_id PK, customer_id, api_key` (the agent's hub key), `agent_version`,
desired-state intent (storage manifest + policies + golden-image version, as JSON), a per-host
**`desired_generation`** counter, the slim DR record (§9), timestamps.
- **`guests`** (new) — `guest_id PK, customer_id, host_id, api_key` (the controller's hub key),
`display_name, controller_version`, per-guest **`desired_spec_json`** (CPU/mem/disk, versions),
timestamps.
**Per-reporter keys:** today's per-customer `customer_configs.api_key` becomes per-reporter —
`hosts.api_key` (agent) and `guests.api_key` (controller). The hub resolves a presented Bearer key →
host or guest → customer; `customer_configs.api_key` goes unused once auth resolves via the new keys.
**Clean cutover:** no dual-model support; the demo re-enrolls fresh into `host + guests`.
## 3. Report ingest — two domains
The single controller report splits. The de-privileged controller no longer sees host disks/storage/
backup, so its report **slims** (it loses System/Storage/Backup, keeps app-domain).
- **`POST /api/v1/host-report`** (new, agent) → **`host_reports`**: host CPU/RAM/disk, per-guest
up/down + spec, storage-target status (attached drives + `durable_id` + reachability), last backup
+ restore-test per target, latest PBS snapshot pointers, `cloudflared` health, agent + controller
versions. Denormalized columns for the dashboard; full `report_json`. Index `(host_id, received_at
DESC)` + `(customer_id, received_at DESC)`.
- **`POST /api/v1/report`** (existing, slimmed controller) → the renamed **`guest_reports`**: it
gains `guest_id` + `host_id`; its `cpu/memory` denorm now means *guest-level*; `backup_last_snapshot`
goes quiet (backup status lives in `host_reports`). App telemetry / log issues stay.
These two streams are the bottom-up mirror of §1 — they keep the hub current without a separate push.
## 4. Liveness / dead-man's-switch
Evolves the existing staleness checker (60s **cadence**, 30m/1h **thresholds** — OK <30m, down at
2× = >1h; today: controller-report recency → `node_stale`/`down`/`recovered`):
- **Primary = host-report recency → `host_stale` / `host_down`.** The agent heartbeat is the box's
liveness signal; a silent agent = the box is gone (the critical alert).
- **Guest up/down comes from the host report's per-guest status** — authoritative, every poll, faster
than waiting for a guest report to go stale.
- **Guest-report recency = secondary** app-level signal.
**Backup-deadline checker:** today it is *event-based* — it scans for `backup_completed`/`backup_failed`
events since local midnight and alerts if none. Two changes: (1) **mechanism** — move it to a field
check on `host_reports`' last-backup-per-target (cleaner now that backup state arrives in the host
report); (2) **emitter** — the de-privileged controller no longer runs backups, so the **agent** is the
source of the last-backup status (Part 3 §8). Without re-homing the source, the deadline check would go
silent after the controller stops backing up.
## 5. Desired-state serving
The operator's **intent** (§1 top-down) lives as JSON on `hosts`/`guests` (storage manifest +
policies + golden version on the host; per-guest spec + versions on the guest) with a per-host
`desired_generation`. The agent pulls its host's desired state on poll (with the generation, so it
reconciles only on change and reports which generation it has converged to).
- **Benign convergence** (create a guest, attach storage per policy, bump a version, adjust a
non-destructive policy) → the agent reconciles freely.
- **Destructive convergence** (guest removal = destroy, storage detach/wipe, data-losing resize) →
the agent requires a **matching signed op** (§6) before executing that delta; absent/invalid → it
refuses and reports `pending_signature`.
**Geo is *not* in the agent's desired state** — it's customer→hub→Cloudflare (§7); the agent never
touches WAF.
## 6. Authorization — signed-op queue + editing flow
Implements Part 4's gate on the hub side. The hub holds **no signing key**.
- **`signed_ops`** (new): `op_id, customer_id, host_id, target_guest, op_type, op_blob (canonical
JSON), signature (armored SSHSIG), status (pending_signature → signed → delivered → executed /
failed / expired / rejected), nonce, issued_at, expires_at, executed_at, result`.
- **Editing flow:** the operator edits a customer's desired state, reusing the existing config-form +
diff UX. Note the **transport inverts**: today's "Push" is a hub→box *inbound* POST (forbidden by the
box-initiated model); here "publish" means **write to desired state, delivered on the next agent/
controller poll**. The form and diff carry over; the push transport does not. The hub diffs vs current
and **classifies each delta** (B1 rule):
- **benign** → published straight to desired state;
- **destructive** → the hub generates the canonical op blob and routes it through signing.
- **Signing hand-off (Part 4 option (b)):** a local operator CLI (`felhom-sign --pending`) fetches
the pending blob from the hub, signs it on the workstation with the dedicated key, and posts the
signature back into `signed_ops`. The hub never sees the key.
- The agent polls `signed_ops` for its host alongside desired state, verifies (Part 4 pipeline),
executes, and reports status → the hub logs to the existing **`events`** audit trail.
- **Classification lives in both places, with different jobs:** the hub classifies at *edit time*
for UX (prompt to sign); the **agent's classification is the authoritative guard** (a compromised
hub could skip the prompt, but the agent still enforces the signature).
- A **pending-ops view** per customer shows the lifecycle (awaiting signature → awaiting agent →
executed).
## 7. Geo enforcement (Part-2 S4)
The hub already holds the CF API token and already has a remove-all path
(`internal/web/configs.go` `handleGeoDisable` → `cloudflare.RemoveGeoRules`). **But the token is
dual-purpose today** — DNS-01/ACME *and* WAF/geo — and `configgen.Generate` deep-merges it (via
`config_json`) into the generated `controller.yaml`, so it currently ships **down to the box**. Two
things follow:
- **ACME assumption (must be stated, not skipped):** in the Cloudflare-Tunnel-default model the edge
terminates TLS, so the box needs no public certificate and the **DNS-01/ACME use of the token goes
away**. Granting that, the token comes fully off the box and lives hub-only. (If any box still does
DNS-01, the token cannot fully come off — so this assumption is load-bearing.)
- **`configgen` must stop emitting `cf_api_token`** into `controller.yaml` (drop it from the merge /
relocate it to a hub-only field).
The delta: the **customer sets geo in the controller UI → the controller reports the geo desired-state
up → the hub reconciles it into the Cloudflare WAF** (rather than the box calling the CF API). The hub
keeps the remove-all override for self-lockout. The controller no longer calls the CF API.
## 8. Enrollment (evolution of the existing retrieval-password/config-gen flow)
Today: `GET /config/{id}` with an `X-Retrieval-Password` (Hungarian passphrase) returns a deep-merged
`controller.yaml`. New:
- Enrollment mints the **agent identity first** (the agent then provisions controllers), pins the
**operator signing public keys** (Part 4 — operational + cold recovery) onto the agent, and the
agent mints each controller's bootstrap (its hub guest key + local-API token).
- A **restore-mode** re-enrollment (§9) hands an existing identity to a fresh agent.
The existing `configgen` deep-merge + Hungarian-passphrase machinery is the base; it grows the
agent-first + key-pinning + restore-mode steps.
## 9. DR model
The headline: the **old heavy infra-backup push retires** — not because the hub authors everything
(§1 says it doesn't), but because (a) the box-driven mirror already arrives via the §3 report streams,
and (b) the actual app **data + configs live inside the PBS guest snapshot**. So a separate
config+secrets+restic-password infra-backup blob is redundant.
What remains:
- the **report streams** keep the hub's mirror current (storage layout + `durable_id`s, app inventory,
snapshot pointers) — but this mirror is **convenience, not the DR source of record** (reports are
pruned by age);
- the agent **escrows the recovery-code-wrapped PBS key** to the hub (the one artifact only the box
can produce — zero-knowledge: the hub stores it, cannot open it);
- a **slim DR record** on the `hosts` row (PBS namespace + repo fingerprint + the wrapped escrow key).
These last two are *box-reported* columns on an otherwise operator-intent row — labelled as such so
the §1 two-driver split stays legible per column.
Both existing infra-backup tables retire — `infra_backup_versions` (the current/live one, all readers
hit it) **and** `infra_backups` (the deprecated legacy mirror). The slim DR record folds onto `hosts`
instead. The **controller's infra-backup push is removed** (it's de-privileged).
**Recovery (host loss):** the new agent re-enrolls in **restore mode**; the hub hands it the durable
record — and DR reads from the **durable sources, not the prunable report mirror**: operator intent
(desired-state on `hosts`/`guests` — identity, tunnel token, storage manifest), the slim DR record
(PBS namespace + repo fingerprint), the **wrapped escrow key**, and **PBS's own snapshot enumeration**
(the agent lists snapshots once it has the namespace + unwrapped key). Guest inventory + app data come
from **inside the PBS guest snapshots**, not from a retained `host_report`, so recovery doesn't degrade
when the last report has aged out. The **customer provides their recovery code at the agent**, which
unwraps the PBS key locally (never sent to the hub); the agent restores guests from PBS, resets
identity, reuses the tunnel. The customer recovery code is the irreducible residual (the premium
operator-managed custody tier avoids it, at the cost of the operator holding the key). The old
controller-targeted `GET /recovery/{id}` is replaced by this agent restore-mode flow.
## 10. What persists from today (unchanged or lightly adapted)
The Customer record (`customer_configs`); config generation/retrieval (`configgen`); the two-tier
notification system (operator English / customer Hungarian, Resend, cooldowns); `events` + audit;
`app_telemetry` / `app_log_issues`; customer lifecycle actions (block/unblock, trigger-update,
delete); the asset manager; and the dashboard — adapted to render the **host + guests** view per
customer instead of a single controller.
## 11. Schema deltas (grounded in store.go's idempotent style; clean cutover)
- **NEW:** `hosts`, `guests`, `host_reports`, `signed_ops`.
- **DROP `reports` + CREATE `guest_reports`** (under the clean cutover this is drop+create with no data
migration, not an in-place rename); `guest_reports` adds `guest_id`, `host_id`; `cpu/memory` mean
guest-level; `backup_last_snapshot` goes quiet.
- **ADD** desired-state JSON + `desired_generation` to `hosts`; `desired_spec_json` to `guests`; the
slim DR record (PBS namespace + repo fingerprint + wrapped escrow key) onto `hosts`.
- **DROP both** `infra_backup_versions` (current/live) **and** `infra_backups` (legacy mirror) — the DR
record replaces them on `hosts`.
- **KEEP** `customer_configs`, `events`, `customer_notifications`, `notification_log`,
`app_telemetry`, `app_log_issues`.
- **Authz cleanup the cutover enables:** several endpoints today use global-or-any-customer-key auth
rather than customer-scoped (the infra-backup GETs, `/notify`). Most retire with the infra-backup
push; any that carry over should scope to the resolved host/guest → customer under §2.
## 12. Open items
- Operator signing-key operational mechanics (Part 4 §8) — the hub-side pending-op UI is here; the
key custody/rotation tooling is Part 4's.
- Multi-tenant resource fairness (deferred shared-host case).
- Hub-side desired-state **editing UX** specifics (form/diff wiring) — to be grounded against
`hub/internal/web/configs.go` at implementation.
- Golden-image refresh cadence / fleet versioning (carried from Part 3 §13).