15 KiB
Architecture Part 5 — The Hub
Status: design draft (decision content). To be validated by Claude Code against the actual felhom-hub source (
felhom.eurepo,hub/) + Parts 01–04, then placed atdocs/architecture/05-hub-architecture.md.The hub is not greenfield — it's a mature service (felhom-hub v0.6.3, Go + SQLite on k3s,
hub.felhom.eu). This doc is the deltas to evolve it for the Proxmox model, plus the new data model. Builds on Part 1 (trust/enrollment), Part 3 (the agent + reconcile), Part 4 (signing).
1. Source-of-truth model — two drivers, two directions
The single most important framing, and the one that governs everything below: the hub is not a monolithic source of truth. State flows in two directions with opposite drivers.
- Operator-driven intent — hub authors, agent reconciles (top-down). Which guests should exist and their spec, storage policy (a target's role/class/backup schedule), controller + golden-image versions, identity, tunnel. The operator sets these in the hub; the agent converges toward them. Here the hub is the source of truth.
- Box/customer-driven reality — box authors, pushes up, hub mirrors (bottom-up). Which USB
drive is physically attached (and its
durable_id), what apps are deployed and where, the customer's controller configs/settings, host/guest health, latest PBS snapshot pointers. The customer or the physical world drives these; the box reports them; the hub stays an up-to-date mirror but is never the driver.
They meet at a handshake, not a tug-of-war. Storage is the clearest case: the customer plugs in
a drive → the agent detects it and reports durable_id X attached (reality) → the operator
assigns role=bulk, class=slow, backup=weekly (policy, intent) → the agent reconciles that policy
onto the detected drive. Apps never enter the reconcile loop — app deployment is the
controller's domain (customer- or operator-driven, inside the guest); the hub only mirrors the
resulting inventory. Reconciliation applies to infrastructure; the app/customer layer is mirrored.
2. Data model (Part 1 decision (b): customer-anchored)
A customer's deployment is one Host (its agent) plus one-or-more Guests (its controllers).
1 customer = 1 host + N guests; the shared-host multi-tenant case is deferred (not precluded — the
hosts table is the seam it would use).
customer_configs(existing) — the Customer anchor: identity, domain, email,retrieval_password, status, config_json. Unchanged role.hosts(new) —host_id PK, customer_id, api_key(the agent's hub key),agent_version, desired-state intent (storage manifest + policies + golden-image version, as JSON), a per-hostdesired_generationcounter, the slim DR record (§9), timestamps.guests(new) —guest_id PK, customer_id, host_id, api_key(the controller's hub key),display_name, controller_version, per-guestdesired_spec_json(CPU/mem/disk, versions), timestamps.
Per-reporter keys: today's per-customer customer_configs.api_key becomes per-reporter —
hosts.api_key (agent) and guests.api_key (controller). The hub resolves a presented Bearer key →
host or guest → customer; customer_configs.api_key goes unused once auth resolves via the new keys.
Clean cutover: no dual-model support; the demo re-enrolls fresh into host + guests.
3. Report ingest — two domains
The single controller report splits. The de-privileged controller no longer sees host disks/storage/ backup, so its report slims (it loses System/Storage/Backup, keeps app-domain).
POST /api/v1/host-report(new, agent) →host_reports: host CPU/RAM/disk, per-guest up/down + spec, storage-target status (attached drives +durable_id+ reachability), last backup- restore-test per target, latest PBS snapshot pointers,
cloudflaredhealth, agent + controller versions. Denormalized columns for the dashboard; fullreport_json. Index(host_id, received_at DESC)+(customer_id, received_at DESC).
- restore-test per target, latest PBS snapshot pointers,
POST /api/v1/report(existing, slimmed controller) → the renamedguest_reports: it gainsguest_id+host_id; itscpu/memorydenorm now means guest-level;backup_last_snapshotgoes quiet (backup status lives inhost_reports). App telemetry / log issues stay.
These two streams are the bottom-up mirror of §1 — they keep the hub current without a separate push.
4. Liveness / dead-man's-switch
Evolves the existing staleness checker (60s cadence, 30m/1h thresholds — OK <30m, down at
2× = >1h; today: controller-report recency → node_stale/down/recovered):
- Primary = host-report recency →
host_stale/host_down. The agent heartbeat is the box's liveness signal; a silent agent = the box is gone (the critical alert). - Guest up/down comes from the host report's per-guest status — authoritative, every poll, faster than waiting for a guest report to go stale.
- Guest-report recency = secondary app-level signal.
Backup-deadline checker: today it is event-based — it scans for backup_completed/backup_failed
events since local midnight and alerts if none. Two changes: (1) mechanism — move it to a field
check on host_reports' last-backup-per-target (cleaner now that backup state arrives in the host
report); (2) emitter — the de-privileged controller no longer runs backups, so the agent is the
source of the last-backup status (Part 3 §8). Without re-homing the source, the deadline check would go
silent after the controller stops backing up.
5. Desired-state serving
The operator's intent (§1 top-down) lives as JSON on hosts/guests (storage manifest +
policies + golden version on the host; per-guest spec + versions on the guest) with a per-host
desired_generation. The agent pulls its host's desired state on poll (with the generation, so it
reconciles only on change and reports which generation it has converged to).
- Benign convergence (create a guest, attach storage per policy, bump a version, adjust a non-destructive policy) → the agent reconciles freely.
- Destructive convergence (guest removal = destroy, storage detach/wipe, data-losing resize) →
the agent requires a matching signed op (§6) before executing that delta; absent/invalid → it
refuses and reports
pending_signature.
Geo is not in the agent's desired state — it's customer→hub→Cloudflare (§7); the agent never touches WAF.
6. Authorization — signed-op queue + editing flow
Implements Part 4's gate on the hub side. The hub holds no signing key.
signed_ops(new):op_id, customer_id, host_id, target_guest, op_type, op_blob (canonical JSON), signature (armored SSHSIG), status (pending_signature → signed → delivered → executed / failed / expired / rejected), nonce, issued_at, expires_at, executed_at, result.- Editing flow: the operator edits a customer's desired state, reusing the existing config-form +
diff UX. Note the transport inverts: today's "Push" is a hub→box inbound POST (forbidden by the
box-initiated model); here "publish" means write to desired state, delivered on the next agent/
controller poll. The form and diff carry over; the push transport does not. The hub diffs vs current
and classifies each delta (B1 rule):
- benign → published straight to desired state;
- destructive → the hub generates the canonical op blob and routes it through signing.
- Signing hand-off (Part 4 option (b)): a local operator CLI (
felhom-sign --pending) fetches the pending blob from the hub, signs it on the workstation with the dedicated key, and posts the signature back intosigned_ops. The hub never sees the key. - The agent polls
signed_opsfor its host alongside desired state, verifies (Part 4 pipeline), executes, and reports status → the hub logs to the existingeventsaudit trail. - Classification lives in both places, with different jobs: the hub classifies at edit time for UX (prompt to sign); the agent's classification is the authoritative guard (a compromised hub could skip the prompt, but the agent still enforces the signature).
- A pending-ops view per customer shows the lifecycle (awaiting signature → awaiting agent → executed).
7. Geo enforcement (Part-2 S4)
The hub already holds the CF API token and already has a remove-all path
(internal/web/configs.go handleGeoDisable → cloudflare.RemoveGeoRules). But the token is
dual-purpose today — DNS-01/ACME and WAF/geo — and configgen.Generate deep-merges it (via
config_json) into the generated controller.yaml, so it currently ships down to the box. Two
things follow:
- ACME assumption (must be stated, not skipped): in the Cloudflare-Tunnel-default model the edge terminates TLS, so the box needs no public certificate and the DNS-01/ACME use of the token goes away. Granting that, the token comes fully off the box and lives hub-only. (If any box still does DNS-01, the token cannot fully come off — so this assumption is load-bearing.)
configgenmust stop emittingcf_api_tokenintocontroller.yaml(drop it from the merge / relocate it to a hub-only field).
The delta: the customer sets geo in the controller UI → the controller reports the geo desired-state up → the hub reconciles it into the Cloudflare WAF (rather than the box calling the CF API). The hub keeps the remove-all override for self-lockout. The controller no longer calls the CF API.
8. Enrollment (evolution of the existing retrieval-password/config-gen flow)
Today: GET /config/{id} with an X-Retrieval-Password (Hungarian passphrase) returns a deep-merged
controller.yaml. New:
- Enrollment mints the agent identity first (the agent then provisions controllers), pins the operator signing public keys (Part 4 — operational + cold recovery) onto the agent, and the agent mints each controller's bootstrap (its hub guest key + local-API token).
- A restore-mode re-enrollment (§9) hands an existing identity to a fresh agent.
The existing configgen deep-merge + Hungarian-passphrase machinery is the base; it grows the
agent-first + key-pinning + restore-mode steps.
9. DR model
The headline: the old heavy infra-backup push retires — not because the hub authors everything (§1 says it doesn't), but because (a) the box-driven mirror already arrives via the §3 report streams, and (b) the actual app data + configs live inside the PBS guest snapshot. So a separate config+secrets+restic-password infra-backup blob is redundant.
What remains:
- the report streams keep the hub's mirror current (storage layout +
durable_ids, app inventory, snapshot pointers) — but this mirror is convenience, not the DR source of record (reports are pruned by age); - the agent escrows the recovery-code-wrapped PBS key to the hub (the one artifact only the box can produce — zero-knowledge: the hub stores it, cannot open it);
- a slim DR record on the
hostsrow (PBS namespace + repo fingerprint + the wrapped escrow key). These last two are box-reported columns on an otherwise operator-intent row — labelled as such so the §1 two-driver split stays legible per column.
Both existing infra-backup tables retire — infra_backup_versions (the current/live one, all readers
hit it) and infra_backups (the deprecated legacy mirror). The slim DR record folds onto hosts
instead. The controller's infra-backup push is removed (it's de-privileged).
Recovery (host loss): the new agent re-enrolls in restore mode; the hub hands it the durable
record — and DR reads from the durable sources, not the prunable report mirror: operator intent
(desired-state on hosts/guests — identity, tunnel token, storage manifest), the slim DR record
(PBS namespace + repo fingerprint), the wrapped escrow key, and PBS's own snapshot enumeration
(the agent lists snapshots once it has the namespace + unwrapped key). Guest inventory + app data come
from inside the PBS guest snapshots, not from a retained host_report, so recovery doesn't degrade
when the last report has aged out. The customer provides their recovery code at the agent, which
unwraps the PBS key locally (never sent to the hub); the agent restores guests from PBS, resets
identity, reuses the tunnel. The customer recovery code is the irreducible residual (the premium
operator-managed custody tier avoids it, at the cost of the operator holding the key). The old
controller-targeted GET /recovery/{id} is replaced by this agent restore-mode flow.
10. What persists from today (unchanged or lightly adapted)
The Customer record (customer_configs); config generation/retrieval (configgen); the two-tier
notification system (operator English / customer Hungarian, Resend, cooldowns); events + audit;
app_telemetry / app_log_issues; customer lifecycle actions (block/unblock, trigger-update,
delete); the asset manager; and the dashboard — adapted to render the host + guests view per
customer instead of a single controller.
11. Schema deltas (grounded in store.go's idempotent style; clean cutover)
- NEW:
hosts,guests,host_reports,signed_ops. - DROP
reports+ CREATEguest_reports(under the clean cutover this is drop+create with no data migration, not an in-place rename);guest_reportsaddsguest_id,host_id;cpu/memorymean guest-level;backup_last_snapshotgoes quiet. - ADD desired-state JSON +
desired_generationtohosts;desired_spec_jsontoguests; the slim DR record (PBS namespace + repo fingerprint + wrapped escrow key) ontohosts. - DROP both
infra_backup_versions(current/live) andinfra_backups(legacy mirror) — the DR record replaces them onhosts. - KEEP
customer_configs,events,customer_notifications,notification_log,app_telemetry,app_log_issues. - Authz cleanup the cutover enables: several endpoints today use global-or-any-customer-key auth
rather than customer-scoped (the infra-backup GETs,
/notify). Most retire with the infra-backup push; any that carry over should scope to the resolved host/guest → customer under §2.
12. Open items
- Operator signing-key operational mechanics (Part 4 §8) — the hub-side pending-op UI is here; the key custody/rotation tooling is Part 4's.
- Multi-tenant resource fairness (deferred shared-host case).
- Hub-side desired-state editing UX specifics (form/diff wiring) — to be grounded against
hub/internal/web/configs.goat implementation. - Golden-image refresh cadence / fleet versioning (carried from Part 3 §13).