Files
felhom-agent/docs/architecture/05-hub-architecture.md
T
2026-06-08 12:40:56 +02:00

12 KiB
Raw Blame History

Architecture Part 5 — The Hub

Status: design draft (decision content). To be validated by Claude Code against the actual felhom-hub source (felhom.eu repo, hub/) + Parts 0104, then placed at docs/architecture/05-hub-architecture.md.

The hub is not greenfield — it's a mature service (felhom-hub v0.6.3, Go + SQLite on k3s, hub.felhom.eu). This doc is the deltas to evolve it for the Proxmox model, plus the new data model. Builds on Part 1 (trust/enrollment), Part 3 (the agent + reconcile), Part 4 (signing).

1. Source-of-truth model — two drivers, two directions

The single most important framing, and the one that governs everything below: the hub is not a monolithic source of truth. State flows in two directions with opposite drivers.

  • Operator-driven intent — hub authors, agent reconciles (top-down). Which guests should exist and their spec, storage policy (a target's role/class/backup schedule), controller + golden-image versions, identity, tunnel. The operator sets these in the hub; the agent converges toward them. Here the hub is the source of truth.
  • Box/customer-driven reality — box authors, pushes up, hub mirrors (bottom-up). Which USB drive is physically attached (and its durable_id), what apps are deployed and where, the customer's controller configs/settings, host/guest health, latest PBS snapshot pointers. The customer or the physical world drives these; the box reports them; the hub stays an up-to-date mirror but is never the driver.

They meet at a handshake, not a tug-of-war. Storage is the clearest case: the customer plugs in a drive → the agent detects it and reports durable_id X attached (reality) → the operator assigns role=bulk, class=slow, backup=weekly (policy, intent) → the agent reconciles that policy onto the detected drive. Apps never enter the reconcile loop — app deployment is the controller's domain (customer- or operator-driven, inside the guest); the hub only mirrors the resulting inventory. Reconciliation applies to infrastructure; the app/customer layer is mirrored.

2. Data model (Part 1 decision (b): customer-anchored)

A customer's deployment is one Host (its agent) plus one-or-more Guests (its controllers). 1 customer = 1 host + N guests; the shared-host multi-tenant case is deferred (not precluded — the hosts table is the seam it would use).

  • customer_configs (existing) — the Customer anchor: identity, domain, email, retrieval_password, status, config_json. Unchanged role.
  • hosts (new) — host_id PK, customer_id, api_key (the agent's hub key), agent_version, desired-state intent (storage manifest + policies + golden-image version, as JSON), a per-host desired_generation counter, the slim DR record (§9), timestamps.
  • guests (new) — guest_id PK, customer_id, host_id, api_key (the controller's hub key), display_name, controller_version, per-guest desired_spec_json (CPU/mem/disk, versions), timestamps.

Per-reporter keys: today's per-customer api_key becomes per-reporter — hosts.api_key (agent) and guests.api_key (controller). The hub resolves a presented Bearer key → host or guest → customer. Clean cutover: no dual-model support; the demo re-enrolls fresh into host + guests.

3. Report ingest — two domains

The single controller report splits. The de-privileged controller no longer sees host disks/storage/ backup, so its report slims (it loses System/Storage/Backup, keeps app-domain).

  • POST /api/v1/host-report (new, agent) → host_reports: host CPU/RAM/disk, per-guest up/down + spec, storage-target status (attached drives + durable_id + reachability), last backup
    • restore-test per target, latest PBS snapshot pointers, cloudflared health, agent + controller versions. Denormalized columns for the dashboard; full report_json. Index (host_id, received_at DESC) + (customer_id, received_at DESC).
  • POST /api/v1/report (existing, slimmed controller) → the renamed guest_reports: it gains guest_id + host_id; its cpu/memory denorm now means guest-level; backup_last_snapshot goes quiet (backup status lives in host_reports). App telemetry / log issues stay.

These two streams are the bottom-up mirror of §1 — they keep the hub current without a separate push.

4. Liveness / dead-man's-switch

Evolves the existing 60s staleness checker (today: controller-report recency → node_stale/down/ recovered):

  • Primary = host-report recency → host_stale / host_down. The agent heartbeat is the box's liveness signal; a silent agent = the box is gone (the critical alert).
  • Guest up/down comes from the host report's per-guest status — authoritative, every poll, faster than waiting for a guest report to go stale.
  • Guest-report recency = secondary app-level signal.

The existing backup-deadline checker maps onto host_reports' last-backup-per-target.

5. Desired-state serving

The operator's intent (§1 top-down) lives as JSON on hosts/guests (storage manifest + policies + golden version on the host; per-guest spec + versions on the guest) with a per-host desired_generation. The agent pulls its host's desired state on poll (with the generation, so it reconciles only on change and reports which generation it has converged to).

  • Benign convergence (create a guest, attach storage per policy, bump a version, adjust a non-destructive policy) → the agent reconciles freely.
  • Destructive convergence (guest removal = destroy, storage detach/wipe, data-losing resize) → the agent requires a matching signed op (§6) before executing that delta; absent/invalid → it refuses and reports pending_signature.

Geo is not in the agent's desired state — it's customer→hub→Cloudflare (§7); the agent never touches WAF.

6. Authorization — signed-op queue + editing flow

Implements Part 4's gate on the hub side. The hub holds no signing key.

  • signed_ops (new): op_id, customer_id, host_id, target_guest, op_type, op_blob (canonical JSON), signature (armored SSHSIG), status (pending_signature → signed → delivered → executed / failed / expired / rejected), nonce, issued_at, expires_at, executed_at, result.
  • Editing flow: the operator edits a customer's desired state (building on the existing config- form + Push/Pull/Diff). The hub diffs vs current and classifies each delta (B1 rule):
    • benign → published straight to desired state;
    • destructive → the hub generates the canonical op blob and routes it through signing.
  • Signing hand-off (Part 4 option (b)): a local operator CLI (felhom-sign --pending) fetches the pending blob from the hub, signs it on the workstation with the dedicated key, and posts the signature back into signed_ops. The hub never sees the key.
  • The agent polls signed_ops for its host alongside desired state, verifies (Part 4 pipeline), executes, and reports status → the hub logs to the existing events audit trail.
  • Classification lives in both places, with different jobs: the hub classifies at edit time for UX (prompt to sign); the agent's classification is the authoritative guard (a compromised hub could skip the prompt, but the agent still enforces the signature).
  • A pending-ops view per customer shows the lifecycle (awaiting signature → awaiting agent → executed).

7. Geo enforcement (Part-2 S4)

The hub already holds the CF API token (the config form notes Zone WAF:Edit) and already has a remove-all path (internal/cloudflare/unblock.go). The delta: the customer sets geo in the controller UI → the controller reports the geo desired-state up → the hub reconciles it into the Cloudflare WAF (rather than pushing the token down to the controller). The hub keeps the remove-all override for self-lockout. The controller no longer calls the CF API.

8. Enrollment (evolution of the existing retrieval-password/config-gen flow)

Today: GET /config/{id} with an X-Retrieval-Password (Hungarian passphrase) returns a deep-merged controller.yaml. New:

  • Enrollment mints the agent identity first (the agent then provisions controllers), pins the operator signing public keys (Part 4 — operational + cold recovery) onto the agent, and the agent mints each controller's bootstrap (its hub guest key + local-API token).
  • A restore-mode re-enrollment (§9) hands an existing identity to a fresh agent.

The existing configgen deep-merge + Hungarian-passphrase machinery is the base; it grows the agent-first + key-pinning + restore-mode steps.

9. DR model

The headline: the old heavy infra-backup push retires — not because the hub authors everything (§1 says it doesn't), but because (a) the box-driven mirror already arrives via the §3 report streams, and (b) the actual app data + configs live inside the PBS guest snapshot. So a separate config+secrets+restic-password infra-backup blob is redundant.

What remains:

  • the report streams keep the hub's mirror current (storage layout + durable_ids, app inventory, snapshot pointers);
  • the agent escrows the recovery-code-wrapped PBS key to the hub (the one artifact only the box can produce — zero-knowledge: the hub stores it, cannot open it);
  • a slim DR record on the hosts row (PBS namespace + repo fingerprint + the wrapped escrow key).

infra_backup_versions retires; infra_backups is repurposed into the slim DR record (or folded onto hosts). The controller's infra-backup push is removed (it's de-privileged).

Recovery (host loss): the new agent re-enrolls in restore mode; the hub hands it the durable record (identity, tunnel token, storage manifest, PBS namespace, guest inventory + snapshots) plus the wrapped escrow key. The customer provides their recovery code at the agent, which unwraps the PBS key locally (never sent to the hub); the agent restores guests from PBS, resets identity, reuses the tunnel. The customer recovery code is the irreducible residual (the premium operator- managed custody tier avoids it, at the cost of the operator holding the key). The old controller- targeted GET /recovery/{id} is replaced by this agent restore-mode flow.

10. What persists from today (unchanged or lightly adapted)

The Customer record (customer_configs); config generation/retrieval (configgen); the two-tier notification system (operator English / customer Hungarian, Resend, cooldowns); events + audit; app_telemetry / app_log_issues; customer lifecycle actions (block/unblock, trigger-update, delete); the asset manager; and the dashboard — adapted to render the host + guests view per customer instead of a single controller.

11. Schema deltas (grounded in store.go's idempotent style; clean cutover)

  • NEW: hosts, guests, host_reports, signed_ops.
  • RENAME reportsguest_reports; add guest_id, host_id; reinterpret cpu/memory as guest-level; backup_last_snapshot goes quiet.
  • ADD desired-state JSON + desired_generation to hosts; desired_spec_json to guests.
  • RETIRE infra_backup_versions; repurpose infra_backups → slim DR record (or fold onto hosts).
  • KEEP customer_configs, events, customer_notifications, notification_log, app_telemetry, app_log_issues.

12. Open items

  • Operator signing-key operational mechanics (Part 4 §8) — the hub-side pending-op UI is here; the key custody/rotation tooling is Part 4's.
  • Multi-tenant resource fairness (deferred shared-host case).
  • Hub-side desired-state editing UX specifics (form/diff wiring) — to be grounded against hub/internal/web/configs.go at implementation.
  • Golden-image refresh cadence / fleet versioning (carried from Part 3 §13).