hub-architecture
This commit is contained in:
@@ -0,0 +1,190 @@
|
||||
# Architecture Part 5 — The Hub
|
||||
|
||||
> Status: design draft (decision content). To be validated by Claude Code against the **actual
|
||||
> felhom-hub source** (`felhom.eu` repo, `hub/`) + Parts 01–04, then placed at
|
||||
> `docs/architecture/05-hub-architecture.md`.
|
||||
>
|
||||
> The hub is **not** greenfield — it's a mature service (felhom-hub v0.6.3, Go + SQLite on k3s,
|
||||
> `hub.felhom.eu`). This doc is the **deltas** to evolve it for the Proxmox model, plus the new
|
||||
> data model. Builds on Part 1 (trust/enrollment), Part 3 (the agent + reconcile), Part 4 (signing).
|
||||
|
||||
## 1. Source-of-truth model — two drivers, two directions
|
||||
|
||||
The single most important framing, and the one that governs everything below: the hub is **not** a
|
||||
monolithic source of truth. State flows in two directions with opposite drivers.
|
||||
|
||||
- **Operator-driven *intent* — hub authors, agent reconciles (top-down).** Which guests should
|
||||
exist and their spec, storage *policy* (a target's role/class/backup schedule), controller +
|
||||
golden-image versions, identity, tunnel. The operator sets these in the hub; the agent converges
|
||||
toward them. Here the hub *is* the source of truth.
|
||||
- **Box/customer-driven *reality* — box authors, pushes up, hub mirrors (bottom-up).** Which USB
|
||||
drive is *physically* attached (and its `durable_id`), what apps are deployed and where, the
|
||||
customer's controller configs/settings, host/guest health, latest PBS snapshot pointers. The
|
||||
customer or the physical world drives these; the box reports them; the hub stays an up-to-date
|
||||
**mirror** but is **never** the driver.
|
||||
|
||||
They meet at a **handshake**, not a tug-of-war. Storage is the clearest case: the customer plugs in
|
||||
a drive → the agent *detects* it and reports `durable_id X attached` (reality) → the operator
|
||||
assigns `role=bulk, class=slow, backup=weekly` (policy, intent) → the agent reconciles that policy
|
||||
*onto the detected drive*. **Apps never enter the reconcile loop** — app deployment is the
|
||||
controller's domain (customer- or operator-driven, inside the guest); the hub only mirrors the
|
||||
resulting inventory. **Reconciliation applies to infrastructure; the app/customer layer is mirrored.**
|
||||
|
||||
## 2. Data model (Part 1 decision (b): customer-anchored)
|
||||
|
||||
A customer's deployment is one **Host** (its agent) plus one-or-more **Guests** (its controllers).
|
||||
1 customer = 1 host + N guests; the shared-host multi-tenant case is deferred (not precluded — the
|
||||
`hosts` table is the seam it would use).
|
||||
|
||||
- **`customer_configs`** (existing) — the Customer anchor: identity, domain, email,
|
||||
`retrieval_password`, status, config_json. Unchanged role.
|
||||
- **`hosts`** (new) — `host_id PK, customer_id, api_key` (the agent's hub key), `agent_version`,
|
||||
desired-state intent (storage manifest + policies + golden-image version, as JSON), a per-host
|
||||
**`desired_generation`** counter, the slim DR record (§9), timestamps.
|
||||
- **`guests`** (new) — `guest_id PK, customer_id, host_id, api_key` (the controller's hub key),
|
||||
`display_name, controller_version`, per-guest **`desired_spec_json`** (CPU/mem/disk, versions),
|
||||
timestamps.
|
||||
|
||||
**Per-reporter keys:** today's per-customer `api_key` becomes per-reporter — `hosts.api_key` (agent)
|
||||
and `guests.api_key` (controller). The hub resolves a presented Bearer key → host or guest → customer.
|
||||
**Clean cutover:** no dual-model support; the demo re-enrolls fresh into `host + guests`.
|
||||
|
||||
## 3. Report ingest — two domains
|
||||
|
||||
The single controller report splits. The de-privileged controller no longer sees host disks/storage/
|
||||
backup, so its report **slims** (it loses System/Storage/Backup, keeps app-domain).
|
||||
|
||||
- **`POST /api/v1/host-report`** (new, agent) → **`host_reports`**: host CPU/RAM/disk, per-guest
|
||||
up/down + spec, storage-target status (attached drives + `durable_id` + reachability), last backup
|
||||
+ restore-test per target, latest PBS snapshot pointers, `cloudflared` health, agent + controller
|
||||
versions. Denormalized columns for the dashboard; full `report_json`. Index `(host_id, received_at
|
||||
DESC)` + `(customer_id, received_at DESC)`.
|
||||
- **`POST /api/v1/report`** (existing, slimmed controller) → the renamed **`guest_reports`**: it
|
||||
gains `guest_id` + `host_id`; its `cpu/memory` denorm now means *guest-level*; `backup_last_snapshot`
|
||||
goes quiet (backup status lives in `host_reports`). App telemetry / log issues stay.
|
||||
|
||||
These two streams are the bottom-up mirror of §1 — they keep the hub current without a separate push.
|
||||
|
||||
## 4. Liveness / dead-man's-switch
|
||||
|
||||
Evolves the existing 60s staleness checker (today: controller-report recency → `node_stale`/`down`/
|
||||
`recovered`):
|
||||
|
||||
- **Primary = host-report recency → `host_stale` / `host_down`.** The agent heartbeat is the box's
|
||||
liveness signal; a silent agent = the box is gone (the critical alert).
|
||||
- **Guest up/down comes from the host report's per-guest status** — authoritative, every poll, faster
|
||||
than waiting for a guest report to go stale.
|
||||
- **Guest-report recency = secondary** app-level signal.
|
||||
|
||||
The existing backup-deadline checker maps onto `host_reports`' last-backup-per-target.
|
||||
|
||||
## 5. Desired-state serving
|
||||
|
||||
The operator's **intent** (§1 top-down) lives as JSON on `hosts`/`guests` (storage manifest +
|
||||
policies + golden version on the host; per-guest spec + versions on the guest) with a per-host
|
||||
`desired_generation`. The agent pulls its host's desired state on poll (with the generation, so it
|
||||
reconciles only on change and reports which generation it has converged to).
|
||||
|
||||
- **Benign convergence** (create a guest, attach storage per policy, bump a version, adjust a
|
||||
non-destructive policy) → the agent reconciles freely.
|
||||
- **Destructive convergence** (guest removal = destroy, storage detach/wipe, data-losing resize) →
|
||||
the agent requires a **matching signed op** (§6) before executing that delta; absent/invalid → it
|
||||
refuses and reports `pending_signature`.
|
||||
|
||||
**Geo is *not* in the agent's desired state** — it's customer→hub→Cloudflare (§7); the agent never
|
||||
touches WAF.
|
||||
|
||||
## 6. Authorization — signed-op queue + editing flow
|
||||
|
||||
Implements Part 4's gate on the hub side. The hub holds **no signing key**.
|
||||
|
||||
- **`signed_ops`** (new): `op_id, customer_id, host_id, target_guest, op_type, op_blob (canonical
|
||||
JSON), signature (armored SSHSIG), status (pending_signature → signed → delivered → executed /
|
||||
failed / expired / rejected), nonce, issued_at, expires_at, executed_at, result`.
|
||||
- **Editing flow:** the operator edits a customer's desired state (building on the existing config-
|
||||
form + Push/Pull/Diff). The hub diffs vs current and **classifies each delta** (B1 rule):
|
||||
- **benign** → published straight to desired state;
|
||||
- **destructive** → the hub generates the canonical op blob and routes it through signing.
|
||||
- **Signing hand-off (Part 4 option (b)):** a local operator CLI (`felhom-sign --pending`) fetches
|
||||
the pending blob from the hub, signs it on the workstation with the dedicated key, and posts the
|
||||
signature back into `signed_ops`. The hub never sees the key.
|
||||
- The agent polls `signed_ops` for its host alongside desired state, verifies (Part 4 pipeline),
|
||||
executes, and reports status → the hub logs to the existing **`events`** audit trail.
|
||||
- **Classification lives in both places, with different jobs:** the hub classifies at *edit time*
|
||||
for UX (prompt to sign); the **agent's classification is the authoritative guard** (a compromised
|
||||
hub could skip the prompt, but the agent still enforces the signature).
|
||||
- A **pending-ops view** per customer shows the lifecycle (awaiting signature → awaiting agent →
|
||||
executed).
|
||||
|
||||
## 7. Geo enforcement (Part-2 S4)
|
||||
|
||||
The hub already holds the CF API token (the config form notes Zone WAF:Edit) and already has a
|
||||
remove-all path (`internal/cloudflare/unblock.go`). The delta: the **customer sets geo in the
|
||||
controller UI → the controller reports the geo desired-state up → the hub reconciles it into the
|
||||
Cloudflare WAF** (rather than pushing the token down to the controller). The hub keeps the
|
||||
remove-all override for self-lockout. The controller no longer calls the CF API.
|
||||
|
||||
## 8. Enrollment (evolution of the existing retrieval-password/config-gen flow)
|
||||
|
||||
Today: `GET /config/{id}` with an `X-Retrieval-Password` (Hungarian passphrase) returns a deep-merged
|
||||
`controller.yaml`. New:
|
||||
|
||||
- Enrollment mints the **agent identity first** (the agent then provisions controllers), pins the
|
||||
**operator signing public keys** (Part 4 — operational + cold recovery) onto the agent, and the
|
||||
agent mints each controller's bootstrap (its hub guest key + local-API token).
|
||||
- A **restore-mode** re-enrollment (§9) hands an existing identity to a fresh agent.
|
||||
|
||||
The existing `configgen` deep-merge + Hungarian-passphrase machinery is the base; it grows the
|
||||
agent-first + key-pinning + restore-mode steps.
|
||||
|
||||
## 9. DR model
|
||||
|
||||
The headline: the **old heavy infra-backup push retires** — not because the hub authors everything
|
||||
(§1 says it doesn't), but because (a) the box-driven mirror already arrives via the §3 report streams,
|
||||
and (b) the actual app **data + configs live inside the PBS guest snapshot**. So a separate
|
||||
config+secrets+restic-password infra-backup blob is redundant.
|
||||
|
||||
What remains:
|
||||
- the **report streams** keep the hub's mirror current (storage layout + `durable_id`s, app inventory,
|
||||
snapshot pointers);
|
||||
- the agent **escrows the recovery-code-wrapped PBS key** to the hub (the one artifact only the box
|
||||
can produce — zero-knowledge: the hub stores it, cannot open it);
|
||||
- a **slim DR record** on the `hosts` row (PBS namespace + repo fingerprint + the wrapped escrow key).
|
||||
|
||||
`infra_backup_versions` retires; `infra_backups` is repurposed into the slim DR record (or folded
|
||||
onto `hosts`). The **controller's infra-backup push is removed** (it's de-privileged).
|
||||
|
||||
**Recovery (host loss):** the new agent re-enrolls in **restore mode**; the hub hands it the durable
|
||||
record (identity, tunnel token, storage manifest, PBS namespace, guest inventory + snapshots) **plus
|
||||
the wrapped escrow key**. The **customer provides their recovery code at the agent**, which unwraps
|
||||
the PBS key locally (never sent to the hub); the agent restores guests from PBS, resets identity,
|
||||
reuses the tunnel. The customer recovery code is the irreducible residual (the premium operator-
|
||||
managed custody tier avoids it, at the cost of the operator holding the key). The old controller-
|
||||
targeted `GET /recovery/{id}` is replaced by this agent restore-mode flow.
|
||||
|
||||
## 10. What persists from today (unchanged or lightly adapted)
|
||||
|
||||
The Customer record (`customer_configs`); config generation/retrieval (`configgen`); the two-tier
|
||||
notification system (operator English / customer Hungarian, Resend, cooldowns); `events` + audit;
|
||||
`app_telemetry` / `app_log_issues`; customer lifecycle actions (block/unblock, trigger-update,
|
||||
delete); the asset manager; and the dashboard — adapted to render the **host + guests** view per
|
||||
customer instead of a single controller.
|
||||
|
||||
## 11. Schema deltas (grounded in store.go's idempotent style; clean cutover)
|
||||
|
||||
- **NEW:** `hosts`, `guests`, `host_reports`, `signed_ops`.
|
||||
- **RENAME** `reports` → `guest_reports`; add `guest_id`, `host_id`; reinterpret `cpu/memory` as
|
||||
guest-level; `backup_last_snapshot` goes quiet.
|
||||
- **ADD** desired-state JSON + `desired_generation` to `hosts`; `desired_spec_json` to `guests`.
|
||||
- **RETIRE** `infra_backup_versions`; **repurpose** `infra_backups` → slim DR record (or fold onto
|
||||
`hosts`).
|
||||
- **KEEP** `customer_configs`, `events`, `customer_notifications`, `notification_log`,
|
||||
`app_telemetry`, `app_log_issues`.
|
||||
|
||||
## 12. Open items
|
||||
- Operator signing-key operational mechanics (Part 4 §8) — the hub-side pending-op UI is here; the
|
||||
key custody/rotation tooling is Part 4's.
|
||||
- Multi-tenant resource fairness (deferred shared-host case).
|
||||
- Hub-side desired-state **editing UX** specifics (form/diff wiring) — to be grounded against
|
||||
`hub/internal/web/configs.go` at implementation.
|
||||
- Golden-image refresh cadence / fleet versioning (carried from Part 3 §13).
|
||||
@@ -0,0 +1,221 @@
|
||||
# `05-hub-architecture.md` — critical review (grounded against felhom-hub v0.6.3 source + Parts 01–04)
|
||||
|
||||
Method: every claim about the existing hub was checked against `felhom.eu/hub/` source; every
|
||||
cross-doc claim against Parts 01/03/04. Citations are `file:line`. Severity: **blocking** (wrong /
|
||||
breaks an assumption) · **should-fix** (real gap or contradiction, low blast) · **minor**.
|
||||
|
||||
The two highest-value catches (doc assumes something the code contradicts) are **S1** and **S2**.
|
||||
|
||||
---
|
||||
|
||||
## Ranked summary
|
||||
|
||||
| # | What | Where (doc → code) | Severity |
|
||||
|---|---|---|---|
|
||||
| S1 | §9/§11 name the **wrong infra-backup table as current** — `infra_backup_versions` is the live/primary one; `infra_backups` is the deprecated write-only mirror | 05 §9/§11 → `store.go:198-217,541-578` | should-fix (code-contradiction) |
|
||||
| S2 | §7 treats the CF token as **geo-only**; it is **dual-purpose (DNS-01/ACME + WAF)** and is injected into the generated `controller.yaml` | 05 §7 → `config_form.html:76-80`, `controller.yaml.default:26`, `configgen.go:28-37`, `configs.go:1041` | should-fix (code-contradiction / unverified assumption) |
|
||||
| S3 | §6 leans on the existing **"Push"**, but that is a hub→box **inbound** POST — forbidden by the box-initiated model; transport must invert to poll | 05 §6 → `configs.go:569-570,1148-1150`; Part 1 §4/§5/§11; Part 3 §5 | should-fix |
|
||||
| S4 | Part 1 §6 calls app inventory **"declarative"**; 05 §1 (LOCKED) says apps are mirrored, never declared/reconciled, restored from PBS | Part 1 §6 ↔ 05 §1/§9 | should-fix (cross-doc) |
|
||||
| S5 | §9 hands "guest inventory + snapshots" **from the prunable report mirror**; DR soundness actually rests on durable sources | 05 §9/§3 → `store.go:809-816` | should-fix (DR robustness) |
|
||||
| S6 | §4 says backup-deadline checker "maps onto host_reports' last-backup field"; today it is **event-based** and controller-emitted | 05 §4 → `deadline.go:31-86` | should-fix (mechanism) |
|
||||
| M1 | "60s staleness checker" conflates the 60s **cadence** with the 30m/1h **threshold** | 05 §4 → `main.go:207-217,99-102`, `staleness.go:33-37` | minor |
|
||||
| M2 | §2 `customer_configs` field list omits `api_key` — the very field the per-reporter plan retires | 05 §2 → `store.go:102-112` | minor |
|
||||
| M3 | §11 `reports`→`guest_reports` "rename" is really drop+create under the locked clean cutover | 05 §11 → `store.go:55-119` | minor |
|
||||
| M4 | Pre-existing weak authz on infra-backup GET / `/notify` (any valid key, not customer-scoped) | handler.go:407,536,568,596 | minor |
|
||||
|
||||
No **blocking** findings — the data model and the two-driver framing are sound, and the LOCKED clean
|
||||
cutover absorbs most schema risk. The items below are gaps/contradictions worth fixing before the doc
|
||||
drives work.
|
||||
|
||||
---
|
||||
|
||||
## Highest-value: doc assumes something the code contradicts
|
||||
|
||||
### S1 — `infra_backups` vs `infra_backup_versions` is inverted (should-fix, code-contradiction)
|
||||
05 §9: *"`infra_backup_versions` retires; `infra_backups` is repurposed into the slim DR record."*
|
||||
§11 repeats: *"RETIRE `infra_backup_versions`; repurpose `infra_backups`."*
|
||||
|
||||
The code is the other way round:
|
||||
- `infra_backup_versions` (added v0.7.0, `store.go:198-211`) is the **live/primary** table. **Every read
|
||||
path hits it**: `GetInfraBackup` (`store.go:565-578`), `GetInfraBackupByID` (`store.go:581-593`),
|
||||
`GetInfraBackupMeta` (`store.go:604`), `ListInfraBackupVersions` (`store.go:640`), and the recovery
|
||||
endpoint (`handler.go:670-686`).
|
||||
- `infra_backups` (original single-row, `store.go:96-100`) is **deprecated**. It is now **written only
|
||||
as a legacy mirror** ("for backward compatibility during rollback window", `store.go:552-558`) and is
|
||||
**never read** except as the one-time migration *source* (`store.go:214-217`).
|
||||
|
||||
So the doc proposes retiring the current table and repurposing the dead one. Under the LOCKED clean
|
||||
cutover both are discarded anyway, so blast radius is low — but an implementer following §9/§11
|
||||
literally would point the DR record at the wrong table.
|
||||
**Fix:** take §11's own alternative — *fold the slim DR record onto `hosts`* and **drop both**
|
||||
infra-backup tables. If a standalone table is kept, base it on `infra_backup_versions` (the one with the
|
||||
data/readers), and correct the "which is current" framing.
|
||||
|
||||
### S2 — the CF API token is **not** geo-only; it is the ACME token too, and ships into `controller.yaml` (should-fix, code-contradiction)
|
||||
05 §7: *"The hub already holds the CF API token (the config form notes Zone WAF:Edit)… rather than
|
||||
pushing the token down to the controller… The controller no longer calls the CF API."*
|
||||
|
||||
Grounding confirms the hub **does** hold the token and **does** have a remove-all path:
|
||||
`config_json → infrastructure.cf_api_token` (`configs.go:714-715,1041-1042,1089-1096`) →
|
||||
`cfClient.RemoveGeoRules(cfToken, cfg.Domain, …)` in `handleGeoDisable` (`configs.go:1112`), route
|
||||
`/customers/{id}/geo/disable` (`server.go:201-205`). ✓ The §7 framing of geo-enforcement-moves-to-hub
|
||||
is also consistent with Part 1 §5/§7 and Part 3 §2/§46.
|
||||
|
||||
**But the doc's assumption that the token is *for geo* is contradicted by the code:** the same
|
||||
`cf_api_token` is **dual-purpose** —
|
||||
- the config-form hint says **"Zone DNS:Edit (ACME), Zone WAF:Edit (geo)"** (`config_form.html:80`),
|
||||
- `controller.yaml.default:26` documents it as the **"Cloudflare API token (DNS-01 challenge)"**,
|
||||
- and it is **deep-merged into the generated `controller.yaml`** via `configgen.Generate` (config_json
|
||||
overrides, `configgen.go:28-37`), i.e. **today it is shipped down to the box** and served at
|
||||
`/config/{id}` and `/recovery/{id}`.
|
||||
|
||||
Consequences §7 must address:
|
||||
1. **"Token off the controller" is incomplete** if the box still does DNS-01/ACME. In the CF-Tunnel
|
||||
model the box may no longer need a public cert at all (edge-terminated), making the ACME use moot —
|
||||
but that is an assumption the doc must state, not skip. Either confirm ACME is gone, or the CF token
|
||||
cannot fully come off the box.
|
||||
2. **`configgen` must stop emitting `cf_api_token` into `controller.yaml`** (or relocate it to a
|
||||
hub-only field). As written, the generated config still carries it.
|
||||
|
||||
---
|
||||
|
||||
## Should-fix
|
||||
|
||||
### S3 — §6 "Push" is an inbound-to-box mechanism the new model forbids
|
||||
05 §6: *"the operator edits a customer's desired state (building on the existing config-form +
|
||||
Push/Pull/Diff)."* The form + diff/pull/push handlers exist — `handlePushConfig` (`configs.go:569`),
|
||||
`handlePullConfig` (`configs.go:952`), `handleConfigDiff` (`configs.go:861`), routes at
|
||||
`server.go:209-229`. ✓ So the UI base is real.
|
||||
|
||||
The wrinkle: **"Push" today is a hub→controller outbound POST** (`handlePushConfig` "sends the generated
|
||||
YAML config to the controller", `configs.go:569-570`), as is the geo-disable notify
|
||||
(`notifyControllerGeoDisable` → `POST controllerURL/api/geo/settings`, `configs.go:1148-1153`). Both are
|
||||
the hub **connecting into the box** — explicitly disallowed by the box-initiated model (Part 1 §4
|
||||
"the hub never initiates inbound"; §5 row `agent↔hub`/`controller↔hub` = outbound poll; Part 3 §5 "The
|
||||
hub never connects inbound"). 05's own §5 already resolves this (desired state is **pulled** on poll
|
||||
with a `desired_generation`). So the doc is internally consistent in *mechanism* but loose in *wording*:
|
||||
**make §6 explicit that "Push" becomes "publish to desired state, delivered on the next agent/controller
|
||||
poll," not a reuse of the inbound push transport.** The form/diff UX carries over; the transport inverts.
|
||||
(Same applies to the geo-disable controller-notify path.)
|
||||
|
||||
### S4 — "declarative app inventory" (Part 1 §6) vs "apps are mirrored, never reconciled" (05 §1)
|
||||
Part 1 §6 lists the durable record as including a **"declarative app inventory"** that survives box loss
|
||||
— wording that implies an operator-authored, re-deployable spec. 05 §1 (LOCKED two-driver model) is
|
||||
explicit the opposite way: *"Apps never enter the reconcile loop… the hub only mirrors the resulting
|
||||
inventory… the app/customer layer is mirrored,"* and 05 §9 restores apps **from the PBS guest snapshot**,
|
||||
not by re-deploying a declared inventory. These are reconcilable (the mirror *is* durable last-known
|
||||
truth) but the word "declarative" contradicts the locked framing and the §9 restore-from-snapshot path.
|
||||
**Fix (align the older doc to the locked model):** in Part 1 §6 change "declarative app inventory" →
|
||||
"mirrored / last-reported app inventory," and note apps are recovered from the guest snapshot, not
|
||||
re-declared. (Flagging an internal inconsistency, not relitigating the locked premise.)
|
||||
|
||||
### S5 — §9 reads DR inputs from a prunable mirror; soundness rests on durable sources
|
||||
05 §9 hands the recovering agent *"identity, tunnel token, storage manifest, PBS namespace, guest
|
||||
inventory + snapshots."* §3 places "guest inventory" and "latest PBS snapshot pointers" in
|
||||
`host_reports` — the bottom-up mirror. But reports are **pruned** (`Prune` deletes rows older than
|
||||
`maxDays`, `store.go:809-816`; the doc keeps this), so after a long pre-DR outage the last `host_report`
|
||||
can be gone or stale. The actually-durable DR inputs are: desired-state on `hosts`/`guests` (§5), the
|
||||
slim DR record (PBS namespace + repo fingerprint + wrapped escrow key, §9/§11), and **PBS's own snapshot
|
||||
enumeration** (the agent lists snapshots once it has the namespace + unwrapped key). The mirrored
|
||||
inventory/pointers are convenience, not the source of record.
|
||||
**Fix:** state in §9 that DR reads from the durable sources (desired-state + DR record + PBS), **not**
|
||||
from prunable `host_reports`, so recovery doesn't degrade when the last report has aged out. This also
|
||||
keeps §1's two-driver discipline clean: DR must not depend on bottom-up mirror rows being retained.
|
||||
(Note: the `hosts` row legitimately mixes top-down intent columns with a few box-reported columns —
|
||||
repo fingerprint, wrapped escrow key. That is fine; just label them as box-reported so the §1 split
|
||||
stays legible at the column level.)
|
||||
|
||||
### S6 — backup-deadline checker: doc says field-based, code is event-based (and re-emitter changes)
|
||||
05 §4: *"The existing backup-deadline checker maps onto `host_reports`' last-backup-per-target."* The
|
||||
existing checker is **event-based**, not field-based: `CheckBackupDeadlines` looks for
|
||||
`backup_completed` / `backup_failed` (and `db_dump_*`) **events** since Budapest midnight and emits
|
||||
`expected_backup_missed` if neither is present (`deadline.go:31-86`). Two changes the doc should make
|
||||
explicit:
|
||||
1. **Mechanism:** either keep it event-based (someone emits `backup_completed`) or genuinely move it to
|
||||
a `host_reports.last_backup_per_target` field check — the doc says the latter but the impl is the
|
||||
former.
|
||||
2. **Emitter:** today the **controller** emits backup events; in the de-privileged model the **agent**
|
||||
owns backup/PBS (Part 3 §8), so the agent must now emit `backup_completed`/`backup_failed` (or the
|
||||
host report carries last-backup-per-target). Without re-homing the emitter, the deadline check goes
|
||||
silent after the controller stops doing backups.
|
||||
|
||||
---
|
||||
|
||||
## Minor
|
||||
|
||||
- **M1 — "60s staleness checker" (§4).** 60s is the **check cadence** (`main.go:207-217`,
|
||||
`ticker := time.NewTicker(60 * time.Second)`); the **staleness threshold** is 30m (default,
|
||||
`main.go:99-102`) with down at 2× = 60m (`staleness.go:33-37`; CLAUDE.md "OK <30m, DOWN >1h"). The
|
||||
event-transition mechanism (`node_stale`/`node_down`/`node_recovered`) is described correctly
|
||||
(`staleness.go:155-185`). Reword to "the staleness checker (60s cadence, 30m/1h thresholds)."
|
||||
- **M2 — `customer_configs` fields (§2).** The list ("identity, domain, email, retrieval_password,
|
||||
status, config_json") omits **`api_key`** (`store.go:108`) — the field §2's per-reporter plan
|
||||
actually retires. Worth noting `customer_configs.api_key` becomes unused once auth resolves via
|
||||
`hosts.api_key` / `guests.api_key`.
|
||||
- **M3 — rename under clean cutover (§11).** `migrate()` is all `CREATE TABLE IF NOT EXISTS` +
|
||||
idempotent `ALTER` (`store.go:55-119,146-149`). §11's claim "grounded in store.go's idempotent style"
|
||||
is accurate. But a `reports`→`guest_reports` **rename** isn't part of that style; under the LOCKED
|
||||
clean cutover (demo re-enrolls fresh, §2) it is really **drop `reports` + create `guest_reports`**
|
||||
with no data migration. Name it as such to avoid implying an in-place rename + backfill.
|
||||
- **M4 — pre-existing weak authz.** `handleInfraBackupGet`/`Versions` and `handleNotify`/
|
||||
`handleSavePreferences`/`handleInfraBackupPush` use `checkAuth` (global **or any** customer key,
|
||||
`handler.go:63-66`), not customer-scoped `checkAuthCustomer`. Most retire with the infra-backup push
|
||||
(§9); for any that carry over, the per-reporter model (§2) should scope them to the resolved
|
||||
host/guest→customer. Not a regression the doc introduces — a cleanup the cutover enables.
|
||||
|
||||
---
|
||||
|
||||
## Confirmed accurate (grounding that holds — so the rest of the doc can be trusted)
|
||||
|
||||
- **§10 KEEP list** matches the schema exactly: `customer_configs`, `events`, `customer_notifications`,
|
||||
`notification_log`, `app_telemetry`, `app_log_issues` all present (`store.go:74-189,102-135`). The
|
||||
asset manager exists (`handler.go:57,834-867`). ✓
|
||||
- **§10 two-tier notifications** (operator English / customer Hungarian, Resend, cooldowns) match
|
||||
`notify/dispatcher.go`: `processOperator` (1h cooldown, `FormatOperatorEmail`, gated by `operatorOn`,
|
||||
`dispatcher.go:91-114`) + `processCustomer` (prefs-driven, default 6h, `FormatCustomerEmail`,
|
||||
`dispatcher.go:116-158`); wired in `main.go:134`. ✓
|
||||
- **§8 enrollment / §11 configgen** — deep-merge + Hungarian passphrase base is real:
|
||||
`configgen.deepMerge` (`configgen.go:76-91`), programmatic overrides + `hub.api_key = cfg.APIKey`
|
||||
(`configgen.go:40-47`), retrieval-password gate (`handler.go:709-753`). The evolution to agent-first +
|
||||
per-guest keys + key-pinning is a clean extension. ✓
|
||||
- **§2 auth extension** (Bearer → reporter → customer) is clean against today's
|
||||
`checkAuthCustomer` (global key, else `GetCustomerConfigByAPIKey`, `handler.go:72-90`,
|
||||
`store.go:913-935`); adding host/guest key lookups slots straight in. ✓
|
||||
- **§11 "idempotent style"** is accurate (`store.go:55-119`). New tables/columns (`hosts`, `guests`,
|
||||
`host_reports`, `signed_ops`, `desired_generation`, `desired_spec_json`) follow the existing
|
||||
`CREATE IF NOT EXISTS` / `ALTER … ` pattern cleanly.
|
||||
- **§9 escrow/custody** is consistent with Part 1 §8 (three-tier custody, zero-knowledge default,
|
||||
recovery-code-wrapped PBS keyfile, operator can't open) and Part 3 §8 (live PBS key on the box for
|
||||
backup + restore-test; hub holds only the wrapped escrow). The "customer recovery code is the
|
||||
irreducible residual; operator-managed tier avoids it" matches Part 1 §8 verbatim in spirit. ✓
|
||||
- **§4 dead-man's-switch** (host-report recency = primary liveness) is consistent with Part 3 §5
|
||||
("the heartbeat *is* the liveness signal… first-class treatment hub-side"). ✓
|
||||
- **§5/§6 signed-op + desired-state** are consistent with Part 4 and Part 3 §4:
|
||||
hub holds **no** signing key and queues opaque blobs (Part 4 §5; 05 §6 "The hub holds no signing
|
||||
key"); agent runs the verify pipeline and is the authoritative guard (Part 4 §2.3, Part 3 §4; 05 §6
|
||||
"the agent's classification is the authoritative guard"); hub classifies at edit-time for UX only.
|
||||
05 §6's `signed_ops` columns are a consistent superset of Part 4 §2.1's blob
|
||||
`{op, target:{host_id,guest_id}, params, nonce, issued_at, expires_at, key_id}` (05 adds hub-side
|
||||
lifecycle states `delivered`/`rejected` — fine). The local-CLI hand-off (`felhom-sign --pending`)
|
||||
matches Part 4 §5–6's `Signer`-on-the-workstation model. ✓
|
||||
|
||||
## Two-driver soundness (axis 3) — holds
|
||||
No place in 05 has the hub **drive** box/customer-owned state. Desired-state (§5) is all infrastructure
|
||||
intent (guests, storage *policy*, versions, identity, tunnel) — top-down and legitimate. Apps are
|
||||
explicitly excluded from reconcile (§1, §5) and mirrored only. Storage is the handshake (detect →
|
||||
assign policy → reconcile policy onto the detected drive), matching Part 3 §7. The one nuance (S5): the
|
||||
`hosts` row holds both top-down intent and a few box-reported columns (repo fingerprint, wrapped escrow
|
||||
key) — acceptable, just label provenance per column. Reconcile (§5) never collides with app/storage
|
||||
reality because the reality columns (`durable_id` attached, snapshot pointers, app inventory) are
|
||||
mirror-only and never serve as desired state.
|
||||
|
||||
## DR completeness (axis 4) — safe to retire the heavy push, with S5's clarification
|
||||
Retiring the controller's infra-backup push is safe **given** that DR reads from durable sources, not
|
||||
the prunable mirror (S5). What the old push carried — `deployed_stacks` + `disk_layout.mounts`
|
||||
(`store.go:768-795`, surfaced by `handleRecovery`, `handler.go:620-705`) — is reconstructible:
|
||||
storage layout/`durable_id`s from the storage manifest (desired-state, durable) + host-report mirror;
|
||||
app inventory from the guest **inside the PBS snapshot** (so it need not be separately stored); snapshot
|
||||
list from PBS itself. The one artifact only the box can produce — the recovery-code-wrapped PBS key — is
|
||||
explicitly escrowed (§9), zero-knowledge, consistent with Part 1 §8 / Part 3 §8. So nothing
|
||||
DR-essential is lost by removing the push **provided** §9 is amended per S5 to name durable sources and
|
||||
not lean on `host_reports` retention.
|
||||
Reference in New Issue
Block a user