updated hub arch
This commit is contained in:
@@ -45,8 +45,9 @@ A customer's deployment is one **Host** (its agent) plus one-or-more **Guests**
|
|||||||
`display_name, controller_version`, per-guest **`desired_spec_json`** (CPU/mem/disk, versions),
|
`display_name, controller_version`, per-guest **`desired_spec_json`** (CPU/mem/disk, versions),
|
||||||
timestamps.
|
timestamps.
|
||||||
|
|
||||||
**Per-reporter keys:** today's per-customer `api_key` becomes per-reporter — `hosts.api_key` (agent)
|
**Per-reporter keys:** today's per-customer `customer_configs.api_key` becomes per-reporter —
|
||||||
and `guests.api_key` (controller). The hub resolves a presented Bearer key → host or guest → customer.
|
`hosts.api_key` (agent) and `guests.api_key` (controller). The hub resolves a presented Bearer key →
|
||||||
|
host or guest → customer; `customer_configs.api_key` goes unused once auth resolves via the new keys.
|
||||||
**Clean cutover:** no dual-model support; the demo re-enrolls fresh into `host + guests`.
|
**Clean cutover:** no dual-model support; the demo re-enrolls fresh into `host + guests`.
|
||||||
|
|
||||||
## 3. Report ingest — two domains
|
## 3. Report ingest — two domains
|
||||||
@@ -67,8 +68,8 @@ These two streams are the bottom-up mirror of §1 — they keep the hub current
|
|||||||
|
|
||||||
## 4. Liveness / dead-man's-switch
|
## 4. Liveness / dead-man's-switch
|
||||||
|
|
||||||
Evolves the existing 60s staleness checker (today: controller-report recency → `node_stale`/`down`/
|
Evolves the existing staleness checker (60s **cadence**, 30m/1h **thresholds** — OK <30m, down at
|
||||||
`recovered`):
|
2× = >1h; today: controller-report recency → `node_stale`/`down`/`recovered`):
|
||||||
|
|
||||||
- **Primary = host-report recency → `host_stale` / `host_down`.** The agent heartbeat is the box's
|
- **Primary = host-report recency → `host_stale` / `host_down`.** The agent heartbeat is the box's
|
||||||
liveness signal; a silent agent = the box is gone (the critical alert).
|
liveness signal; a silent agent = the box is gone (the critical alert).
|
||||||
@@ -76,7 +77,12 @@ Evolves the existing 60s staleness checker (today: controller-report recency →
|
|||||||
than waiting for a guest report to go stale.
|
than waiting for a guest report to go stale.
|
||||||
- **Guest-report recency = secondary** app-level signal.
|
- **Guest-report recency = secondary** app-level signal.
|
||||||
|
|
||||||
The existing backup-deadline checker maps onto `host_reports`' last-backup-per-target.
|
**Backup-deadline checker:** today it is *event-based* — it scans for `backup_completed`/`backup_failed`
|
||||||
|
events since local midnight and alerts if none. Two changes: (1) **mechanism** — move it to a field
|
||||||
|
check on `host_reports`' last-backup-per-target (cleaner now that backup state arrives in the host
|
||||||
|
report); (2) **emitter** — the de-privileged controller no longer runs backups, so the **agent** is the
|
||||||
|
source of the last-backup status (Part 3 §8). Without re-homing the source, the deadline check would go
|
||||||
|
silent after the controller stops backing up.
|
||||||
|
|
||||||
## 5. Desired-state serving
|
## 5. Desired-state serving
|
||||||
|
|
||||||
@@ -101,8 +107,11 @@ Implements Part 4's gate on the hub side. The hub holds **no signing key**.
|
|||||||
- **`signed_ops`** (new): `op_id, customer_id, host_id, target_guest, op_type, op_blob (canonical
|
- **`signed_ops`** (new): `op_id, customer_id, host_id, target_guest, op_type, op_blob (canonical
|
||||||
JSON), signature (armored SSHSIG), status (pending_signature → signed → delivered → executed /
|
JSON), signature (armored SSHSIG), status (pending_signature → signed → delivered → executed /
|
||||||
failed / expired / rejected), nonce, issued_at, expires_at, executed_at, result`.
|
failed / expired / rejected), nonce, issued_at, expires_at, executed_at, result`.
|
||||||
- **Editing flow:** the operator edits a customer's desired state (building on the existing config-
|
- **Editing flow:** the operator edits a customer's desired state, reusing the existing config-form +
|
||||||
form + Push/Pull/Diff). The hub diffs vs current and **classifies each delta** (B1 rule):
|
diff UX. Note the **transport inverts**: today's "Push" is a hub→box *inbound* POST (forbidden by the
|
||||||
|
box-initiated model); here "publish" means **write to desired state, delivered on the next agent/
|
||||||
|
controller poll**. The form and diff carry over; the push transport does not. The hub diffs vs current
|
||||||
|
and **classifies each delta** (B1 rule):
|
||||||
- **benign** → published straight to desired state;
|
- **benign** → published straight to desired state;
|
||||||
- **destructive** → the hub generates the canonical op blob and routes it through signing.
|
- **destructive** → the hub generates the canonical op blob and routes it through signing.
|
||||||
- **Signing hand-off (Part 4 option (b)):** a local operator CLI (`felhom-sign --pending`) fetches
|
- **Signing hand-off (Part 4 option (b)):** a local operator CLI (`felhom-sign --pending`) fetches
|
||||||
@@ -118,11 +127,22 @@ Implements Part 4's gate on the hub side. The hub holds **no signing key**.
|
|||||||
|
|
||||||
## 7. Geo enforcement (Part-2 S4)
|
## 7. Geo enforcement (Part-2 S4)
|
||||||
|
|
||||||
The hub already holds the CF API token (the config form notes Zone WAF:Edit) and already has a
|
The hub already holds the CF API token and already has a remove-all path
|
||||||
remove-all path (`internal/cloudflare/unblock.go`). The delta: the **customer sets geo in the
|
(`internal/web/configs.go` `handleGeoDisable` → `cloudflare.RemoveGeoRules`). **But the token is
|
||||||
controller UI → the controller reports the geo desired-state up → the hub reconciles it into the
|
dual-purpose today** — DNS-01/ACME *and* WAF/geo — and `configgen.Generate` deep-merges it (via
|
||||||
Cloudflare WAF** (rather than pushing the token down to the controller). The hub keeps the
|
`config_json`) into the generated `controller.yaml`, so it currently ships **down to the box**. Two
|
||||||
remove-all override for self-lockout. The controller no longer calls the CF API.
|
things follow:
|
||||||
|
|
||||||
|
- **ACME assumption (must be stated, not skipped):** in the Cloudflare-Tunnel-default model the edge
|
||||||
|
terminates TLS, so the box needs no public certificate and the **DNS-01/ACME use of the token goes
|
||||||
|
away**. Granting that, the token comes fully off the box and lives hub-only. (If any box still does
|
||||||
|
DNS-01, the token cannot fully come off — so this assumption is load-bearing.)
|
||||||
|
- **`configgen` must stop emitting `cf_api_token`** into `controller.yaml` (drop it from the merge /
|
||||||
|
relocate it to a hub-only field).
|
||||||
|
|
||||||
|
The delta: the **customer sets geo in the controller UI → the controller reports the geo desired-state
|
||||||
|
up → the hub reconciles it into the Cloudflare WAF** (rather than the box calling the CF API). The hub
|
||||||
|
keeps the remove-all override for self-lockout. The controller no longer calls the CF API.
|
||||||
|
|
||||||
## 8. Enrollment (evolution of the existing retrieval-password/config-gen flow)
|
## 8. Enrollment (evolution of the existing retrieval-password/config-gen flow)
|
||||||
|
|
||||||
@@ -146,21 +166,29 @@ config+secrets+restic-password infra-backup blob is redundant.
|
|||||||
|
|
||||||
What remains:
|
What remains:
|
||||||
- the **report streams** keep the hub's mirror current (storage layout + `durable_id`s, app inventory,
|
- the **report streams** keep the hub's mirror current (storage layout + `durable_id`s, app inventory,
|
||||||
snapshot pointers);
|
snapshot pointers) — but this mirror is **convenience, not the DR source of record** (reports are
|
||||||
|
pruned by age);
|
||||||
- the agent **escrows the recovery-code-wrapped PBS key** to the hub (the one artifact only the box
|
- the agent **escrows the recovery-code-wrapped PBS key** to the hub (the one artifact only the box
|
||||||
can produce — zero-knowledge: the hub stores it, cannot open it);
|
can produce — zero-knowledge: the hub stores it, cannot open it);
|
||||||
- a **slim DR record** on the `hosts` row (PBS namespace + repo fingerprint + the wrapped escrow key).
|
- a **slim DR record** on the `hosts` row (PBS namespace + repo fingerprint + the wrapped escrow key).
|
||||||
|
These last two are *box-reported* columns on an otherwise operator-intent row — labelled as such so
|
||||||
|
the §1 two-driver split stays legible per column.
|
||||||
|
|
||||||
`infra_backup_versions` retires; `infra_backups` is repurposed into the slim DR record (or folded
|
Both existing infra-backup tables retire — `infra_backup_versions` (the current/live one, all readers
|
||||||
onto `hosts`). The **controller's infra-backup push is removed** (it's de-privileged).
|
hit it) **and** `infra_backups` (the deprecated legacy mirror). The slim DR record folds onto `hosts`
|
||||||
|
instead. The **controller's infra-backup push is removed** (it's de-privileged).
|
||||||
|
|
||||||
**Recovery (host loss):** the new agent re-enrolls in **restore mode**; the hub hands it the durable
|
**Recovery (host loss):** the new agent re-enrolls in **restore mode**; the hub hands it the durable
|
||||||
record (identity, tunnel token, storage manifest, PBS namespace, guest inventory + snapshots) **plus
|
record — and DR reads from the **durable sources, not the prunable report mirror**: operator intent
|
||||||
the wrapped escrow key**. The **customer provides their recovery code at the agent**, which unwraps
|
(desired-state on `hosts`/`guests` — identity, tunnel token, storage manifest), the slim DR record
|
||||||
the PBS key locally (never sent to the hub); the agent restores guests from PBS, resets identity,
|
(PBS namespace + repo fingerprint), the **wrapped escrow key**, and **PBS's own snapshot enumeration**
|
||||||
reuses the tunnel. The customer recovery code is the irreducible residual (the premium operator-
|
(the agent lists snapshots once it has the namespace + unwrapped key). Guest inventory + app data come
|
||||||
managed custody tier avoids it, at the cost of the operator holding the key). The old controller-
|
from **inside the PBS guest snapshots**, not from a retained `host_report`, so recovery doesn't degrade
|
||||||
targeted `GET /recovery/{id}` is replaced by this agent restore-mode flow.
|
when the last report has aged out. The **customer provides their recovery code at the agent**, which
|
||||||
|
unwraps the PBS key locally (never sent to the hub); the agent restores guests from PBS, resets
|
||||||
|
identity, reuses the tunnel. The customer recovery code is the irreducible residual (the premium
|
||||||
|
operator-managed custody tier avoids it, at the cost of the operator holding the key). The old
|
||||||
|
controller-targeted `GET /recovery/{id}` is replaced by this agent restore-mode flow.
|
||||||
|
|
||||||
## 10. What persists from today (unchanged or lightly adapted)
|
## 10. What persists from today (unchanged or lightly adapted)
|
||||||
|
|
||||||
@@ -173,13 +201,18 @@ customer instead of a single controller.
|
|||||||
## 11. Schema deltas (grounded in store.go's idempotent style; clean cutover)
|
## 11. Schema deltas (grounded in store.go's idempotent style; clean cutover)
|
||||||
|
|
||||||
- **NEW:** `hosts`, `guests`, `host_reports`, `signed_ops`.
|
- **NEW:** `hosts`, `guests`, `host_reports`, `signed_ops`.
|
||||||
- **RENAME** `reports` → `guest_reports`; add `guest_id`, `host_id`; reinterpret `cpu/memory` as
|
- **DROP `reports` + CREATE `guest_reports`** (under the clean cutover this is drop+create with no data
|
||||||
|
migration, not an in-place rename); `guest_reports` adds `guest_id`, `host_id`; `cpu/memory` mean
|
||||||
guest-level; `backup_last_snapshot` goes quiet.
|
guest-level; `backup_last_snapshot` goes quiet.
|
||||||
- **ADD** desired-state JSON + `desired_generation` to `hosts`; `desired_spec_json` to `guests`.
|
- **ADD** desired-state JSON + `desired_generation` to `hosts`; `desired_spec_json` to `guests`; the
|
||||||
- **RETIRE** `infra_backup_versions`; **repurpose** `infra_backups` → slim DR record (or fold onto
|
slim DR record (PBS namespace + repo fingerprint + wrapped escrow key) onto `hosts`.
|
||||||
`hosts`).
|
- **DROP both** `infra_backup_versions` (current/live) **and** `infra_backups` (legacy mirror) — the DR
|
||||||
|
record replaces them on `hosts`.
|
||||||
- **KEEP** `customer_configs`, `events`, `customer_notifications`, `notification_log`,
|
- **KEEP** `customer_configs`, `events`, `customer_notifications`, `notification_log`,
|
||||||
`app_telemetry`, `app_log_issues`.
|
`app_telemetry`, `app_log_issues`.
|
||||||
|
- **Authz cleanup the cutover enables:** several endpoints today use global-or-any-customer-key auth
|
||||||
|
rather than customer-scoped (the infra-backup GETs, `/notify`). Most retire with the infra-backup
|
||||||
|
push; any that carry over should scope to the resolved host/guest → customer under §2.
|
||||||
|
|
||||||
## 12. Open items
|
## 12. Open items
|
||||||
- Operator signing-key operational mechanics (Part 4 §8) — the hub-side pending-op UI is here; the
|
- Operator signing-key operational mechanics (Part 4 §8) — the hub-side pending-op UI is here; the
|
||||||
|
|||||||
Reference in New Issue
Block a user