moved documentation to felhom.eu
This commit is contained in:
@@ -0,0 +1,224 @@
|
||||
# Felhom Controller Architecture — Part 1: Topology & Trust
|
||||
|
||||
**Status:** draft (decisions from the topology/trust design sessions).
|
||||
**Platform facts** referenced here live in `docs/proxmox-platform.md`; this document
|
||||
records *Felhom's decisions*, not Proxmox behaviour.
|
||||
|
||||
---
|
||||
|
||||
## 1. Model at a glance
|
||||
|
||||
Three components. **Control is always box-initiated** — the hub never connects *into* a
|
||||
customer box.
|
||||
|
||||
```
|
||||
operator side customer box (per Proxmox host)
|
||||
┌───────────────────┐ ┌───────────────────────────────────────────┐
|
||||
│ HUB │ │ Proxmox host │
|
||||
│ (dooplex.hu, k3s) │ │ ┌──────────────┐ │
|
||||
│ - report sink │◀──poll──┤ │ HOST AGENT │ operator-tier │
|
||||
│ - signed jobs │ signed │ │ (Proxmox │ • all Proxmox ops │
|
||||
│ - dashboard │ jobs │ │ token) │ • provision / restore │
|
||||
│ - customer record│ │ └──────┬───────┘ • storage mgmt │
|
||||
│ - PBS namespace │ │ │ local constrained API │
|
||||
└─────────▲─────────┘ │ ┌──────▼───────────────────────────────┐ │
|
||||
│ │ │ customer LXC (one per customer) │ │
|
||||
│ direct, app- │ │ ┌──────────────┐ Docker: │ │
|
||||
└───────────────────┼───┤ │ IN-GUEST │ [app] [app] ... │ │
|
||||
domain reports │ │ │ CONTROLLER │ (Docker containers)│
|
||||
│ │ │ (Docker-only)│ │ │
|
||||
│ │ └──────────────┘ │ │
|
||||
│ └───────────────────────────────────────┘ │
|
||||
└───────────────────────────────────────────┘
|
||||
PBS (offsite) ◀── outbound, client-side-encrypted backups ── customer box
|
||||
end-users / customer ◀── Cloudflare Tunnel ── apps + controller UI
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. The customer node
|
||||
|
||||
- One **Proxmox host** per box (PVE 9.2, Debian 13, LVM-thin).
|
||||
- **Default workload topology:** one **customer LXC**, Docker inside it, each app a Docker
|
||||
container/stack. Apps are isolated at the Docker layer (separate containers, networks,
|
||||
volumes, cgroup limits); they share one LXC/kernel/Docker daemon.
|
||||
- **Escape hatch:** promote an individual app to its own guest (LXC or VM) only for a
|
||||
specific reason — a non-Linux/Windows app, a genuinely untrusted or exposed app needing
|
||||
hard isolation, or a resource hog needing guarantees.
|
||||
- **Multi-tenant:** one customer per host is the home default; multiple customer LXCs on
|
||||
one host (a company environment) is **not precluded** — the agent manages a *set* of
|
||||
guests. The only multi-tenant-specific work deferred to "if it becomes real" is resource
|
||||
fairness (per-guest disk/RAM/CPU quotas).
|
||||
|
||||
---
|
||||
|
||||
## 3. Components & responsibilities
|
||||
|
||||
| | **Hub** | **Host agent** | **In-guest controller** |
|
||||
|---|---|---|---|
|
||||
| Runs on | dooplex.hu (k3s) | the Proxmox host | the customer LXC |
|
||||
| Tier | operator backend | operator (high-privilege) | customer-facing (app) |
|
||||
| Holds | customer records, signed-job source, PBS namespaces, escrowed keys | the **only** Proxmox API token; per-host operator identity | **no Proxmox creds**; its own hub API key + a local-API token to the agent |
|
||||
| Does | reporting sink, dashboard, job queue, source of durable truth | all Proxmox ops (provision, restore, snapshot, backup, storage mgmt, LXC lifecycle); polls hub for signed jobs; exposes a constrained local API to the controller; **per-guest authorization gate** | Docker/app lifecycle, catalog deploy, customer UI, app-level (data-layer) backup; reports app-domain to the hub directly |
|
||||
| Never does | initiate a connection *into* a box | — | touch the Proxmox API directly |
|
||||
|
||||
**Key separation:** the controller manages Docker; the agent manages Proxmox. The controller's
|
||||
only path to guest-level operations (snapshot-before-deploy, "grow my RAM") is a constrained
|
||||
**local API call to the agent**, which the agent authorizes (scoped to that controller's own
|
||||
guest) and executes with its operator-tier token. This consolidates all Proxmox access and
|
||||
all per-guest authorization in one auditable place and leaves the guest with zero Proxmox
|
||||
credentials.
|
||||
|
||||
---
|
||||
|
||||
## 4. Control plane — box-initiated
|
||||
|
||||
- CGNAT does **not** force this: the Cloudflare Tunnel already makes a box reachable through
|
||||
Cloudflare's edge. We *choose* box-initiated control for the smallest attack surface — the
|
||||
box exposes no control endpoint at all.
|
||||
- The agent and the controller **poll** the hub; the hub never initiates inbound.
|
||||
- Operator actions are delivered as **signed jobs**: the agent verifies an operator signature
|
||||
before executing, so a compromised hub database alone cannot forge commands.
|
||||
- All operator-initiated actions are recorded in a **customer-visible audit log**.
|
||||
|
||||
---
|
||||
|
||||
## 5. Trust boundaries
|
||||
|
||||
| Boundary | What crosses | Mechanism | Blast radius if breached |
|
||||
|---|---|---|---|
|
||||
| end-user ↔ apps | app traffic | Cloudflare Tunnel → Traefik (Host routing) | that app |
|
||||
| customer ↔ controller UI | management UI | Cloudflare Tunnel; UI auth (bcrypt) | the customer's own box |
|
||||
| controller ↔ agent | snapshot/resize/backup requests | local constrained RPC; agent authorizes per-guest | the controller's own guest only |
|
||||
| agent ↔ hub | reports + signed jobs | outbound poll; signed jobs | one box; signed jobs limit forgery |
|
||||
| controller ↔ hub | app-domain reports/jobs (incl. geo desired-state) | outbound, own API key | app-domain of one customer |
|
||||
| box ↔ PBS | encrypted backups | outbound; per-customer namespace; client-side encryption | ciphertext only (operator can't read) |
|
||||
| guest ↔ Proxmox host | **(none direct)** | the guest holds no Proxmox creds; all via the agent | — |
|
||||
| hub ↔ Cloudflare API | geo-restriction WAF (enforcement) | the **hub** holds the CF API token; reconciles geo desired-state → WAF | the customer's zone/WAF |
|
||||
|
||||
---
|
||||
|
||||
## 6. Enrollment & identity
|
||||
|
||||
- **Physical presence at provisioning** (on-site install, or pre-imaged-and-delivered).
|
||||
This removes any zero-touch remote-enrollment problem.
|
||||
- A **one-time retrieval code** mints durable identity. Single-use (burned on the successful
|
||||
config fetch) plus a short *pre-use* TTL; one-click regenerate for the only real failure
|
||||
case (fetch fails before anything is persisted). After the fetch, the code is irrelevant —
|
||||
everything downstream runs on durable credentials, so retries don't need it.
|
||||
- **Order:** the agent enrolls first (and, running as root at setup, mints its own scoped
|
||||
operator-tier Proxmox token), then provisions the customer LXC from the golden template and
|
||||
deploys the controller into it — injecting the controller's hub API key and its local-API
|
||||
token. The controller is the agent's product, never the other way around.
|
||||
- The **hub customer record is the durable source of truth**, and it survives box loss:
|
||||
identity, domain, **Cloudflare tunnel token**, **PBS namespace**, **storage manifest**, a
|
||||
**mirrored app inventory** (bottom-up reality, not operator-declared intent — apps themselves
|
||||
restore from the PBS guest snapshot, never re-deployed from this record; see `05` §1/§9), and the
|
||||
**escrowed (zero-knowledge) backup key**. This is what makes hardware replacement possible.
|
||||
|
||||
---
|
||||
|
||||
## 7. Networking
|
||||
|
||||
- **Cloudflare Tunnel** provides inbound access to apps and the controller UI (the CGNAT
|
||||
solution). Tunnel token lives in the hub record → **reused on new hardware during DR**, so
|
||||
DNS/routing stay intact through an outage.
|
||||
- **Outbound only** for control/report/backup (poll to hub, push to PBS). No inbound control
|
||||
endpoint exists in the chosen model.
|
||||
- **Tunnel placement: host** (resolved, Part 3 §3/§5). `cloudflared` runs on the Proxmox host
|
||||
as its own **agent-managed systemd service** — not inside the guest — so the data path
|
||||
survives control-plane death by construction. Geo-restriction WAF is **hub-enforced** (the
|
||||
hub holds the CF API token; the controller only reports geo desired-state).
|
||||
|
||||
---
|
||||
|
||||
## 8. Storage & backup
|
||||
|
||||
**Tiers** (escalating failure scope):
|
||||
|
||||
| Layer | Mechanism | Survives | Note |
|
||||
|---|---|---|---|
|
||||
| Snapshot | LVM-thin snapshot (transient) | *logical* loss only | whole-LXC rollback; **not a backup** |
|
||||
| Local — second storage | vzdump to `dir`/`nfs`/`cifs` | primary-disk failure (USB) / box death (NAS) | first *real* backup tier |
|
||||
| Offsite — PBS | dedup'd, incremental, encrypted | site loss | the DR substrate; paid tier |
|
||||
|
||||
- **Storage manifest** (hub-held, agent-reconciled): per target → type, durable identity
|
||||
(UUID / `server:/export` / repo+fingerprint), **class** (fast/slow + rough IOPS, set once
|
||||
at attach), role, encrypted credentials, schedule/retention. The agent creates the Proxmox
|
||||
storages, continuously checks presence/reachability, and reports per-target status (a
|
||||
disconnected target → actionable notification).
|
||||
- **App data placement is per-volume, not per-app:** `.felhom.yml` classifies each volume
|
||||
**hot** (DB/config/cache → fast storage, enforced) vs **bulk** (media/files → may be slow).
|
||||
A photo app's DB stays on SSD while its blobs go to the USB.
|
||||
- **Backup scoping:** hot data (LXC rootfs) rides the guest `vzdump` → tiers + PBS. Bulk data
|
||||
on external mount points is **excluded** from the guest vzdump (per-mount `backup` flag) and
|
||||
gets its own per-volume policy (file-level to a tier, slower cadence — or explicitly *not*
|
||||
backed up for re-downloadable content, with the customer informed).
|
||||
- **Tiers double as the DR restore-source priority:** restore from the fastest *surviving*
|
||||
source (local if still attachable, PBS on true site loss).
|
||||
- **Key custody (zero-knowledge default):** three tiers the customer chooses —
|
||||
*customer-only* / *zero-knowledge escrow (default)* / *operator-managed*. Default escrows
|
||||
the **PBS passphrase-protected keyfile** in the hub, wrapped under a **customer recovery
|
||||
code** the operator can't open; DR needs the customer's code. Access-notification is an
|
||||
audit signal, never the primary guard. (Don't build bespoke crypto — use PBS's native
|
||||
keyfile passphrase.)
|
||||
|
||||
---
|
||||
|
||||
## 9. Disaster recovery
|
||||
|
||||
- **Guest-loss (host + agent alive):** the agent restores the guest from the fastest
|
||||
surviving tier, **resets identity** (MAC/hostname — see `proxmox-platform.md`), boots it,
|
||||
controller returns. Validated mechanics: Phase 2.
|
||||
- **Host / hardware-loss (agent gone):** re-provision (§6) in **restore mode** — the hub,
|
||||
knowing the customer has PBS backups, hands the freshly-enrolled agent the existing identity
|
||||
+ PBS namespace + a restore directive instead of a clean-provision directive. The agent
|
||||
restores from PBS; the controller returns on the same domain (tunnel reused from the hub
|
||||
record). DR = provisioning + a restore mode, not a separate mechanism.
|
||||
- **Snapshot-before-deploy:** controller asks the agent to snapshot, deploys, runs its
|
||||
post-deploy health check, asks the agent to roll back on failure. (Transient snapshot, §8.)
|
||||
|
||||
---
|
||||
|
||||
## 10. How this embodies the product values
|
||||
|
||||
- **Zero-knowledge offsite** — the operator holds the offsite backup but cannot read it.
|
||||
- **Box-initiated control + signed jobs** — no standing operator backdoor; a hub compromise
|
||||
alone can't forge commands.
|
||||
- **Customer-visible audit log** — every operator action is visible to the customer.
|
||||
- **Never hold data hostage** — subscriptions cover ongoing labour (monitoring, offsite,
|
||||
support, new deployments); the customer's data and deployed apps remain recoverable by the
|
||||
customer (recovery code), with nothing locked behind the operator.
|
||||
|
||||
---
|
||||
|
||||
## 11. Open sub-decisions (carried into later parts)
|
||||
|
||||
- **RTO/RPO targets** → drive the backup + offsite-replication schedule (§8).
|
||||
- Offboarding / decommission (scenario 6) — not yet designed; must honour "never hold data
|
||||
hostage" in credential revocation + data hand-off.
|
||||
- Multi-tenant resource fairness — deferred until multi-tenant is real (§2).
|
||||
|
||||
---
|
||||
|
||||
## Appendix — relationship to the spike
|
||||
|
||||
- **Phase 0** → §2: LXC-default for the workload; overhead numbers.
|
||||
- **Phase 1** → §3/§5: validated the privilege boundary (create/allocate is operator-tier).
|
||||
The guest-side scoped-backup-token it proved possible is **not** used — we chose the
|
||||
agent-mediated path — but it confirmed restore = operator-tier, which shapes the agent.
|
||||
- **Phase 2** → §8/§9: backup→restore round-trip; identity reset on restore.
|
||||
|
||||
---
|
||||
|
||||
## Changelog — design-review + Phase-3 fold-in (2026-06-08)
|
||||
|
||||
- §5 trust boundaries: **added `hub ↔ Cloudflare API`** row (hub holds the CF token, enforces
|
||||
geo→WAF); controller↔hub row notes it carries geo desired-state (S4).
|
||||
- §7 networking: **tunnel placement resolved → host** (agent-managed systemd service); geo is
|
||||
hub-enforced (S4/S5).
|
||||
- §11 open items: removed the now-resolved **tunnel placement** and **self-update flow** entries
|
||||
(S5; self-update designed in 03 §11).
|
||||
- §6 durable record: **"declarative app inventory" → "mirrored app inventory"** — aligns the wording
|
||||
with the locked two-driver model (`05` §1: apps are bottom-up mirror, never operator-declared;
|
||||
`05` §9: apps restore from the PBS guest snapshot, not re-deployed from this record).
|
||||
@@ -0,0 +1,374 @@
|
||||
# Felhom Controller Architecture — Part 2: Controller Module Map
|
||||
|
||||
**Status:** audit (keep / port / delete / modify / add), grounded in the v0.33 source.
|
||||
**Subject:** the v0.33 controller in `deploy-felhom-compose/controller/` (110 `.go` files,
|
||||
~40 K LOC) audited against [01-topology-and-trust.md](01-topology-and-trust.md) and
|
||||
[../proxmox-platform.md](../proxmox-platform.md).
|
||||
|
||||
> This is a **planning map, not the port.** No controller code was changed. Source
|
||||
> citations use `controller/internal/...:line` (a different repo, so links are not
|
||||
> clickable). Classifications reflect the **target model**: the in-guest controller is
|
||||
> **Docker-only and holds no Proxmox credentials**; everything host/disk/Proxmox moves to
|
||||
> a new **host agent** (out of scope here); the controller reaches the agent through a
|
||||
> constrained **local API**.
|
||||
|
||||
## Classification scheme
|
||||
**KEEP** (host-agnostic, ~unchanged) · **PORT** (survives, needs rework) ·
|
||||
**DELETE (→agent)** (responsibility moves to the host agent) ·
|
||||
**DELETE (obsolete)** (no longer needed) · **MODIFY** (stays, materially changes) ·
|
||||
**NEW** (no v0.33 equivalent).
|
||||
Risk tags: **clean** · **needs-rework** · **hazard** (entangles a delete-target with a keep/port target).
|
||||
|
||||
---
|
||||
|
||||
## 0. Executive summary
|
||||
|
||||
- The **app domain is largely intact and portable**: stack lifecycle (`stacks/`), catalog
|
||||
git-sync (`sync/`), app-to-app integrations (`integrations/`), `.fab` export/import
|
||||
(`appexport/`), the scheduler, crypto, asset sync, the hub report/notify *channels*, and
|
||||
most of the web UI **KEEP/PORT cleanly**.
|
||||
- The **disk/storage/host half deletes wholesale to the agent**: all of `storage/`,
|
||||
`monitor/watchdog.go`, the restic/cross-drive/disk-layout/drive-mount parts of `backup/`,
|
||||
`report/infra_backup*`+`infra_pull`, and the host-physical parts of `system/`.
|
||||
- The **setup wizard (`setup/`) is obsolete** — the agent provisions the controller.
|
||||
- **The single biggest hazard is `backup/`**: the keep side (DB dumps, Docker-volume
|
||||
archive, per-app restore — needed by `appexport/` and the backup UI) and the delete side
|
||||
(restic, cross-drive, drive-mount) are **interleaved inside the same files**
|
||||
(`backup.go`, `restore.go`, `paths.go`), not cleanly file-separated. Extracting the
|
||||
app-data-backup subset into a clean retained package is the critical refactor.
|
||||
- **Intent-vs-reality corrections** (vs the task's provisional split): `monitor/pinger.go`
|
||||
is already **dead** (legacy Healthchecks.io, "deprecated… now handled by Hub" per
|
||||
`main.go`) → DELETE(obsolete), not keep. `backup.go`/`restore.go`/`paths.go` do **not**
|
||||
split on file boundaries — they split *within* the file. `settings/` is **not** pure app
|
||||
domain — it stores disk/disconnect/decommission state. `system/` is genuinely
|
||||
mixed-per-function, not per-file.
|
||||
|
||||
---
|
||||
|
||||
## 1. v0.33 module inventory (package → purpose, key deps)
|
||||
|
||||
| Package | Purpose | Key internal deps |
|
||||
|---|---|---|
|
||||
| `cmd/controller/main.go` | Entry point; wires all subsystems; 6 adapters break import cycles; branches into setup mode | imports **every** package |
|
||||
| `api/` | REST API (`router.go`) + geo endpoints (`geo.go`) | stacks, backup, metrics, notify, selfupdate, sync, system, assets, integrations, cloudflare, config, settings |
|
||||
| `appexport/` | `.fab` app export/import (config+DB+volumes, AES-256-CTR+scrypt) | **backup** (DB dump), (provider iface → stacks) |
|
||||
| `assets/` | Download/cache app assets from Hub API | — (HTTP only) |
|
||||
| `backup/` | DB dumps, Docker-volume archive, **restic**, **cross-drive rsync**, per-app restore, **drive mount**, disk-layout, infra-backup metadata | config, monitor, settings, system, util |
|
||||
| `cloudflare/` | Geo-restriction via Cloudflare WAF (zone/waf/geosync/countries) — **enforcement → hub** (S4) | settings |
|
||||
| `config/` | `controller.yaml` schema + load | — |
|
||||
| `crypto/` | AES-256-GCM for app.yaml secrets | — |
|
||||
| `integrations/` | App-to-app (OnlyOffice→FileBrowser/Nextcloud) via docker exec / config patch | stacks, crypto, settings |
|
||||
| `metrics/` | SQLite time-series: system + container metrics, log scan | system |
|
||||
| `monitor/` | App health (`healthcheck`,`pinger`) + **storage/USB watchdog** | config, notify, settings, system |
|
||||
| `notify/` | Hub event push (direct, own API key) | settings |
|
||||
| `recovery/` | Generate `recovery-info.txt` (DR guide) | — |
|
||||
| `report/` | Build+push hub report; **infra-backup payload**; **recovery pull** | backup, config, metrics, monitor, scheduler, settings, stacks, system |
|
||||
| `scheduler/` | Cron/interval jobs, Budapest TZ | — |
|
||||
| `selftest/` | Startup checks (docker/dirs/catalog/hub/**restic repos**/mountpoint) | backup, config, settings, system |
|
||||
| `selfupdate/` | Self-update: pull image, edit compose, `up -d` | config |
|
||||
| `settings/` | `settings.json` persistent state: **storage paths/disconnect/decommission**, cross-drive cfg, notif prefs, geo, integration state, DB-validation cache | — |
|
||||
| `setup/` | **First-run wizard** (scan drives, hub-restore, manual config) | backup, config, report, settings, web |
|
||||
| `stacks/` | Docker Compose lifecycle, deploy + memory validation, metadata (`.felhom.yml`), HDD-data delete | config, crypto, system |
|
||||
| `storage/` | **Physical disk** scan/format/attach/mount/migrate/fstab/safety | backup, settings, util |
|
||||
| `sync/` | Catalog git-sync (pull templates) | config |
|
||||
| `system/` | Resource info: mem/cpu/load (guest) + **temp/disk-model/USB/mount topology (host)** | — |
|
||||
| `util/` | String helper | — |
|
||||
| `web/` | Hungarian dashboard: pages, auth, deploy, backup UI, **storage/disk UI**, DR restore UI, export UI, debug | appexport, backup, config, crypto, integrations, monitor, notify, scheduler, selfupdate, settings, stacks, storage, system |
|
||||
|
||||
---
|
||||
|
||||
## 2. Classification table (per package/file)
|
||||
|
||||
### `cmd/`
|
||||
| File | Class | Reason | Risk |
|
||||
|---|---|---|---|
|
||||
| `cmd/controller/main.go` | **MODIFY** | Wiring stays, but drop the setup-mode branch, the storage/watchdog/drive-migrator/restic/cross-drive/infra-backup wiring, and add the **agent local-API client**. 6 adapters shrink. | hazard |
|
||||
|
||||
### `api/`
|
||||
| File | Class | Reason | Risk |
|
||||
|---|---|---|---|
|
||||
| `api/router.go` | **PORT/MODIFY** | Keep stacks/deploy/integrations/metrics/sync/assets/selfupdate routes; **remove `/api/storage/*` (disk)**; backup routes become **agent-coordinated guest-backup** requests; `config/apply` (hub-pushes-yaml) changes since the **agent** now injects config at provision. | needs-rework |
|
||||
| `api/geo.go` | **PORT/MODIFY** | Keep the customer-facing geo **preference** endpoints (set/get global + per-app); **drop the Cloudflare-sync trigger** — enforcement → hub (S4). The controller reports geo desired-state up instead of calling the CF API. | needs-rework |
|
||||
|
||||
### `appexport/` — KEEP/PORT (Docker-volume + DB level, no disk ops)
|
||||
| File | Class | Reason | Risk |
|
||||
|---|---|---|---|
|
||||
| `crypto.go` | **KEEP** | Self-contained AES-256-CTR+HMAC+scrypt for `.fab`. | clean |
|
||||
| `manifest.go`, `provider.go` | **KEEP** | Bundle metadata; provider interface (impl in main). | clean |
|
||||
| `export.go` | **PORT** | Docker-volume `tar`, DB dump via `backup.DumpOne`, config copy. Depends on the **retained** app-data-backup subset of `backup/`; HDD-mount enumeration reworked to **per-volume placement**. | needs-rework |
|
||||
| `restore.go` | **PORT** | `docker volume create`/`tar xf`, DB import, compose up. Same per-volume rework. | needs-rework |
|
||||
| `estimate.go` | **PORT** | `du`/`df` on mounts → per-volume sizing. | clean |
|
||||
|
||||
### `assets/`
|
||||
| File | Class | Reason | Risk |
|
||||
|---|---|---|---|
|
||||
| `syncer.go` | **KEEP** | Hub API download + checksum cache; already a direct hub channel. | clean |
|
||||
|
||||
### `backup/` — THE SPLIT (delete side interleaved with keep side; see §3)
|
||||
| File | Class | Reason | Risk |
|
||||
|---|---|---|---|
|
||||
| `dbdump.go` | **KEEP** | Pure `docker exec pg_dump`/`mariadb-dump` — app/DB data layer; the retained per-app backup. | clean |
|
||||
| `appdata.go` | **PORT** | App-data discovery (stacks/volumes/DB containers, `du`). "HDD mount" concept → per-volume. | needs-rework |
|
||||
| `backup.go` (1478 L) | **MODIFY (split)** | Mixes **keep** (`RunDBDumps`, `DumpAppVolumes(Safe)`, app restore) with **delete→agent** (`RunBackup`/`backupDrive`/restic snapshot/prune/check on per-drive repos). Must be torn in two. | hazard |
|
||||
| `restore.go` (442 L) | **MODIFY (split)** | `RestoreApp` restic path → agent; Docker-volume + Tier-2 rsync restore (app layer) → keep. | hazard |
|
||||
| `restore_app_linux.go`/`_other.go` | **PORT** | Per-app restore: compose pull/up, rsync app data, DB-dump restore. App layer; depends on backup location that changes. | needs-rework |
|
||||
| `paths.go` | **MODIFY (split)** | `AppDBDumpPath`/`AppVolumeDumpPath` keep; `Primary/SecondaryResticRepoPath`, `InfraBackupDir` → agent. | needs-rework |
|
||||
| `restic.go` | **DELETE (→agent)** | restic repos on drives = infra backup tier; agent does vzdump/PBS. | hazard |
|
||||
| `crossdrive.go` | **DELETE (→agent)** | Tier-2 cross-drive rsync to secondary storage = storage-tier (agent + storage manifest). | hazard |
|
||||
| `restore_drives_linux.go`/`_other.go` | **DELETE (→agent)** | `lsblk`/`blkid`/`mount`/fstab — pure host disk. | hazard |
|
||||
| `disk_layout.go` | **DELETE (→agent)** | Disk topology for DR → agent. | clean |
|
||||
| `local_infra.go` | **DELETE (→agent)** | Per-drive infra-backup metadata → agent. | clean |
|
||||
| `restore_scan.go` | **DELETE (→agent)** | Scans drives to build a DR restore plan = agent-tier DR. | needs-rework |
|
||||
|
||||
### `cloudflare/` — DELETE (→hub): CF-API enforcement moves to the hub (S4)
|
||||
| File | Class | Reason | Risk |
|
||||
|---|---|---|---|
|
||||
| `client.go`,`zone.go`,`waf.go`,`geosync.go`,`countries.go` | **DELETE (→hub)** | The **hub** holds the CF API token and reconciles geo desired-state → WAF (doc 01 §5, doc 03 §2). The controller no longer calls the Cloudflare API — it reports geo desired-state up. The customer-facing geo *preference UI/data* stays (see `api/geo.go`). | needs-rework |
|
||||
|
||||
### `config/`, `crypto/`, `util/`
|
||||
| File | Class | Reason | Risk |
|
||||
|---|---|---|---|
|
||||
| `config/config.go` | **MODIFY** | Drop `BackupConfig` (restic/retention), storage-drive keys, and `InfrastructureConfig.cf_api_token` (→hub, S4); keep customer/paths/web/git/stacks/monitoring/hub/assets/system; **add agent local-API endpoint+token**. | needs-rework |
|
||||
| `crypto/crypto.go` | **KEEP** | App.yaml secret encryption. | clean |
|
||||
| `util/strings.go` | **KEEP** | Trivial helper. | clean |
|
||||
|
||||
### `integrations/` — all KEEP (pure app-domain)
|
||||
| File | Class | Reason | Risk |
|
||||
|---|---|---|---|
|
||||
| `integrations.go`,`lifecycle.go`,`manager.go`,`onlyoffice_filebrowser.go`,`onlyoffice_nextcloud.go` | **KEEP** | App-to-app via `docker exec` / compose-config patch; no host ops. | clean |
|
||||
|
||||
### `metrics/`
|
||||
| File | Class | Reason | Risk |
|
||||
|---|---|---|---|
|
||||
| `store.go`,`logscanner.go`,`telemetry.go`,`types.go` | **KEEP** | SQLite store, `docker logs` scan, container telemetry — app-domain. | clean |
|
||||
| `collector.go` | **PORT** | Container metrics (`docker stats`) keep; host metrics via `system.GetInfo` (temp, physical disk) become **agent-provided or dropped**. | needs-rework |
|
||||
| `sysinfo.go`/`sysinfo_other.go` | **MODIFY** | Reads `/host/etc`, `/proc/cpuinfo`, uptime — host static info; in-guest some is meaningful, hardware identity via agent. | needs-rework |
|
||||
|
||||
### `monitor/`
|
||||
| File | Class | Reason | Risk |
|
||||
|---|---|---|---|
|
||||
| `healthcheck.go` | **PORT (split)** | Keep guest health (mem/cpu/docker/protected-containers); host health (temp, **physical disk**, storage-path mount status) becomes **agent-fed**. | needs-rework |
|
||||
| `pinger.go` | **DELETE (obsolete)** | Legacy Healthchecks.io; `main.go` itself marks it "deprecated… now handled by Hub". *(Corrects the task's KEEP/PORT guess.)* | clean |
|
||||
| `watchdog.go` (902 L) | **DELETE (→agent)** | Storage/USB disconnect monitoring: `umount -l`, `mount -T /host-fstab`, UUID probing, restic-lock cleanup — pure host storage. | hazard |
|
||||
|
||||
### `notify/`, `recovery/`, `scheduler/`, `selftest/`
|
||||
| File | Class | Reason | Risk |
|
||||
|---|---|---|---|
|
||||
| `notify/notifier.go` | **KEEP/MODIFY** | Direct hub event channel (own API key) — keep; prune infra event types that move to the agent (`storage_disconnected`, `crossdrive_*`, `disaster_recovery_*`). | clean |
|
||||
| `recovery/info.go` | **DELETE (obsolete)** | Generates a DR text guide (OS install, docker-setup.sh, hub restore UI); DR is now agent+hub provisioning. | clean |
|
||||
| `scheduler/scheduler.go` | **KEEP** | Generic cron/interval, Budapest TZ. | clean |
|
||||
| `selftest/selftest.go` | **PORT** | Keep docker/dirs/catalog/hub checks; drop restic-repo + system-data **mountpoint** checks (→agent). | needs-rework |
|
||||
|
||||
### `report/`
|
||||
| File | Class | Reason | Risk |
|
||||
|---|---|---|---|
|
||||
| `pusher.go` | **KEEP** | Direct hub push (`/api/v1/report`, Bearer). | clean |
|
||||
| `telemetry.go` | **KEEP** | Per-app telemetry section. | clean |
|
||||
| `builder.go` (326 L) | **MODIFY** | Keep containers/telemetry/stacks/geo/app-health; drop/relocate host system info, physical storage, **restic backup status incl. restic password**. | hazard |
|
||||
| `types.go` | **MODIFY** | Schema: drop infra fields (`restic password`, physical storage), keep app-domain. | needs-rework |
|
||||
| `infra_backup.go`/`_linux.go`/`_other.go` | **DELETE (→agent)** | Builds infra-backup payload (disk layout, restic/enc passwords) for hub. | hazard |
|
||||
| `infra_pull.go` | **DELETE (→agent)** | Pulls recovery config + infra backup from hub (setup-wizard DR). | needs-rework |
|
||||
|
||||
### `selfupdate/` — controller is agent-managed (doc 03 §11)
|
||||
| File | Class | Reason | Risk |
|
||||
|---|---|---|---|
|
||||
| `version.go` | **KEEP** | Semver parse / version string (still used for reporting). | clean |
|
||||
| `state.go` | **DELETE (obsolete)** | Self-update audit state — the agent owns controller updates now (doc 03 §11). | clean |
|
||||
| `updater.go` | **DELETE (→agent)** | Resolved (doc 03 §11): the controller is **agent-managed** — the agent snapshots → redeploys → health-gates → rolls back the controller. The controller's old self-update path (image pull + compose edit) is **removed**. | clean |
|
||||
|
||||
### `settings/`
|
||||
| File | Class | Reason | Risk |
|
||||
|---|---|---|---|
|
||||
| `settings/settings.go` (1101 L) | **MODIFY (split)** | Keep notif prefs, integration state, geo, DB-validation cache, cross-drive *intent*. The **storage-path registry** (`StoragePath` with `Disconnected`/`DisconnectedAt`/`StoppedStacks`/decommission) is disk-management state → reshape to **per-volume placement** fed by the agent's storage manifest; disconnect/decommission/migrate state leaves. (UUID is *not* a persisted field — runtime-derived from fstab.) | hazard |
|
||||
|
||||
### `setup/` — all DELETE (obsolete); the agent provisions the controller
|
||||
| File | Class | Reason | Risk |
|
||||
|---|---|---|---|
|
||||
| `handlers.go`,`setup.go`,`csrf.go`,`network.go` | **DELETE (obsolete)** | First-run wizard (hub-restore, manual config, LAN-IP detection). | needs-rework |
|
||||
| `scanner.go` | **DELETE (→agent)** | Drive scan (`lsblk`+temp mounts) for backup discovery — host op; its capability informs the agent. | clean |
|
||||
|
||||
### `stacks/` — core app domain (KEEP/PORT)
|
||||
| File | Class | Reason | Risk |
|
||||
|---|---|---|---|
|
||||
| `manager.go` (1074 L) | **KEEP/PORT** | Docker Compose orchestration, scan/state/start/stop/logs — the heart. Minor port. | clean |
|
||||
| `deploy.go` | **PORT** | Memory validation (`system.GetMemoryMB` — **guest** mem, fine in LXC), secret gen, encrypted app.yaml. **Add snapshot-before-deploy → agent** hook. | needs-rework |
|
||||
| `healthprobe.go` | **KEEP** | TCP/HTTP app probes. | clean |
|
||||
| `metadata.go` | **PORT** | `.felhom.yml` parse. **Add per-volume hot/bulk classification** (doc 01 §8). | needs-rework |
|
||||
| `delete.go` | **PORT** | Stack delete + HDD-data `os.RemoveAll` on bind mounts → per-volume cleanup. | needs-rework |
|
||||
|
||||
### `storage/` — entire package DELETE (→agent)
|
||||
| File | Class | Reason | Risk |
|
||||
|---|---|---|---|
|
||||
| `scan*`,`format*`,`attach*`,`migrate*`,`migrate_drive*`,`safety*` | **DELETE (→agent)** | Physical disk: `lsblk`/`sfdisk`/`wipefs`/`mkfs.ext4`/`partprobe`/`mount`/`umount`/fstab/`blkid`/drive-rsync. The agent owns all of this (doc 01 §3, §8). | hazard |
|
||||
|
||||
### `sync/`
|
||||
| File | Class | Reason | Risk |
|
||||
|---|---|---|---|
|
||||
| `sync/sync.go` | **KEEP** | Catalog git-sync (clone/fetch/reset, copy compose+`.felhom.yml`, never overwrite app.yaml). | clean |
|
||||
|
||||
### `system/` — split per-function (not per-file)
|
||||
| File | Class | Reason | Risk |
|
||||
|---|---|---|---|
|
||||
| `cpu_linux.go`/`cpu_other.go` | **KEEP** | `/proc/stat` works inside an LXC. | clean |
|
||||
| `info.go`/`info_other.go` | **KEEP** | Structs/stubs. | clean |
|
||||
| `info_linux.go` | **MODIFY (split)** | Keep mem (`/proc/meminfo`)/load/statfs (guest); **temp via `/host/sys`, hwmon → agent**. | needs-rework |
|
||||
| `mounts_linux.go`/`mounts_other.go` | **DELETE (→agent)** mostly | Mount-point detection, USB, disk model, fstab, probe — host/disk. Guest-meaningful `statfs` disk-usage is the only keep-candidate → fold into the kept `info`. | hazard |
|
||||
|
||||
### `web/` — split by UI surface
|
||||
| File | Class | Reason | Risk |
|
||||
|---|---|---|---|
|
||||
| `auth.go`,`csrf.go`,`logbuffer.go`,`embed.go`,`templates.go` | **KEEP** | Session/CSRF, log ring buffer, embeds/logo. | clean |
|
||||
| `funcmap.go` | **KEEP/PORT** | Template helpers; a few backup/state labels track the backup rework. | clean |
|
||||
| `server.go` (559 L) | **MODIFY** | Routing/wiring; remove storage/DR-restore/watchdog wiring; keep app/deploy/backup/settings/export/debug. | needs-rework |
|
||||
| `handlers.go` (1883 L) | **PORT/MODIFY** | Core pages keep; the embedded **storage-path management** (add/remove/label/schedulable, storage bars, FileBrowser mount sync) → per-volume / agent-fed. | hazard |
|
||||
| `handler_export.go` | **KEEP/PORT** | `.fab` UI. | clean |
|
||||
| `handler_debug.go` (823 L) | **PORT** | Drop storage-simulate/infra-push/DR debug; keep the rest. | needs-rework |
|
||||
| `alerts.go` | **PORT/MODIFY** | Storage-disconnect alert now sourced from **agent** status; backup/update alerts keep. | needs-rework |
|
||||
| `handler_restore.go` | **DELETE (→agent) / MODIFY** | DR restore-mode UI; DR is agent-tier — replace with an agent-status view or remove. | needs-rework |
|
||||
| `storage_handlers.go` (1600 L) | **DELETE (→agent)** | Format/attach/mount/disconnect/migrate-drive/decommission disk UI. Any survivor is a **thin client calling the agent API** (e.g. per-volume placement requests). | hazard |
|
||||
| `templates/` (HTML, non-Go) | **PORT** | Remove disk-wizard + DR pages; keep app/deploy/backup/settings pages. | needs-rework |
|
||||
|
||||
### `scripts/`
|
||||
| File | Class | Reason | Risk |
|
||||
|---|---|---|---|
|
||||
| `scripts/hashpass.go` | **KEEP** | Standalone bcrypt helper. | clean |
|
||||
|
||||
---
|
||||
|
||||
## 3. Coupling hazards (delete-targets depended on by keep/port)
|
||||
|
||||
1. **`backup/` is half-deleted but split *inside files*, not across them.** `backup.go`
|
||||
contains both `RunDBDumps`/`DumpAppVolumesSafe`/app-restore (keep) and
|
||||
`RunBackup`/`backupDrive` + restic (delete→agent); `restore.go` and `paths.go` are
|
||||
likewise mixed. **Keep/port consumers reach into this same package:**
|
||||
- `appexport/export.go:295` → `backup.DiscoverDatabases`/`DumpOne` (DB dump is app-layer — must survive)
|
||||
- `report/builder.go:buildBackupReport` → backup status (MODIFY)
|
||||
- `web/handlers.go` (backups page, `buildAppBackupRows`), `web/funcmap.go`, `web/alerts.go`, `web/handler_restore.go`, `web/handler_debug.go`
|
||||
- `selftest/selftest.go:217` → `checkResticRepos` (restic path — delete)
|
||||
- `main.go` scheduler chain `RunFullBackup` (DB→volume→restic→infra-push) interleaves both sides.
|
||||
**Action:** extract the app-data-backup subset (DB dump, volume archive, per-app
|
||||
restore) into a clean retained package *before* deleting the restic/cross-drive code,
|
||||
or every keep consumer breaks.
|
||||
|
||||
2. **`backup/crossdrive.go` (delete→agent) is wired as `crossDriveRunner` into**
|
||||
`main.go`, `api/router.go`, `web/server.go`, and surfaced by `report/builder.go` and the
|
||||
backups page. Removing it requires reworking the backup UI/report to the agent's
|
||||
guest-backup status.
|
||||
|
||||
3. **`storage/` (delete→agent) depended on by keep/port UI:** `web/storage_handlers.go`
|
||||
(delete) and `web/server.go`/`web/handlers.go` (port) — the latter renders storage
|
||||
labels/bars and runs **FileBrowser mount sync** off the storage-path registry.
|
||||
`storage/migrate*.go` also imports `backup` (also being split). Untangle the per-volume
|
||||
placement UI from the disk-management UI.
|
||||
|
||||
4. **`monitor/watchdog.go` (delete→agent) depended on by** `web/alerts.go` (port),
|
||||
`web/server.go`, `web/handler_debug.go`, `main.go`. The disconnect **alert** must instead
|
||||
consume agent-reported storage status.
|
||||
|
||||
5. **`system/` mixed-per-function, consumed by both sides.** Keep consumers —
|
||||
`stacks/deploy.go` (`GetMemoryMB`, guest), `metrics/collector.go` (container) — must not
|
||||
drag in the host-disk/temp/USB code that goes to the agent (`mounts_linux.go`,
|
||||
`info_linux.go` temp). Also consumed by `report/builder.go` (MODIFY), `monitor/healthcheck.go`
|
||||
(PORT), `selftest`, `crossdrive` (delete). **Split `system/` cleanly into guest-info vs
|
||||
host-info first.**
|
||||
|
||||
6. **`settings/StoragePath` carries disk state into an app-domain store.** Disk fields
|
||||
(`Disconnected`,`DisconnectedAt`,`StoppedStacks`, decommission — UUID is *not* persisted, it's runtime-derived from fstab via `system.ParseFstabUUID`/`watchdog.go`) are written by
|
||||
`watchdog.go`/`storage_handlers.go`/`crossdrive.go` (all delete) but the same struct is
|
||||
read by `stacks`/`web` for labels and **placement** (keep). Reshape `StoragePath` to a
|
||||
placement record fed by the agent manifest.
|
||||
|
||||
7. **`report/builder.go` imports almost everything** (backup, monitor, scheduler, stacks,
|
||||
system, metrics, settings, config). Its MODIFY must land *after* the backup and system
|
||||
splits, or it pulls deleted code along.
|
||||
|
||||
8. **`backup/paths.go` shared both ways** — `appexport` + `selftest` + the kept DB-dump
|
||||
flow use the app-dump path helpers; the same file holds the restic/secondary helpers
|
||||
that leave.
|
||||
|
||||
9. **DR/provisioning chain is cross-cut:** `setup/` (obsolete) → `report/infra_pull` +
|
||||
`recovery/info` + `backup.MountDrivesFromLayout` + `backup.ReadLocalInfraBackup`. All
|
||||
obsolete/→agent, but `main.go`'s setup branch and `web/handler_restore.go` reference
|
||||
them; remove together.
|
||||
|
||||
---
|
||||
|
||||
## 4. Moves to the host agent (consolidated — feeds the future agent design)
|
||||
|
||||
> Reporting only; **not** designing the agent here.
|
||||
|
||||
- **All physical-disk management** — `storage/` in full: scan/classify, format
|
||||
(`wipefs`/`sfdisk`/`mkfs.ext4`/`partprobe`), attach (raw mount + bind + fstab), per-app
|
||||
and full-drive migration (rsync), safety checks (system-disk detection).
|
||||
- **Storage/USB watchdog** — `monitor/watchdog.go`: disconnect/reconnect detection,
|
||||
`umount -l`, `mount -T /host-fstab`, UUID-by-id probing, safe-disconnect, restic-lock
|
||||
cleanup.
|
||||
- **Infra/disk backup tier** — `backup/restic.go`, `crossdrive.go`,
|
||||
`restore_drives_*`, `disk_layout.go`, `local_infra.go`, `restore_scan.go`, plus the
|
||||
restic-snapshot half of `backup.go`, the restic-restore half of `restore.go`, and the
|
||||
restic/secondary path helpers in `paths.go`. (Maps to the agent's `vzdump`→tiers→PBS in
|
||||
doc 01 §8.)
|
||||
- **Infra-backup payload + recovery pull** — `report/infra_backup*`, `report/infra_pull`.
|
||||
- **Host-physical telemetry** — `system/mounts_linux.go` (mount topology, USB, disk
|
||||
model), the temp/hwmon parts of `system/info_linux.go`, and the host-hardware parts of
|
||||
`metrics/sysinfo.go`.
|
||||
- **Drive scanning for provisioning/DR** — `setup/scanner.go`.
|
||||
- **Self-restore-test execution** — the agent performs the restore-to-scratch-guest; the
|
||||
controller only orchestrates/validates (see §5).
|
||||
|
||||
---
|
||||
|
||||
## 5. New components to build (no v0.33 equivalent)
|
||||
|
||||
1. **Agent local-API client** — the controller's only path to guest-level Proxmox
|
||||
operations (doc 01 §3, §5): `snapshot-before-deploy` + rollback, "grow my RAM", request
|
||||
guest backup/restore, read the storage manifest / mount placement, query per-target
|
||||
storage status. Replaces the deleted direct host/disk code with constrained RPC. The
|
||||
controller holds **no Proxmox creds** — only a local-API token.
|
||||
2. **Per-volume storage placement** (doc 01 §8) — `.felhom.yml` `hot`/`bulk` volume
|
||||
classification (extend `stacks/metadata.go`), enforcement at deploy (extend
|
||||
`stacks/deploy.go`), and a placement record in `settings`. Replaces the per-app
|
||||
HDD-path + cross-drive model. A `bulk` volume must be realized as a `backup=0` mount point,
|
||||
**never** a rootfs Docker named volume (validated recipe: `phase3-findings.md` B2 / doc 03 §7).
|
||||
3. **Self-restore-test status display** (read-only) — the **agent owns orchestration** (it
|
||||
holds the PBS key and creates the scratch guest — operator-tier, doc 03 §8); the controller
|
||||
only surfaces `GET /restore-test/status` in its UI. (Round-trip validated: Phase 2,
|
||||
[../proxmox-platform.md](../proxmox-platform.md) §4.)
|
||||
4. **Snapshot-before-deploy/rollback flow** in the deploy path — wraps the existing
|
||||
compose deploy with agent snapshot → health check → agent rollback-on-failure
|
||||
(doc 01 §9). New behaviour on top of `stacks/deploy.go` + `stacks/healthprobe.go`.
|
||||
5. **Agent-provisioning bootstrap receiver** — the controller accepts its injected hub API
|
||||
key + local-API token from the agent at provision time (doc 01 §6), replacing the
|
||||
deleted `setup/` wizard.
|
||||
|
||||
---
|
||||
|
||||
## 6. Open / blocked items
|
||||
|
||||
- **Geo — resolved (S4):** CF-API **enforcement moves to the hub** (it holds the CF token and
|
||||
reconciles geo → WAF); the controller keeps the geo **preference UI/data** and reports
|
||||
desired-state up. Tunnel placement is settled (host, agent-managed, doc 03 §3/§5). The
|
||||
`cloudflare/` package + `api/geo.go`'s CF-sync are DELETE-from-controller → hub.
|
||||
- **Self-update — resolved (doc 03 §11):** the controller is agent-managed; its self-update
|
||||
path is removed.
|
||||
- **`settings`/`stacks` per-volume reshape** — depends on the storage-manifest contract
|
||||
between hub ↔ agent ↔ controller (doc 01 §8), not yet specified.
|
||||
- **Backup UI/report surface** — depends on the agent's guest-backup status API shape
|
||||
(what the controller can see about vzdump/PBS state) — undefined.
|
||||
- **Notification event taxonomy** — which infra events (`storage_disconnected`,
|
||||
`crossdrive_*`, `disaster_recovery_*`) the **agent** emits vs the controller, once those
|
||||
responsibilities move.
|
||||
|
||||
---
|
||||
|
||||
## Changelog — design-review + Phase-3 fold-in (2026-06-08)
|
||||
|
||||
- **M1:** removed `UUID` from the `settings.StoragePath` field lists (§ settings, hazard #6) —
|
||||
it is runtime-derived from fstab, not persisted.
|
||||
- **S4 (geo):** `cloudflare/` reclassified **PORT(blocked) → DELETE(→hub)** (CF-API enforcement
|
||||
moves to the hub); `api/geo.go` → **PORT/MODIFY** (keep geo *preference* endpoints, drop the
|
||||
CF-sync trigger); `config/config.go` also drops `cf_api_token`. §6 + §1 updated.
|
||||
- **S5:** cloudflare/geo no longer "blocked on tunnel placement" (resolved).
|
||||
- **S6:** §5(3) self-restore-test → **status-display only**; the agent owns orchestration.
|
||||
- **Self-update resolved (03 §11):** `updater.go` → **DELETE(→agent)**, `state.go` →
|
||||
DELETE(obsolete), `version.go` KEEP; §6 + §5(2) updated (bulk = `backup=0` mountpoint recipe).
|
||||
@@ -0,0 +1,299 @@
|
||||
# Architecture Part 3 — The Host Agent
|
||||
|
||||
> Status: design draft (decision content). To be grounded by Claude Code against
|
||||
> `docs/proxmox-platform.md` and `docs/architecture/02-controller-module-map.md`,
|
||||
> then placed at `docs/architecture/03-host-agent.md`.
|
||||
>
|
||||
> Builds on Part 1 (`01-topology-and-trust.md`) and Part 2 (`02-controller-module-map.md`).
|
||||
> Where this doc and the locked decisions disagree, the locked decisions win and this
|
||||
> draft is wrong — flag it.
|
||||
|
||||
## 1. Purpose & scope
|
||||
|
||||
The **host agent** is the operator-tier component that runs on each Proxmox host and
|
||||
owns *all* Proxmox interaction. It is the trusted host actor: it provisions and restores
|
||||
guests, manages host storage, orchestrates backups and restore-tests, watches the host
|
||||
and the tunnel, talks to the hub, and exposes a narrow local API to the in-guest
|
||||
controllers it deploys.
|
||||
|
||||
It is the privileged tier. The controller deliberately holds **no** Proxmox credentials
|
||||
(Part 1) — the privilege the controller shed by losing `storage/` did not disappear, it
|
||||
**moved here**. That makes the agent's hardening and blast-radius discipline the most
|
||||
security-sensitive part of the platform.
|
||||
|
||||
The agent manages a **set** of guests on its host (usually one customer = one guest, but
|
||||
the multi-tenant/company case is not precluded — the agent's data model is per-host,
|
||||
N-guests, never "the guest").
|
||||
|
||||
## 2. Responsibilities (and explicit non-responsibilities)
|
||||
|
||||
Owns:
|
||||
|
||||
1. **Proxmox lifecycle** — create/start/stop/destroy guests, snapshots, storage allocation. Via a scoped Proxmox API token (the **`FelhomAgent` operator role** — `proxmox-platform.md` §3.6, validated Phase 3 B3) for everything the API covers; raw host ops only where unavoidable.
|
||||
2. **Storage management** — attach/classify targets, reconcile the storage manifest, mount USB-by-UUID, present mounts into guests.
|
||||
3. **Backup/restore orchestration** — vzdump to the tiers, PBS, snapshot management, and the **self-restore-test**.
|
||||
4. **Host & tunnel monitoring** — host metrics, guest up/down, storage-target status, and `cloudflared` health; reports the host domain to the hub.
|
||||
5. **Provisioning** — provision a guest **by restoring the golden base image** (§9), deploy the controller into it, hand it its bootstrap config; also **build and refresh the golden base image** itself.
|
||||
6. **Hub control loop** — poll for desired state + signed jobs, reconcile, execute, report, heartbeat.
|
||||
7. **Local API** — the per-guest authorization gate the controller calls.
|
||||
8. **Self-update** — update itself (carefully — it is a host service) and update the controllers it owns.
|
||||
|
||||
Explicitly does **not**:
|
||||
|
||||
- Serve application traffic or sit in the data path. **Control plane, not data plane**: if the agent dies, apps keep serving (Docker + LXC run without it); only *management* degrades — no new backups, no provisioning, hub loses the heartbeat.
|
||||
- Hold or proxy customer application data.
|
||||
- Run inside a guest. It is the thing that recovers guests and the host; it cannot be one of them.
|
||||
- Manage **geo-restriction / the Cloudflare API**. Geo is hub-owned: the customer sets it in the controller UI, the controller reports the geo desired-state to the hub, and the **hub** (holding the CF API token) reconciles the WAF (S4). The agent manages only the *tunnel* service (`cloudflared`, §3/§5), never WAF rules.
|
||||
|
||||
## 3. Process model & host integration
|
||||
|
||||
- **Native Go binary, systemd service** on the host: boot-start, `Restart=always`, systemd watchdog (kill+restart on hang), journald logging, resource limits.
|
||||
- **Root-minimized (boundary settled — Phase 3 B3).** The agent runs as a **non-root** service user with the scoped `FelhomAgent` token for all API-covered work + a **narrow `sudoers` allowlist** for true host ops. Per Phase 3 (B3) the boundary is settled: the entire per-customer guest lifecycle — provision (by restore, §9), config, start/stop, snapshot, backup, **restore**, destroy — is token-covered. Genuine OS-root is confined to: (1) building/refreshing the **golden base image** (`keyctl` create is `root@pam`-only — one-time at enrollment + a maintenance cadence, §9); (2) **host mounts** (USB mount-by-UUID, systemd mount units / fstab); (3) **SMART / hardware sensors**. Root therefore never sits on the per-customer path. See `proxmox-platform.md` §3.6 for the role + boundary table.
|
||||
- **`cloudflared` is a separate systemd service**, not embedded in the agent. This is what makes the data path survive control-plane death by construction. The agent **manages and health-watches** it (see §5) but the tunnel does not live or die with the agent process.
|
||||
|
||||
## 4. Control model — reconcile + signed destructive ops
|
||||
|
||||
Two channels, split by **reversibility**, not by transport.
|
||||
|
||||
**(a) Desired-state reconciliation — steady state.**
|
||||
The hub holds desired state for the host: which guests should exist (and at what spec),
|
||||
the storage manifest, backup/retention policies, controller image versions. The agent
|
||||
runs a reconcile loop converging actual Proxmox state → desired: idempotent, self-healing,
|
||||
and tolerant of missed polls (drift is corrected on the next loop). Provisioning retries,
|
||||
re-attach of a flapping USB target, redeploy of a crashed controller — all fall out of
|
||||
reconciliation for free.
|
||||
|
||||
**(b) Signed one-shot jobs — operator actions.**
|
||||
Restore-now, decommission, force-backup, break-glass-enable. Discrete, run-once
|
||||
(idempotency key), written to the customer-visible audit log, and **outside** the reconcile
|
||||
loop — they are point-in-time and often destructive, and a reconciler must never re-run a
|
||||
restore because it "sees drift." A one-shot job names a **target** ("restore guest X from
|
||||
snapshot S"), not a procedure; the agent owns the *how*.
|
||||
|
||||
**The reversibility gate (security-critical).**
|
||||
"Signed jobs resist hub compromise" only holds if the agent also distrusts hub-supplied
|
||||
*desired state* for destructive changes. The gate is by **provenance + data-bearing-ness, not
|
||||
by verb**:
|
||||
|
||||
- **The reconciler MAY act without an operator signature** when: (a) creating/starting/restarting; (b) destroying resources it created earlier **within the same journaled transaction** (compensating rollback, §10); (c) destroying resources it **tagged ephemeral/scratch** (e.g. restore-test scratch guests, §8). The ephemeral/scratch tag is **agent-internal provenance and is never accepted from the hub** — else a compromised hub could relabel a data-bearing guest as scratch to walk the gate.
|
||||
- **An operator signature is always required** to destroy/overwrite any resource holding the only/primary copy of customer data — live-guest destroy, storage detach/wipe, restore-overwrite, decommission — *regardless of whether it arrives as a job or as a desired-state delta*. A compromised hub cannot forge them because the signing key is **not held by the hub** (it lives with the operator / a separate signing path; the hub only queues opaque signed blobs).
|
||||
- **Healing a crashed controller is non-destructive by construction:** it is reconstructable from its image + the guest's persistent volume, so "redeploy" = restart the LXC / `docker compose up -d` **inside the existing guest** — never a guest destroy. (v0.33 precedent: `watchdog.go` restarts stopped stacks, it never destroys the guest.)
|
||||
|
||||
Signed payloads carry a **nonce + expiry** (anti-replay: a captured "restore" job cannot be
|
||||
re-injected later) and a target binding (host + guest id) so a signature can't be retargeted.
|
||||
Notification-on-destructive-op is an **audit signal, never the guard** — a compromised hub
|
||||
could both issue and suppress the notice, which is exactly why the *signature* (not the
|
||||
notification) is the control.
|
||||
|
||||
## 5. Hub ↔ agent protocol (host domain)
|
||||
|
||||
**Box-initiated poll.** The hub never connects inbound. Each poll cycle exchanges:
|
||||
|
||||
- **Up:** heartbeat + a host-domain state report — host CPU/RAM/disk, per-guest up/down + spec, storage-target status (USB connected? NFS/CIFS reachable? PBS reachable?), last backup per target, last restore-test result, `cloudflared` health, agent + controller versions, audit-log tail.
|
||||
- **Down:** the current desired state, any pending signed one-shot jobs, and config (poll interval, update window, policy changes).
|
||||
|
||||
**Dead-man's-switch (essential, not optional).** In a box-initiated model the heartbeat
|
||||
*is* the liveness signal — a box that stops checking in is otherwise invisible. The hub
|
||||
alerts the operator when an agent misses its expected check-in window. This is the worst
|
||||
failure mode for a managed service, so it gets first-class treatment hub-side.
|
||||
|
||||
**Break-glass.** Standing inbound control is off. But when the poll loop *itself* is wedged
|
||||
(agent hung, host sick) you cannot fix it through the poll loop. So there is an explicit,
|
||||
**off-by-default, customer-consented, fully-audited** emergency path: SSH to the host via
|
||||
the Cloudflare Tunnel behind Cloudflare Access (or on-site). Enabling it is itself a signed,
|
||||
logged operation; it auto-expires.
|
||||
|
||||
## 6. Agent ↔ controller local API
|
||||
|
||||
The controller (in its LXC) reaches the agent (on the host) over the local bridge.
|
||||
|
||||
- **Transport:** HTTPS to the host's bridge IP on a fixed port.
|
||||
- **Auth:** a per-guest local token, minted by the agent when it deploys the controller and written into the guest's bootstrap config. The agent maps token → guest and **authorizes per guest**: a controller can only act on *its own* guest. This is the agent acting as the per-guest authorization gate from Part 1.
|
||||
- **Surface (minimal, all scoped to the caller's own guest):**
|
||||
- `GET /storage` — mounts available to this guest and their **class** (fast/slow), so the controller can place hot vs bulk volumes per `.felhom.yml`. (The agent owns the actual mounts; the controller just binds to the paths it's given.)
|
||||
- `POST /snapshot` — snapshot *this* guest (the snapshot-before-deploy primitive).
|
||||
- `POST /rollback` — roll *this* guest back to a named snapshot (post-deploy failure recovery).
|
||||
- `POST /backup` — request a backup-now of *this* guest (enqueued; non-destructive).
|
||||
- `GET /backup/due` — whether a policy-scheduled backup is due for *this* guest, so the controller can quiesce then call `POST /backup` (the app-consistent path, §8).
|
||||
- `GET /backup/status`, `GET /restore-test/status` — read-only status for the controller's UI.
|
||||
|
||||
Note what is *absent*: nothing here lets a controller touch another guest, the host, storage
|
||||
attachment, or restore-overwrite. Destructive/cross-guest power stays operator-signed (§4).
|
||||
|
||||
A controller can only `POST /rollback` (or snapshot/backup) **its own** guest — the agent maps
|
||||
token → guest and authorizes per guest, so a compromised controller's blast radius is
|
||||
**self-scoped and bounded** to its own guest.
|
||||
|
||||
## 7. Storage manifest & reconciliation
|
||||
|
||||
The manifest is the load-bearing contract. It absorbs the **persisted** disk-state fields that
|
||||
`settings.StoragePath` carries today **and adds** `durable_id`/UUID — today the controller
|
||||
re-derives the UUID from fstab each boot (Part 2 / Phase-3), so persisting it is an
|
||||
improvement. Held in the hub, reconciled by the agent.
|
||||
|
||||
Per target:
|
||||
|
||||
| field | meaning |
|
||||
|---|---|
|
||||
| `type` | `local-dir` / `usb` / `nfs` / `cifs` / `pbs` |
|
||||
| `durable_id` | UUID (USB), `server:export` (NFS/CIFS), `repo+fingerprint` (PBS) — survives box loss |
|
||||
| `class` | `fast` or `slow`, set **once at attach**, with an IOPS marker; no runtime speed-test |
|
||||
| `role` | `primary` / `vzdump-target` / `pbs-offsite` / `bulk-data` |
|
||||
| `creds` | encrypted (NFS/CIFS/PBS); USB has none |
|
||||
| `policy` | schedule + retention for this target |
|
||||
| `state` | `attached` / `disconnected` / `decommissioned` |
|
||||
|
||||
Reconciliation: ensure each `attached` target is mounted (USB-by-UUID via the sudoers
|
||||
allowlist), each Proxmox storage entry matches, and `disconnected` targets are surfaced to
|
||||
the hub (the storage watchdog — detect a USB drop in seconds, not at the next health cycle).
|
||||
|
||||
**Placement is per-volume, not per-app.** Hot volumes (DB/config) → a `fast` target,
|
||||
**enforced**; bulk volumes (media) → may live on `slow`, declared in `.felhom.yml`.
|
||||
|
||||
A `bulk` volume **MUST** be realized as a `backup=0` **volume mount point** (or an external
|
||||
bind mount) — **never** a Docker named volume in rootfs, which `vzdump` always captures
|
||||
(verified, `phase3-findings.md` B2). Proven recipe: attach
|
||||
`-mpN <storage>:<size>,mp=/mnt/bulk,backup=0`, then
|
||||
`docker volume create --driver local -o type=none -o o=bind -o device=/mnt/bulk <vol>` (or a
|
||||
compose bind). The per-volume placement component (Part 2 §5(2)) enforces this at deploy. The
|
||||
**DR consequence** of excluding bulk is covered in §8.
|
||||
|
||||
**Field re-homing (from `settings.StoragePath`, Part 2):** `Label` → manifest (canonical);
|
||||
`IsDefault`/`Schedulable` → manifest `policy`; `MigratedTo` + decommission → manifest `state`;
|
||||
`StoppedStacks` → the **controller's `settings`** (app-domain: which apps to restart on
|
||||
reconnect, not a host concern).
|
||||
|
||||
## 8. Backup/restore orchestration
|
||||
|
||||
Tiers double as backup *and* restore-source priority (fastest surviving source first),
|
||||
per Part 1: **snapshot** (LVM-thin, transient, whole-guest rollback — not a backup) →
|
||||
**local second storage** (vzdump to dir/NFS/CIFS) → **PBS offsite** (the DR substrate).
|
||||
|
||||
- **Quiescing (controller-driven for app-consistency):** an LXC has no fsfreeze
|
||||
(`proxmox-platform.md` §4.2), so app-consistency is the controller's job: it learns a backup
|
||||
is due (`GET /backup/due`, §6, or via its hub channel) → **quiesces** the app stack →
|
||||
`POST /backup` → polls `GET /backup/status` → unquiesces. **An agent-initiated vzdump is
|
||||
crash-consistent only** (there is no inbound-to-guest channel to trigger a quiesce — §3/§5).
|
||||
Every Proxmox op is async → the agent polls `task exitstatus`, never trusts the POST return.
|
||||
- **Bulk volumes have no DR coverage from the guest vzdump** — they are excluded (§7). Every
|
||||
`bulk` volume needs an explicit own-backup decision: its own backup target per the manifest
|
||||
`policy`, **or deliberately none** when the data is re-downloadable (customer informed). On
|
||||
host-loss, un-backed-up bulk is gone; a **bind-mounted** bulk volume re-attaches only on the
|
||||
*same* host, so cross-host DR needs the separate backup. A deliberate per-volume choice,
|
||||
never a silent loss.
|
||||
- **Key custody (PBS):** the **live** PBS key sits on the box so the agent can both back up
|
||||
*and* run restore-tests. The hub holds only the **recovery-code-wrapped escrow** copy it
|
||||
cannot open (zero-knowledge default). So: the box can restore-test; the operator cannot
|
||||
read the data; the customer's offsite recovery code is the irreducible residual.
|
||||
- **Self-restore-test:** the closing of the "tested restore is the critical gap" theme. The
|
||||
agent periodically restores a backup into a **throwaway scratch guest**, boots it, runs
|
||||
health checks, reports pass/fail, and tears it down. Zero-knowledge backups can *only* be
|
||||
restore-tested by the box (the operator lacks the key) — so this lives in the agent by
|
||||
necessity, not just convenience. Integrity-verify (cheap, ciphertext-level) runs more often
|
||||
as the lighter check.
|
||||
|
||||
## 9. Provisioning & DR flows
|
||||
|
||||
**Provisioning (reconcile-driven, by restore).** Fresh creation of a Docker-capable LXC needs
|
||||
the `keyctl=1` feature flag, which Proxmox permits only for `root@pam` (Phase 3, B3) — not the
|
||||
scoped token. But a token-authorized **restore preserves `keyctl`** (Phase 3, B3), so the agent
|
||||
provisions **by restoring a golden base image**, never by `pct create` on the per-customer path:
|
||||
|
||||
- A **golden base archive** — minimal Debian + Docker, `nesting=1,keyctl=1`, overlayfs — is
|
||||
built once as `root@pam` **at enrollment** (when the agent legitimately holds root to mint its
|
||||
Proxmox token) and refreshed on a maintenance cadence. This is the one place `keyctl`/root
|
||||
provisioning lives — off the per-customer path.
|
||||
- To provision guest G: restore the golden archive → new VMID (token-covered: `VM.Allocate` +
|
||||
`Datastore.AllocateSpace`; `keyctl` preserved) → reset identity (MAC/hostname) → size the guest
|
||||
(CPU/mem config + `pct resize` rootfs, token-covered) → attach storage mounts per the manifest
|
||||
→ deploy the controller → hand it bootstrap config. A mid-flight failure is journaled and
|
||||
compensating-rolled-back (destroy the just-restored guest — allowed without a signature per §4,
|
||||
same-transaction provenance).
|
||||
|
||||
**Unified bring-up primitive.** Provisioning and DR-restore share the same token-covered front
|
||||
half — *restore an archive → reset identity* — and differ only in the archive and the back half:
|
||||
provisioning restores the **golden base** then deploys a fresh controller; DR-restore restores
|
||||
the **customer's backup** (already containing controller + data), brings it up, and reattaches
|
||||
external storage. One code path, exercised by every restore-test (§8).
|
||||
|
||||
**Guest loss.** Agent restores G from the fastest surviving tier and resets identity
|
||||
(MAC/hostname) so the restored guest rejoins cleanly — this *is* the unified restore primitive
|
||||
above (customer-backup archive, DR back half).
|
||||
|
||||
**Host/hardware loss.** Re-enroll the new host in **restore mode**; the hub — the durable
|
||||
source of truth that survives box death — hands the new agent the existing identity, PBS
|
||||
namespace, tunnel token, storage manifest, and a restore directive. Tunnel is reused from
|
||||
the hub record, so DNS stays intact.
|
||||
|
||||
## 10. Concurrency, crash-safety, idempotency
|
||||
|
||||
- **Per-guest serialization.** Reconcile, one-shot jobs, and local-API calls all feed a
|
||||
work queue that serializes mutations **per guest** (Proxmox dislikes concurrent conflicting
|
||||
ops on the same guest). Independent guests proceed in parallel.
|
||||
- **Operation journaling.** Multi-step async ops (provision, restore, controller-update, agent
|
||||
self-update) are journaled with their in-flight Proxmox task ids. On agent restart, the
|
||||
journal is replayed: resume-or-rollback, so a crash mid-restore never leaves a corrupt or
|
||||
half-built guest.
|
||||
- **Idempotency keys** on one-shot jobs (run-once across retries and restarts).
|
||||
|
||||
## 11. Self-update
|
||||
|
||||
- **Agent (the hard case — a host service, no snapshot-rollback).** **A/B layout:** download →
|
||||
verify signature → stage as the inactive slot → flip a `current → good|new` symlink → restart.
|
||||
**Revert authority lives outside the swapped binary** — `Restart=always` alone just
|
||||
crash-loops a bad binary — so a **separate health-gate** (a systemd oneshot `ExecStartPost`
|
||||
probe, or a tiny supervisor unit) flips `current` back to last-good and restarts on a failed
|
||||
health window. The new version is **committed as "good" only after a clean health window**.
|
||||
Triggered by a hub signed job within the update window; manual always allowed. Journaled (§10).
|
||||
- **Controller (the easy case — it's a guest).** The agent owns the controller's lifecycle,
|
||||
so the **agent updates the controller**: snapshot-before-update (free rollback, because the
|
||||
controller *is* a snapshottable guest) → pull new image → redeploy → health-check → rollback
|
||||
on failure. This resolves the Part-2 `selfupdate/` open: the controller is **agent-managed**,
|
||||
not self-updating; the controller's old self-update path is removed.
|
||||
|
||||
## 12. Secrets at rest on the host
|
||||
|
||||
The agent holds, root-only on the host fs: the scoped Proxmox token, the hub API key, the
|
||||
operator's **public** verify key (for §4 signatures — public, low-risk), the Cloudflare
|
||||
tunnel token, encrypted storage creds (NFS/CIFS/PBS), and the **live PBS key**. The privilege
|
||||
and the secret footprint that left the controller now concentrate here — which is the whole
|
||||
argument for §3's root-minimization and a small, auditable agent.
|
||||
|
||||
## 13. Open items / what this unblocks
|
||||
|
||||
Resolved here: tunnel placement (host, agent-managed, own systemd service), the
|
||||
reconcile-vs-jobs fork (hybrid, gated by reversibility), agent process model, self-update
|
||||
ownership, the local-API surface, the storage-manifest schema, **provision-by-restore**, and
|
||||
the **root-vs-API boundary** (Phase 3, B3).
|
||||
|
||||
Still open:
|
||||
|
||||
- Multi-tenant **resource fairness** on a shared host (per-guest cgroup limits, noisy-neighbor) — deferred to the company-case pass.
|
||||
- Operator-side **signing tooling** — where the operator signing key lives operationally and how a destructive op gets signed without undue friction (offline key vs. a small signing service; the security floor is "not in the hub").
|
||||
- Hub-side **desired-state editing UX** and the host-domain report schema details — belong to the hub architecture doc.
|
||||
- **Golden base image** refresh cadence + fleet versioning — who triggers a rebuild, how the per-host image version is tracked (operational detail, not blocking; §9).
|
||||
|
||||
This doc hands the implementation three contracts it was waiting on:
|
||||
|
||||
1. the **local-API surface** (§6) → the controller's NEW local-API client, snapshot-before-deploy, and self-restore-test wiring (Part 2);
|
||||
2. the **storage-manifest schema** (§7) → the `settings.StoragePath` reshape and per-volume hot/bulk placement (Part 2);
|
||||
3. the **backup contract** (§7–8) → the destination for the app-data-backup package extracted in the Part-2 refactor.
|
||||
|
||||
---
|
||||
|
||||
## Changelog — design-review + Phase-3 fold-in (2026-06-08)
|
||||
|
||||
- **NEW provision-by-restore** (§9): the agent provisions by **restoring a golden base image**
|
||||
(token-covered, preserves `keyctl`), never `pct create` on the per-customer path; one unified
|
||||
restore primitive shared with DR. §2 responsibility + §3 boundary updated.
|
||||
- **B3** (§2/§3): replaced "Phase-1 minimal role" with the validated **`FelhomAgent`** operator
|
||||
role; root-vs-API boundary **settled** (root only for golden-image build, host mounts, SMART).
|
||||
- **B1** (§4): reversibility gate rewritten as **provenance + data-bearing** (scratch tag is
|
||||
agent-internal, never hub-supplied; crashed-controller heal is non-destructive in-place).
|
||||
- **B2** (§7/§8): validated bulk-as-`backup=0`-mountpoint recipe + the **bulk-DR consequence**
|
||||
(excluded bulk needs its own backup decision).
|
||||
- **S1** (§6/§8): `GET /backup/due` added; controller-driven quiescing; agent vzdump is
|
||||
crash-consistent only. **S2** (§10/§11): A/B self-update with external revert authority;
|
||||
controller-update + agent self-update journaled. **S3** (§7): `StoragePath` field re-homing.
|
||||
**S4:** geo non-responsibility added (§2). **M2** (§7): manifest "absorbs + adds durable_id".
|
||||
**§6:** rollback is self-scoped/bounded. **§13:** golden-image refresh cadence added as open.
|
||||
@@ -0,0 +1,154 @@
|
||||
# Architecture Part 4 — Control-plane authorization (operator signing)
|
||||
|
||||
> Status: design draft (decision content), grounded on `docs/tests/phase4-signing-findings.md`.
|
||||
> To be reviewed by Claude Code against that spike + `03` §4, then placed at
|
||||
> `docs/architecture/04-control-plane-authorization.md`.
|
||||
>
|
||||
> Builds on Part 1 (enrollment / trust), Part 3 (the agent verifies + the §4 reversibility gate).
|
||||
> This doc defines the **mechanism** behind `03` §4's "an operator signature the hub can't forge."
|
||||
|
||||
## 1. Purpose & scope
|
||||
|
||||
`03` §4 gates **destructive/irreversible** operations behind an operator signature the hub cannot
|
||||
forge. That gate is only real if signing is real. This doc defines the signing mechanism: the
|
||||
primitive, the keys, rotation, the three components' roles, and the operator workflow. The
|
||||
*policy* (what needs a signature) lives in `03` §4; this is the *how*.
|
||||
|
||||
**Recap of what needs a signature** (from `03` §4, by reversibility, not by verb): destroying or
|
||||
overwriting any resource holding the only/primary copy of customer data — live-guest destroy,
|
||||
storage detach/wipe, restore-overwrite, decommission — **regardless of whether it arrives as a job
|
||||
or a desired-state delta**. Benign convergence (deploy a guest, attach storage, restore to a *new*
|
||||
guest, bump a version) runs on normal hub auth, unsigned. Most recovery is therefore unsigned;
|
||||
signed ops are rare and deliberate.
|
||||
|
||||
## 2. Primitive — SSH signatures (SSHSIG)
|
||||
|
||||
Confirmed by Phase 4: destructive ops carry an **SSH signature** (`ssh-keygen -Y sign`, the armored
|
||||
`SSHSIG` format), verified by the agent in Go (`golang.org/x/crypto/ssh`) — `pem.Decode` →
|
||||
`ssh.Unmarshal` → `ssh.ParsePublicKey` → `pub.Verify`. ~40 lines of framing, no hand-rolled crypto.
|
||||
|
||||
**Why SSHSIG and not raw Ed25519 / minisign:** SSHSIG verification dispatches on the key type
|
||||
embedded in the signature, so the **same verifier accepts a software key (`ssh-ed25519`) today and
|
||||
a FIDO2 hardware key (`sk-ssh-ed25519@openssh.com`) later** — which is exactly the hardware-ready
|
||||
foundation we want (§7). A raw-Ed25519 verifier cannot consume an sk signature (flags+counter,
|
||||
different signed-data), so it would force a verifier change on every box at hardware-adoption time.
|
||||
SSHSIG buys key-type-agnosticism for a one-file framing cost (Phase 4 §5–6).
|
||||
|
||||
### 2.1 The signed object — canonical op blob
|
||||
The signature covers an op blob (Phase 4 §2):
|
||||
|
||||
```
|
||||
{ op, target:{host_id, guest_id}, params, nonce, issued_at, expires_at, key_id }
|
||||
```
|
||||
|
||||
- **Canonical form is a *signer-side* requirement** — JSON, keys sorted at every level, no
|
||||
insignificant whitespace, UTF-8 — so the blob is deterministic and human-auditable. The
|
||||
**verifier trusts the exact bytes it receives** (it verifies the signature over the raw bytes and
|
||||
parses those same bytes for fields), so there is no canonicalization-mismatch risk on the verify
|
||||
side. The canonical form is the shared contract between the operator CLI and the agent (both Go).
|
||||
- `nonce` ≥128-bit random; `issued_at`/`expires_at` a short window (minutes); `key_id` identifies
|
||||
the signing key (rotation/audit).
|
||||
|
||||
### 2.2 Domain separation — the namespace
|
||||
The SSHSIG **namespace** `felhom-op-v1` is a **fixed constant in the verifier**, never
|
||||
caller-supplied. A signature minted for any other namespace must not verify (proven). This stops a
|
||||
signature made for one purpose being reused for another.
|
||||
|
||||
### 2.3 Verify pipeline (order is load-bearing)
|
||||
`namespace → allow-list → crypto verify → target binding → time window → nonce`. The **nonce is
|
||||
recorded last**, only after everything else passes, so an invalid signature can never consume a
|
||||
nonce (DoS-safe). Each layer is mandatory and was proven to reject independently (Phase 4 §3–4):
|
||||
- **target binding** — `target.host_id`/`guest_id` must equal *this* box/guest (a signature for box
|
||||
A cannot be replayed at box B);
|
||||
- **time window** — `now ∈ [issued_at, expires_at]`;
|
||||
- **nonce** — unseen within the window (the nonce store **must be persistent across agent restarts**
|
||||
and expiry-pruned; a non-persistent store reopens the replay window after a restart).
|
||||
|
||||
The Phase-4 reference verifier (`VerifySignedOp`) is the seed of the agent's implementation.
|
||||
|
||||
## 3. The keys — two-key model, software now
|
||||
|
||||
Both software (SSH-format) keys today; both are also valid FIDO2-resident keys later with no box
|
||||
change (§7).
|
||||
|
||||
- **Operational signing key** — the "master stamp" for destructive ops. A **dedicated** key (NOT
|
||||
the operator's daily SSH login key), passphrase-protected, on the operator workstation. Used only
|
||||
for destructive ops — rare, so its exposure is low.
|
||||
- **Cold recovery key** — generated once, kept **offline** (password manager / a USB held back /
|
||||
printed). Never used for ordinary ops; its sole power is to authorize rotating the operational key
|
||||
if that key is lost or compromised.
|
||||
|
||||
Both **public** keys are pinned onto the agent at enrollment (the allowed-signers set). The
|
||||
operational key is authorized for ops; the recovery key is authorized **only** for key-rotation
|
||||
instructions.
|
||||
|
||||
**Allowed-signers is a set** → single signer today; **quorum (N-of-M) for the highest-blast ops is
|
||||
just set sizing + a threshold policy**, addable later without a redesign (Phase 4 §8). Out of scope
|
||||
now.
|
||||
|
||||
## 4. Rotation & compromise recovery
|
||||
|
||||
The agents pin the operator public keys. The danger: rotation must **not** flow as plain hub config,
|
||||
or a compromised hub re-pins its own key and forges everything. So **every re-pin is itself a signed
|
||||
op the agent verifies** (same pipeline, §2.3) — never unauthenticated config.
|
||||
|
||||
- **Planned rotation:** the *current* operational key signs a "new operational public key = X" op;
|
||||
the agent accepts it because it's signed by the trusted current key (key-signs-key).
|
||||
- **Operational key lost/compromised:** the **cold recovery key** signs the re-pin; the agent accepts
|
||||
it because the recovery key is pinned and authorized for rotation. The compromised key is removed
|
||||
from the allowed set in the same signed op.
|
||||
- **Both keys gone:** on-site physical re-enrollment (last resort — re-establishes the trust root the
|
||||
way initial enrollment did).
|
||||
|
||||
## 5. Component roles
|
||||
|
||||
- **Operator tooling (the workstation).** A signing CLI behind a thin **`Signer` interface**
|
||||
(`Sign(blob) → signature`). The backend today is a **file key**; a **FIDO2/PIV** backend drops in
|
||||
later (§7) with no change to the blob format, the hub, or the agent. Holds the operational private
|
||||
key (passphrase-protected); can reach the cold recovery key when rotation is needed.
|
||||
- **Hub.** Queues the **opaque** signed blobs and surfaces pending destructive ops + their signature
|
||||
status in the operator UI. Holds **no** private key and cannot sign — a compromised hub can only
|
||||
queue blobs the agent rejects. (Matches `03` §4 / box-initiated poll.)
|
||||
- **Agent (each box).** Pins the allowed-signers set (operational + recovery) at enrollment; runs the
|
||||
verify pipeline (§2.3) on any destructive op before executing; writes every signed op to the
|
||||
customer-visible **audit log**. Notification-on-destructive-op is an audit signal, never the guard
|
||||
(a compromised hub could issue *and* suppress notice — the signature is the control).
|
||||
- **Enrollment.** Pins the initial operational + recovery public keys onto the agent during the
|
||||
physical-presence provisioning step (the trust root is established on-site, not via the hub).
|
||||
|
||||
## 6. Operator workflow
|
||||
|
||||
- **Routine work** (deploy, monitor, attach storage, restore to a *new* guest): no signing, zero
|
||||
overhead.
|
||||
- **A destructive op** (rare): the operator runs the signing CLI on their workstation — which builds
|
||||
the canonical blob, signs it (passphrase, or later a hardware touch), and posts it to the hub
|
||||
queue — then the agent polls, verifies, executes, and audits. One command + passphrase, from the
|
||||
desk. **Never** a site visit.
|
||||
|
||||
## 7. Hardware readiness (Viktor's "build the foundation now")
|
||||
|
||||
Software `ssh-ed25519` now; a FIDO2 `sk-ssh-ed25519@openssh.com` key later is a **no-op on the
|
||||
boxes** — proven end-to-end against the OpenSSH spec in Phase 4 §5 (the unchanged verifier accepts a
|
||||
spec-faithful sk signature). At hardware adoption the operator generates an sk-key, points the
|
||||
`Signer` backend at it, and updates the allowed-signers entry; nothing on the boxes changes.
|
||||
|
||||
Two honest notes:
|
||||
- **Confirm with a real device at adoption.** §5 was validated to spec, not against live hardware —
|
||||
a 5-minute real-key round-trip should confirm it (no surprise expected; signer/library/device all
|
||||
follow the same spec).
|
||||
- **Optional future hardening:** require the FIDO2 **user-presence (touch) flag**. The verifier is
|
||||
crypto-only today (correct for software keys); enforcing the flag is a small later option once
|
||||
hardware is in use.
|
||||
|
||||
## 8. Open items
|
||||
- **Quorum policy** (N-of-M per op-class, e.g. two signatures for decommission) — deferred; the
|
||||
allowed-signers-set foundation supports it.
|
||||
- **Signing-key passphrase UX** on the workstation (ssh-agent / askpass) — minor operator-tooling
|
||||
detail.
|
||||
- **Hub-side pending-op UI** (showing ops awaiting signature + audit) — belongs to the hub doc.
|
||||
|
||||
## 9. What this unblocks
|
||||
Closes the `03` §4 "undesigned signing path." Hands the implementation: the **canonical blob spec**
|
||||
(§2.1) + the **`VerifySignedOp` reference** (Phase 4 §7) for the agent's verify path, the
|
||||
**`Signer` interface** for the operator CLI, and the **allowed-signers pinning** step for enrollment.
|
||||
The hub's signed-job queue + pending-op UI carry into the hub architecture doc.
|
||||
@@ -0,0 +1,223 @@
|
||||
# Architecture Part 5 — The Hub
|
||||
|
||||
> Status: design draft (decision content). To be validated by Claude Code against the **actual
|
||||
> felhom-hub source** (`felhom.eu` repo, `hub/`) + Parts 01–04, then placed at
|
||||
> `docs/architecture/05-hub-architecture.md`.
|
||||
>
|
||||
> The hub is **not** greenfield — it's a mature service (felhom-hub v0.6.3, Go + SQLite on k3s,
|
||||
> `hub.felhom.eu`). This doc is the **deltas** to evolve it for the Proxmox model, plus the new
|
||||
> data model. Builds on Part 1 (trust/enrollment), Part 3 (the agent + reconcile), Part 4 (signing).
|
||||
|
||||
## 1. Source-of-truth model — two drivers, two directions
|
||||
|
||||
The single most important framing, and the one that governs everything below: the hub is **not** a
|
||||
monolithic source of truth. State flows in two directions with opposite drivers.
|
||||
|
||||
- **Operator-driven *intent* — hub authors, agent reconciles (top-down).** Which guests should
|
||||
exist and their spec, storage *policy* (a target's role/class/backup schedule), controller +
|
||||
golden-image versions, identity, tunnel. The operator sets these in the hub; the agent converges
|
||||
toward them. Here the hub *is* the source of truth.
|
||||
- **Box/customer-driven *reality* — box authors, pushes up, hub mirrors (bottom-up).** Which USB
|
||||
drive is *physically* attached (and its `durable_id`), what apps are deployed and where, the
|
||||
customer's controller configs/settings, host/guest health, latest PBS snapshot pointers. The
|
||||
customer or the physical world drives these; the box reports them; the hub stays an up-to-date
|
||||
**mirror** but is **never** the driver.
|
||||
|
||||
They meet at a **handshake**, not a tug-of-war. Storage is the clearest case: the customer plugs in
|
||||
a drive → the agent *detects* it and reports `durable_id X attached` (reality) → the operator
|
||||
assigns `role=bulk, class=slow, backup=weekly` (policy, intent) → the agent reconciles that policy
|
||||
*onto the detected drive*. **Apps never enter the reconcile loop** — app deployment is the
|
||||
controller's domain (customer- or operator-driven, inside the guest); the hub only mirrors the
|
||||
resulting inventory. **Reconciliation applies to infrastructure; the app/customer layer is mirrored.**
|
||||
|
||||
## 2. Data model (Part 1 decision (b): customer-anchored)
|
||||
|
||||
A customer's deployment is one **Host** (its agent) plus one-or-more **Guests** (its controllers).
|
||||
1 customer = 1 host + N guests; the shared-host multi-tenant case is deferred (not precluded — the
|
||||
`hosts` table is the seam it would use).
|
||||
|
||||
- **`customer_configs`** (existing) — the Customer anchor: identity, domain, email,
|
||||
`retrieval_password`, status, config_json. Unchanged role.
|
||||
- **`hosts`** (new) — `host_id PK, customer_id, api_key` (the agent's hub key), `agent_version`,
|
||||
desired-state intent (storage manifest + policies + golden-image version, as JSON), a per-host
|
||||
**`desired_generation`** counter, the slim DR record (§9), timestamps.
|
||||
- **`guests`** (new) — `guest_id PK, customer_id, host_id, api_key` (the controller's hub key),
|
||||
`display_name, controller_version`, per-guest **`desired_spec_json`** (CPU/mem/disk, versions),
|
||||
timestamps.
|
||||
|
||||
**Per-reporter keys:** today's per-customer `customer_configs.api_key` becomes per-reporter —
|
||||
`hosts.api_key` (agent) and `guests.api_key` (controller). The hub resolves a presented Bearer key →
|
||||
host or guest → customer; `customer_configs.api_key` goes unused once auth resolves via the new keys.
|
||||
**Clean cutover:** no dual-model support; the demo re-enrolls fresh into `host + guests`.
|
||||
|
||||
## 3. Report ingest — two domains
|
||||
|
||||
The single controller report splits. The de-privileged controller no longer sees host disks/storage/
|
||||
backup, so its report **slims** (it loses System/Storage/Backup, keeps app-domain).
|
||||
|
||||
- **`POST /api/v1/host-report`** (new, agent) → **`host_reports`**: host CPU/RAM/disk, per-guest
|
||||
up/down + spec, storage-target status (attached drives + `durable_id` + reachability), last backup
|
||||
+ restore-test per target, latest PBS snapshot pointers, `cloudflared` health, agent + controller
|
||||
versions. Denormalized columns for the dashboard; full `report_json`. Index `(host_id, received_at
|
||||
DESC)` + `(customer_id, received_at DESC)`.
|
||||
- **`POST /api/v1/report`** (existing, slimmed controller) → the renamed **`guest_reports`**: it
|
||||
gains `guest_id` + `host_id`; its `cpu/memory` denorm now means *guest-level*; `backup_last_snapshot`
|
||||
goes quiet (backup status lives in `host_reports`). App telemetry / log issues stay.
|
||||
|
||||
These two streams are the bottom-up mirror of §1 — they keep the hub current without a separate push.
|
||||
|
||||
## 4. Liveness / dead-man's-switch
|
||||
|
||||
Evolves the existing staleness checker (60s **cadence**, 30m/1h **thresholds** — OK <30m, down at
|
||||
2× = >1h; today: controller-report recency → `node_stale`/`down`/`recovered`):
|
||||
|
||||
- **Primary = host-report recency → `host_stale` / `host_down`.** The agent heartbeat is the box's
|
||||
liveness signal; a silent agent = the box is gone (the critical alert).
|
||||
- **Guest up/down comes from the host report's per-guest status** — authoritative, every poll, faster
|
||||
than waiting for a guest report to go stale.
|
||||
- **Guest-report recency = secondary** app-level signal.
|
||||
|
||||
**Backup-deadline checker:** today it is *event-based* — it scans for `backup_completed`/`backup_failed`
|
||||
events since local midnight and alerts if none. Two changes: (1) **mechanism** — move it to a field
|
||||
check on `host_reports`' last-backup-per-target (cleaner now that backup state arrives in the host
|
||||
report); (2) **emitter** — the de-privileged controller no longer runs backups, so the **agent** is the
|
||||
source of the last-backup status (Part 3 §8). Without re-homing the source, the deadline check would go
|
||||
silent after the controller stops backing up.
|
||||
|
||||
## 5. Desired-state serving
|
||||
|
||||
The operator's **intent** (§1 top-down) lives as JSON on `hosts`/`guests` (storage manifest +
|
||||
policies + golden version on the host; per-guest spec + versions on the guest) with a per-host
|
||||
`desired_generation`. The agent pulls its host's desired state on poll (with the generation, so it
|
||||
reconciles only on change and reports which generation it has converged to).
|
||||
|
||||
- **Benign convergence** (create a guest, attach storage per policy, bump a version, adjust a
|
||||
non-destructive policy) → the agent reconciles freely.
|
||||
- **Destructive convergence** (guest removal = destroy, storage detach/wipe, data-losing resize) →
|
||||
the agent requires a **matching signed op** (§6) before executing that delta; absent/invalid → it
|
||||
refuses and reports `pending_signature`.
|
||||
|
||||
**Geo is *not* in the agent's desired state** — it's customer→hub→Cloudflare (§7); the agent never
|
||||
touches WAF.
|
||||
|
||||
## 6. Authorization — signed-op queue + editing flow
|
||||
|
||||
Implements Part 4's gate on the hub side. The hub holds **no signing key**.
|
||||
|
||||
- **`signed_ops`** (new): `op_id, customer_id, host_id, target_guest, op_type, op_blob (canonical
|
||||
JSON), signature (armored SSHSIG), status (pending_signature → signed → delivered → executed /
|
||||
failed / expired / rejected), nonce, issued_at, expires_at, executed_at, result`.
|
||||
- **Editing flow:** the operator edits a customer's desired state, reusing the existing config-form +
|
||||
diff UX. Note the **transport inverts**: today's "Push" is a hub→box *inbound* POST (forbidden by the
|
||||
box-initiated model); here "publish" means **write to desired state, delivered on the next agent/
|
||||
controller poll**. The form and diff carry over; the push transport does not. The hub diffs vs current
|
||||
and **classifies each delta** (B1 rule):
|
||||
- **benign** → published straight to desired state;
|
||||
- **destructive** → the hub generates the canonical op blob and routes it through signing.
|
||||
- **Signing hand-off (Part 4 option (b)):** a local operator CLI (`felhom-sign --pending`) fetches
|
||||
the pending blob from the hub, signs it on the workstation with the dedicated key, and posts the
|
||||
signature back into `signed_ops`. The hub never sees the key.
|
||||
- The agent polls `signed_ops` for its host alongside desired state, verifies (Part 4 pipeline),
|
||||
executes, and reports status → the hub logs to the existing **`events`** audit trail.
|
||||
- **Classification lives in both places, with different jobs:** the hub classifies at *edit time*
|
||||
for UX (prompt to sign); the **agent's classification is the authoritative guard** (a compromised
|
||||
hub could skip the prompt, but the agent still enforces the signature).
|
||||
- A **pending-ops view** per customer shows the lifecycle (awaiting signature → awaiting agent →
|
||||
executed).
|
||||
|
||||
## 7. Geo enforcement (Part-2 S4)
|
||||
|
||||
The hub already holds the CF API token and already has a remove-all path
|
||||
(`internal/web/configs.go` `handleGeoDisable` → `cloudflare.RemoveGeoRules`). **But the token is
|
||||
dual-purpose today** — DNS-01/ACME *and* WAF/geo — and `configgen.Generate` deep-merges it (via
|
||||
`config_json`) into the generated `controller.yaml`, so it currently ships **down to the box**. Two
|
||||
things follow:
|
||||
|
||||
- **ACME assumption (must be stated, not skipped):** in the Cloudflare-Tunnel-default model the edge
|
||||
terminates TLS, so the box needs no public certificate and the **DNS-01/ACME use of the token goes
|
||||
away**. Granting that, the token comes fully off the box and lives hub-only. (If any box still does
|
||||
DNS-01, the token cannot fully come off — so this assumption is load-bearing.)
|
||||
- **`configgen` must stop emitting `cf_api_token`** into `controller.yaml` (drop it from the merge /
|
||||
relocate it to a hub-only field).
|
||||
|
||||
The delta: the **customer sets geo in the controller UI → the controller reports the geo desired-state
|
||||
up → the hub reconciles it into the Cloudflare WAF** (rather than the box calling the CF API). The hub
|
||||
keeps the remove-all override for self-lockout. The controller no longer calls the CF API.
|
||||
|
||||
## 8. Enrollment (evolution of the existing retrieval-password/config-gen flow)
|
||||
|
||||
Today: `GET /config/{id}` with an `X-Retrieval-Password` (Hungarian passphrase) returns a deep-merged
|
||||
`controller.yaml`. New:
|
||||
|
||||
- Enrollment mints the **agent identity first** (the agent then provisions controllers), pins the
|
||||
**operator signing public keys** (Part 4 — operational + cold recovery) onto the agent, and the
|
||||
agent mints each controller's bootstrap (its hub guest key + local-API token).
|
||||
- A **restore-mode** re-enrollment (§9) hands an existing identity to a fresh agent.
|
||||
|
||||
The existing `configgen` deep-merge + Hungarian-passphrase machinery is the base; it grows the
|
||||
agent-first + key-pinning + restore-mode steps.
|
||||
|
||||
## 9. DR model
|
||||
|
||||
The headline: the **old heavy infra-backup push retires** — not because the hub authors everything
|
||||
(§1 says it doesn't), but because (a) the box-driven mirror already arrives via the §3 report streams,
|
||||
and (b) the actual app **data + configs live inside the PBS guest snapshot**. So a separate
|
||||
config+secrets+restic-password infra-backup blob is redundant.
|
||||
|
||||
What remains:
|
||||
- the **report streams** keep the hub's mirror current (storage layout + `durable_id`s, app inventory,
|
||||
snapshot pointers) — but this mirror is **convenience, not the DR source of record** (reports are
|
||||
pruned by age);
|
||||
- the agent **escrows the recovery-code-wrapped PBS key** to the hub (the one artifact only the box
|
||||
can produce — zero-knowledge: the hub stores it, cannot open it);
|
||||
- a **slim DR record** on the `hosts` row (PBS namespace + repo fingerprint + the wrapped escrow key).
|
||||
These last two are *box-reported* columns on an otherwise operator-intent row — labelled as such so
|
||||
the §1 two-driver split stays legible per column.
|
||||
|
||||
Both existing infra-backup tables retire — `infra_backup_versions` (the current/live one, all readers
|
||||
hit it) **and** `infra_backups` (the deprecated legacy mirror). The slim DR record folds onto `hosts`
|
||||
instead. The **controller's infra-backup push is removed** (it's de-privileged).
|
||||
|
||||
**Recovery (host loss):** the new agent re-enrolls in **restore mode**; the hub hands it the durable
|
||||
record — and DR reads from the **durable sources, not the prunable report mirror**: operator intent
|
||||
(desired-state on `hosts`/`guests` — identity, tunnel token, storage manifest), the slim DR record
|
||||
(PBS namespace + repo fingerprint), the **wrapped escrow key**, and **PBS's own snapshot enumeration**
|
||||
(the agent lists snapshots once it has the namespace + unwrapped key). Guest inventory + app data come
|
||||
from **inside the PBS guest snapshots**, not from a retained `host_report`, so recovery doesn't degrade
|
||||
when the last report has aged out. The **customer provides their recovery code at the agent**, which
|
||||
unwraps the PBS key locally (never sent to the hub); the agent restores guests from PBS, resets
|
||||
identity, reuses the tunnel. The customer recovery code is the irreducible residual (the premium
|
||||
operator-managed custody tier avoids it, at the cost of the operator holding the key). The old
|
||||
controller-targeted `GET /recovery/{id}` is replaced by this agent restore-mode flow.
|
||||
|
||||
## 10. What persists from today (unchanged or lightly adapted)
|
||||
|
||||
The Customer record (`customer_configs`); config generation/retrieval (`configgen`); the two-tier
|
||||
notification system (operator English / customer Hungarian, Resend, cooldowns); `events` + audit;
|
||||
`app_telemetry` / `app_log_issues`; customer lifecycle actions (block/unblock, trigger-update,
|
||||
delete); the asset manager; and the dashboard — adapted to render the **host + guests** view per
|
||||
customer instead of a single controller.
|
||||
|
||||
## 11. Schema deltas (grounded in store.go's idempotent style; clean cutover)
|
||||
|
||||
- **NEW:** `hosts`, `guests`, `host_reports`, `signed_ops`.
|
||||
- **DROP `reports` + CREATE `guest_reports`** (under the clean cutover this is drop+create with no data
|
||||
migration, not an in-place rename); `guest_reports` adds `guest_id`, `host_id`; `cpu/memory` mean
|
||||
guest-level; `backup_last_snapshot` goes quiet.
|
||||
- **ADD** desired-state JSON + `desired_generation` to `hosts`; `desired_spec_json` to `guests`; the
|
||||
slim DR record (PBS namespace + repo fingerprint + wrapped escrow key) onto `hosts`.
|
||||
- **DROP both** `infra_backup_versions` (current/live) **and** `infra_backups` (legacy mirror) — the DR
|
||||
record replaces them on `hosts`.
|
||||
- **KEEP** `customer_configs`, `events`, `customer_notifications`, `notification_log`,
|
||||
`app_telemetry`, `app_log_issues`.
|
||||
- **Authz cleanup the cutover enables:** several endpoints today use global-or-any-customer-key auth
|
||||
rather than customer-scoped (the infra-backup GETs, `/notify`). Most retire with the infra-backup
|
||||
push; any that carry over should scope to the resolved host/guest → customer under §2.
|
||||
|
||||
## 12. Open items
|
||||
- Operator signing-key operational mechanics (Part 4 §8) — the hub-side pending-op UI is here; the
|
||||
key custody/rotation tooling is Part 4's.
|
||||
- Multi-tenant resource fairness (deferred shared-host case).
|
||||
- Hub-side desired-state **editing UX** specifics (form/diff wiring) — to be grounded against
|
||||
`hub/internal/web/configs.go` at implementation.
|
||||
- Golden-image refresh cadence / fleet versioning (carried from Part 3 §13).
|
||||
@@ -0,0 +1,260 @@
|
||||
# Critical design review — Proxmox re-platform doc set
|
||||
|
||||
> ✅ **RESOLVED (2026-06-08).** All findings folded into 01/02/03 + `proxmox-platform.md`
|
||||
> (Phase-3 spike run for B2/B3 → `tests/phase3-findings.md`). **Folded:** B1 (03 §4), B2
|
||||
> (03 §7/§8 + platform §4.7), B3 (03 §2/§3 + platform §3.6), S1 (03 §6/§8), S2 (03 §10/§11),
|
||||
> S3 (03 §7), S4 (01 §5/§7 + 02 + 03 §2), S5 (01 §7/§11 + 02 §6), S6 (02 §5), M1 (02 §3),
|
||||
> M2 (03 §7), M3 (03 §10), §6-residual (03 §6). Plus the two Phase-3 design updates:
|
||||
> provision-by-restore (03 §9) and the settled root-vs-API boundary (03 §3). **Deferred/none:**
|
||||
> no finding was deferred; the pre-existing open items (operator signing-key mechanics,
|
||||
> multi-tenant fairness, hub-side desired-state UX, golden-image refresh cadence) remain
|
||||
> flagged in 03 §13. This artifact can be deleted once confirmed.
|
||||
|
||||
Working artifact. Review pass over `01-topology-and-trust.md`, `02-controller-module-map.md`,
|
||||
`03-host-agent.md`, `proxmox-platform.md`, and the Phase 0 / Phase 1-2 findings, grounded
|
||||
against the v0.33 source (`deploy-felhom-compose/controller/`). Every finding cites a
|
||||
file+line or a doc section. Severity: **blocking** / **should-fix** / **minor**.
|
||||
|
||||
Two findings are self-corrections of my own earlier work (`02` and `proxmox-platform.md`) —
|
||||
flagged as such.
|
||||
|
||||
---
|
||||
|
||||
## Ranked summary
|
||||
|
||||
| # | Severity | Finding | Where |
|
||||
|---|---|---|---|
|
||||
| B1 | **blocking** | Reversibility gate contradicts the self-heal reconcile loop — crashed-guest healing can require a signature-gated destroy → reconcile stalls | `03` §4 vs §4(a) |
|
||||
| B2 | **blocking** | vzdump bulk-exclusion only works for **volume** mount points; Docker **named volumes live in the LXC rootfs and ARE captured** → naive placement silently backs up the 1 TB media drive. Unvalidated by spike. | `03` §7 vs `proxmox-platform.md` §4.3 + pct manpage |
|
||||
| B3 | **blocking** | Agent's Proxmox role is called "the minimal role from Phase 1" — but that role is the *narrow self-backup* role that Phase 1 proved is **denied** create/allocate/restore. The agent's operator-tier role is undefined. | `03` §2/§3 vs `phase1-2` §1.3-1.4, `01` appendix |
|
||||
| S1 | should-fix | Quiescing for agent/hub-scheduled backups has **no agent→controller channel** — the local API is controller→agent only | `03` §6, §8 |
|
||||
| S2 | should-fix | Agent self-update revert authority unspecified — if the new binary won't boot, nothing outside it can flip back | `03` §11 |
|
||||
| S3 | should-fix | Storage manifest drops fields `settings.StoragePath` carries today (Label, Schedulable/default, StoppedStacks, MigratedTo) with no re-homing stated | `03` §7 vs `settings.go:90-103` |
|
||||
| S4 | should-fix | Geo-restriction WAF ownership + Cloudflare **API token** placement unspecified after tunnel placement was locked; zone-wide token in a guest is a blast-radius concern | `03` (absent), `01` §3, `config.go` InfrastructureConfig |
|
||||
| S5 | should-fix | Cross-doc staleness: `01` §11 still lists tunnel placement OPEN; `02` §6 lists geo "blocked on tunnel placement" — both resolved by `03` §13 | `01` §11, `02` §6 vs `03` §13 |
|
||||
| S6 | should-fix (self-correct) | `02` put self-restore-test **orchestration** in the controller; `03` correctly makes it agent-owned (controller only reads status) | `02` §5(3) vs `03` §6/§8 |
|
||||
| M1 | minor (self-correct) | `02` §3 lists `UUID` as a `settings.StoragePath` field — it isn't; UUID is derived from fstab at runtime | `02` §3 vs `settings.go:91-103` |
|
||||
| M2 | minor | `03` §7 says the manifest "absorbs the disk-state fields StoragePath carries today" incl. UUID — UUID isn't persisted today, so the manifest *adds* it (an improvement, not absorption) | `03` §7 |
|
||||
| M3 | minor | controller-update is not in `03` §10's journaled-ops list, though it's a multi-step async op | `03` §10 vs §11 |
|
||||
|
||||
**Values check: clean.** No DR/key-custody/offboarding path leaves a customer locked out.
|
||||
Zero-knowledge DR (`03` §8, `01` §8) correctly makes the customer recovery code the
|
||||
irreducible residual; the operator cannot read data and the box can still restore-test.
|
||||
No hostage path found.
|
||||
|
||||
**Locked premises:** reviewed for soundness/consistency only; not relitigated.
|
||||
|
||||
---
|
||||
|
||||
## Blocking findings
|
||||
|
||||
### B1 — The reversibility gate stalls the self-healing reconcile loop
|
||||
**Where:** `03` §4(a) vs the gate in §4.
|
||||
**What:** §4(a) lists "redeploy of a crashed controller" as benign convergence that "falls
|
||||
out of reconciliation for free." The gate then lists **guest destroy** among the
|
||||
irreversible ops that require an operator signature "*regardless of whether they arrive as a
|
||||
job or as a desired-state delta*." These collide: if healing a wedged guest requires
|
||||
destroy+recreate (corrupt rootfs, failed in-place restart, half-built guest from an
|
||||
interrupted provision), the reconciler hits a signature-gated op and **cannot proceed
|
||||
without an operator** — the loop either stalls or silently gives up, defeating "self-healing
|
||||
… tolerant of missed polls."
|
||||
**Why it matters:** This is the security-critical control model. A fuzzy benign/destructive
|
||||
line is unimplementable: either the reconciler can destroy (and a compromised hub's desired
|
||||
state can wipe guests — the exact threat §4 exists to stop), or it can't (and self-heal is a
|
||||
fiction for the crashed-guest case).
|
||||
**Grounding:** `03` §4 self-describes the gate as "security-critical"; §9/§10 already rely on
|
||||
the reconciler rolling back "a half-built guest" — which *is* a destroy of a customer-id-bound
|
||||
resource, contradicting the blanket "guest destroy needs a signature."
|
||||
**Suggested fix (crisp, implementable rule):** Scope the reconciler's destructive verbs by
|
||||
*provenance and data-bearing-ness*, not by verb:
|
||||
- The reconciler MAY, without a signature: (a) create/start/restart; (b) destroy resources it
|
||||
**created earlier in the same journaled transaction** (compensating rollback, §10); (c)
|
||||
destroy resources **tagged ephemeral/scratch** (restore-test scratch guests, §8).
|
||||
- Destroying or overwriting any resource that **holds the only/primary copy of customer data**
|
||||
always needs an operator signature.
|
||||
- **Healing a crashed controller is non-destructive by construction:** the controller is
|
||||
reconstructable from its image + the guest's persistent volume, so "redeploy" = restart the
|
||||
LXC / `docker compose up -d` **inside the existing guest** — never a guest destroy. State
|
||||
this explicitly so the two clauses stop colliding. (The v0.33 self-heal precedent is already
|
||||
in-place restart: `watchdog.go` restarts stopped stacks, it never destroys the guest.)
|
||||
|
||||
### B2 — vzdump bulk-exclusion: the rootfs-Docker-volume trap
|
||||
**Where:** `03` §7 ("Bulk external mounts are excluded from the guest's vzdump (a per-mount
|
||||
backup flag)").
|
||||
**What:** Two grounded problems:
|
||||
1. The flag is real but narrow. The pct manpage (verified): `backup=<boolean>` —
|
||||
*"Whether to include the mount point in backups (**only used for volume mount points**)."*
|
||||
It does **not** apply to bind mounts / device mounts (those are handled separately).
|
||||
2. The trap: `proxmox-platform.md` §4.3 (validated in `phase1-2` §2.2) proved that **Docker
|
||||
named volumes live inside the LXC rootfs and ARE captured by vzdump** — a sentinel in
|
||||
`pgdata` survived. The default Felhom app uses Docker named volumes. So unless bulk data is
|
||||
deliberately placed on a **dedicated Proxmox volume mount point** (backup=0) or a bind
|
||||
mount, a "bulk" volume will be an ordinary named volume in rootfs and will be **silently
|
||||
swept into the whole-guest image** — exactly the 1 TB-media-in-every-backup outcome §7 says
|
||||
it prevents.
|
||||
**Why it matters:** Backup size/cost and RPO blow up silently; the failure is invisible until
|
||||
a media drive fills the vzdump target. This is load-bearing for the §8 tier model.
|
||||
**Grounding:** pct manpage (fetched 2026); `proxmox-platform.md` §4.3; `phase1-2` §2.2.
|
||||
Not covered by any spike — `proxmox-platform.md` §6 "not yet validated" should gain this row.
|
||||
**Suggested fix:** Make the placement contract explicit: a `bulk` volume **must** be realized
|
||||
as a dedicated LXC mount point (volume mountpoint with `backup=0`, or an external bind mount),
|
||||
**never** a Docker named volume in rootfs. The per-volume placement component (`02` §5(2))
|
||||
must enforce this at deploy. Add a Phase-3 spike: create an LXC with a `backup=0` volume
|
||||
mountpoint + a bind mount, vzdump it, confirm both are excluded and the rootfs+`backup=1`
|
||||
volume are included.
|
||||
|
||||
### B3 — The agent's Proxmox role is mis-grounded as "the Phase-1 minimal role"
|
||||
**Where:** `03` §2 ("scoped Proxmox API token (minimal role from Phase 1)"), §3 ("the
|
||||
Phase-1 minimal role is the API floor").
|
||||
**What:** Phase 1's minimal role (`FelhomSelfBackup` = `VM.Audit, VM.Snapshot, VM.Backup,
|
||||
Datastore.AllocateSpace, Datastore.Audit`) is the **narrow self-backup** role scoped to one
|
||||
guest, and Phase 1 explicitly proved it is **denied (403)** on create/allocate
|
||||
(`phase1-2` §1.3 call #7) — i.e. exactly the operator-tier ops the agent's whole job consists
|
||||
of (provision, restore, storage allocation). Worse, `01` appendix states that guest-side role
|
||||
"**is not used** — we chose the agent-mediated path." So `03` cites, as the agent's role
|
||||
floor, a role that (a) the architecture discarded and (b) is provably insufficient for the
|
||||
agent.
|
||||
**Why it matters:** The agent's actual operator-tier role is **undefined**. Provisioning,
|
||||
restore, and storage management cannot be built or hardened against an undefined privilege
|
||||
set, and §3's root-minimization argument ("the Phase-1 minimal role is the API floor")
|
||||
collapses because that floor can't create a guest.
|
||||
**Grounding:** `phase1-2` §1.3 (create CT = 403), §1.4 (role = self-backup only); `01`
|
||||
appendix ("not used … confirmed restore = operator-tier"); `proxmox-platform.md` §3.4.
|
||||
**Suggested fix:** Replace the Phase-1 reference with a **new agent operator role** to be
|
||||
defined and least-privilege-tested in a Phase-3 spike — minimally `VM.Allocate`, `VM.Config.*`,
|
||||
`VM.PowerMgmt`, `VM.Snapshot(.Rollback)`, `VM.Backup`, `VM.Audit`, `Datastore.Allocate(Space)`,
|
||||
`Datastore.Audit`, plus whatever storage-attach needs (see S4/root-boundary below). Keep §3's
|
||||
"API token, not root, where the API suffices" principle — that part is sound — but stop
|
||||
calling it the Phase-1 role.
|
||||
|
||||
---
|
||||
|
||||
## Should-fix findings
|
||||
|
||||
### S1 — No agent→controller channel for backup quiescing
|
||||
**Where:** `03` §6 (local API is controller→agent only) vs §8 ("the controller stops the app
|
||||
stack … before a guest vzdump where app-consistency matters").
|
||||
**What:** App-consistent LXC backup requires the controller to quiesce (no fsfreeze for LXC —
|
||||
`proxmox-platform.md` §4.2, `phase1-2` §2.1). But the §6 surface is entirely controller→agent;
|
||||
the box-initiated model forbids the hub calling in, and there is no agent→controller call
|
||||
defined. For a **hub/agent-scheduled** backup (schedule lives in the manifest `policy`, §7),
|
||||
the agent has no way to tell the controller "quiesce now."
|
||||
**Why it matters:** Either scheduled backups silently fall back to crash-consistent (relying
|
||||
on WAL recovery, which `phase1-2` §3 warns is unvalidated under write load), or the feature
|
||||
can't be built as drawn.
|
||||
**Suggested fix:** Make backups **controller-driven for app-consistency**: the controller
|
||||
learns due/policy via its own hub channel (or a `GET /backup/due` on the local API), quiesces,
|
||||
calls the existing `POST /backup`, then unquiesces on completion. Document that agent-initiated
|
||||
vzdump is crash-consistent only. (No inbound-to-guest channel needed — preserves §3/§5.)
|
||||
|
||||
### S2 — Agent self-update revert authority unspecified
|
||||
**Where:** `03` §11 ("a watchdog reverts to last-good if the new binary fails to come up
|
||||
healthy").
|
||||
**What:** The agent is a single host systemd service with `Restart=always` (§3). If the new
|
||||
binary crashes on startup, systemd just restarts the **same bad binary** in a loop. "Revert
|
||||
to last-good" cannot be done *by* the thing that won't boot. §11 doesn't name the actor.
|
||||
**Why it matters:** A bad self-update can brick the crown-jewel host agent — the one component
|
||||
that recovers everything else — with no automatic recovery, requiring break-glass.
|
||||
**Suggested fix:** Put revert authority **outside** the swapped binary: e.g. an A/B symlink
|
||||
(`current → good|new`) where a separate systemd oneshot health-gate (`ExecStartPost` probe; on
|
||||
failure flip the symlink back and restart), or a tiny supervisor unit. Boot-into-last-good +
|
||||
explicit "commit" after a clean health window is the robust pattern. Add agent-update to the
|
||||
§10 journal so an interrupted swap is resumable.
|
||||
|
||||
### S3 — Manifest schema omits live `StoragePath` fields without re-homing them
|
||||
**Where:** `03` §7 table vs `settings.go:90-103`.
|
||||
**What:** Today's `StoragePath` carries `Label`, `IsDefault`, `Schedulable`, `StoppedStacks`,
|
||||
`Decommissioned`/`DecommissionedAt`/`MigratedTo`. The manifest covers state (attached/
|
||||
disconnected/decommissioned) and durable_id, but drops: **Label** (human name, e.g. "Külső
|
||||
HDD 1TB" — UI), **Schedulable/IsDefault** (default placement target for new apps),
|
||||
**StoppedStacks** (which apps to restart on reconnect — app-domain), **MigratedTo** (decommission
|
||||
target pointer).
|
||||
**Why it matters:** `02` named this manifest as the contract that the `settings.StoragePath`
|
||||
reshape depends on. Silently dropped fields become lost behavior (no default-drive choice, no
|
||||
restart-after-reconnect list, no friendly labels).
|
||||
**Suggested fix:** Either add Label + a placement-default marker to the manifest, or explicitly
|
||||
state which fields re-home to the controller's `settings` (StoppedStacks and Label are
|
||||
plausibly controller-side; default/schedulable placement must live wherever placement decisions
|
||||
are made). Make the split explicit so neither side assumes the other owns it.
|
||||
|
||||
### S4 — Geo-WAF ownership + Cloudflare API token placement unspecified
|
||||
**Where:** `03` covers `cloudflared` (tunnel) health but is silent on geo-restriction WAF; `02`
|
||||
§6 had `cloudflare/`+`geo` "blocked on tunnel placement"; `01` §3 lists the controller's creds
|
||||
as "hub API key + local-API token" only.
|
||||
**What:** Now that tunnel placement is locked (host), the **geo-restriction WAF** management
|
||||
(`cloudflare/` package: zone/waf/geosync) still has no home. It requires a Cloudflare **API
|
||||
token** (`config.go` InfrastructureConfig.cf_api_token) with zone-wide WAF edit rights. If geo
|
||||
stays in the controller (app-domain, per `02`), a **zone-wide Cloudflare token sits inside the
|
||||
customer guest** — a real blast-radius concern (compromise → edit/disable WAF for the whole
|
||||
zone, potentially other customers on the same zone).
|
||||
**Why it matters:** Trust-boundary gap. `01` §5's boundary table has no row for controller↔
|
||||
Cloudflare-API. Unspecified ownership blocks the `02` geo classification from being unblocked.
|
||||
**Suggested fix:** Decide geo-WAF ownership explicitly and add it to `01` §5. Options: (a) move
|
||||
WAF management to the **agent/hub** (operator-tier, token off the customer box); (b) keep it in
|
||||
the controller but scope the CF token per-zone/per-customer if the account model allows. Note
|
||||
this is now *unblocked* by the tunnel decision and should leave `02` §6's "blocked" state.
|
||||
|
||||
### S5 — Cross-doc staleness on the now-locked tunnel placement
|
||||
**Where:** `01` §11 ("Cloudflare Tunnel placement: host vs guest (§7)") and `02` §6
|
||||
("`cloudflare/` + `api/geo.go` — blocked on tunnel placement") vs `03` §13 ("Resolved here:
|
||||
tunnel placement (host, agent-managed)") and the LOCKED list.
|
||||
**What:** `01` and `02` still present as OPEN/blocked a decision `03` and the locked set have
|
||||
resolved.
|
||||
**Why it matters:** A dev reading `01`/`02` would treat a settled decision as open (or a
|
||||
classification as blocked when only geo-ownership, S4, actually remains).
|
||||
**Suggested fix:** When folding this review in: update `01` §7/§11 to record tunnel=host
|
||||
(agent-managed systemd service); update `02` §6 to reduce the cloudflare item from "blocked on
|
||||
tunnel placement" to the narrower "blocked on geo-WAF ownership (S4)."
|
||||
|
||||
### S6 — (self-correction) self-restore-test orchestration belongs to the agent, not the controller
|
||||
**Where:** `02` §5(3) said "Self-restore-test orchestration — *controller* asks the agent to
|
||||
restore to scratch guest, validates, reports." `03` §8 makes the **agent** drive it
|
||||
autonomously; §6 gives the controller only `GET /restore-test/status` (read-only).
|
||||
**What:** `03` is right and `02` overreached. Zero-knowledge means only the box/agent holds the
|
||||
PBS key (`03` §8); creating a scratch guest is operator-tier (create/allocate — `phase1-2`
|
||||
§1.3 #7); the controller cannot do either. The controller's only piece is surfacing status.
|
||||
**Why it matters:** Keeps the NEW-component list honest — this is not a controller component to
|
||||
build beyond a status read.
|
||||
**Suggested fix:** Amend `02` §5(3) to "self-restore-test **status display** (read-only); the
|
||||
agent owns orchestration."
|
||||
|
||||
---
|
||||
|
||||
## Minor findings
|
||||
|
||||
- **M1 (self-correction):** `02` §3 lists `UUID` among `settings.StoragePath` fields. It is
|
||||
**not** there (`settings.go:91-103`: Path, Label, IsDefault, Schedulable, AddedAt,
|
||||
Disconnected/At, StoppedStacks, Decommissioned/At, MigratedTo). UUID is derived at runtime
|
||||
from fstab / `/host-dev/disk/by-uuid` by `system.ParseFstabUUID` and `watchdog.go`. The
|
||||
classification (settings = MODIFY/split) is unaffected; the field list was wrong.
|
||||
- **M2:** Consequently `03` §7's "absorbs the disk-state fields `settings.StoragePath` carries
|
||||
today" overstates: `durable_id`/UUID is *not* carried today, so the manifest **adds** durable
|
||||
identity (a genuine improvement — today the controller re-derives UUID from fstab each boot,
|
||||
which is fragile). Reword "absorbs" → "absorbs + adds durable_id."
|
||||
- **M3:** `03` §10 journals "provision, restore" but not **controller-update** (§11), which is
|
||||
also a multi-step async op (snapshot→pull→redeploy→health→rollback). Add it so an agent crash
|
||||
mid-controller-update is resume-or-rollback like the others.
|
||||
|
||||
---
|
||||
|
||||
## Verified-correct (no action) — grounding that held up
|
||||
|
||||
- LXC flags `nesting=1,keyctl=1` + overlayfs (`03` §9) match `proxmox-platform.md` §2.3 /
|
||||
`phase0` §3. ✓
|
||||
- async `task exitstatus`, not POST return (`03` §8) matches `proxmox-platform.md` §3.5. ✓
|
||||
- stop-mode backup not requiring `VM.PowerMgmt` (`03` §8 "per Phase 1") matches
|
||||
`proxmox-platform.md` §3.4. ✓ (applies to the agent role too.)
|
||||
- running-LXC snapshot on LVM-thin (`03` §6/§8/§11) matches `proxmox-platform.md` §4.5 /
|
||||
`phase1-2` §1.6. ✓
|
||||
- `monitor/pinger.go` deprecation (`02` DELETE-obsolete) confirmed in `main.go:168,175`
|
||||
("legacy, will be removed" / "no longer used — monitoring is now handled by the Hub"). ✓
|
||||
- backup keep/delete **intra-file tear** (`02` hazard) confirmed: `backup.go` holds both
|
||||
`RunDBDumps`/`DumpAppVolumes(Safe)` (keep) and `RunBackup`/`RunFullBackup` (restic, delete);
|
||||
`restore.go` holds `RestoreApp` (restic) + `RestoreAppFromTier2` (app). The §7-8 backup
|
||||
contract gives the extracted app-data-backup package a coherent destination. ✓
|
||||
- Control-plane-not-data-plane (`03` §2/§43): apps keep serving if the agent dies — consistent
|
||||
with Docker-in-LXC running independently (`phase0` §3). ✓
|
||||
- §6 per-guest local-API authorization (token→guest map): sound; a leaked token acts only on
|
||||
its own guest. Residual: a compromised controller can `POST /rollback` its **own** guest
|
||||
(blast radius = self) — acceptable per design; worth a one-line note that rollback is
|
||||
self-scoped and bounded.
|
||||
@@ -0,0 +1,221 @@
|
||||
# `05-hub-architecture.md` — critical review (grounded against felhom-hub v0.6.3 source + Parts 01–04)
|
||||
|
||||
Method: every claim about the existing hub was checked against `felhom.eu/hub/` source; every
|
||||
cross-doc claim against Parts 01/03/04. Citations are `file:line`. Severity: **blocking** (wrong /
|
||||
breaks an assumption) · **should-fix** (real gap or contradiction, low blast) · **minor**.
|
||||
|
||||
The two highest-value catches (doc assumes something the code contradicts) are **S1** and **S2**.
|
||||
|
||||
---
|
||||
|
||||
## Ranked summary
|
||||
|
||||
| # | What | Where (doc → code) | Severity |
|
||||
|---|---|---|---|
|
||||
| S1 | §9/§11 name the **wrong infra-backup table as current** — `infra_backup_versions` is the live/primary one; `infra_backups` is the deprecated write-only mirror | 05 §9/§11 → `store.go:198-217,541-578` | should-fix (code-contradiction) |
|
||||
| S2 | §7 treats the CF token as **geo-only**; it is **dual-purpose (DNS-01/ACME + WAF)** and is injected into the generated `controller.yaml` | 05 §7 → `config_form.html:76-80`, `controller.yaml.default:26`, `configgen.go:28-37`, `configs.go:1041` | should-fix (code-contradiction / unverified assumption) |
|
||||
| S3 | §6 leans on the existing **"Push"**, but that is a hub→box **inbound** POST — forbidden by the box-initiated model; transport must invert to poll | 05 §6 → `configs.go:569-570,1148-1150`; Part 1 §4/§5/§11; Part 3 §5 | should-fix |
|
||||
| S4 | Part 1 §6 calls app inventory **"declarative"**; 05 §1 (LOCKED) says apps are mirrored, never declared/reconciled, restored from PBS | Part 1 §6 ↔ 05 §1/§9 | should-fix (cross-doc) |
|
||||
| S5 | §9 hands "guest inventory + snapshots" **from the prunable report mirror**; DR soundness actually rests on durable sources | 05 §9/§3 → `store.go:809-816` | should-fix (DR robustness) |
|
||||
| S6 | §4 says backup-deadline checker "maps onto host_reports' last-backup field"; today it is **event-based** and controller-emitted | 05 §4 → `deadline.go:31-86` | should-fix (mechanism) |
|
||||
| M1 | "60s staleness checker" conflates the 60s **cadence** with the 30m/1h **threshold** | 05 §4 → `main.go:207-217,99-102`, `staleness.go:33-37` | minor |
|
||||
| M2 | §2 `customer_configs` field list omits `api_key` — the very field the per-reporter plan retires | 05 §2 → `store.go:102-112` | minor |
|
||||
| M3 | §11 `reports`→`guest_reports` "rename" is really drop+create under the locked clean cutover | 05 §11 → `store.go:55-119` | minor |
|
||||
| M4 | Pre-existing weak authz on infra-backup GET / `/notify` (any valid key, not customer-scoped) | handler.go:407,536,568,596 | minor |
|
||||
|
||||
No **blocking** findings — the data model and the two-driver framing are sound, and the LOCKED clean
|
||||
cutover absorbs most schema risk. The items below are gaps/contradictions worth fixing before the doc
|
||||
drives work.
|
||||
|
||||
---
|
||||
|
||||
## Highest-value: doc assumes something the code contradicts
|
||||
|
||||
### S1 — `infra_backups` vs `infra_backup_versions` is inverted (should-fix, code-contradiction)
|
||||
05 §9: *"`infra_backup_versions` retires; `infra_backups` is repurposed into the slim DR record."*
|
||||
§11 repeats: *"RETIRE `infra_backup_versions`; repurpose `infra_backups`."*
|
||||
|
||||
The code is the other way round:
|
||||
- `infra_backup_versions` (added v0.7.0, `store.go:198-211`) is the **live/primary** table. **Every read
|
||||
path hits it**: `GetInfraBackup` (`store.go:565-578`), `GetInfraBackupByID` (`store.go:581-593`),
|
||||
`GetInfraBackupMeta` (`store.go:604`), `ListInfraBackupVersions` (`store.go:640`), and the recovery
|
||||
endpoint (`handler.go:670-686`).
|
||||
- `infra_backups` (original single-row, `store.go:96-100`) is **deprecated**. It is now **written only
|
||||
as a legacy mirror** ("for backward compatibility during rollback window", `store.go:552-558`) and is
|
||||
**never read** except as the one-time migration *source* (`store.go:214-217`).
|
||||
|
||||
So the doc proposes retiring the current table and repurposing the dead one. Under the LOCKED clean
|
||||
cutover both are discarded anyway, so blast radius is low — but an implementer following §9/§11
|
||||
literally would point the DR record at the wrong table.
|
||||
**Fix:** take §11's own alternative — *fold the slim DR record onto `hosts`* and **drop both**
|
||||
infra-backup tables. If a standalone table is kept, base it on `infra_backup_versions` (the one with the
|
||||
data/readers), and correct the "which is current" framing.
|
||||
|
||||
### S2 — the CF API token is **not** geo-only; it is the ACME token too, and ships into `controller.yaml` (should-fix, code-contradiction)
|
||||
05 §7: *"The hub already holds the CF API token (the config form notes Zone WAF:Edit)… rather than
|
||||
pushing the token down to the controller… The controller no longer calls the CF API."*
|
||||
|
||||
Grounding confirms the hub **does** hold the token and **does** have a remove-all path:
|
||||
`config_json → infrastructure.cf_api_token` (`configs.go:714-715,1041-1042,1089-1096`) →
|
||||
`cfClient.RemoveGeoRules(cfToken, cfg.Domain, …)` in `handleGeoDisable` (`configs.go:1112`), route
|
||||
`/customers/{id}/geo/disable` (`server.go:201-205`). ✓ The §7 framing of geo-enforcement-moves-to-hub
|
||||
is also consistent with Part 1 §5/§7 and Part 3 §2/§46.
|
||||
|
||||
**But the doc's assumption that the token is *for geo* is contradicted by the code:** the same
|
||||
`cf_api_token` is **dual-purpose** —
|
||||
- the config-form hint says **"Zone DNS:Edit (ACME), Zone WAF:Edit (geo)"** (`config_form.html:80`),
|
||||
- `controller.yaml.default:26` documents it as the **"Cloudflare API token (DNS-01 challenge)"**,
|
||||
- and it is **deep-merged into the generated `controller.yaml`** via `configgen.Generate` (config_json
|
||||
overrides, `configgen.go:28-37`), i.e. **today it is shipped down to the box** and served at
|
||||
`/config/{id}` and `/recovery/{id}`.
|
||||
|
||||
Consequences §7 must address:
|
||||
1. **"Token off the controller" is incomplete** if the box still does DNS-01/ACME. In the CF-Tunnel
|
||||
model the box may no longer need a public cert at all (edge-terminated), making the ACME use moot —
|
||||
but that is an assumption the doc must state, not skip. Either confirm ACME is gone, or the CF token
|
||||
cannot fully come off the box.
|
||||
2. **`configgen` must stop emitting `cf_api_token` into `controller.yaml`** (or relocate it to a
|
||||
hub-only field). As written, the generated config still carries it.
|
||||
|
||||
---
|
||||
|
||||
## Should-fix
|
||||
|
||||
### S3 — §6 "Push" is an inbound-to-box mechanism the new model forbids
|
||||
05 §6: *"the operator edits a customer's desired state (building on the existing config-form +
|
||||
Push/Pull/Diff)."* The form + diff/pull/push handlers exist — `handlePushConfig` (`configs.go:569`),
|
||||
`handlePullConfig` (`configs.go:952`), `handleConfigDiff` (`configs.go:861`), routes at
|
||||
`server.go:209-229`. ✓ So the UI base is real.
|
||||
|
||||
The wrinkle: **"Push" today is a hub→controller outbound POST** (`handlePushConfig` "sends the generated
|
||||
YAML config to the controller", `configs.go:569-570`), as is the geo-disable notify
|
||||
(`notifyControllerGeoDisable` → `POST controllerURL/api/geo/settings`, `configs.go:1148-1153`). Both are
|
||||
the hub **connecting into the box** — explicitly disallowed by the box-initiated model (Part 1 §4
|
||||
"the hub never initiates inbound"; §5 row `agent↔hub`/`controller↔hub` = outbound poll; Part 3 §5 "The
|
||||
hub never connects inbound"). 05's own §5 already resolves this (desired state is **pulled** on poll
|
||||
with a `desired_generation`). So the doc is internally consistent in *mechanism* but loose in *wording*:
|
||||
**make §6 explicit that "Push" becomes "publish to desired state, delivered on the next agent/controller
|
||||
poll," not a reuse of the inbound push transport.** The form/diff UX carries over; the transport inverts.
|
||||
(Same applies to the geo-disable controller-notify path.)
|
||||
|
||||
### S4 — "declarative app inventory" (Part 1 §6) vs "apps are mirrored, never reconciled" (05 §1)
|
||||
Part 1 §6 lists the durable record as including a **"declarative app inventory"** that survives box loss
|
||||
— wording that implies an operator-authored, re-deployable spec. 05 §1 (LOCKED two-driver model) is
|
||||
explicit the opposite way: *"Apps never enter the reconcile loop… the hub only mirrors the resulting
|
||||
inventory… the app/customer layer is mirrored,"* and 05 §9 restores apps **from the PBS guest snapshot**,
|
||||
not by re-deploying a declared inventory. These are reconcilable (the mirror *is* durable last-known
|
||||
truth) but the word "declarative" contradicts the locked framing and the §9 restore-from-snapshot path.
|
||||
**Fix (align the older doc to the locked model):** in Part 1 §6 change "declarative app inventory" →
|
||||
"mirrored / last-reported app inventory," and note apps are recovered from the guest snapshot, not
|
||||
re-declared. (Flagging an internal inconsistency, not relitigating the locked premise.)
|
||||
|
||||
### S5 — §9 reads DR inputs from a prunable mirror; soundness rests on durable sources
|
||||
05 §9 hands the recovering agent *"identity, tunnel token, storage manifest, PBS namespace, guest
|
||||
inventory + snapshots."* §3 places "guest inventory" and "latest PBS snapshot pointers" in
|
||||
`host_reports` — the bottom-up mirror. But reports are **pruned** (`Prune` deletes rows older than
|
||||
`maxDays`, `store.go:809-816`; the doc keeps this), so after a long pre-DR outage the last `host_report`
|
||||
can be gone or stale. The actually-durable DR inputs are: desired-state on `hosts`/`guests` (§5), the
|
||||
slim DR record (PBS namespace + repo fingerprint + wrapped escrow key, §9/§11), and **PBS's own snapshot
|
||||
enumeration** (the agent lists snapshots once it has the namespace + unwrapped key). The mirrored
|
||||
inventory/pointers are convenience, not the source of record.
|
||||
**Fix:** state in §9 that DR reads from the durable sources (desired-state + DR record + PBS), **not**
|
||||
from prunable `host_reports`, so recovery doesn't degrade when the last report has aged out. This also
|
||||
keeps §1's two-driver discipline clean: DR must not depend on bottom-up mirror rows being retained.
|
||||
(Note: the `hosts` row legitimately mixes top-down intent columns with a few box-reported columns —
|
||||
repo fingerprint, wrapped escrow key. That is fine; just label them as box-reported so the §1 split
|
||||
stays legible at the column level.)
|
||||
|
||||
### S6 — backup-deadline checker: doc says field-based, code is event-based (and re-emitter changes)
|
||||
05 §4: *"The existing backup-deadline checker maps onto `host_reports`' last-backup-per-target."* The
|
||||
existing checker is **event-based**, not field-based: `CheckBackupDeadlines` looks for
|
||||
`backup_completed` / `backup_failed` (and `db_dump_*`) **events** since Budapest midnight and emits
|
||||
`expected_backup_missed` if neither is present (`deadline.go:31-86`). Two changes the doc should make
|
||||
explicit:
|
||||
1. **Mechanism:** either keep it event-based (someone emits `backup_completed`) or genuinely move it to
|
||||
a `host_reports.last_backup_per_target` field check — the doc says the latter but the impl is the
|
||||
former.
|
||||
2. **Emitter:** today the **controller** emits backup events; in the de-privileged model the **agent**
|
||||
owns backup/PBS (Part 3 §8), so the agent must now emit `backup_completed`/`backup_failed` (or the
|
||||
host report carries last-backup-per-target). Without re-homing the emitter, the deadline check goes
|
||||
silent after the controller stops doing backups.
|
||||
|
||||
---
|
||||
|
||||
## Minor
|
||||
|
||||
- **M1 — "60s staleness checker" (§4).** 60s is the **check cadence** (`main.go:207-217`,
|
||||
`ticker := time.NewTicker(60 * time.Second)`); the **staleness threshold** is 30m (default,
|
||||
`main.go:99-102`) with down at 2× = 60m (`staleness.go:33-37`; CLAUDE.md "OK <30m, DOWN >1h"). The
|
||||
event-transition mechanism (`node_stale`/`node_down`/`node_recovered`) is described correctly
|
||||
(`staleness.go:155-185`). Reword to "the staleness checker (60s cadence, 30m/1h thresholds)."
|
||||
- **M2 — `customer_configs` fields (§2).** The list ("identity, domain, email, retrieval_password,
|
||||
status, config_json") omits **`api_key`** (`store.go:108`) — the field §2's per-reporter plan
|
||||
actually retires. Worth noting `customer_configs.api_key` becomes unused once auth resolves via
|
||||
`hosts.api_key` / `guests.api_key`.
|
||||
- **M3 — rename under clean cutover (§11).** `migrate()` is all `CREATE TABLE IF NOT EXISTS` +
|
||||
idempotent `ALTER` (`store.go:55-119,146-149`). §11's claim "grounded in store.go's idempotent style"
|
||||
is accurate. But a `reports`→`guest_reports` **rename** isn't part of that style; under the LOCKED
|
||||
clean cutover (demo re-enrolls fresh, §2) it is really **drop `reports` + create `guest_reports`**
|
||||
with no data migration. Name it as such to avoid implying an in-place rename + backfill.
|
||||
- **M4 — pre-existing weak authz.** `handleInfraBackupGet`/`Versions` and `handleNotify`/
|
||||
`handleSavePreferences`/`handleInfraBackupPush` use `checkAuth` (global **or any** customer key,
|
||||
`handler.go:63-66`), not customer-scoped `checkAuthCustomer`. Most retire with the infra-backup push
|
||||
(§9); for any that carry over, the per-reporter model (§2) should scope them to the resolved
|
||||
host/guest→customer. Not a regression the doc introduces — a cleanup the cutover enables.
|
||||
|
||||
---
|
||||
|
||||
## Confirmed accurate (grounding that holds — so the rest of the doc can be trusted)
|
||||
|
||||
- **§10 KEEP list** matches the schema exactly: `customer_configs`, `events`, `customer_notifications`,
|
||||
`notification_log`, `app_telemetry`, `app_log_issues` all present (`store.go:74-189,102-135`). The
|
||||
asset manager exists (`handler.go:57,834-867`). ✓
|
||||
- **§10 two-tier notifications** (operator English / customer Hungarian, Resend, cooldowns) match
|
||||
`notify/dispatcher.go`: `processOperator` (1h cooldown, `FormatOperatorEmail`, gated by `operatorOn`,
|
||||
`dispatcher.go:91-114`) + `processCustomer` (prefs-driven, default 6h, `FormatCustomerEmail`,
|
||||
`dispatcher.go:116-158`); wired in `main.go:134`. ✓
|
||||
- **§8 enrollment / §11 configgen** — deep-merge + Hungarian passphrase base is real:
|
||||
`configgen.deepMerge` (`configgen.go:76-91`), programmatic overrides + `hub.api_key = cfg.APIKey`
|
||||
(`configgen.go:40-47`), retrieval-password gate (`handler.go:709-753`). The evolution to agent-first +
|
||||
per-guest keys + key-pinning is a clean extension. ✓
|
||||
- **§2 auth extension** (Bearer → reporter → customer) is clean against today's
|
||||
`checkAuthCustomer` (global key, else `GetCustomerConfigByAPIKey`, `handler.go:72-90`,
|
||||
`store.go:913-935`); adding host/guest key lookups slots straight in. ✓
|
||||
- **§11 "idempotent style"** is accurate (`store.go:55-119`). New tables/columns (`hosts`, `guests`,
|
||||
`host_reports`, `signed_ops`, `desired_generation`, `desired_spec_json`) follow the existing
|
||||
`CREATE IF NOT EXISTS` / `ALTER … ` pattern cleanly.
|
||||
- **§9 escrow/custody** is consistent with Part 1 §8 (three-tier custody, zero-knowledge default,
|
||||
recovery-code-wrapped PBS keyfile, operator can't open) and Part 3 §8 (live PBS key on the box for
|
||||
backup + restore-test; hub holds only the wrapped escrow). The "customer recovery code is the
|
||||
irreducible residual; operator-managed tier avoids it" matches Part 1 §8 verbatim in spirit. ✓
|
||||
- **§4 dead-man's-switch** (host-report recency = primary liveness) is consistent with Part 3 §5
|
||||
("the heartbeat *is* the liveness signal… first-class treatment hub-side"). ✓
|
||||
- **§5/§6 signed-op + desired-state** are consistent with Part 4 and Part 3 §4:
|
||||
hub holds **no** signing key and queues opaque blobs (Part 4 §5; 05 §6 "The hub holds no signing
|
||||
key"); agent runs the verify pipeline and is the authoritative guard (Part 4 §2.3, Part 3 §4; 05 §6
|
||||
"the agent's classification is the authoritative guard"); hub classifies at edit-time for UX only.
|
||||
05 §6's `signed_ops` columns are a consistent superset of Part 4 §2.1's blob
|
||||
`{op, target:{host_id,guest_id}, params, nonce, issued_at, expires_at, key_id}` (05 adds hub-side
|
||||
lifecycle states `delivered`/`rejected` — fine). The local-CLI hand-off (`felhom-sign --pending`)
|
||||
matches Part 4 §5–6's `Signer`-on-the-workstation model. ✓
|
||||
|
||||
## Two-driver soundness (axis 3) — holds
|
||||
No place in 05 has the hub **drive** box/customer-owned state. Desired-state (§5) is all infrastructure
|
||||
intent (guests, storage *policy*, versions, identity, tunnel) — top-down and legitimate. Apps are
|
||||
explicitly excluded from reconcile (§1, §5) and mirrored only. Storage is the handshake (detect →
|
||||
assign policy → reconcile policy onto the detected drive), matching Part 3 §7. The one nuance (S5): the
|
||||
`hosts` row holds both top-down intent and a few box-reported columns (repo fingerprint, wrapped escrow
|
||||
key) — acceptable, just label provenance per column. Reconcile (§5) never collides with app/storage
|
||||
reality because the reality columns (`durable_id` attached, snapshot pointers, app inventory) are
|
||||
mirror-only and never serve as desired state.
|
||||
|
||||
## DR completeness (axis 4) — safe to retire the heavy push, with S5's clarification
|
||||
Retiring the controller's infra-backup push is safe **given** that DR reads from durable sources, not
|
||||
the prunable mirror (S5). What the old push carried — `deployed_stacks` + `disk_layout.mounts`
|
||||
(`store.go:768-795`, surfaced by `handleRecovery`, `handler.go:620-705`) — is reconstructible:
|
||||
storage layout/`durable_id`s from the storage manifest (desired-state, durable) + host-report mirror;
|
||||
app inventory from the guest **inside the PBS snapshot** (so it need not be separately stored); snapshot
|
||||
list from PBS itself. The one artifact only the box can produce — the recovery-code-wrapped PBS key — is
|
||||
explicitly escrowed (§9), zero-knowledge, consistent with Part 1 §8 / Part 3 §8. So nothing
|
||||
DR-essential is lost by removing the push **provided** §9 is amended per S5 to name durable sources and
|
||||
not lean on `host_reports` retention.
|
||||
Reference in New Issue
Block a user