From bb0a9e720563c398000899d55b04dc9b1aab5f65 Mon Sep 17 00:00:00 2001 From: kisfenyo Date: Mon, 8 Jun 2026 09:15:16 +0200 Subject: [PATCH] refresh --- docs/architecture/01-topology-and-trust.md | 24 +++- docs/architecture/02-controller-module-map.md | 57 +++++--- docs/architecture/03-host-agent.md | 132 ++++++++++++++---- docs/architecture/_design-review.md | 10 ++ 4 files changed, 165 insertions(+), 58 deletions(-) diff --git a/docs/architecture/01-topology-and-trust.md b/docs/architecture/01-topology-and-trust.md index 9bcd9c0..85a5abf 100644 --- a/docs/architecture/01-topology-and-trust.md +++ b/docs/architecture/01-topology-and-trust.md @@ -91,9 +91,10 @@ credentials. | customer ↔ controller UI | management UI | Cloudflare Tunnel; UI auth (bcrypt) | the customer's own box | | controller ↔ agent | snapshot/resize/backup requests | local constrained RPC; agent authorizes per-guest | the controller's own guest only | | agent ↔ hub | reports + signed jobs | outbound poll; signed jobs | one box; signed jobs limit forgery | -| controller ↔ hub | app-domain reports/jobs | outbound, own API key | app-domain of one customer | +| controller ↔ hub | app-domain reports/jobs (incl. geo desired-state) | outbound, own API key | app-domain of one customer | | box ↔ PBS | encrypted backups | outbound; per-customer namespace; client-side encryption | ciphertext only (operator can't read) | | guest ↔ Proxmox host | **(none direct)** | the guest holds no Proxmox creds; all via the agent | — | +| hub ↔ Cloudflare API | geo-restriction WAF (enforcement) | the **hub** holds the CF API token; reconciles geo desired-state → WAF | the customer's zone/WAF | --- @@ -123,8 +124,10 @@ credentials. DNS/routing stay intact through an outage. - **Outbound only** for control/report/backup (poll to hub, push to PBS). No inbound control endpoint exists in the chosen model. -- **OPEN:** Cloudflare Tunnel placement — host vs guest (`cloudflared` on the Proxmox host - routing to guest services, or inside the customer LXC). To resolve in a later part. +- **Tunnel placement: host** (resolved, Part 3 §3/§5). `cloudflared` runs on the Proxmox host + as its own **agent-managed systemd service** — not inside the guest — so the data path + survives control-plane death by construction. Geo-restriction WAF is **hub-enforced** (the + hub holds the CF API token; the controller only reports geo desired-state). --- @@ -190,9 +193,7 @@ credentials. ## 11. Open sub-decisions (carried into later parts) -- Cloudflare Tunnel placement: host vs guest (§7). - **RTO/RPO targets** → drive the backup + offsite-replication schedule (§8). -- Self-update flow (scenario 5) — not yet designed. - Offboarding / decommission (scenario 6) — not yet designed; must honour "never hold data hostage" in credential revocation + data hand-off. - Multi-tenant resource fairness — deferred until multi-tenant is real (§2). @@ -205,4 +206,15 @@ credentials. - **Phase 1** → §3/§5: validated the privilege boundary (create/allocate is operator-tier). The guest-side scoped-backup-token it proved possible is **not** used — we chose the agent-mediated path — but it confirmed restore = operator-tier, which shapes the agent. -- **Phase 2** → §8/§9: backup→restore round-trip; identity reset on restore. \ No newline at end of file +- **Phase 2** → §8/§9: backup→restore round-trip; identity reset on restore. + +--- + +## Changelog — design-review + Phase-3 fold-in (2026-06-08) + +- §5 trust boundaries: **added `hub ↔ Cloudflare API`** row (hub holds the CF token, enforces + geo→WAF); controller↔hub row notes it carries geo desired-state (S4). +- §7 networking: **tunnel placement resolved → host** (agent-managed systemd service); geo is + hub-enforced (S4/S5). +- §11 open items: removed the now-resolved **tunnel placement** and **self-update flow** entries + (S5; self-update designed in 03 §11). \ No newline at end of file diff --git a/docs/architecture/02-controller-module-map.md b/docs/architecture/02-controller-module-map.md index e7433b6..b1e538b 100644 --- a/docs/architecture/02-controller-module-map.md +++ b/docs/architecture/02-controller-module-map.md @@ -54,7 +54,7 @@ Risk tags: **clean** · **needs-rework** · **hazard** (entangles a delete-targe | `appexport/` | `.fab` app export/import (config+DB+volumes, AES-256-CTR+scrypt) | **backup** (DB dump), (provider iface → stacks) | | `assets/` | Download/cache app assets from Hub API | — (HTTP only) | | `backup/` | DB dumps, Docker-volume archive, **restic**, **cross-drive rsync**, per-app restore, **drive mount**, disk-layout, infra-backup metadata | config, monitor, settings, system, util | -| `cloudflare/` | Geo-restriction via Cloudflare WAF (zone/waf/geosync/countries) | settings | +| `cloudflare/` | Geo-restriction via Cloudflare WAF (zone/waf/geosync/countries) — **enforcement → hub** (S4) | settings | | `config/` | `controller.yaml` schema + load | — | | `crypto/` | AES-256-GCM for app.yaml secrets | — | | `integrations/` | App-to-app (OnlyOffice→FileBrowser/Nextcloud) via docker exec / config patch | stacks, crypto, settings | @@ -88,7 +88,7 @@ Risk tags: **clean** · **needs-rework** · **hazard** (entangles a delete-targe | File | Class | Reason | Risk | |---|---|---|---| | `api/router.go` | **PORT/MODIFY** | Keep stacks/deploy/integrations/metrics/sync/assets/selfupdate routes; **remove `/api/storage/*` (disk)**; backup routes become **agent-coordinated guest-backup** requests; `config/apply` (hub-pushes-yaml) changes since the **agent** now injects config at provision. | needs-rework | -| `api/geo.go` | **PORT (blocked)** | Geo is app-domain, but gated on the tunnel-placement decision (doc 01 §7/§11). | blocked | +| `api/geo.go` | **PORT/MODIFY** | Keep the customer-facing geo **preference** endpoints (set/get global + per-app); **drop the Cloudflare-sync trigger** — enforcement → hub (S4). The controller reports geo desired-state up instead of calling the CF API. | needs-rework | ### `appexport/` — KEEP/PORT (Docker-volume + DB level, no disk ops) | File | Class | Reason | Risk | @@ -120,15 +120,15 @@ Risk tags: **clean** · **needs-rework** · **hazard** (entangles a delete-targe | `local_infra.go` | **DELETE (→agent)** | Per-drive infra-backup metadata → agent. | clean | | `restore_scan.go` | **DELETE (→agent)** | Scans drives to build a DR restore plan = agent-tier DR. | needs-rework | -### `cloudflare/` — BLOCKED on tunnel-placement (doc 01 §7/§11) +### `cloudflare/` — DELETE (→hub): CF-API enforcement moves to the hub (S4) | File | Class | Reason | Risk | |---|---|---|---| -| `client.go`,`zone.go`,`waf.go`,`geosync.go`,`countries.go` | **PORT (blocked)** | Geo-restriction WAF is app-domain and could stay in the controller, but it shares the Cloudflare account/zone with the **tunnel**, whose host-vs-guest placement is undecided. Classify provisionally PORT; do not force. | blocked | +| `client.go`,`zone.go`,`waf.go`,`geosync.go`,`countries.go` | **DELETE (→hub)** | The **hub** holds the CF API token and reconciles geo desired-state → WAF (doc 01 §5, doc 03 §2). The controller no longer calls the Cloudflare API — it reports geo desired-state up. The customer-facing geo *preference UI/data* stays (see `api/geo.go`). | needs-rework | ### `config/`, `crypto/`, `util/` | File | Class | Reason | Risk | |---|---|---|---| -| `config/config.go` | **MODIFY** | Drop `BackupConfig` (restic/retention) and storage-drive keys; keep customer/paths/web/git/stacks/monitoring/hub/assets/system; **add agent local-API endpoint+token**. Self-update section gated (open). | needs-rework | +| `config/config.go` | **MODIFY** | Drop `BackupConfig` (restic/retention), storage-drive keys, and `InfrastructureConfig.cf_api_token` (→hub, S4); keep customer/paths/web/git/stacks/monitoring/hub/assets/system; **add agent local-API endpoint+token**. | needs-rework | | `crypto/crypto.go` | **KEEP** | App.yaml secret encryption. | clean | | `util/strings.go` | **KEEP** | Trivial helper. | clean | @@ -169,16 +169,17 @@ Risk tags: **clean** · **needs-rework** · **hazard** (entangles a delete-targe | `infra_backup.go`/`_linux.go`/`_other.go` | **DELETE (→agent)** | Builds infra-backup payload (disk layout, restic/enc passwords) for hub. | hazard | | `infra_pull.go` | **DELETE (→agent)** | Pulls recovery config + infra backup from hub (setup-wizard DR). | needs-rework | -### `selfupdate/` — OPEN (doc 01 §11: "self-update flow not yet designed") +### `selfupdate/` — controller is agent-managed (doc 03 §11) | File | Class | Reason | Risk | |---|---|---|---| -| `version.go`,`state.go` | **KEEP** | Semver parse; update audit state. | clean | -| `updater.go` | **PORT (open)** | Pulls image + edits `docker-compose.yml` + `compose up -d`. In the agent model the controller is the **agent's product** (doc 01 §3) — self-update may move under the agent. Flag as open. | blocked | +| `version.go` | **KEEP** | Semver parse / version string (still used for reporting). | clean | +| `state.go` | **DELETE (obsolete)** | Self-update audit state — the agent owns controller updates now (doc 03 §11). | clean | +| `updater.go` | **DELETE (→agent)** | Resolved (doc 03 §11): the controller is **agent-managed** — the agent snapshots → redeploys → health-gates → rolls back the controller. The controller's old self-update path (image pull + compose edit) is **removed**. | clean | ### `settings/` | File | Class | Reason | Risk | |---|---|---|---| -| `settings/settings.go` (1101 L) | **MODIFY (split)** | Keep notif prefs, integration state, geo, DB-validation cache, cross-drive *intent*. The **storage-path registry** (`StoragePath` with `Disconnected`/`DisconnectedAt`/`StoppedStacks`/decommission/UUID) is disk-management state → reshape to **per-volume placement** fed by the agent's storage manifest; disconnect/decommission/migrate state leaves. | hazard | +| `settings/settings.go` (1101 L) | **MODIFY (split)** | Keep notif prefs, integration state, geo, DB-validation cache, cross-drive *intent*. The **storage-path registry** (`StoragePath` with `Disconnected`/`DisconnectedAt`/`StoppedStacks`/decommission) is disk-management state → reshape to **per-volume placement** fed by the agent's storage manifest; disconnect/decommission/migrate state leaves. (UUID is *not* a persisted field — runtime-derived from fstab.) | hazard | ### `setup/` — all DELETE (obsolete); the agent provisions the controller | File | Class | Reason | Risk | @@ -272,7 +273,7 @@ Risk tags: **clean** · **needs-rework** · **hazard** (entangles a delete-targe host-info first.** 6. **`settings/StoragePath` carries disk state into an app-domain store.** Disk fields - (`Disconnected`,`DisconnectedAt`,`StoppedStacks`, decommission, UUID) are written by + (`Disconnected`,`DisconnectedAt`,`StoppedStacks`, decommission — UUID is *not* persisted, it's runtime-derived from fstab via `system.ParseFstabUUID`/`watchdog.go`) are written by `watchdog.go`/`storage_handlers.go`/`crossdrive.go` (all delete) but the same struct is read by `stacks`/`web` for labels and **placement** (keep). Reshape `StoragePath` to a placement record fed by the agent manifest. @@ -327,10 +328,11 @@ Risk tags: **clean** · **needs-rework** · **hazard** (entangles a delete-targe 2. **Per-volume storage placement** (doc 01 §8) — `.felhom.yml` `hot`/`bulk` volume classification (extend `stacks/metadata.go`), enforcement at deploy (extend `stacks/deploy.go`), and a placement record in `settings`. Replaces the per-app - HDD-path + cross-drive model. -3. **Self-restore-test orchestration** — controller asks the agent to restore the latest - guest backup to a scratch guest, runs its post-restore health probes, reports the - verdict to the hub. (Backed by the validated Phase 2 round-trip in + HDD-path + cross-drive model. A `bulk` volume must be realized as a `backup=0` mount point, + **never** a rootfs Docker named volume (validated recipe: `phase3-findings.md` B2 / doc 03 §7). +3. **Self-restore-test status display** (read-only) — the **agent owns orchestration** (it + holds the PBS key and creates the scratch guest — operator-tier, doc 03 §8); the controller + only surfaces `GET /restore-test/status` in its UI. (Round-trip validated: Phase 2, [../proxmox-platform.md](../proxmox-platform.md) §4.) 4. **Snapshot-before-deploy/rollback flow** in the deploy path — wraps the existing compose deploy with agent snapshot → health check → agent rollback-on-failure @@ -343,13 +345,12 @@ Risk tags: **clean** · **needs-rework** · **hazard** (entangles a delete-targe ## 6. Open / blocked items -- **`cloudflare/` + `api/geo.go` — blocked on tunnel placement** (doc 01 §7, §11: host vs - guest `cloudflared`). Geo-WAF is app-domain and likely PORT, but it shares the - Cloudflare account/zone with the tunnel; do not finalize until placement is decided. -- **`selfupdate/updater.go` — open** (doc 01 §11: self-update flow undesigned). Because the - controller is "the agent's product" (doc 01 §3), self-update may move under the agent - (snapshot → swap → health-gate → rollback) rather than the controller editing its own - compose file. Provisionally PORT. +- **Geo — resolved (S4):** CF-API **enforcement moves to the hub** (it holds the CF token and + reconciles geo → WAF); the controller keeps the geo **preference UI/data** and reports + desired-state up. Tunnel placement is settled (host, agent-managed, doc 03 §3/§5). The + `cloudflare/` package + `api/geo.go`'s CF-sync are DELETE-from-controller → hub. +- **Self-update — resolved (doc 03 §11):** the controller is agent-managed; its self-update + path is removed. - **`settings`/`stacks` per-volume reshape** — depends on the storage-manifest contract between hub ↔ agent ↔ controller (doc 01 §8), not yet specified. - **Backup UI/report surface** — depends on the agent's guest-backup status API shape @@ -357,3 +358,17 @@ Risk tags: **clean** · **needs-rework** · **hazard** (entangles a delete-targe - **Notification event taxonomy** — which infra events (`storage_disconnected`, `crossdrive_*`, `disaster_recovery_*`) the **agent** emits vs the controller, once those responsibilities move. + +--- + +## Changelog — design-review + Phase-3 fold-in (2026-06-08) + +- **M1:** removed `UUID` from the `settings.StoragePath` field lists (§ settings, hazard #6) — + it is runtime-derived from fstab, not persisted. +- **S4 (geo):** `cloudflare/` reclassified **PORT(blocked) → DELETE(→hub)** (CF-API enforcement + moves to the hub); `api/geo.go` → **PORT/MODIFY** (keep geo *preference* endpoints, drop the + CF-sync trigger); `config/config.go` also drops `cf_api_token`. §6 + §1 updated. +- **S5:** cloudflare/geo no longer "blocked on tunnel placement" (resolved). +- **S6:** §5(3) self-restore-test → **status-display only**; the agent owns orchestration. +- **Self-update resolved (03 §11):** `updater.go` → **DELETE(→agent)**, `state.go` → + DELETE(obsolete), `version.go` KEEP; §6 + §5(2) updated (bulk = `backup=0` mountpoint recipe). diff --git a/docs/architecture/03-host-agent.md b/docs/architecture/03-host-agent.md index e8309dc..644864d 100644 --- a/docs/architecture/03-host-agent.md +++ b/docs/architecture/03-host-agent.md @@ -29,11 +29,11 @@ N-guests, never "the guest"). Owns: -1. **Proxmox lifecycle** — create/start/stop/destroy guests, snapshots, storage allocation. Via a scoped Proxmox API token (minimal role from Phase 1) for everything the API covers; raw host ops only where unavoidable. +1. **Proxmox lifecycle** — create/start/stop/destroy guests, snapshots, storage allocation. Via a scoped Proxmox API token (the **`FelhomAgent` operator role** — `proxmox-platform.md` §3.6, validated Phase 3 B3) for everything the API covers; raw host ops only where unavoidable. 2. **Storage management** — attach/classify targets, reconcile the storage manifest, mount USB-by-UUID, present mounts into guests. 3. **Backup/restore orchestration** — vzdump to the tiers, PBS, snapshot management, and the **self-restore-test**. 4. **Host & tunnel monitoring** — host metrics, guest up/down, storage-target status, and `cloudflared` health; reports the host domain to the hub. -5. **Provisioning** — build a guest, deploy the controller into it, hand it its bootstrap config. +5. **Provisioning** — provision a guest **by restoring the golden base image** (§9), deploy the controller into it, hand it its bootstrap config; also **build and refresh the golden base image** itself. 6. **Hub control loop** — poll for desired state + signed jobs, reconcile, execute, report, heartbeat. 7. **Local API** — the per-guest authorization gate the controller calls. 8. **Self-update** — update itself (carefully — it is a host service) and update the controllers it owns. @@ -43,11 +43,12 @@ Explicitly does **not**: - Serve application traffic or sit in the data path. **Control plane, not data plane**: if the agent dies, apps keep serving (Docker + LXC run without it); only *management* degrades — no new backups, no provisioning, hub loses the heartbeat. - Hold or proxy customer application data. - Run inside a guest. It is the thing that recovers guests and the host; it cannot be one of them. +- Manage **geo-restriction / the Cloudflare API**. Geo is hub-owned: the customer sets it in the controller UI, the controller reports the geo desired-state to the hub, and the **hub** (holding the CF API token) reconciles the WAF (S4). The agent manages only the *tunnel* service (`cloudflared`, §3/§5), never WAF rules. ## 3. Process model & host integration - **Native Go binary, systemd service** on the host: boot-start, `Restart=always`, systemd watchdog (kill+restart on hang), journald logging, resource limits. -- **Root-minimized.** Default to a **non-root** service user with the scoped Proxmox token for API-covered work + a **narrow `sudoers` allowlist** for the handful of true host ops (USB mount-by-UUID, systemd mount units). Full root on the crown-jewel host is what a compromise most wants; avoid it where the API or a scoped sudoers entry suffices. *(Open: confirm during build which ops genuinely need host root vs. are API-covered — the Phase-1 minimal role is the API floor.)* +- **Root-minimized (boundary settled — Phase 3 B3).** The agent runs as a **non-root** service user with the scoped `FelhomAgent` token for all API-covered work + a **narrow `sudoers` allowlist** for true host ops. Per Phase 3 (B3) the boundary is settled: the entire per-customer guest lifecycle — provision (by restore, §9), config, start/stop, snapshot, backup, **restore**, destroy — is token-covered. Genuine OS-root is confined to: (1) building/refreshing the **golden base image** (`keyctl` create is `root@pam`-only — one-time at enrollment + a maintenance cadence, §9); (2) **host mounts** (USB mount-by-UUID, systemd mount units / fstab); (3) **SMART / hardware sensors**. Root therefore never sits on the per-customer path. See `proxmox-platform.md` §3.6 for the role + boundary table. - **`cloudflared` is a separate systemd service**, not embedded in the agent. This is what makes the data path survive control-plane death by construction. The agent **manages and health-watches** it (see §5) but the tunnel does not live or die with the agent process. ## 4. Control model — reconcile + signed destructive ops @@ -71,10 +72,12 @@ snapshot S"), not a procedure; the agent owns the *how*. **The reversibility gate (security-critical).** "Signed jobs resist hub compromise" only holds if the agent also distrusts hub-supplied -*desired state* for destructive changes. So: +*desired state* for destructive changes. The gate is by **provenance + data-bearing-ness, not +by verb**: -- **Irreversible/destructive operations** — guest destroy, storage detach/wipe, restore-overwrite, decommission — require a valid **operator signature**, *regardless of whether they arrive as a job or as a desired-state delta*. A compromised hub cannot forge them because the signing key is **not held by the hub** (it lives with the operator / a separate signing path; the hub only queues opaque signed blobs). -- **Benign convergence** — deploy a guest, attach storage, adjust a non-destructive policy, bump a controller version — runs on normal hub API auth, no signature. +- **The reconciler MAY act without an operator signature** when: (a) creating/starting/restarting; (b) destroying resources it created earlier **within the same journaled transaction** (compensating rollback, §10); (c) destroying resources it **tagged ephemeral/scratch** (e.g. restore-test scratch guests, §8). The ephemeral/scratch tag is **agent-internal provenance and is never accepted from the hub** — else a compromised hub could relabel a data-bearing guest as scratch to walk the gate. +- **An operator signature is always required** to destroy/overwrite any resource holding the only/primary copy of customer data — live-guest destroy, storage detach/wipe, restore-overwrite, decommission — *regardless of whether it arrives as a job or as a desired-state delta*. A compromised hub cannot forge them because the signing key is **not held by the hub** (it lives with the operator / a separate signing path; the hub only queues opaque signed blobs). +- **Healing a crashed controller is non-destructive by construction:** it is reconstructable from its image + the guest's persistent volume, so "redeploy" = restart the LXC / `docker compose up -d` **inside the existing guest** — never a guest destroy. (v0.33 precedent: `watchdog.go` restarts stopped stacks, it never destroys the guest.) Signed payloads carry a **nonce + expiry** (anti-replay: a captured "restore" job cannot be re-injected later) and a target binding (host + guest id) so a signature can't be retargeted. @@ -111,15 +114,22 @@ The controller (in its LXC) reaches the agent (on the host) over the local bridg - `POST /snapshot` — snapshot *this* guest (the snapshot-before-deploy primitive). - `POST /rollback` — roll *this* guest back to a named snapshot (post-deploy failure recovery). - `POST /backup` — request a backup-now of *this* guest (enqueued; non-destructive). + - `GET /backup/due` — whether a policy-scheduled backup is due for *this* guest, so the controller can quiesce then call `POST /backup` (the app-consistent path, §8). - `GET /backup/status`, `GET /restore-test/status` — read-only status for the controller's UI. Note what is *absent*: nothing here lets a controller touch another guest, the host, storage attachment, or restore-overwrite. Destructive/cross-guest power stays operator-signed (§4). +A controller can only `POST /rollback` (or snapshot/backup) **its own** guest — the agent maps +token → guest and authorizes per guest, so a compromised controller's blast radius is +**self-scoped and bounded** to its own guest. + ## 7. Storage manifest & reconciliation -The manifest is the load-bearing contract (it absorbs the disk-state fields that -`settings.StoragePath` carries today — see Part 2). Held in the hub, reconciled by the agent. +The manifest is the load-bearing contract. It absorbs the **persisted** disk-state fields that +`settings.StoragePath` carries today **and adds** `durable_id`/UUID — today the controller +re-derives the UUID from fstab each boot (Part 2 / Phase-3), so persisting it is an +improvement. Held in the hub, reconciled by the agent. Per target: @@ -138,10 +148,20 @@ allowlist), each Proxmox storage entry matches, and `disconnected` targets are s the hub (the storage watchdog — detect a USB drop in seconds, not at the next health cycle). **Placement is per-volume, not per-app.** Hot volumes (DB/config) → a `fast` target, -**enforced**; bulk volumes (media) → may live on `slow`, declared in `.felhom.yml`. **Bulk -external mounts are excluded from the guest's vzdump** (a per-mount backup flag) and carry -their own per-volume policy (file-level to a tier, or explicitly *not* backed up for -re-downloadable media). This is what keeps a 1 TB media drive out of the whole-guest image. +**enforced**; bulk volumes (media) → may live on `slow`, declared in `.felhom.yml`. + +A `bulk` volume **MUST** be realized as a `backup=0` **volume mount point** (or an external +bind mount) — **never** a Docker named volume in rootfs, which `vzdump` always captures +(verified, `phase3-findings.md` B2). Proven recipe: attach +`-mpN :,mp=/mnt/bulk,backup=0`, then +`docker volume create --driver local -o type=none -o o=bind -o device=/mnt/bulk ` (or a +compose bind). The per-volume placement component (Part 2 §5(2)) enforces this at deploy. The +**DR consequence** of excluding bulk is covered in §8. + +**Field re-homing (from `settings.StoragePath`, Part 2):** `Label` → manifest (canonical); +`IsDefault`/`Schedulable` → manifest `policy`; `MigratedTo` + decommission → manifest `state`; +`StoppedStacks` → the **controller's `settings`** (app-domain: which apps to restart on +reconnect, not a host concern). ## 8. Backup/restore orchestration @@ -149,9 +169,18 @@ Tiers double as backup *and* restore-source priority (fastest surviving source f per Part 1: **snapshot** (LVM-thin, transient, whole-guest rollback — not a backup) → **local second storage** (vzdump to dir/NFS/CIFS) → **PBS offsite** (the DR substrate). -- **Quiescing:** the controller stops the app stack (volume-consistent) before a guest - vzdump where app-consistency matters; stop-mode/snapshot-mode per Phase 1. Every Proxmox - op is async → the agent polls `task exitstatus`, never trusts the POST return. +- **Quiescing (controller-driven for app-consistency):** an LXC has no fsfreeze + (`proxmox-platform.md` §4.2), so app-consistency is the controller's job: it learns a backup + is due (`GET /backup/due`, §6, or via its hub channel) → **quiesces** the app stack → + `POST /backup` → polls `GET /backup/status` → unquiesces. **An agent-initiated vzdump is + crash-consistent only** (there is no inbound-to-guest channel to trigger a quiesce — §3/§5). + Every Proxmox op is async → the agent polls `task exitstatus`, never trusts the POST return. +- **Bulk volumes have no DR coverage from the guest vzdump** — they are excluded (§7). Every + `bulk` volume needs an explicit own-backup decision: its own backup target per the manifest + `policy`, **or deliberately none** when the data is re-downloadable (customer informed). On + host-loss, un-backed-up bulk is gone; a **bind-mounted** bulk volume re-attaches only on the + *same* host, so cross-host DR needs the separate backup. A deliberate per-volume choice, + never a silent loss. - **Key custody (PBS):** the **live** PBS key sits on the box so the agent can both back up *and* run restore-tests. The hub holds only the **recovery-code-wrapped escrow** copy it cannot open (zero-knowledge default). So: the box can restore-test; the operator cannot @@ -165,15 +194,31 @@ per Part 1: **snapshot** (LVM-thin, transient, whole-guest rollback — not a ba ## 9. Provisioning & DR flows -**Provisioning (reconcile-driven).** Desired state says "this customer should have guest G -with controller C." The agent: enrolls (mints its scoped Proxmox token as root at setup) → -creates the LXC (unprivileged, `nesting=1,keyctl=1`, overlayfs — Phase 0) → deploys the -controller → hands it the bootstrap config (identity, hub API key, local-API token, mount -map). If any step fails, reconciliation retries; a half-built guest is journaled (§10) and -rolled back, never orphaned. +**Provisioning (reconcile-driven, by restore).** Fresh creation of a Docker-capable LXC needs +the `keyctl=1` feature flag, which Proxmox permits only for `root@pam` (Phase 3, B3) — not the +scoped token. But a token-authorized **restore preserves `keyctl`** (Phase 3, B3), so the agent +provisions **by restoring a golden base image**, never by `pct create` on the per-customer path: + +- A **golden base archive** — minimal Debian + Docker, `nesting=1,keyctl=1`, overlayfs — is + built once as `root@pam` **at enrollment** (when the agent legitimately holds root to mint its + Proxmox token) and refreshed on a maintenance cadence. This is the one place `keyctl`/root + provisioning lives — off the per-customer path. +- To provision guest G: restore the golden archive → new VMID (token-covered: `VM.Allocate` + + `Datastore.AllocateSpace`; `keyctl` preserved) → reset identity (MAC/hostname) → size the guest + (CPU/mem config + `pct resize` rootfs, token-covered) → attach storage mounts per the manifest + → deploy the controller → hand it bootstrap config. A mid-flight failure is journaled and + compensating-rolled-back (destroy the just-restored guest — allowed without a signature per §4, + same-transaction provenance). + +**Unified bring-up primitive.** Provisioning and DR-restore share the same token-covered front +half — *restore an archive → reset identity* — and differ only in the archive and the back half: +provisioning restores the **golden base** then deploys a fresh controller; DR-restore restores +the **customer's backup** (already containing controller + data), brings it up, and reattaches +external storage. One code path, exercised by every restore-test (§8). **Guest loss.** Agent restores G from the fastest surviving tier and resets identity -(MAC/hostname) so the restored guest rejoins cleanly. +(MAC/hostname) so the restored guest rejoins cleanly — this *is* the unified restore primitive +above (customer-backup archive, DR back half). **Host/hardware loss.** Re-enroll the new host in **restore mode**; the hub — the durable source of truth that survives box death — hands the new agent the existing identity, PBS @@ -185,17 +230,21 @@ the hub record, so DNS stays intact. - **Per-guest serialization.** Reconcile, one-shot jobs, and local-API calls all feed a work queue that serializes mutations **per guest** (Proxmox dislikes concurrent conflicting ops on the same guest). Independent guests proceed in parallel. -- **Operation journaling.** Multi-step async ops (provision, restore) are journaled with - their in-flight Proxmox task ids. On agent restart, the journal is replayed: - resume-or-rollback, so a crash mid-restore never leaves a corrupt or half-built guest. +- **Operation journaling.** Multi-step async ops (provision, restore, controller-update, agent + self-update) are journaled with their in-flight Proxmox task ids. On agent restart, the + journal is replayed: resume-or-rollback, so a crash mid-restore never leaves a corrupt or + half-built guest. - **Idempotency keys** on one-shot jobs (run-once across retries and restarts). ## 11. Self-update -- **Agent (the hard case — a host service, no snapshot-rollback).** Atomic binary swap: - download → verify signature → atomic rename → restart; **keep last-known-good**; a watchdog - reverts to last-good if the new binary fails to come up healthy. Triggered by a hub signed - job within the update window; manual always allowed. +- **Agent (the hard case — a host service, no snapshot-rollback).** **A/B layout:** download → + verify signature → stage as the inactive slot → flip a `current → good|new` symlink → restart. + **Revert authority lives outside the swapped binary** — `Restart=always` alone just + crash-loops a bad binary — so a **separate health-gate** (a systemd oneshot `ExecStartPost` + probe, or a tiny supervisor unit) flips `current` back to last-good and restarts on a failed + health window. The new version is **committed as "good" only after a clean health window**. + Triggered by a hub signed job within the update window; manual always allowed. Journaled (§10). - **Controller (the easy case — it's a guest).** The agent owns the controller's lifecycle, so the **agent updates the controller**: snapshot-before-update (free rollback, because the controller *is* a snapshottable guest) → pull new image → redeploy → health-check → rollback @@ -214,16 +263,37 @@ argument for §3's root-minimization and a small, auditable agent. Resolved here: tunnel placement (host, agent-managed, own systemd service), the reconcile-vs-jobs fork (hybrid, gated by reversibility), agent process model, self-update -ownership, the local-API surface, and the storage-manifest schema. +ownership, the local-API surface, the storage-manifest schema, **provision-by-restore**, and +the **root-vs-API boundary** (Phase 3, B3). Still open: - Multi-tenant **resource fairness** on a shared host (per-guest cgroup limits, noisy-neighbor) — deferred to the company-case pass. - Operator-side **signing tooling** — where the operator signing key lives operationally and how a destructive op gets signed without undue friction (offline key vs. a small signing service; the security floor is "not in the hub"). - Hub-side **desired-state editing UX** and the host-domain report schema details — belong to the hub architecture doc. +- **Golden base image** refresh cadence + fleet versioning — who triggers a rebuild, how the per-host image version is tracked (operational detail, not blocking; §9). This doc hands the implementation three contracts it was waiting on: 1. the **local-API surface** (§6) → the controller's NEW local-API client, snapshot-before-deploy, and self-restore-test wiring (Part 2); 2. the **storage-manifest schema** (§7) → the `settings.StoragePath` reshape and per-volume hot/bulk placement (Part 2); -3. the **backup contract** (§7–8) → the destination for the app-data-backup package extracted in the Part-2 refactor. \ No newline at end of file +3. the **backup contract** (§7–8) → the destination for the app-data-backup package extracted in the Part-2 refactor. + +--- + +## Changelog — design-review + Phase-3 fold-in (2026-06-08) + +- **NEW provision-by-restore** (§9): the agent provisions by **restoring a golden base image** + (token-covered, preserves `keyctl`), never `pct create` on the per-customer path; one unified + restore primitive shared with DR. §2 responsibility + §3 boundary updated. +- **B3** (§2/§3): replaced "Phase-1 minimal role" with the validated **`FelhomAgent`** operator + role; root-vs-API boundary **settled** (root only for golden-image build, host mounts, SMART). +- **B1** (§4): reversibility gate rewritten as **provenance + data-bearing** (scratch tag is + agent-internal, never hub-supplied; crashed-controller heal is non-destructive in-place). +- **B2** (§7/§8): validated bulk-as-`backup=0`-mountpoint recipe + the **bulk-DR consequence** + (excluded bulk needs its own backup decision). +- **S1** (§6/§8): `GET /backup/due` added; controller-driven quiescing; agent vzdump is + crash-consistent only. **S2** (§10/§11): A/B self-update with external revert authority; + controller-update + agent self-update journaled. **S3** (§7): `StoragePath` field re-homing. + **S4:** geo non-responsibility added (§2). **M2** (§7): manifest "absorbs + adds durable_id". + **§6:** rollback is self-scoped/bounded. **§13:** golden-image refresh cadence added as open. \ No newline at end of file diff --git a/docs/architecture/_design-review.md b/docs/architecture/_design-review.md index a25d24e..91cd684 100644 --- a/docs/architecture/_design-review.md +++ b/docs/architecture/_design-review.md @@ -1,5 +1,15 @@ # Critical design review — Proxmox re-platform doc set +> ✅ **RESOLVED (2026-06-08).** All findings folded into 01/02/03 + `proxmox-platform.md` +> (Phase-3 spike run for B2/B3 → `tests/phase3-findings.md`). **Folded:** B1 (03 §4), B2 +> (03 §7/§8 + platform §4.7), B3 (03 §2/§3 + platform §3.6), S1 (03 §6/§8), S2 (03 §10/§11), +> S3 (03 §7), S4 (01 §5/§7 + 02 + 03 §2), S5 (01 §7/§11 + 02 §6), S6 (02 §5), M1 (02 §3), +> M2 (03 §7), M3 (03 §10), §6-residual (03 §6). Plus the two Phase-3 design updates: +> provision-by-restore (03 §9) and the settled root-vs-API boundary (03 §3). **Deferred/none:** +> no finding was deferred; the pre-existing open items (operator signing-key mechanics, +> multi-tenant fairness, hub-side desired-state UX, golden-image refresh cadence) remain +> flagged in 03 §13. This artifact can be deleted once confirmed. + Working artifact. Review pass over `01-topology-and-trust.md`, `02-controller-module-map.md`, `03-host-agent.md`, `proxmox-platform.md`, and the Phase 0 / Phase 1-2 findings, grounded against the v0.33 source (`deploy-felhom-compose/controller/`). Every finding cites a