From e436b61368656da8f78febc42aafdb70f709495d Mon Sep 17 00:00:00 2001 From: kisfenyo Date: Wed, 10 Jun 2026 10:02:11 +0200 Subject: [PATCH] =?UTF-8?q?doc=2003:=20slice=208A=20implemented=20?= =?UTF-8?q?=E2=80=94=20=C2=A76a=20local-API=20impl,=20=C2=A79=20back-half?= =?UTF-8?q?=20row,=20=C2=A713=20(2026-06-10)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit §6a (new): the local-API implementation — stable leaf-SHA-256 pin, token->guest self-scoping (cross-guest 403), bootstrap.json contract + controller ingestion (c), baked-controller deploy (no registry cred in guest), firewall narrowing. §9 slice table: back-half = slice 8A implemented (8B quiesce / 8C de-priv split out); build-golden.sh bakes the controller. §13 + doc changelog. Co-Authored-By: Claude Opus 4.8 --- documentation/architecture/03-host-agent.md | 60 ++++++++++++++++++--- 1 file changed, 53 insertions(+), 7 deletions(-) diff --git a/documentation/architecture/03-host-agent.md b/documentation/architecture/03-host-agent.md index 18cfe7e..122e579 100644 --- a/documentation/architecture/03-host-agent.md +++ b/documentation/architecture/03-host-agent.md @@ -124,6 +124,34 @@ A controller can only `POST /rollback` (or snapshot/backup) **its own** guest token → guest and authorizes per guest, so a compromised controller's blast radius is **self-scoped and bounded** to its own guest. +### 6a. Implementation (slice 8A — implemented) + +**Status: implemented** (agent v0.10.0 `internal/localapi`; controller v0.35.0 `internal/bootstrap` ++ `internal/agentapi`). Grounded by `documentation/tests/slice8a-channel-deploy-spike-findings.md` +(commit `4a81a96`). The 7 endpoints above are live; `GET /backup/due` is **thin** in 8A (the +quiesce-on-due consumer is 8B), the rest wrap the existing slice-5/6/7 machinery. + +- **Transport / pin.** The agent serves a **persisted self-signed leaf** bound to the host bridge IP + on a fixed port (default `:8443`). The controller pins the **leaf-cert SHA-256** (decision: + consistency with the agent's Proxmox/PBS cert pinning), carried in its bootstrap. The leaf is + generated **once and persisted**, so its fingerprint is stable across agent restarts (a fresh cert + each boot would invalidate every already-issued bootstrap pin). Defense-in-depth: the listener + binds the **bridge IP** (not `0.0.0.0`) and a host firewall rule narrows the port to the guest + bridge subnet (`configs/felhom-localapi-firewall.example`) — the **per-guest token stays the gate**. +- **Token custody.** The per-guest token is minted by the back-half (§9), persisted as a **SHA-256 + hash** only (the plaintext exists transiently at mint→write-to-mount, then is discarded), in a + durable last-write-wins map. **Self-scoping** is enforced by the token→guest map alone: the VMID is + resolved from the token, never from a caller-supplied id; an explicit `vmid` that disagrees is + refused (**403**) and the Proxmox op is never issued for the other guest. Absent/unknown token → 401. +- **The bootstrap contract `(c)`.** The agent emits a stable `bootstrap.json` + (`schema: felhom.bootstrap/v1`: customer identity, hub, and the local-API `{endpoint, fingerprint, + token}`) into a read-only config mount; the controller **ingests it on first run and seeds its own + `controller.yaml`, skipping setup mode** (idempotent — never clobbers an existing config; fail-safe + — a malformed/absent bootstrap stays in setup). The agent emits the contract; the controller owns + the translation — they stay decoupled (no shared config schema). **No registry credential ever + enters a guest**: the controller image is **baked into the golden** (§9), so deploy does no + `docker login`/`pull`. + ## 7. Storage manifest & reconciliation The manifest is the load-bearing contract. It absorbs the **persisted** disk-state fields that @@ -307,7 +335,7 @@ identity" is shorthand for two different operations: NOT auto-regenerate host keys after a restore, so the golden carries the regeneration, keeping the agent host-side-only). It then receives a **fresh** controller identity (host-id, local token, hub channel), **fresh restic repo identity**, and a fresh tunnel association — all minted - in the back half (slice 8). + in the back half (slice 8A — implemented). - **Guest-loss DR (customer backup) → preserve continuity identity, reset only what would collide.** The restored guest must *continue* the customer's world: **keep** the restic repo identity (resetting it orphans the existing backup chain — a silent data-continuity bug), the @@ -332,11 +360,13 @@ this path — bring up + reattach external storage and it is whole. This is full | Capability | Slice | Status | |---|---|---| -| Golden base image build (root@pam, at enrollment) | **7** | **recipe implemented** (`felhom-agent/configs/build-golden.sh`, incl. the F3 host-key unit); golden archived at enrollment | +| Golden base image build (root@pam, at enrollment) | **7** | **recipe implemented** (`felhom-agent/configs/build-golden.sh`, incl. the F3 host-key unit; **now also bakes the controller image + a controller-bootstrap unit**, slice 8A); golden archived at enrollment | | Unified bring-up **front half** (restore→reset identity→size→attach storage), journaled + compensating rollback | **7** | **implemented** (agent v0.8.0, `internal/reconcile/bringup.go`) | | **Guest-loss DR** (front half + DR identity policy; no controller deploy) | **7** | **implemented** (v0.8.0, `dr_guest_loss` mode — continuity identity preserved) | | PBS recovery-code escrow **creation** + **hub opaque storage** (§8a) | **7** | **implemented** (agent v0.9.0 `internal/escrow`; hub v0.8.0 `PUT /hosts/{id}/escrow`) | -| Provisioning **back half** — deploy controller, hand bootstrap config, mint per-guest local token | **8** | deferred — needs the controller-deploy path + agent↔controller local API (§6) | +| **Local API** server (§6) + provisioning **back half** — deploy controller, hand bootstrap config, mint per-guest local token | **8A** | **implemented** (agent v0.10.0 `internal/localapi` + `internal/provision`; controller v0.35.0 `internal/bootstrap` + `internal/agentapi`). The controller image is **baked into the golden** (no registry cred in any guest); the back-half mints the token, writes a 0600 `bootstrap.json` to a `chown 100000:100000` config mount, and `pct set`-attaches it read-only; the golden's baked unit deploys the controller, which ingests the bootstrap, comes up configured, and reaches the agent over the bridge (leaf-pin + token). Validated live end-to-end on the demo. | +| **Quiesced app-consistent backup** (`/backup/due`-driven stack-stop) | **8B** | deferred — `/backup/due` is thin in 8A; the controller quiesce-then-`POST /backup` loop is 8B | +| **Controller de-privileging** (retire the disk-execution subsystem; new customer disk endpoints behind the slice-4 data-bearing classifier) | **8C** | deferred | | **Host/hardware loss** DR — re-enroll in "restore mode"; hub serves identity / PBS namespace / tunnel token / storage manifest / restore directive | **10** | deferred — needs hub desired-state serving; hub store today holds only `{host_id, customer_id, api_key}` (slice 3) | | PBS escrow **consumption** (recover `K` on a new box) | **10** | deferred — exercised by host-loss DR | | Golden base refresh cadence + fleet versioning | post-launch | operational, non-blocking (§13) | @@ -386,10 +416,13 @@ argument for §3's root-minimization and a small, auditable agent. Resolved here: tunnel placement (host, agent-managed, own systemd service), the reconcile-vs-jobs fork (hybrid, gated by reversibility), agent process model, self-update -ownership, the local-API surface, the storage-manifest schema, **provision-by-restore**, the -**provision/DR slice boundary** (7 front-half + guest-loss DR + escrow creation; 8 provisioning -back-half; 10 host-loss DR + escrow consumption — §9 table), the **PBS recovery-code escrow -design** (§8a), and the **root-vs-API boundary** (Phase 3, B3). +ownership, the local-API surface (**implemented, slice 8A — §6a**), the storage-manifest schema, +**provision-by-restore**, the **provision/DR slice boundary** (7 front-half + guest-loss DR + +escrow creation; **8A provisioning back-half + local API — implemented**; 8B quiesced backup; 8C +controller de-privileging; 10 host-loss DR + escrow consumption — §9 table), the **PBS +recovery-code escrow design** (§8a), and the **root-vs-API boundary** (Phase 3, B3 — the slice-8A +back-half's host-side `chown`/`pct set` bind-mount is a deliberate, narrow addition OUTSIDE the +API token, in `internal/provision`, not the 3-exception `proxmox.Privileged` fence). Still open: @@ -413,6 +446,19 @@ This doc hands the implementation three contracts it was waiting on: ## Changelog — design-review + Phase-3 fold-in (2026-06-08) +### Slice-8A implemented: local API + provisioning back-half (2026-06-10) +- NEW §6a: the **local-API implementation** (agent v0.10.0 `internal/localapi`; controller v0.35.0 + `internal/bootstrap` + `internal/agentapi`) — persisted self-signed leaf with a **stable + leaf-SHA-256 pin**, the **token→guest self-scoping** (explicit cross-guest id → 403, op never + issued), the stable **`bootstrap.json` contract + controller ingestion `(c)`** (seed + `controller.yaml`, skip setup; idempotent + fail-safe), and the **baked-controller deploy** (no + registry credential in any guest). Firewall narrowing = defense-in-depth; the token stays the gate. +- §9: the provisioning **back half** row is now **slice 8A — implemented** (split from the old "8"); + `build-golden.sh` now **bakes the controller + a bootstrap unit**; quiesced backup → 8B, controller + de-privileging → 8C. The host-side `chown`/`pct set` bind-mount is a deliberate narrow surface in + `internal/provision` (NOT the 3-exception `proxmox.Privileged` fence). Validated live end-to-end. +- §13 updated accordingly. + ### Slice-7 scope + escrow design (2026-06-09) - §9 rewritten: the bring-up primitive is a **shared front half only** — identity-reset policy is **scenario-specific** (provision = fresh everything; guest-loss DR = preserve restic/tunnel/hub