From 4a81a966782e5b423e3a7445cf4575493e3643d2 Mon Sep 17 00:00:00 2001 From: kisfenyo Date: Wed, 10 Jun 2026 08:57:48 +0200 Subject: [PATCH] slice 8A spike: agent<->controller channel + controller deploy plumbing findings Doc-only spike (no hub code change). Validated on demo-felhom (guest 8200, torn down): (1) guest->host HTTPS over vmbr0 with fingerprint-pin + bearer + self-scoping (200/401/403, wrong-pin TLS fail, no firewall rule needed); (2) config-mount + golden-baked bootstrap unit deploys+runs the controller (docker login/pull/run v0.34.0) with no pct exec. Verdict: GO to 8A spec. Co-Authored-By: Claude Opus 4.8 --- REPORT.md | 76 ++++----- .../slice8a-channel-deploy-spike-findings.md | 159 ++++++++++++++++++ 2 files changed, 196 insertions(+), 39 deletions(-) create mode 100644 documentation/tests/slice8a-channel-deploy-spike-findings.md diff --git a/REPORT.md b/REPORT.md index a0176db..c5beaf9 100644 --- a/REPORT.md +++ b/REPORT.md @@ -4,51 +4,49 @@ --- -# REPORT — Slice 7 close-out: PBS escrow — hub opaque storage + doc 03 §8a (v0.8.0) (2026-06-10) +# REPORT — Slice 8 Phase A spike: agent↔controller channel + controller deploy plumbing (2026-06-10) -## Outcome +## Type -The `felhom.eu` half of `TASK — Slice 7 close-out: PBS recovery-code escrow`. The agent -(felhom-agent v0.9.0) creates an **opaque** `R`-wrapped copy of the PBS key in the zero-knowledge -default; this slice adds the **hub opaque storage** for that blob and rewrites **doc 03 §8a** into a -full key-custody posture model. The wrap→recover→restore round-trip was proven on a throwaway first -(`documentation/tests/slice7-escrow-spike-findings.md`). +**SPIKE** (CC-executed on the demo). Doc-only deliverable — **no hub/code change, no version bump, +no deploy**. Probes the two unvalidated foundations of slice 8 *before* speccing the local API +(doc §6) and the provisioning back-half. Findings: +[documentation/tests/slice8a-channel-deploy-spike-findings.md](documentation/tests/slice8a-channel-deploy-spike-findings.md). -## What landed (hub v0.8.0) +## What was proven on `demo-felhom` -- **`PUT /api/v1/hosts/{host_id}/escrow`** (`internal/api/handler.go`) — per-host-key authed (a host - writes only its own escrow; global operator key also accepted). Decodes the base64 blob and stores - the **opaque bytes verbatim** against the host. The hub **never decrypts** — there is no decrypt - path; it has no recovery code. Rotation is last-write-wins. -- **`host_escrow`** table + `SaveHostEscrow`/`GetHostEscrow` (`internal/store`). Blob is ciphertext. -- **Contract:** `escrowUploadRequest` mirrors the agent's emit struct (`blob_b64`, `key_fingerprint`, - `posture`, `created_at`); a key-set test in each repo guards drift. -- **Tests:** stores the blob byte-identical; rotation last-write-wins; 401 (absent/wrong key), 403 - (host writing another host's escrow), 400 (bad base64); contract key-set. `go test ./...` green. +Spike guest **8200** was produced by the **real slice-7 bring-up job** (`felhom-agent v0.9.0`, +`-mode provision`) from the golden archive — a golden, link-up, Docker-29.5.3 guest in 8s, fresh MAC. +Torn down at the end; demo left as found (only pre-existing 9001/9999 remain; golden archive intact). -## Documentation (doc 03 §8a) +### 1. The channel (guest → host HTTPS over `vmbr0`, fingerprint-pinned) — **PASS** +A throwaway self-signed HTTPS stub on `192.168.0.162:8443`, hit from **inside guest 8200**: +- correct pin + guest-8200 token → **200**; no token → **401**; **other-guest** token → **403** + (self-scoping holds); **wrong pin → hard TLS failure** (curl exit 90 — the pin gates the handshake). +- **No firewall rule needed** (PVE firewall off; guest and host share the `vmbr0` /24, direct route). +- Security note: the local-API binds the host **LAN IP** → reachable by anything on the LAN; **auth + is the only gate** (it held). Both pin forms captured (SPKI + leaf-cert SHA-256) for the 8A choice. -Rewrote §8a into the **key-custody posture model**: the **separation principle** (reading data needs -both chunks *and* a key; zero-knowledge holds while Felhom never holds both), the **topology matrix** -(data location × key custody → who can read; the one dangerous cell flagged), the **default** -(Felhom storage + customer-only key; `R` printed durably), the **anti-lockout ladder** ((b) wrapped -offline copy → (a) raw paperkey → Felhom-holds-a-key), **SSH-for-support is a separate grant** (not -coupled to key custody), **why zero-knowledge stays default** (breach + legal compellability), and -the **integrity caveat** for self-hosted-data postures. Corrected the storage-slice note: hub opaque -storage is **slice 7** (this task); only restore-mode **serving** is slice 10. §9 slice table + §13 -updated. +### 2. The deploy plumbing (no `pct exec` — config mount + golden-baked unit) — **PASS** +The F3 principle end-to-end: agent stays host-side, populates a **read-only config mount** +(`/etc/felhom-bootstrap`, bind-mount hotplugged live); a **golden-baked oneshot** reads it → +`docker login` (token via `--password-stdin`) → `docker pull …/felhom-controller:v0.34.0` → +`docker run`. The controller came up **Up (healthy)**; an in-guest process read the bootstrap token +from the mount and reached the host `/storage` → **200**. **No `pct exec` used.** -## Live validation +## Gotchas carried into 8A / the back-half +1. **Unprivileged-LXC uid mapping** — the agent must `chown 100000:100000` files it writes into the + mount (else the guest reads them as `nobody`; the secret config is inaccessible). +2. **Registry-cred scope** — the bootstrap currently carries the shared `admin` pull token; production + wants a narrow, read-only, ideally per-guest/short-lived registry token (mount is the right channel). +3. **Controller config contract** — `bootstrap.json` ≠ the controller's `controller.yaml`; the + controller boots to *setup mode* until 8A emits the real config format/path (or the unit translates). +4. **Pin form** (SPKI vs leaf-cert SHA-256) and **LAN exposure** narrowing — 8A/back-half decisions. -After the v0.8.0 deploy, the demo agent's `--selftest=escrow-create -upload` PUT the opaque blob and -the hub stored it against the host; the stored bytes are **ciphertext** (not the key). The recovery -code `R` is never sent to or stored by the hub. *(No `R`/`K` value appears in any committed file.)* +## Verdict — **GO** to spec 8A (local-API server + the 7 §6 endpoints) and the provisioning back-half. -## Deferred / security - -Restore-mode serving + consumption → slice 10. The hub holds ciphertext only — possessing the blob -does not let Felhom read customer data (separation principle). No secrets committed. - -## Deploy (GitOps) - -Build+push `felhom-hub:v0.8.0` → bump `manifests/hub.yaml` → commit → sync the `felhom` ArgoCD app. +## Secret handling (held) +Test local-API tokens + the registry pull cred kept in `0600` host files, referenced by location, +never logged/committed; the stub never logged the `Authorization` header; `docker login` via +`--password-stdin`. No real per-guest token or registry cred in git. All scratch shredded on teardown. +No throwaway registry token was minted (the existing `gitea-creds` admin cred was used by reference). diff --git a/documentation/tests/slice8a-channel-deploy-spike-findings.md b/documentation/tests/slice8a-channel-deploy-spike-findings.md new file mode 100644 index 0000000..2fbbb3d --- /dev/null +++ b/documentation/tests/slice8a-channel-deploy-spike-findings.md @@ -0,0 +1,159 @@ +# Slice 8 Phase A — agent↔controller channel + controller deploy plumbing: Findings + +**Host:** `demo-felhom` (192.168.0.162) — Proxmox VE 9.2.2, Debian 13 (Trixie). Bridge `vmbr0`, +LAN DHCP (router 192.168.0.1). The host's **`vmbr0` IP = 192.168.0.162** (its LAN address — the +guest reaches the agent here). +**Date:** 2026-06-10. **Driver:** SPIKE-RUNBOOK (root@pam for the throwaway stub + guest plumbing; +the real bring-up job — `felhom-agent v0.9.0` — to provision the spike guest). +**VMID:** spike guest `8200` (torn down). Fixed port **8443**. + +> This document presents **data, observations, and design consequences**. It de-risks and feeds the +> **8A spec** (the real local-API server + the 7 §6 endpoints) and the **provisioning back-half** +> (deploy + per-guest token mint + bootstrap). The test local-API token and the registry pull +> credential are **secrets** — referenced by location, **redacted** here. + +--- + +## 0. Setup / provenance + +| Component | Value | +|---|---| +| Host `vmbr0` IP : port | `192.168.0.162:8443` (nothing else bound there pre-spike) | +| Controller image | `gitea.dooplex.hu/admin/felhom-controller:v0.34.0` (registry has 44 tags; latest is v0.34.0) | +| Registry pull cred | Gitea token — k8s `secret/gitea-creds` (user `admin`), **by reference** (never echoed/committed) | +| Spike guest 8200 | provisioned by the **real bring-up job** from golden `local:backup/vzdump-lxc-9100-2026_06_09-21_32_58.tar.zst` (`-mode provision -keep`) | +| Guest 8200 facts | DHCP IP `192.168.0.145`, fresh MAC `BC:24:11:59:F2:DD`, `features: nesting=1,keyctl=1`, **Docker 29.5.3 active** | + +The bring-up job confirmed re-usable as the spike's guest factory: `Pass:true`, `Verified:"boot+running"`, +8s, fresh MAC — the slice-7 primitive delivered a golden, link-up, Docker-ready guest unchanged. + +--- + +## 1. The channel (guest → host HTTPS over the bridge, fingerprint-pinned) — **PASS** + +Throwaway HTTPS stub on `192.168.0.162:8443` (self-signed; `GET /storage`; the stub never logs the +`Authorization` header). Two tokens: one scoped to guest 8200, one scoped to a *different* guest. + +| Cert handle | Value (public; not secret) | +|---|---| +| Leaf-cert SHA-256 | `CC:7B:03:DC:0F:FA:AC:94:C8:79:35:50:03:3F:FC:CF:CB:2B:49:AE:A7:8A:7D:7C:C7:49:80:9E:3D:EB:92:BC` | +| SPKI pubkey SHA-256 (curl `--pinnedpubkey sha256//`) | `uSSmg6cuEJj9CF7hiBdQ5OEJKOs0NszXJXjRNBwq8DM=` | + +From **inside guest 8200** (`curl -k --pinnedpubkey sha256//`, token read from a file — value +never on the command line): + +| # | Case | Expected | Result | +|---|---|---|---| +| T1 | correct pin + guest-8200 token | 200 | **HTTP 200** (`{"storage":"ok","guest":8200}`) | +| T2 | correct pin + **no** token | 401 | **HTTP 401** | +| T3 | correct pin + **other-guest** token | 403 | **HTTP 403** (self-scoping holds) | +| T4 | **wrong** pin + valid token | TLS failure | **HTTP 000, curl exit 90** (`CURLE_SSL_PINNEDPUBKEYNOTMATCH`) — the pin gates the handshake before any request is sent | + +**Reachability / firewall:** **no rule needed.** PVE firewall is **off** by default on this demo +(no `cluster.fw` / `host.fw` / `8200.fw`; host `iptables INPUT policy ACCEPT`, `nft` empty). Guest +and host share the `vmbr0` L2 segment (192.168.0.0/24); the guest's route to the host is direct +(`192.168.0.162 dev eth0 src 192.168.0.145`). + +**Security observation (design consequence):** the local-API binds the host's **LAN IP**, so it is +reachable by *anything on the LAN*, not just guests on the bridge — network isolation does **not** +gate it. The pin + bearer + self-scoping are the *only* gate, and at the plumbing level they held +airtight. The back-half should still consider narrowing exposure (bind to the bridge subnet and/or a +PVE firewall ACCEPT limited to the guest subnet → DROP otherwise) as defence-in-depth. + +**Pin form:** curl validated the **SPKI** (`--pinnedpubkey`). The agent's existing convention is +**leaf-cert SHA-256** pinning. Both fingerprints are captured above; **8A picks one** for the Go +controller's pin (leaf-cert SHA-256 is the lower-friction match to the agent's PVE-cert pinning). + +--- + +## 2. The deploy plumbing (no `pct exec` — host-side mount + golden-baked unit) — **PASS** + +Validates the F3 principle end-to-end: the **agent stays host-side**, populates a config mount; a +**golden-baked oneshot** does the guest-side work. + +**Config mount (host-side, agent-simulated):** a host dir bind-mounted **read-only** at +`/etc/felhom-bootstrap` (`pct set 8200 -mp0 ,mp=/etc/felhom-bootstrap,ro=1`) carrying +`bootstrap.json` = `{ hub_url, host_id, local_api_endpoint, local_api_pin_spki_sha256, +local_api_token, registry{host,username,token}, controller_image }`. + +- **Hotplugged live** — the bind mount appeared inside the running guest with **no restart**. +- **GOTCHA (unprivileged uid mapping):** host files must be `chown 100000:100000` so they appear as + `root:root 0600` inside the guest (host uid 0 maps to guest `nobody`, leaving the secret config + unreadable otherwise). The provisioning back-half's mount-populate step **must chown to the + container's mapped root**. Verified: after the chown, the guest saw `bootstrap.json` as + `-rw------- root root`. + +**Golden-baked bootstrap unit** (`felhom-controller-bootstrap.service`, oneshot, `RemainAfterExit`, +`ConditionPathExists=/etc/felhom-bootstrap/bootstrap.json`, `After=docker.service +network-online.target`) → `/usr/local/sbin/felhom-controller-bootstrap.sh`: +`docker login` (token piped via `--password-stdin`, never echoed) → `docker pull` → `docker run`. + +| Step | Result | +|---|---| +| `docker login gitea.dooplex.hu` (admin + pull token, from the mount) | **Login Succeeded** | +| `docker pull …/felhom-controller:v0.34.0` (guest→registry) | **Downloaded** (digest `sha256:463733a1…`) — registry creds + guest egress both work | +| Unit fired + finished | `active` (RemainAfterExit); journal clean; **no `pct exec`** used | +| Controller container | **Up (healthy)**, real `v0.34.0` | +| **Tie-to-S1:** in-guest process reads the bootstrap token from the mount → host `/storage` | **HTTP 200** | + +**Controller boot (informational):** the container came up in **setup mode** (`[INFO] +felhom-controller v0.34.0 — setup mode`, setup wizard on :8080/:8081) because it looks for +`/opt/docker/felhom-controller/controller.yaml` and the spike mounted `/config/bootstrap.json`. The +container *running and healthy* is the spike's success criterion; **full self-configuration is an 8A +concern** (see gotcha 3). + +--- + +## 3. Gotchas (carry into 8A / the back-half) + +1. **Unprivileged-LXC uid mapping for the config mount** — the agent must `chown 100000:100000` (the + container's mapped root) the files it writes into the mount, or the guest reads them as `nobody` + and the secret config is inaccessible. (Bind mount itself hotplugs fine, no restart.) +2. **Registry-cred distribution** — the bootstrap currently carries the **shared `admin` pull token** + into every guest's mount. For production this should be a **narrow, read-only, ideally per-guest / + short-lived registry token** (the mount is the right delivery channel; the cred's *scope* is the + issue). Treat as a back-half decision. +3. **Controller config contract mismatch** — `bootstrap.json` (this spike's shape/path) ≠ the + controller's expected `controller.yaml` at `/opt/docker/felhom-controller/`. 8A must either (a) + emit the controller's real config format at the path it reads, or (b) have the bootstrap unit + translate `bootstrap.json` → `controller.yaml`. Until then the controller boots to *setup mode*. +4. **Pin form** — SPKI (validated by curl) vs leaf-cert SHA-256 (agent convention). 8A picks one for + the Go controller; both fingerprints captured in §1. +5. **LAN exposure** — §1's security observation: the local-API is on the host LAN IP, gated by + auth only. Consider bridge-bind / firewall narrowing in the back-half. + +--- + +## 4. Verdict — **GO** to spec 8A + the provisioning back-half + +Both unvalidated foundations are proven at the plumbing level: + +- **Channel (doc §6 transport):** guest→host over `vmbr0` works with **no firewall rule** on this + demo; **fingerprint-pinning gates** the handshake (wrong pin = hard TLS failure); **bearer + + self-scoping** behave (200 / 401 / 403). → 8A can spec the real local-API server + the 7 §6 + endpoints with confidence in the transport. +- **Deploy:** the **config-mount + golden-baked bootstrap unit** cleanly deploys *and* configures the + controller **without `pct exec`** (F3 principle holds); `docker login`+`pull` from the guest with a + Gitea pull token works; the controller runs healthy and an in-guest process reaches the host + endpoint with its bootstrap token. → the provisioning back-half can adopt this mechanism (mount + + baked unit + per-guest token mint), addressing gotchas 1–3. + +--- + +## Out of scope (noted, not built here) + +- The **real local-API server** + the 7 §6 endpoints, the per-guest token→guest map and self-scoping + *enforcement* → **8A spec**. +- The **provisioning back-half** proper (agent mints the per-guest token, writes the bootstrap mount, + the controller-bootstrap unit as a permanent golden-recipe addition + the config-format alignment + of gotcha 3) → **8A spec**, informed by this spike. +- **Quiesced app-consistent backup** (stack-stop contract) → **8B**. +- **Controller de-privileging** (retire the disk-*execution* subsystem; bind `GET /storage`; new + customer disk-management endpoints behind the slice-4 data-bearing classifier) → **8C**. + +## Secret handling (held) + +The test local-API tokens and the registry pull credential were kept in `0600` files on the host, +referenced by location, **never** logged or committed; the stub never logged the `Authorization` +header; `docker login` used `--password-stdin`. No real per-guest token or registry cred appears in +git. Only public cert fingerprints are recorded above.