Files
felhom.eu/documentation/tests/slice8a-channel-deploy-spike-findings.md
T
admin 4a81a96678 slice 8A spike: agent<->controller channel + controller deploy plumbing findings
Doc-only spike (no hub code change). Validated on demo-felhom (guest 8200,
torn down): (1) guest->host HTTPS over vmbr0 with fingerprint-pin + bearer +
self-scoping (200/401/403, wrong-pin TLS fail, no firewall rule needed);
(2) config-mount + golden-baked bootstrap unit deploys+runs the controller
(docker login/pull/run v0.34.0) with no pct exec. Verdict: GO to 8A spec.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 08:57:48 +02:00

160 lines
9.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Slice 8 Phase A — agent↔controller channel + controller deploy plumbing: Findings
**Host:** `demo-felhom` (192.168.0.162) — Proxmox VE 9.2.2, Debian 13 (Trixie). Bridge `vmbr0`,
LAN DHCP (router 192.168.0.1). The host's **`vmbr0` IP = 192.168.0.162** (its LAN address — the
guest reaches the agent here).
**Date:** 2026-06-10. **Driver:** SPIKE-RUNBOOK (root@pam for the throwaway stub + guest plumbing;
the real bring-up job — `felhom-agent v0.9.0` — to provision the spike guest).
**VMID:** spike guest `8200` (torn down). Fixed port **8443**.
> This document presents **data, observations, and design consequences**. It de-risks and feeds the
> **8A spec** (the real local-API server + the 7 §6 endpoints) and the **provisioning back-half**
> (deploy + per-guest token mint + bootstrap). The test local-API token and the registry pull
> credential are **secrets** — referenced by location, **redacted** here.
---
## 0. Setup / provenance
| Component | Value |
|---|---|
| Host `vmbr0` IP : port | `192.168.0.162:8443` (nothing else bound there pre-spike) |
| Controller image | `gitea.dooplex.hu/admin/felhom-controller:v0.34.0` (registry has 44 tags; latest is v0.34.0) |
| Registry pull cred | Gitea token — k8s `secret/gitea-creds` (user `admin`), **by reference** (never echoed/committed) |
| Spike guest 8200 | provisioned by the **real bring-up job** from golden `local:backup/vzdump-lxc-9100-2026_06_09-21_32_58.tar.zst` (`-mode provision -keep`) |
| Guest 8200 facts | DHCP IP `192.168.0.145`, fresh MAC `BC:24:11:59:F2:DD`, `features: nesting=1,keyctl=1`, **Docker 29.5.3 active** |
The bring-up job confirmed re-usable as the spike's guest factory: `Pass:true`, `Verified:"boot+running"`,
8s, fresh MAC — the slice-7 primitive delivered a golden, link-up, Docker-ready guest unchanged.
---
## 1. The channel (guest → host HTTPS over the bridge, fingerprint-pinned) — **PASS**
Throwaway HTTPS stub on `192.168.0.162:8443` (self-signed; `GET /storage`; the stub never logs the
`Authorization` header). Two tokens: one scoped to guest 8200, one scoped to a *different* guest.
| Cert handle | Value (public; not secret) |
|---|---|
| Leaf-cert SHA-256 | `CC:7B:03:DC:0F:FA:AC:94:C8:79:35:50:03:3F:FC:CF:CB:2B:49:AE:A7:8A:7D:7C:C7:49:80:9E:3D:EB:92:BC` |
| SPKI pubkey SHA-256 (curl `--pinnedpubkey sha256//`) | `uSSmg6cuEJj9CF7hiBdQ5OEJKOs0NszXJXjRNBwq8DM=` |
From **inside guest 8200** (`curl -k --pinnedpubkey sha256//<spki>`, token read from a file — value
never on the command line):
| # | Case | Expected | Result |
|---|---|---|---|
| T1 | correct pin + guest-8200 token | 200 | **HTTP 200** (`{"storage":"ok","guest":8200}`) |
| T2 | correct pin + **no** token | 401 | **HTTP 401** |
| T3 | correct pin + **other-guest** token | 403 | **HTTP 403** (self-scoping holds) |
| T4 | **wrong** pin + valid token | TLS failure | **HTTP 000, curl exit 90** (`CURLE_SSL_PINNEDPUBKEYNOTMATCH`) — the pin gates the handshake before any request is sent |
**Reachability / firewall:** **no rule needed.** PVE firewall is **off** by default on this demo
(no `cluster.fw` / `host.fw` / `8200.fw`; host `iptables INPUT policy ACCEPT`, `nft` empty). Guest
and host share the `vmbr0` L2 segment (192.168.0.0/24); the guest's route to the host is direct
(`192.168.0.162 dev eth0 src 192.168.0.145`).
**Security observation (design consequence):** the local-API binds the host's **LAN IP**, so it is
reachable by *anything on the LAN*, not just guests on the bridge — network isolation does **not**
gate it. The pin + bearer + self-scoping are the *only* gate, and at the plumbing level they held
airtight. The back-half should still consider narrowing exposure (bind to the bridge subnet and/or a
PVE firewall ACCEPT limited to the guest subnet → DROP otherwise) as defence-in-depth.
**Pin form:** curl validated the **SPKI** (`--pinnedpubkey`). The agent's existing convention is
**leaf-cert SHA-256** pinning. Both fingerprints are captured above; **8A picks one** for the Go
controller's pin (leaf-cert SHA-256 is the lower-friction match to the agent's PVE-cert pinning).
---
## 2. The deploy plumbing (no `pct exec` — host-side mount + golden-baked unit) — **PASS**
Validates the F3 principle end-to-end: the **agent stays host-side**, populates a config mount; a
**golden-baked oneshot** does the guest-side work.
**Config mount (host-side, agent-simulated):** a host dir bind-mounted **read-only** at
`/etc/felhom-bootstrap` (`pct set 8200 -mp0 <hostdir>,mp=/etc/felhom-bootstrap,ro=1`) carrying
`bootstrap.json` = `{ hub_url, host_id, local_api_endpoint, local_api_pin_spki_sha256,
local_api_token, registry{host,username,token}, controller_image }`.
- **Hotplugged live** — the bind mount appeared inside the running guest with **no restart**.
- **GOTCHA (unprivileged uid mapping):** host files must be `chown 100000:100000` so they appear as
`root:root 0600` inside the guest (host uid 0 maps to guest `nobody`, leaving the secret config
unreadable otherwise). The provisioning back-half's mount-populate step **must chown to the
container's mapped root**. Verified: after the chown, the guest saw `bootstrap.json` as
`-rw------- root root`.
**Golden-baked bootstrap unit** (`felhom-controller-bootstrap.service`, oneshot, `RemainAfterExit`,
`ConditionPathExists=/etc/felhom-bootstrap/bootstrap.json`, `After=docker.service
network-online.target`) → `/usr/local/sbin/felhom-controller-bootstrap.sh`:
`docker login` (token piped via `--password-stdin`, never echoed) → `docker pull``docker run`.
| Step | Result |
|---|---|
| `docker login gitea.dooplex.hu` (admin + pull token, from the mount) | **Login Succeeded** |
| `docker pull …/felhom-controller:v0.34.0` (guest→registry) | **Downloaded** (digest `sha256:463733a1…`) — registry creds + guest egress both work |
| Unit fired + finished | `active` (RemainAfterExit); journal clean; **no `pct exec`** used |
| Controller container | **Up (healthy)**, real `v0.34.0` |
| **Tie-to-S1:** in-guest process reads the bootstrap token from the mount → host `/storage` | **HTTP 200** |
**Controller boot (informational):** the container came up in **setup mode** (`[INFO]
felhom-controller v0.34.0 — setup mode`, setup wizard on :8080/:8081) because it looks for
`/opt/docker/felhom-controller/controller.yaml` and the spike mounted `/config/bootstrap.json`. The
container *running and healthy* is the spike's success criterion; **full self-configuration is an 8A
concern** (see gotcha 3).
---
## 3. Gotchas (carry into 8A / the back-half)
1. **Unprivileged-LXC uid mapping for the config mount** — the agent must `chown 100000:100000` (the
container's mapped root) the files it writes into the mount, or the guest reads them as `nobody`
and the secret config is inaccessible. (Bind mount itself hotplugs fine, no restart.)
2. **Registry-cred distribution** — the bootstrap currently carries the **shared `admin` pull token**
into every guest's mount. For production this should be a **narrow, read-only, ideally per-guest /
short-lived registry token** (the mount is the right delivery channel; the cred's *scope* is the
issue). Treat as a back-half decision.
3. **Controller config contract mismatch**`bootstrap.json` (this spike's shape/path) ≠ the
controller's expected `controller.yaml` at `/opt/docker/felhom-controller/`. 8A must either (a)
emit the controller's real config format at the path it reads, or (b) have the bootstrap unit
translate `bootstrap.json``controller.yaml`. Until then the controller boots to *setup mode*.
4. **Pin form** — SPKI (validated by curl) vs leaf-cert SHA-256 (agent convention). 8A picks one for
the Go controller; both fingerprints captured in §1.
5. **LAN exposure** — §1's security observation: the local-API is on the host LAN IP, gated by
auth only. Consider bridge-bind / firewall narrowing in the back-half.
---
## 4. Verdict — **GO** to spec 8A + the provisioning back-half
Both unvalidated foundations are proven at the plumbing level:
- **Channel (doc §6 transport):** guest→host over `vmbr0` works with **no firewall rule** on this
demo; **fingerprint-pinning gates** the handshake (wrong pin = hard TLS failure); **bearer +
self-scoping** behave (200 / 401 / 403). → 8A can spec the real local-API server + the 7 §6
endpoints with confidence in the transport.
- **Deploy:** the **config-mount + golden-baked bootstrap unit** cleanly deploys *and* configures the
controller **without `pct exec`** (F3 principle holds); `docker login`+`pull` from the guest with a
Gitea pull token works; the controller runs healthy and an in-guest process reaches the host
endpoint with its bootstrap token. → the provisioning back-half can adopt this mechanism (mount +
baked unit + per-guest token mint), addressing gotchas 13.
---
## Out of scope (noted, not built here)
- The **real local-API server** + the 7 §6 endpoints, the per-guest token→guest map and self-scoping
*enforcement***8A spec**.
- The **provisioning back-half** proper (agent mints the per-guest token, writes the bootstrap mount,
the controller-bootstrap unit as a permanent golden-recipe addition + the config-format alignment
of gotcha 3) → **8A spec**, informed by this spike.
- **Quiesced app-consistent backup** (stack-stop contract) → **8B**.
- **Controller de-privileging** (retire the disk-*execution* subsystem; bind `GET /storage`; new
customer disk-management endpoints behind the slice-4 data-bearing classifier) → **8C**.
## Secret handling (held)
The test local-API tokens and the registry pull credential were kept in `0600` files on the host,
referenced by location, **never** logged or committed; the stub never logged the `Authorization`
header; `docker login` used `--password-stdin`. No real per-guest token or registry cred appears in
git. Only public cert fingerprints are recorded above.