diff --git a/REPORT.md b/REPORT.md index c5beaf9..1cc2f05 100644 --- a/REPORT.md +++ b/REPORT.md @@ -4,49 +4,42 @@ --- -# REPORT — Slice 8 Phase A spike: agent↔controller channel + controller deploy plumbing (2026-06-10) +# REPORT — Slice 8C docs: controller de-privileging + disk classifier (slice 8 CLOSED) (2026-06-10) ## Type -**SPIKE** (CC-executed on the demo). Doc-only deliverable — **no hub/code change, no version bump, -no deploy**. Probes the two unvalidated foundations of slice 8 *before* speccing the local API -(doc §6) and the provisioning back-half. Findings: -[documentation/tests/slice8a-channel-deploy-spike-findings.md](documentation/tests/slice8a-channel-deploy-spike-findings.md). +Documentation update for **slice 8C** (the implementation is in `felhom-agent` v0.12.0 + +`felhom-controller` v0.37.0; no hub change). Slice 8 is now **CLOSED**. -## What was proven on `demo-felhom` +## What changed (doc 03 — host-agent) -Spike guest **8200** was produced by the **real slice-7 bring-up job** (`felhom-agent v0.9.0`, -`-mode provision`) from the golden archive — a golden, link-up, Docker-29.5.3 guest in 8s, fresh MAC. -Torn down at the end; demo left as found (only pre-existing 9001/9999 remain; golden archive intact). +- **§6** — added the disk-management endpoints (`GET /disks`, `POST /disks/{assign,eject,format}`) + and **reframed the principle**: a controller may do *non-data-destructive* storage setup self-serve + (list / assign / eject / format-blank); **anything that can lose customer data stays + operator-signed (§4)**, with the **classifier (agent-internal device inspection)** as the enforcer. + The 8C invariant: the agent decides data-bearing-ness by inspecting the device itself, never the + caller's claim; a data-bearing format → `ClassStorageWipe` → gate → `pending_signature` (signed + completion is slice 10). Marked **implemented**. +- **§4** — added: data-bearing-ness is **agent-internal evidence, never the caller's claim** + (mirrors the agent-internal scratch-provenance rule); destructive completion → slice 10. +- **§9 slice table** — **8C implemented → slice 8 CLOSED**: agent v0.12.0 (`/disks` + classifier + gate + `mkfs`); controller v0.37.0 (~12.3k LOC disk-execution retired, `backup.Manager` split to + app-data, disk mgmt rewired to the agent, container de-privileged). §13 + doc changelog updated. -### 1. The channel (guest → host HTTPS over `vmbr0`, fingerprint-pinned) — **PASS** -A throwaway self-signed HTTPS stub on `192.168.0.162:8443`, hit from **inside guest 8200**: -- correct pin + guest-8200 token → **200**; no token → **401**; **other-guest** token → **403** - (self-scoping holds); **wrong pin → hard TLS failure** (curl exit 90 — the pin gates the handshake). -- **No firewall rule needed** (PVE firewall off; guest and host share the `vmbr0` /24, direct route). -- Security note: the local-API binds the host **LAN IP** → reachable by anything on the LAN; **auth - is the only gate** (it held). Both pin forms captured (SPKI + leaf-cert SHA-256) for the 8A choice. +## What changed (doc 02 — controller module map) -### 2. The deploy plumbing (no `pct exec` — config mount + golden-baked unit) — **PASS** -The F3 principle end-to-end: agent stays host-side, populates a **read-only config mount** -(`/etc/felhom-bootstrap`, bind-mount hotplugged live); a **golden-baked oneshot** reads it → -`docker login` (token via `--password-stdin`) → `docker pull …/felhom-controller:v0.34.0` → -`docker run`. The controller came up **Up (healthy)**; an in-guest process read the bootstrap token -from the mount and reached the host `/storage` → **200**. **No `pct exec` used.** +- Added an **EXECUTED** banner: the map's target state is realized — the disk subsystem is deleted, + `backup.Manager` split, disk mgmt rewired to the agent, the container de-privileged. The in-guest + controller is now Docker-only with no disk/Proxmox privileges. -## Gotchas carried into 8A / the back-half -1. **Unprivileged-LXC uid mapping** — the agent must `chown 100000:100000` files it writes into the - mount (else the guest reads them as `nobody`; the secret config is inaccessible). -2. **Registry-cred scope** — the bootstrap currently carries the shared `admin` pull token; production - wants a narrow, read-only, ideally per-guest/short-lived registry token (mount is the right channel). -3. **Controller config contract** — `bootstrap.json` ≠ the controller's `controller.yaml`; the - controller boots to *setup mode* until 8A emits the real config format/path (or the unit translates). -4. **Pin form** (SPKI vs leaf-cert SHA-256) and **LAN exposure** narrowing — 8A/back-half decisions. +## Live validation (cross-repo, on the demo) -## Verdict — **GO** to spec 8A (local-API server + the 7 §6 endpoints) and the provisioning back-half. +A provisioned **de-privileged** controller v0.37.0 (`Privileged=false`; mounts only bootstrap + data ++ docker.sock) drove the agent disk API: `GET /disks` returned data-bearing flags, and a +**data-bearing format was refused** (`pending_signature`, nothing formatted) — the security +centerpiece, proven live. See the agent + controller REPORTs. -## Secret handling (held) -Test local-API tokens + the registry pull cred kept in `0600` host files, referenced by location, -never logged/committed; the stub never logged the `Authorization` header; `docker login` via -`--password-stdin`. No real per-guest token or registry cred in git. All scratch shredded on teardown. -No throwaway registry token was minted (the existing `gitea-creds` admin cred was used by reference). +## Deferred + +The operator-signed completion of a data-bearing wipe/format → **slice 10**. No hub change → no +deploy. No secrets committed. diff --git a/documentation/architecture/02-controller-module-map.md b/documentation/architecture/02-controller-module-map.md index 85d5f0a..a344db1 100644 --- a/documentation/architecture/02-controller-module-map.md +++ b/documentation/architecture/02-controller-module-map.md @@ -1,5 +1,13 @@ # Felhom Controller Architecture — Part 2: Controller Module Map +> **EXECUTED (slice 8C, 2026-06-10 — controller v0.37.0).** This map's target state is now realized: +> the disk-execution subsystem (`storage/*`, restic, cross-drive, drive-restore, `disk_layout`, +> `local_infra`, `infra_backup`, `setup/scanner`, `monitor/watchdog`+`pinger`, the storage UI) is +> **deleted** (~12.3k LOC); `backup.Manager` is **split to app-data only**; disk management is +> **rewired to the host agent's local API** (`web/agent_disk_handlers.go` → agent `/disks`); and the +> container is **de-privileged** (no `privileged`, `/dev`, `/etc/fstab`, rshared). The in-guest +> controller is now **Docker-only with no disk/Proxmox privileges**, as designed. See doc 03 §6/§9. + **Status:** audit (keep / port / delete / modify / add), grounded in the v0.33 source. **Subject:** the v0.33 controller in `felhom-controller/controller/` (110 `.go` files, ~40 K LOC) audited against [01-topology-and-trust.md](01-topology-and-trust.md) and diff --git a/documentation/architecture/03-host-agent.md b/documentation/architecture/03-host-agent.md index af6f6a7..418cd36 100644 --- a/documentation/architecture/03-host-agent.md +++ b/documentation/architecture/03-host-agent.md @@ -77,6 +77,7 @@ by verb**: - **The reconciler MAY act without an operator signature** when: (a) creating/starting/restarting; (b) destroying resources it created earlier **within the same journaled transaction** (compensating rollback, §10); (c) destroying resources it **tagged ephemeral/scratch** (e.g. restore-test scratch guests, §8). The ephemeral/scratch tag is **agent-internal provenance and is never accepted from the hub** — else a compromised hub could relabel a data-bearing guest as scratch to walk the gate. - **An operator signature is always required** to destroy/overwrite any resource holding the only/primary copy of customer data — live-guest destroy, storage detach/wipe, restore-overwrite, decommission — *regardless of whether it arrives as a job or as a desired-state delta*. A compromised hub cannot forge them because the signing key is **not held by the hub** (it lives with the operator / a separate signing path; the hub only queues opaque signed blobs). +- **Data-bearing-ness is agent-internal evidence, never a caller's claim (slice 8C).** For a customer-driven storage op (`POST /disks/format`, §6) the agent **inspects the actual device** (filesystem signature / partition table / partitions / mount, conservative — ambiguous → data-bearing) to decide the class. A blank device → benign self-serve `mkfs`; a data-bearing device → `ClassStorageWipe` → this gate → `pending_signature`. The **destructive completion of a data-bearing wipe is slice 10** (the operator-signed path); 8C refuses it. This mirrors the provenance rule above: just as the scratch tag is agent-internal (never hub-sourced), data-bearing-ness is agent-observed (never controller-asserted) — a compromised controller cannot relabel a data-bearing drive "blank" to walk the gate. - **Healing a crashed controller is non-destructive by construction:** it is reconstructable from its image + the guest's persistent volume, so "redeploy" = restart the LXC / `docker compose up -d` **inside the existing guest** — never a guest destroy. (v0.33 precedent: `watchdog.go` restarts stopped stacks, it never destroys the guest.) Signed payloads carry a **nonce + expiry** (anti-replay: a captured "restore" job cannot be @@ -116,9 +117,26 @@ The controller (in its LXC) reaches the agent (on the host) over the local bridg - `POST /backup` — request a backup-now of *this* guest (enqueued; non-destructive). - `GET /backup/due` — whether a policy-scheduled backup is due for *this* guest, so the controller can quiesce then call `POST /backup` (the app-consistent path, §8). - `GET /backup/status`, `GET /restore-test/status` — read-only status for the controller's UI. + - **Disk management (slice 8C):** `GET /disks` (host drives + a **data-bearing flag**), + `POST /disks/assign` (attach a drive as a mount — benign, additive, self-serve), `POST + /disks/eject` (safe-unmount, **data preserved**, returns the dependent guests so the controller + warns which apps lose that storage — benign), `POST /disks/format` (see the reframed principle + below). The controller is Docker-only (de-privileged, slice 8C); **execution is the agent's**. -Note what is *absent*: nothing here lets a controller touch another guest, the host, storage -attachment, or restore-overwrite. Destructive/cross-guest power stays operator-signed (§4). +**The principle (reframed for 8C):** a controller may do **non-data-destructive** storage setup +**self-serve** (list, assign, eject, format a *blank* drive); **anything that can lose customer data +stays operator-signed (§4)**. The enforcer is the **classifier**: for `POST /disks/format` the agent +**inspects the actual device itself** (filesystem signature / partition table / partitions / mount — +agent-internal evidence, NEVER the caller's claim) and classifies conservatively (ambiguous → +data-bearing). A blank device → benign → `mkfs`. A data-bearing device → `ClassStorageWipe` → +destructive → the §4 gate → refused **`pending_signature`** (the operator-signed completion is slice +10). So a compromised controller asserting "this drive is blank" **cannot** wipe a data-bearing +drive — the 8C analog of self-scoping. **Status: implemented** (agent v0.12.0 `internal/localapi` + +`internal/storage`; controller v0.37.0 `internal/web/agent_disk_handlers.go`). + +Note what is *absent*: nothing here lets a controller touch **another guest**, the **host** beyond +this narrow disk surface, or **restore-overwrite**; and within the disk surface, **data-destructive** +power stays operator-signed (§4). Destructive/cross-guest power stays operator-signed. A controller can only `POST /rollback` (or snapshot/backup) **its own** guest — the agent maps token → guest and authorizes per guest, so a compromised controller's blast radius is @@ -381,7 +399,7 @@ this path — bring up + reattach external storage and it is whole. This is full | PBS recovery-code escrow **creation** + **hub opaque storage** (§8a) | **7** | **implemented** (agent v0.9.0 `internal/escrow`; hub v0.8.0 `PUT /hosts/{id}/escrow`) | | **Local API** server (§6) + provisioning **back half** — deploy controller, hand bootstrap config, mint per-guest local token | **8A** | **implemented** (agent v0.10.0 `internal/localapi` + `internal/provision`; controller v0.35.0 `internal/bootstrap` + `internal/agentapi`). The controller image is **baked into the golden** (no registry cred in any guest); the back-half mints the token, writes a 0600 `bootstrap.json` to a `chown 100000:100000` config mount, and `pct set`-attaches it read-only; the golden's baked unit deploys the controller, which ingests the bootstrap, comes up configured, and reaches the agent over the bridge (leaf-pin + token). Validated live end-to-end on the demo. | | **Quiesced app-consistent backup** (`/backup/due`-driven stack-stop) | **8B** | **implemented** (agent v0.11.0 `/backup/due` cadence + `/backup/status` phases; controller v0.36.0 `internal/quiesce` — stop stacks → backup → restart, with crash-safety marker/guaranteed-unquiesce/max-bound/crash-recovery). Validated live incl. the postgres clean-vs-crash-recovery restore contrast. Downtime optimization (snapshot mode) → 8B.2. | -| **Controller de-privileging** (retire the disk-execution subsystem; new customer disk endpoints behind the slice-4 data-bearing classifier) | **8C** | deferred | +| **Controller de-privileging** (retire the disk-execution subsystem; new customer disk endpoints behind the slice-4 data-bearing classifier) | **8C** | **implemented — slice 8 CLOSED** (agent v0.12.0: `/disks` endpoints + the data-bearing classifier gate + `mkfs`; controller v0.37.0: ~12.3k LOC of disk-execution retired — storage/restic/cross-drive/migrate/watchdog/scanner/infra-backup — `backup.Manager` split to app-data only, disk mgmt rewired to the agent, container de-privileged). The data-bearing format refusal (§6) is the security centerpiece. | | **Host/hardware loss** DR — re-enroll in "restore mode"; hub serves identity / PBS namespace / tunnel token / storage manifest / restore directive | **10** | deferred — needs hub desired-state serving; hub store today holds only `{host_id, customer_id, api_key}` (slice 3) | | PBS escrow **consumption** (recover `K` on a new box) | **10** | deferred — exercised by host-loss DR | | Golden base refresh cadence + fleet versioning | post-launch | operational, non-blocking (§13) | @@ -461,6 +479,18 @@ This doc hands the implementation three contracts it was waiting on: ## Changelog — design-review + Phase-3 fold-in (2026-06-08) +### Slice-8C implemented — controller de-privileged, slice 8 CLOSED (2026-06-10) +- §6: added the **disk-management endpoints** (`/disks`, `/disks/assign|eject|format`) and + **reframed the principle** — a controller may do non-data-destructive storage setup self-serve; + **anything that can lose customer data stays operator-signed (§4)**, with the **classifier + (agent-internal device inspection)** as the enforcer. The 8C invariant: the agent decides + data-bearing-ness by **inspecting the device itself**, never the caller's claim; a data-bearing + format → `ClassStorageWipe` → gate → `pending_signature` (signed completion is slice 10). +- §9 slice table: **8C implemented — slice 8 CLOSED** (agent v0.12.0 `/disks` + classifier gate + + `mkfs`; controller v0.37.0 retired ~12.3k LOC of disk-execution + de-privileged + rewired to the + agent). The controller-side re-platform milestone: the in-guest controller is now Docker-only with + no disk/Proxmox privileges. + ### Slice-8B implemented: app-consistent backup (quiesce / stack-stop) (2026-06-10) - §8: the **controller-driven quiesce** (stop app stacks → `POST /backup` → restart) is **implemented** (controller v0.36.0 `internal/quiesce` + agent v0.11.0 `/backup/due` cadence + `/backup/status`