doc 03 §6/§4/§9 + doc 02: slice 8C implemented — controller de-privileged, slice 8 CLOSED (2026-06-10)
§6: disk-management endpoints + reframed principle (non-data-destructive self-serve; data-destructive stays operator-signed; classifier = agent-internal device inspection). §4: data-bearing-ness is agent-internal, never caller-claimed. §9: 8C implemented, slice 8 CLOSED. doc 02: EXECUTED banner. Validated live (data-bearing format refused; de-privileged controller). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -4,49 +4,42 @@
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
# REPORT — Slice 8 Phase A spike: agent↔controller channel + controller deploy plumbing (2026-06-10)
|
# REPORT — Slice 8C docs: controller de-privileging + disk classifier (slice 8 CLOSED) (2026-06-10)
|
||||||
|
|
||||||
## Type
|
## Type
|
||||||
|
|
||||||
**SPIKE** (CC-executed on the demo). Doc-only deliverable — **no hub/code change, no version bump,
|
Documentation update for **slice 8C** (the implementation is in `felhom-agent` v0.12.0 +
|
||||||
no deploy**. Probes the two unvalidated foundations of slice 8 *before* speccing the local API
|
`felhom-controller` v0.37.0; no hub change). Slice 8 is now **CLOSED**.
|
||||||
(doc §6) and the provisioning back-half. Findings:
|
|
||||||
[documentation/tests/slice8a-channel-deploy-spike-findings.md](documentation/tests/slice8a-channel-deploy-spike-findings.md).
|
|
||||||
|
|
||||||
## What was proven on `demo-felhom`
|
## What changed (doc 03 — host-agent)
|
||||||
|
|
||||||
Spike guest **8200** was produced by the **real slice-7 bring-up job** (`felhom-agent v0.9.0`,
|
- **§6** — added the disk-management endpoints (`GET /disks`, `POST /disks/{assign,eject,format}`)
|
||||||
`-mode provision`) from the golden archive — a golden, link-up, Docker-29.5.3 guest in 8s, fresh MAC.
|
and **reframed the principle**: a controller may do *non-data-destructive* storage setup self-serve
|
||||||
Torn down at the end; demo left as found (only pre-existing 9001/9999 remain; golden archive intact).
|
(list / assign / eject / format-blank); **anything that can lose customer data stays
|
||||||
|
operator-signed (§4)**, with the **classifier (agent-internal device inspection)** as the enforcer.
|
||||||
|
The 8C invariant: the agent decides data-bearing-ness by inspecting the device itself, never the
|
||||||
|
caller's claim; a data-bearing format → `ClassStorageWipe` → gate → `pending_signature` (signed
|
||||||
|
completion is slice 10). Marked **implemented**.
|
||||||
|
- **§4** — added: data-bearing-ness is **agent-internal evidence, never the caller's claim**
|
||||||
|
(mirrors the agent-internal scratch-provenance rule); destructive completion → slice 10.
|
||||||
|
- **§9 slice table** — **8C implemented → slice 8 CLOSED**: agent v0.12.0 (`/disks` + classifier
|
||||||
|
gate + `mkfs`); controller v0.37.0 (~12.3k LOC disk-execution retired, `backup.Manager` split to
|
||||||
|
app-data, disk mgmt rewired to the agent, container de-privileged). §13 + doc changelog updated.
|
||||||
|
|
||||||
### 1. The channel (guest → host HTTPS over `vmbr0`, fingerprint-pinned) — **PASS**
|
## What changed (doc 02 — controller module map)
|
||||||
A throwaway self-signed HTTPS stub on `192.168.0.162:8443`, hit from **inside guest 8200**:
|
|
||||||
- correct pin + guest-8200 token → **200**; no token → **401**; **other-guest** token → **403**
|
|
||||||
(self-scoping holds); **wrong pin → hard TLS failure** (curl exit 90 — the pin gates the handshake).
|
|
||||||
- **No firewall rule needed** (PVE firewall off; guest and host share the `vmbr0` /24, direct route).
|
|
||||||
- Security note: the local-API binds the host **LAN IP** → reachable by anything on the LAN; **auth
|
|
||||||
is the only gate** (it held). Both pin forms captured (SPKI + leaf-cert SHA-256) for the 8A choice.
|
|
||||||
|
|
||||||
### 2. The deploy plumbing (no `pct exec` — config mount + golden-baked unit) — **PASS**
|
- Added an **EXECUTED** banner: the map's target state is realized — the disk subsystem is deleted,
|
||||||
The F3 principle end-to-end: agent stays host-side, populates a **read-only config mount**
|
`backup.Manager` split, disk mgmt rewired to the agent, the container de-privileged. The in-guest
|
||||||
(`/etc/felhom-bootstrap`, bind-mount hotplugged live); a **golden-baked oneshot** reads it →
|
controller is now Docker-only with no disk/Proxmox privileges.
|
||||||
`docker login` (token via `--password-stdin`) → `docker pull …/felhom-controller:v0.34.0` →
|
|
||||||
`docker run`. The controller came up **Up (healthy)**; an in-guest process read the bootstrap token
|
|
||||||
from the mount and reached the host `/storage` → **200**. **No `pct exec` used.**
|
|
||||||
|
|
||||||
## Gotchas carried into 8A / the back-half
|
## Live validation (cross-repo, on the demo)
|
||||||
1. **Unprivileged-LXC uid mapping** — the agent must `chown 100000:100000` files it writes into the
|
|
||||||
mount (else the guest reads them as `nobody`; the secret config is inaccessible).
|
|
||||||
2. **Registry-cred scope** — the bootstrap currently carries the shared `admin` pull token; production
|
|
||||||
wants a narrow, read-only, ideally per-guest/short-lived registry token (mount is the right channel).
|
|
||||||
3. **Controller config contract** — `bootstrap.json` ≠ the controller's `controller.yaml`; the
|
|
||||||
controller boots to *setup mode* until 8A emits the real config format/path (or the unit translates).
|
|
||||||
4. **Pin form** (SPKI vs leaf-cert SHA-256) and **LAN exposure** narrowing — 8A/back-half decisions.
|
|
||||||
|
|
||||||
## Verdict — **GO** to spec 8A (local-API server + the 7 §6 endpoints) and the provisioning back-half.
|
A provisioned **de-privileged** controller v0.37.0 (`Privileged=false`; mounts only bootstrap + data
|
||||||
|
+ docker.sock) drove the agent disk API: `GET /disks` returned data-bearing flags, and a
|
||||||
|
**data-bearing format was refused** (`pending_signature`, nothing formatted) — the security
|
||||||
|
centerpiece, proven live. See the agent + controller REPORTs.
|
||||||
|
|
||||||
## Secret handling (held)
|
## Deferred
|
||||||
Test local-API tokens + the registry pull cred kept in `0600` host files, referenced by location,
|
|
||||||
never logged/committed; the stub never logged the `Authorization` header; `docker login` via
|
The operator-signed completion of a data-bearing wipe/format → **slice 10**. No hub change → no
|
||||||
`--password-stdin`. No real per-guest token or registry cred in git. All scratch shredded on teardown.
|
deploy. No secrets committed.
|
||||||
No throwaway registry token was minted (the existing `gitea-creds` admin cred was used by reference).
|
|
||||||
|
|||||||
@@ -1,5 +1,13 @@
|
|||||||
# Felhom Controller Architecture — Part 2: Controller Module Map
|
# Felhom Controller Architecture — Part 2: Controller Module Map
|
||||||
|
|
||||||
|
> **EXECUTED (slice 8C, 2026-06-10 — controller v0.37.0).** This map's target state is now realized:
|
||||||
|
> the disk-execution subsystem (`storage/*`, restic, cross-drive, drive-restore, `disk_layout`,
|
||||||
|
> `local_infra`, `infra_backup`, `setup/scanner`, `monitor/watchdog`+`pinger`, the storage UI) is
|
||||||
|
> **deleted** (~12.3k LOC); `backup.Manager` is **split to app-data only**; disk management is
|
||||||
|
> **rewired to the host agent's local API** (`web/agent_disk_handlers.go` → agent `/disks`); and the
|
||||||
|
> container is **de-privileged** (no `privileged`, `/dev`, `/etc/fstab`, rshared). The in-guest
|
||||||
|
> controller is now **Docker-only with no disk/Proxmox privileges**, as designed. See doc 03 §6/§9.
|
||||||
|
|
||||||
**Status:** audit (keep / port / delete / modify / add), grounded in the v0.33 source.
|
**Status:** audit (keep / port / delete / modify / add), grounded in the v0.33 source.
|
||||||
**Subject:** the v0.33 controller in `felhom-controller/controller/` (110 `.go` files,
|
**Subject:** the v0.33 controller in `felhom-controller/controller/` (110 `.go` files,
|
||||||
~40 K LOC) audited against [01-topology-and-trust.md](01-topology-and-trust.md) and
|
~40 K LOC) audited against [01-topology-and-trust.md](01-topology-and-trust.md) and
|
||||||
|
|||||||
@@ -77,6 +77,7 @@ by verb**:
|
|||||||
|
|
||||||
- **The reconciler MAY act without an operator signature** when: (a) creating/starting/restarting; (b) destroying resources it created earlier **within the same journaled transaction** (compensating rollback, §10); (c) destroying resources it **tagged ephemeral/scratch** (e.g. restore-test scratch guests, §8). The ephemeral/scratch tag is **agent-internal provenance and is never accepted from the hub** — else a compromised hub could relabel a data-bearing guest as scratch to walk the gate.
|
- **The reconciler MAY act without an operator signature** when: (a) creating/starting/restarting; (b) destroying resources it created earlier **within the same journaled transaction** (compensating rollback, §10); (c) destroying resources it **tagged ephemeral/scratch** (e.g. restore-test scratch guests, §8). The ephemeral/scratch tag is **agent-internal provenance and is never accepted from the hub** — else a compromised hub could relabel a data-bearing guest as scratch to walk the gate.
|
||||||
- **An operator signature is always required** to destroy/overwrite any resource holding the only/primary copy of customer data — live-guest destroy, storage detach/wipe, restore-overwrite, decommission — *regardless of whether it arrives as a job or as a desired-state delta*. A compromised hub cannot forge them because the signing key is **not held by the hub** (it lives with the operator / a separate signing path; the hub only queues opaque signed blobs).
|
- **An operator signature is always required** to destroy/overwrite any resource holding the only/primary copy of customer data — live-guest destroy, storage detach/wipe, restore-overwrite, decommission — *regardless of whether it arrives as a job or as a desired-state delta*. A compromised hub cannot forge them because the signing key is **not held by the hub** (it lives with the operator / a separate signing path; the hub only queues opaque signed blobs).
|
||||||
|
- **Data-bearing-ness is agent-internal evidence, never a caller's claim (slice 8C).** For a customer-driven storage op (`POST /disks/format`, §6) the agent **inspects the actual device** (filesystem signature / partition table / partitions / mount, conservative — ambiguous → data-bearing) to decide the class. A blank device → benign self-serve `mkfs`; a data-bearing device → `ClassStorageWipe` → this gate → `pending_signature`. The **destructive completion of a data-bearing wipe is slice 10** (the operator-signed path); 8C refuses it. This mirrors the provenance rule above: just as the scratch tag is agent-internal (never hub-sourced), data-bearing-ness is agent-observed (never controller-asserted) — a compromised controller cannot relabel a data-bearing drive "blank" to walk the gate.
|
||||||
- **Healing a crashed controller is non-destructive by construction:** it is reconstructable from its image + the guest's persistent volume, so "redeploy" = restart the LXC / `docker compose up -d` **inside the existing guest** — never a guest destroy. (v0.33 precedent: `watchdog.go` restarts stopped stacks, it never destroys the guest.)
|
- **Healing a crashed controller is non-destructive by construction:** it is reconstructable from its image + the guest's persistent volume, so "redeploy" = restart the LXC / `docker compose up -d` **inside the existing guest** — never a guest destroy. (v0.33 precedent: `watchdog.go` restarts stopped stacks, it never destroys the guest.)
|
||||||
|
|
||||||
Signed payloads carry a **nonce + expiry** (anti-replay: a captured "restore" job cannot be
|
Signed payloads carry a **nonce + expiry** (anti-replay: a captured "restore" job cannot be
|
||||||
@@ -116,9 +117,26 @@ The controller (in its LXC) reaches the agent (on the host) over the local bridg
|
|||||||
- `POST /backup` — request a backup-now of *this* guest (enqueued; non-destructive).
|
- `POST /backup` — request a backup-now of *this* guest (enqueued; non-destructive).
|
||||||
- `GET /backup/due` — whether a policy-scheduled backup is due for *this* guest, so the controller can quiesce then call `POST /backup` (the app-consistent path, §8).
|
- `GET /backup/due` — whether a policy-scheduled backup is due for *this* guest, so the controller can quiesce then call `POST /backup` (the app-consistent path, §8).
|
||||||
- `GET /backup/status`, `GET /restore-test/status` — read-only status for the controller's UI.
|
- `GET /backup/status`, `GET /restore-test/status` — read-only status for the controller's UI.
|
||||||
|
- **Disk management (slice 8C):** `GET /disks` (host drives + a **data-bearing flag**),
|
||||||
|
`POST /disks/assign` (attach a drive as a mount — benign, additive, self-serve), `POST
|
||||||
|
/disks/eject` (safe-unmount, **data preserved**, returns the dependent guests so the controller
|
||||||
|
warns which apps lose that storage — benign), `POST /disks/format` (see the reframed principle
|
||||||
|
below). The controller is Docker-only (de-privileged, slice 8C); **execution is the agent's**.
|
||||||
|
|
||||||
Note what is *absent*: nothing here lets a controller touch another guest, the host, storage
|
**The principle (reframed for 8C):** a controller may do **non-data-destructive** storage setup
|
||||||
attachment, or restore-overwrite. Destructive/cross-guest power stays operator-signed (§4).
|
**self-serve** (list, assign, eject, format a *blank* drive); **anything that can lose customer data
|
||||||
|
stays operator-signed (§4)**. The enforcer is the **classifier**: for `POST /disks/format` the agent
|
||||||
|
**inspects the actual device itself** (filesystem signature / partition table / partitions / mount —
|
||||||
|
agent-internal evidence, NEVER the caller's claim) and classifies conservatively (ambiguous →
|
||||||
|
data-bearing). A blank device → benign → `mkfs`. A data-bearing device → `ClassStorageWipe` →
|
||||||
|
destructive → the §4 gate → refused **`pending_signature`** (the operator-signed completion is slice
|
||||||
|
10). So a compromised controller asserting "this drive is blank" **cannot** wipe a data-bearing
|
||||||
|
drive — the 8C analog of self-scoping. **Status: implemented** (agent v0.12.0 `internal/localapi` +
|
||||||
|
`internal/storage`; controller v0.37.0 `internal/web/agent_disk_handlers.go`).
|
||||||
|
|
||||||
|
Note what is *absent*: nothing here lets a controller touch **another guest**, the **host** beyond
|
||||||
|
this narrow disk surface, or **restore-overwrite**; and within the disk surface, **data-destructive**
|
||||||
|
power stays operator-signed (§4). Destructive/cross-guest power stays operator-signed.
|
||||||
|
|
||||||
A controller can only `POST /rollback` (or snapshot/backup) **its own** guest — the agent maps
|
A controller can only `POST /rollback` (or snapshot/backup) **its own** guest — the agent maps
|
||||||
token → guest and authorizes per guest, so a compromised controller's blast radius is
|
token → guest and authorizes per guest, so a compromised controller's blast radius is
|
||||||
@@ -381,7 +399,7 @@ this path — bring up + reattach external storage and it is whole. This is full
|
|||||||
| PBS recovery-code escrow **creation** + **hub opaque storage** (§8a) | **7** | **implemented** (agent v0.9.0 `internal/escrow`; hub v0.8.0 `PUT /hosts/{id}/escrow`) |
|
| PBS recovery-code escrow **creation** + **hub opaque storage** (§8a) | **7** | **implemented** (agent v0.9.0 `internal/escrow`; hub v0.8.0 `PUT /hosts/{id}/escrow`) |
|
||||||
| **Local API** server (§6) + provisioning **back half** — deploy controller, hand bootstrap config, mint per-guest local token | **8A** | **implemented** (agent v0.10.0 `internal/localapi` + `internal/provision`; controller v0.35.0 `internal/bootstrap` + `internal/agentapi`). The controller image is **baked into the golden** (no registry cred in any guest); the back-half mints the token, writes a 0600 `bootstrap.json` to a `chown 100000:100000` config mount, and `pct set`-attaches it read-only; the golden's baked unit deploys the controller, which ingests the bootstrap, comes up configured, and reaches the agent over the bridge (leaf-pin + token). Validated live end-to-end on the demo. |
|
| **Local API** server (§6) + provisioning **back half** — deploy controller, hand bootstrap config, mint per-guest local token | **8A** | **implemented** (agent v0.10.0 `internal/localapi` + `internal/provision`; controller v0.35.0 `internal/bootstrap` + `internal/agentapi`). The controller image is **baked into the golden** (no registry cred in any guest); the back-half mints the token, writes a 0600 `bootstrap.json` to a `chown 100000:100000` config mount, and `pct set`-attaches it read-only; the golden's baked unit deploys the controller, which ingests the bootstrap, comes up configured, and reaches the agent over the bridge (leaf-pin + token). Validated live end-to-end on the demo. |
|
||||||
| **Quiesced app-consistent backup** (`/backup/due`-driven stack-stop) | **8B** | **implemented** (agent v0.11.0 `/backup/due` cadence + `/backup/status` phases; controller v0.36.0 `internal/quiesce` — stop stacks → backup → restart, with crash-safety marker/guaranteed-unquiesce/max-bound/crash-recovery). Validated live incl. the postgres clean-vs-crash-recovery restore contrast. Downtime optimization (snapshot mode) → 8B.2. |
|
| **Quiesced app-consistent backup** (`/backup/due`-driven stack-stop) | **8B** | **implemented** (agent v0.11.0 `/backup/due` cadence + `/backup/status` phases; controller v0.36.0 `internal/quiesce` — stop stacks → backup → restart, with crash-safety marker/guaranteed-unquiesce/max-bound/crash-recovery). Validated live incl. the postgres clean-vs-crash-recovery restore contrast. Downtime optimization (snapshot mode) → 8B.2. |
|
||||||
| **Controller de-privileging** (retire the disk-execution subsystem; new customer disk endpoints behind the slice-4 data-bearing classifier) | **8C** | deferred |
|
| **Controller de-privileging** (retire the disk-execution subsystem; new customer disk endpoints behind the slice-4 data-bearing classifier) | **8C** | **implemented — slice 8 CLOSED** (agent v0.12.0: `/disks` endpoints + the data-bearing classifier gate + `mkfs`; controller v0.37.0: ~12.3k LOC of disk-execution retired — storage/restic/cross-drive/migrate/watchdog/scanner/infra-backup — `backup.Manager` split to app-data only, disk mgmt rewired to the agent, container de-privileged). The data-bearing format refusal (§6) is the security centerpiece. |
|
||||||
| **Host/hardware loss** DR — re-enroll in "restore mode"; hub serves identity / PBS namespace / tunnel token / storage manifest / restore directive | **10** | deferred — needs hub desired-state serving; hub store today holds only `{host_id, customer_id, api_key}` (slice 3) |
|
| **Host/hardware loss** DR — re-enroll in "restore mode"; hub serves identity / PBS namespace / tunnel token / storage manifest / restore directive | **10** | deferred — needs hub desired-state serving; hub store today holds only `{host_id, customer_id, api_key}` (slice 3) |
|
||||||
| PBS escrow **consumption** (recover `K` on a new box) | **10** | deferred — exercised by host-loss DR |
|
| PBS escrow **consumption** (recover `K` on a new box) | **10** | deferred — exercised by host-loss DR |
|
||||||
| Golden base refresh cadence + fleet versioning | post-launch | operational, non-blocking (§13) |
|
| Golden base refresh cadence + fleet versioning | post-launch | operational, non-blocking (§13) |
|
||||||
@@ -461,6 +479,18 @@ This doc hands the implementation three contracts it was waiting on:
|
|||||||
|
|
||||||
## Changelog — design-review + Phase-3 fold-in (2026-06-08)
|
## Changelog — design-review + Phase-3 fold-in (2026-06-08)
|
||||||
|
|
||||||
|
### Slice-8C implemented — controller de-privileged, slice 8 CLOSED (2026-06-10)
|
||||||
|
- §6: added the **disk-management endpoints** (`/disks`, `/disks/assign|eject|format`) and
|
||||||
|
**reframed the principle** — a controller may do non-data-destructive storage setup self-serve;
|
||||||
|
**anything that can lose customer data stays operator-signed (§4)**, with the **classifier
|
||||||
|
(agent-internal device inspection)** as the enforcer. The 8C invariant: the agent decides
|
||||||
|
data-bearing-ness by **inspecting the device itself**, never the caller's claim; a data-bearing
|
||||||
|
format → `ClassStorageWipe` → gate → `pending_signature` (signed completion is slice 10).
|
||||||
|
- §9 slice table: **8C implemented — slice 8 CLOSED** (agent v0.12.0 `/disks` + classifier gate +
|
||||||
|
`mkfs`; controller v0.37.0 retired ~12.3k LOC of disk-execution + de-privileged + rewired to the
|
||||||
|
agent). The controller-side re-platform milestone: the in-guest controller is now Docker-only with
|
||||||
|
no disk/Proxmox privileges.
|
||||||
|
|
||||||
### Slice-8B implemented: app-consistent backup (quiesce / stack-stop) (2026-06-10)
|
### Slice-8B implemented: app-consistent backup (quiesce / stack-stop) (2026-06-10)
|
||||||
- §8: the **controller-driven quiesce** (stop app stacks → `POST /backup` → restart) is **implemented**
|
- §8: the **controller-driven quiesce** (stop app stacks → `POST /backup` → restart) is **implemented**
|
||||||
(controller v0.36.0 `internal/quiesce` + agent v0.11.0 `/backup/due` cadence + `/backup/status`
|
(controller v0.36.0 `internal/quiesce` + agent v0.11.0 `/backup/due` cadence + `/backup/status`
|
||||||
|
|||||||
Reference in New Issue
Block a user