doc 03 §6/§4/§9 + doc 02: slice 8C implemented — controller de-privileged, slice 8 CLOSED (2026-06-10)
§6: disk-management endpoints + reframed principle (non-data-destructive self-serve; data-destructive stays operator-signed; classifier = agent-internal device inspection). §4: data-bearing-ness is agent-internal, never caller-claimed. §9: 8C implemented, slice 8 CLOSED. doc 02: EXECUTED banner. Validated live (data-bearing format refused; de-privileged controller). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -1,5 +1,13 @@
|
||||
# Felhom Controller Architecture — Part 2: Controller Module Map
|
||||
|
||||
> **EXECUTED (slice 8C, 2026-06-10 — controller v0.37.0).** This map's target state is now realized:
|
||||
> the disk-execution subsystem (`storage/*`, restic, cross-drive, drive-restore, `disk_layout`,
|
||||
> `local_infra`, `infra_backup`, `setup/scanner`, `monitor/watchdog`+`pinger`, the storage UI) is
|
||||
> **deleted** (~12.3k LOC); `backup.Manager` is **split to app-data only**; disk management is
|
||||
> **rewired to the host agent's local API** (`web/agent_disk_handlers.go` → agent `/disks`); and the
|
||||
> container is **de-privileged** (no `privileged`, `/dev`, `/etc/fstab`, rshared). The in-guest
|
||||
> controller is now **Docker-only with no disk/Proxmox privileges**, as designed. See doc 03 §6/§9.
|
||||
|
||||
**Status:** audit (keep / port / delete / modify / add), grounded in the v0.33 source.
|
||||
**Subject:** the v0.33 controller in `felhom-controller/controller/` (110 `.go` files,
|
||||
~40 K LOC) audited against [01-topology-and-trust.md](01-topology-and-trust.md) and
|
||||
|
||||
@@ -77,6 +77,7 @@ by verb**:
|
||||
|
||||
- **The reconciler MAY act without an operator signature** when: (a) creating/starting/restarting; (b) destroying resources it created earlier **within the same journaled transaction** (compensating rollback, §10); (c) destroying resources it **tagged ephemeral/scratch** (e.g. restore-test scratch guests, §8). The ephemeral/scratch tag is **agent-internal provenance and is never accepted from the hub** — else a compromised hub could relabel a data-bearing guest as scratch to walk the gate.
|
||||
- **An operator signature is always required** to destroy/overwrite any resource holding the only/primary copy of customer data — live-guest destroy, storage detach/wipe, restore-overwrite, decommission — *regardless of whether it arrives as a job or as a desired-state delta*. A compromised hub cannot forge them because the signing key is **not held by the hub** (it lives with the operator / a separate signing path; the hub only queues opaque signed blobs).
|
||||
- **Data-bearing-ness is agent-internal evidence, never a caller's claim (slice 8C).** For a customer-driven storage op (`POST /disks/format`, §6) the agent **inspects the actual device** (filesystem signature / partition table / partitions / mount, conservative — ambiguous → data-bearing) to decide the class. A blank device → benign self-serve `mkfs`; a data-bearing device → `ClassStorageWipe` → this gate → `pending_signature`. The **destructive completion of a data-bearing wipe is slice 10** (the operator-signed path); 8C refuses it. This mirrors the provenance rule above: just as the scratch tag is agent-internal (never hub-sourced), data-bearing-ness is agent-observed (never controller-asserted) — a compromised controller cannot relabel a data-bearing drive "blank" to walk the gate.
|
||||
- **Healing a crashed controller is non-destructive by construction:** it is reconstructable from its image + the guest's persistent volume, so "redeploy" = restart the LXC / `docker compose up -d` **inside the existing guest** — never a guest destroy. (v0.33 precedent: `watchdog.go` restarts stopped stacks, it never destroys the guest.)
|
||||
|
||||
Signed payloads carry a **nonce + expiry** (anti-replay: a captured "restore" job cannot be
|
||||
@@ -116,9 +117,26 @@ The controller (in its LXC) reaches the agent (on the host) over the local bridg
|
||||
- `POST /backup` — request a backup-now of *this* guest (enqueued; non-destructive).
|
||||
- `GET /backup/due` — whether a policy-scheduled backup is due for *this* guest, so the controller can quiesce then call `POST /backup` (the app-consistent path, §8).
|
||||
- `GET /backup/status`, `GET /restore-test/status` — read-only status for the controller's UI.
|
||||
- **Disk management (slice 8C):** `GET /disks` (host drives + a **data-bearing flag**),
|
||||
`POST /disks/assign` (attach a drive as a mount — benign, additive, self-serve), `POST
|
||||
/disks/eject` (safe-unmount, **data preserved**, returns the dependent guests so the controller
|
||||
warns which apps lose that storage — benign), `POST /disks/format` (see the reframed principle
|
||||
below). The controller is Docker-only (de-privileged, slice 8C); **execution is the agent's**.
|
||||
|
||||
Note what is *absent*: nothing here lets a controller touch another guest, the host, storage
|
||||
attachment, or restore-overwrite. Destructive/cross-guest power stays operator-signed (§4).
|
||||
**The principle (reframed for 8C):** a controller may do **non-data-destructive** storage setup
|
||||
**self-serve** (list, assign, eject, format a *blank* drive); **anything that can lose customer data
|
||||
stays operator-signed (§4)**. The enforcer is the **classifier**: for `POST /disks/format` the agent
|
||||
**inspects the actual device itself** (filesystem signature / partition table / partitions / mount —
|
||||
agent-internal evidence, NEVER the caller's claim) and classifies conservatively (ambiguous →
|
||||
data-bearing). A blank device → benign → `mkfs`. A data-bearing device → `ClassStorageWipe` →
|
||||
destructive → the §4 gate → refused **`pending_signature`** (the operator-signed completion is slice
|
||||
10). So a compromised controller asserting "this drive is blank" **cannot** wipe a data-bearing
|
||||
drive — the 8C analog of self-scoping. **Status: implemented** (agent v0.12.0 `internal/localapi` +
|
||||
`internal/storage`; controller v0.37.0 `internal/web/agent_disk_handlers.go`).
|
||||
|
||||
Note what is *absent*: nothing here lets a controller touch **another guest**, the **host** beyond
|
||||
this narrow disk surface, or **restore-overwrite**; and within the disk surface, **data-destructive**
|
||||
power stays operator-signed (§4). Destructive/cross-guest power stays operator-signed.
|
||||
|
||||
A controller can only `POST /rollback` (or snapshot/backup) **its own** guest — the agent maps
|
||||
token → guest and authorizes per guest, so a compromised controller's blast radius is
|
||||
@@ -381,7 +399,7 @@ this path — bring up + reattach external storage and it is whole. This is full
|
||||
| PBS recovery-code escrow **creation** + **hub opaque storage** (§8a) | **7** | **implemented** (agent v0.9.0 `internal/escrow`; hub v0.8.0 `PUT /hosts/{id}/escrow`) |
|
||||
| **Local API** server (§6) + provisioning **back half** — deploy controller, hand bootstrap config, mint per-guest local token | **8A** | **implemented** (agent v0.10.0 `internal/localapi` + `internal/provision`; controller v0.35.0 `internal/bootstrap` + `internal/agentapi`). The controller image is **baked into the golden** (no registry cred in any guest); the back-half mints the token, writes a 0600 `bootstrap.json` to a `chown 100000:100000` config mount, and `pct set`-attaches it read-only; the golden's baked unit deploys the controller, which ingests the bootstrap, comes up configured, and reaches the agent over the bridge (leaf-pin + token). Validated live end-to-end on the demo. |
|
||||
| **Quiesced app-consistent backup** (`/backup/due`-driven stack-stop) | **8B** | **implemented** (agent v0.11.0 `/backup/due` cadence + `/backup/status` phases; controller v0.36.0 `internal/quiesce` — stop stacks → backup → restart, with crash-safety marker/guaranteed-unquiesce/max-bound/crash-recovery). Validated live incl. the postgres clean-vs-crash-recovery restore contrast. Downtime optimization (snapshot mode) → 8B.2. |
|
||||
| **Controller de-privileging** (retire the disk-execution subsystem; new customer disk endpoints behind the slice-4 data-bearing classifier) | **8C** | deferred |
|
||||
| **Controller de-privileging** (retire the disk-execution subsystem; new customer disk endpoints behind the slice-4 data-bearing classifier) | **8C** | **implemented — slice 8 CLOSED** (agent v0.12.0: `/disks` endpoints + the data-bearing classifier gate + `mkfs`; controller v0.37.0: ~12.3k LOC of disk-execution retired — storage/restic/cross-drive/migrate/watchdog/scanner/infra-backup — `backup.Manager` split to app-data only, disk mgmt rewired to the agent, container de-privileged). The data-bearing format refusal (§6) is the security centerpiece. |
|
||||
| **Host/hardware loss** DR — re-enroll in "restore mode"; hub serves identity / PBS namespace / tunnel token / storage manifest / restore directive | **10** | deferred — needs hub desired-state serving; hub store today holds only `{host_id, customer_id, api_key}` (slice 3) |
|
||||
| PBS escrow **consumption** (recover `K` on a new box) | **10** | deferred — exercised by host-loss DR |
|
||||
| Golden base refresh cadence + fleet versioning | post-launch | operational, non-blocking (§13) |
|
||||
@@ -461,6 +479,18 @@ This doc hands the implementation three contracts it was waiting on:
|
||||
|
||||
## Changelog — design-review + Phase-3 fold-in (2026-06-08)
|
||||
|
||||
### Slice-8C implemented — controller de-privileged, slice 8 CLOSED (2026-06-10)
|
||||
- §6: added the **disk-management endpoints** (`/disks`, `/disks/assign|eject|format`) and
|
||||
**reframed the principle** — a controller may do non-data-destructive storage setup self-serve;
|
||||
**anything that can lose customer data stays operator-signed (§4)**, with the **classifier
|
||||
(agent-internal device inspection)** as the enforcer. The 8C invariant: the agent decides
|
||||
data-bearing-ness by **inspecting the device itself**, never the caller's claim; a data-bearing
|
||||
format → `ClassStorageWipe` → gate → `pending_signature` (signed completion is slice 10).
|
||||
- §9 slice table: **8C implemented — slice 8 CLOSED** (agent v0.12.0 `/disks` + classifier gate +
|
||||
`mkfs`; controller v0.37.0 retired ~12.3k LOC of disk-execution + de-privileged + rewired to the
|
||||
agent). The controller-side re-platform milestone: the in-guest controller is now Docker-only with
|
||||
no disk/Proxmox privileges.
|
||||
|
||||
### Slice-8B implemented: app-consistent backup (quiesce / stack-stop) (2026-06-10)
|
||||
- §8: the **controller-driven quiesce** (stop app stacks → `POST /backup` → restart) is **implemented**
|
||||
(controller v0.36.0 `internal/quiesce` + agent v0.11.0 `/backup/due` cadence + `/backup/status`
|
||||
|
||||
Reference in New Issue
Block a user