Files
admin 1af21a6cac v0.4.0: slice 4 Phase B — reversibility gate + signed-op consuming layer
The security core of slice 4: hub-supplied intent is no longer trusted for
destructive change. The gate fronts the per-guest queue's executor, so every
mutation passes it. Reuses internal/authz for all crypto (surface untouched).

- Classifier (doc 03 §4): benign vs destructive by provenance + data-bearing-
  ness, NOT by verb. Destroy/overwrite of customer data is destructive unless
  agent-internal provenance (same-journaled-txn create, or agent-tagged scratch)
  makes it benign — and that provenance is journal-recorded, NEVER hub-sourced.
  Unknown op class fails safe to destructive.
- Reversibility gate: benign -> allowed unsigned; destructive -> requires a
  verified, role-scoped, action-bound operator signature, else pending_signature
  and never executed. Every decision audited (signal, never the guard).
- Signed-op consuming layer over authz.Verifier.Verify (locked pipeline
  untouched): role-scoping (doc 04 §4 — recovery=rotation only, operational=
  ordinary destructive + planned rotation) + op-to-action binding (op+host+
  guest+params must match the gated action).
- Signed-job orchestration: idempotency dedupe by nonce + journal-wrapped
  execution via an injected DestructiveExecutor (nil this slice — inert).
- Crash recovery (Note 1): Engine.Recover consumes the journal InFlight() set at
  startup (resume-or-rollback) — covers an op that crashed after the POST and
  before its terminal record, which idempotency dedupe alone cannot. Added
  TaskStatusOnce to the GuestAPI seam. Wired into daemon startup.
- Note 2: memory comparison canonicalized to MiB (desiredMemoryMiB) so a
  non-MiB-aligned MemoryBytes converges in one pass, not perpetual drift.
- Daemon: builds the verifier from config signers (none = nil verifier, the
  common slice-4 state), the gate (+SlogAudit), runs Recover before mutating.

Adversarial matrix proven against the REAL authz.Verifier with in-test-minted
SSHSIGs (framing replicated in reconcile's test binary; authz untouched, no
signing added to the verify-only package): unsigned job + unsigned desired-state
delta -> pending_signature; unknown signer/expired/replay-across-restart/wrong
host -> typed authz rejections; wrong guest/op/params -> binding_mismatch;
recovery key on ordinary destructive -> role_denied; hub-supplied scratch tag
ignored -> refused; valid+role+target+fresh nonce -> accepted then replay
rejected. Full module race-clean + vet-clean on the Linux build server.

Inert this slice: no destructive deltas served until slice 10; the destructive
path is classified, gated, and tested but not wired to live execution.

CHECKPOINT: Phase B complete (slice 4 done). Awaiting validation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 23:56:20 +02:00

364 lines
24 KiB
Markdown

# Changelog
All notable changes to **felhom-agent** are recorded here. Update on every code
change that gets pushed.
## v0.4.0 — slice 4 Phase B: reversibility gate + signed-op consuming layer (2026-06-08)
The security core of slice 4: hub-supplied intent stops being trusted for destructive
change. Layered in front of the per-guest queue's executor — **every** mutation now
passes the gate. Reuses `internal/authz` for all crypto (untouched surface). Inert
this slice: no destructive deltas are served until slice 10, so the destructive path is
classified, gated, and adversarially tested but not wired to live execution.
### Added
- **Classifier (`classify.go`, doc 03 §4)** — benign vs destructive by **provenance +
data-bearing-ness, NOT by verb**. The `OpClass` vocabulary (seeded by the committed
slice-2 `op_blob.json`: `guest_destroy`) is the agent-side contract slice 10 matches.
Destroy/overwrite of customer data is destructive UNLESS **agent-internal**
provenance (same-journaled-transaction create → compensating rollback, or
agent-tagged scratch) makes it benign. `Provenance` is journal-recorded and **never
populated from the hub** (its zero value is the only thing an external intent may
carry). Unknown op class fails safe → destructive.
- **Reversibility gate (`gate.go`)** — `Gate.Authorize(intent, signed)`: benign →
allowed unsigned; destructive → requires a verified, role-authorized, action-bound
operator signature, else refused **`pending_signature`**, never executed. Every
decision is written to an `AuditSink` (audit is a signal, never the guard).
- **Signed-op consuming layer over `authz`** — verifies via `authz.Verifier.Verify`
(the locked pipeline, untouched), then enforces on the `VerifiedOp`:
- **Role-scoping (doc 04 §4)** — recovery key authorizes key-rotation re-pins ONLY;
operational key authorizes ordinary destructive ops + planned rotation.
- **Op-to-action binding** — verified `op` + host + guest + `params` must match the
gated action (a signature for guest X / op A can't authorize guest Y / op B);
params compared semantically (key-order/whitespace independent).
- **Signed-job orchestration (`job.go`)** — `RunSignedJob`: idempotency dedupe (the
op nonce as the journal key — a redelivered completed op is skipped, not re-run),
gate authorization, then journal-wrapped execution via an injected
`DestructiveExecutor` (nil this slice — authorized destructive ops are inert, no
executor wired until 6/7).
- **Crash-recovery consumer (`recover.go`, Note 1 / doc 03 §10)** — `Engine.Recover`
consumes the journal's `InFlight()` at startup: an op that crashed AFTER the Proxmox
POST and BEFORE its terminal record (`OpTaskRunning`, nonce already consumed) is NOT
covered by idempotency dedupe — only this resume-or-rollback resolves it (re-read the
task via the new `TaskStatusOnce`, record the real outcome; a no-task-id op is
abandoned fail-safe). Landed together with the signed-op executor, as Note 1 required.
- **Daemon wiring** — `runDaemon` builds the verifier from `config.Authz.Signers` (a
bad key / missing nonce-store path is a fatal misconfig; **no signers = nil verifier**,
the common slice-4 state), constructs the gate (+ `SlogAudit`), runs `Recover` before
issuing any mutation, and routes every reconcile action through the gate.
### Changed
- **Memory comparison canonicalized (Note 2)** — `desiredMemoryMiB` makes the
desired↔actual memory compare in the same MiB unit that is then written, so a
non-MiB-aligned `MemoryBytes` converges in one pass instead of re-issuing SetConfig
forever (the numeric cousin of the description-newline normalization). Test proves
convergence. Slice 10 should still serve MiB-aligned specs at the source.
### Tests (the security proof — each independently rejected)
- **Adversarial matrix** via the REAL `authz.Verifier` with in-test-minted SSHSIGs
(framing replicated in reconcile's test binary; production authz untouched, no signing
added to the verify-only package): unsigned destructive **job** → pending_signature;
unsigned destructive **desired-state delta** → pending_signature (distrusts hub
desired state, not just jobs); forged/unknown signer → `ErrUnknownSigner`; expired →
`ErrExpired`; **replayed nonce across an agent restart** (durable `FileNonceStore`) →
`ErrReplay`; wrong host → `ErrTarget`; wrong guest / wrong op / wrong params →
binding_mismatch; **recovery key on ordinary destructive** → role_denied;
**hub-supplied "scratch" tag ignored** → still destructive → refused; **valid + role +
target + fresh nonce → accepted**, and a second presentation → `ErrReplay` (nonce
consumed).
- Classifier (benign/destructive/provenance/key-rotation/fail-safe), role-scoping,
params binding, crash-recovery (resume OK / fail / still-running / no-task rollback /
unreadable / one-shot key applied on resume), signed-job idempotency (execute once,
dedupe redelivery, refused-not-executed, no-executor-inert, executor-error).
- Full module **race-clean** (`go test -race`) + vet clean on the Linux build server.
## v0.4.0-rc1 — slice 4 Phase A: reconcile engine (structural; runs live, unfed) (2026-06-08)
The agent-side control core's structural half. **Checkpoint marker**`-rc1` is the
Phase-A push; awaiting validation before Phase B (the reversibility gate + signed-op
consuming layer) lands the final **v0.4.0**. Runs LIVE but UNFED: with no desired-state
provider until slice 10, the live engine computes an empty action set and performs
**zero mutations**.
### Added
- **`internal/reconcile`** package — the engine, the per-guest serializer, the
desired-state model, the normalization layer, and the durable op journal:
- **Per-guest serializer (`Queue`, doc 03 §10)** — the single choke point ALL
mutation sources funnel through. Same-vmid jobs run strictly one-at-a-time in
submit order; independent vmids run in parallel. Each vmid is a cond-var FIFO lane
(unbounded, non-blocking, order-preserving); graceful drain on `Close`.
- **Desired-state model + `DesiredProvider` seam** — `DesiredGuest` (per-field
optional: run-state / `*hub.GuestSpec` / `*description`), `DesiredState`. The only
live provider is **`EmptyProvider`** (slice 4 has no source); `StaticProvider`
feeds fixtures. The seam is where slice 10's hub-serving plugs in — no hub/local
source invented here.
- **Normalization layer (`FieldNormalizers`)** — reconcile compares *normalized*
desired-vs-actual so Proxmox round-trip quirks don't read as drift. `description`'s
trailing newline is the first registered case; the registry takes more (boolean
coercion, list ordering) as discovered. `normDesc` **promoted** out of
`cmd/felhom-agent/main.go` to **`reconcile.NormDescription`**; the `--selftest=task`
description round-trip now uses that shared helper (one source of truth for the quirk).
- **Plan engine (`Plan`, pure function)** — computes the minimal **benign** action set
(`Start`/`Stop`/`SetConfig`) for guests present in both desired and actual, with
normalized comparison, deterministic vmid ordering, config-before-run-state. Skips
provision (desired-absent-in-actual, slice 7) and destroy (actual-absent-in-desired,
gated, slice 10); never writes a config it couldn't first read (`SpecKnown`). Disk
(rootfs grow) intentionally not reconciled here.
- **Reconcile engine (`Engine`)** — reads desired+actual, plans, dispatches each action
onto the shared queue. Every Proxmox op handled per the mutate.go contract: non-empty
UPID → `WaitTask` + assert `exitstatus`; empty UPID → clean **synchronous** success
(slice-4 proven). Per-action failures are counted, not fatal (other guests still
converge).
- **Operation journal (`Journal`)** — durable fsync'd append-only JSONL mirroring
`authz.FileNonceStore`: records each op's lifecycle (started → task_running →
succeeded/failed) with its Proxmox task id (crash mid-op is detected and re-checkable
on restart via `InFlight()`), plus an **idempotency-key store** (`AlreadyApplied`) so
a one-shot op never re-runs across retries/restarts. Reconcile actions carry no
idempotency key (convergent — must re-run on real drift).
- **Daemon wiring (`runDaemon`)** — reconcile runs alongside the hub loop on the poll
cadence, **sharing the per-guest queue**. Journal path is a `journal.log` sibling of the
nonce store. The daemon runs cleanly with **no desired state and no signers** (reconcile
is a logged live no-op; a journal-open failure degrades to journal-less, never crashes).
### Tests
- Serializer: same-guest serialized (max-concurrency 1, submit order preserved) and
different-guests parallel (cross-waiting jobs both complete — would deadlock if not);
error propagation; drain-pending-on-close; submit-after-close.
- Normalization: description round-trip; unknown-field identity; extensibility seam
(synthetic boolean-coercion + list-ordering normalizers).
- Plan: run-state start/stop, spec drift (cores/memory), disk-not-reconciled,
description-newline-not-drift, unmanaged fields, spec-unknown skips config keeps
run-state, desired-absent skipped, combined ordering, empty-desired no-op, deterministic
vmid order.
- Engine: empty-provider zero mutations; async start (WaitTask); synchronous SetConfig
(no WaitTask); WaitTask failure + POST error counted failed; list error = pass failure.
- Journal: lifecycle latest-wins; in-flight survives restart; idempotency dedupe across
restart; failed key not applied; torn-trailing-line skipped.
- Full module **race-clean** (`go test -race`) on the Linux build server; vet clean.
### Not in this phase (Phase B)
- The benign/destructive classifier, the reversibility gate, and the signed-op consuming
layer over `internal/authz` (doc 03 §4 / doc 04) — added next, in front of the queue's
executor, landing **v0.4.0**.
## v0.3.2 — SetConfig selftest extension (slice-4 pre-check) (2026-06-08)
The gate before slice 4: prove `SetConfig` works live under the scoped token before
reconcile is built on it. **Self-gated live run PASSED** on `demo-felhom`/guest 9999.
### Added
- **Reversible `SetConfig` step appended to `--selftest=task`** (`cmd/felhom-agent/main.go`,
`selftestSetConfig`): read `GuestConfig` → write a `description` marker
(`felhom-selftest <RFC3339>`) → verify it landed → restore the original value (or
`delete` the key if it was absent) → verify the restore. Handles PVE's dual-mode
`SetConfig` return per the `mutate.go` contract: empty UPID = synchronous success
(printed `synchronous`); non-empty UPID = `WaitTask` + assert `exitstatus=OK`.
The existing snapshot → rollback → delete-snapshot steps are unchanged. First live
exercise of the **`VM.Config.*`** privilege cluster.
- **`normDesc` / `extraString` helpers** — `extraString` decodes a string-valued key
from `GuestConfig.Extra` (raw JSON); `normDesc` strips the trailing newline PVE
appends to `description` on read, so a written value round-trips equal.
### Finding (live)
- The LXC `description` write returned **synchronous (empty UPID)** — PVE applied it
inline, no task. The agent's dual-mode `SetConfig` modeling is correct: the
empty-string path is real and must not be treated as an error.
- PVE **appends a trailing `\n` to `description`** on read (stored URL-encoded as
`%0A`). A naive exact-match reconcile would see perpetual drift — slice-4 reconcile
must normalize `description` comparisons (hence `normDesc`).
### Ops
- Standing operator token (`felhom-agent@pve!agent`, privsep) **rotated** during this
run (the prior secret was not retrievable); role + both user/token ACL rows
re-confirmed at `/`. New secret stored out-of-band, **not persisted to the repo**.
Guest 9999 left pristine (stopped, no `description`, no leftover snapshot). Version → 0.3.2.
## Docs + live validation — no version bump (2026-06-08)
### Changed
- **Reflowed `CLAUDE.md`** — removed hard mid-paragraph line wraps (prose, list items, blockquotes now single-line, soft-wrapped); code blocks and tables untouched; rendered output unchanged.
- **Unified the REPORT/CHANGELOG convention** in `CLAUDE.md`: `CHANGELOG.md` is the cumulative log (newest on top); `REPORT.md` is overwritten with the most-recent implementation/validation only. Added an explicit **no-secrets** rule (never write tokens/passwords/keys into committed files; reference them as stored out-of-band).
### Added
- **`REPORT.md`** rewritten for the live `--selftest=task` validation on the demo host (`demo-felhom`): snapshot → rollback → delete-snapshot on guest 9999, each polled to `exitstatus=OK` under the `felhom-agent@pve!agent` privsep token (UPIDs name the token actor — privsep path genuinely exercised); 16-privilege `FelhomAgent` role + both user & token ACLs confirmed; `--selftest=read` clean. Closes the slice-1 "mutating ops unit-tested only" gap; `WaitTask` async foundation validated live → **slice 4 unblocked**. (Token secret stored out-of-band, not in the repo.)
## v0.3.1 — slice-3 validation follow-ups (2026-06-08)
### Changed
- **Collector keeps the known run-status on a `GuestConfig` failure** (`internal/hub/collect.go`):
previously a per-guest config-read error forced `status="unknown"`; now the run-status from
`ListLXC` is preserved (only the `spec` is dropped). An empty status is still normalized to
`unknown` (wire value is always `running|stopped|unknown`). Test renamed to
`TestCollect_GuestConfigFailureKeepsStatusOmitsSpec` and asserts the preserved `running` + nil spec.
- **`--selftest` usage** error string now reads `(want read|task|hub)`.
### Added
- **Cross-repo contract fixture** `internal/hub/testdata/host-report.golden.json` +
`TestHostReport_ContractMatchesGolden` — compares the marshaled `HostReport` field-name sets
(top level + `host` + `guests[0]`) against the golden, failing on any json-tag drift. The file is
**kept byte-identical** with felhom-hub's copy (duplicated contract until a shared types module;
revisit when slices 5/6 populate the empty collections). Version → 0.3.1.
## v0.3.0 — hub client + host-report + first daemon loop (slice 3) (2026-06-08)
The agent's first daemon: a periodic read-only host-report POSTed to the hub (the
heartbeat). No Proxmox mutations, no desired-state/signed-op consumption, no
storage/backup collection yet — those are slices 4/5/6.
### Added
- **`internal/hub`** package:
- **`HostReport`** wire contract (`report.go`) shared field-for-field with the hub
ingest: host metrics, guests (`vmid` + spec), `cloudflared` status, and the
`storage_targets`/`backups`/`restore_tests`/`pbs_snapshots`/`audit_tail`
collections **defined but emitted empty** (typed `[]`, slices 5/6 fill them).
- **`Collector`** (`collect.go`) builds the report from a read-only `proxmoxReader`
(adapted to the real `internal/proxmox` surface — node held by the client, value
returns, `proxmox.Guest`) + a `CloudflaredProber`. Partial-failure policy: a
failed `NodeStatus` is a hard error (skip the POST); a failed per-guest
`GuestConfig` degrades that guest to `status="unknown"` (spec omitted) but still
sends; a cloudflared probe failure → `"unknown"`, never fatal.
- **`CloudflaredProber`** + `SystemctlProber` (`systemctl is-active cloudflared`;
read-only — NOT a Privileged/root op; tunnel management is a later slice).
- **`Client`** (`client.go`): `POST /api/v1/host-report` with
`Authorization: Bearer <key>`, standard TLS (system roots or optional `ca_file`;
verification always on). Typed `*TransportError` / `*HTTPError`; the bearer token
never appears in any error.
- **`Loop`** (`loop.go`): the daemon — immediate first report then tick; adopts the
hub's `poll_interval_seconds` clamped to [60,3600]; resilient (a collect/report
error is logged and the loop continues); clean shutdown on context cancel.
- **`ControlEnvelope`**: only `poll_interval_seconds` is acted on; `blocked` /
`desired_generation` / `has_signed_ops` are parsed-but-ignored (logged at most)
pending reconcile (slice 4).
- **Config**: `HubConfig` (url/host_id/api_key/poll_seconds/timeout_seconds/ca_file),
`FELHOM_AGENT_HUB_*` env overlay, `HubConfig.Validate()` (mode-aware — proxmox-only
`--selftest=read|task` still runs without hub config), `WithDefaults()`, and
`Redacted()` now also blanks the hub key. `configs/agent.example.json` gains `hub`
(and `authz`) blocks.
- **`cmd/felhom-agent`**: the no-`--selftest` mode is now the **daemon** (poll loop);
added **`--selftest=hub`** (one collect+report, prints the report + envelope).
Version 0.2.0 → 0.3.0.
### Tests
- Report serialization (field names; empty collections are `[]` not `null`; spec
omitted when unknown); client (Bearer header, non-2xx→`*HTTPError`,
transport→`*TransportError`, **token never in error**); collector (host mapping,
guest spec, per-guest failure degrades-but-still-reports, NodeStatus hard error,
cloudflared error→unknown); loop (immediate first report, continuation after an
injected error, interval adoption + clamp); config (hub validate/redact/env).
### Notes
- `internal/proxmox` and `internal/authz` were **not touched** — no new proxmox
surface was needed (`ListLXC` already exposes status/maxmem/maxdisk; `GuestConfig`
exposes cores). The task's `proxmoxReader` sketch (node-arg/pointer/`LXC`) was
adapted to the real exports as instructed.
- **Defined-but-empty** this slice: `storage_targets`, `backups`, `restore_tests`,
`pbs_snapshots`, `audit_tail` (slices 5/6). **Parsed-but-ignored**: the envelope's
`blocked`/`desired_generation`/`has_signed_ops` (slice 4).
## v0.2.0 — `authz` signed-op verifier (slice 2) (2026-06-08)
Production form of the Phase-4 signing primitive: a key-type-agnostic SSHSIG
verifier for operator-signed destructive ops, with the full anti-replay/
authorization pipeline and a durable, crash-safe nonce store. What slice 4
(reconcile) will call to gate destructive desired-state deltas. No hub, no signing
CLI, no reconcile loop.
### Added
- **`internal/authz``Verifier`**: `New(signers, store, hostID)` + `Verify(blob,
sigArmored) (*VerifiedOp, error)`. Runs the LOCKED pipeline (order is
load-bearing): parse armor → namespace → parse pubkey → allow-list (by key
**material**, `pub.Marshal()` equality, not key_id) → crypto verify (over the
**raw received bytes**, never re-canonicalized) → parse blob → target → time
window → **nonce recorded LAST**. Each post-crypto stage rejects even with a
valid signature.
- **SSHSIG framing** (`sshsig.go`) via `golang.org/x/crypto/ssh` — `pem.Decode` →
strip 6-byte magic → `ssh.Unmarshal` → `ssh.ParsePublicKey` → recompute signed
data with the named hash → `pub.Verify` (dispatches on key algorithm). No
hand-rolled crypto. Key-type-agnostic: ed25519 / **sk-ssh-ed25519 (FIDO2)** /
rsa / ecdsa via the one path.
- **Fixed namespace** `felhom-op-v1` (package constant, never caller-supplied).
- **`OpBlob`** (corrected `host_id`/`guest_id` json tags) + **`VerifiedOp`** (op,
host/guest, params, key_id, matched signer). key_id is advisory/audit only —
never an authz input.
- **Typed errors**: `ErrMalformed, ErrNamespace, ErrUnknownSigner, ErrBadSignature,
ErrTarget, ErrExpired, ErrNotYetValid, ErrReplay` (errors.Is-friendly).
- **`NonceStore`** + two impls: `MemoryNonceStore` (tests) and **`FileNonceStore`**
— durable, crash-safe (fsync'd append log, replayed into an index on open,
periodic compaction, expiry-only pruning). A nonce is fsync'd to disk before
`SeenOrRecord` returns false; replay protection survives restart; I/O failure
fails safe (reports seen=true). Target generalization: host_id matched strictly,
guest_id surfaced for the caller to route.
- **Config**: `AuthzConfig` (nonce-store path + pinned operator `signers` tagged
`operational`/`recovery` with a key_id, as authorized_keys lines).
- **Version 0.2.0.**
### Tests
- Real OpenSSH interop via a committed `ssh-keygen -Y sign` vector (hermetic CI);
per-stage rejection (each with an otherwise-valid sig); the headline
**invalid-sig-does-not-burn-the-nonce** invariant; replay; **persistence across
restart**; synthetic **sk-ssh-ed25519** through the unchanged path; byte-exactness
(a re-serialized blob fails crypto — not re-canonicalized).
### Notes / corrections to the Phase-4 reference
- §7's `Target` lacked json tags (`host_id`/`guest_id`) — fixed.
- The doc paired "Go 1.24.4 / x/crypto v0.52.0", but v0.52.0 declares `go 1.25.0`
and does **not** build on Go 1.24. Resolved by upgrading the build server to
go1.26.0 (backward-compatible; felhom-controller/hub unaffected); the module is
`go 1.25.0` on x/crypto v0.52.0.
- Free function → constructed `Verifier`; returns the full `VerifiedOp`; typed
errors; clock-skew tolerance added; durable nonce store is the net-new work.
- **Shared-contract dependency flagged** (not built): the hub and the `felhom-sign`
CLI must emit byte-identical canonical JSON or signatures won't verify; a shared
canonicalizer both import would be the right home.
## v0.1.0 — Scaffold + `proxmox` interaction layer (slice 1) (2026-06-08)
First slice: stand up the host-agent project and its foundation — the typed
Proxmox interaction layer every other module will call. No reconcile loop, hub
client, signing, or storage/backup orchestration yet (later slices).
### Added
- **Project scaffold**: module `gitea.dooplex.hu/admin/felhom-agent`, binary
`felhom-agent` (`cmd/felhom-agent/`), Go 1.24, zero external dependencies
(pure stdlib). `--version` flag; `version` var overridable via
`-ldflags "-X main.version=<v>"`.
- **`internal/proxmox` — API backend (`Client`)**: hand-rolled REST client over
`https://<host>:8006/api2/json` with `PVEAPIToken` auth. Typed read ops
(`Version`, `Nodes`, `NodeStatus`, `ListLXC`, `GuestStatus`, `GuestConfig`,
`ListStorage`, `NodeStorage`, `StorageContent`) and async mutating ops
returning a UPID (`RestoreLXC` — the primary create path, `Vzdump`, `Snapshot`,
`Rollback`, `DeleteSnapshot`, `SetConfig`, `Start`, `Stop`).
- **`WaitTask`**: polls `GET /nodes/{node}/tasks/{upid}/status` until stopped, then
asserts `exitstatus == "OK"` (authorization can surface at task execution, not
the POST — phase1-2 §1.3). Exponential backoff (1s→5s cap), context
cancellation + timeout. `*APIError` parses the offending privilege from a 403;
`*TaskError` parses it from a failed task exitstatus + log tail.
- **`internal/proxmox` — fenced root-CLI backend (`Privileged`)**: limited to the
three proven OS-root exceptions only — `CreateGoldenLXC` (keyctl `pct create`),
`MountUSBByUUID`, `SMART`, `Sensors`; each cites why it can't be the API. Fence
is structural (Client never shells out, Privileged never makes an HTTP call) and
asserted in tests.
- **TLS trust**: SHA-256 leaf-cert pinning (the host serves a self-signed cert) or
a CA file; an explicitly-named `insecure_skip_verify` that is off by default. No
blanket verification disable.
- **`internal/config`**: JSON config file + `FELHOM_AGENT_*` env overrides; the
token secret is never logged (`Redacted()`).
- **`internal/log`**: slog setup (text, stderr, configurable level).
- **`cmd/felhom-agent --selftest`**: read-only health report against a live host
(version/nodes/status/guests/storage); `--selftest=task --vmid N` exercises
`WaitTask` on a reversible snapshot→rollback→delete op (gated; default selftest
mutates nothing).
- **Tests**: unit tests with a mock HTTP transport + mock runner (UPID parse,
`WaitTask` running→OK / failed-403 / timeout / ctx-cancel, 403→privilege error,
response decoding against shapes captured live from `demo-felhom`, config
redaction, and the API-vs-root routing fence).
### Notes
- Types are grounded in the spike findings
(`felhom.eu/documentation/proxmox-platform.md`, `tests/phase{0,1-2,3}-findings.md`)
and the exact JSON shapes captured live from `demo-felhom` (PVE 9.2.2).
- Verified: `go build/vet/test` green on Go 1.24.4 (build server) and a live
read-only `--selftest` against the demo host with TLS fingerprint pinning.
- The 16-privilege `FelhomAgent` role + privsep token (role on **both** user and
token) is provisioned out-of-band; the agent only consumes the token.