The security core of slice 4: hub-supplied intent is no longer trusted for destructive change. The gate fronts the per-guest queue's executor, so every mutation passes it. Reuses internal/authz for all crypto (surface untouched). - Classifier (doc 03 §4): benign vs destructive by provenance + data-bearing- ness, NOT by verb. Destroy/overwrite of customer data is destructive unless agent-internal provenance (same-journaled-txn create, or agent-tagged scratch) makes it benign — and that provenance is journal-recorded, NEVER hub-sourced. Unknown op class fails safe to destructive. - Reversibility gate: benign -> allowed unsigned; destructive -> requires a verified, role-scoped, action-bound operator signature, else pending_signature and never executed. Every decision audited (signal, never the guard). - Signed-op consuming layer over authz.Verifier.Verify (locked pipeline untouched): role-scoping (doc 04 §4 — recovery=rotation only, operational= ordinary destructive + planned rotation) + op-to-action binding (op+host+ guest+params must match the gated action). - Signed-job orchestration: idempotency dedupe by nonce + journal-wrapped execution via an injected DestructiveExecutor (nil this slice — inert). - Crash recovery (Note 1): Engine.Recover consumes the journal InFlight() set at startup (resume-or-rollback) — covers an op that crashed after the POST and before its terminal record, which idempotency dedupe alone cannot. Added TaskStatusOnce to the GuestAPI seam. Wired into daemon startup. - Note 2: memory comparison canonicalized to MiB (desiredMemoryMiB) so a non-MiB-aligned MemoryBytes converges in one pass, not perpetual drift. - Daemon: builds the verifier from config signers (none = nil verifier, the common slice-4 state), the gate (+SlogAudit), runs Recover before mutating. Adversarial matrix proven against the REAL authz.Verifier with in-test-minted SSHSIGs (framing replicated in reconcile's test binary; authz untouched, no signing added to the verify-only package): unsigned job + unsigned desired-state delta -> pending_signature; unknown signer/expired/replay-across-restart/wrong host -> typed authz rejections; wrong guest/op/params -> binding_mismatch; recovery key on ordinary destructive -> role_denied; hub-supplied scratch tag ignored -> refused; valid+role+target+fresh nonce -> accepted then replay rejected. Full module race-clean + vet-clean on the Linux build server. Inert this slice: no destructive deltas served until slice 10; the destructive path is classified, gated, and tested but not wired to live execution. CHECKPOINT: Phase B complete (slice 4 done). Awaiting validation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
24 KiB
Changelog
All notable changes to felhom-agent are recorded here. Update on every code change that gets pushed.
v0.4.0 — slice 4 Phase B: reversibility gate + signed-op consuming layer (2026-06-08)
The security core of slice 4: hub-supplied intent stops being trusted for destructive
change. Layered in front of the per-guest queue's executor — every mutation now
passes the gate. Reuses internal/authz for all crypto (untouched surface). Inert
this slice: no destructive deltas are served until slice 10, so the destructive path is
classified, gated, and adversarially tested but not wired to live execution.
Added
- Classifier (
classify.go, doc 03 §4) — benign vs destructive by provenance + data-bearing-ness, NOT by verb. TheOpClassvocabulary (seeded by the committed slice-2op_blob.json:guest_destroy) is the agent-side contract slice 10 matches. Destroy/overwrite of customer data is destructive UNLESS agent-internal provenance (same-journaled-transaction create → compensating rollback, or agent-tagged scratch) makes it benign.Provenanceis journal-recorded and never populated from the hub (its zero value is the only thing an external intent may carry). Unknown op class fails safe → destructive. - Reversibility gate (
gate.go) —Gate.Authorize(intent, signed): benign → allowed unsigned; destructive → requires a verified, role-authorized, action-bound operator signature, else refusedpending_signature, never executed. Every decision is written to anAuditSink(audit is a signal, never the guard). - Signed-op consuming layer over
authz— verifies viaauthz.Verifier.Verify(the locked pipeline, untouched), then enforces on theVerifiedOp:- Role-scoping (doc 04 §4) — recovery key authorizes key-rotation re-pins ONLY; operational key authorizes ordinary destructive ops + planned rotation.
- Op-to-action binding — verified
op+ host + guest +paramsmust match the gated action (a signature for guest X / op A can't authorize guest Y / op B); params compared semantically (key-order/whitespace independent).
- Signed-job orchestration (
job.go) —RunSignedJob: idempotency dedupe (the op nonce as the journal key — a redelivered completed op is skipped, not re-run), gate authorization, then journal-wrapped execution via an injectedDestructiveExecutor(nil this slice — authorized destructive ops are inert, no executor wired until 6/7). - Crash-recovery consumer (
recover.go, Note 1 / doc 03 §10) —Engine.Recoverconsumes the journal'sInFlight()at startup: an op that crashed AFTER the Proxmox POST and BEFORE its terminal record (OpTaskRunning, nonce already consumed) is NOT covered by idempotency dedupe — only this resume-or-rollback resolves it (re-read the task via the newTaskStatusOnce, record the real outcome; a no-task-id op is abandoned fail-safe). Landed together with the signed-op executor, as Note 1 required. - Daemon wiring —
runDaemonbuilds the verifier fromconfig.Authz.Signers(a bad key / missing nonce-store path is a fatal misconfig; no signers = nil verifier, the common slice-4 state), constructs the gate (+SlogAudit), runsRecoverbefore issuing any mutation, and routes every reconcile action through the gate.
Changed
- Memory comparison canonicalized (Note 2) —
desiredMemoryMiBmakes the desired↔actual memory compare in the same MiB unit that is then written, so a non-MiB-alignedMemoryBytesconverges in one pass instead of re-issuing SetConfig forever (the numeric cousin of the description-newline normalization). Test proves convergence. Slice 10 should still serve MiB-aligned specs at the source.
Tests (the security proof — each independently rejected)
- Adversarial matrix via the REAL
authz.Verifierwith in-test-minted SSHSIGs (framing replicated in reconcile's test binary; production authz untouched, no signing added to the verify-only package): unsigned destructive job → pending_signature; unsigned destructive desired-state delta → pending_signature (distrusts hub desired state, not just jobs); forged/unknown signer →ErrUnknownSigner; expired →ErrExpired; replayed nonce across an agent restart (durableFileNonceStore) →ErrReplay; wrong host →ErrTarget; wrong guest / wrong op / wrong params → binding_mismatch; recovery key on ordinary destructive → role_denied; hub-supplied "scratch" tag ignored → still destructive → refused; valid + role + target + fresh nonce → accepted, and a second presentation →ErrReplay(nonce consumed). - Classifier (benign/destructive/provenance/key-rotation/fail-safe), role-scoping, params binding, crash-recovery (resume OK / fail / still-running / no-task rollback / unreadable / one-shot key applied on resume), signed-job idempotency (execute once, dedupe redelivery, refused-not-executed, no-executor-inert, executor-error).
- Full module race-clean (
go test -race) + vet clean on the Linux build server.
v0.4.0-rc1 — slice 4 Phase A: reconcile engine (structural; runs live, unfed) (2026-06-08)
The agent-side control core's structural half. Checkpoint marker — -rc1 is the
Phase-A push; awaiting validation before Phase B (the reversibility gate + signed-op
consuming layer) lands the final v0.4.0. Runs LIVE but UNFED: with no desired-state
provider until slice 10, the live engine computes an empty action set and performs
zero mutations.
Added
internal/reconcilepackage — the engine, the per-guest serializer, the desired-state model, the normalization layer, and the durable op journal:- Per-guest serializer (
Queue, doc 03 §10) — the single choke point ALL mutation sources funnel through. Same-vmid jobs run strictly one-at-a-time in submit order; independent vmids run in parallel. Each vmid is a cond-var FIFO lane (unbounded, non-blocking, order-preserving); graceful drain onClose. - Desired-state model +
DesiredProviderseam —DesiredGuest(per-field optional: run-state /*hub.GuestSpec/*description),DesiredState. The only live provider isEmptyProvider(slice 4 has no source);StaticProviderfeeds fixtures. The seam is where slice 10's hub-serving plugs in — no hub/local source invented here. - Normalization layer (
FieldNormalizers) — reconcile compares normalized desired-vs-actual so Proxmox round-trip quirks don't read as drift.description's trailing newline is the first registered case; the registry takes more (boolean coercion, list ordering) as discovered.normDescpromoted out ofcmd/felhom-agent/main.gotoreconcile.NormDescription; the--selftest=taskdescription round-trip now uses that shared helper (one source of truth for the quirk). - Plan engine (
Plan, pure function) — computes the minimal benign action set (Start/Stop/SetConfig) for guests present in both desired and actual, with normalized comparison, deterministic vmid ordering, config-before-run-state. Skips provision (desired-absent-in-actual, slice 7) and destroy (actual-absent-in-desired, gated, slice 10); never writes a config it couldn't first read (SpecKnown). Disk (rootfs grow) intentionally not reconciled here. - Reconcile engine (
Engine) — reads desired+actual, plans, dispatches each action onto the shared queue. Every Proxmox op handled per the mutate.go contract: non-empty UPID →WaitTask+ assertexitstatus; empty UPID → clean synchronous success (slice-4 proven). Per-action failures are counted, not fatal (other guests still converge). - Operation journal (
Journal) — durable fsync'd append-only JSONL mirroringauthz.FileNonceStore: records each op's lifecycle (started → task_running → succeeded/failed) with its Proxmox task id (crash mid-op is detected and re-checkable on restart viaInFlight()), plus an idempotency-key store (AlreadyApplied) so a one-shot op never re-runs across retries/restarts. Reconcile actions carry no idempotency key (convergent — must re-run on real drift).
- Per-guest serializer (
- Daemon wiring (
runDaemon) — reconcile runs alongside the hub loop on the poll cadence, sharing the per-guest queue. Journal path is ajournal.logsibling of the nonce store. The daemon runs cleanly with no desired state and no signers (reconcile is a logged live no-op; a journal-open failure degrades to journal-less, never crashes).
Tests
- Serializer: same-guest serialized (max-concurrency 1, submit order preserved) and different-guests parallel (cross-waiting jobs both complete — would deadlock if not); error propagation; drain-pending-on-close; submit-after-close.
- Normalization: description round-trip; unknown-field identity; extensibility seam (synthetic boolean-coercion + list-ordering normalizers).
- Plan: run-state start/stop, spec drift (cores/memory), disk-not-reconciled, description-newline-not-drift, unmanaged fields, spec-unknown skips config keeps run-state, desired-absent skipped, combined ordering, empty-desired no-op, deterministic vmid order.
- Engine: empty-provider zero mutations; async start (WaitTask); synchronous SetConfig (no WaitTask); WaitTask failure + POST error counted failed; list error = pass failure.
- Journal: lifecycle latest-wins; in-flight survives restart; idempotency dedupe across restart; failed key not applied; torn-trailing-line skipped.
- Full module race-clean (
go test -race) on the Linux build server; vet clean.
Not in this phase (Phase B)
- The benign/destructive classifier, the reversibility gate, and the signed-op consuming
layer over
internal/authz(doc 03 §4 / doc 04) — added next, in front of the queue's executor, landing v0.4.0.
v0.3.2 — SetConfig selftest extension (slice-4 pre-check) (2026-06-08)
The gate before slice 4: prove SetConfig works live under the scoped token before
reconcile is built on it. Self-gated live run PASSED on demo-felhom/guest 9999.
Added
- Reversible
SetConfigstep appended to--selftest=task(cmd/felhom-agent/main.go,selftestSetConfig): readGuestConfig→ write adescriptionmarker (felhom-selftest <RFC3339>) → verify it landed → restore the original value (ordeletethe key if it was absent) → verify the restore. Handles PVE's dual-modeSetConfigreturn per themutate.gocontract: empty UPID = synchronous success (printedsynchronous); non-empty UPID =WaitTask+ assertexitstatus=OK. The existing snapshot → rollback → delete-snapshot steps are unchanged. First live exercise of theVM.Config.*privilege cluster. normDesc/extraStringhelpers —extraStringdecodes a string-valued key fromGuestConfig.Extra(raw JSON);normDescstrips the trailing newline PVE appends todescriptionon read, so a written value round-trips equal.
Finding (live)
- The LXC
descriptionwrite returned synchronous (empty UPID) — PVE applied it inline, no task. The agent's dual-modeSetConfigmodeling is correct: the empty-string path is real and must not be treated as an error. - PVE appends a trailing
\ntodescriptionon read (stored URL-encoded as%0A). A naive exact-match reconcile would see perpetual drift — slice-4 reconcile must normalizedescriptioncomparisons (hencenormDesc).
Ops
- Standing operator token (
felhom-agent@pve!agent, privsep) rotated during this run (the prior secret was not retrievable); role + both user/token ACL rows re-confirmed at/. New secret stored out-of-band, not persisted to the repo. Guest 9999 left pristine (stopped, nodescription, no leftover snapshot). Version → 0.3.2.
Docs + live validation — no version bump (2026-06-08)
Changed
- Reflowed
CLAUDE.md— removed hard mid-paragraph line wraps (prose, list items, blockquotes now single-line, soft-wrapped); code blocks and tables untouched; rendered output unchanged. - Unified the REPORT/CHANGELOG convention in
CLAUDE.md:CHANGELOG.mdis the cumulative log (newest on top);REPORT.mdis overwritten with the most-recent implementation/validation only. Added an explicit no-secrets rule (never write tokens/passwords/keys into committed files; reference them as stored out-of-band).
Added
REPORT.mdrewritten for the live--selftest=taskvalidation on the demo host (demo-felhom): snapshot → rollback → delete-snapshot on guest 9999, each polled toexitstatus=OKunder thefelhom-agent@pve!agentprivsep token (UPIDs name the token actor — privsep path genuinely exercised); 16-privilegeFelhomAgentrole + both user & token ACLs confirmed;--selftest=readclean. Closes the slice-1 "mutating ops unit-tested only" gap;WaitTaskasync foundation validated live → slice 4 unblocked. (Token secret stored out-of-band, not in the repo.)
v0.3.1 — slice-3 validation follow-ups (2026-06-08)
Changed
- Collector keeps the known run-status on a
GuestConfigfailure (internal/hub/collect.go): previously a per-guest config-read error forcedstatus="unknown"; now the run-status fromListLXCis preserved (only thespecis dropped). An empty status is still normalized tounknown(wire value is alwaysrunning|stopped|unknown). Test renamed toTestCollect_GuestConfigFailureKeepsStatusOmitsSpecand asserts the preservedrunning+ nil spec. --selftestusage error string now reads(want read|task|hub).
Added
- Cross-repo contract fixture
internal/hub/testdata/host-report.golden.json+TestHostReport_ContractMatchesGolden— compares the marshaledHostReportfield-name sets (top level +host+guests[0]) against the golden, failing on any json-tag drift. The file is kept byte-identical with felhom-hub's copy (duplicated contract until a shared types module; revisit when slices 5/6 populate the empty collections). Version → 0.3.1.
v0.3.0 — hub client + host-report + first daemon loop (slice 3) (2026-06-08)
The agent's first daemon: a periodic read-only host-report POSTed to the hub (the heartbeat). No Proxmox mutations, no desired-state/signed-op consumption, no storage/backup collection yet — those are slices 4/5/6.
Added
internal/hubpackage:HostReportwire contract (report.go) shared field-for-field with the hub ingest: host metrics, guests (vmid+ spec),cloudflaredstatus, and thestorage_targets/backups/restore_tests/pbs_snapshots/audit_tailcollections defined but emitted empty (typed[], slices 5/6 fill them).Collector(collect.go) builds the report from a read-onlyproxmoxReader(adapted to the realinternal/proxmoxsurface — node held by the client, value returns,proxmox.Guest) + aCloudflaredProber. Partial-failure policy: a failedNodeStatusis a hard error (skip the POST); a failed per-guestGuestConfigdegrades that guest tostatus="unknown"(spec omitted) but still sends; a cloudflared probe failure →"unknown", never fatal.CloudflaredProber+SystemctlProber(systemctl is-active cloudflared; read-only — NOT a Privileged/root op; tunnel management is a later slice).Client(client.go):POST /api/v1/host-reportwithAuthorization: Bearer <key>, standard TLS (system roots or optionalca_file; verification always on). Typed*TransportError/*HTTPError; the bearer token never appears in any error.Loop(loop.go): the daemon — immediate first report then tick; adopts the hub'spoll_interval_secondsclamped to [60,3600]; resilient (a collect/report error is logged and the loop continues); clean shutdown on context cancel.ControlEnvelope: onlypoll_interval_secondsis acted on;blocked/desired_generation/has_signed_opsare parsed-but-ignored (logged at most) pending reconcile (slice 4).
- Config:
HubConfig(url/host_id/api_key/poll_seconds/timeout_seconds/ca_file),FELHOM_AGENT_HUB_*env overlay,HubConfig.Validate()(mode-aware — proxmox-only--selftest=read|taskstill runs without hub config),WithDefaults(), andRedacted()now also blanks the hub key.configs/agent.example.jsongainshub(andauthz) blocks. cmd/felhom-agent: the no---selftestmode is now the daemon (poll loop); added--selftest=hub(one collect+report, prints the report + envelope). Version 0.2.0 → 0.3.0.
Tests
- Report serialization (field names; empty collections are
[]notnull; spec omitted when unknown); client (Bearer header, non-2xx→*HTTPError, transport→*TransportError, token never in error); collector (host mapping, guest spec, per-guest failure degrades-but-still-reports, NodeStatus hard error, cloudflared error→unknown); loop (immediate first report, continuation after an injected error, interval adoption + clamp); config (hub validate/redact/env).
Notes
internal/proxmoxandinternal/authzwere not touched — no new proxmox surface was needed (ListLXCalready exposes status/maxmem/maxdisk;GuestConfigexposes cores). The task'sproxmoxReadersketch (node-arg/pointer/LXC) was adapted to the real exports as instructed.- Defined-but-empty this slice:
storage_targets,backups,restore_tests,pbs_snapshots,audit_tail(slices 5/6). Parsed-but-ignored: the envelope'sblocked/desired_generation/has_signed_ops(slice 4).
v0.2.0 — authz signed-op verifier (slice 2) (2026-06-08)
Production form of the Phase-4 signing primitive: a key-type-agnostic SSHSIG verifier for operator-signed destructive ops, with the full anti-replay/ authorization pipeline and a durable, crash-safe nonce store. What slice 4 (reconcile) will call to gate destructive desired-state deltas. No hub, no signing CLI, no reconcile loop.
Added
internal/authz—Verifier:New(signers, store, hostID)+Verify(blob, sigArmored) (*VerifiedOp, error). Runs the LOCKED pipeline (order is load-bearing): parse armor → namespace → parse pubkey → allow-list (by key material,pub.Marshal()equality, not key_id) → crypto verify (over the raw received bytes, never re-canonicalized) → parse blob → target → time window → nonce recorded LAST. Each post-crypto stage rejects even with a valid signature.- SSHSIG framing (
sshsig.go) viagolang.org/x/crypto/ssh—pem.Decode→ strip 6-byte magic →ssh.Unmarshal→ssh.ParsePublicKey→ recompute signed data with the named hash →pub.Verify(dispatches on key algorithm). No hand-rolled crypto. Key-type-agnostic: ed25519 / sk-ssh-ed25519 (FIDO2) / rsa / ecdsa via the one path. - Fixed namespace
felhom-op-v1(package constant, never caller-supplied). OpBlob(correctedhost_id/guest_idjson tags) +VerifiedOp(op, host/guest, params, key_id, matched signer). key_id is advisory/audit only — never an authz input.- Typed errors:
ErrMalformed, ErrNamespace, ErrUnknownSigner, ErrBadSignature, ErrTarget, ErrExpired, ErrNotYetValid, ErrReplay(errors.Is-friendly). NonceStore+ two impls:MemoryNonceStore(tests) andFileNonceStore— durable, crash-safe (fsync'd append log, replayed into an index on open, periodic compaction, expiry-only pruning). A nonce is fsync'd to disk beforeSeenOrRecordreturns false; replay protection survives restart; I/O failure fails safe (reports seen=true). Target generalization: host_id matched strictly, guest_id surfaced for the caller to route.- Config:
AuthzConfig(nonce-store path + pinned operatorsignerstaggedoperational/recoverywith a key_id, as authorized_keys lines). - Version 0.2.0.
Tests
- Real OpenSSH interop via a committed
ssh-keygen -Y signvector (hermetic CI); per-stage rejection (each with an otherwise-valid sig); the headline invalid-sig-does-not-burn-the-nonce invariant; replay; persistence across restart; synthetic sk-ssh-ed25519 through the unchanged path; byte-exactness (a re-serialized blob fails crypto — not re-canonicalized).
Notes / corrections to the Phase-4 reference
- §7's
Targetlacked json tags (host_id/guest_id) — fixed. - The doc paired "Go 1.24.4 / x/crypto v0.52.0", but v0.52.0 declares
go 1.25.0and does not build on Go 1.24. Resolved by upgrading the build server to go1.26.0 (backward-compatible; felhom-controller/hub unaffected); the module isgo 1.25.0on x/crypto v0.52.0. - Free function → constructed
Verifier; returns the fullVerifiedOp; typed errors; clock-skew tolerance added; durable nonce store is the net-new work. - Shared-contract dependency flagged (not built): the hub and the
felhom-signCLI must emit byte-identical canonical JSON or signatures won't verify; a shared canonicalizer both import would be the right home.
v0.1.0 — Scaffold + proxmox interaction layer (slice 1) (2026-06-08)
First slice: stand up the host-agent project and its foundation — the typed Proxmox interaction layer every other module will call. No reconcile loop, hub client, signing, or storage/backup orchestration yet (later slices).
Added
- Project scaffold: module
gitea.dooplex.hu/admin/felhom-agent, binaryfelhom-agent(cmd/felhom-agent/), Go 1.24, zero external dependencies (pure stdlib).--versionflag;versionvar overridable via-ldflags "-X main.version=<v>". internal/proxmox— API backend (Client): hand-rolled REST client overhttps://<host>:8006/api2/jsonwithPVEAPITokenauth. Typed read ops (Version,Nodes,NodeStatus,ListLXC,GuestStatus,GuestConfig,ListStorage,NodeStorage,StorageContent) and async mutating ops returning a UPID (RestoreLXC— the primary create path,Vzdump,Snapshot,Rollback,DeleteSnapshot,SetConfig,Start,Stop).WaitTask: pollsGET /nodes/{node}/tasks/{upid}/statusuntil stopped, then assertsexitstatus == "OK"(authorization can surface at task execution, not the POST — phase1-2 §1.3). Exponential backoff (1s→5s cap), context cancellation + timeout.*APIErrorparses the offending privilege from a 403;*TaskErrorparses it from a failed task exitstatus + log tail.internal/proxmox— fenced root-CLI backend (Privileged): limited to the three proven OS-root exceptions only —CreateGoldenLXC(keyctlpct create),MountUSBByUUID,SMART,Sensors; each cites why it can't be the API. Fence is structural (Client never shells out, Privileged never makes an HTTP call) and asserted in tests.- TLS trust: SHA-256 leaf-cert pinning (the host serves a self-signed cert) or
a CA file; an explicitly-named
insecure_skip_verifythat is off by default. No blanket verification disable. internal/config: JSON config file +FELHOM_AGENT_*env overrides; the token secret is never logged (Redacted()).internal/log: slog setup (text, stderr, configurable level).cmd/felhom-agent --selftest: read-only health report against a live host (version/nodes/status/guests/storage);--selftest=task --vmid NexercisesWaitTaskon a reversible snapshot→rollback→delete op (gated; default selftest mutates nothing).- Tests: unit tests with a mock HTTP transport + mock runner (UPID parse,
WaitTaskrunning→OK / failed-403 / timeout / ctx-cancel, 403→privilege error, response decoding against shapes captured live fromdemo-felhom, config redaction, and the API-vs-root routing fence).
Notes
- Types are grounded in the spike findings
(
felhom.eu/documentation/proxmox-platform.md,tests/phase{0,1-2,3}-findings.md) and the exact JSON shapes captured live fromdemo-felhom(PVE 9.2.2). - Verified:
go build/vet/testgreen on Go 1.24.4 (build server) and a live read-only--selftestagainst the demo host with TLS fingerprint pinning. - The 16-privilege
FelhomAgentrole + privsep token (role on both user and token) is provisioned out-of-band; the agent only consumes the token.