Files
felhom-agent/CHANGELOG.md
T
admin 05c450147c v0.4.0-rc1: slice 4 Phase A — reconcile engine (structural, runs live unfed)
New internal/reconcile package: the agent-side control core's structural half.

- Per-guest serializer Queue (doc 03 §10): the single choke point all mutation
  sources funnel through; same-vmid serial in submit order, different vmids
  parallel (cond-var FIFO lanes).
- Desired-state model + DesiredProvider seam; EmptyProvider is the only live
  source at slice 4 (no hub serving until slice 10) so the live engine computes
  an empty action set and performs zero mutations.
- Normalization layer (FieldNormalizers): normalized desired-vs-actual so
  Proxmox round-trip quirks don't read as drift. normDesc promoted out of
  main.go to reconcile.NormDescription; selftest uses the shared helper.
- Plan (pure diff): minimal benign action set (Start/Stop/SetConfig) for guests
  in both desired and actual; provision/destroy out of scope here.
- Engine: dispatches onto the shared queue; honors the dual-mode SetConfig
  contract (UPID -> WaitTask; empty UPID -> synchronous success).
- Durable op journal + idempotency store (mirrors authz.FileNonceStore):
  in-flight task ids for crash detection + AlreadyApplied dedupe across restart.
- Wired into runDaemon alongside the hub loop, sharing the queue; runs cleanly
  with no desired state and no signers.

Full module race-clean and vet-clean on the Linux build server.

CHECKPOINT: Phase A only. Awaiting validation before Phase B (the reversibility
gate + signed-op consuming layer, landing v0.4.0).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 23:21:55 +02:00

19 KiB

Changelog

All notable changes to felhom-agent are recorded here. Update on every code change that gets pushed.

v0.4.0-rc1 — slice 4 Phase A: reconcile engine (structural; runs live, unfed) (2026-06-08)

The agent-side control core's structural half. Checkpoint marker-rc1 is the Phase-A push; awaiting validation before Phase B (the reversibility gate + signed-op consuming layer) lands the final v0.4.0. Runs LIVE but UNFED: with no desired-state provider until slice 10, the live engine computes an empty action set and performs zero mutations.

Added

  • internal/reconcile package — the engine, the per-guest serializer, the desired-state model, the normalization layer, and the durable op journal:
    • Per-guest serializer (Queue, doc 03 §10) — the single choke point ALL mutation sources funnel through. Same-vmid jobs run strictly one-at-a-time in submit order; independent vmids run in parallel. Each vmid is a cond-var FIFO lane (unbounded, non-blocking, order-preserving); graceful drain on Close.
    • Desired-state model + DesiredProvider seamDesiredGuest (per-field optional: run-state / *hub.GuestSpec / *description), DesiredState. The only live provider is EmptyProvider (slice 4 has no source); StaticProvider feeds fixtures. The seam is where slice 10's hub-serving plugs in — no hub/local source invented here.
    • Normalization layer (FieldNormalizers) — reconcile compares normalized desired-vs-actual so Proxmox round-trip quirks don't read as drift. description's trailing newline is the first registered case; the registry takes more (boolean coercion, list ordering) as discovered. normDesc promoted out of cmd/felhom-agent/main.go to reconcile.NormDescription; the --selftest=task description round-trip now uses that shared helper (one source of truth for the quirk).
    • Plan engine (Plan, pure function) — computes the minimal benign action set (Start/Stop/SetConfig) for guests present in both desired and actual, with normalized comparison, deterministic vmid ordering, config-before-run-state. Skips provision (desired-absent-in-actual, slice 7) and destroy (actual-absent-in-desired, gated, slice 10); never writes a config it couldn't first read (SpecKnown). Disk (rootfs grow) intentionally not reconciled here.
    • Reconcile engine (Engine) — reads desired+actual, plans, dispatches each action onto the shared queue. Every Proxmox op handled per the mutate.go contract: non-empty UPID → WaitTask + assert exitstatus; empty UPID → clean synchronous success (slice-4 proven). Per-action failures are counted, not fatal (other guests still converge).
    • Operation journal (Journal) — durable fsync'd append-only JSONL mirroring authz.FileNonceStore: records each op's lifecycle (started → task_running → succeeded/failed) with its Proxmox task id (crash mid-op is detected and re-checkable on restart via InFlight()), plus an idempotency-key store (AlreadyApplied) so a one-shot op never re-runs across retries/restarts. Reconcile actions carry no idempotency key (convergent — must re-run on real drift).
  • Daemon wiring (runDaemon) — reconcile runs alongside the hub loop on the poll cadence, sharing the per-guest queue. Journal path is a journal.log sibling of the nonce store. The daemon runs cleanly with no desired state and no signers (reconcile is a logged live no-op; a journal-open failure degrades to journal-less, never crashes).

Tests

  • Serializer: same-guest serialized (max-concurrency 1, submit order preserved) and different-guests parallel (cross-waiting jobs both complete — would deadlock if not); error propagation; drain-pending-on-close; submit-after-close.
  • Normalization: description round-trip; unknown-field identity; extensibility seam (synthetic boolean-coercion + list-ordering normalizers).
  • Plan: run-state start/stop, spec drift (cores/memory), disk-not-reconciled, description-newline-not-drift, unmanaged fields, spec-unknown skips config keeps run-state, desired-absent skipped, combined ordering, empty-desired no-op, deterministic vmid order.
  • Engine: empty-provider zero mutations; async start (WaitTask); synchronous SetConfig (no WaitTask); WaitTask failure + POST error counted failed; list error = pass failure.
  • Journal: lifecycle latest-wins; in-flight survives restart; idempotency dedupe across restart; failed key not applied; torn-trailing-line skipped.
  • Full module race-clean (go test -race) on the Linux build server; vet clean.

Not in this phase (Phase B)

  • The benign/destructive classifier, the reversibility gate, and the signed-op consuming layer over internal/authz (doc 03 §4 / doc 04) — added next, in front of the queue's executor, landing v0.4.0.

v0.3.2 — SetConfig selftest extension (slice-4 pre-check) (2026-06-08)

The gate before slice 4: prove SetConfig works live under the scoped token before reconcile is built on it. Self-gated live run PASSED on demo-felhom/guest 9999.

Added

  • Reversible SetConfig step appended to --selftest=task (cmd/felhom-agent/main.go, selftestSetConfig): read GuestConfig → write a description marker (felhom-selftest <RFC3339>) → verify it landed → restore the original value (or delete the key if it was absent) → verify the restore. Handles PVE's dual-mode SetConfig return per the mutate.go contract: empty UPID = synchronous success (printed synchronous); non-empty UPID = WaitTask + assert exitstatus=OK. The existing snapshot → rollback → delete-snapshot steps are unchanged. First live exercise of the VM.Config.* privilege cluster.
  • normDesc / extraString helpersextraString decodes a string-valued key from GuestConfig.Extra (raw JSON); normDesc strips the trailing newline PVE appends to description on read, so a written value round-trips equal.

Finding (live)

  • The LXC description write returned synchronous (empty UPID) — PVE applied it inline, no task. The agent's dual-mode SetConfig modeling is correct: the empty-string path is real and must not be treated as an error.
  • PVE appends a trailing \n to description on read (stored URL-encoded as %0A). A naive exact-match reconcile would see perpetual drift — slice-4 reconcile must normalize description comparisons (hence normDesc).

Ops

  • Standing operator token (felhom-agent@pve!agent, privsep) rotated during this run (the prior secret was not retrievable); role + both user/token ACL rows re-confirmed at /. New secret stored out-of-band, not persisted to the repo. Guest 9999 left pristine (stopped, no description, no leftover snapshot). Version → 0.3.2.

Docs + live validation — no version bump (2026-06-08)

Changed

  • Reflowed CLAUDE.md — removed hard mid-paragraph line wraps (prose, list items, blockquotes now single-line, soft-wrapped); code blocks and tables untouched; rendered output unchanged.
  • Unified the REPORT/CHANGELOG convention in CLAUDE.md: CHANGELOG.md is the cumulative log (newest on top); REPORT.md is overwritten with the most-recent implementation/validation only. Added an explicit no-secrets rule (never write tokens/passwords/keys into committed files; reference them as stored out-of-band).

Added

  • REPORT.md rewritten for the live --selftest=task validation on the demo host (demo-felhom): snapshot → rollback → delete-snapshot on guest 9999, each polled to exitstatus=OK under the felhom-agent@pve!agent privsep token (UPIDs name the token actor — privsep path genuinely exercised); 16-privilege FelhomAgent role + both user & token ACLs confirmed; --selftest=read clean. Closes the slice-1 "mutating ops unit-tested only" gap; WaitTask async foundation validated live → slice 4 unblocked. (Token secret stored out-of-band, not in the repo.)

v0.3.1 — slice-3 validation follow-ups (2026-06-08)

Changed

  • Collector keeps the known run-status on a GuestConfig failure (internal/hub/collect.go): previously a per-guest config-read error forced status="unknown"; now the run-status from ListLXC is preserved (only the spec is dropped). An empty status is still normalized to unknown (wire value is always running|stopped|unknown). Test renamed to TestCollect_GuestConfigFailureKeepsStatusOmitsSpec and asserts the preserved running + nil spec.
  • --selftest usage error string now reads (want read|task|hub).

Added

  • Cross-repo contract fixture internal/hub/testdata/host-report.golden.json + TestHostReport_ContractMatchesGolden — compares the marshaled HostReport field-name sets (top level + host + guests[0]) against the golden, failing on any json-tag drift. The file is kept byte-identical with felhom-hub's copy (duplicated contract until a shared types module; revisit when slices 5/6 populate the empty collections). Version → 0.3.1.

v0.3.0 — hub client + host-report + first daemon loop (slice 3) (2026-06-08)

The agent's first daemon: a periodic read-only host-report POSTed to the hub (the heartbeat). No Proxmox mutations, no desired-state/signed-op consumption, no storage/backup collection yet — those are slices 4/5/6.

Added

  • internal/hub package:
    • HostReport wire contract (report.go) shared field-for-field with the hub ingest: host metrics, guests (vmid + spec), cloudflared status, and the storage_targets/backups/restore_tests/pbs_snapshots/audit_tail collections defined but emitted empty (typed [], slices 5/6 fill them).
    • Collector (collect.go) builds the report from a read-only proxmoxReader (adapted to the real internal/proxmox surface — node held by the client, value returns, proxmox.Guest) + a CloudflaredProber. Partial-failure policy: a failed NodeStatus is a hard error (skip the POST); a failed per-guest GuestConfig degrades that guest to status="unknown" (spec omitted) but still sends; a cloudflared probe failure → "unknown", never fatal.
    • CloudflaredProber + SystemctlProber (systemctl is-active cloudflared; read-only — NOT a Privileged/root op; tunnel management is a later slice).
    • Client (client.go): POST /api/v1/host-report with Authorization: Bearer <key>, standard TLS (system roots or optional ca_file; verification always on). Typed *TransportError / *HTTPError; the bearer token never appears in any error.
    • Loop (loop.go): the daemon — immediate first report then tick; adopts the hub's poll_interval_seconds clamped to [60,3600]; resilient (a collect/report error is logged and the loop continues); clean shutdown on context cancel.
    • ControlEnvelope: only poll_interval_seconds is acted on; blocked / desired_generation / has_signed_ops are parsed-but-ignored (logged at most) pending reconcile (slice 4).
  • Config: HubConfig (url/host_id/api_key/poll_seconds/timeout_seconds/ca_file), FELHOM_AGENT_HUB_* env overlay, HubConfig.Validate() (mode-aware — proxmox-only --selftest=read|task still runs without hub config), WithDefaults(), and Redacted() now also blanks the hub key. configs/agent.example.json gains hub (and authz) blocks.
  • cmd/felhom-agent: the no---selftest mode is now the daemon (poll loop); added --selftest=hub (one collect+report, prints the report + envelope). Version 0.2.0 → 0.3.0.

Tests

  • Report serialization (field names; empty collections are [] not null; spec omitted when unknown); client (Bearer header, non-2xx→*HTTPError, transport→*TransportError, token never in error); collector (host mapping, guest spec, per-guest failure degrades-but-still-reports, NodeStatus hard error, cloudflared error→unknown); loop (immediate first report, continuation after an injected error, interval adoption + clamp); config (hub validate/redact/env).

Notes

  • internal/proxmox and internal/authz were not touched — no new proxmox surface was needed (ListLXC already exposes status/maxmem/maxdisk; GuestConfig exposes cores). The task's proxmoxReader sketch (node-arg/pointer/LXC) was adapted to the real exports as instructed.
  • Defined-but-empty this slice: storage_targets, backups, restore_tests, pbs_snapshots, audit_tail (slices 5/6). Parsed-but-ignored: the envelope's blocked/desired_generation/has_signed_ops (slice 4).

v0.2.0 — authz signed-op verifier (slice 2) (2026-06-08)

Production form of the Phase-4 signing primitive: a key-type-agnostic SSHSIG verifier for operator-signed destructive ops, with the full anti-replay/ authorization pipeline and a durable, crash-safe nonce store. What slice 4 (reconcile) will call to gate destructive desired-state deltas. No hub, no signing CLI, no reconcile loop.

Added

  • internal/authzVerifier: New(signers, store, hostID) + Verify(blob, sigArmored) (*VerifiedOp, error). Runs the LOCKED pipeline (order is load-bearing): parse armor → namespace → parse pubkey → allow-list (by key material, pub.Marshal() equality, not key_id) → crypto verify (over the raw received bytes, never re-canonicalized) → parse blob → target → time window → nonce recorded LAST. Each post-crypto stage rejects even with a valid signature.
  • SSHSIG framing (sshsig.go) via golang.org/x/crypto/sshpem.Decode → strip 6-byte magic → ssh.Unmarshalssh.ParsePublicKey → recompute signed data with the named hash → pub.Verify (dispatches on key algorithm). No hand-rolled crypto. Key-type-agnostic: ed25519 / sk-ssh-ed25519 (FIDO2) / rsa / ecdsa via the one path.
  • Fixed namespace felhom-op-v1 (package constant, never caller-supplied).
  • OpBlob (corrected host_id/guest_id json tags) + VerifiedOp (op, host/guest, params, key_id, matched signer). key_id is advisory/audit only — never an authz input.
  • Typed errors: ErrMalformed, ErrNamespace, ErrUnknownSigner, ErrBadSignature, ErrTarget, ErrExpired, ErrNotYetValid, ErrReplay (errors.Is-friendly).
  • NonceStore + two impls: MemoryNonceStore (tests) and FileNonceStore — durable, crash-safe (fsync'd append log, replayed into an index on open, periodic compaction, expiry-only pruning). A nonce is fsync'd to disk before SeenOrRecord returns false; replay protection survives restart; I/O failure fails safe (reports seen=true). Target generalization: host_id matched strictly, guest_id surfaced for the caller to route.
  • Config: AuthzConfig (nonce-store path + pinned operator signers tagged operational/recovery with a key_id, as authorized_keys lines).
  • Version 0.2.0.

Tests

  • Real OpenSSH interop via a committed ssh-keygen -Y sign vector (hermetic CI); per-stage rejection (each with an otherwise-valid sig); the headline invalid-sig-does-not-burn-the-nonce invariant; replay; persistence across restart; synthetic sk-ssh-ed25519 through the unchanged path; byte-exactness (a re-serialized blob fails crypto — not re-canonicalized).

Notes / corrections to the Phase-4 reference

  • §7's Target lacked json tags (host_id/guest_id) — fixed.
  • The doc paired "Go 1.24.4 / x/crypto v0.52.0", but v0.52.0 declares go 1.25.0 and does not build on Go 1.24. Resolved by upgrading the build server to go1.26.0 (backward-compatible; felhom-controller/hub unaffected); the module is go 1.25.0 on x/crypto v0.52.0.
  • Free function → constructed Verifier; returns the full VerifiedOp; typed errors; clock-skew tolerance added; durable nonce store is the net-new work.
  • Shared-contract dependency flagged (not built): the hub and the felhom-sign CLI must emit byte-identical canonical JSON or signatures won't verify; a shared canonicalizer both import would be the right home.

v0.1.0 — Scaffold + proxmox interaction layer (slice 1) (2026-06-08)

First slice: stand up the host-agent project and its foundation — the typed Proxmox interaction layer every other module will call. No reconcile loop, hub client, signing, or storage/backup orchestration yet (later slices).

Added

  • Project scaffold: module gitea.dooplex.hu/admin/felhom-agent, binary felhom-agent (cmd/felhom-agent/), Go 1.24, zero external dependencies (pure stdlib). --version flag; version var overridable via -ldflags "-X main.version=<v>".
  • internal/proxmox — API backend (Client): hand-rolled REST client over https://<host>:8006/api2/json with PVEAPIToken auth. Typed read ops (Version, Nodes, NodeStatus, ListLXC, GuestStatus, GuestConfig, ListStorage, NodeStorage, StorageContent) and async mutating ops returning a UPID (RestoreLXC — the primary create path, Vzdump, Snapshot, Rollback, DeleteSnapshot, SetConfig, Start, Stop).
  • WaitTask: polls GET /nodes/{node}/tasks/{upid}/status until stopped, then asserts exitstatus == "OK" (authorization can surface at task execution, not the POST — phase1-2 §1.3). Exponential backoff (1s→5s cap), context cancellation + timeout. *APIError parses the offending privilege from a 403; *TaskError parses it from a failed task exitstatus + log tail.
  • internal/proxmox — fenced root-CLI backend (Privileged): limited to the three proven OS-root exceptions only — CreateGoldenLXC (keyctl pct create), MountUSBByUUID, SMART, Sensors; each cites why it can't be the API. Fence is structural (Client never shells out, Privileged never makes an HTTP call) and asserted in tests.
  • TLS trust: SHA-256 leaf-cert pinning (the host serves a self-signed cert) or a CA file; an explicitly-named insecure_skip_verify that is off by default. No blanket verification disable.
  • internal/config: JSON config file + FELHOM_AGENT_* env overrides; the token secret is never logged (Redacted()).
  • internal/log: slog setup (text, stderr, configurable level).
  • cmd/felhom-agent --selftest: read-only health report against a live host (version/nodes/status/guests/storage); --selftest=task --vmid N exercises WaitTask on a reversible snapshot→rollback→delete op (gated; default selftest mutates nothing).
  • Tests: unit tests with a mock HTTP transport + mock runner (UPID parse, WaitTask running→OK / failed-403 / timeout / ctx-cancel, 403→privilege error, response decoding against shapes captured live from demo-felhom, config redaction, and the API-vs-root routing fence).

Notes

  • Types are grounded in the spike findings (felhom.eu/documentation/proxmox-platform.md, tests/phase{0,1-2,3}-findings.md) and the exact JSON shapes captured live from demo-felhom (PVE 9.2.2).
  • Verified: go build/vet/test green on Go 1.24.4 (build server) and a live read-only --selftest against the demo host with TLS fingerprint pinning.
  • The 16-privilege FelhomAgent role + privsep token (role on both user and token) is provisioned out-of-band; the agent only consumes the token.