Files
felhom-agent/CHANGELOG.md
T
admin f0fee7e193 feat(authz): operator signed-op verifier + durable nonce store (slice 2, v0.2.0)
internal/authz: production form of the Phase-4 SSHSIG signing primitive.

- Verifier.New/Verify with the LOCKED pipeline (namespace → allow-list by key
  material → crypto over RAW bytes → target → time → nonce LAST); each post-crypto
  stage rejects even with a valid sig; an invalid sig never burns a nonce.
- SSHSIG framing via x/crypto/ssh (no hand-rolled crypto); key-type-agnostic
  (ed25519 / sk-ssh-ed25519 / rsa / ecdsa via pub.Verify). Fixed namespace
  felhom-op-v1. Typed errors. OpBlob (fixed host_id/guest_id tags) + VerifiedOp.
- NonceStore: MemoryNonceStore + durable crash-safe FileNonceStore (fsync'd append
  log, replay-on-open, compaction, expiry-only pruning; survives restart).
- config.AuthzConfig (nonce path + pinned operational/recovery signer keys).
- Tests (14): real ssh-keygen fixture, per-stage rejection, nonce-not-burned,
  replay, persistence-across-restart, synthetic sk, byte-exactness.

Dep: golang.org/x/crypto v0.52.0 (declares go 1.25 — the Phase-4 doc's "Go 1.24.4 /
x/crypto v0.52.0" pairing doesn't build; build server upgraded to go1.26.0,
backward-compatible). Version 0.1.0 -> 0.2.0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 15:23:02 +02:00

6.7 KiB

Changelog

All notable changes to felhom-agent are recorded here. Update on every code change that gets pushed.

v0.2.0 — authz signed-op verifier (slice 2) (2026-06-08)

Production form of the Phase-4 signing primitive: a key-type-agnostic SSHSIG verifier for operator-signed destructive ops, with the full anti-replay/ authorization pipeline and a durable, crash-safe nonce store. What slice 4 (reconcile) will call to gate destructive desired-state deltas. No hub, no signing CLI, no reconcile loop.

Added

  • internal/authzVerifier: New(signers, store, hostID) + Verify(blob, sigArmored) (*VerifiedOp, error). Runs the LOCKED pipeline (order is load-bearing): parse armor → namespace → parse pubkey → allow-list (by key material, pub.Marshal() equality, not key_id) → crypto verify (over the raw received bytes, never re-canonicalized) → parse blob → target → time window → nonce recorded LAST. Each post-crypto stage rejects even with a valid signature.
  • SSHSIG framing (sshsig.go) via golang.org/x/crypto/sshpem.Decode → strip 6-byte magic → ssh.Unmarshalssh.ParsePublicKey → recompute signed data with the named hash → pub.Verify (dispatches on key algorithm). No hand-rolled crypto. Key-type-agnostic: ed25519 / sk-ssh-ed25519 (FIDO2) / rsa / ecdsa via the one path.
  • Fixed namespace felhom-op-v1 (package constant, never caller-supplied).
  • OpBlob (corrected host_id/guest_id json tags) + VerifiedOp (op, host/guest, params, key_id, matched signer). key_id is advisory/audit only — never an authz input.
  • Typed errors: ErrMalformed, ErrNamespace, ErrUnknownSigner, ErrBadSignature, ErrTarget, ErrExpired, ErrNotYetValid, ErrReplay (errors.Is-friendly).
  • NonceStore + two impls: MemoryNonceStore (tests) and FileNonceStore — durable, crash-safe (fsync'd append log, replayed into an index on open, periodic compaction, expiry-only pruning). A nonce is fsync'd to disk before SeenOrRecord returns false; replay protection survives restart; I/O failure fails safe (reports seen=true). Target generalization: host_id matched strictly, guest_id surfaced for the caller to route.
  • Config: AuthzConfig (nonce-store path + pinned operator signers tagged operational/recovery with a key_id, as authorized_keys lines).
  • Version 0.2.0.

Tests

  • Real OpenSSH interop via a committed ssh-keygen -Y sign vector (hermetic CI); per-stage rejection (each with an otherwise-valid sig); the headline invalid-sig-does-not-burn-the-nonce invariant; replay; persistence across restart; synthetic sk-ssh-ed25519 through the unchanged path; byte-exactness (a re-serialized blob fails crypto — not re-canonicalized).

Notes / corrections to the Phase-4 reference

  • §7's Target lacked json tags (host_id/guest_id) — fixed.
  • The doc paired "Go 1.24.4 / x/crypto v0.52.0", but v0.52.0 declares go 1.25.0 and does not build on Go 1.24. Resolved by upgrading the build server to go1.26.0 (backward-compatible; felhom-controller/hub unaffected); the module is go 1.25.0 on x/crypto v0.52.0.
  • Free function → constructed Verifier; returns the full VerifiedOp; typed errors; clock-skew tolerance added; durable nonce store is the net-new work.
  • Shared-contract dependency flagged (not built): the hub and the felhom-sign CLI must emit byte-identical canonical JSON or signatures won't verify; a shared canonicalizer both import would be the right home.

v0.1.0 — Scaffold + proxmox interaction layer (slice 1) (2026-06-08)

First slice: stand up the host-agent project and its foundation — the typed Proxmox interaction layer every other module will call. No reconcile loop, hub client, signing, or storage/backup orchestration yet (later slices).

Added

  • Project scaffold: module gitea.dooplex.hu/admin/felhom-agent, binary felhom-agent (cmd/felhom-agent/), Go 1.24, zero external dependencies (pure stdlib). --version flag; version var overridable via -ldflags "-X main.version=<v>".
  • internal/proxmox — API backend (Client): hand-rolled REST client over https://<host>:8006/api2/json with PVEAPIToken auth. Typed read ops (Version, Nodes, NodeStatus, ListLXC, GuestStatus, GuestConfig, ListStorage, NodeStorage, StorageContent) and async mutating ops returning a UPID (RestoreLXC — the primary create path, Vzdump, Snapshot, Rollback, DeleteSnapshot, SetConfig, Start, Stop).
  • WaitTask: polls GET /nodes/{node}/tasks/{upid}/status until stopped, then asserts exitstatus == "OK" (authorization can surface at task execution, not the POST — phase1-2 §1.3). Exponential backoff (1s→5s cap), context cancellation + timeout. *APIError parses the offending privilege from a 403; *TaskError parses it from a failed task exitstatus + log tail.
  • internal/proxmox — fenced root-CLI backend (Privileged): limited to the three proven OS-root exceptions only — CreateGoldenLXC (keyctl pct create), MountUSBByUUID, SMART, Sensors; each cites why it can't be the API. Fence is structural (Client never shells out, Privileged never makes an HTTP call) and asserted in tests.
  • TLS trust: SHA-256 leaf-cert pinning (the host serves a self-signed cert) or a CA file; an explicitly-named insecure_skip_verify that is off by default. No blanket verification disable.
  • internal/config: JSON config file + FELHOM_AGENT_* env overrides; the token secret is never logged (Redacted()).
  • internal/log: slog setup (text, stderr, configurable level).
  • cmd/felhom-agent --selftest: read-only health report against a live host (version/nodes/status/guests/storage); --selftest=task --vmid N exercises WaitTask on a reversible snapshot→rollback→delete op (gated; default selftest mutates nothing).
  • Tests: unit tests with a mock HTTP transport + mock runner (UPID parse, WaitTask running→OK / failed-403 / timeout / ctx-cancel, 403→privilege error, response decoding against shapes captured live from demo-felhom, config redaction, and the API-vs-root routing fence).

Notes

  • Types are grounded in the spike findings (felhom.eu/documentation/proxmox-platform.md, tests/phase{0,1-2,3}-findings.md) and the exact JSON shapes captured live from demo-felhom (PVE 9.2.2).
  • Verified: go build/vet/test green on Go 1.24.4 (build server) and a live read-only --selftest against the demo host with TLS fingerprint pinning.
  • The 16-privilege FelhomAgent role + privsep token (role on both user and token) is provisioned out-of-band; the agent only consumes the token.