6 Commits

Author SHA1 Message Date
admin 1af21a6cac v0.4.0: slice 4 Phase B — reversibility gate + signed-op consuming layer
The security core of slice 4: hub-supplied intent is no longer trusted for
destructive change. The gate fronts the per-guest queue's executor, so every
mutation passes it. Reuses internal/authz for all crypto (surface untouched).

- Classifier (doc 03 §4): benign vs destructive by provenance + data-bearing-
  ness, NOT by verb. Destroy/overwrite of customer data is destructive unless
  agent-internal provenance (same-journaled-txn create, or agent-tagged scratch)
  makes it benign — and that provenance is journal-recorded, NEVER hub-sourced.
  Unknown op class fails safe to destructive.
- Reversibility gate: benign -> allowed unsigned; destructive -> requires a
  verified, role-scoped, action-bound operator signature, else pending_signature
  and never executed. Every decision audited (signal, never the guard).
- Signed-op consuming layer over authz.Verifier.Verify (locked pipeline
  untouched): role-scoping (doc 04 §4 — recovery=rotation only, operational=
  ordinary destructive + planned rotation) + op-to-action binding (op+host+
  guest+params must match the gated action).
- Signed-job orchestration: idempotency dedupe by nonce + journal-wrapped
  execution via an injected DestructiveExecutor (nil this slice — inert).
- Crash recovery (Note 1): Engine.Recover consumes the journal InFlight() set at
  startup (resume-or-rollback) — covers an op that crashed after the POST and
  before its terminal record, which idempotency dedupe alone cannot. Added
  TaskStatusOnce to the GuestAPI seam. Wired into daemon startup.
- Note 2: memory comparison canonicalized to MiB (desiredMemoryMiB) so a
  non-MiB-aligned MemoryBytes converges in one pass, not perpetual drift.
- Daemon: builds the verifier from config signers (none = nil verifier, the
  common slice-4 state), the gate (+SlogAudit), runs Recover before mutating.

Adversarial matrix proven against the REAL authz.Verifier with in-test-minted
SSHSIGs (framing replicated in reconcile's test binary; authz untouched, no
signing added to the verify-only package): unsigned job + unsigned desired-state
delta -> pending_signature; unknown signer/expired/replay-across-restart/wrong
host -> typed authz rejections; wrong guest/op/params -> binding_mismatch;
recovery key on ordinary destructive -> role_denied; hub-supplied scratch tag
ignored -> refused; valid+role+target+fresh nonce -> accepted then replay
rejected. Full module race-clean + vet-clean on the Linux build server.

Inert this slice: no destructive deltas served until slice 10; the destructive
path is classified, gated, and tested but not wired to live execution.

CHECKPOINT: Phase B complete (slice 4 done). Awaiting validation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 23:56:20 +02:00
admin 05c450147c v0.4.0-rc1: slice 4 Phase A — reconcile engine (structural, runs live unfed)
New internal/reconcile package: the agent-side control core's structural half.

- Per-guest serializer Queue (doc 03 §10): the single choke point all mutation
  sources funnel through; same-vmid serial in submit order, different vmids
  parallel (cond-var FIFO lanes).
- Desired-state model + DesiredProvider seam; EmptyProvider is the only live
  source at slice 4 (no hub serving until slice 10) so the live engine computes
  an empty action set and performs zero mutations.
- Normalization layer (FieldNormalizers): normalized desired-vs-actual so
  Proxmox round-trip quirks don't read as drift. normDesc promoted out of
  main.go to reconcile.NormDescription; selftest uses the shared helper.
- Plan (pure diff): minimal benign action set (Start/Stop/SetConfig) for guests
  in both desired and actual; provision/destroy out of scope here.
- Engine: dispatches onto the shared queue; honors the dual-mode SetConfig
  contract (UPID -> WaitTask; empty UPID -> synchronous success).
- Durable op journal + idempotency store (mirrors authz.FileNonceStore):
  in-flight task ids for crash detection + AlreadyApplied dedupe across restart.
- Wired into runDaemon alongside the hub loop, sharing the queue; runs cleanly
  with no desired state and no signers.

Full module race-clean and vet-clean on the Linux build server.

CHECKPOINT: Phase A only. Awaiting validation before Phase B (the reversibility
gate + signed-op consuming layer, landing v0.4.0).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 23:21:55 +02:00
admin e68a7af4d3 fix(agent): slice-3 follow-ups — keep run-status on config fail, selftest usage, contract golden (v0.3.1)
- collect: a per-guest GuestConfig failure preserves the ListLXC run-status (only
  spec dropped); empty status normalized to "unknown". Test asserts preserved
  "running" + nil spec.
- main: --selftest usage error now reads (want read|task|hub).
- contract: testdata/host-report.golden.json + TestHostReport_ContractMatchesGolden
  (field-name key-set check vs golden; byte-identical with the hub copy).
- version 0.3.0 -> 0.3.1.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 18:29:05 +02:00
admin ab77fa3544 feat(hub): host-report client + collector + first daemon loop (slice 3, v0.3.0)
internal/hub: the agent's first daemon — a periodic read-only host-report POSTed to
the hub (the heartbeat; no separate ping).

- HostReport wire contract (shared field-for-field with the hub ingest): host
  metrics, guests (vmid + spec), cloudflared status; storage/backups/restore-tests/
  pbs/audit collections DEFINED but emitted empty (slices 5/6 fill).
- Collector over a read-only proxmoxReader (adapted to the real proxmox surface;
  no proxmox changes) + a CloudflaredProber. Partial-failure: NodeStatus fail = hard
  (skip POST); per-guest GuestConfig fail = status "unknown", still report.
- Client: Bearer-auth POST, standard TLS (system roots / optional ca_file), typed
  TransportError/HTTPError, token never in errors.
- Loop: immediate first report, adopt hub poll_interval (clamp [60,3600]), resilient
  to collect/report errors, clean ctx-cancel shutdown.
- ControlEnvelope: only poll_interval_seconds acted on; blocked/desired_generation/
  has_signed_ops parsed-but-ignored (slice 4).
- config: HubConfig + FELHOM_AGENT_HUB_* overlay + mode-aware HubConfig.Validate +
  WithDefaults + hub-key redaction; example config updated.
- main: no-selftest mode is now the daemon; added --selftest=hub. Version -> 0.3.0.

Tests: report serialization, client (incl. token-redaction), collector partial-
failure, loop continuation+interval adoption, config. internal/proxmox + internal/
authz untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 16:20:09 +02:00
admin f0fee7e193 feat(authz): operator signed-op verifier + durable nonce store (slice 2, v0.2.0)
internal/authz: production form of the Phase-4 SSHSIG signing primitive.

- Verifier.New/Verify with the LOCKED pipeline (namespace → allow-list by key
  material → crypto over RAW bytes → target → time → nonce LAST); each post-crypto
  stage rejects even with a valid sig; an invalid sig never burns a nonce.
- SSHSIG framing via x/crypto/ssh (no hand-rolled crypto); key-type-agnostic
  (ed25519 / sk-ssh-ed25519 / rsa / ecdsa via pub.Verify). Fixed namespace
  felhom-op-v1. Typed errors. OpBlob (fixed host_id/guest_id tags) + VerifiedOp.
- NonceStore: MemoryNonceStore + durable crash-safe FileNonceStore (fsync'd append
  log, replay-on-open, compaction, expiry-only pruning; survives restart).
- config.AuthzConfig (nonce path + pinned operational/recovery signer keys).
- Tests (14): real ssh-keygen fixture, per-stage rejection, nonce-not-burned,
  replay, persistence-across-restart, synthetic sk, byte-exactness.

Dep: golang.org/x/crypto v0.52.0 (declares go 1.25 — the Phase-4 doc's "Go 1.24.4 /
x/crypto v0.52.0" pairing doesn't build; build server upgraded to go1.26.0,
backward-compatible). Version 0.1.0 -> 0.2.0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 15:23:02 +02:00
admin a042316d6d feat(agent): scaffold + proxmox interaction layer (slice 1)
Stand up the felhom-agent project (module gitea.dooplex.hu/admin/felhom-agent,
binary felhom-agent) and the internal/proxmox package: the typed library every
other agent module calls to talk to Proxmox.

- API-first Client (hand-rolled REST over net/http, PVEAPIToken auth) with typed
  read ops (version/nodes/status/lxc/config/storage) and async mutating ops
  (restore/vzdump/snapshot/rollback/delete-snapshot/setconfig/start/stop), each
  returning a UPID. WaitTask polls task status until stopped and asserts
  exitstatus OK (authz can surface at task exec, not the POST — phase1-2 §1.3).
- Fenced Privileged (root-CLI) backend for the THREE proven exceptions only
  (keyctl pct create, USB mount/fstab, SMART/sensors); each cites why it can't be
  the API. Fence is structural (Client never shells out, Privileged never HTTPs)
  and asserted in routing_test.go.
- TLS: SHA-256 leaf-cert pinning or CA file; insecure mode explicit + off by
  default. No blanket verification disable.
- 403 -> privilege-named APIError; failed task -> privilege-named TaskError.
- JSON config + env overrides (token never logged); slog logging.
- cmd/felhom-agent --selftest (read-only health report) + gated --selftest=task
  (reversible snapshot/rollback/delete exercise of WaitTask). No daemon loop yet.
- Types grounded in the spike findings and exact JSON shapes captured live from
  demo-felhom (PVE 9.2.2). Unit tests use a mock transport + runner.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 14:34:32 +02:00