v0.4.0-rc1: slice 4 Phase A — reconcile engine (structural, runs live unfed)

New internal/reconcile package: the agent-side control core's structural half.

- Per-guest serializer Queue (doc 03 §10): the single choke point all mutation
  sources funnel through; same-vmid serial in submit order, different vmids
  parallel (cond-var FIFO lanes).
- Desired-state model + DesiredProvider seam; EmptyProvider is the only live
  source at slice 4 (no hub serving until slice 10) so the live engine computes
  an empty action set and performs zero mutations.
- Normalization layer (FieldNormalizers): normalized desired-vs-actual so
  Proxmox round-trip quirks don't read as drift. normDesc promoted out of
  main.go to reconcile.NormDescription; selftest uses the shared helper.
- Plan (pure diff): minimal benign action set (Start/Stop/SetConfig) for guests
  in both desired and actual; provision/destroy out of scope here.
- Engine: dispatches onto the shared queue; honors the dual-mode SetConfig
  contract (UPID -> WaitTask; empty UPID -> synchronous success).
- Durable op journal + idempotency store (mirrors authz.FileNonceStore):
  in-flight task ids for crash detection + AlreadyApplied dedupe across restart.
- Wired into runDaemon alongside the hub loop, sharing the queue; runs cleanly
  with no desired state and no signers.

Full module race-clean and vet-clean on the Linux build server.

CHECKPOINT: Phase A only. Awaiting validation before Phase B (the reversibility
gate + signed-op consuming layer, landing v0.4.0).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-08 23:21:55 +02:00
parent 605ce25f58
commit 05c450147c
16 changed files with 1904 additions and 78 deletions
+69
View File
@@ -3,6 +3,75 @@
All notable changes to **felhom-agent** are recorded here. Update on every code
change that gets pushed.
## v0.4.0-rc1 — slice 4 Phase A: reconcile engine (structural; runs live, unfed) (2026-06-08)
The agent-side control core's structural half. **Checkpoint marker**`-rc1` is the
Phase-A push; awaiting validation before Phase B (the reversibility gate + signed-op
consuming layer) lands the final **v0.4.0**. Runs LIVE but UNFED: with no desired-state
provider until slice 10, the live engine computes an empty action set and performs
**zero mutations**.
### Added
- **`internal/reconcile`** package — the engine, the per-guest serializer, the
desired-state model, the normalization layer, and the durable op journal:
- **Per-guest serializer (`Queue`, doc 03 §10)** — the single choke point ALL
mutation sources funnel through. Same-vmid jobs run strictly one-at-a-time in
submit order; independent vmids run in parallel. Each vmid is a cond-var FIFO lane
(unbounded, non-blocking, order-preserving); graceful drain on `Close`.
- **Desired-state model + `DesiredProvider` seam** — `DesiredGuest` (per-field
optional: run-state / `*hub.GuestSpec` / `*description`), `DesiredState`. The only
live provider is **`EmptyProvider`** (slice 4 has no source); `StaticProvider`
feeds fixtures. The seam is where slice 10's hub-serving plugs in — no hub/local
source invented here.
- **Normalization layer (`FieldNormalizers`)** — reconcile compares *normalized*
desired-vs-actual so Proxmox round-trip quirks don't read as drift. `description`'s
trailing newline is the first registered case; the registry takes more (boolean
coercion, list ordering) as discovered. `normDesc` **promoted** out of
`cmd/felhom-agent/main.go` to **`reconcile.NormDescription`**; the `--selftest=task`
description round-trip now uses that shared helper (one source of truth for the quirk).
- **Plan engine (`Plan`, pure function)** — computes the minimal **benign** action set
(`Start`/`Stop`/`SetConfig`) for guests present in both desired and actual, with
normalized comparison, deterministic vmid ordering, config-before-run-state. Skips
provision (desired-absent-in-actual, slice 7) and destroy (actual-absent-in-desired,
gated, slice 10); never writes a config it couldn't first read (`SpecKnown`). Disk
(rootfs grow) intentionally not reconciled here.
- **Reconcile engine (`Engine`)** — reads desired+actual, plans, dispatches each action
onto the shared queue. Every Proxmox op handled per the mutate.go contract: non-empty
UPID → `WaitTask` + assert `exitstatus`; empty UPID → clean **synchronous** success
(slice-4 proven). Per-action failures are counted, not fatal (other guests still
converge).
- **Operation journal (`Journal`)** — durable fsync'd append-only JSONL mirroring
`authz.FileNonceStore`: records each op's lifecycle (started → task_running →
succeeded/failed) with its Proxmox task id (crash mid-op is detected and re-checkable
on restart via `InFlight()`), plus an **idempotency-key store** (`AlreadyApplied`) so
a one-shot op never re-runs across retries/restarts. Reconcile actions carry no
idempotency key (convergent — must re-run on real drift).
- **Daemon wiring (`runDaemon`)** — reconcile runs alongside the hub loop on the poll
cadence, **sharing the per-guest queue**. Journal path is a `journal.log` sibling of the
nonce store. The daemon runs cleanly with **no desired state and no signers** (reconcile
is a logged live no-op; a journal-open failure degrades to journal-less, never crashes).
### Tests
- Serializer: same-guest serialized (max-concurrency 1, submit order preserved) and
different-guests parallel (cross-waiting jobs both complete — would deadlock if not);
error propagation; drain-pending-on-close; submit-after-close.
- Normalization: description round-trip; unknown-field identity; extensibility seam
(synthetic boolean-coercion + list-ordering normalizers).
- Plan: run-state start/stop, spec drift (cores/memory), disk-not-reconciled,
description-newline-not-drift, unmanaged fields, spec-unknown skips config keeps
run-state, desired-absent skipped, combined ordering, empty-desired no-op, deterministic
vmid order.
- Engine: empty-provider zero mutations; async start (WaitTask); synchronous SetConfig
(no WaitTask); WaitTask failure + POST error counted failed; list error = pass failure.
- Journal: lifecycle latest-wins; in-flight survives restart; idempotency dedupe across
restart; failed key not applied; torn-trailing-line skipped.
- Full module **race-clean** (`go test -race`) on the Linux build server; vet clean.
### Not in this phase (Phase B)
- The benign/destructive classifier, the reversibility gate, and the signed-op consuming
layer over `internal/authz` (doc 03 §4 / doc 04) — added next, in front of the queue's
executor, landing **v0.4.0**.
## v0.3.2 — SetConfig selftest extension (slice-4 pre-check) (2026-06-08)
The gate before slice 4: prove `SetConfig` works live under the scoped token before