v0.4.0-rc1: slice 4 Phase A — reconcile engine (structural, runs live unfed)

New internal/reconcile package: the agent-side control core's structural half.

- Per-guest serializer Queue (doc 03 §10): the single choke point all mutation
  sources funnel through; same-vmid serial in submit order, different vmids
  parallel (cond-var FIFO lanes).
- Desired-state model + DesiredProvider seam; EmptyProvider is the only live
  source at slice 4 (no hub serving until slice 10) so the live engine computes
  an empty action set and performs zero mutations.
- Normalization layer (FieldNormalizers): normalized desired-vs-actual so
  Proxmox round-trip quirks don't read as drift. normDesc promoted out of
  main.go to reconcile.NormDescription; selftest uses the shared helper.
- Plan (pure diff): minimal benign action set (Start/Stop/SetConfig) for guests
  in both desired and actual; provision/destroy out of scope here.
- Engine: dispatches onto the shared queue; honors the dual-mode SetConfig
  contract (UPID -> WaitTask; empty UPID -> synchronous success).
- Durable op journal + idempotency store (mirrors authz.FileNonceStore):
  in-flight task ids for crash detection + AlreadyApplied dedupe across restart.
- Wired into runDaemon alongside the hub loop, sharing the queue; runs cleanly
  with no desired state and no signers.

Full module race-clean and vet-clean on the Linux build server.

CHECKPOINT: Phase A only. Awaiting validation before Phase B (the reversibility
gate + signed-op consuming layer, landing v0.4.0).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-08 23:21:55 +02:00
parent 605ce25f58
commit 05c450147c
16 changed files with 1904 additions and 78 deletions
+66 -60
View File
@@ -1,76 +1,82 @@
# REPORT — `SetConfig` selftest extension, live self-gate (2026-06-08)
# REPORT — Slice 4 Phase A: reconcile engine (structural) (2026-06-08)
> Overwrite-latest report (most recent significant run only). Cumulative history lives in [CHANGELOG.md](CHANGELOG.md).
> Overwrite-latest report (most recent significant work only). Cumulative history lives in [CHANGELOG.md](CHANGELOG.md).
## Outcome
**`SetConfig` PASSED live under the scoped operator token.** The slice-4 pre-check is
satisfied — `--selftest=task -vmid 9999` now exercises a reversible `SetConfig`
write+revert end-to-end and reached `=== selftest=task OK ===` (exit 0). Reconcile
(slice 4) can be built on `SetConfig` with confidence.
**Phase A of slice 4 is implemented, tested, and pushed as the checkpoint marker
`v0.4.0-rc1`.** This is the structural half of the agent-side control core: the
reconcile engine, the per-guest serializer (doc 03 §10), the desired-state model + its
provider seam, the field-normalization layer, the plan/diff engine, and the durable
operation journal + idempotency store — all adversarially fixture-tested.
## What was implemented
**Per the task, I have STOPPED at the checkpoint and am awaiting the validation pass
before starting Phase B** (the benign/destructive classifier, the reversibility gate,
and the signed-op consuming layer over `internal/authz`). Phase B is the security core
and earns isolated review.
A reversible `SetConfig` step appended to the existing `runSelftestTask` flow
(`cmd/felhom-agent/main.go`, `selftestSetConfig`), keeping the prior
snapshot → rollback → delete-snapshot steps intact. Against guest 9999:
## What runs (and what deliberately doesn't)
1. `GuestConfig` — capture the original `description` (was **absent**).
2. `SetConfig description="felhom-selftest <RFC3339>"` — dual-mode return handled per
the `mutate.go` contract (empty UPID = synchronous; UPID = `WaitTask`+assert OK).
3. `GuestConfig` again — confirm the marker landed.
4. **Restore** — original was absent, so `SetConfig delete=description`; confirm cleared.
The engine **runs live but unfed**. At slice 4 there is no desired-state source (hub
serving is slice 10; provisioning is slice 7), so the only production `DesiredProvider`
is `EmptyProvider` → the live engine reads state, computes an **empty action set**, and
performs **zero mutations** every tick. That is the correct, expected slice-4 behavior;
the first live convergence arrives when slice 10 serves desired state into the seam.
Output matches the existing format:
```
[ ok ] setconfig synchronous exitstatus=OK
[ ok ] verify-write description verified == marker
[ ok ] setconfig-revert synchronous exitstatus=OK
[ ok ] verify-revert description restored to original
```
The wired action set is **benign-on-existing-guest only**: `Start`, `Stop`, `SetConfig`.
Provisioning and the destructive set are out of scope for Phase A (the destructive set
is classified and gated in Phase B but not wired to live execution — nothing serves
destructive deltas yet).
## Key finding — synchronous, not async
## Package `internal/reconcile`
**The LXC `description` write came back synchronous (empty UPID).** PVE applied it
inline with no task object; the agent printed `synchronous exitstatus=OK` on the
empty-string path. This confirms the agent's **dual-mode `SetConfig` modeling matches
Proxmox reality**: for `description`, the empty-UPID branch is the live path, and
treating `""` as success (not an error) is correct. This was the **first live exercise
of the `VM.Config.*` privilege cluster** (previously only the snapshot/rollback/backup
privileges had been run live).
- **`Queue` (per-guest serializer, doc 03 §10)** — the single choke point all mutation
sources funnel through. Same-vmid jobs run strictly one-at-a-time in submit order;
independent vmids run in parallel. Each vmid is an unbounded cond-var FIFO lane
(non-blocking, order-preserving submission); `Close` drains pending jobs gracefully.
- **Desired-state model + `DesiredProvider`** — `DesiredGuest` makes each field
individually optional (run-state / `*hub.GuestSpec` / `*description`) so a source pins
only what it manages. `EmptyProvider` (live, slice 4) and `StaticProvider` (fixtures).
- **Normalization layer (`FieldNormalizers`)** — reconcile compares *normalized*
desired-vs-actual. `description`'s trailing newline (the slice-4-proven quirk) is the
first registered normalizer; the registry takes more as discovered. `normDesc` was
**promoted** out of `main.go` to `reconcile.NormDescription`, and the `--selftest=task`
round-trip now uses that shared helper — one source of truth.
- **`Plan` (pure diff engine)** — minimal benign action set for guests in both desired
and actual: normalized comparison, deterministic vmid order, config-before-run-state.
Skips provision (slice 7) and destroy (gated, slice 10); never writes a config it
couldn't first read; disk grow deferred.
- **`Engine`** — reads desired+actual, plans, dispatches onto the shared queue. Honors
the mutate.go dual-mode contract: non-empty UPID → `WaitTask`+assert; empty UPID →
clean synchronous success. Per-action failures counted, never fatal.
- **`Journal`** — durable fsync'd JSONL (mirrors `authz.FileNonceStore`): op lifecycle
with the Proxmox task id (crash mid-op detected + re-checkable via `InFlight()`), plus
an idempotency-key store so a one-shot op never double-runs across retries/restarts.
Reconcile actions carry no idempotency key (convergent — must re-run on real drift).
## Second finding — `description` trailing-newline normalization
## Daemon wiring
PVE **appends a trailing `\n` to `description` on read** (stored URL-encoded as
`%0A...`). The first live run surfaced this as a (false) verify mismatch:
`got="...Z\n"` vs `want="...Z"`. The write had genuinely landed — only my exact-match
check was too strict. Fixed with `normDesc` (strip trailing newline) at every
comparison point, and the run went green. **This is load-bearing intel for slice 4:**
a reconcile that compares desired vs actual `description` verbatim will detect
perpetual drift; it must normalize the trailing newline.
`runDaemon` now runs reconcile alongside the hub loop on the poll cadence, sharing the
per-guest queue. The journal lives at a `journal.log` sibling of the nonce store. The
daemon runs cleanly with **no desired state and no signers** — reconcile is a logged
no-op; a journal-open failure degrades to journal-less rather than crashing.
## Live run environment
## Verification
- Built **v0.3.2** on the build server (192.168.0.180, go1.26), pointed at
`demo-felhom` (`https://192.168.0.162:8006`, PVE 9.2.2).
- Pinned leaf-cert SHA-256 fingerprint re-verified — still
`BA:7C:99:7D:45:D0…` (matches the agent's pin).
- `--selftest=read` clean first (PVE 9.2.2, node online, guests 9001+9999 visible,
storages listed), then the gated `--selftest=task -vmid 9999`.
- Task UPIDs name the token actor (`…:vzsnapshot:9999:felhom-agent@pve!agent:` etc.) —
privsep token path genuinely exercised, no privilege drift.
- Full module **race-clean** (`go test -race -count=1 ./...`) and `go vet` clean on the
Linux build server (go1.26); all unit tests green locally and there.
- Adversarial fixture coverage: serializer concurrency/ordering, normalization +
extensibility seam, the full plan matrix (drift / no-false-drift / unmanaged /
spec-unknown / scope skips / ordering / empty-desired), engine sync-vs-async +
failure counting, and journal persistence + idempotency dedupe **across a simulated
restart**.
- No live Proxmox needed (the engine is unfed); the live exercise is deferred — there is
nothing to converge until a desired-state source exists.
## Post-state
## Next (after validation)
Guest **9999** left pristine: **stopped**, `description` **absent**, only `current`
remains (no leftover `felhom-selftest` snapshot).
## Credentials
The standing operator token (`felhom-agent@pve!agent`, privsep) was **rotated** during
this run — the prior secret was not retrievable (PVE reveals a token secret only once
at creation), so a fresh secret was minted via `root@felhom-pve` and the `FelhomAgent`
role re-confirmed on **both** the user and the token ACL at `/` (privsep intersection
gotcha). The token was consumed via the **standing operator token through
`FELHOM_AGENT_PROXMOX_TOKEN`, not persisted to the repo** — the on-disk demo config
carries only a placeholder. The new secret is **stored out-of-band**.
Phase B: the classifier (benign vs destructive by provenance + data-bearing-ness, not by
verb), the reversibility gate in front of the queue's executor, and the signed-op
consuming layer over `internal/authz` with role-scoping + op-to-action binding + the
adversarial rejection matrix — landing **v0.4.0**. I will not start it until the Phase-A
validation passes.