v0.4.0: slice 4 Phase B — reversibility gate + signed-op consuming layer

The security core of slice 4: hub-supplied intent is no longer trusted for
destructive change. The gate fronts the per-guest queue's executor, so every
mutation passes it. Reuses internal/authz for all crypto (surface untouched).

- Classifier (doc 03 §4): benign vs destructive by provenance + data-bearing-
  ness, NOT by verb. Destroy/overwrite of customer data is destructive unless
  agent-internal provenance (same-journaled-txn create, or agent-tagged scratch)
  makes it benign — and that provenance is journal-recorded, NEVER hub-sourced.
  Unknown op class fails safe to destructive.
- Reversibility gate: benign -> allowed unsigned; destructive -> requires a
  verified, role-scoped, action-bound operator signature, else pending_signature
  and never executed. Every decision audited (signal, never the guard).
- Signed-op consuming layer over authz.Verifier.Verify (locked pipeline
  untouched): role-scoping (doc 04 §4 — recovery=rotation only, operational=
  ordinary destructive + planned rotation) + op-to-action binding (op+host+
  guest+params must match the gated action).
- Signed-job orchestration: idempotency dedupe by nonce + journal-wrapped
  execution via an injected DestructiveExecutor (nil this slice — inert).
- Crash recovery (Note 1): Engine.Recover consumes the journal InFlight() set at
  startup (resume-or-rollback) — covers an op that crashed after the POST and
  before its terminal record, which idempotency dedupe alone cannot. Added
  TaskStatusOnce to the GuestAPI seam. Wired into daemon startup.
- Note 2: memory comparison canonicalized to MiB (desiredMemoryMiB) so a
  non-MiB-aligned MemoryBytes converges in one pass, not perpetual drift.
- Daemon: builds the verifier from config signers (none = nil verifier, the
  common slice-4 state), the gate (+SlogAudit), runs Recover before mutating.

Adversarial matrix proven against the REAL authz.Verifier with in-test-minted
SSHSIGs (framing replicated in reconcile's test binary; authz untouched, no
signing added to the verify-only package): unsigned job + unsigned desired-state
delta -> pending_signature; unknown signer/expired/replay-across-restart/wrong
host -> typed authz rejections; wrong guest/op/params -> binding_mismatch;
recovery key on ordinary destructive -> role_denied; hub-supplied scratch tag
ignored -> refused; valid+role+target+fresh nonce -> accepted then replay
rejected. Full module race-clean + vet-clean on the Linux build server.

Inert this slice: no destructive deltas served until slice 10; the destructive
path is classified, gated, and tested but not wired to live execution.

CHECKPOINT: Phase B complete (slice 4 done). Awaiting validation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-08 23:56:20 +02:00
parent 05c450147c
commit 1af21a6cac
18 changed files with 1640 additions and 80 deletions
+77 -64
View File
@@ -1,82 +1,95 @@
# REPORT — Slice 4 Phase A: reconcile engine (structural) (2026-06-08)
# REPORT — Slice 4: reconcile engine + the reversibility gate (v0.4.0) (2026-06-08)
> Overwrite-latest report (most recent significant work only). Cumulative history lives in [CHANGELOG.md](CHANGELOG.md).
## Outcome
**Phase A of slice 4 is implemented, tested, and pushed as the checkpoint marker
`v0.4.0-rc1`.** This is the structural half of the agent-side control core: the
reconcile engine, the per-guest serializer (doc 03 §10), the desired-state model + its
provider seam, the field-normalization layer, the plan/diff engine, and the durable
operation journal + idempotency store — all adversarially fixture-tested.
**Slice 4 is complete and pushed as `v0.4.0`.** Both phases landed:
**Per the task, I have STOPPED at the checkpoint and am awaiting the validation pass
before starting Phase B** (the benign/destructive classifier, the reversibility gate,
and the signed-op consuming layer over `internal/authz`). Phase B is the security core
and earns isolated review.
- **Phase A** (structural, pushed earlier as `v0.4.0-rc1`): the reconcile engine, the
per-guest serializer (doc 03 §10), the desired-state model + provider seam, the
field-normalization layer, the plan/diff engine, and the durable op journal +
idempotency store. Runs **live but unfed**`EmptyProvider` → zero mutations until
slice 10 serves desired state.
- **Phase B** (this push, the security core): the benign/destructive **classifier**,
the **reversibility gate**, and the **signed-op consuming layer** over `internal/authz`
— with role-scoping, op-to-action binding, idempotency/journaling, audit, and the
crash-recovery consumer. The gate sits in front of the per-guest queue's executor, so
**every mutation passes it**.
## What runs (and what deliberately doesn't)
The whole module is **race-clean and vet-clean** on the Linux build server; 62 reconcile
tests pass (the adversarial matrix runs against the real `authz.Verifier`).
The engine **runs live but unfed**. At slice 4 there is no desired-state source (hub
serving is slice 10; provisioning is slice 7), so the only production `DesiredProvider`
is `EmptyProvider` → the live engine reads state, computes an **empty action set**, and
performs **zero mutations** every tick. That is the correct, expected slice-4 behavior;
the first live convergence arrives when slice 10 serves desired state into the seam.
## The security model (Phase B)
The wired action set is **benign-on-existing-guest only**: `Start`, `Stop`, `SetConfig`.
Provisioning and the destructive set are out of scope for Phase A (the destructive set
is classified and gated in Phase B but not wired to live execution — nothing serves
destructive deltas yet).
Hub-supplied intent is no longer trusted for destructive change — **by provenance +
data-bearing-ness, not by verb** (doc 03 §4):
## Package `internal/reconcile`
- **Benign** (unsigned): start/stop/restart/create, and destroying a resource the agent
created in the **same journaled transaction** (compensating rollback) or **tagged
scratch**. That scratch/same-txn provenance is **agent-internal, journal-recorded, and
never accepted from the hub** — a compromised hub cannot relabel a data-bearing guest
as scratch to walk the gate.
- **Destructive** (signature required): destroy/overwrite of the only/primary copy of
customer data — **regardless of whether it arrives as a job or a desired-state delta**.
Absent/invalid signature → refused **`pending_signature`**, never executed.
- **`Queue` (per-guest serializer, doc 03 §10)** — the single choke point all mutation
sources funnel through. Same-vmid jobs run strictly one-at-a-time in submit order;
independent vmids run in parallel. Each vmid is an unbounded cond-var FIFO lane
(non-blocking, order-preserving submission); `Close` drains pending jobs gracefully.
- **Desired-state model + `DesiredProvider`** — `DesiredGuest` makes each field
individually optional (run-state / `*hub.GuestSpec` / `*description`) so a source pins
only what it manages. `EmptyProvider` (live, slice 4) and `StaticProvider` (fixtures).
- **Normalization layer (`FieldNormalizers`)** — reconcile compares *normalized*
desired-vs-actual. `description`'s trailing newline (the slice-4-proven quirk) is the
first registered normalizer; the registry takes more as discovered. `normDesc` was
**promoted** out of `main.go` to `reconcile.NormDescription`, and the `--selftest=task`
round-trip now uses that shared helper — one source of truth.
- **`Plan` (pure diff engine)** — minimal benign action set for guests in both desired
and actual: normalized comparison, deterministic vmid order, config-before-run-state.
Skips provision (slice 7) and destroy (gated, slice 10); never writes a config it
couldn't first read; disk grow deferred.
- **`Engine`** — reads desired+actual, plans, dispatches onto the shared queue. Honors
the mutate.go dual-mode contract: non-empty UPID → `WaitTask`+assert; empty UPID →
clean synchronous success. Per-action failures counted, never fatal.
- **`Journal`** — durable fsync'd JSONL (mirrors `authz.FileNonceStore`): op lifecycle
with the Proxmox task id (crash mid-op detected + re-checkable via `InFlight()`), plus
an idempotency-key store so a one-shot op never double-runs across retries/restarts.
Reconcile actions carry no idempotency key (convergent — must re-run on real drift).
The signed-op consuming layer calls `authz.Verifier.Verify` (the locked
namespace→allow-list→crypto→target→time→nonce pipeline, untouched) and then enforces
the slice-4 policy on the `VerifiedOp`: **role-scoping** (recovery key = key-rotation
only; operational key = ordinary destructive + planned rotation, doc 04 §4) and
**op-to-action binding** (the verified op + host + guest + params must name the exact
gated action). Idempotency keys the journal by the op nonce; every decision is audited
(a signal, never the guard).
## Daemon wiring
## Inert by design (slice-4 scope)
`runDaemon` now runs reconcile alongside the hub loop on the poll cadence, sharing the
per-guest queue. The journal lives at a `journal.log` sibling of the nonce store. The
daemon runs cleanly with **no desired state and no signers** — reconcile is a logged
no-op; a journal-open failure degrades to journal-less rather than crashing.
There is **no live destructive execution** this slice: nothing serves destructive deltas
until slice 10, and the guest-destroy/storage-wipe/restore-overwrite executors land in
6/7. So the destructive path is fully **classified, gated, and adversarially tested**,
but `RunSignedJob`'s executor is nil in production — an authorized destructive op is
journaled as authorized-but-not-executed. Reconcile itself only produces the benign
Start/Stop/SetConfig set, all allowed through the gate unsigned.
## Adversarial proof (each case independently rejected)
Run against the **real** `authz.Verifier` with in-test-minted SSHSIGs (the ~40-line
framing is replicated in reconcile's test binary — production `authz` is untouched and
gains no signing capability; live minting is required because the verifier's clock is
not cross-package injectable):
unsigned destructive **job** → pending_signature · unsigned destructive **desired-state
delta** → pending_signature (distrusts hub desired state, not just jobs) · forged /
unknown signer → `ErrUnknownSigner` · expired → `ErrExpired` · **replayed nonce across an
agent restart** (durable `FileNonceStore`) → `ErrReplay` · wrong host → `ErrTarget` ·
wrong guest / wrong op / wrong params → binding_mismatch · **recovery key on ordinary
destructive** → role_denied · **hub-supplied "scratch" tag** on a data-bearing guest →
ignored, still destructive → refused · **valid + correct role + correct target + fresh
nonce → accepted**, and a second presentation → `ErrReplay`.
## The two forward-looking notes
- **Note 1 (carried in)** — the `InFlight()` **resume-or-rollback** startup consumer
(`Engine.Recover`) landed **together with** the signed-op executor, as required. An op
that crashed after the Proxmox POST but before its terminal record (`OpTaskRunning`,
nonce already consumed) is not covered by idempotency dedupe — only this consumer
resolves it (re-read the task via the new `TaskStatusOnce`, record the real outcome; a
no-task-id op is abandoned fail-safe). Wired into daemon startup and tested.
- **Note 2 (addressed)** — the memory comparison is canonicalized (`desiredMemoryMiB`):
desired and actual compare in the same MiB unit that is then written, so a
non-MiB-aligned `MemoryBytes` converges in one pass rather than re-issuing SetConfig
every cycle. A test proves convergence. Recommendation stands that slice 10 serve
MiB-aligned specs at the source.
## Verification
- Full module **race-clean** (`go test -race -count=1 ./...`) and `go vet` clean on the
Linux build server (go1.26); all unit tests green locally and there.
- Adversarial fixture coverage: serializer concurrency/ordering, normalization +
extensibility seam, the full plan matrix (drift / no-false-drift / unmanaged /
spec-unknown / scope skips / ordering / empty-desired), engine sync-vs-async +
failure counting, and journal persistence + idempotency dedupe **across a simulated
restart**.
- No live Proxmox needed (the engine is unfed); the live exercise is deferred — there is
nothing to converge until a desired-state source exists.
- `go test -race -count=1 ./...` and `go vet ./...` clean on the Linux build server
(go1.26); all tests green locally and there.
- No live Proxmox needed — Phase A is unfed and Phase B's destructive path is inert this
slice. The gate's crypto path is proven end-to-end against the real verifier.
## Next (after validation)
## Conventions
Phase B: the classifier (benign vs destructive by provenance + data-bearing-ness, not by
verb), the reversibility gate in front of the queue's executor, and the signed-op
consuming layer over `internal/authz` with role-scoping + op-to-action binding + the
adversarial rejection matrix — landing **v0.4.0**. I will not start it until the Phase-A
validation passes.
Version → **v0.4.0**. CHANGELOG has a per-phase entry (newest on top). No secrets in any
committed file. Pushed to `main`. Per the task, I stop at this checkpoint and await the
validation pass.