v0.4.0: slice 4 Phase B — reversibility gate + signed-op consuming layer

The security core of slice 4: hub-supplied intent is no longer trusted for
destructive change. The gate fronts the per-guest queue's executor, so every
mutation passes it. Reuses internal/authz for all crypto (surface untouched).

- Classifier (doc 03 §4): benign vs destructive by provenance + data-bearing-
  ness, NOT by verb. Destroy/overwrite of customer data is destructive unless
  agent-internal provenance (same-journaled-txn create, or agent-tagged scratch)
  makes it benign — and that provenance is journal-recorded, NEVER hub-sourced.
  Unknown op class fails safe to destructive.
- Reversibility gate: benign -> allowed unsigned; destructive -> requires a
  verified, role-scoped, action-bound operator signature, else pending_signature
  and never executed. Every decision audited (signal, never the guard).
- Signed-op consuming layer over authz.Verifier.Verify (locked pipeline
  untouched): role-scoping (doc 04 §4 — recovery=rotation only, operational=
  ordinary destructive + planned rotation) + op-to-action binding (op+host+
  guest+params must match the gated action).
- Signed-job orchestration: idempotency dedupe by nonce + journal-wrapped
  execution via an injected DestructiveExecutor (nil this slice — inert).
- Crash recovery (Note 1): Engine.Recover consumes the journal InFlight() set at
  startup (resume-or-rollback) — covers an op that crashed after the POST and
  before its terminal record, which idempotency dedupe alone cannot. Added
  TaskStatusOnce to the GuestAPI seam. Wired into daemon startup.
- Note 2: memory comparison canonicalized to MiB (desiredMemoryMiB) so a
  non-MiB-aligned MemoryBytes converges in one pass, not perpetual drift.
- Daemon: builds the verifier from config signers (none = nil verifier, the
  common slice-4 state), the gate (+SlogAudit), runs Recover before mutating.

Adversarial matrix proven against the REAL authz.Verifier with in-test-minted
SSHSIGs (framing replicated in reconcile's test binary; authz untouched, no
signing added to the verify-only package): unsigned job + unsigned desired-state
delta -> pending_signature; unknown signer/expired/replay-across-restart/wrong
host -> typed authz rejections; wrong guest/op/params -> binding_mismatch;
recovery key on ordinary destructive -> role_denied; hub-supplied scratch tag
ignored -> refused; valid+role+target+fresh nonce -> accepted then replay
rejected. Full module race-clean + vet-clean on the Linux build server.

Inert this slice: no destructive deltas served until slice 10; the destructive
path is classified, gated, and tested but not wired to live execution.

CHECKPOINT: Phase B complete (slice 4 done). Awaiting validation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-08 23:56:20 +02:00
parent 05c450147c
commit 1af21a6cac
18 changed files with 1640 additions and 80 deletions
+97
View File
@@ -0,0 +1,97 @@
package reconcile
import (
"context"
"time"
)
// Recover consumes the journal's in-flight set at startup: resume-or-rollback for any
// op that was mid-execution when the agent crashed (doc 03 §10). This MUST run before
// the engine begins issuing new mutations.
//
// Why it is load-bearing for signed destructive ops (and why it lands with the gate):
// the idempotency-key store dedupes a COMPLETED op, but an op that crashed AFTER the
// Proxmox POST and BEFORE its terminal record (OpTaskRunning) is not covered by that —
// its nonce is already consumed, so a redelivery is rejected as a replay, yet it never
// reached a terminal state. Only this startup consumer can resolve it: re-check the
// Proxmox task and record the real outcome.
//
// Resolution per in-flight entry:
// - has a task id (OpTaskRunning): re-read the task status once. Stopped → record the
// real terminal state (OK → succeeded, else failed). Still running → leave it
// in-flight (a later Recover or the task's own completion resolves it). Unreadable →
// leave it (cannot safely decide).
// - no task id (OpStarted only): the Proxmox POST was never confirmed, so the op
// never took effect — record failed (fail-safe, the documented FileNonceStore
// direction). A convergent reconcile op is simply re-issued next pass; a one-shot
// op did NOT mark its idempotency key applied, so it is not falsely deduped.
func (e *Engine) Recover(ctx context.Context) RecoverResult {
var res RecoverResult
if e.journal == nil {
return res
}
for _, entry := range e.journal.InFlight() {
res.Examined++
if entry.UPID == "" {
// POST never confirmed → abandon (fail-safe).
e.append(terminal(entry, OpFailed))
res.RolledBack++
e.logger.Warn("recover: in-flight op had no task id; marked failed (fail-safe)",
"op_id", entry.OpID, "vmid", entry.VMID, "kind", entry.Kind)
continue
}
st, err := e.api.TaskStatusOnce(ctx, entry.UPID)
if err != nil {
res.Unresolved++
e.logger.Warn("recover: cannot read in-flight task status; left in-flight",
"op_id", entry.OpID, "upid", entry.UPID, "err", err)
continue
}
if st.Running() {
res.StillRunning++
e.logger.Info("recover: in-flight task still running; left in-flight",
"op_id", entry.OpID, "upid", entry.UPID)
continue
}
// Stopped: record the real outcome.
if st.OK() {
e.append(terminal(entry, OpSucceeded))
res.Resumed++
e.logger.Info("recover: in-flight task completed OK; marked succeeded",
"op_id", entry.OpID, "upid", entry.UPID)
} else {
e.append(terminal(entry, OpFailed))
res.Failed++
e.logger.Warn("recover: in-flight task ended non-OK; marked failed",
"op_id", entry.OpID, "upid", entry.UPID, "exitstatus", st.ExitStatus)
}
}
if res.Examined > 0 {
e.logger.Info("recover: in-flight journal reconciled", "result", res)
}
return res
}
// RecoverResult summarizes a startup recovery pass.
type RecoverResult struct {
Examined int
Resumed int // task found completed OK and recorded succeeded
Failed int // task found ended non-OK and recorded failed
RolledBack int // no task id → abandoned (fail-safe)
StillRunning int // task still executing → left in-flight
Unresolved int // task status unreadable → left in-flight
}
// terminal builds a terminal journal record preserving the op's identity, with the
// idempotency key carried through so a SUCCEEDED one-shot op marks its key applied.
func terminal(e JournalEntry, state OpState) JournalEntry {
return JournalEntry{
OpID: e.OpID,
VMID: e.VMID,
Kind: e.Kind,
UPID: e.UPID,
State: state,
IdempKey: e.IdempKey,
At: time.Now().UTC(),
}
}