v0.4.0: slice 4 Phase B — reversibility gate + signed-op consuming layer
The security core of slice 4: hub-supplied intent is no longer trusted for destructive change. The gate fronts the per-guest queue's executor, so every mutation passes it. Reuses internal/authz for all crypto (surface untouched). - Classifier (doc 03 §4): benign vs destructive by provenance + data-bearing- ness, NOT by verb. Destroy/overwrite of customer data is destructive unless agent-internal provenance (same-journaled-txn create, or agent-tagged scratch) makes it benign — and that provenance is journal-recorded, NEVER hub-sourced. Unknown op class fails safe to destructive. - Reversibility gate: benign -> allowed unsigned; destructive -> requires a verified, role-scoped, action-bound operator signature, else pending_signature and never executed. Every decision audited (signal, never the guard). - Signed-op consuming layer over authz.Verifier.Verify (locked pipeline untouched): role-scoping (doc 04 §4 — recovery=rotation only, operational= ordinary destructive + planned rotation) + op-to-action binding (op+host+ guest+params must match the gated action). - Signed-job orchestration: idempotency dedupe by nonce + journal-wrapped execution via an injected DestructiveExecutor (nil this slice — inert). - Crash recovery (Note 1): Engine.Recover consumes the journal InFlight() set at startup (resume-or-rollback) — covers an op that crashed after the POST and before its terminal record, which idempotency dedupe alone cannot. Added TaskStatusOnce to the GuestAPI seam. Wired into daemon startup. - Note 2: memory comparison canonicalized to MiB (desiredMemoryMiB) so a non-MiB-aligned MemoryBytes converges in one pass, not perpetual drift. - Daemon: builds the verifier from config signers (none = nil verifier, the common slice-4 state), the gate (+SlogAudit), runs Recover before mutating. Adversarial matrix proven against the REAL authz.Verifier with in-test-minted SSHSIGs (framing replicated in reconcile's test binary; authz untouched, no signing added to the verify-only package): unsigned job + unsigned desired-state delta -> pending_signature; unknown signer/expired/replay-across-restart/wrong host -> typed authz rejections; wrong guest/op/params -> binding_mismatch; recovery key on ordinary destructive -> role_denied; hub-supplied scratch tag ignored -> refused; valid+role+target+fresh nonce -> accepted then replay rejected. Full module race-clean + vet-clean on the Linux build server. Inert this slice: no destructive deltas served until slice 10; the destructive path is classified, gated, and tested but not wired to live execution. CHECKPOINT: Phase B complete (slice 4 done). Awaiting validation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,97 @@
|
||||
package reconcile
|
||||
|
||||
import (
|
||||
"context"
|
||||
"time"
|
||||
)
|
||||
|
||||
// Recover consumes the journal's in-flight set at startup: resume-or-rollback for any
|
||||
// op that was mid-execution when the agent crashed (doc 03 §10). This MUST run before
|
||||
// the engine begins issuing new mutations.
|
||||
//
|
||||
// Why it is load-bearing for signed destructive ops (and why it lands with the gate):
|
||||
// the idempotency-key store dedupes a COMPLETED op, but an op that crashed AFTER the
|
||||
// Proxmox POST and BEFORE its terminal record (OpTaskRunning) is not covered by that —
|
||||
// its nonce is already consumed, so a redelivery is rejected as a replay, yet it never
|
||||
// reached a terminal state. Only this startup consumer can resolve it: re-check the
|
||||
// Proxmox task and record the real outcome.
|
||||
//
|
||||
// Resolution per in-flight entry:
|
||||
// - has a task id (OpTaskRunning): re-read the task status once. Stopped → record the
|
||||
// real terminal state (OK → succeeded, else failed). Still running → leave it
|
||||
// in-flight (a later Recover or the task's own completion resolves it). Unreadable →
|
||||
// leave it (cannot safely decide).
|
||||
// - no task id (OpStarted only): the Proxmox POST was never confirmed, so the op
|
||||
// never took effect — record failed (fail-safe, the documented FileNonceStore
|
||||
// direction). A convergent reconcile op is simply re-issued next pass; a one-shot
|
||||
// op did NOT mark its idempotency key applied, so it is not falsely deduped.
|
||||
func (e *Engine) Recover(ctx context.Context) RecoverResult {
|
||||
var res RecoverResult
|
||||
if e.journal == nil {
|
||||
return res
|
||||
}
|
||||
for _, entry := range e.journal.InFlight() {
|
||||
res.Examined++
|
||||
if entry.UPID == "" {
|
||||
// POST never confirmed → abandon (fail-safe).
|
||||
e.append(terminal(entry, OpFailed))
|
||||
res.RolledBack++
|
||||
e.logger.Warn("recover: in-flight op had no task id; marked failed (fail-safe)",
|
||||
"op_id", entry.OpID, "vmid", entry.VMID, "kind", entry.Kind)
|
||||
continue
|
||||
}
|
||||
st, err := e.api.TaskStatusOnce(ctx, entry.UPID)
|
||||
if err != nil {
|
||||
res.Unresolved++
|
||||
e.logger.Warn("recover: cannot read in-flight task status; left in-flight",
|
||||
"op_id", entry.OpID, "upid", entry.UPID, "err", err)
|
||||
continue
|
||||
}
|
||||
if st.Running() {
|
||||
res.StillRunning++
|
||||
e.logger.Info("recover: in-flight task still running; left in-flight",
|
||||
"op_id", entry.OpID, "upid", entry.UPID)
|
||||
continue
|
||||
}
|
||||
// Stopped: record the real outcome.
|
||||
if st.OK() {
|
||||
e.append(terminal(entry, OpSucceeded))
|
||||
res.Resumed++
|
||||
e.logger.Info("recover: in-flight task completed OK; marked succeeded",
|
||||
"op_id", entry.OpID, "upid", entry.UPID)
|
||||
} else {
|
||||
e.append(terminal(entry, OpFailed))
|
||||
res.Failed++
|
||||
e.logger.Warn("recover: in-flight task ended non-OK; marked failed",
|
||||
"op_id", entry.OpID, "upid", entry.UPID, "exitstatus", st.ExitStatus)
|
||||
}
|
||||
}
|
||||
if res.Examined > 0 {
|
||||
e.logger.Info("recover: in-flight journal reconciled", "result", res)
|
||||
}
|
||||
return res
|
||||
}
|
||||
|
||||
// RecoverResult summarizes a startup recovery pass.
|
||||
type RecoverResult struct {
|
||||
Examined int
|
||||
Resumed int // task found completed OK and recorded succeeded
|
||||
Failed int // task found ended non-OK and recorded failed
|
||||
RolledBack int // no task id → abandoned (fail-safe)
|
||||
StillRunning int // task still executing → left in-flight
|
||||
Unresolved int // task status unreadable → left in-flight
|
||||
}
|
||||
|
||||
// terminal builds a terminal journal record preserving the op's identity, with the
|
||||
// idempotency key carried through so a SUCCEEDED one-shot op marks its key applied.
|
||||
func terminal(e JournalEntry, state OpState) JournalEntry {
|
||||
return JournalEntry{
|
||||
OpID: e.OpID,
|
||||
VMID: e.VMID,
|
||||
Kind: e.Kind,
|
||||
UPID: e.UPID,
|
||||
State: state,
|
||||
IdempKey: e.IdempKey,
|
||||
At: time.Now().UTC(),
|
||||
}
|
||||
}
|
||||
Reference in New Issue
Block a user