v0.4.0: slice 4 Phase B — reversibility gate + signed-op consuming layer

The security core of slice 4: hub-supplied intent is no longer trusted for destructive change. The gate fronts the per-guest queue's executor, so every mutation passes it. Reuses internal/authz for all crypto (surface untouched). - Classifier (doc 03 §4): benign vs destructive by provenance + data-bearing- ness, NOT by verb. Destroy/overwrite of customer data is destructive unless agent-internal provenance (same-journaled-txn create, or agent-tagged scratch) makes it benign — and that provenance is journal-recorded, NEVER hub-sourced. Unknown op class fails safe to destructive. - Reversibility gate: benign -> allowed unsigned; destructive -> requires a verified, role-scoped, action-bound operator signature, else pending_signature and never executed. Every decision audited (signal, never the guard). - Signed-op consuming layer over authz.Verifier.Verify (locked pipeline untouched): role-scoping (doc 04 §4 — recovery=rotation only, operational= ordinary destructive + planned rotation) + op-to-action binding (op+host+ guest+params must match the gated action). - Signed-job orchestration: idempotency dedupe by nonce + journal-wrapped execution via an injected DestructiveExecutor (nil this slice — inert). - Crash recovery (Note 1): Engine.Recover consumes the journal InFlight() set at startup (resume-or-rollback) — covers an op that crashed after the POST and before its terminal record, which idempotency dedupe alone cannot. Added TaskStatusOnce to the GuestAPI seam. Wired into daemon startup. - Note 2: memory comparison canonicalized to MiB (desiredMemoryMiB) so a non-MiB-aligned MemoryBytes converges in one pass, not perpetual drift. - Daemon: builds the verifier from config signers (none = nil verifier, the common slice-4 state), the gate (+SlogAudit), runs Recover before mutating. Adversarial matrix proven against the REAL authz.Verifier with in-test-minted SSHSIGs (framing replicated in reconcile's test binary; authz untouched, no signing added to the verify-only package): unsigned job + unsigned desired-state delta -> pending_signature; unknown signer/expired/replay-across-restart/wrong host -> typed authz rejections; wrong guest/op/params -> binding_mismatch; recovery key on ordinary destructive -> role_denied; hub-supplied scratch tag ignored -> refused; valid+role+target+fresh nonce -> accepted then replay rejected. Full module race-clean + vet-clean on the Linux build server. Inert this slice: no destructive deltas served until slice 10; the destructive path is classified, gated, and tested but not wired to live execution. CHECKPOINT: Phase B complete (slice 4 done). Awaiting validation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 23:56:20 +02:00
parent 05c450147c
commit 1af21a6cac
18 changed files with 1640 additions and 80 deletions
@@ -0,0 +1,291 @@
+package reconcile
+
+import (
+	"encoding/json"
+	"log/slog"
+	"reflect"
+	"strconv"
+	"time"
+
+	"gitea.dooplex.hu/admin/felhom-agent/internal/authz"
+)
+
+// SourceKind records where an intent came from — audit/debug ONLY. Classification
+// does NOT depend on it: a destructive desired-state delta and a destructive one-shot
+// job are gated identically (the agent distrusts hub desired state for destructive
+// change, not just jobs — doc 03 §4).
+type SourceKind string
+
+const (
+	SourceDesiredDelta SourceKind = "desired_delta"
+	SourceOneShotJob   SourceKind = "one_shot_job"
+)
+
+// Intent is an intended mutation presented to the gate. For benign reconcile actions
+// the engine builds one per planned Action; destructive intents (jobs / deltas) carry
+// their op class + canonical params for binding.
+type Intent struct {
+	Class   OpClass
+	HostID  string
+	GuestID string // blob-style guest id ("" = host-scoped); matches OpBlob.target.guest_id
+	VMID    int    // numeric, for queue routing (0 = host-scoped)
+	// ParamsJSON is the canonical params (matching the signed blob's `params`) used for
+	// op-to-action binding on destructive ops. Nil for benign actions (not bound).
+	ParamsJSON json.RawMessage
+	// Provenance is AGENT-INTERNAL only (never hub-sourced) — see classify.go.
+	Provenance Provenance
+	Source     SourceKind
+}
+
+// SignedOp is the opaque operator-signed blob+signature pair the hub queues (doc 04
+// §5). The agent never trusts it until authz.Verifier.Verify passes.
+type SignedOp struct {
+	Blob []byte // the canonical OpBlob JSON bytes (verified over RAW bytes)
+	Sig  []byte // the armored SSHSIG
+}
+
+// RefuseReason is a stable, machine-readable gate refusal reason.
+type RefuseReason string
+
+const (
+	ReasonBenign           RefuseReason = "benign"            // allowed, no signature needed
+	ReasonSigned           RefuseReason = "signed"            // allowed by a verified op
+	ReasonPendingSignature RefuseReason = "pending_signature" // destructive, no/again-needed signature
+	ReasonRejected         RefuseReason = "rejected"          // signature failed authz verification
+	ReasonRoleDenied       RefuseReason = "role_denied"       // signer role not authorized for this op class
+	ReasonBindingMismatch  RefuseReason = "binding_mismatch"  // signature is for a different action
+)
+
+// Decision is the gate verdict.
+type Decision struct {
+	Allowed     bool
+	Disposition Disposition
+	Reason      RefuseReason
+	// Verified is the authenticated op when a signature authorized the action.
+	Verified *authz.VerifiedOp
+	// Err is the underlying authz rejection (errors.Is-friendly: ErrUnknownSigner,
+	// ErrExpired, ErrReplay, …) when Reason == ReasonRejected.
+	Err error
+}
+
+// OpVerifier is the crypto verifier seam — *authz.Verifier in production; a fake in
+// gate unit tests. The gate never re-implements any crypto; it only consumes the
+// verdict and enforces the policy layer on top (role-scoping + op-to-action binding).
+type OpVerifier interface {
+	Verify(blob, sigArmored []byte) (*authz.VerifiedOp, error)
+}
+
+// AuditSink records every gate decision to the customer-visible audit log. Audit is a
+// SIGNAL, never the guard (doc 03 §4 / doc 04 §5): a compromised hub could suppress a
+// notice, which is exactly why the signature — not the audit — is the control.
+type AuditSink interface {
+	Record(rec AuditRecord)
+}
+
+// AuditRecord is one audited gate decision.
+type AuditRecord struct {
+	Time        time.Time
+	Class       OpClass
+	HostID      string
+	GuestID     string
+	Source      SourceKind
+	Disposition Disposition
+	Allowed     bool
+	Reason      RefuseReason
+	KeyID       string // matched signer's key id, when signed
+	Nonce       string // the op nonce, when signed
+}
+
+// Gate is the reversibility gate: it sits in front of the per-guest queue's executor
+// so EVERY mutation passes it. Benign intents are allowed unsigned; destructive
+// intents require a verified, role-authorized, action-bound operator signature, else
+// they are refused with pending_signature and never executed.
+type Gate struct {
+	verifier OpVerifier // may be nil (no signers pinned) → destructive is always pending_signature
+	hostID   string
+	audit    AuditSink
+	logger   *slog.Logger
+}
+
+// NewGate builds a gate. verifier may be nil when no signers are configured (the
+// common slice-4 state) — then there is nothing destructive to authorize and any
+// destructive intent is refused pending_signature. audit/logger default to no-ops.
+func NewGate(verifier OpVerifier, hostID string, audit AuditSink, logger *slog.Logger) *Gate {
+	if audit == nil {
+		audit = noopAudit{}
+	}
+	if logger == nil {
+		logger = slog.New(slog.NewTextHandler(discard{}, nil))
+	}
+	return &Gate{verifier: verifier, hostID: hostID, audit: audit, logger: logger}
+}
+
+// Authorize classifies the intent and, for destructive intents, runs the full
+// consuming-layer policy over the verifier verdict. It writes the decision to the
+// audit log and returns it. It NEVER executes anything — the caller dispatches an
+// Allowed decision onto the queue.
+func (g *Gate) Authorize(intent Intent, signed *SignedOp) Decision {
+	disp := Classify(intent.Class, intent.Provenance)
+
+	// Benign: allowed without a signature.
+	if disp == Benign {
+		d := Decision{Allowed: true, Disposition: Benign, Reason: ReasonBenign}
+		g.record(intent, d)
+		return d
+	}
+
+	// Destructive from here: a verified, role-authorized, action-bound signature is
+	// mandatory. Missing signature OR no pinned verifier → pending_signature (refuse).
+	if signed == nil || g.verifier == nil {
+		d := Decision{Allowed: false, Disposition: Destructive, Reason: ReasonPendingSignature}
+		g.record(intent, d)
+		return d
+	}
+
+	// Crypto + namespace + allow-list + target + time + nonce — the LOCKED authz
+	// pipeline. The nonce is consumed (recorded) only if this passes.
+	vop, err := g.verifier.Verify(signed.Blob, signed.Sig)
+	if err != nil {
+		d := Decision{Allowed: false, Disposition: Destructive, Reason: ReasonRejected, Err: err}
+		g.record(intent, d)
+		return d
+	}
+
+	// Role-scoping (the slice-4 job per verifier.go): the signer's pinned role must be
+	// authorized for THIS op class.
+	if !roleAuthorizes(vop.Signer.Role, intent.Class) {
+		d := Decision{Allowed: false, Disposition: Destructive, Reason: ReasonRoleDenied, Verified: vop}
+		g.record(intent, d)
+		return d
+	}
+
+	// Op-to-action binding: the verified op must name THIS exact action (op + target +
+	// params) — a signature for "restore guest X" cannot authorize destroying guest Y.
+	if !g.bindsToAction(vop, intent) {
+		d := Decision{Allowed: false, Disposition: Destructive, Reason: ReasonBindingMismatch, Verified: vop}
+		g.record(intent, d)
+		return d
+	}
+
+	d := Decision{Allowed: true, Disposition: Destructive, Reason: ReasonSigned, Verified: vop}
+	g.record(intent, d)
+	return d
+}
+
+// roleAuthorizes enforces the doc 04 §4 two-key role model: the cold recovery key
+// authorizes ONLY key-rotation re-pins; the operational key authorizes ordinary
+// destructive ops AND planned key-rotation.
+func roleAuthorizes(role authz.KeyRole, class OpClass) bool {
+	if class == ClassKeyRotation {
+		return role == authz.RoleOperational || role == authz.RoleRecovery
+	}
+	return role == authz.RoleOperational
+}
+
+// bindsToAction checks the verified op names this exact action: host (already checked
+// by the verifier, re-asserted here), guest, op class, and params. This is the binding
+// BEYOND the verifier's target check (doc 04 §2.3 binds host; this binds the full
+// action).
+func (g *Gate) bindsToAction(vop *authz.VerifiedOp, intent Intent) bool {
+	if vop.HostID != g.hostID || vop.HostID != intent.HostID {
+		return false
+	}
+	if vop.GuestID != intent.GuestID {
+		return false
+	}
+	if vop.Op != string(intent.Class) {
+		return false
+	}
+	return paramsEqual(vop.Params, intent.ParamsJSON)
+}
+
+// paramsEqual compares two JSON param objects semantically (key order / whitespace
+// independent). Absent params on both sides ({} or empty) compare equal.
+func paramsEqual(a, b json.RawMessage) bool {
+	ax, aok := decodeParams(a)
+	bx, bok := decodeParams(b)
+	if !aok || !bok {
+		return false
+	}
+	return reflect.DeepEqual(ax, bx)
+}
+
+func decodeParams(p json.RawMessage) (any, bool) {
+	if len(p) == 0 {
+		return map[string]any{}, true // absent == empty object
+	}
+	var v any
+	if err := json.Unmarshal(p, &v); err != nil {
+		return nil, false
+	}
+	if v == nil {
+		return map[string]any{}, true // explicit null == empty
+	}
+	return v, true
+}
+
+func (g *Gate) record(intent Intent, d Decision) {
+	rec := AuditRecord{
+		Time:        time.Now().UTC(),
+		Class:       intent.Class,
+		HostID:      intent.HostID,
+		GuestID:     intent.GuestID,
+		Source:      intent.Source,
+		Disposition: d.Disposition,
+		Allowed:     d.Allowed,
+		Reason:      d.Reason,
+	}
+	if d.Verified != nil {
+		rec.KeyID = d.Verified.Signer.KeyID
+		rec.Nonce = d.Verified.Nonce
+	}
+	g.audit.Record(rec)
+	g.logger.Info("gate decision",
+		"class", intent.Class, "guest", intent.GuestID, "source", intent.Source,
+		"disposition", d.Disposition, "allowed", d.Allowed, "reason", d.Reason)
+}
+
+// intentForAction builds the gate Intent for a benign reconcile action. The provenance
+// is the zero value (no agent-internal destroy evidence) and the source is the
+// desired-state delta — reconcile never fabricates scratch/same-txn provenance.
+func intentForAction(hostID string, act Action) Intent {
+	return Intent{
+		Class:      classOfAction(act.Kind),
+		HostID:     hostID,
+		GuestID:    strconv.Itoa(act.VMID),
+		VMID:       act.VMID,
+		Provenance: Provenance{}, // benign actions need none; never hub-sourced
+		Source:     SourceDesiredDelta,
+	}
+}
+
+// noopAudit drops audit records (used when no sink is configured).
+type noopAudit struct{}
+
+func (noopAudit) Record(AuditRecord) {}
+
+// SlogAudit is a minimal AuditSink that emits records to a logger. The durable,
+// customer-visible audit log + its inclusion in the host-report (HostReport.AuditTail)
+// is a later-slice concern; this keeps the signal flowing now without inventing that
+// wire schema.
+type SlogAudit struct{ Logger *slog.Logger }
+
+// Record logs the audit entry at info level.
+func (s SlogAudit) Record(rec AuditRecord) {
+	if s.Logger == nil {
+		return
+	}
+	s.Logger.Info("audit: gate decision",
+		"class", rec.Class, "host", rec.HostID, "guest", rec.GuestID, "source", rec.Source,
+		"disposition", rec.Disposition, "allowed", rec.Allowed, "reason", rec.Reason,
+		"key_id", rec.KeyID, "nonce", auditNonce(rec.Nonce))
+}
+
+// auditNonce shortens a nonce for the log (full nonce is high-cardinality; a prefix is
+// enough to correlate without bloating logs).
+func auditNonce(n string) string {
+	if len(n) <= 8 {
+		return n
+	}
+	return n[:8] + "…"
+}