diff --git a/CHANGELOG.md b/CHANGELOG.md index 91cfb8a..c9803a0 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -3,6 +3,75 @@ All notable changes to **felhom-agent** are recorded here. Update on every code change that gets pushed. +## v0.4.0 — slice 4 Phase B: reversibility gate + signed-op consuming layer (2026-06-08) + +The security core of slice 4: hub-supplied intent stops being trusted for destructive +change. Layered in front of the per-guest queue's executor — **every** mutation now +passes the gate. Reuses `internal/authz` for all crypto (untouched surface). Inert +this slice: no destructive deltas are served until slice 10, so the destructive path is +classified, gated, and adversarially tested but not wired to live execution. + +### Added +- **Classifier (`classify.go`, doc 03 §4)** — benign vs destructive by **provenance + + data-bearing-ness, NOT by verb**. The `OpClass` vocabulary (seeded by the committed + slice-2 `op_blob.json`: `guest_destroy`) is the agent-side contract slice 10 matches. + Destroy/overwrite of customer data is destructive UNLESS **agent-internal** + provenance (same-journaled-transaction create → compensating rollback, or + agent-tagged scratch) makes it benign. `Provenance` is journal-recorded and **never + populated from the hub** (its zero value is the only thing an external intent may + carry). Unknown op class fails safe → destructive. +- **Reversibility gate (`gate.go`)** — `Gate.Authorize(intent, signed)`: benign → + allowed unsigned; destructive → requires a verified, role-authorized, action-bound + operator signature, else refused **`pending_signature`**, never executed. Every + decision is written to an `AuditSink` (audit is a signal, never the guard). +- **Signed-op consuming layer over `authz`** — verifies via `authz.Verifier.Verify` + (the locked pipeline, untouched), then enforces on the `VerifiedOp`: + - **Role-scoping (doc 04 §4)** — recovery key authorizes key-rotation re-pins ONLY; + operational key authorizes ordinary destructive ops + planned rotation. + - **Op-to-action binding** — verified `op` + host + guest + `params` must match the + gated action (a signature for guest X / op A can't authorize guest Y / op B); + params compared semantically (key-order/whitespace independent). +- **Signed-job orchestration (`job.go`)** — `RunSignedJob`: idempotency dedupe (the + op nonce as the journal key — a redelivered completed op is skipped, not re-run), + gate authorization, then journal-wrapped execution via an injected + `DestructiveExecutor` (nil this slice — authorized destructive ops are inert, no + executor wired until 6/7). +- **Crash-recovery consumer (`recover.go`, Note 1 / doc 03 §10)** — `Engine.Recover` + consumes the journal's `InFlight()` at startup: an op that crashed AFTER the Proxmox + POST and BEFORE its terminal record (`OpTaskRunning`, nonce already consumed) is NOT + covered by idempotency dedupe — only this resume-or-rollback resolves it (re-read the + task via the new `TaskStatusOnce`, record the real outcome; a no-task-id op is + abandoned fail-safe). Landed together with the signed-op executor, as Note 1 required. +- **Daemon wiring** — `runDaemon` builds the verifier from `config.Authz.Signers` (a + bad key / missing nonce-store path is a fatal misconfig; **no signers = nil verifier**, + the common slice-4 state), constructs the gate (+ `SlogAudit`), runs `Recover` before + issuing any mutation, and routes every reconcile action through the gate. + +### Changed +- **Memory comparison canonicalized (Note 2)** — `desiredMemoryMiB` makes the + desired↔actual memory compare in the same MiB unit that is then written, so a + non-MiB-aligned `MemoryBytes` converges in one pass instead of re-issuing SetConfig + forever (the numeric cousin of the description-newline normalization). Test proves + convergence. Slice 10 should still serve MiB-aligned specs at the source. + +### Tests (the security proof — each independently rejected) +- **Adversarial matrix** via the REAL `authz.Verifier` with in-test-minted SSHSIGs + (framing replicated in reconcile's test binary; production authz untouched, no signing + added to the verify-only package): unsigned destructive **job** → pending_signature; + unsigned destructive **desired-state delta** → pending_signature (distrusts hub + desired state, not just jobs); forged/unknown signer → `ErrUnknownSigner`; expired → + `ErrExpired`; **replayed nonce across an agent restart** (durable `FileNonceStore`) → + `ErrReplay`; wrong host → `ErrTarget`; wrong guest / wrong op / wrong params → + binding_mismatch; **recovery key on ordinary destructive** → role_denied; + **hub-supplied "scratch" tag ignored** → still destructive → refused; **valid + role + + target + fresh nonce → accepted**, and a second presentation → `ErrReplay` (nonce + consumed). +- Classifier (benign/destructive/provenance/key-rotation/fail-safe), role-scoping, + params binding, crash-recovery (resume OK / fail / still-running / no-task rollback / + unreadable / one-shot key applied on resume), signed-job idempotency (execute once, + dedupe redelivery, refused-not-executed, no-executor-inert, executor-error). +- Full module **race-clean** (`go test -race`) + vet clean on the Linux build server. + ## v0.4.0-rc1 — slice 4 Phase A: reconcile engine (structural; runs live, unfed) (2026-06-08) The agent-side control core's structural half. **Checkpoint marker** — `-rc1` is the diff --git a/CLAUDE.md b/CLAUDE.md index 268877a..de67d46 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -15,7 +15,7 @@ - Module `gitea.dooplex.hu/admin/felhom-agent`; binary `felhom-agent` (`cmd/felhom-agent/`). - **Pure Go stdlib + `golang.org/x/crypto` only** — no web frameworks. - `go.mod` directive **go 1.25.0**; dep `golang.org/x/crypto v0.52.0` (declares go 1.25, will NOT build on Go 1.24). The **build server (192.168.0.180) runs go1.26.0** (upstream Go on PATH, backward-compatible). Build/run the agent there for live tests (same LAN as the demo host). -- Version: `version` var in `cmd/felhom-agent/main.go`, overridable via `-ldflags "-X main.version="`; `--version` flag. **Current: v0.4.0-rc1** (slice-4 Phase A checkpoint; v0.4.0 lands with Phase B). Bump on meaningful changes + add a CHANGELOG entry. +- Version: `version` var in `cmd/felhom-agent/main.go`, overridable via `-ldflags "-X main.version="`; `--version` flag. **Current: v0.4.0** (slice 4 complete: reconcile engine + reversibility gate). Bump on meaningful changes + add a CHANGELOG entry. ## Layout @@ -48,8 +48,9 @@ Built in slices, all on `main`: - **v0.3.0** slice 3 — `internal/hub`: the first **daemon loop** (no-`--selftest` mode) posting a read-only `HostReport` to the hub (= the heartbeat). Report's storage/backup/restore/pbs/audit fields are **defined-but-empty** (slices 5/6); the envelope's desired-state/signed-ops fields are **parsed-but-ignored** (slice 4). - **v0.3.1** — slice-3 validation follow-ups. - **v0.3.2** — slice-4 pre-check: reversible `SetConfig` step added to `--selftest=task`; passed live on guest 9999. Findings: LXC `description` write is **synchronous** (empty UPID — dual-mode modeling confirmed); PVE appends a trailing `\n` to `description` on read (reconcile must normalize). First live `VM.Config.*` exercise. -- **v0.4.0-rc1** — slice-4 **Phase A** (structural): `internal/reconcile` — engine, per-guest serializer (§10), desired-state model + `DesiredProvider` seam, normalization layer (`NormDescription` promoted out of main.go), plan/diff engine (benign Start/Stop/SetConfig set), durable op journal + idempotency store. Wired into `runDaemon` sharing the queue. Runs **live but unfed** (EmptyProvider → zero mutations until slice 10). Checkpoint: awaiting validation before Phase B. -- **Next: slice 4 Phase B** — the benign/destructive classifier, the reversibility gate, and the signed-op consuming layer over `internal/authz` (doc 03 §4 / doc 04), in front of the queue's executor → lands **v0.4.0**. +- **v0.4.0-rc1** — slice-4 **Phase A** (structural): `internal/reconcile` — engine, per-guest serializer (§10), desired-state model + `DesiredProvider` seam, normalization layer (`NormDescription` promoted out of main.go), plan/diff engine (benign Start/Stop/SetConfig set), durable op journal + idempotency store. Wired into `runDaemon` sharing the queue. Runs **live but unfed** (EmptyProvider → zero mutations until slice 10). +- **v0.4.0** — slice-4 **Phase B** (security core): the benign/destructive **classifier** (provenance + data-bearing, not by verb; scratch/same-txn provenance is agent-internal, never hub-sourced), the **reversibility gate** (destructive → `pending_signature` unless a verified, role-scoped, action-bound operator signature), the **signed-op consuming layer** over `internal/authz` (role-scoping per doc 04 §4, op-to-action binding, idempotency-by-nonce, audit), and the **crash-recovery consumer** (`Recover` over `InFlight()`, resume-or-rollback). The gate fronts the queue's executor (every mutation passes it). **Inert this slice** — no destructive deltas served until slice 10; the destructive path is classified, gated, and adversarially tested but not wired to live execution. `authz` surface untouched. +- **Next: slice 5/6 (storage manifest, backup/restore)** — the slices that fill the host-report's empty storage/backup collections and add the destructive executors the gate already guards. ## Demo host (for live tests) diff --git a/REPORT.md b/REPORT.md index 0cdcca1..2562299 100644 --- a/REPORT.md +++ b/REPORT.md @@ -1,82 +1,95 @@ -# REPORT — Slice 4 Phase A: reconcile engine (structural) (2026-06-08) +# REPORT — Slice 4: reconcile engine + the reversibility gate (v0.4.0) (2026-06-08) > Overwrite-latest report (most recent significant work only). Cumulative history lives in [CHANGELOG.md](CHANGELOG.md). ## Outcome -**Phase A of slice 4 is implemented, tested, and pushed as the checkpoint marker -`v0.4.0-rc1`.** This is the structural half of the agent-side control core: the -reconcile engine, the per-guest serializer (doc 03 §10), the desired-state model + its -provider seam, the field-normalization layer, the plan/diff engine, and the durable -operation journal + idempotency store — all adversarially fixture-tested. +**Slice 4 is complete and pushed as `v0.4.0`.** Both phases landed: -**Per the task, I have STOPPED at the checkpoint and am awaiting the validation pass -before starting Phase B** (the benign/destructive classifier, the reversibility gate, -and the signed-op consuming layer over `internal/authz`). Phase B is the security core -and earns isolated review. +- **Phase A** (structural, pushed earlier as `v0.4.0-rc1`): the reconcile engine, the + per-guest serializer (doc 03 §10), the desired-state model + provider seam, the + field-normalization layer, the plan/diff engine, and the durable op journal + + idempotency store. Runs **live but unfed** — `EmptyProvider` → zero mutations until + slice 10 serves desired state. +- **Phase B** (this push, the security core): the benign/destructive **classifier**, + the **reversibility gate**, and the **signed-op consuming layer** over `internal/authz` + — with role-scoping, op-to-action binding, idempotency/journaling, audit, and the + crash-recovery consumer. The gate sits in front of the per-guest queue's executor, so + **every mutation passes it**. -## What runs (and what deliberately doesn't) +The whole module is **race-clean and vet-clean** on the Linux build server; 62 reconcile +tests pass (the adversarial matrix runs against the real `authz.Verifier`). -The engine **runs live but unfed**. At slice 4 there is no desired-state source (hub -serving is slice 10; provisioning is slice 7), so the only production `DesiredProvider` -is `EmptyProvider` → the live engine reads state, computes an **empty action set**, and -performs **zero mutations** every tick. That is the correct, expected slice-4 behavior; -the first live convergence arrives when slice 10 serves desired state into the seam. +## The security model (Phase B) -The wired action set is **benign-on-existing-guest only**: `Start`, `Stop`, `SetConfig`. -Provisioning and the destructive set are out of scope for Phase A (the destructive set -is classified and gated in Phase B but not wired to live execution — nothing serves -destructive deltas yet). +Hub-supplied intent is no longer trusted for destructive change — **by provenance + +data-bearing-ness, not by verb** (doc 03 §4): -## Package `internal/reconcile` +- **Benign** (unsigned): start/stop/restart/create, and destroying a resource the agent + created in the **same journaled transaction** (compensating rollback) or **tagged + scratch**. That scratch/same-txn provenance is **agent-internal, journal-recorded, and + never accepted from the hub** — a compromised hub cannot relabel a data-bearing guest + as scratch to walk the gate. +- **Destructive** (signature required): destroy/overwrite of the only/primary copy of + customer data — **regardless of whether it arrives as a job or a desired-state delta**. + Absent/invalid signature → refused **`pending_signature`**, never executed. -- **`Queue` (per-guest serializer, doc 03 §10)** — the single choke point all mutation - sources funnel through. Same-vmid jobs run strictly one-at-a-time in submit order; - independent vmids run in parallel. Each vmid is an unbounded cond-var FIFO lane - (non-blocking, order-preserving submission); `Close` drains pending jobs gracefully. -- **Desired-state model + `DesiredProvider`** — `DesiredGuest` makes each field - individually optional (run-state / `*hub.GuestSpec` / `*description`) so a source pins - only what it manages. `EmptyProvider` (live, slice 4) and `StaticProvider` (fixtures). -- **Normalization layer (`FieldNormalizers`)** — reconcile compares *normalized* - desired-vs-actual. `description`'s trailing newline (the slice-4-proven quirk) is the - first registered normalizer; the registry takes more as discovered. `normDesc` was - **promoted** out of `main.go` to `reconcile.NormDescription`, and the `--selftest=task` - round-trip now uses that shared helper — one source of truth. -- **`Plan` (pure diff engine)** — minimal benign action set for guests in both desired - and actual: normalized comparison, deterministic vmid order, config-before-run-state. - Skips provision (slice 7) and destroy (gated, slice 10); never writes a config it - couldn't first read; disk grow deferred. -- **`Engine`** — reads desired+actual, plans, dispatches onto the shared queue. Honors - the mutate.go dual-mode contract: non-empty UPID → `WaitTask`+assert; empty UPID → - clean synchronous success. Per-action failures counted, never fatal. -- **`Journal`** — durable fsync'd JSONL (mirrors `authz.FileNonceStore`): op lifecycle - with the Proxmox task id (crash mid-op detected + re-checkable via `InFlight()`), plus - an idempotency-key store so a one-shot op never double-runs across retries/restarts. - Reconcile actions carry no idempotency key (convergent — must re-run on real drift). +The signed-op consuming layer calls `authz.Verifier.Verify` (the locked +namespace→allow-list→crypto→target→time→nonce pipeline, untouched) and then enforces +the slice-4 policy on the `VerifiedOp`: **role-scoping** (recovery key = key-rotation +only; operational key = ordinary destructive + planned rotation, doc 04 §4) and +**op-to-action binding** (the verified op + host + guest + params must name the exact +gated action). Idempotency keys the journal by the op nonce; every decision is audited +(a signal, never the guard). -## Daemon wiring +## Inert by design (slice-4 scope) -`runDaemon` now runs reconcile alongside the hub loop on the poll cadence, sharing the -per-guest queue. The journal lives at a `journal.log` sibling of the nonce store. The -daemon runs cleanly with **no desired state and no signers** — reconcile is a logged -no-op; a journal-open failure degrades to journal-less rather than crashing. +There is **no live destructive execution** this slice: nothing serves destructive deltas +until slice 10, and the guest-destroy/storage-wipe/restore-overwrite executors land in +6/7. So the destructive path is fully **classified, gated, and adversarially tested**, +but `RunSignedJob`'s executor is nil in production — an authorized destructive op is +journaled as authorized-but-not-executed. Reconcile itself only produces the benign +Start/Stop/SetConfig set, all allowed through the gate unsigned. + +## Adversarial proof (each case independently rejected) + +Run against the **real** `authz.Verifier` with in-test-minted SSHSIGs (the ~40-line +framing is replicated in reconcile's test binary — production `authz` is untouched and +gains no signing capability; live minting is required because the verifier's clock is +not cross-package injectable): + +unsigned destructive **job** → pending_signature · unsigned destructive **desired-state +delta** → pending_signature (distrusts hub desired state, not just jobs) · forged / +unknown signer → `ErrUnknownSigner` · expired → `ErrExpired` · **replayed nonce across an +agent restart** (durable `FileNonceStore`) → `ErrReplay` · wrong host → `ErrTarget` · +wrong guest / wrong op / wrong params → binding_mismatch · **recovery key on ordinary +destructive** → role_denied · **hub-supplied "scratch" tag** on a data-bearing guest → +ignored, still destructive → refused · **valid + correct role + correct target + fresh +nonce → accepted**, and a second presentation → `ErrReplay`. + +## The two forward-looking notes + +- **Note 1 (carried in)** — the `InFlight()` **resume-or-rollback** startup consumer + (`Engine.Recover`) landed **together with** the signed-op executor, as required. An op + that crashed after the Proxmox POST but before its terminal record (`OpTaskRunning`, + nonce already consumed) is not covered by idempotency dedupe — only this consumer + resolves it (re-read the task via the new `TaskStatusOnce`, record the real outcome; a + no-task-id op is abandoned fail-safe). Wired into daemon startup and tested. +- **Note 2 (addressed)** — the memory comparison is canonicalized (`desiredMemoryMiB`): + desired and actual compare in the same MiB unit that is then written, so a + non-MiB-aligned `MemoryBytes` converges in one pass rather than re-issuing SetConfig + every cycle. A test proves convergence. Recommendation stands that slice 10 serve + MiB-aligned specs at the source. ## Verification -- Full module **race-clean** (`go test -race -count=1 ./...`) and `go vet` clean on the - Linux build server (go1.26); all unit tests green locally and there. -- Adversarial fixture coverage: serializer concurrency/ordering, normalization + - extensibility seam, the full plan matrix (drift / no-false-drift / unmanaged / - spec-unknown / scope skips / ordering / empty-desired), engine sync-vs-async + - failure counting, and journal persistence + idempotency dedupe **across a simulated - restart**. -- No live Proxmox needed (the engine is unfed); the live exercise is deferred — there is - nothing to converge until a desired-state source exists. +- `go test -race -count=1 ./...` and `go vet ./...` clean on the Linux build server + (go1.26); all tests green locally and there. +- No live Proxmox needed — Phase A is unfed and Phase B's destructive path is inert this + slice. The gate's crypto path is proven end-to-end against the real verifier. -## Next (after validation) +## Conventions -Phase B: the classifier (benign vs destructive by provenance + data-bearing-ness, not by -verb), the reversibility gate in front of the queue's executor, and the signed-op -consuming layer over `internal/authz` with role-scoping + op-to-action binding + the -adversarial rejection matrix — landing **v0.4.0**. I will not start it until the Phase-A -validation passes. +Version → **v0.4.0**. CHANGELOG has a per-phase entry (newest on top). No secrets in any +committed file. Pushed to `main`. Per the task, I stop at this checkpoint and await the +validation pass. diff --git a/cmd/felhom-agent/main.go b/cmd/felhom-agent/main.go index 710d375..7487013 100644 --- a/cmd/felhom-agent/main.go +++ b/cmd/felhom-agent/main.go @@ -19,6 +19,7 @@ import ( "syscall" "time" + "gitea.dooplex.hu/admin/felhom-agent/internal/authz" "gitea.dooplex.hu/admin/felhom-agent/internal/config" "gitea.dooplex.hu/admin/felhom-agent/internal/hub" applog "gitea.dooplex.hu/admin/felhom-agent/internal/log" @@ -28,7 +29,7 @@ import ( // version is the agent version. Overridable at build time with // -ldflags "-X main.version="; defaults to the in-repo CHANGELOG version. -var version = "0.4.0-rc1" +var version = "0.4.0" func main() { var ( @@ -137,14 +138,35 @@ func runDaemon(cfg config.Config, logger *slog.Logger) int { defer journal.Close() } } + // The reversibility gate (slice 4 Phase B) sits in front of every mutation. With + // no signers pinned (the common slice-4 state) the verifier is nil: benign actions + // pass, destructive intents are refused pending_signature — and reconcile only + // produces benign actions, so nothing is gated away. A misconfigured signer key is + // a security misconfig and is fatal. + verifier, nonceStore, err := buildVerifier(cfg, logger) + if err != nil { + fmt.Fprintln(os.Stderr, "daemon: authz verifier:", err) + return 2 + } + if nonceStore != nil { + defer nonceStore.Close() + } + gate := reconcile.NewGate(verifier, cfg.Hub.HostID, reconcile.SlogAudit{Logger: logger}, logger) + engine := reconcile.NewEngine(reconcile.EngineOptions{ API: px, Queue: queue, Journal: journal, Provider: reconcile.EmptyProvider{}, // slice 4: no live desired-state source + Gate: gate, + HostID: cfg.Hub.HostID, Logger: logger, }) + // Crash recovery (doc 03 §10): resolve any op that was in flight when the agent + // last died BEFORE issuing new mutations. With an empty journal this is a no-op. + engine.Recover(ctx) + // Run reconcile and the hub loop concurrently; either returning ends the daemon. errc := make(chan error, 2) go func() { errc <- engine.Run(ctx, interval) }() @@ -170,6 +192,35 @@ func reconcileJournalPath(cfg config.Config) string { return "/var/lib/felhom-agent/journal.log" } +// buildVerifier constructs the operator-signed-op verifier from config. With no signers +// pinned it returns (nil, nil, nil) — the gate then refuses any destructive intent as +// pending_signature (and reconcile serves none). With signers it requires a durable +// nonce-store path (anti-replay must survive restarts) and parses every pinned key — +// a bad key or missing path is a fatal security misconfig. +func buildVerifier(cfg config.Config, logger *slog.Logger) (reconcile.OpVerifier, *authz.FileNonceStore, error) { + if len(cfg.Authz.Signers) == 0 { + logger.Info("daemon: no operator signers pinned; destructive ops will be refused pending_signature") + return nil, nil, nil + } + if cfg.Authz.NonceStorePath == "" { + return nil, nil, fmt.Errorf("authz.nonce_store_path is required when signers are configured") + } + signers := make([]authz.AllowedSigner, 0, len(cfg.Authz.Signers)) + for _, s := range cfg.Authz.Signers { + as, err := authz.NewAllowedSigner(s.KeyID, authz.KeyRole(s.Role), s.PublicKey) + if err != nil { + return nil, nil, fmt.Errorf("pinned signer %q: %w", s.KeyID, err) + } + signers = append(signers, as) + } + store, err := authz.OpenFileNonceStore(cfg.Authz.NonceStorePath) + if err != nil { + return nil, nil, fmt.Errorf("nonce store: %w", err) + } + logger.Info("daemon: operator signers pinned", "count", len(signers)) + return authz.New(signers, store, cfg.Hub.HostID), store, nil +} + // runSelftestHub validates hub config, does ONE collect + report, and prints the // report it would send plus the envelope it got back. func runSelftestHub(ctx context.Context, cfg config.Config, logger *slog.Logger) int { diff --git a/internal/reconcile/classify.go b/internal/reconcile/classify.go new file mode 100644 index 0000000..8ff9f36 --- /dev/null +++ b/internal/reconcile/classify.go @@ -0,0 +1,106 @@ +package reconcile + +// The benign/destructive classifier (doc 03 §4). The gate decides whether an intended +// action needs an operator signature by **provenance + data-bearing-ness, NOT by +// verb**. The op CLASS encodes the semantic intent (a "detach storage" is its own +// destructive class, not just another PUT) so classification never turns on the HTTP +// method. + +// OpClass is the semantic class of an intended action. This vocabulary is the +// agent-side contract: the signed-op `op` field (doc 04 §2.1) and slice-10's hub / +// operator CLI match these exact strings. The committed slice-2 fixture +// (op="guest_destroy") seeds it. +type OpClass string + +const ( + // Benign-on-existing-guest set — wired to live execution this slice. + ClassStart OpClass = "start" + ClassStop OpClass = "stop" + ClassSetConfig OpClass = "set_config" // benign sizing/description changes only + + // Benign by construction — classified now, executors land in later slices. + ClassCreate OpClass = "create" // provision a NEW guest (restore-to-new, slice 7) + ClassRestart OpClass = "restart" // heal a crashed controller in-place (§4) + + // Destructive set — destroying/overwriting the only/primary copy of customer + // data. Classified and gated now; NOT wired to live execution this slice (nothing + // serves destructive deltas until slice 10). + ClassGuestDestroy OpClass = "guest_destroy" + ClassStorageWipe OpClass = "storage_wipe" // storage detach/wipe + ClassRestoreOverwrite OpClass = "restore_overwrite" // restore OVER an existing guest + ClassDecommission OpClass = "decommission" + + // Key-rotation re-pin (doc 04 §4) — destructive-class, role-scoped: the cold + // recovery key authorizes ONLY this; the operational key authorizes this + ordinary + // destructive ops. + ClassKeyRotation OpClass = "key_rotation" +) + +// Disposition is the classifier verdict. +type Disposition string + +const ( + // Benign — the reconciler/executor MAY act without an operator signature. + Benign Disposition = "benign" + // Destructive — an operator signature bound to the action is REQUIRED. + Destructive Disposition = "destructive" +) + +// Provenance is AGENT-INTERNAL evidence that an otherwise-destructive action is +// actually safe (doc 03 §4). It is recorded in the operation journal by the agent's +// own bookkeeping and is **NEVER populated from the hub or any external input** — else +// a compromised hub could relabel a data-bearing guest as scratch to walk the gate. +// The zero value (no internal evidence) is the only value an externally-sourced intent +// may carry. +type Provenance struct { + // SameTxnCreated: the agent created this resource earlier in the SAME journaled + // transaction, so destroying it is a compensating rollback (§10), not data loss. + SameTxnCreated bool + // AgentTaggedScratch: the agent tagged this resource ephemeral/scratch (e.g. a + // restore-test scratch guest, §8). Journal-recorded provenance only. + AgentTaggedScratch bool +} + +// internalEvidence reports whether agent-internal provenance makes a destroy benign. +func (p Provenance) internalEvidence() bool { + return p.SameTxnCreated || p.AgentTaggedScratch +} + +// Classify returns the disposition for an op class given agent-internal provenance. +// +// Rules (doc 03 §4): +// - create/start/stop/restart/benign-set_config → always Benign. +// - destroy/overwrite of a data-bearing resource → Destructive, UNLESS agent-internal +// provenance (same-transaction create, or agent-tagged scratch) makes it benign. +// - key-rotation → always Destructive (signed); role-scoping picks the allowed key. +// - an UNKNOWN class fails safe → Destructive (require a signature). +func Classify(class OpClass, prov Provenance) Disposition { + switch class { + case ClassStart, ClassStop, ClassSetConfig, ClassCreate, ClassRestart: + return Benign + case ClassGuestDestroy, ClassStorageWipe, ClassRestoreOverwrite, ClassDecommission: + if prov.internalEvidence() { + return Benign // compensating rollback / scratch teardown + } + return Destructive + case ClassKeyRotation: + return Destructive + default: + return Destructive // fail safe: an unrecognized op is treated as destructive + } +} + +// classOfAction maps a benign reconcile ActionKind to its OpClass, so every reconcile +// mutation is classified and passed through the gate like any other intent. +func classOfAction(k ActionKind) OpClass { + switch k { + case ActionStart: + return ClassStart + case ActionStop: + return ClassStop + case ActionSetConfig: + return ClassSetConfig + default: + return OpClass(k) + } +} diff --git a/internal/reconcile/classify_test.go b/internal/reconcile/classify_test.go new file mode 100644 index 0000000..e2bbf5f --- /dev/null +++ b/internal/reconcile/classify_test.go @@ -0,0 +1,70 @@ +package reconcile + +import ( + "testing" + + "gitea.dooplex.hu/admin/felhom-agent/internal/authz" +) + +func TestClassify_BenignClasses(t *testing.T) { + for _, c := range []OpClass{ClassStart, ClassStop, ClassSetConfig, ClassCreate, ClassRestart} { + if got := Classify(c, Provenance{}); got != Benign { + t.Errorf("Classify(%s) = %s, want benign", c, got) + } + } +} + +func TestClassify_DestructiveClassesNeedSignature(t *testing.T) { + for _, c := range []OpClass{ClassGuestDestroy, ClassStorageWipe, ClassRestoreOverwrite, ClassDecommission, ClassKeyRotation} { + if got := Classify(c, Provenance{}); got != Destructive { + t.Errorf("Classify(%s) = %s, want destructive", c, got) + } + } +} + +func TestClassify_InternalProvenanceMakesDestroyBenign(t *testing.T) { + // Same-transaction create → compensating rollback is benign (§10). + if got := Classify(ClassGuestDestroy, Provenance{SameTxnCreated: true}); got != Benign { + t.Errorf("same-txn destroy = %s, want benign", got) + } + // Agent-tagged scratch teardown is benign (§8). + if got := Classify(ClassGuestDestroy, Provenance{AgentTaggedScratch: true}); got != Benign { + t.Errorf("scratch destroy = %s, want benign", got) + } +} + +func TestClassify_KeyRotationAlwaysDestructive(t *testing.T) { + // Even with internal provenance, key-rotation stays signed (role-scoping decides + // which key) — provenance flags don't apply to it. + if got := Classify(ClassKeyRotation, Provenance{SameTxnCreated: true, AgentTaggedScratch: true}); got != Destructive { + t.Errorf("key_rotation = %s, want destructive", got) + } +} + +func TestClassify_UnknownClassFailsSafe(t *testing.T) { + if got := Classify(OpClass("totally_unknown_op"), Provenance{}); got != Destructive { + t.Errorf("unknown class = %s, want destructive (fail-safe)", got) + } +} + +func TestRoleAuthorizes(t *testing.T) { + op := authz.RoleOperational + rec := authz.RoleRecovery + cases := []struct { + role authz.KeyRole + class OpClass + want bool + }{ + {op, ClassGuestDestroy, true}, // operational does ordinary destructive + {op, ClassDecommission, true}, // + {op, ClassKeyRotation, true}, // operational does planned rotation + {rec, ClassGuestDestroy, false}, // recovery may NOT do ordinary destructive + {rec, ClassStorageWipe, false}, // + {rec, ClassKeyRotation, true}, // recovery authorizes ONLY rotation + } + for _, c := range cases { + if got := roleAuthorizes(c.role, c.class); got != c.want { + t.Errorf("roleAuthorizes(%s, %s) = %v, want %v", c.role, c.class, got, c.want) + } + } +} diff --git a/internal/reconcile/engine.go b/internal/reconcile/engine.go index a355312..b2e09c9 100644 --- a/internal/reconcile/engine.go +++ b/internal/reconcile/engine.go @@ -27,19 +27,23 @@ type Engine struct { journal *Journal provider DesiredProvider norm FieldNormalizers + gate *Gate + hostID string logger *slog.Logger opSeq uint64 // atomic; makes each op id unique per attempt } // EngineOptions configures a new Engine. Norm defaults to DefaultNormalizers, Logger -// to a discard logger. +// to a discard logger, Gate to a no-verifier gate (benign-allow, destructive-pending). type EngineOptions struct { API GuestAPI Queue *Queue Journal *Journal Provider DesiredProvider Norm FieldNormalizers + Gate *Gate + HostID string Logger *slog.Logger } @@ -58,12 +62,20 @@ func NewEngine(opts EngineOptions) *Engine { if provider == nil { provider = EmptyProvider{} } + gate := opts.Gate + if gate == nil { + // No verifier configured: benign actions pass, destructive are pending. This is + // the common slice-4 daemon state (no signers pinned, no desired state). + gate = NewGate(nil, opts.HostID, nil, logger) + } return &Engine{ api: opts.API, queue: opts.Queue, journal: opts.Journal, provider: provider, norm: norm, + gate: gate, + hostID: opts.HostID, logger: logger, } } @@ -97,23 +109,39 @@ func (e *Engine) Reconcile(ctx context.Context) (Result, error) { return res, nil } - // Dispatch all actions onto the shared per-guest queue, then await each. Same-vmid - // actions serialize in submit order; different vmids run concurrently. - chans := make([]<-chan error, len(actions)) + // Every mutation passes the reversibility gate before the queue (doc 03 §4). + // Reconcile only produces benign actions, so each is allowed unsigned — but the + // gate is genuinely in the path: a destructive class here would be refused + // (pending_signature) and never dispatched. A gate refusal counts as a failed + // action (it should not happen for the benign reconcile set). + type dispatched struct { + act Action + ch <-chan error + } + var sent []dispatched for i := range actions { act := actions[i] - chans[i] = e.queue.Submit(act.VMID, func() error { return e.execute(ctx, act) }) + dec := e.gate.Authorize(intentForAction(e.hostID, act), nil) + if !dec.Allowed { + res.Failed++ + res.Errors = append(res.Errors, fmt.Errorf("reconcile: gate refused %s vmid %d: %s", + act.Kind, act.VMID, dec.Reason)) + e.logger.Error("reconcile: gate refused a benign action (unexpected)", + "vmid", act.VMID, "kind", act.Kind, "reason", dec.Reason) + continue + } + sent = append(sent, dispatched{act: act, ch: e.queue.Submit(act.VMID, func() error { return e.execute(ctx, act) })}) } - for i, ch := range chans { - if err := <-ch; err != nil { + for _, d := range sent { + if err := <-d.ch; err != nil { res.Failed++ res.Errors = append(res.Errors, err) e.logger.Error("reconcile: action failed", - "vmid", actions[i].VMID, "kind", actions[i].Kind, "err", err) + "vmid", d.act.VMID, "kind", d.act.Kind, "err", err) } else { res.Executed++ e.logger.Info("reconcile: action applied", - "vmid", actions[i].VMID, "kind", actions[i].Kind, "reason", actions[i].Reason) + "vmid", d.act.VMID, "kind", d.act.Kind, "reason", d.act.Reason) } } return res, nil @@ -227,8 +255,13 @@ func (e *Engine) reconcileOnce(ctx context.Context) { // nextOpID builds a per-attempt unique op id (kind-vmid-seq) for journal correlation. func (e *Engine) nextOpID(act Action) string { - n := atomic.AddUint64(&e.opSeq, 1) - return string(act.Kind) + "-" + strconv.Itoa(act.VMID) + "-" + strconv.FormatUint(n, 10) + return string(act.Kind) + "-" + strconv.Itoa(act.VMID) + "-" + nextSeq(&e.opSeq) +} + +// nextSeq atomically increments a counter and returns it as a string — the unique +// suffix that distinguishes journal op ids across attempts. +func nextSeq(p *uint64) string { + return strconv.FormatUint(atomic.AddUint64(p, 1), 10) } // append journals a lifecycle record, logging (never failing the op on) a journal I/O diff --git a/internal/reconcile/engine_test.go b/internal/reconcile/engine_test.go index d4dc852..f028ffa 100644 --- a/internal/reconcile/engine_test.go +++ b/internal/reconcile/engine_test.go @@ -23,6 +23,8 @@ type fakeAPI struct { // waitFunc maps a UPID to a (status, err); default = OK. Mirrors the real client, // which errors on a non-OK exitstatus. waitFunc func(upid string) (proxmox.TaskStatus, error) + // statusFunc backs TaskStatusOnce (crash recovery); default = stopped/OK. + statusFunc func(upid string) (proxmox.TaskStatus, error) starts []int stops []int @@ -31,6 +33,13 @@ type fakeAPI struct { listErr error } +func (f *fakeAPI) TaskStatusOnce(_ context.Context, upid string) (proxmox.TaskStatus, error) { + if f.statusFunc != nil { + return f.statusFunc(upid) + } + return proxmox.TaskStatus{UPID: upid, Status: "stopped", ExitStatus: "OK"}, nil +} + type setCall struct { vmid int params map[string]string diff --git a/internal/reconcile/gate.go b/internal/reconcile/gate.go new file mode 100644 index 0000000..de214b5 --- /dev/null +++ b/internal/reconcile/gate.go @@ -0,0 +1,291 @@ +package reconcile + +import ( + "encoding/json" + "log/slog" + "reflect" + "strconv" + "time" + + "gitea.dooplex.hu/admin/felhom-agent/internal/authz" +) + +// SourceKind records where an intent came from — audit/debug ONLY. Classification +// does NOT depend on it: a destructive desired-state delta and a destructive one-shot +// job are gated identically (the agent distrusts hub desired state for destructive +// change, not just jobs — doc 03 §4). +type SourceKind string + +const ( + SourceDesiredDelta SourceKind = "desired_delta" + SourceOneShotJob SourceKind = "one_shot_job" +) + +// Intent is an intended mutation presented to the gate. For benign reconcile actions +// the engine builds one per planned Action; destructive intents (jobs / deltas) carry +// their op class + canonical params for binding. +type Intent struct { + Class OpClass + HostID string + GuestID string // blob-style guest id ("" = host-scoped); matches OpBlob.target.guest_id + VMID int // numeric, for queue routing (0 = host-scoped) + // ParamsJSON is the canonical params (matching the signed blob's `params`) used for + // op-to-action binding on destructive ops. Nil for benign actions (not bound). + ParamsJSON json.RawMessage + // Provenance is AGENT-INTERNAL only (never hub-sourced) — see classify.go. + Provenance Provenance + Source SourceKind +} + +// SignedOp is the opaque operator-signed blob+signature pair the hub queues (doc 04 +// §5). The agent never trusts it until authz.Verifier.Verify passes. +type SignedOp struct { + Blob []byte // the canonical OpBlob JSON bytes (verified over RAW bytes) + Sig []byte // the armored SSHSIG +} + +// RefuseReason is a stable, machine-readable gate refusal reason. +type RefuseReason string + +const ( + ReasonBenign RefuseReason = "benign" // allowed, no signature needed + ReasonSigned RefuseReason = "signed" // allowed by a verified op + ReasonPendingSignature RefuseReason = "pending_signature" // destructive, no/again-needed signature + ReasonRejected RefuseReason = "rejected" // signature failed authz verification + ReasonRoleDenied RefuseReason = "role_denied" // signer role not authorized for this op class + ReasonBindingMismatch RefuseReason = "binding_mismatch" // signature is for a different action +) + +// Decision is the gate verdict. +type Decision struct { + Allowed bool + Disposition Disposition + Reason RefuseReason + // Verified is the authenticated op when a signature authorized the action. + Verified *authz.VerifiedOp + // Err is the underlying authz rejection (errors.Is-friendly: ErrUnknownSigner, + // ErrExpired, ErrReplay, …) when Reason == ReasonRejected. + Err error +} + +// OpVerifier is the crypto verifier seam — *authz.Verifier in production; a fake in +// gate unit tests. The gate never re-implements any crypto; it only consumes the +// verdict and enforces the policy layer on top (role-scoping + op-to-action binding). +type OpVerifier interface { + Verify(blob, sigArmored []byte) (*authz.VerifiedOp, error) +} + +// AuditSink records every gate decision to the customer-visible audit log. Audit is a +// SIGNAL, never the guard (doc 03 §4 / doc 04 §5): a compromised hub could suppress a +// notice, which is exactly why the signature — not the audit — is the control. +type AuditSink interface { + Record(rec AuditRecord) +} + +// AuditRecord is one audited gate decision. +type AuditRecord struct { + Time time.Time + Class OpClass + HostID string + GuestID string + Source SourceKind + Disposition Disposition + Allowed bool + Reason RefuseReason + KeyID string // matched signer's key id, when signed + Nonce string // the op nonce, when signed +} + +// Gate is the reversibility gate: it sits in front of the per-guest queue's executor +// so EVERY mutation passes it. Benign intents are allowed unsigned; destructive +// intents require a verified, role-authorized, action-bound operator signature, else +// they are refused with pending_signature and never executed. +type Gate struct { + verifier OpVerifier // may be nil (no signers pinned) → destructive is always pending_signature + hostID string + audit AuditSink + logger *slog.Logger +} + +// NewGate builds a gate. verifier may be nil when no signers are configured (the +// common slice-4 state) — then there is nothing destructive to authorize and any +// destructive intent is refused pending_signature. audit/logger default to no-ops. +func NewGate(verifier OpVerifier, hostID string, audit AuditSink, logger *slog.Logger) *Gate { + if audit == nil { + audit = noopAudit{} + } + if logger == nil { + logger = slog.New(slog.NewTextHandler(discard{}, nil)) + } + return &Gate{verifier: verifier, hostID: hostID, audit: audit, logger: logger} +} + +// Authorize classifies the intent and, for destructive intents, runs the full +// consuming-layer policy over the verifier verdict. It writes the decision to the +// audit log and returns it. It NEVER executes anything — the caller dispatches an +// Allowed decision onto the queue. +func (g *Gate) Authorize(intent Intent, signed *SignedOp) Decision { + disp := Classify(intent.Class, intent.Provenance) + + // Benign: allowed without a signature. + if disp == Benign { + d := Decision{Allowed: true, Disposition: Benign, Reason: ReasonBenign} + g.record(intent, d) + return d + } + + // Destructive from here: a verified, role-authorized, action-bound signature is + // mandatory. Missing signature OR no pinned verifier → pending_signature (refuse). + if signed == nil || g.verifier == nil { + d := Decision{Allowed: false, Disposition: Destructive, Reason: ReasonPendingSignature} + g.record(intent, d) + return d + } + + // Crypto + namespace + allow-list + target + time + nonce — the LOCKED authz + // pipeline. The nonce is consumed (recorded) only if this passes. + vop, err := g.verifier.Verify(signed.Blob, signed.Sig) + if err != nil { + d := Decision{Allowed: false, Disposition: Destructive, Reason: ReasonRejected, Err: err} + g.record(intent, d) + return d + } + + // Role-scoping (the slice-4 job per verifier.go): the signer's pinned role must be + // authorized for THIS op class. + if !roleAuthorizes(vop.Signer.Role, intent.Class) { + d := Decision{Allowed: false, Disposition: Destructive, Reason: ReasonRoleDenied, Verified: vop} + g.record(intent, d) + return d + } + + // Op-to-action binding: the verified op must name THIS exact action (op + target + + // params) — a signature for "restore guest X" cannot authorize destroying guest Y. + if !g.bindsToAction(vop, intent) { + d := Decision{Allowed: false, Disposition: Destructive, Reason: ReasonBindingMismatch, Verified: vop} + g.record(intent, d) + return d + } + + d := Decision{Allowed: true, Disposition: Destructive, Reason: ReasonSigned, Verified: vop} + g.record(intent, d) + return d +} + +// roleAuthorizes enforces the doc 04 §4 two-key role model: the cold recovery key +// authorizes ONLY key-rotation re-pins; the operational key authorizes ordinary +// destructive ops AND planned key-rotation. +func roleAuthorizes(role authz.KeyRole, class OpClass) bool { + if class == ClassKeyRotation { + return role == authz.RoleOperational || role == authz.RoleRecovery + } + return role == authz.RoleOperational +} + +// bindsToAction checks the verified op names this exact action: host (already checked +// by the verifier, re-asserted here), guest, op class, and params. This is the binding +// BEYOND the verifier's target check (doc 04 §2.3 binds host; this binds the full +// action). +func (g *Gate) bindsToAction(vop *authz.VerifiedOp, intent Intent) bool { + if vop.HostID != g.hostID || vop.HostID != intent.HostID { + return false + } + if vop.GuestID != intent.GuestID { + return false + } + if vop.Op != string(intent.Class) { + return false + } + return paramsEqual(vop.Params, intent.ParamsJSON) +} + +// paramsEqual compares two JSON param objects semantically (key order / whitespace +// independent). Absent params on both sides ({} or empty) compare equal. +func paramsEqual(a, b json.RawMessage) bool { + ax, aok := decodeParams(a) + bx, bok := decodeParams(b) + if !aok || !bok { + return false + } + return reflect.DeepEqual(ax, bx) +} + +func decodeParams(p json.RawMessage) (any, bool) { + if len(p) == 0 { + return map[string]any{}, true // absent == empty object + } + var v any + if err := json.Unmarshal(p, &v); err != nil { + return nil, false + } + if v == nil { + return map[string]any{}, true // explicit null == empty + } + return v, true +} + +func (g *Gate) record(intent Intent, d Decision) { + rec := AuditRecord{ + Time: time.Now().UTC(), + Class: intent.Class, + HostID: intent.HostID, + GuestID: intent.GuestID, + Source: intent.Source, + Disposition: d.Disposition, + Allowed: d.Allowed, + Reason: d.Reason, + } + if d.Verified != nil { + rec.KeyID = d.Verified.Signer.KeyID + rec.Nonce = d.Verified.Nonce + } + g.audit.Record(rec) + g.logger.Info("gate decision", + "class", intent.Class, "guest", intent.GuestID, "source", intent.Source, + "disposition", d.Disposition, "allowed", d.Allowed, "reason", d.Reason) +} + +// intentForAction builds the gate Intent for a benign reconcile action. The provenance +// is the zero value (no agent-internal destroy evidence) and the source is the +// desired-state delta — reconcile never fabricates scratch/same-txn provenance. +func intentForAction(hostID string, act Action) Intent { + return Intent{ + Class: classOfAction(act.Kind), + HostID: hostID, + GuestID: strconv.Itoa(act.VMID), + VMID: act.VMID, + Provenance: Provenance{}, // benign actions need none; never hub-sourced + Source: SourceDesiredDelta, + } +} + +// noopAudit drops audit records (used when no sink is configured). +type noopAudit struct{} + +func (noopAudit) Record(AuditRecord) {} + +// SlogAudit is a minimal AuditSink that emits records to a logger. The durable, +// customer-visible audit log + its inclusion in the host-report (HostReport.AuditTail) +// is a later-slice concern; this keeps the signal flowing now without inventing that +// wire schema. +type SlogAudit struct{ Logger *slog.Logger } + +// Record logs the audit entry at info level. +func (s SlogAudit) Record(rec AuditRecord) { + if s.Logger == nil { + return + } + s.Logger.Info("audit: gate decision", + "class", rec.Class, "host", rec.HostID, "guest", rec.GuestID, "source", rec.Source, + "disposition", rec.Disposition, "allowed", rec.Allowed, "reason", rec.Reason, + "key_id", rec.KeyID, "nonce", auditNonce(rec.Nonce)) +} + +// auditNonce shortens a nonce for the log (full nonce is high-cardinality; a prefix is +// enough to correlate without bloating logs). +func auditNonce(n string) string { + if len(n) <= 8 { + return n + } + return n[:8] + "…" +} diff --git a/internal/reconcile/gate_test.go b/internal/reconcile/gate_test.go new file mode 100644 index 0000000..f5fbb80 --- /dev/null +++ b/internal/reconcile/gate_test.go @@ -0,0 +1,299 @@ +package reconcile + +import ( + "encoding/json" + "errors" + "path/filepath" + "testing" + "time" + + "gitea.dooplex.hu/admin/felhom-agent/internal/authz" +) + +const testHost = "demo-felhom" + +// captureAudit records gate decisions so tests can assert audit is always written +// (audit is a signal, never the guard). +type captureAudit struct{ recs []AuditRecord } + +func (c *captureAudit) Record(r AuditRecord) { c.recs = append(c.recs, r) } + +// realVerifierAt builds a real authz.Verifier over a durable nonce store at path +// (reused across "restart" by reopening the same path), pinning the given signers. +func realVerifierAt(t *testing.T, path, hostID string, signers ...authz.AllowedSigner) (*authz.Verifier, *authz.FileNonceStore) { + t.Helper() + store, err := authz.OpenFileNonceStore(path) + if err != nil { + t.Fatalf("OpenFileNonceStore: %v", err) + } + t.Cleanup(func() { store.Close() }) + return authz.New(signers, store, hostID), store +} + +// destroyIntent is the canonical destructive fixture: destroy guest 9001, params +// {"purge":true} (mirrors the committed slice-2 op_blob.json shape). +func destroyIntent(source SourceKind) Intent { + return Intent{ + Class: ClassGuestDestroy, + HostID: testHost, + GuestID: "9001", + VMID: 9001, + ParamsJSON: json.RawMessage(`{"purge":true}`), + Source: source, + } +} + +func freshWindow() (issued, expires time.Time) { + now := time.Now().UTC() + return now.Add(-1 * time.Minute), now.Add(10 * time.Minute) +} + +// --- The adversarial matrix: each case must be INDEPENDENTLY rejected (or, the one +// positive case, accepted). --- + +func TestGate_DestructiveJobNoSignatureRefused(t *testing.T) { + op := newTestSigner(t) + v, _ := realVerifierAt(t, filepath.Join(t.TempDir(), "n.log"), testHost, op.allowed(t, "op1", authz.RoleOperational)) + aud := &captureAudit{} + g := NewGate(v, testHost, aud, nil) + + d := g.Authorize(destroyIntent(SourceOneShotJob), nil) + if d.Allowed || d.Reason != ReasonPendingSignature { + t.Fatalf("unsigned destructive job: got allowed=%v reason=%s, want pending_signature", d.Allowed, d.Reason) + } + if len(aud.recs) != 1 || aud.recs[0].Allowed { + t.Errorf("decision must be audited as refused: %+v", aud.recs) + } +} + +func TestGate_DestructiveDesiredDeltaNoSignatureRefused(t *testing.T) { + // Proves the agent distrusts hub DESIRED STATE for destructive change, not just + // jobs — same refusal, different source. + op := newTestSigner(t) + v, _ := realVerifierAt(t, filepath.Join(t.TempDir(), "n.log"), testHost, op.allowed(t, "op1", authz.RoleOperational)) + g := NewGate(v, testHost, nil, nil) + + d := g.Authorize(destroyIntent(SourceDesiredDelta), nil) + if d.Allowed || d.Reason != ReasonPendingSignature { + t.Fatalf("unsigned destructive delta: got allowed=%v reason=%s, want pending_signature", d.Allowed, d.Reason) + } +} + +func TestGate_UnknownSignerRejected(t *testing.T) { + pinned := newTestSigner(t) + attacker := newTestSigner(t) // NOT pinned + v, _ := realVerifierAt(t, filepath.Join(t.TempDir(), "n.log"), testHost, pinned.allowed(t, "op1", authz.RoleOperational)) + g := NewGate(v, testHost, nil, nil) + + issued, expires := freshWindow() + signed := attacker.mint("guest_destroy", testHost, "9001", "op1", nonce(), `{"purge":true}`, issued, expires) + d := g.Authorize(destroyIntent(SourceOneShotJob), signed) + if d.Allowed || d.Reason != ReasonRejected || !errors.Is(d.Err, authz.ErrUnknownSigner) { + t.Fatalf("forged signer: got allowed=%v reason=%s err=%v, want rejected/ErrUnknownSigner", d.Allowed, d.Reason, d.Err) + } +} + +func TestGate_ExpiredSignatureRejected(t *testing.T) { + op := newTestSigner(t) + v, _ := realVerifierAt(t, filepath.Join(t.TempDir(), "n.log"), testHost, op.allowed(t, "op1", authz.RoleOperational)) + g := NewGate(v, testHost, nil, nil) + + past := time.Now().UTC().Add(-2 * time.Hour) + signed := op.mint("guest_destroy", testHost, "9001", "op1", nonce(), `{"purge":true}`, past, past.Add(time.Minute)) + d := g.Authorize(destroyIntent(SourceOneShotJob), signed) + if d.Allowed || !errors.Is(d.Err, authz.ErrExpired) { + t.Fatalf("expired op: got allowed=%v err=%v, want ErrExpired", d.Allowed, d.Err) + } +} + +func TestGate_WrongHostTargetRejected(t *testing.T) { + op := newTestSigner(t) + v, _ := realVerifierAt(t, filepath.Join(t.TempDir(), "n.log"), testHost, op.allowed(t, "op1", authz.RoleOperational)) + g := NewGate(v, testHost, nil, nil) + + issued, expires := freshWindow() + signed := op.mint("guest_destroy", "some-other-host", "9001", "op1", nonce(), `{"purge":true}`, issued, expires) + d := g.Authorize(destroyIntent(SourceOneShotJob), signed) + if d.Allowed || !errors.Is(d.Err, authz.ErrTarget) { + t.Fatalf("wrong host: got allowed=%v err=%v, want ErrTarget", d.Allowed, d.Err) + } +} + +func TestGate_WrongGuestBindingMismatch(t *testing.T) { + // host matches (verifier passes) but the signature names a DIFFERENT guest than the + // action — the op-to-action binding rejects it. + op := newTestSigner(t) + v, _ := realVerifierAt(t, filepath.Join(t.TempDir(), "n.log"), testHost, op.allowed(t, "op1", authz.RoleOperational)) + g := NewGate(v, testHost, nil, nil) + + issued, expires := freshWindow() + signed := op.mint("guest_destroy", testHost, "9002", "op1", nonce(), `{"purge":true}`, issued, expires) + d := g.Authorize(destroyIntent(SourceOneShotJob), signed) // intent targets 9001 + if d.Allowed || d.Reason != ReasonBindingMismatch { + t.Fatalf("guest mismatch: got allowed=%v reason=%s, want binding_mismatch", d.Allowed, d.Reason) + } +} + +func TestGate_WrongParamsBindingMismatch(t *testing.T) { + op := newTestSigner(t) + v, _ := realVerifierAt(t, filepath.Join(t.TempDir(), "n.log"), testHost, op.allowed(t, "op1", authz.RoleOperational)) + g := NewGate(v, testHost, nil, nil) + + issued, expires := freshWindow() + // signature authorizes purge=false; the action wants purge=true. + signed := op.mint("guest_destroy", testHost, "9001", "op1", nonce(), `{"purge":false}`, issued, expires) + d := g.Authorize(destroyIntent(SourceOneShotJob), signed) + if d.Allowed || d.Reason != ReasonBindingMismatch { + t.Fatalf("params mismatch: got allowed=%v reason=%s, want binding_mismatch", d.Allowed, d.Reason) + } +} + +func TestGate_WrongOpBindingMismatch(t *testing.T) { + op := newTestSigner(t) + v, _ := realVerifierAt(t, filepath.Join(t.TempDir(), "n.log"), testHost, op.allowed(t, "op1", authz.RoleOperational)) + g := NewGate(v, testHost, nil, nil) + + issued, expires := freshWindow() + // a valid signature for restore_overwrite cannot authorize a guest_destroy. + signed := op.mint("restore_overwrite", testHost, "9001", "op1", nonce(), `{"purge":true}`, issued, expires) + d := g.Authorize(destroyIntent(SourceOneShotJob), signed) + if d.Allowed || d.Reason != ReasonBindingMismatch { + t.Fatalf("op mismatch: got allowed=%v reason=%s, want binding_mismatch", d.Allowed, d.Reason) + } +} + +func TestGate_RecoveryKeyOnOrdinaryDestructiveRoleDenied(t *testing.T) { + // A valid signature from the cold RECOVERY key on an ordinary destructive op is + // refused by role-scoping (recovery authorizes ONLY key-rotation). + rec := newTestSigner(t) + v, _ := realVerifierAt(t, filepath.Join(t.TempDir(), "n.log"), testHost, rec.allowed(t, "rec1", authz.RoleRecovery)) + g := NewGate(v, testHost, nil, nil) + + issued, expires := freshWindow() + signed := rec.mint("guest_destroy", testHost, "9001", "rec1", nonce(), `{"purge":true}`, issued, expires) + d := g.Authorize(destroyIntent(SourceOneShotJob), signed) + if d.Allowed || d.Reason != ReasonRoleDenied { + t.Fatalf("recovery on destroy: got allowed=%v reason=%s, want role_denied", d.Allowed, d.Reason) + } +} + +func TestGate_HubSuppliedScratchTagIgnored(t *testing.T) { + // A compromised hub attaches a "scratch" hint to a data-bearing guest's destroy + // delta to try to walk the gate unsigned. The intent built from a hub delta must + // NOT carry that as agent-internal provenance — so it stays destructive and is + // refused without a signature. + intent := intentFromHubDelta(hubDelta{Class: ClassGuestDestroy, HostID: testHost, GuestID: "9001", VMID: 9001, HubSaysScratch: true}) + if intent.Provenance.AgentTaggedScratch || intent.Provenance.SameTxnCreated { + t.Fatal("hub-supplied scratch must NOT become agent-internal provenance") + } + g := NewGate(nil, testHost, nil, nil) // no verifier even needed + d := g.Authorize(intent, nil) + if d.Allowed || d.Reason != ReasonPendingSignature { + t.Fatalf("hub-scratch destroy: got allowed=%v reason=%s, want pending_signature", d.Allowed, d.Reason) + } +} + +func TestGate_ValidOpAcceptedThenReplayRejected(t *testing.T) { + // The ONE positive case: valid signature, correct role, correct target, fresh + // nonce → accepted. A SECOND presentation (same nonce) → rejected (nonce consumed). + op := newTestSigner(t) + path := filepath.Join(t.TempDir(), "n.log") + v, _ := realVerifierAt(t, path, testHost, op.allowed(t, "op1", authz.RoleOperational)) + aud := &captureAudit{} + g := NewGate(v, testHost, aud, nil) + + issued, expires := freshWindow() + n := nonce() + signed := op.mint("guest_destroy", testHost, "9001", "op1", n, `{"purge":true}`, issued, expires) + + d := g.Authorize(destroyIntent(SourceOneShotJob), signed) + if !d.Allowed || d.Reason != ReasonSigned { + t.Fatalf("valid op: got allowed=%v reason=%s err=%v, want accepted/signed", d.Allowed, d.Reason, d.Err) + } + if d.Verified == nil || d.Verified.Nonce != n { + t.Fatalf("accepted op should surface the verified op with nonce %s", n) + } + // Replay the exact same signed op → nonce already consumed. + d2 := g.Authorize(destroyIntent(SourceOneShotJob), signed) + if d2.Allowed || !errors.Is(d2.Err, authz.ErrReplay) { + t.Fatalf("replay: got allowed=%v err=%v, want ErrReplay", d2.Allowed, d2.Err) + } +} + +func TestGate_ReplayAcrossRestartRejected(t *testing.T) { + // Replay protection must survive an agent restart (the durable nonce store). Accept + // once with verifier A, then reopen the SAME nonce-store path as verifier B (a + // restart) and replay → still rejected. + op := newTestSigner(t) + path := filepath.Join(t.TempDir(), "n.log") + signer := op.allowed(t, "op1", authz.RoleOperational) + + issued, expires := freshWindow() + n := nonce() + signed := op.mint("guest_destroy", testHost, "9001", "op1", n, `{"purge":true}`, issued, expires) + + vA, storeA := realVerifierAt(t, path, testHost, signer) + if d := NewGate(vA, testHost, nil, nil).Authorize(destroyIntent(SourceOneShotJob), signed); !d.Allowed { + t.Fatalf("first presentation should be accepted: %+v", d) + } + storeA.Close() // simulate shutdown + + vB, _ := realVerifierAt(t, path, testHost, signer) // restart: reopen same nonce log + d := NewGate(vB, testHost, nil, nil).Authorize(destroyIntent(SourceOneShotJob), signed) + if d.Allowed || !errors.Is(d.Err, authz.ErrReplay) { + t.Fatalf("replay across restart: got allowed=%v err=%v, want ErrReplay", d.Allowed, d.Err) + } +} + +// --- gate unit tests (benign path, binding, params) --- + +func TestGate_BenignAllowedWithoutVerifier(t *testing.T) { + g := NewGate(nil, testHost, nil, nil) // no verifier at all + for _, k := range []ActionKind{ActionStart, ActionStop, ActionSetConfig} { + d := g.Authorize(intentForAction(testHost, Action{VMID: 100, Kind: k}), nil) + if !d.Allowed || d.Reason != ReasonBenign { + t.Errorf("benign %s: got allowed=%v reason=%s, want benign", k, d.Allowed, d.Reason) + } + } +} + +func TestParamsEqual(t *testing.T) { + eq := func(a, b string) bool { return paramsEqual(json.RawMessage(a), json.RawMessage(b)) } + if !eq(`{"purge":true}`, `{"purge":true}`) { + t.Error("identical params should be equal") + } + if !eq(`{"a":1,"b":2}`, `{"b":2,"a":1}`) { + t.Error("key order must not matter") + } + if eq(`{"purge":true}`, `{"purge":false}`) { + t.Error("different values must differ") + } + if !eq(``, `{}`) || !eq(`{}`, `null`) { + t.Error("absent / empty / null params should all compare equal") + } +} + +// --- helpers for the hub-scratch test: a stand-in for the slice-10 desired-delta → +// intent constructor, proving it never propagates hub-supplied provenance. --- + +type hubDelta struct { + Class OpClass + HostID string + GuestID string + VMID int + HubSaysScratch bool // a hostile/erroneous hub hint — MUST be ignored +} + +func intentFromHubDelta(d hubDelta) Intent { + // NOTE: HubSaysScratch is deliberately NOT mapped to Provenance. Agent-internal + // provenance (scratch/same-txn) is recorded by the agent's own journal, never taken + // from the hub (doc 03 §4). + return Intent{ + Class: d.Class, + HostID: d.HostID, + GuestID: d.GuestID, + VMID: d.VMID, + Provenance: Provenance{}, // always zero from an external source + Source: SourceDesiredDelta, + } +} diff --git a/internal/reconcile/job.go b/internal/reconcile/job.go new file mode 100644 index 0000000..49d3124 --- /dev/null +++ b/internal/reconcile/job.go @@ -0,0 +1,120 @@ +package reconcile + +import ( + "context" + "encoding/json" + "fmt" + "time" + + "gitea.dooplex.hu/admin/felhom-agent/internal/authz" + "gitea.dooplex.hu/admin/felhom-agent/internal/proxmox" +) + +// DestructiveExecutor performs an authorized destructive op against the host. At slice +// 4 there is NO live implementation (guest-destroy / storage-wipe / restore-overwrite +// executors land in slices 6/7) — the consuming layer is wired and tested with fixture +// executors but never executes a real destructive op, because nothing serves +// destructive deltas until slice 10. It returns a Proxmox UPID (or "" for a synchronous +// op) so the journal/Recover path is identical to benign execution. +type DestructiveExecutor func(ctx context.Context, intent Intent, vop *authz.VerifiedOp) (upid string, err error) + +// JobResult is the outcome of RunSignedJob. +type JobResult struct { + Decision Decision + AlreadyApplied bool // the op's idempotency key was already applied (deduped, not re-run) + Executed bool // the executor ran and succeeded + Err error // execution error (after a successful authorization) +} + +// RunSignedJob is the signed one-shot consuming layer (doc 03 §4(b) / doc 04). It adds +// idempotency dedupe + journaling around the gate: +// +// 1. Dedupe: if the op's idempotency key (its nonce) is already applied, skip — a +// redelivered, already-completed op must not re-run (returns AlreadyApplied). +// 2. Gate: classify + verify + role-scope + op-to-action bind. A refusal returns the +// Decision and executes nothing. +// 3. Journal + execute: record started → run the executor → record the task id → +// record the terminal state under the idempotency key (so success marks the key +// applied; a crash mid-execute is resolved by Recover, never by idempotency alone). +// +// exec may be nil — then an AUTHORIZED destructive op is journaled as authorized but +// not executed (the slice-4 inert state: the gate works, the executor doesn't exist +// yet). A REFUSED op never reaches exec. +func (e *Engine) RunSignedJob(ctx context.Context, intent Intent, signed *SignedOp, exec DestructiveExecutor) JobResult { + idemKey := jobIdempotencyKey(signed) + + // 1. Idempotency dedupe (redelivery after a prior success). + if idemKey != "" && e.journal != nil && e.journal.AlreadyApplied(idemKey) { + e.logger.Info("job: idempotency key already applied; skipping", "key", auditNonce(idemKey)) + return JobResult{AlreadyApplied: true, Decision: Decision{Allowed: true, Reason: ReasonSigned}} + } + + // 2. Gate (classification + the full signed-op consuming policy). + dec := e.gate.Authorize(intent, signed) + if !dec.Allowed { + return JobResult{Decision: dec} + } + + // 3. Journal + execute. Benign authorized ops (no signature path) also flow here if + // routed as jobs; they carry no idempotency key and are simply executed. + opID := e.nextJobOpID(intent) + e.append(JournalEntry{OpID: opID, VMID: intent.VMID, Kind: string(intent.Class), + State: OpStarted, IdempKey: idemKey, At: time.Now().UTC()}) + + if exec == nil { + // Slice-4 inert: authorized, but no destructive executor wired. Record the + // authorization terminally (do NOT mark applied — nothing actually ran). + e.append(JournalEntry{OpID: opID, VMID: intent.VMID, Kind: string(intent.Class), + State: OpFailed, IdempKey: "", At: time.Now().UTC()}) + e.logger.Warn("job: authorized but no executor wired (slice-4 inert)", "class", intent.Class) + return JobResult{Decision: dec, Err: fmt.Errorf("reconcile: no executor for %s (not wired this slice)", intent.Class)} + } + + upid, err := exec(ctx, intent, dec.Verified) + if err != nil { + e.append(JournalEntry{OpID: opID, VMID: intent.VMID, Kind: string(intent.Class), + State: OpFailed, IdempKey: idemKey, At: time.Now().UTC()}) + return JobResult{Decision: dec, Err: err} + } + e.append(JournalEntry{OpID: opID, VMID: intent.VMID, Kind: string(intent.Class), + UPID: upid, State: OpTaskRunning, IdempKey: idemKey, At: time.Now().UTC()}) + + if upid != "" { + st, err := e.api.WaitTask(ctx, upid, proxmox.WaitOptions{}) + if err != nil { + e.append(JournalEntry{OpID: opID, VMID: intent.VMID, Kind: string(intent.Class), + UPID: upid, State: OpFailed, IdempKey: idemKey, At: time.Now().UTC()}) + return JobResult{Decision: dec, Err: err} + } + _ = st + } + + // Terminal success — marks the idempotency key applied (survives restart). + e.append(JournalEntry{OpID: opID, VMID: intent.VMID, Kind: string(intent.Class), + UPID: upid, State: OpSucceeded, IdempKey: idemKey, At: time.Now().UTC()}) + return JobResult{Decision: dec, Executed: true} +} + +// jobIdempotencyKey derives the idempotency key from the signed op's nonce — unique +// per op (≥128-bit, doc 04 §2.1) and already the anti-replay token, so reusing it as +// the journal dedupe key is exact. Parsed from the UNVERIFIED blob: it is only a map +// key here (the gate's verifier is the trust boundary), and a forged blob is refused at +// the gate regardless. +func jobIdempotencyKey(signed *SignedOp) string { + if signed == nil || len(signed.Blob) == 0 { + return "" + } + var b struct { + Nonce string `json:"nonce"` + } + if json.Unmarshal(signed.Blob, &b) != nil { + return "" + } + return b.Nonce +} + +// nextJobOpID builds a per-attempt op id for a signed job (distinct namespace from +// reconcile op ids). +func (e *Engine) nextJobOpID(intent Intent) string { + return "job-" + string(intent.Class) + "-" + intent.GuestID + "-" + nextSeq(&e.opSeq) +} diff --git a/internal/reconcile/job_test.go b/internal/reconcile/job_test.go new file mode 100644 index 0000000..2a15d12 --- /dev/null +++ b/internal/reconcile/job_test.go @@ -0,0 +1,135 @@ +package reconcile + +import ( + "context" + "errors" + "path/filepath" + "testing" + + "gitea.dooplex.hu/admin/felhom-agent/internal/authz" +) + +// newSignedEngine builds an engine whose gate has a real verifier pinning one +// operational key — for exercising the signed-job consuming layer end to end. +func newSignedEngine(t *testing.T, api GuestAPI) (*Engine, *Journal, testSigner) { + t.Helper() + j, err := OpenJournal(filepath.Join(t.TempDir(), "journal.log")) + if err != nil { + t.Fatalf("OpenJournal: %v", err) + } + t.Cleanup(func() { j.Close() }) + q := NewQueue() + t.Cleanup(q.Close) + op := newTestSigner(t) + v, _ := realVerifierAt(t, filepath.Join(t.TempDir(), "n.log"), testHost, op.allowed(t, "op1", authz.RoleOperational)) + g := NewGate(v, testHost, nil, nil) + e := NewEngine(EngineOptions{API: api, Queue: q, Journal: j, Gate: g, HostID: testHost}) + return e, j, op +} + +func TestRunSignedJob_ValidExecutesAndMarksApplied(t *testing.T) { + e, j, op := newSignedEngine(t, &fakeAPI{}) + issued, expires := freshWindow() + n := nonce() + signed := op.mint("guest_destroy", testHost, "9001", "op1", n, `{"purge":true}`, issued, expires) + + calls := 0 + exec := func(context.Context, Intent, *authz.VerifiedOp) (string, error) { calls++; return "", nil } // synchronous + + res := e.RunSignedJob(context.Background(), destroyIntent(SourceOneShotJob), signed, exec) + if !res.Executed || res.Err != nil { + t.Fatalf("valid job should execute, got %+v", res) + } + if calls != 1 { + t.Errorf("executor should run once, ran %d", calls) + } + if !j.AlreadyApplied(n) { + t.Error("successful job must mark its idempotency key (nonce) applied") + } +} + +func TestRunSignedJob_RedeliveryDedupedByIdempotencyKey(t *testing.T) { + // After success, a redelivered identical job must NOT re-run — the journal's + // idempotency key short-circuits BEFORE the verifier (so it reports already-applied, + // not a confusing replay rejection). + e, _, op := newSignedEngine(t, &fakeAPI{}) + issued, expires := freshWindow() + n := nonce() + signed := op.mint("guest_destroy", testHost, "9001", "op1", n, `{"purge":true}`, issued, expires) + + calls := 0 + exec := func(context.Context, Intent, *authz.VerifiedOp) (string, error) { calls++; return "", nil } + + first := e.RunSignedJob(context.Background(), destroyIntent(SourceOneShotJob), signed, exec) + if !first.Executed { + t.Fatalf("first delivery should execute: %+v", first) + } + second := e.RunSignedJob(context.Background(), destroyIntent(SourceOneShotJob), signed, exec) + if !second.AlreadyApplied || second.Executed { + t.Fatalf("redelivery should be deduped (already applied), got %+v", second) + } + if calls != 1 { + t.Errorf("executor must run exactly once across redelivery, ran %d", calls) + } +} + +func TestRunSignedJob_RefusedDoesNotExecute(t *testing.T) { + e, j, _ := newSignedEngine(t, &fakeAPI{}) + attacker := newTestSigner(t) // not pinned + issued, expires := freshWindow() + n := nonce() + signed := attacker.mint("guest_destroy", testHost, "9001", "op1", n, `{"purge":true}`, issued, expires) + + calls := 0 + exec := func(context.Context, Intent, *authz.VerifiedOp) (string, error) { calls++; return "", nil } + + res := e.RunSignedJob(context.Background(), destroyIntent(SourceOneShotJob), signed, exec) + if res.Executed || res.Decision.Allowed || !errors.Is(res.Decision.Err, authz.ErrUnknownSigner) { + t.Fatalf("forged job must be refused unexecuted, got %+v", res) + } + if calls != 0 { + t.Errorf("executor must not run for a refused job, ran %d", calls) + } + if j.AlreadyApplied(n) { + t.Error("a refused job must not mark its key applied") + } +} + +func TestRunSignedJob_NoExecutorInert(t *testing.T) { + // Slice-4 inert state: a VALID authorization with no destructive executor wired + // returns an error and does NOT mark the key applied (so it is retryable once the + // executor lands in a later slice). + e, j, op := newSignedEngine(t, &fakeAPI{}) + issued, expires := freshWindow() + n := nonce() + signed := op.mint("guest_destroy", testHost, "9001", "op1", n, `{"purge":true}`, issued, expires) + + res := e.RunSignedJob(context.Background(), destroyIntent(SourceOneShotJob), signed, nil) + if !res.Decision.Allowed { + t.Fatalf("op should authorize even with no executor: %+v", res.Decision) + } + if res.Executed || res.Err == nil { + t.Fatalf("no-executor job should not execute and should error, got %+v", res) + } + if j.AlreadyApplied(n) { + t.Error("an unexecuted (no-executor) job must not mark its key applied") + } +} + +func TestRunSignedJob_ExecutorErrorJournaledFailed(t *testing.T) { + e, j, op := newSignedEngine(t, &fakeAPI{}) + issued, expires := freshWindow() + n := nonce() + signed := op.mint("guest_destroy", testHost, "9001", "op1", n, `{"purge":true}`, issued, expires) + + exec := func(context.Context, Intent, *authz.VerifiedOp) (string, error) { + return "", errors.New("destroy failed") + } + res := e.RunSignedJob(context.Background(), destroyIntent(SourceOneShotJob), signed, exec) + if res.Executed || res.Err == nil { + t.Fatalf("executor error should propagate, got %+v", res) + } + if j.AlreadyApplied(n) { + t.Error("a failed execution must not mark its key applied") + } +} diff --git a/internal/reconcile/mint_test.go b/internal/reconcile/mint_test.go new file mode 100644 index 0000000..4fda3d7 --- /dev/null +++ b/internal/reconcile/mint_test.go @@ -0,0 +1,129 @@ +package reconcile + +import ( + "crypto/ed25519" + "crypto/rand" + "crypto/sha256" + "encoding/pem" + "fmt" + "testing" + "time" + + "gitea.dooplex.hu/admin/felhom-agent/internal/authz" + "golang.org/x/crypto/ssh" +) + +// In-test SSHSIG minter for the gate's adversarial matrix. It replicates the ~40 lines +// of SSHSIG framing (porting internal/authz/sshsig.go + mint_test.go) so reconcile's +// tests can produce valid AND adversarial signatures with now-relative timestamps. +// This lives only in reconcile's test binary — production authz is untouched (no +// signing capability is added to the verify-only security package), and the verifier's +// unexported clock (not injectable cross-package) is why we mint live rather than reuse +// the committed fixed-window fixture. +// +// The minted bytes round-trip through the REAL authz.Verifier, so a correct framing is +// proven by the positive case verifying. + +const sshsigMagic = "SSHSIG" + +type sshsigBlob struct { + Version uint32 + PublicKey string + Namespace string + Reserved string + HashAlgo string + Signature string +} + +// signedData recomputes the SSHSIG signed bytes: "SSHSIG" || marshal(ns, reserved, +// hash, H(message)). Mirrors authz.signedData exactly (sha256). +func signedDataForTest(ns string, msg []byte) []byte { + h := sha256.Sum256(msg) + body := ssh.Marshal(struct { + Namespace string + Reserved string + HashAlgo string + Hash []byte + }{ns, "", "sha256", h[:]}) + return append([]byte(sshsigMagic), body...) +} + +// mintArmor builds an armored SSHSIG over message using sign. +func mintArmor(pubMarshaled []byte, namespace string, message []byte, sign func([]byte) ssh.Signature) []byte { + sb := &sshsigBlob{Version: 1, PublicKey: string(pubMarshaled), Namespace: namespace, Reserved: "", HashAlgo: "sha256"} + sig := sign(signedDataForTest(namespace, message)) + sb.Signature = string(ssh.Marshal(&sig)) + raw := append([]byte(sshsigMagic), ssh.Marshal(sb)...) + return pem.EncodeToMemory(&pem.Block{Type: "SSH SIGNATURE", Bytes: raw}) +} + +// nonce returns a fresh 128-bit hex nonce (doc 04 §2.1: ≥128-bit random). +func nonce() string { + var b [16]byte + if _, err := rand.Read(b[:]); err != nil { + panic(err) + } + const hexdigits = "0123456789abcdef" + out := make([]byte, 32) + for i, x := range b { + out[i*2] = hexdigits[x>>4] + out[i*2+1] = hexdigits[x&0x0f] + } + return string(out) +} + +// canonicalBlob builds an op blob in the doc 04 §2.1 canonical field order (keys +// sorted at every level, no insignificant whitespace). +func canonicalBlob(op, hostID, guestID, keyID, nonce, paramsJSON string, issued, expires time.Time) []byte { + if paramsJSON == "" { + paramsJSON = "{}" + } + return []byte(fmt.Sprintf( + `{"expires_at":%q,"issued_at":%q,"key_id":%q,"nonce":%q,"op":%q,"params":%s,"target":{"guest_id":%q,"host_id":%q}}`, + expires.UTC().Format(time.RFC3339), issued.UTC().Format(time.RFC3339), + keyID, nonce, op, paramsJSON, guestID, hostID)) +} + +// testSigner is a fresh ed25519 operator key: its public key, an authorized_keys line +// to pin it, and a sign closure. +type testSigner struct { + pub ssh.PublicKey + line string + sign func([]byte) ssh.Signature +} + +func newTestSigner(t *testing.T) testSigner { + t.Helper() + pub, priv, err := ed25519.GenerateKey(rand.Reader) + if err != nil { + t.Fatal(err) + } + sshPub, err := ssh.NewPublicKey(pub) + if err != nil { + t.Fatal(err) + } + return testSigner{ + pub: sshPub, + line: string(ssh.MarshalAuthorizedKey(sshPub)), + sign: func(d []byte) ssh.Signature { + return ssh.Signature{Format: ssh.KeyAlgoED25519, Blob: ed25519.Sign(priv, d)} + }, + } +} + +// allowed builds a pinned AllowedSigner for this key with the given id+role. +func (s testSigner) allowed(t *testing.T, keyID string, role authz.KeyRole) authz.AllowedSigner { + t.Helper() + as, err := authz.NewAllowedSigner(keyID, role, s.line) + if err != nil { + t.Fatalf("NewAllowedSigner: %v", err) + } + return as +} + +// mint builds a SignedOp (canonical blob + armored sig) for this signer. +func (s testSigner) mint(op, hostID, guestID, keyID, nonce, paramsJSON string, issued, expires time.Time) *SignedOp { + blob := canonicalBlob(op, hostID, guestID, keyID, nonce, paramsJSON, issued, expires) + sig := mintArmor(s.pub.Marshal(), authz.Namespace, blob, s.sign) + return &SignedOp{Blob: blob, Sig: sig} +} diff --git a/internal/reconcile/plan.go b/internal/reconcile/plan.go index 51ece0c..aaa151a 100644 --- a/internal/reconcile/plan.go +++ b/internal/reconcile/plan.go @@ -37,6 +37,13 @@ type Action struct { // the MiB unit Proxmox's LXC `memory` config field uses. const bytesPerMiB = 1024 * 1024 +// desiredMemoryMiB canonicalizes a desired byte count to the integer MiB that +// Proxmox's `memory` field stores and reports. Floor division is deliberate and +// convergent: the value returned here is exactly the value written via SetConfig, so a +// subsequent read returns the same MiB and the comparison settles (see Plan's memory +// note). The actual side (a.MemoryMiB) is already MiB from GuestConfig. +func desiredMemoryMiB(bytes int64) int64 { return bytes / bytesPerMiB } + // Plan computes the minimal benign action set converging actual → desired. It is a // pure function (deterministic, side-effect-free) so it is exhaustively fixture-test // -able. Actions are returned sorted by vmid, then config-before-runstate per guest. @@ -81,7 +88,14 @@ func Plan(desired DesiredState, actual ActualState, norm FieldNormalizers) []Act params["cores"] = strconv.Itoa(d.Spec.Cores) reasons = append(reasons, fmt.Sprintf("cores %d->%d", a.Cores, d.Spec.Cores)) } - if want := d.Spec.MemoryBytes / bytesPerMiB; want != a.MemoryMiB { + // Memory is canonicalized to MiB on BOTH sides before comparison — the + // numeric cousin of the description-newline normalization (string + // normalizers cover string fields; this is the integer one). We compare + // the SAME MiB value we then write, so a non-MiB-aligned desired + // converges in one pass (write `want` MiB → PVE stores `want` MiB → next + // read a.MemoryMiB == want → no further action), never perpetual drift. + // Slice 10 should still serve MiB-aligned MemoryBytes at the source. + if want := desiredMemoryMiB(d.Spec.MemoryBytes); want != a.MemoryMiB { params["memory"] = strconv.FormatInt(want, 10) reasons = append(reasons, fmt.Sprintf("memory %dMiB->%dMiB", a.MemoryMiB, want)) } diff --git a/internal/reconcile/plan_test.go b/internal/reconcile/plan_test.go index f2d8e14..d5efbce 100644 --- a/internal/reconcile/plan_test.go +++ b/internal/reconcile/plan_test.go @@ -73,6 +73,23 @@ func TestPlan_SpecDrift(t *testing.T) { mustActions(t, got) } +func TestPlan_MemoryNonAlignedConverges(t *testing.T) { + // Note-2 guard: a desired MemoryBytes that is NOT a clean MiB multiple must not + // cause perpetual drift. We compare in MiB and write the SAME MiB we compared, so it + // settles in one pass. + desiredBytes := int64(2049)*bytesPerMiB + 500000 // 2049 MiB + change → floors to 2049 + d := desired(DesiredGuest{VMID: 100, Spec: &hub.GuestSpec{Cores: 2, MemoryBytes: desiredBytes}}) + + // First pass: actual is 2048 MiB → one SetConfig memory=2049. + got := Plan(d, actual(ActualGuest{VMID: 100, Run: RunStopped, SpecKnown: true, Cores: 2, MemoryMiB: 2048}), nil) + mustActions(t, got, Action{VMID: 100, Kind: ActionSetConfig, Params: map[string]string{"memory": "2049"}}) + + // Apply it: actual becomes 2049 MiB. Re-plan against the SAME desired → no action. + if got2 := Plan(d, actual(ActualGuest{VMID: 100, Run: RunStopped, SpecKnown: true, Cores: 2, MemoryMiB: 2049}), nil); len(got2) != 0 { + t.Fatalf("non-MiB-aligned memory did not converge (perpetual drift): %+v", got2) + } +} + func TestPlan_DiskNotReconciled(t *testing.T) { // DiskBytes differs but is intentionally not reconciled (pct resize, later slice). got := Plan( diff --git a/internal/reconcile/recover.go b/internal/reconcile/recover.go new file mode 100644 index 0000000..1e1e7f7 --- /dev/null +++ b/internal/reconcile/recover.go @@ -0,0 +1,97 @@ +package reconcile + +import ( + "context" + "time" +) + +// Recover consumes the journal's in-flight set at startup: resume-or-rollback for any +// op that was mid-execution when the agent crashed (doc 03 §10). This MUST run before +// the engine begins issuing new mutations. +// +// Why it is load-bearing for signed destructive ops (and why it lands with the gate): +// the idempotency-key store dedupes a COMPLETED op, but an op that crashed AFTER the +// Proxmox POST and BEFORE its terminal record (OpTaskRunning) is not covered by that — +// its nonce is already consumed, so a redelivery is rejected as a replay, yet it never +// reached a terminal state. Only this startup consumer can resolve it: re-check the +// Proxmox task and record the real outcome. +// +// Resolution per in-flight entry: +// - has a task id (OpTaskRunning): re-read the task status once. Stopped → record the +// real terminal state (OK → succeeded, else failed). Still running → leave it +// in-flight (a later Recover or the task's own completion resolves it). Unreadable → +// leave it (cannot safely decide). +// - no task id (OpStarted only): the Proxmox POST was never confirmed, so the op +// never took effect — record failed (fail-safe, the documented FileNonceStore +// direction). A convergent reconcile op is simply re-issued next pass; a one-shot +// op did NOT mark its idempotency key applied, so it is not falsely deduped. +func (e *Engine) Recover(ctx context.Context) RecoverResult { + var res RecoverResult + if e.journal == nil { + return res + } + for _, entry := range e.journal.InFlight() { + res.Examined++ + if entry.UPID == "" { + // POST never confirmed → abandon (fail-safe). + e.append(terminal(entry, OpFailed)) + res.RolledBack++ + e.logger.Warn("recover: in-flight op had no task id; marked failed (fail-safe)", + "op_id", entry.OpID, "vmid", entry.VMID, "kind", entry.Kind) + continue + } + st, err := e.api.TaskStatusOnce(ctx, entry.UPID) + if err != nil { + res.Unresolved++ + e.logger.Warn("recover: cannot read in-flight task status; left in-flight", + "op_id", entry.OpID, "upid", entry.UPID, "err", err) + continue + } + if st.Running() { + res.StillRunning++ + e.logger.Info("recover: in-flight task still running; left in-flight", + "op_id", entry.OpID, "upid", entry.UPID) + continue + } + // Stopped: record the real outcome. + if st.OK() { + e.append(terminal(entry, OpSucceeded)) + res.Resumed++ + e.logger.Info("recover: in-flight task completed OK; marked succeeded", + "op_id", entry.OpID, "upid", entry.UPID) + } else { + e.append(terminal(entry, OpFailed)) + res.Failed++ + e.logger.Warn("recover: in-flight task ended non-OK; marked failed", + "op_id", entry.OpID, "upid", entry.UPID, "exitstatus", st.ExitStatus) + } + } + if res.Examined > 0 { + e.logger.Info("recover: in-flight journal reconciled", "result", res) + } + return res +} + +// RecoverResult summarizes a startup recovery pass. +type RecoverResult struct { + Examined int + Resumed int // task found completed OK and recorded succeeded + Failed int // task found ended non-OK and recorded failed + RolledBack int // no task id → abandoned (fail-safe) + StillRunning int // task still executing → left in-flight + Unresolved int // task status unreadable → left in-flight +} + +// terminal builds a terminal journal record preserving the op's identity, with the +// idempotency key carried through so a SUCCEEDED one-shot op marks its key applied. +func terminal(e JournalEntry, state OpState) JournalEntry { + return JournalEntry{ + OpID: e.OpID, + VMID: e.VMID, + Kind: e.Kind, + UPID: e.UPID, + State: state, + IdempKey: e.IdempKey, + At: time.Now().UTC(), + } +} diff --git a/internal/reconcile/recover_test.go b/internal/reconcile/recover_test.go new file mode 100644 index 0000000..55a24cb --- /dev/null +++ b/internal/reconcile/recover_test.go @@ -0,0 +1,103 @@ +package reconcile + +import ( + "context" + "errors" + "testing" + "time" + + "gitea.dooplex.hu/admin/felhom-agent/internal/proxmox" +) + +func seedInFlight(t *testing.T, j *Journal, e JournalEntry) { + t.Helper() + e.State = OpTaskRunning + if e.At.IsZero() { + e.At = time.Now().UTC() + } + if err := j.Append(e); err != nil { + t.Fatalf("seed: %v", err) + } +} + +func TestRecover_TaskCompletedOKMarksSucceeded(t *testing.T) { + api := &fakeAPI{statusFunc: func(string) (proxmox.TaskStatus, error) { + return proxmox.TaskStatus{Status: "stopped", ExitStatus: "OK"}, nil + }} + e, j, _ := newEngine(t, api, EmptyProvider{}) + seedInFlight(t, j, JournalEntry{OpID: "op1", VMID: 100, Kind: "set_config", UPID: "UPID:x:", IdempKey: "k1"}) + + res := e.Recover(context.Background()) + if res.Examined != 1 || res.Resumed != 1 { + t.Fatalf("want 1 resumed, got %+v", res) + } + if len(j.InFlight()) != 0 { + t.Errorf("resolved op should not be in-flight: %+v", j.InFlight()) + } + // A resumed one-shot op marks its idempotency key applied (it really completed) — + // this is the case idempotency-alone could not cover (Note 1). + if !j.AlreadyApplied("k1") { + t.Error("a recovered-succeeded op must mark its idempotency key applied") + } +} + +func TestRecover_TaskEndedNonOKMarksFailed(t *testing.T) { + api := &fakeAPI{statusFunc: func(string) (proxmox.TaskStatus, error) { + return proxmox.TaskStatus{Status: "stopped", ExitStatus: "got 403"}, nil + }} + e, j, _ := newEngine(t, api, EmptyProvider{}) + seedInFlight(t, j, JournalEntry{OpID: "op2", VMID: 100, Kind: "guest_destroy", UPID: "UPID:x:", IdempKey: "k2"}) + + res := e.Recover(context.Background()) + if res.Failed != 1 { + t.Fatalf("want 1 failed, got %+v", res) + } + if j.AlreadyApplied("k2") { + t.Error("a failed op must NOT mark its key applied (it may be retried)") + } +} + +func TestRecover_TaskStillRunningLeftInFlight(t *testing.T) { + api := &fakeAPI{statusFunc: func(string) (proxmox.TaskStatus, error) { + return proxmox.TaskStatus{Status: "running"}, nil + }} + e, j, _ := newEngine(t, api, EmptyProvider{}) + seedInFlight(t, j, JournalEntry{OpID: "op3", VMID: 100, Kind: "set_config", UPID: "UPID:x:"}) + + res := e.Recover(context.Background()) + if res.StillRunning != 1 || len(j.InFlight()) != 1 { + t.Fatalf("still-running task must be left in-flight, got res=%+v inflight=%d", res, len(j.InFlight())) + } +} + +func TestRecover_NoTaskIDRolledBack(t *testing.T) { + // OpStarted with no UPID: the POST was never confirmed → abandon (fail-safe). + e, j, _ := newEngine(t, &fakeAPI{}, EmptyProvider{}) + if err := j.Append(JournalEntry{OpID: "op4", VMID: 100, Kind: "start", State: OpStarted, At: time.Now().UTC()}); err != nil { + t.Fatal(err) + } + res := e.Recover(context.Background()) + if res.RolledBack != 1 || len(j.InFlight()) != 0 { + t.Fatalf("no-task op must be rolled back, got res=%+v inflight=%d", res, len(j.InFlight())) + } +} + +func TestRecover_UnreadableStatusLeftInFlight(t *testing.T) { + api := &fakeAPI{statusFunc: func(string) (proxmox.TaskStatus, error) { + return proxmox.TaskStatus{}, errors.New("api unreachable") + }} + e, j, _ := newEngine(t, api, EmptyProvider{}) + seedInFlight(t, j, JournalEntry{OpID: "op5", VMID: 100, Kind: "set_config", UPID: "UPID:x:"}) + + res := e.Recover(context.Background()) + if res.Unresolved != 1 || len(j.InFlight()) != 1 { + t.Fatalf("unreadable status must leave op in-flight, got res=%+v inflight=%d", res, len(j.InFlight())) + } +} + +func TestRecover_EmptyJournalNoop(t *testing.T) { + e, _, _ := newEngine(t, &fakeAPI{}, EmptyProvider{}) + if res := e.Recover(context.Background()); res.Examined != 0 { + t.Errorf("empty journal recover should be a no-op, got %+v", res) + } +} diff --git a/internal/reconcile/state.go b/internal/reconcile/state.go index ad7908e..a0b0f22 100644 --- a/internal/reconcile/state.go +++ b/internal/reconcile/state.go @@ -112,6 +112,9 @@ type GuestAPI interface { Stop(ctx context.Context, vmid int) (string, error) SetConfig(ctx context.Context, vmid int, params map[string]string) (string, error) WaitTask(ctx context.Context, upid string, opts proxmox.WaitOptions) (proxmox.TaskStatus, error) + // TaskStatusOnce is a single non-blocking task-status read — used by crash + // recovery to learn the outcome of an op that was in flight when the agent died. + TaskStatusOnce(ctx context.Context, upid string) (proxmox.TaskStatus, error) } // guestDescription decodes the (string-valued) `description` key from a GuestConfig's