v0.4.0: slice 4 Phase B — reversibility gate + signed-op consuming layer
The security core of slice 4: hub-supplied intent is no longer trusted for destructive change. The gate fronts the per-guest queue's executor, so every mutation passes it. Reuses internal/authz for all crypto (surface untouched). - Classifier (doc 03 §4): benign vs destructive by provenance + data-bearing- ness, NOT by verb. Destroy/overwrite of customer data is destructive unless agent-internal provenance (same-journaled-txn create, or agent-tagged scratch) makes it benign — and that provenance is journal-recorded, NEVER hub-sourced. Unknown op class fails safe to destructive. - Reversibility gate: benign -> allowed unsigned; destructive -> requires a verified, role-scoped, action-bound operator signature, else pending_signature and never executed. Every decision audited (signal, never the guard). - Signed-op consuming layer over authz.Verifier.Verify (locked pipeline untouched): role-scoping (doc 04 §4 — recovery=rotation only, operational= ordinary destructive + planned rotation) + op-to-action binding (op+host+ guest+params must match the gated action). - Signed-job orchestration: idempotency dedupe by nonce + journal-wrapped execution via an injected DestructiveExecutor (nil this slice — inert). - Crash recovery (Note 1): Engine.Recover consumes the journal InFlight() set at startup (resume-or-rollback) — covers an op that crashed after the POST and before its terminal record, which idempotency dedupe alone cannot. Added TaskStatusOnce to the GuestAPI seam. Wired into daemon startup. - Note 2: memory comparison canonicalized to MiB (desiredMemoryMiB) so a non-MiB-aligned MemoryBytes converges in one pass, not perpetual drift. - Daemon: builds the verifier from config signers (none = nil verifier, the common slice-4 state), the gate (+SlogAudit), runs Recover before mutating. Adversarial matrix proven against the REAL authz.Verifier with in-test-minted SSHSIGs (framing replicated in reconcile's test binary; authz untouched, no signing added to the verify-only package): unsigned job + unsigned desired-state delta -> pending_signature; unknown signer/expired/replay-across-restart/wrong host -> typed authz rejections; wrong guest/op/params -> binding_mismatch; recovery key on ordinary destructive -> role_denied; hub-supplied scratch tag ignored -> refused; valid+role+target+fresh nonce -> accepted then replay rejected. Full module race-clean + vet-clean on the Linux build server. Inert this slice: no destructive deltas served until slice 10; the destructive path is classified, gated, and tested but not wired to live execution. CHECKPOINT: Phase B complete (slice 4 done). Awaiting validation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -3,6 +3,75 @@
|
||||
All notable changes to **felhom-agent** are recorded here. Update on every code
|
||||
change that gets pushed.
|
||||
|
||||
## v0.4.0 — slice 4 Phase B: reversibility gate + signed-op consuming layer (2026-06-08)
|
||||
|
||||
The security core of slice 4: hub-supplied intent stops being trusted for destructive
|
||||
change. Layered in front of the per-guest queue's executor — **every** mutation now
|
||||
passes the gate. Reuses `internal/authz` for all crypto (untouched surface). Inert
|
||||
this slice: no destructive deltas are served until slice 10, so the destructive path is
|
||||
classified, gated, and adversarially tested but not wired to live execution.
|
||||
|
||||
### Added
|
||||
- **Classifier (`classify.go`, doc 03 §4)** — benign vs destructive by **provenance +
|
||||
data-bearing-ness, NOT by verb**. The `OpClass` vocabulary (seeded by the committed
|
||||
slice-2 `op_blob.json`: `guest_destroy`) is the agent-side contract slice 10 matches.
|
||||
Destroy/overwrite of customer data is destructive UNLESS **agent-internal**
|
||||
provenance (same-journaled-transaction create → compensating rollback, or
|
||||
agent-tagged scratch) makes it benign. `Provenance` is journal-recorded and **never
|
||||
populated from the hub** (its zero value is the only thing an external intent may
|
||||
carry). Unknown op class fails safe → destructive.
|
||||
- **Reversibility gate (`gate.go`)** — `Gate.Authorize(intent, signed)`: benign →
|
||||
allowed unsigned; destructive → requires a verified, role-authorized, action-bound
|
||||
operator signature, else refused **`pending_signature`**, never executed. Every
|
||||
decision is written to an `AuditSink` (audit is a signal, never the guard).
|
||||
- **Signed-op consuming layer over `authz`** — verifies via `authz.Verifier.Verify`
|
||||
(the locked pipeline, untouched), then enforces on the `VerifiedOp`:
|
||||
- **Role-scoping (doc 04 §4)** — recovery key authorizes key-rotation re-pins ONLY;
|
||||
operational key authorizes ordinary destructive ops + planned rotation.
|
||||
- **Op-to-action binding** — verified `op` + host + guest + `params` must match the
|
||||
gated action (a signature for guest X / op A can't authorize guest Y / op B);
|
||||
params compared semantically (key-order/whitespace independent).
|
||||
- **Signed-job orchestration (`job.go`)** — `RunSignedJob`: idempotency dedupe (the
|
||||
op nonce as the journal key — a redelivered completed op is skipped, not re-run),
|
||||
gate authorization, then journal-wrapped execution via an injected
|
||||
`DestructiveExecutor` (nil this slice — authorized destructive ops are inert, no
|
||||
executor wired until 6/7).
|
||||
- **Crash-recovery consumer (`recover.go`, Note 1 / doc 03 §10)** — `Engine.Recover`
|
||||
consumes the journal's `InFlight()` at startup: an op that crashed AFTER the Proxmox
|
||||
POST and BEFORE its terminal record (`OpTaskRunning`, nonce already consumed) is NOT
|
||||
covered by idempotency dedupe — only this resume-or-rollback resolves it (re-read the
|
||||
task via the new `TaskStatusOnce`, record the real outcome; a no-task-id op is
|
||||
abandoned fail-safe). Landed together with the signed-op executor, as Note 1 required.
|
||||
- **Daemon wiring** — `runDaemon` builds the verifier from `config.Authz.Signers` (a
|
||||
bad key / missing nonce-store path is a fatal misconfig; **no signers = nil verifier**,
|
||||
the common slice-4 state), constructs the gate (+ `SlogAudit`), runs `Recover` before
|
||||
issuing any mutation, and routes every reconcile action through the gate.
|
||||
|
||||
### Changed
|
||||
- **Memory comparison canonicalized (Note 2)** — `desiredMemoryMiB` makes the
|
||||
desired↔actual memory compare in the same MiB unit that is then written, so a
|
||||
non-MiB-aligned `MemoryBytes` converges in one pass instead of re-issuing SetConfig
|
||||
forever (the numeric cousin of the description-newline normalization). Test proves
|
||||
convergence. Slice 10 should still serve MiB-aligned specs at the source.
|
||||
|
||||
### Tests (the security proof — each independently rejected)
|
||||
- **Adversarial matrix** via the REAL `authz.Verifier` with in-test-minted SSHSIGs
|
||||
(framing replicated in reconcile's test binary; production authz untouched, no signing
|
||||
added to the verify-only package): unsigned destructive **job** → pending_signature;
|
||||
unsigned destructive **desired-state delta** → pending_signature (distrusts hub
|
||||
desired state, not just jobs); forged/unknown signer → `ErrUnknownSigner`; expired →
|
||||
`ErrExpired`; **replayed nonce across an agent restart** (durable `FileNonceStore`) →
|
||||
`ErrReplay`; wrong host → `ErrTarget`; wrong guest / wrong op / wrong params →
|
||||
binding_mismatch; **recovery key on ordinary destructive** → role_denied;
|
||||
**hub-supplied "scratch" tag ignored** → still destructive → refused; **valid + role +
|
||||
target + fresh nonce → accepted**, and a second presentation → `ErrReplay` (nonce
|
||||
consumed).
|
||||
- Classifier (benign/destructive/provenance/key-rotation/fail-safe), role-scoping,
|
||||
params binding, crash-recovery (resume OK / fail / still-running / no-task rollback /
|
||||
unreadable / one-shot key applied on resume), signed-job idempotency (execute once,
|
||||
dedupe redelivery, refused-not-executed, no-executor-inert, executor-error).
|
||||
- Full module **race-clean** (`go test -race`) + vet clean on the Linux build server.
|
||||
|
||||
## v0.4.0-rc1 — slice 4 Phase A: reconcile engine (structural; runs live, unfed) (2026-06-08)
|
||||
|
||||
The agent-side control core's structural half. **Checkpoint marker** — `-rc1` is the
|
||||
|
||||
@@ -15,7 +15,7 @@
|
||||
- Module `gitea.dooplex.hu/admin/felhom-agent`; binary `felhom-agent` (`cmd/felhom-agent/`).
|
||||
- **Pure Go stdlib + `golang.org/x/crypto` only** — no web frameworks.
|
||||
- `go.mod` directive **go 1.25.0**; dep `golang.org/x/crypto v0.52.0` (declares go 1.25, will NOT build on Go 1.24). The **build server (192.168.0.180) runs go1.26.0** (upstream Go on PATH, backward-compatible). Build/run the agent there for live tests (same LAN as the demo host).
|
||||
- Version: `version` var in `cmd/felhom-agent/main.go`, overridable via `-ldflags "-X main.version=<v>"`; `--version` flag. **Current: v0.4.0-rc1** (slice-4 Phase A checkpoint; v0.4.0 lands with Phase B). Bump on meaningful changes + add a CHANGELOG entry.
|
||||
- Version: `version` var in `cmd/felhom-agent/main.go`, overridable via `-ldflags "-X main.version=<v>"`; `--version` flag. **Current: v0.4.0** (slice 4 complete: reconcile engine + reversibility gate). Bump on meaningful changes + add a CHANGELOG entry.
|
||||
|
||||
## Layout
|
||||
|
||||
@@ -48,8 +48,9 @@ Built in slices, all on `main`:
|
||||
- **v0.3.0** slice 3 — `internal/hub`: the first **daemon loop** (no-`--selftest` mode) posting a read-only `HostReport` to the hub (= the heartbeat). Report's storage/backup/restore/pbs/audit fields are **defined-but-empty** (slices 5/6); the envelope's desired-state/signed-ops fields are **parsed-but-ignored** (slice 4).
|
||||
- **v0.3.1** — slice-3 validation follow-ups.
|
||||
- **v0.3.2** — slice-4 pre-check: reversible `SetConfig` step added to `--selftest=task`; passed live on guest 9999. Findings: LXC `description` write is **synchronous** (empty UPID — dual-mode modeling confirmed); PVE appends a trailing `\n` to `description` on read (reconcile must normalize). First live `VM.Config.*` exercise.
|
||||
- **v0.4.0-rc1** — slice-4 **Phase A** (structural): `internal/reconcile` — engine, per-guest serializer (§10), desired-state model + `DesiredProvider` seam, normalization layer (`NormDescription` promoted out of main.go), plan/diff engine (benign Start/Stop/SetConfig set), durable op journal + idempotency store. Wired into `runDaemon` sharing the queue. Runs **live but unfed** (EmptyProvider → zero mutations until slice 10). Checkpoint: awaiting validation before Phase B.
|
||||
- **Next: slice 4 Phase B** — the benign/destructive classifier, the reversibility gate, and the signed-op consuming layer over `internal/authz` (doc 03 §4 / doc 04), in front of the queue's executor → lands **v0.4.0**.
|
||||
- **v0.4.0-rc1** — slice-4 **Phase A** (structural): `internal/reconcile` — engine, per-guest serializer (§10), desired-state model + `DesiredProvider` seam, normalization layer (`NormDescription` promoted out of main.go), plan/diff engine (benign Start/Stop/SetConfig set), durable op journal + idempotency store. Wired into `runDaemon` sharing the queue. Runs **live but unfed** (EmptyProvider → zero mutations until slice 10).
|
||||
- **v0.4.0** — slice-4 **Phase B** (security core): the benign/destructive **classifier** (provenance + data-bearing, not by verb; scratch/same-txn provenance is agent-internal, never hub-sourced), the **reversibility gate** (destructive → `pending_signature` unless a verified, role-scoped, action-bound operator signature), the **signed-op consuming layer** over `internal/authz` (role-scoping per doc 04 §4, op-to-action binding, idempotency-by-nonce, audit), and the **crash-recovery consumer** (`Recover` over `InFlight()`, resume-or-rollback). The gate fronts the queue's executor (every mutation passes it). **Inert this slice** — no destructive deltas served until slice 10; the destructive path is classified, gated, and adversarially tested but not wired to live execution. `authz` surface untouched.
|
||||
- **Next: slice 5/6 (storage manifest, backup/restore)** — the slices that fill the host-report's empty storage/backup collections and add the destructive executors the gate already guards.
|
||||
|
||||
## Demo host (for live tests)
|
||||
|
||||
|
||||
@@ -1,82 +1,95 @@
|
||||
# REPORT — Slice 4 Phase A: reconcile engine (structural) (2026-06-08)
|
||||
# REPORT — Slice 4: reconcile engine + the reversibility gate (v0.4.0) (2026-06-08)
|
||||
|
||||
> Overwrite-latest report (most recent significant work only). Cumulative history lives in [CHANGELOG.md](CHANGELOG.md).
|
||||
|
||||
## Outcome
|
||||
|
||||
**Phase A of slice 4 is implemented, tested, and pushed as the checkpoint marker
|
||||
`v0.4.0-rc1`.** This is the structural half of the agent-side control core: the
|
||||
reconcile engine, the per-guest serializer (doc 03 §10), the desired-state model + its
|
||||
provider seam, the field-normalization layer, the plan/diff engine, and the durable
|
||||
operation journal + idempotency store — all adversarially fixture-tested.
|
||||
**Slice 4 is complete and pushed as `v0.4.0`.** Both phases landed:
|
||||
|
||||
**Per the task, I have STOPPED at the checkpoint and am awaiting the validation pass
|
||||
before starting Phase B** (the benign/destructive classifier, the reversibility gate,
|
||||
and the signed-op consuming layer over `internal/authz`). Phase B is the security core
|
||||
and earns isolated review.
|
||||
- **Phase A** (structural, pushed earlier as `v0.4.0-rc1`): the reconcile engine, the
|
||||
per-guest serializer (doc 03 §10), the desired-state model + provider seam, the
|
||||
field-normalization layer, the plan/diff engine, and the durable op journal +
|
||||
idempotency store. Runs **live but unfed** — `EmptyProvider` → zero mutations until
|
||||
slice 10 serves desired state.
|
||||
- **Phase B** (this push, the security core): the benign/destructive **classifier**,
|
||||
the **reversibility gate**, and the **signed-op consuming layer** over `internal/authz`
|
||||
— with role-scoping, op-to-action binding, idempotency/journaling, audit, and the
|
||||
crash-recovery consumer. The gate sits in front of the per-guest queue's executor, so
|
||||
**every mutation passes it**.
|
||||
|
||||
## What runs (and what deliberately doesn't)
|
||||
The whole module is **race-clean and vet-clean** on the Linux build server; 62 reconcile
|
||||
tests pass (the adversarial matrix runs against the real `authz.Verifier`).
|
||||
|
||||
The engine **runs live but unfed**. At slice 4 there is no desired-state source (hub
|
||||
serving is slice 10; provisioning is slice 7), so the only production `DesiredProvider`
|
||||
is `EmptyProvider` → the live engine reads state, computes an **empty action set**, and
|
||||
performs **zero mutations** every tick. That is the correct, expected slice-4 behavior;
|
||||
the first live convergence arrives when slice 10 serves desired state into the seam.
|
||||
## The security model (Phase B)
|
||||
|
||||
The wired action set is **benign-on-existing-guest only**: `Start`, `Stop`, `SetConfig`.
|
||||
Provisioning and the destructive set are out of scope for Phase A (the destructive set
|
||||
is classified and gated in Phase B but not wired to live execution — nothing serves
|
||||
destructive deltas yet).
|
||||
Hub-supplied intent is no longer trusted for destructive change — **by provenance +
|
||||
data-bearing-ness, not by verb** (doc 03 §4):
|
||||
|
||||
## Package `internal/reconcile`
|
||||
- **Benign** (unsigned): start/stop/restart/create, and destroying a resource the agent
|
||||
created in the **same journaled transaction** (compensating rollback) or **tagged
|
||||
scratch**. That scratch/same-txn provenance is **agent-internal, journal-recorded, and
|
||||
never accepted from the hub** — a compromised hub cannot relabel a data-bearing guest
|
||||
as scratch to walk the gate.
|
||||
- **Destructive** (signature required): destroy/overwrite of the only/primary copy of
|
||||
customer data — **regardless of whether it arrives as a job or a desired-state delta**.
|
||||
Absent/invalid signature → refused **`pending_signature`**, never executed.
|
||||
|
||||
- **`Queue` (per-guest serializer, doc 03 §10)** — the single choke point all mutation
|
||||
sources funnel through. Same-vmid jobs run strictly one-at-a-time in submit order;
|
||||
independent vmids run in parallel. Each vmid is an unbounded cond-var FIFO lane
|
||||
(non-blocking, order-preserving submission); `Close` drains pending jobs gracefully.
|
||||
- **Desired-state model + `DesiredProvider`** — `DesiredGuest` makes each field
|
||||
individually optional (run-state / `*hub.GuestSpec` / `*description`) so a source pins
|
||||
only what it manages. `EmptyProvider` (live, slice 4) and `StaticProvider` (fixtures).
|
||||
- **Normalization layer (`FieldNormalizers`)** — reconcile compares *normalized*
|
||||
desired-vs-actual. `description`'s trailing newline (the slice-4-proven quirk) is the
|
||||
first registered normalizer; the registry takes more as discovered. `normDesc` was
|
||||
**promoted** out of `main.go` to `reconcile.NormDescription`, and the `--selftest=task`
|
||||
round-trip now uses that shared helper — one source of truth.
|
||||
- **`Plan` (pure diff engine)** — minimal benign action set for guests in both desired
|
||||
and actual: normalized comparison, deterministic vmid order, config-before-run-state.
|
||||
Skips provision (slice 7) and destroy (gated, slice 10); never writes a config it
|
||||
couldn't first read; disk grow deferred.
|
||||
- **`Engine`** — reads desired+actual, plans, dispatches onto the shared queue. Honors
|
||||
the mutate.go dual-mode contract: non-empty UPID → `WaitTask`+assert; empty UPID →
|
||||
clean synchronous success. Per-action failures counted, never fatal.
|
||||
- **`Journal`** — durable fsync'd JSONL (mirrors `authz.FileNonceStore`): op lifecycle
|
||||
with the Proxmox task id (crash mid-op detected + re-checkable via `InFlight()`), plus
|
||||
an idempotency-key store so a one-shot op never double-runs across retries/restarts.
|
||||
Reconcile actions carry no idempotency key (convergent — must re-run on real drift).
|
||||
The signed-op consuming layer calls `authz.Verifier.Verify` (the locked
|
||||
namespace→allow-list→crypto→target→time→nonce pipeline, untouched) and then enforces
|
||||
the slice-4 policy on the `VerifiedOp`: **role-scoping** (recovery key = key-rotation
|
||||
only; operational key = ordinary destructive + planned rotation, doc 04 §4) and
|
||||
**op-to-action binding** (the verified op + host + guest + params must name the exact
|
||||
gated action). Idempotency keys the journal by the op nonce; every decision is audited
|
||||
(a signal, never the guard).
|
||||
|
||||
## Daemon wiring
|
||||
## Inert by design (slice-4 scope)
|
||||
|
||||
`runDaemon` now runs reconcile alongside the hub loop on the poll cadence, sharing the
|
||||
per-guest queue. The journal lives at a `journal.log` sibling of the nonce store. The
|
||||
daemon runs cleanly with **no desired state and no signers** — reconcile is a logged
|
||||
no-op; a journal-open failure degrades to journal-less rather than crashing.
|
||||
There is **no live destructive execution** this slice: nothing serves destructive deltas
|
||||
until slice 10, and the guest-destroy/storage-wipe/restore-overwrite executors land in
|
||||
6/7. So the destructive path is fully **classified, gated, and adversarially tested**,
|
||||
but `RunSignedJob`'s executor is nil in production — an authorized destructive op is
|
||||
journaled as authorized-but-not-executed. Reconcile itself only produces the benign
|
||||
Start/Stop/SetConfig set, all allowed through the gate unsigned.
|
||||
|
||||
## Adversarial proof (each case independently rejected)
|
||||
|
||||
Run against the **real** `authz.Verifier` with in-test-minted SSHSIGs (the ~40-line
|
||||
framing is replicated in reconcile's test binary — production `authz` is untouched and
|
||||
gains no signing capability; live minting is required because the verifier's clock is
|
||||
not cross-package injectable):
|
||||
|
||||
unsigned destructive **job** → pending_signature · unsigned destructive **desired-state
|
||||
delta** → pending_signature (distrusts hub desired state, not just jobs) · forged /
|
||||
unknown signer → `ErrUnknownSigner` · expired → `ErrExpired` · **replayed nonce across an
|
||||
agent restart** (durable `FileNonceStore`) → `ErrReplay` · wrong host → `ErrTarget` ·
|
||||
wrong guest / wrong op / wrong params → binding_mismatch · **recovery key on ordinary
|
||||
destructive** → role_denied · **hub-supplied "scratch" tag** on a data-bearing guest →
|
||||
ignored, still destructive → refused · **valid + correct role + correct target + fresh
|
||||
nonce → accepted**, and a second presentation → `ErrReplay`.
|
||||
|
||||
## The two forward-looking notes
|
||||
|
||||
- **Note 1 (carried in)** — the `InFlight()` **resume-or-rollback** startup consumer
|
||||
(`Engine.Recover`) landed **together with** the signed-op executor, as required. An op
|
||||
that crashed after the Proxmox POST but before its terminal record (`OpTaskRunning`,
|
||||
nonce already consumed) is not covered by idempotency dedupe — only this consumer
|
||||
resolves it (re-read the task via the new `TaskStatusOnce`, record the real outcome; a
|
||||
no-task-id op is abandoned fail-safe). Wired into daemon startup and tested.
|
||||
- **Note 2 (addressed)** — the memory comparison is canonicalized (`desiredMemoryMiB`):
|
||||
desired and actual compare in the same MiB unit that is then written, so a
|
||||
non-MiB-aligned `MemoryBytes` converges in one pass rather than re-issuing SetConfig
|
||||
every cycle. A test proves convergence. Recommendation stands that slice 10 serve
|
||||
MiB-aligned specs at the source.
|
||||
|
||||
## Verification
|
||||
|
||||
- Full module **race-clean** (`go test -race -count=1 ./...`) and `go vet` clean on the
|
||||
Linux build server (go1.26); all unit tests green locally and there.
|
||||
- Adversarial fixture coverage: serializer concurrency/ordering, normalization +
|
||||
extensibility seam, the full plan matrix (drift / no-false-drift / unmanaged /
|
||||
spec-unknown / scope skips / ordering / empty-desired), engine sync-vs-async +
|
||||
failure counting, and journal persistence + idempotency dedupe **across a simulated
|
||||
restart**.
|
||||
- No live Proxmox needed (the engine is unfed); the live exercise is deferred — there is
|
||||
nothing to converge until a desired-state source exists.
|
||||
- `go test -race -count=1 ./...` and `go vet ./...` clean on the Linux build server
|
||||
(go1.26); all tests green locally and there.
|
||||
- No live Proxmox needed — Phase A is unfed and Phase B's destructive path is inert this
|
||||
slice. The gate's crypto path is proven end-to-end against the real verifier.
|
||||
|
||||
## Next (after validation)
|
||||
## Conventions
|
||||
|
||||
Phase B: the classifier (benign vs destructive by provenance + data-bearing-ness, not by
|
||||
verb), the reversibility gate in front of the queue's executor, and the signed-op
|
||||
consuming layer over `internal/authz` with role-scoping + op-to-action binding + the
|
||||
adversarial rejection matrix — landing **v0.4.0**. I will not start it until the Phase-A
|
||||
validation passes.
|
||||
Version → **v0.4.0**. CHANGELOG has a per-phase entry (newest on top). No secrets in any
|
||||
committed file. Pushed to `main`. Per the task, I stop at this checkpoint and await the
|
||||
validation pass.
|
||||
|
||||
@@ -19,6 +19,7 @@ import (
|
||||
"syscall"
|
||||
"time"
|
||||
|
||||
"gitea.dooplex.hu/admin/felhom-agent/internal/authz"
|
||||
"gitea.dooplex.hu/admin/felhom-agent/internal/config"
|
||||
"gitea.dooplex.hu/admin/felhom-agent/internal/hub"
|
||||
applog "gitea.dooplex.hu/admin/felhom-agent/internal/log"
|
||||
@@ -28,7 +29,7 @@ import (
|
||||
|
||||
// version is the agent version. Overridable at build time with
|
||||
// -ldflags "-X main.version=<v>"; defaults to the in-repo CHANGELOG version.
|
||||
var version = "0.4.0-rc1"
|
||||
var version = "0.4.0"
|
||||
|
||||
func main() {
|
||||
var (
|
||||
@@ -137,14 +138,35 @@ func runDaemon(cfg config.Config, logger *slog.Logger) int {
|
||||
defer journal.Close()
|
||||
}
|
||||
}
|
||||
// The reversibility gate (slice 4 Phase B) sits in front of every mutation. With
|
||||
// no signers pinned (the common slice-4 state) the verifier is nil: benign actions
|
||||
// pass, destructive intents are refused pending_signature — and reconcile only
|
||||
// produces benign actions, so nothing is gated away. A misconfigured signer key is
|
||||
// a security misconfig and is fatal.
|
||||
verifier, nonceStore, err := buildVerifier(cfg, logger)
|
||||
if err != nil {
|
||||
fmt.Fprintln(os.Stderr, "daemon: authz verifier:", err)
|
||||
return 2
|
||||
}
|
||||
if nonceStore != nil {
|
||||
defer nonceStore.Close()
|
||||
}
|
||||
gate := reconcile.NewGate(verifier, cfg.Hub.HostID, reconcile.SlogAudit{Logger: logger}, logger)
|
||||
|
||||
engine := reconcile.NewEngine(reconcile.EngineOptions{
|
||||
API: px,
|
||||
Queue: queue,
|
||||
Journal: journal,
|
||||
Provider: reconcile.EmptyProvider{}, // slice 4: no live desired-state source
|
||||
Gate: gate,
|
||||
HostID: cfg.Hub.HostID,
|
||||
Logger: logger,
|
||||
})
|
||||
|
||||
// Crash recovery (doc 03 §10): resolve any op that was in flight when the agent
|
||||
// last died BEFORE issuing new mutations. With an empty journal this is a no-op.
|
||||
engine.Recover(ctx)
|
||||
|
||||
// Run reconcile and the hub loop concurrently; either returning ends the daemon.
|
||||
errc := make(chan error, 2)
|
||||
go func() { errc <- engine.Run(ctx, interval) }()
|
||||
@@ -170,6 +192,35 @@ func reconcileJournalPath(cfg config.Config) string {
|
||||
return "/var/lib/felhom-agent/journal.log"
|
||||
}
|
||||
|
||||
// buildVerifier constructs the operator-signed-op verifier from config. With no signers
|
||||
// pinned it returns (nil, nil, nil) — the gate then refuses any destructive intent as
|
||||
// pending_signature (and reconcile serves none). With signers it requires a durable
|
||||
// nonce-store path (anti-replay must survive restarts) and parses every pinned key —
|
||||
// a bad key or missing path is a fatal security misconfig.
|
||||
func buildVerifier(cfg config.Config, logger *slog.Logger) (reconcile.OpVerifier, *authz.FileNonceStore, error) {
|
||||
if len(cfg.Authz.Signers) == 0 {
|
||||
logger.Info("daemon: no operator signers pinned; destructive ops will be refused pending_signature")
|
||||
return nil, nil, nil
|
||||
}
|
||||
if cfg.Authz.NonceStorePath == "" {
|
||||
return nil, nil, fmt.Errorf("authz.nonce_store_path is required when signers are configured")
|
||||
}
|
||||
signers := make([]authz.AllowedSigner, 0, len(cfg.Authz.Signers))
|
||||
for _, s := range cfg.Authz.Signers {
|
||||
as, err := authz.NewAllowedSigner(s.KeyID, authz.KeyRole(s.Role), s.PublicKey)
|
||||
if err != nil {
|
||||
return nil, nil, fmt.Errorf("pinned signer %q: %w", s.KeyID, err)
|
||||
}
|
||||
signers = append(signers, as)
|
||||
}
|
||||
store, err := authz.OpenFileNonceStore(cfg.Authz.NonceStorePath)
|
||||
if err != nil {
|
||||
return nil, nil, fmt.Errorf("nonce store: %w", err)
|
||||
}
|
||||
logger.Info("daemon: operator signers pinned", "count", len(signers))
|
||||
return authz.New(signers, store, cfg.Hub.HostID), store, nil
|
||||
}
|
||||
|
||||
// runSelftestHub validates hub config, does ONE collect + report, and prints the
|
||||
// report it would send plus the envelope it got back.
|
||||
func runSelftestHub(ctx context.Context, cfg config.Config, logger *slog.Logger) int {
|
||||
|
||||
@@ -0,0 +1,106 @@
|
||||
package reconcile
|
||||
|
||||
// The benign/destructive classifier (doc 03 §4). The gate decides whether an intended
|
||||
// action needs an operator signature by **provenance + data-bearing-ness, NOT by
|
||||
// verb**. The op CLASS encodes the semantic intent (a "detach storage" is its own
|
||||
// destructive class, not just another PUT) so classification never turns on the HTTP
|
||||
// method.
|
||||
|
||||
// OpClass is the semantic class of an intended action. This vocabulary is the
|
||||
// agent-side contract: the signed-op `op` field (doc 04 §2.1) and slice-10's hub /
|
||||
// operator CLI match these exact strings. The committed slice-2 fixture
|
||||
// (op="guest_destroy") seeds it.
|
||||
type OpClass string
|
||||
|
||||
const (
|
||||
// Benign-on-existing-guest set — wired to live execution this slice.
|
||||
ClassStart OpClass = "start"
|
||||
ClassStop OpClass = "stop"
|
||||
ClassSetConfig OpClass = "set_config" // benign sizing/description changes only
|
||||
|
||||
// Benign by construction — classified now, executors land in later slices.
|
||||
ClassCreate OpClass = "create" // provision a NEW guest (restore-to-new, slice 7)
|
||||
ClassRestart OpClass = "restart" // heal a crashed controller in-place (§4)
|
||||
|
||||
// Destructive set — destroying/overwriting the only/primary copy of customer
|
||||
// data. Classified and gated now; NOT wired to live execution this slice (nothing
|
||||
// serves destructive deltas until slice 10).
|
||||
ClassGuestDestroy OpClass = "guest_destroy"
|
||||
ClassStorageWipe OpClass = "storage_wipe" // storage detach/wipe
|
||||
ClassRestoreOverwrite OpClass = "restore_overwrite" // restore OVER an existing guest
|
||||
ClassDecommission OpClass = "decommission"
|
||||
|
||||
// Key-rotation re-pin (doc 04 §4) — destructive-class, role-scoped: the cold
|
||||
// recovery key authorizes ONLY this; the operational key authorizes this + ordinary
|
||||
// destructive ops.
|
||||
ClassKeyRotation OpClass = "key_rotation"
|
||||
)
|
||||
|
||||
// Disposition is the classifier verdict.
|
||||
type Disposition string
|
||||
|
||||
const (
|
||||
// Benign — the reconciler/executor MAY act without an operator signature.
|
||||
Benign Disposition = "benign"
|
||||
// Destructive — an operator signature bound to the action is REQUIRED.
|
||||
Destructive Disposition = "destructive"
|
||||
)
|
||||
|
||||
// Provenance is AGENT-INTERNAL evidence that an otherwise-destructive action is
|
||||
// actually safe (doc 03 §4). It is recorded in the operation journal by the agent's
|
||||
// own bookkeeping and is **NEVER populated from the hub or any external input** — else
|
||||
// a compromised hub could relabel a data-bearing guest as scratch to walk the gate.
|
||||
// The zero value (no internal evidence) is the only value an externally-sourced intent
|
||||
// may carry.
|
||||
type Provenance struct {
|
||||
// SameTxnCreated: the agent created this resource earlier in the SAME journaled
|
||||
// transaction, so destroying it is a compensating rollback (§10), not data loss.
|
||||
SameTxnCreated bool
|
||||
// AgentTaggedScratch: the agent tagged this resource ephemeral/scratch (e.g. a
|
||||
// restore-test scratch guest, §8). Journal-recorded provenance only.
|
||||
AgentTaggedScratch bool
|
||||
}
|
||||
|
||||
// internalEvidence reports whether agent-internal provenance makes a destroy benign.
|
||||
func (p Provenance) internalEvidence() bool {
|
||||
return p.SameTxnCreated || p.AgentTaggedScratch
|
||||
}
|
||||
|
||||
// Classify returns the disposition for an op class given agent-internal provenance.
|
||||
//
|
||||
// Rules (doc 03 §4):
|
||||
// - create/start/stop/restart/benign-set_config → always Benign.
|
||||
// - destroy/overwrite of a data-bearing resource → Destructive, UNLESS agent-internal
|
||||
// provenance (same-transaction create, or agent-tagged scratch) makes it benign.
|
||||
// - key-rotation → always Destructive (signed); role-scoping picks the allowed key.
|
||||
// - an UNKNOWN class fails safe → Destructive (require a signature).
|
||||
func Classify(class OpClass, prov Provenance) Disposition {
|
||||
switch class {
|
||||
case ClassStart, ClassStop, ClassSetConfig, ClassCreate, ClassRestart:
|
||||
return Benign
|
||||
case ClassGuestDestroy, ClassStorageWipe, ClassRestoreOverwrite, ClassDecommission:
|
||||
if prov.internalEvidence() {
|
||||
return Benign // compensating rollback / scratch teardown
|
||||
}
|
||||
return Destructive
|
||||
case ClassKeyRotation:
|
||||
return Destructive
|
||||
default:
|
||||
return Destructive // fail safe: an unrecognized op is treated as destructive
|
||||
}
|
||||
}
|
||||
|
||||
// classOfAction maps a benign reconcile ActionKind to its OpClass, so every reconcile
|
||||
// mutation is classified and passed through the gate like any other intent.
|
||||
func classOfAction(k ActionKind) OpClass {
|
||||
switch k {
|
||||
case ActionStart:
|
||||
return ClassStart
|
||||
case ActionStop:
|
||||
return ClassStop
|
||||
case ActionSetConfig:
|
||||
return ClassSetConfig
|
||||
default:
|
||||
return OpClass(k)
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,70 @@
|
||||
package reconcile
|
||||
|
||||
import (
|
||||
"testing"
|
||||
|
||||
"gitea.dooplex.hu/admin/felhom-agent/internal/authz"
|
||||
)
|
||||
|
||||
func TestClassify_BenignClasses(t *testing.T) {
|
||||
for _, c := range []OpClass{ClassStart, ClassStop, ClassSetConfig, ClassCreate, ClassRestart} {
|
||||
if got := Classify(c, Provenance{}); got != Benign {
|
||||
t.Errorf("Classify(%s) = %s, want benign", c, got)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestClassify_DestructiveClassesNeedSignature(t *testing.T) {
|
||||
for _, c := range []OpClass{ClassGuestDestroy, ClassStorageWipe, ClassRestoreOverwrite, ClassDecommission, ClassKeyRotation} {
|
||||
if got := Classify(c, Provenance{}); got != Destructive {
|
||||
t.Errorf("Classify(%s) = %s, want destructive", c, got)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestClassify_InternalProvenanceMakesDestroyBenign(t *testing.T) {
|
||||
// Same-transaction create → compensating rollback is benign (§10).
|
||||
if got := Classify(ClassGuestDestroy, Provenance{SameTxnCreated: true}); got != Benign {
|
||||
t.Errorf("same-txn destroy = %s, want benign", got)
|
||||
}
|
||||
// Agent-tagged scratch teardown is benign (§8).
|
||||
if got := Classify(ClassGuestDestroy, Provenance{AgentTaggedScratch: true}); got != Benign {
|
||||
t.Errorf("scratch destroy = %s, want benign", got)
|
||||
}
|
||||
}
|
||||
|
||||
func TestClassify_KeyRotationAlwaysDestructive(t *testing.T) {
|
||||
// Even with internal provenance, key-rotation stays signed (role-scoping decides
|
||||
// which key) — provenance flags don't apply to it.
|
||||
if got := Classify(ClassKeyRotation, Provenance{SameTxnCreated: true, AgentTaggedScratch: true}); got != Destructive {
|
||||
t.Errorf("key_rotation = %s, want destructive", got)
|
||||
}
|
||||
}
|
||||
|
||||
func TestClassify_UnknownClassFailsSafe(t *testing.T) {
|
||||
if got := Classify(OpClass("totally_unknown_op"), Provenance{}); got != Destructive {
|
||||
t.Errorf("unknown class = %s, want destructive (fail-safe)", got)
|
||||
}
|
||||
}
|
||||
|
||||
func TestRoleAuthorizes(t *testing.T) {
|
||||
op := authz.RoleOperational
|
||||
rec := authz.RoleRecovery
|
||||
cases := []struct {
|
||||
role authz.KeyRole
|
||||
class OpClass
|
||||
want bool
|
||||
}{
|
||||
{op, ClassGuestDestroy, true}, // operational does ordinary destructive
|
||||
{op, ClassDecommission, true}, //
|
||||
{op, ClassKeyRotation, true}, // operational does planned rotation
|
||||
{rec, ClassGuestDestroy, false}, // recovery may NOT do ordinary destructive
|
||||
{rec, ClassStorageWipe, false}, //
|
||||
{rec, ClassKeyRotation, true}, // recovery authorizes ONLY rotation
|
||||
}
|
||||
for _, c := range cases {
|
||||
if got := roleAuthorizes(c.role, c.class); got != c.want {
|
||||
t.Errorf("roleAuthorizes(%s, %s) = %v, want %v", c.role, c.class, got, c.want)
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -27,19 +27,23 @@ type Engine struct {
|
||||
journal *Journal
|
||||
provider DesiredProvider
|
||||
norm FieldNormalizers
|
||||
gate *Gate
|
||||
hostID string
|
||||
logger *slog.Logger
|
||||
|
||||
opSeq uint64 // atomic; makes each op id unique per attempt
|
||||
}
|
||||
|
||||
// EngineOptions configures a new Engine. Norm defaults to DefaultNormalizers, Logger
|
||||
// to a discard logger.
|
||||
// to a discard logger, Gate to a no-verifier gate (benign-allow, destructive-pending).
|
||||
type EngineOptions struct {
|
||||
API GuestAPI
|
||||
Queue *Queue
|
||||
Journal *Journal
|
||||
Provider DesiredProvider
|
||||
Norm FieldNormalizers
|
||||
Gate *Gate
|
||||
HostID string
|
||||
Logger *slog.Logger
|
||||
}
|
||||
|
||||
@@ -58,12 +62,20 @@ func NewEngine(opts EngineOptions) *Engine {
|
||||
if provider == nil {
|
||||
provider = EmptyProvider{}
|
||||
}
|
||||
gate := opts.Gate
|
||||
if gate == nil {
|
||||
// No verifier configured: benign actions pass, destructive are pending. This is
|
||||
// the common slice-4 daemon state (no signers pinned, no desired state).
|
||||
gate = NewGate(nil, opts.HostID, nil, logger)
|
||||
}
|
||||
return &Engine{
|
||||
api: opts.API,
|
||||
queue: opts.Queue,
|
||||
journal: opts.Journal,
|
||||
provider: provider,
|
||||
norm: norm,
|
||||
gate: gate,
|
||||
hostID: opts.HostID,
|
||||
logger: logger,
|
||||
}
|
||||
}
|
||||
@@ -97,23 +109,39 @@ func (e *Engine) Reconcile(ctx context.Context) (Result, error) {
|
||||
return res, nil
|
||||
}
|
||||
|
||||
// Dispatch all actions onto the shared per-guest queue, then await each. Same-vmid
|
||||
// actions serialize in submit order; different vmids run concurrently.
|
||||
chans := make([]<-chan error, len(actions))
|
||||
// Every mutation passes the reversibility gate before the queue (doc 03 §4).
|
||||
// Reconcile only produces benign actions, so each is allowed unsigned — but the
|
||||
// gate is genuinely in the path: a destructive class here would be refused
|
||||
// (pending_signature) and never dispatched. A gate refusal counts as a failed
|
||||
// action (it should not happen for the benign reconcile set).
|
||||
type dispatched struct {
|
||||
act Action
|
||||
ch <-chan error
|
||||
}
|
||||
var sent []dispatched
|
||||
for i := range actions {
|
||||
act := actions[i]
|
||||
chans[i] = e.queue.Submit(act.VMID, func() error { return e.execute(ctx, act) })
|
||||
dec := e.gate.Authorize(intentForAction(e.hostID, act), nil)
|
||||
if !dec.Allowed {
|
||||
res.Failed++
|
||||
res.Errors = append(res.Errors, fmt.Errorf("reconcile: gate refused %s vmid %d: %s",
|
||||
act.Kind, act.VMID, dec.Reason))
|
||||
e.logger.Error("reconcile: gate refused a benign action (unexpected)",
|
||||
"vmid", act.VMID, "kind", act.Kind, "reason", dec.Reason)
|
||||
continue
|
||||
}
|
||||
sent = append(sent, dispatched{act: act, ch: e.queue.Submit(act.VMID, func() error { return e.execute(ctx, act) })})
|
||||
}
|
||||
for i, ch := range chans {
|
||||
if err := <-ch; err != nil {
|
||||
for _, d := range sent {
|
||||
if err := <-d.ch; err != nil {
|
||||
res.Failed++
|
||||
res.Errors = append(res.Errors, err)
|
||||
e.logger.Error("reconcile: action failed",
|
||||
"vmid", actions[i].VMID, "kind", actions[i].Kind, "err", err)
|
||||
"vmid", d.act.VMID, "kind", d.act.Kind, "err", err)
|
||||
} else {
|
||||
res.Executed++
|
||||
e.logger.Info("reconcile: action applied",
|
||||
"vmid", actions[i].VMID, "kind", actions[i].Kind, "reason", actions[i].Reason)
|
||||
"vmid", d.act.VMID, "kind", d.act.Kind, "reason", d.act.Reason)
|
||||
}
|
||||
}
|
||||
return res, nil
|
||||
@@ -227,8 +255,13 @@ func (e *Engine) reconcileOnce(ctx context.Context) {
|
||||
|
||||
// nextOpID builds a per-attempt unique op id (kind-vmid-seq) for journal correlation.
|
||||
func (e *Engine) nextOpID(act Action) string {
|
||||
n := atomic.AddUint64(&e.opSeq, 1)
|
||||
return string(act.Kind) + "-" + strconv.Itoa(act.VMID) + "-" + strconv.FormatUint(n, 10)
|
||||
return string(act.Kind) + "-" + strconv.Itoa(act.VMID) + "-" + nextSeq(&e.opSeq)
|
||||
}
|
||||
|
||||
// nextSeq atomically increments a counter and returns it as a string — the unique
|
||||
// suffix that distinguishes journal op ids across attempts.
|
||||
func nextSeq(p *uint64) string {
|
||||
return strconv.FormatUint(atomic.AddUint64(p, 1), 10)
|
||||
}
|
||||
|
||||
// append journals a lifecycle record, logging (never failing the op on) a journal I/O
|
||||
|
||||
@@ -23,6 +23,8 @@ type fakeAPI struct {
|
||||
// waitFunc maps a UPID to a (status, err); default = OK. Mirrors the real client,
|
||||
// which errors on a non-OK exitstatus.
|
||||
waitFunc func(upid string) (proxmox.TaskStatus, error)
|
||||
// statusFunc backs TaskStatusOnce (crash recovery); default = stopped/OK.
|
||||
statusFunc func(upid string) (proxmox.TaskStatus, error)
|
||||
|
||||
starts []int
|
||||
stops []int
|
||||
@@ -31,6 +33,13 @@ type fakeAPI struct {
|
||||
listErr error
|
||||
}
|
||||
|
||||
func (f *fakeAPI) TaskStatusOnce(_ context.Context, upid string) (proxmox.TaskStatus, error) {
|
||||
if f.statusFunc != nil {
|
||||
return f.statusFunc(upid)
|
||||
}
|
||||
return proxmox.TaskStatus{UPID: upid, Status: "stopped", ExitStatus: "OK"}, nil
|
||||
}
|
||||
|
||||
type setCall struct {
|
||||
vmid int
|
||||
params map[string]string
|
||||
|
||||
@@ -0,0 +1,291 @@
|
||||
package reconcile
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"log/slog"
|
||||
"reflect"
|
||||
"strconv"
|
||||
"time"
|
||||
|
||||
"gitea.dooplex.hu/admin/felhom-agent/internal/authz"
|
||||
)
|
||||
|
||||
// SourceKind records where an intent came from — audit/debug ONLY. Classification
|
||||
// does NOT depend on it: a destructive desired-state delta and a destructive one-shot
|
||||
// job are gated identically (the agent distrusts hub desired state for destructive
|
||||
// change, not just jobs — doc 03 §4).
|
||||
type SourceKind string
|
||||
|
||||
const (
|
||||
SourceDesiredDelta SourceKind = "desired_delta"
|
||||
SourceOneShotJob SourceKind = "one_shot_job"
|
||||
)
|
||||
|
||||
// Intent is an intended mutation presented to the gate. For benign reconcile actions
|
||||
// the engine builds one per planned Action; destructive intents (jobs / deltas) carry
|
||||
// their op class + canonical params for binding.
|
||||
type Intent struct {
|
||||
Class OpClass
|
||||
HostID string
|
||||
GuestID string // blob-style guest id ("" = host-scoped); matches OpBlob.target.guest_id
|
||||
VMID int // numeric, for queue routing (0 = host-scoped)
|
||||
// ParamsJSON is the canonical params (matching the signed blob's `params`) used for
|
||||
// op-to-action binding on destructive ops. Nil for benign actions (not bound).
|
||||
ParamsJSON json.RawMessage
|
||||
// Provenance is AGENT-INTERNAL only (never hub-sourced) — see classify.go.
|
||||
Provenance Provenance
|
||||
Source SourceKind
|
||||
}
|
||||
|
||||
// SignedOp is the opaque operator-signed blob+signature pair the hub queues (doc 04
|
||||
// §5). The agent never trusts it until authz.Verifier.Verify passes.
|
||||
type SignedOp struct {
|
||||
Blob []byte // the canonical OpBlob JSON bytes (verified over RAW bytes)
|
||||
Sig []byte // the armored SSHSIG
|
||||
}
|
||||
|
||||
// RefuseReason is a stable, machine-readable gate refusal reason.
|
||||
type RefuseReason string
|
||||
|
||||
const (
|
||||
ReasonBenign RefuseReason = "benign" // allowed, no signature needed
|
||||
ReasonSigned RefuseReason = "signed" // allowed by a verified op
|
||||
ReasonPendingSignature RefuseReason = "pending_signature" // destructive, no/again-needed signature
|
||||
ReasonRejected RefuseReason = "rejected" // signature failed authz verification
|
||||
ReasonRoleDenied RefuseReason = "role_denied" // signer role not authorized for this op class
|
||||
ReasonBindingMismatch RefuseReason = "binding_mismatch" // signature is for a different action
|
||||
)
|
||||
|
||||
// Decision is the gate verdict.
|
||||
type Decision struct {
|
||||
Allowed bool
|
||||
Disposition Disposition
|
||||
Reason RefuseReason
|
||||
// Verified is the authenticated op when a signature authorized the action.
|
||||
Verified *authz.VerifiedOp
|
||||
// Err is the underlying authz rejection (errors.Is-friendly: ErrUnknownSigner,
|
||||
// ErrExpired, ErrReplay, …) when Reason == ReasonRejected.
|
||||
Err error
|
||||
}
|
||||
|
||||
// OpVerifier is the crypto verifier seam — *authz.Verifier in production; a fake in
|
||||
// gate unit tests. The gate never re-implements any crypto; it only consumes the
|
||||
// verdict and enforces the policy layer on top (role-scoping + op-to-action binding).
|
||||
type OpVerifier interface {
|
||||
Verify(blob, sigArmored []byte) (*authz.VerifiedOp, error)
|
||||
}
|
||||
|
||||
// AuditSink records every gate decision to the customer-visible audit log. Audit is a
|
||||
// SIGNAL, never the guard (doc 03 §4 / doc 04 §5): a compromised hub could suppress a
|
||||
// notice, which is exactly why the signature — not the audit — is the control.
|
||||
type AuditSink interface {
|
||||
Record(rec AuditRecord)
|
||||
}
|
||||
|
||||
// AuditRecord is one audited gate decision.
|
||||
type AuditRecord struct {
|
||||
Time time.Time
|
||||
Class OpClass
|
||||
HostID string
|
||||
GuestID string
|
||||
Source SourceKind
|
||||
Disposition Disposition
|
||||
Allowed bool
|
||||
Reason RefuseReason
|
||||
KeyID string // matched signer's key id, when signed
|
||||
Nonce string // the op nonce, when signed
|
||||
}
|
||||
|
||||
// Gate is the reversibility gate: it sits in front of the per-guest queue's executor
|
||||
// so EVERY mutation passes it. Benign intents are allowed unsigned; destructive
|
||||
// intents require a verified, role-authorized, action-bound operator signature, else
|
||||
// they are refused with pending_signature and never executed.
|
||||
type Gate struct {
|
||||
verifier OpVerifier // may be nil (no signers pinned) → destructive is always pending_signature
|
||||
hostID string
|
||||
audit AuditSink
|
||||
logger *slog.Logger
|
||||
}
|
||||
|
||||
// NewGate builds a gate. verifier may be nil when no signers are configured (the
|
||||
// common slice-4 state) — then there is nothing destructive to authorize and any
|
||||
// destructive intent is refused pending_signature. audit/logger default to no-ops.
|
||||
func NewGate(verifier OpVerifier, hostID string, audit AuditSink, logger *slog.Logger) *Gate {
|
||||
if audit == nil {
|
||||
audit = noopAudit{}
|
||||
}
|
||||
if logger == nil {
|
||||
logger = slog.New(slog.NewTextHandler(discard{}, nil))
|
||||
}
|
||||
return &Gate{verifier: verifier, hostID: hostID, audit: audit, logger: logger}
|
||||
}
|
||||
|
||||
// Authorize classifies the intent and, for destructive intents, runs the full
|
||||
// consuming-layer policy over the verifier verdict. It writes the decision to the
|
||||
// audit log and returns it. It NEVER executes anything — the caller dispatches an
|
||||
// Allowed decision onto the queue.
|
||||
func (g *Gate) Authorize(intent Intent, signed *SignedOp) Decision {
|
||||
disp := Classify(intent.Class, intent.Provenance)
|
||||
|
||||
// Benign: allowed without a signature.
|
||||
if disp == Benign {
|
||||
d := Decision{Allowed: true, Disposition: Benign, Reason: ReasonBenign}
|
||||
g.record(intent, d)
|
||||
return d
|
||||
}
|
||||
|
||||
// Destructive from here: a verified, role-authorized, action-bound signature is
|
||||
// mandatory. Missing signature OR no pinned verifier → pending_signature (refuse).
|
||||
if signed == nil || g.verifier == nil {
|
||||
d := Decision{Allowed: false, Disposition: Destructive, Reason: ReasonPendingSignature}
|
||||
g.record(intent, d)
|
||||
return d
|
||||
}
|
||||
|
||||
// Crypto + namespace + allow-list + target + time + nonce — the LOCKED authz
|
||||
// pipeline. The nonce is consumed (recorded) only if this passes.
|
||||
vop, err := g.verifier.Verify(signed.Blob, signed.Sig)
|
||||
if err != nil {
|
||||
d := Decision{Allowed: false, Disposition: Destructive, Reason: ReasonRejected, Err: err}
|
||||
g.record(intent, d)
|
||||
return d
|
||||
}
|
||||
|
||||
// Role-scoping (the slice-4 job per verifier.go): the signer's pinned role must be
|
||||
// authorized for THIS op class.
|
||||
if !roleAuthorizes(vop.Signer.Role, intent.Class) {
|
||||
d := Decision{Allowed: false, Disposition: Destructive, Reason: ReasonRoleDenied, Verified: vop}
|
||||
g.record(intent, d)
|
||||
return d
|
||||
}
|
||||
|
||||
// Op-to-action binding: the verified op must name THIS exact action (op + target +
|
||||
// params) — a signature for "restore guest X" cannot authorize destroying guest Y.
|
||||
if !g.bindsToAction(vop, intent) {
|
||||
d := Decision{Allowed: false, Disposition: Destructive, Reason: ReasonBindingMismatch, Verified: vop}
|
||||
g.record(intent, d)
|
||||
return d
|
||||
}
|
||||
|
||||
d := Decision{Allowed: true, Disposition: Destructive, Reason: ReasonSigned, Verified: vop}
|
||||
g.record(intent, d)
|
||||
return d
|
||||
}
|
||||
|
||||
// roleAuthorizes enforces the doc 04 §4 two-key role model: the cold recovery key
|
||||
// authorizes ONLY key-rotation re-pins; the operational key authorizes ordinary
|
||||
// destructive ops AND planned key-rotation.
|
||||
func roleAuthorizes(role authz.KeyRole, class OpClass) bool {
|
||||
if class == ClassKeyRotation {
|
||||
return role == authz.RoleOperational || role == authz.RoleRecovery
|
||||
}
|
||||
return role == authz.RoleOperational
|
||||
}
|
||||
|
||||
// bindsToAction checks the verified op names this exact action: host (already checked
|
||||
// by the verifier, re-asserted here), guest, op class, and params. This is the binding
|
||||
// BEYOND the verifier's target check (doc 04 §2.3 binds host; this binds the full
|
||||
// action).
|
||||
func (g *Gate) bindsToAction(vop *authz.VerifiedOp, intent Intent) bool {
|
||||
if vop.HostID != g.hostID || vop.HostID != intent.HostID {
|
||||
return false
|
||||
}
|
||||
if vop.GuestID != intent.GuestID {
|
||||
return false
|
||||
}
|
||||
if vop.Op != string(intent.Class) {
|
||||
return false
|
||||
}
|
||||
return paramsEqual(vop.Params, intent.ParamsJSON)
|
||||
}
|
||||
|
||||
// paramsEqual compares two JSON param objects semantically (key order / whitespace
|
||||
// independent). Absent params on both sides ({} or empty) compare equal.
|
||||
func paramsEqual(a, b json.RawMessage) bool {
|
||||
ax, aok := decodeParams(a)
|
||||
bx, bok := decodeParams(b)
|
||||
if !aok || !bok {
|
||||
return false
|
||||
}
|
||||
return reflect.DeepEqual(ax, bx)
|
||||
}
|
||||
|
||||
func decodeParams(p json.RawMessage) (any, bool) {
|
||||
if len(p) == 0 {
|
||||
return map[string]any{}, true // absent == empty object
|
||||
}
|
||||
var v any
|
||||
if err := json.Unmarshal(p, &v); err != nil {
|
||||
return nil, false
|
||||
}
|
||||
if v == nil {
|
||||
return map[string]any{}, true // explicit null == empty
|
||||
}
|
||||
return v, true
|
||||
}
|
||||
|
||||
func (g *Gate) record(intent Intent, d Decision) {
|
||||
rec := AuditRecord{
|
||||
Time: time.Now().UTC(),
|
||||
Class: intent.Class,
|
||||
HostID: intent.HostID,
|
||||
GuestID: intent.GuestID,
|
||||
Source: intent.Source,
|
||||
Disposition: d.Disposition,
|
||||
Allowed: d.Allowed,
|
||||
Reason: d.Reason,
|
||||
}
|
||||
if d.Verified != nil {
|
||||
rec.KeyID = d.Verified.Signer.KeyID
|
||||
rec.Nonce = d.Verified.Nonce
|
||||
}
|
||||
g.audit.Record(rec)
|
||||
g.logger.Info("gate decision",
|
||||
"class", intent.Class, "guest", intent.GuestID, "source", intent.Source,
|
||||
"disposition", d.Disposition, "allowed", d.Allowed, "reason", d.Reason)
|
||||
}
|
||||
|
||||
// intentForAction builds the gate Intent for a benign reconcile action. The provenance
|
||||
// is the zero value (no agent-internal destroy evidence) and the source is the
|
||||
// desired-state delta — reconcile never fabricates scratch/same-txn provenance.
|
||||
func intentForAction(hostID string, act Action) Intent {
|
||||
return Intent{
|
||||
Class: classOfAction(act.Kind),
|
||||
HostID: hostID,
|
||||
GuestID: strconv.Itoa(act.VMID),
|
||||
VMID: act.VMID,
|
||||
Provenance: Provenance{}, // benign actions need none; never hub-sourced
|
||||
Source: SourceDesiredDelta,
|
||||
}
|
||||
}
|
||||
|
||||
// noopAudit drops audit records (used when no sink is configured).
|
||||
type noopAudit struct{}
|
||||
|
||||
func (noopAudit) Record(AuditRecord) {}
|
||||
|
||||
// SlogAudit is a minimal AuditSink that emits records to a logger. The durable,
|
||||
// customer-visible audit log + its inclusion in the host-report (HostReport.AuditTail)
|
||||
// is a later-slice concern; this keeps the signal flowing now without inventing that
|
||||
// wire schema.
|
||||
type SlogAudit struct{ Logger *slog.Logger }
|
||||
|
||||
// Record logs the audit entry at info level.
|
||||
func (s SlogAudit) Record(rec AuditRecord) {
|
||||
if s.Logger == nil {
|
||||
return
|
||||
}
|
||||
s.Logger.Info("audit: gate decision",
|
||||
"class", rec.Class, "host", rec.HostID, "guest", rec.GuestID, "source", rec.Source,
|
||||
"disposition", rec.Disposition, "allowed", rec.Allowed, "reason", rec.Reason,
|
||||
"key_id", rec.KeyID, "nonce", auditNonce(rec.Nonce))
|
||||
}
|
||||
|
||||
// auditNonce shortens a nonce for the log (full nonce is high-cardinality; a prefix is
|
||||
// enough to correlate without bloating logs).
|
||||
func auditNonce(n string) string {
|
||||
if len(n) <= 8 {
|
||||
return n
|
||||
}
|
||||
return n[:8] + "…"
|
||||
}
|
||||
@@ -0,0 +1,299 @@
|
||||
package reconcile
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"path/filepath"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"gitea.dooplex.hu/admin/felhom-agent/internal/authz"
|
||||
)
|
||||
|
||||
const testHost = "demo-felhom"
|
||||
|
||||
// captureAudit records gate decisions so tests can assert audit is always written
|
||||
// (audit is a signal, never the guard).
|
||||
type captureAudit struct{ recs []AuditRecord }
|
||||
|
||||
func (c *captureAudit) Record(r AuditRecord) { c.recs = append(c.recs, r) }
|
||||
|
||||
// realVerifierAt builds a real authz.Verifier over a durable nonce store at path
|
||||
// (reused across "restart" by reopening the same path), pinning the given signers.
|
||||
func realVerifierAt(t *testing.T, path, hostID string, signers ...authz.AllowedSigner) (*authz.Verifier, *authz.FileNonceStore) {
|
||||
t.Helper()
|
||||
store, err := authz.OpenFileNonceStore(path)
|
||||
if err != nil {
|
||||
t.Fatalf("OpenFileNonceStore: %v", err)
|
||||
}
|
||||
t.Cleanup(func() { store.Close() })
|
||||
return authz.New(signers, store, hostID), store
|
||||
}
|
||||
|
||||
// destroyIntent is the canonical destructive fixture: destroy guest 9001, params
|
||||
// {"purge":true} (mirrors the committed slice-2 op_blob.json shape).
|
||||
func destroyIntent(source SourceKind) Intent {
|
||||
return Intent{
|
||||
Class: ClassGuestDestroy,
|
||||
HostID: testHost,
|
||||
GuestID: "9001",
|
||||
VMID: 9001,
|
||||
ParamsJSON: json.RawMessage(`{"purge":true}`),
|
||||
Source: source,
|
||||
}
|
||||
}
|
||||
|
||||
func freshWindow() (issued, expires time.Time) {
|
||||
now := time.Now().UTC()
|
||||
return now.Add(-1 * time.Minute), now.Add(10 * time.Minute)
|
||||
}
|
||||
|
||||
// --- The adversarial matrix: each case must be INDEPENDENTLY rejected (or, the one
|
||||
// positive case, accepted). ---
|
||||
|
||||
func TestGate_DestructiveJobNoSignatureRefused(t *testing.T) {
|
||||
op := newTestSigner(t)
|
||||
v, _ := realVerifierAt(t, filepath.Join(t.TempDir(), "n.log"), testHost, op.allowed(t, "op1", authz.RoleOperational))
|
||||
aud := &captureAudit{}
|
||||
g := NewGate(v, testHost, aud, nil)
|
||||
|
||||
d := g.Authorize(destroyIntent(SourceOneShotJob), nil)
|
||||
if d.Allowed || d.Reason != ReasonPendingSignature {
|
||||
t.Fatalf("unsigned destructive job: got allowed=%v reason=%s, want pending_signature", d.Allowed, d.Reason)
|
||||
}
|
||||
if len(aud.recs) != 1 || aud.recs[0].Allowed {
|
||||
t.Errorf("decision must be audited as refused: %+v", aud.recs)
|
||||
}
|
||||
}
|
||||
|
||||
func TestGate_DestructiveDesiredDeltaNoSignatureRefused(t *testing.T) {
|
||||
// Proves the agent distrusts hub DESIRED STATE for destructive change, not just
|
||||
// jobs — same refusal, different source.
|
||||
op := newTestSigner(t)
|
||||
v, _ := realVerifierAt(t, filepath.Join(t.TempDir(), "n.log"), testHost, op.allowed(t, "op1", authz.RoleOperational))
|
||||
g := NewGate(v, testHost, nil, nil)
|
||||
|
||||
d := g.Authorize(destroyIntent(SourceDesiredDelta), nil)
|
||||
if d.Allowed || d.Reason != ReasonPendingSignature {
|
||||
t.Fatalf("unsigned destructive delta: got allowed=%v reason=%s, want pending_signature", d.Allowed, d.Reason)
|
||||
}
|
||||
}
|
||||
|
||||
func TestGate_UnknownSignerRejected(t *testing.T) {
|
||||
pinned := newTestSigner(t)
|
||||
attacker := newTestSigner(t) // NOT pinned
|
||||
v, _ := realVerifierAt(t, filepath.Join(t.TempDir(), "n.log"), testHost, pinned.allowed(t, "op1", authz.RoleOperational))
|
||||
g := NewGate(v, testHost, nil, nil)
|
||||
|
||||
issued, expires := freshWindow()
|
||||
signed := attacker.mint("guest_destroy", testHost, "9001", "op1", nonce(), `{"purge":true}`, issued, expires)
|
||||
d := g.Authorize(destroyIntent(SourceOneShotJob), signed)
|
||||
if d.Allowed || d.Reason != ReasonRejected || !errors.Is(d.Err, authz.ErrUnknownSigner) {
|
||||
t.Fatalf("forged signer: got allowed=%v reason=%s err=%v, want rejected/ErrUnknownSigner", d.Allowed, d.Reason, d.Err)
|
||||
}
|
||||
}
|
||||
|
||||
func TestGate_ExpiredSignatureRejected(t *testing.T) {
|
||||
op := newTestSigner(t)
|
||||
v, _ := realVerifierAt(t, filepath.Join(t.TempDir(), "n.log"), testHost, op.allowed(t, "op1", authz.RoleOperational))
|
||||
g := NewGate(v, testHost, nil, nil)
|
||||
|
||||
past := time.Now().UTC().Add(-2 * time.Hour)
|
||||
signed := op.mint("guest_destroy", testHost, "9001", "op1", nonce(), `{"purge":true}`, past, past.Add(time.Minute))
|
||||
d := g.Authorize(destroyIntent(SourceOneShotJob), signed)
|
||||
if d.Allowed || !errors.Is(d.Err, authz.ErrExpired) {
|
||||
t.Fatalf("expired op: got allowed=%v err=%v, want ErrExpired", d.Allowed, d.Err)
|
||||
}
|
||||
}
|
||||
|
||||
func TestGate_WrongHostTargetRejected(t *testing.T) {
|
||||
op := newTestSigner(t)
|
||||
v, _ := realVerifierAt(t, filepath.Join(t.TempDir(), "n.log"), testHost, op.allowed(t, "op1", authz.RoleOperational))
|
||||
g := NewGate(v, testHost, nil, nil)
|
||||
|
||||
issued, expires := freshWindow()
|
||||
signed := op.mint("guest_destroy", "some-other-host", "9001", "op1", nonce(), `{"purge":true}`, issued, expires)
|
||||
d := g.Authorize(destroyIntent(SourceOneShotJob), signed)
|
||||
if d.Allowed || !errors.Is(d.Err, authz.ErrTarget) {
|
||||
t.Fatalf("wrong host: got allowed=%v err=%v, want ErrTarget", d.Allowed, d.Err)
|
||||
}
|
||||
}
|
||||
|
||||
func TestGate_WrongGuestBindingMismatch(t *testing.T) {
|
||||
// host matches (verifier passes) but the signature names a DIFFERENT guest than the
|
||||
// action — the op-to-action binding rejects it.
|
||||
op := newTestSigner(t)
|
||||
v, _ := realVerifierAt(t, filepath.Join(t.TempDir(), "n.log"), testHost, op.allowed(t, "op1", authz.RoleOperational))
|
||||
g := NewGate(v, testHost, nil, nil)
|
||||
|
||||
issued, expires := freshWindow()
|
||||
signed := op.mint("guest_destroy", testHost, "9002", "op1", nonce(), `{"purge":true}`, issued, expires)
|
||||
d := g.Authorize(destroyIntent(SourceOneShotJob), signed) // intent targets 9001
|
||||
if d.Allowed || d.Reason != ReasonBindingMismatch {
|
||||
t.Fatalf("guest mismatch: got allowed=%v reason=%s, want binding_mismatch", d.Allowed, d.Reason)
|
||||
}
|
||||
}
|
||||
|
||||
func TestGate_WrongParamsBindingMismatch(t *testing.T) {
|
||||
op := newTestSigner(t)
|
||||
v, _ := realVerifierAt(t, filepath.Join(t.TempDir(), "n.log"), testHost, op.allowed(t, "op1", authz.RoleOperational))
|
||||
g := NewGate(v, testHost, nil, nil)
|
||||
|
||||
issued, expires := freshWindow()
|
||||
// signature authorizes purge=false; the action wants purge=true.
|
||||
signed := op.mint("guest_destroy", testHost, "9001", "op1", nonce(), `{"purge":false}`, issued, expires)
|
||||
d := g.Authorize(destroyIntent(SourceOneShotJob), signed)
|
||||
if d.Allowed || d.Reason != ReasonBindingMismatch {
|
||||
t.Fatalf("params mismatch: got allowed=%v reason=%s, want binding_mismatch", d.Allowed, d.Reason)
|
||||
}
|
||||
}
|
||||
|
||||
func TestGate_WrongOpBindingMismatch(t *testing.T) {
|
||||
op := newTestSigner(t)
|
||||
v, _ := realVerifierAt(t, filepath.Join(t.TempDir(), "n.log"), testHost, op.allowed(t, "op1", authz.RoleOperational))
|
||||
g := NewGate(v, testHost, nil, nil)
|
||||
|
||||
issued, expires := freshWindow()
|
||||
// a valid signature for restore_overwrite cannot authorize a guest_destroy.
|
||||
signed := op.mint("restore_overwrite", testHost, "9001", "op1", nonce(), `{"purge":true}`, issued, expires)
|
||||
d := g.Authorize(destroyIntent(SourceOneShotJob), signed)
|
||||
if d.Allowed || d.Reason != ReasonBindingMismatch {
|
||||
t.Fatalf("op mismatch: got allowed=%v reason=%s, want binding_mismatch", d.Allowed, d.Reason)
|
||||
}
|
||||
}
|
||||
|
||||
func TestGate_RecoveryKeyOnOrdinaryDestructiveRoleDenied(t *testing.T) {
|
||||
// A valid signature from the cold RECOVERY key on an ordinary destructive op is
|
||||
// refused by role-scoping (recovery authorizes ONLY key-rotation).
|
||||
rec := newTestSigner(t)
|
||||
v, _ := realVerifierAt(t, filepath.Join(t.TempDir(), "n.log"), testHost, rec.allowed(t, "rec1", authz.RoleRecovery))
|
||||
g := NewGate(v, testHost, nil, nil)
|
||||
|
||||
issued, expires := freshWindow()
|
||||
signed := rec.mint("guest_destroy", testHost, "9001", "rec1", nonce(), `{"purge":true}`, issued, expires)
|
||||
d := g.Authorize(destroyIntent(SourceOneShotJob), signed)
|
||||
if d.Allowed || d.Reason != ReasonRoleDenied {
|
||||
t.Fatalf("recovery on destroy: got allowed=%v reason=%s, want role_denied", d.Allowed, d.Reason)
|
||||
}
|
||||
}
|
||||
|
||||
func TestGate_HubSuppliedScratchTagIgnored(t *testing.T) {
|
||||
// A compromised hub attaches a "scratch" hint to a data-bearing guest's destroy
|
||||
// delta to try to walk the gate unsigned. The intent built from a hub delta must
|
||||
// NOT carry that as agent-internal provenance — so it stays destructive and is
|
||||
// refused without a signature.
|
||||
intent := intentFromHubDelta(hubDelta{Class: ClassGuestDestroy, HostID: testHost, GuestID: "9001", VMID: 9001, HubSaysScratch: true})
|
||||
if intent.Provenance.AgentTaggedScratch || intent.Provenance.SameTxnCreated {
|
||||
t.Fatal("hub-supplied scratch must NOT become agent-internal provenance")
|
||||
}
|
||||
g := NewGate(nil, testHost, nil, nil) // no verifier even needed
|
||||
d := g.Authorize(intent, nil)
|
||||
if d.Allowed || d.Reason != ReasonPendingSignature {
|
||||
t.Fatalf("hub-scratch destroy: got allowed=%v reason=%s, want pending_signature", d.Allowed, d.Reason)
|
||||
}
|
||||
}
|
||||
|
||||
func TestGate_ValidOpAcceptedThenReplayRejected(t *testing.T) {
|
||||
// The ONE positive case: valid signature, correct role, correct target, fresh
|
||||
// nonce → accepted. A SECOND presentation (same nonce) → rejected (nonce consumed).
|
||||
op := newTestSigner(t)
|
||||
path := filepath.Join(t.TempDir(), "n.log")
|
||||
v, _ := realVerifierAt(t, path, testHost, op.allowed(t, "op1", authz.RoleOperational))
|
||||
aud := &captureAudit{}
|
||||
g := NewGate(v, testHost, aud, nil)
|
||||
|
||||
issued, expires := freshWindow()
|
||||
n := nonce()
|
||||
signed := op.mint("guest_destroy", testHost, "9001", "op1", n, `{"purge":true}`, issued, expires)
|
||||
|
||||
d := g.Authorize(destroyIntent(SourceOneShotJob), signed)
|
||||
if !d.Allowed || d.Reason != ReasonSigned {
|
||||
t.Fatalf("valid op: got allowed=%v reason=%s err=%v, want accepted/signed", d.Allowed, d.Reason, d.Err)
|
||||
}
|
||||
if d.Verified == nil || d.Verified.Nonce != n {
|
||||
t.Fatalf("accepted op should surface the verified op with nonce %s", n)
|
||||
}
|
||||
// Replay the exact same signed op → nonce already consumed.
|
||||
d2 := g.Authorize(destroyIntent(SourceOneShotJob), signed)
|
||||
if d2.Allowed || !errors.Is(d2.Err, authz.ErrReplay) {
|
||||
t.Fatalf("replay: got allowed=%v err=%v, want ErrReplay", d2.Allowed, d2.Err)
|
||||
}
|
||||
}
|
||||
|
||||
func TestGate_ReplayAcrossRestartRejected(t *testing.T) {
|
||||
// Replay protection must survive an agent restart (the durable nonce store). Accept
|
||||
// once with verifier A, then reopen the SAME nonce-store path as verifier B (a
|
||||
// restart) and replay → still rejected.
|
||||
op := newTestSigner(t)
|
||||
path := filepath.Join(t.TempDir(), "n.log")
|
||||
signer := op.allowed(t, "op1", authz.RoleOperational)
|
||||
|
||||
issued, expires := freshWindow()
|
||||
n := nonce()
|
||||
signed := op.mint("guest_destroy", testHost, "9001", "op1", n, `{"purge":true}`, issued, expires)
|
||||
|
||||
vA, storeA := realVerifierAt(t, path, testHost, signer)
|
||||
if d := NewGate(vA, testHost, nil, nil).Authorize(destroyIntent(SourceOneShotJob), signed); !d.Allowed {
|
||||
t.Fatalf("first presentation should be accepted: %+v", d)
|
||||
}
|
||||
storeA.Close() // simulate shutdown
|
||||
|
||||
vB, _ := realVerifierAt(t, path, testHost, signer) // restart: reopen same nonce log
|
||||
d := NewGate(vB, testHost, nil, nil).Authorize(destroyIntent(SourceOneShotJob), signed)
|
||||
if d.Allowed || !errors.Is(d.Err, authz.ErrReplay) {
|
||||
t.Fatalf("replay across restart: got allowed=%v err=%v, want ErrReplay", d.Allowed, d.Err)
|
||||
}
|
||||
}
|
||||
|
||||
// --- gate unit tests (benign path, binding, params) ---
|
||||
|
||||
func TestGate_BenignAllowedWithoutVerifier(t *testing.T) {
|
||||
g := NewGate(nil, testHost, nil, nil) // no verifier at all
|
||||
for _, k := range []ActionKind{ActionStart, ActionStop, ActionSetConfig} {
|
||||
d := g.Authorize(intentForAction(testHost, Action{VMID: 100, Kind: k}), nil)
|
||||
if !d.Allowed || d.Reason != ReasonBenign {
|
||||
t.Errorf("benign %s: got allowed=%v reason=%s, want benign", k, d.Allowed, d.Reason)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestParamsEqual(t *testing.T) {
|
||||
eq := func(a, b string) bool { return paramsEqual(json.RawMessage(a), json.RawMessage(b)) }
|
||||
if !eq(`{"purge":true}`, `{"purge":true}`) {
|
||||
t.Error("identical params should be equal")
|
||||
}
|
||||
if !eq(`{"a":1,"b":2}`, `{"b":2,"a":1}`) {
|
||||
t.Error("key order must not matter")
|
||||
}
|
||||
if eq(`{"purge":true}`, `{"purge":false}`) {
|
||||
t.Error("different values must differ")
|
||||
}
|
||||
if !eq(``, `{}`) || !eq(`{}`, `null`) {
|
||||
t.Error("absent / empty / null params should all compare equal")
|
||||
}
|
||||
}
|
||||
|
||||
// --- helpers for the hub-scratch test: a stand-in for the slice-10 desired-delta →
|
||||
// intent constructor, proving it never propagates hub-supplied provenance. ---
|
||||
|
||||
type hubDelta struct {
|
||||
Class OpClass
|
||||
HostID string
|
||||
GuestID string
|
||||
VMID int
|
||||
HubSaysScratch bool // a hostile/erroneous hub hint — MUST be ignored
|
||||
}
|
||||
|
||||
func intentFromHubDelta(d hubDelta) Intent {
|
||||
// NOTE: HubSaysScratch is deliberately NOT mapped to Provenance. Agent-internal
|
||||
// provenance (scratch/same-txn) is recorded by the agent's own journal, never taken
|
||||
// from the hub (doc 03 §4).
|
||||
return Intent{
|
||||
Class: d.Class,
|
||||
HostID: d.HostID,
|
||||
GuestID: d.GuestID,
|
||||
VMID: d.VMID,
|
||||
Provenance: Provenance{}, // always zero from an external source
|
||||
Source: SourceDesiredDelta,
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,120 @@
|
||||
package reconcile
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"time"
|
||||
|
||||
"gitea.dooplex.hu/admin/felhom-agent/internal/authz"
|
||||
"gitea.dooplex.hu/admin/felhom-agent/internal/proxmox"
|
||||
)
|
||||
|
||||
// DestructiveExecutor performs an authorized destructive op against the host. At slice
|
||||
// 4 there is NO live implementation (guest-destroy / storage-wipe / restore-overwrite
|
||||
// executors land in slices 6/7) — the consuming layer is wired and tested with fixture
|
||||
// executors but never executes a real destructive op, because nothing serves
|
||||
// destructive deltas until slice 10. It returns a Proxmox UPID (or "" for a synchronous
|
||||
// op) so the journal/Recover path is identical to benign execution.
|
||||
type DestructiveExecutor func(ctx context.Context, intent Intent, vop *authz.VerifiedOp) (upid string, err error)
|
||||
|
||||
// JobResult is the outcome of RunSignedJob.
|
||||
type JobResult struct {
|
||||
Decision Decision
|
||||
AlreadyApplied bool // the op's idempotency key was already applied (deduped, not re-run)
|
||||
Executed bool // the executor ran and succeeded
|
||||
Err error // execution error (after a successful authorization)
|
||||
}
|
||||
|
||||
// RunSignedJob is the signed one-shot consuming layer (doc 03 §4(b) / doc 04). It adds
|
||||
// idempotency dedupe + journaling around the gate:
|
||||
//
|
||||
// 1. Dedupe: if the op's idempotency key (its nonce) is already applied, skip — a
|
||||
// redelivered, already-completed op must not re-run (returns AlreadyApplied).
|
||||
// 2. Gate: classify + verify + role-scope + op-to-action bind. A refusal returns the
|
||||
// Decision and executes nothing.
|
||||
// 3. Journal + execute: record started → run the executor → record the task id →
|
||||
// record the terminal state under the idempotency key (so success marks the key
|
||||
// applied; a crash mid-execute is resolved by Recover, never by idempotency alone).
|
||||
//
|
||||
// exec may be nil — then an AUTHORIZED destructive op is journaled as authorized but
|
||||
// not executed (the slice-4 inert state: the gate works, the executor doesn't exist
|
||||
// yet). A REFUSED op never reaches exec.
|
||||
func (e *Engine) RunSignedJob(ctx context.Context, intent Intent, signed *SignedOp, exec DestructiveExecutor) JobResult {
|
||||
idemKey := jobIdempotencyKey(signed)
|
||||
|
||||
// 1. Idempotency dedupe (redelivery after a prior success).
|
||||
if idemKey != "" && e.journal != nil && e.journal.AlreadyApplied(idemKey) {
|
||||
e.logger.Info("job: idempotency key already applied; skipping", "key", auditNonce(idemKey))
|
||||
return JobResult{AlreadyApplied: true, Decision: Decision{Allowed: true, Reason: ReasonSigned}}
|
||||
}
|
||||
|
||||
// 2. Gate (classification + the full signed-op consuming policy).
|
||||
dec := e.gate.Authorize(intent, signed)
|
||||
if !dec.Allowed {
|
||||
return JobResult{Decision: dec}
|
||||
}
|
||||
|
||||
// 3. Journal + execute. Benign authorized ops (no signature path) also flow here if
|
||||
// routed as jobs; they carry no idempotency key and are simply executed.
|
||||
opID := e.nextJobOpID(intent)
|
||||
e.append(JournalEntry{OpID: opID, VMID: intent.VMID, Kind: string(intent.Class),
|
||||
State: OpStarted, IdempKey: idemKey, At: time.Now().UTC()})
|
||||
|
||||
if exec == nil {
|
||||
// Slice-4 inert: authorized, but no destructive executor wired. Record the
|
||||
// authorization terminally (do NOT mark applied — nothing actually ran).
|
||||
e.append(JournalEntry{OpID: opID, VMID: intent.VMID, Kind: string(intent.Class),
|
||||
State: OpFailed, IdempKey: "", At: time.Now().UTC()})
|
||||
e.logger.Warn("job: authorized but no executor wired (slice-4 inert)", "class", intent.Class)
|
||||
return JobResult{Decision: dec, Err: fmt.Errorf("reconcile: no executor for %s (not wired this slice)", intent.Class)}
|
||||
}
|
||||
|
||||
upid, err := exec(ctx, intent, dec.Verified)
|
||||
if err != nil {
|
||||
e.append(JournalEntry{OpID: opID, VMID: intent.VMID, Kind: string(intent.Class),
|
||||
State: OpFailed, IdempKey: idemKey, At: time.Now().UTC()})
|
||||
return JobResult{Decision: dec, Err: err}
|
||||
}
|
||||
e.append(JournalEntry{OpID: opID, VMID: intent.VMID, Kind: string(intent.Class),
|
||||
UPID: upid, State: OpTaskRunning, IdempKey: idemKey, At: time.Now().UTC()})
|
||||
|
||||
if upid != "" {
|
||||
st, err := e.api.WaitTask(ctx, upid, proxmox.WaitOptions{})
|
||||
if err != nil {
|
||||
e.append(JournalEntry{OpID: opID, VMID: intent.VMID, Kind: string(intent.Class),
|
||||
UPID: upid, State: OpFailed, IdempKey: idemKey, At: time.Now().UTC()})
|
||||
return JobResult{Decision: dec, Err: err}
|
||||
}
|
||||
_ = st
|
||||
}
|
||||
|
||||
// Terminal success — marks the idempotency key applied (survives restart).
|
||||
e.append(JournalEntry{OpID: opID, VMID: intent.VMID, Kind: string(intent.Class),
|
||||
UPID: upid, State: OpSucceeded, IdempKey: idemKey, At: time.Now().UTC()})
|
||||
return JobResult{Decision: dec, Executed: true}
|
||||
}
|
||||
|
||||
// jobIdempotencyKey derives the idempotency key from the signed op's nonce — unique
|
||||
// per op (≥128-bit, doc 04 §2.1) and already the anti-replay token, so reusing it as
|
||||
// the journal dedupe key is exact. Parsed from the UNVERIFIED blob: it is only a map
|
||||
// key here (the gate's verifier is the trust boundary), and a forged blob is refused at
|
||||
// the gate regardless.
|
||||
func jobIdempotencyKey(signed *SignedOp) string {
|
||||
if signed == nil || len(signed.Blob) == 0 {
|
||||
return ""
|
||||
}
|
||||
var b struct {
|
||||
Nonce string `json:"nonce"`
|
||||
}
|
||||
if json.Unmarshal(signed.Blob, &b) != nil {
|
||||
return ""
|
||||
}
|
||||
return b.Nonce
|
||||
}
|
||||
|
||||
// nextJobOpID builds a per-attempt op id for a signed job (distinct namespace from
|
||||
// reconcile op ids).
|
||||
func (e *Engine) nextJobOpID(intent Intent) string {
|
||||
return "job-" + string(intent.Class) + "-" + intent.GuestID + "-" + nextSeq(&e.opSeq)
|
||||
}
|
||||
@@ -0,0 +1,135 @@
|
||||
package reconcile
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"path/filepath"
|
||||
"testing"
|
||||
|
||||
"gitea.dooplex.hu/admin/felhom-agent/internal/authz"
|
||||
)
|
||||
|
||||
// newSignedEngine builds an engine whose gate has a real verifier pinning one
|
||||
// operational key — for exercising the signed-job consuming layer end to end.
|
||||
func newSignedEngine(t *testing.T, api GuestAPI) (*Engine, *Journal, testSigner) {
|
||||
t.Helper()
|
||||
j, err := OpenJournal(filepath.Join(t.TempDir(), "journal.log"))
|
||||
if err != nil {
|
||||
t.Fatalf("OpenJournal: %v", err)
|
||||
}
|
||||
t.Cleanup(func() { j.Close() })
|
||||
q := NewQueue()
|
||||
t.Cleanup(q.Close)
|
||||
op := newTestSigner(t)
|
||||
v, _ := realVerifierAt(t, filepath.Join(t.TempDir(), "n.log"), testHost, op.allowed(t, "op1", authz.RoleOperational))
|
||||
g := NewGate(v, testHost, nil, nil)
|
||||
e := NewEngine(EngineOptions{API: api, Queue: q, Journal: j, Gate: g, HostID: testHost})
|
||||
return e, j, op
|
||||
}
|
||||
|
||||
func TestRunSignedJob_ValidExecutesAndMarksApplied(t *testing.T) {
|
||||
e, j, op := newSignedEngine(t, &fakeAPI{})
|
||||
issued, expires := freshWindow()
|
||||
n := nonce()
|
||||
signed := op.mint("guest_destroy", testHost, "9001", "op1", n, `{"purge":true}`, issued, expires)
|
||||
|
||||
calls := 0
|
||||
exec := func(context.Context, Intent, *authz.VerifiedOp) (string, error) { calls++; return "", nil } // synchronous
|
||||
|
||||
res := e.RunSignedJob(context.Background(), destroyIntent(SourceOneShotJob), signed, exec)
|
||||
if !res.Executed || res.Err != nil {
|
||||
t.Fatalf("valid job should execute, got %+v", res)
|
||||
}
|
||||
if calls != 1 {
|
||||
t.Errorf("executor should run once, ran %d", calls)
|
||||
}
|
||||
if !j.AlreadyApplied(n) {
|
||||
t.Error("successful job must mark its idempotency key (nonce) applied")
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunSignedJob_RedeliveryDedupedByIdempotencyKey(t *testing.T) {
|
||||
// After success, a redelivered identical job must NOT re-run — the journal's
|
||||
// idempotency key short-circuits BEFORE the verifier (so it reports already-applied,
|
||||
// not a confusing replay rejection).
|
||||
e, _, op := newSignedEngine(t, &fakeAPI{})
|
||||
issued, expires := freshWindow()
|
||||
n := nonce()
|
||||
signed := op.mint("guest_destroy", testHost, "9001", "op1", n, `{"purge":true}`, issued, expires)
|
||||
|
||||
calls := 0
|
||||
exec := func(context.Context, Intent, *authz.VerifiedOp) (string, error) { calls++; return "", nil }
|
||||
|
||||
first := e.RunSignedJob(context.Background(), destroyIntent(SourceOneShotJob), signed, exec)
|
||||
if !first.Executed {
|
||||
t.Fatalf("first delivery should execute: %+v", first)
|
||||
}
|
||||
second := e.RunSignedJob(context.Background(), destroyIntent(SourceOneShotJob), signed, exec)
|
||||
if !second.AlreadyApplied || second.Executed {
|
||||
t.Fatalf("redelivery should be deduped (already applied), got %+v", second)
|
||||
}
|
||||
if calls != 1 {
|
||||
t.Errorf("executor must run exactly once across redelivery, ran %d", calls)
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunSignedJob_RefusedDoesNotExecute(t *testing.T) {
|
||||
e, j, _ := newSignedEngine(t, &fakeAPI{})
|
||||
attacker := newTestSigner(t) // not pinned
|
||||
issued, expires := freshWindow()
|
||||
n := nonce()
|
||||
signed := attacker.mint("guest_destroy", testHost, "9001", "op1", n, `{"purge":true}`, issued, expires)
|
||||
|
||||
calls := 0
|
||||
exec := func(context.Context, Intent, *authz.VerifiedOp) (string, error) { calls++; return "", nil }
|
||||
|
||||
res := e.RunSignedJob(context.Background(), destroyIntent(SourceOneShotJob), signed, exec)
|
||||
if res.Executed || res.Decision.Allowed || !errors.Is(res.Decision.Err, authz.ErrUnknownSigner) {
|
||||
t.Fatalf("forged job must be refused unexecuted, got %+v", res)
|
||||
}
|
||||
if calls != 0 {
|
||||
t.Errorf("executor must not run for a refused job, ran %d", calls)
|
||||
}
|
||||
if j.AlreadyApplied(n) {
|
||||
t.Error("a refused job must not mark its key applied")
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunSignedJob_NoExecutorInert(t *testing.T) {
|
||||
// Slice-4 inert state: a VALID authorization with no destructive executor wired
|
||||
// returns an error and does NOT mark the key applied (so it is retryable once the
|
||||
// executor lands in a later slice).
|
||||
e, j, op := newSignedEngine(t, &fakeAPI{})
|
||||
issued, expires := freshWindow()
|
||||
n := nonce()
|
||||
signed := op.mint("guest_destroy", testHost, "9001", "op1", n, `{"purge":true}`, issued, expires)
|
||||
|
||||
res := e.RunSignedJob(context.Background(), destroyIntent(SourceOneShotJob), signed, nil)
|
||||
if !res.Decision.Allowed {
|
||||
t.Fatalf("op should authorize even with no executor: %+v", res.Decision)
|
||||
}
|
||||
if res.Executed || res.Err == nil {
|
||||
t.Fatalf("no-executor job should not execute and should error, got %+v", res)
|
||||
}
|
||||
if j.AlreadyApplied(n) {
|
||||
t.Error("an unexecuted (no-executor) job must not mark its key applied")
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunSignedJob_ExecutorErrorJournaledFailed(t *testing.T) {
|
||||
e, j, op := newSignedEngine(t, &fakeAPI{})
|
||||
issued, expires := freshWindow()
|
||||
n := nonce()
|
||||
signed := op.mint("guest_destroy", testHost, "9001", "op1", n, `{"purge":true}`, issued, expires)
|
||||
|
||||
exec := func(context.Context, Intent, *authz.VerifiedOp) (string, error) {
|
||||
return "", errors.New("destroy failed")
|
||||
}
|
||||
res := e.RunSignedJob(context.Background(), destroyIntent(SourceOneShotJob), signed, exec)
|
||||
if res.Executed || res.Err == nil {
|
||||
t.Fatalf("executor error should propagate, got %+v", res)
|
||||
}
|
||||
if j.AlreadyApplied(n) {
|
||||
t.Error("a failed execution must not mark its key applied")
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,129 @@
|
||||
package reconcile
|
||||
|
||||
import (
|
||||
"crypto/ed25519"
|
||||
"crypto/rand"
|
||||
"crypto/sha256"
|
||||
"encoding/pem"
|
||||
"fmt"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"gitea.dooplex.hu/admin/felhom-agent/internal/authz"
|
||||
"golang.org/x/crypto/ssh"
|
||||
)
|
||||
|
||||
// In-test SSHSIG minter for the gate's adversarial matrix. It replicates the ~40 lines
|
||||
// of SSHSIG framing (porting internal/authz/sshsig.go + mint_test.go) so reconcile's
|
||||
// tests can produce valid AND adversarial signatures with now-relative timestamps.
|
||||
// This lives only in reconcile's test binary — production authz is untouched (no
|
||||
// signing capability is added to the verify-only security package), and the verifier's
|
||||
// unexported clock (not injectable cross-package) is why we mint live rather than reuse
|
||||
// the committed fixed-window fixture.
|
||||
//
|
||||
// The minted bytes round-trip through the REAL authz.Verifier, so a correct framing is
|
||||
// proven by the positive case verifying.
|
||||
|
||||
const sshsigMagic = "SSHSIG"
|
||||
|
||||
type sshsigBlob struct {
|
||||
Version uint32
|
||||
PublicKey string
|
||||
Namespace string
|
||||
Reserved string
|
||||
HashAlgo string
|
||||
Signature string
|
||||
}
|
||||
|
||||
// signedData recomputes the SSHSIG signed bytes: "SSHSIG" || marshal(ns, reserved,
|
||||
// hash, H(message)). Mirrors authz.signedData exactly (sha256).
|
||||
func signedDataForTest(ns string, msg []byte) []byte {
|
||||
h := sha256.Sum256(msg)
|
||||
body := ssh.Marshal(struct {
|
||||
Namespace string
|
||||
Reserved string
|
||||
HashAlgo string
|
||||
Hash []byte
|
||||
}{ns, "", "sha256", h[:]})
|
||||
return append([]byte(sshsigMagic), body...)
|
||||
}
|
||||
|
||||
// mintArmor builds an armored SSHSIG over message using sign.
|
||||
func mintArmor(pubMarshaled []byte, namespace string, message []byte, sign func([]byte) ssh.Signature) []byte {
|
||||
sb := &sshsigBlob{Version: 1, PublicKey: string(pubMarshaled), Namespace: namespace, Reserved: "", HashAlgo: "sha256"}
|
||||
sig := sign(signedDataForTest(namespace, message))
|
||||
sb.Signature = string(ssh.Marshal(&sig))
|
||||
raw := append([]byte(sshsigMagic), ssh.Marshal(sb)...)
|
||||
return pem.EncodeToMemory(&pem.Block{Type: "SSH SIGNATURE", Bytes: raw})
|
||||
}
|
||||
|
||||
// nonce returns a fresh 128-bit hex nonce (doc 04 §2.1: ≥128-bit random).
|
||||
func nonce() string {
|
||||
var b [16]byte
|
||||
if _, err := rand.Read(b[:]); err != nil {
|
||||
panic(err)
|
||||
}
|
||||
const hexdigits = "0123456789abcdef"
|
||||
out := make([]byte, 32)
|
||||
for i, x := range b {
|
||||
out[i*2] = hexdigits[x>>4]
|
||||
out[i*2+1] = hexdigits[x&0x0f]
|
||||
}
|
||||
return string(out)
|
||||
}
|
||||
|
||||
// canonicalBlob builds an op blob in the doc 04 §2.1 canonical field order (keys
|
||||
// sorted at every level, no insignificant whitespace).
|
||||
func canonicalBlob(op, hostID, guestID, keyID, nonce, paramsJSON string, issued, expires time.Time) []byte {
|
||||
if paramsJSON == "" {
|
||||
paramsJSON = "{}"
|
||||
}
|
||||
return []byte(fmt.Sprintf(
|
||||
`{"expires_at":%q,"issued_at":%q,"key_id":%q,"nonce":%q,"op":%q,"params":%s,"target":{"guest_id":%q,"host_id":%q}}`,
|
||||
expires.UTC().Format(time.RFC3339), issued.UTC().Format(time.RFC3339),
|
||||
keyID, nonce, op, paramsJSON, guestID, hostID))
|
||||
}
|
||||
|
||||
// testSigner is a fresh ed25519 operator key: its public key, an authorized_keys line
|
||||
// to pin it, and a sign closure.
|
||||
type testSigner struct {
|
||||
pub ssh.PublicKey
|
||||
line string
|
||||
sign func([]byte) ssh.Signature
|
||||
}
|
||||
|
||||
func newTestSigner(t *testing.T) testSigner {
|
||||
t.Helper()
|
||||
pub, priv, err := ed25519.GenerateKey(rand.Reader)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
sshPub, err := ssh.NewPublicKey(pub)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
return testSigner{
|
||||
pub: sshPub,
|
||||
line: string(ssh.MarshalAuthorizedKey(sshPub)),
|
||||
sign: func(d []byte) ssh.Signature {
|
||||
return ssh.Signature{Format: ssh.KeyAlgoED25519, Blob: ed25519.Sign(priv, d)}
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
// allowed builds a pinned AllowedSigner for this key with the given id+role.
|
||||
func (s testSigner) allowed(t *testing.T, keyID string, role authz.KeyRole) authz.AllowedSigner {
|
||||
t.Helper()
|
||||
as, err := authz.NewAllowedSigner(keyID, role, s.line)
|
||||
if err != nil {
|
||||
t.Fatalf("NewAllowedSigner: %v", err)
|
||||
}
|
||||
return as
|
||||
}
|
||||
|
||||
// mint builds a SignedOp (canonical blob + armored sig) for this signer.
|
||||
func (s testSigner) mint(op, hostID, guestID, keyID, nonce, paramsJSON string, issued, expires time.Time) *SignedOp {
|
||||
blob := canonicalBlob(op, hostID, guestID, keyID, nonce, paramsJSON, issued, expires)
|
||||
sig := mintArmor(s.pub.Marshal(), authz.Namespace, blob, s.sign)
|
||||
return &SignedOp{Blob: blob, Sig: sig}
|
||||
}
|
||||
@@ -37,6 +37,13 @@ type Action struct {
|
||||
// the MiB unit Proxmox's LXC `memory` config field uses.
|
||||
const bytesPerMiB = 1024 * 1024
|
||||
|
||||
// desiredMemoryMiB canonicalizes a desired byte count to the integer MiB that
|
||||
// Proxmox's `memory` field stores and reports. Floor division is deliberate and
|
||||
// convergent: the value returned here is exactly the value written via SetConfig, so a
|
||||
// subsequent read returns the same MiB and the comparison settles (see Plan's memory
|
||||
// note). The actual side (a.MemoryMiB) is already MiB from GuestConfig.
|
||||
func desiredMemoryMiB(bytes int64) int64 { return bytes / bytesPerMiB }
|
||||
|
||||
// Plan computes the minimal benign action set converging actual → desired. It is a
|
||||
// pure function (deterministic, side-effect-free) so it is exhaustively fixture-test
|
||||
// -able. Actions are returned sorted by vmid, then config-before-runstate per guest.
|
||||
@@ -81,7 +88,14 @@ func Plan(desired DesiredState, actual ActualState, norm FieldNormalizers) []Act
|
||||
params["cores"] = strconv.Itoa(d.Spec.Cores)
|
||||
reasons = append(reasons, fmt.Sprintf("cores %d->%d", a.Cores, d.Spec.Cores))
|
||||
}
|
||||
if want := d.Spec.MemoryBytes / bytesPerMiB; want != a.MemoryMiB {
|
||||
// Memory is canonicalized to MiB on BOTH sides before comparison — the
|
||||
// numeric cousin of the description-newline normalization (string
|
||||
// normalizers cover string fields; this is the integer one). We compare
|
||||
// the SAME MiB value we then write, so a non-MiB-aligned desired
|
||||
// converges in one pass (write `want` MiB → PVE stores `want` MiB → next
|
||||
// read a.MemoryMiB == want → no further action), never perpetual drift.
|
||||
// Slice 10 should still serve MiB-aligned MemoryBytes at the source.
|
||||
if want := desiredMemoryMiB(d.Spec.MemoryBytes); want != a.MemoryMiB {
|
||||
params["memory"] = strconv.FormatInt(want, 10)
|
||||
reasons = append(reasons, fmt.Sprintf("memory %dMiB->%dMiB", a.MemoryMiB, want))
|
||||
}
|
||||
|
||||
@@ -73,6 +73,23 @@ func TestPlan_SpecDrift(t *testing.T) {
|
||||
mustActions(t, got)
|
||||
}
|
||||
|
||||
func TestPlan_MemoryNonAlignedConverges(t *testing.T) {
|
||||
// Note-2 guard: a desired MemoryBytes that is NOT a clean MiB multiple must not
|
||||
// cause perpetual drift. We compare in MiB and write the SAME MiB we compared, so it
|
||||
// settles in one pass.
|
||||
desiredBytes := int64(2049)*bytesPerMiB + 500000 // 2049 MiB + change → floors to 2049
|
||||
d := desired(DesiredGuest{VMID: 100, Spec: &hub.GuestSpec{Cores: 2, MemoryBytes: desiredBytes}})
|
||||
|
||||
// First pass: actual is 2048 MiB → one SetConfig memory=2049.
|
||||
got := Plan(d, actual(ActualGuest{VMID: 100, Run: RunStopped, SpecKnown: true, Cores: 2, MemoryMiB: 2048}), nil)
|
||||
mustActions(t, got, Action{VMID: 100, Kind: ActionSetConfig, Params: map[string]string{"memory": "2049"}})
|
||||
|
||||
// Apply it: actual becomes 2049 MiB. Re-plan against the SAME desired → no action.
|
||||
if got2 := Plan(d, actual(ActualGuest{VMID: 100, Run: RunStopped, SpecKnown: true, Cores: 2, MemoryMiB: 2049}), nil); len(got2) != 0 {
|
||||
t.Fatalf("non-MiB-aligned memory did not converge (perpetual drift): %+v", got2)
|
||||
}
|
||||
}
|
||||
|
||||
func TestPlan_DiskNotReconciled(t *testing.T) {
|
||||
// DiskBytes differs but is intentionally not reconciled (pct resize, later slice).
|
||||
got := Plan(
|
||||
|
||||
@@ -0,0 +1,97 @@
|
||||
package reconcile
|
||||
|
||||
import (
|
||||
"context"
|
||||
"time"
|
||||
)
|
||||
|
||||
// Recover consumes the journal's in-flight set at startup: resume-or-rollback for any
|
||||
// op that was mid-execution when the agent crashed (doc 03 §10). This MUST run before
|
||||
// the engine begins issuing new mutations.
|
||||
//
|
||||
// Why it is load-bearing for signed destructive ops (and why it lands with the gate):
|
||||
// the idempotency-key store dedupes a COMPLETED op, but an op that crashed AFTER the
|
||||
// Proxmox POST and BEFORE its terminal record (OpTaskRunning) is not covered by that —
|
||||
// its nonce is already consumed, so a redelivery is rejected as a replay, yet it never
|
||||
// reached a terminal state. Only this startup consumer can resolve it: re-check the
|
||||
// Proxmox task and record the real outcome.
|
||||
//
|
||||
// Resolution per in-flight entry:
|
||||
// - has a task id (OpTaskRunning): re-read the task status once. Stopped → record the
|
||||
// real terminal state (OK → succeeded, else failed). Still running → leave it
|
||||
// in-flight (a later Recover or the task's own completion resolves it). Unreadable →
|
||||
// leave it (cannot safely decide).
|
||||
// - no task id (OpStarted only): the Proxmox POST was never confirmed, so the op
|
||||
// never took effect — record failed (fail-safe, the documented FileNonceStore
|
||||
// direction). A convergent reconcile op is simply re-issued next pass; a one-shot
|
||||
// op did NOT mark its idempotency key applied, so it is not falsely deduped.
|
||||
func (e *Engine) Recover(ctx context.Context) RecoverResult {
|
||||
var res RecoverResult
|
||||
if e.journal == nil {
|
||||
return res
|
||||
}
|
||||
for _, entry := range e.journal.InFlight() {
|
||||
res.Examined++
|
||||
if entry.UPID == "" {
|
||||
// POST never confirmed → abandon (fail-safe).
|
||||
e.append(terminal(entry, OpFailed))
|
||||
res.RolledBack++
|
||||
e.logger.Warn("recover: in-flight op had no task id; marked failed (fail-safe)",
|
||||
"op_id", entry.OpID, "vmid", entry.VMID, "kind", entry.Kind)
|
||||
continue
|
||||
}
|
||||
st, err := e.api.TaskStatusOnce(ctx, entry.UPID)
|
||||
if err != nil {
|
||||
res.Unresolved++
|
||||
e.logger.Warn("recover: cannot read in-flight task status; left in-flight",
|
||||
"op_id", entry.OpID, "upid", entry.UPID, "err", err)
|
||||
continue
|
||||
}
|
||||
if st.Running() {
|
||||
res.StillRunning++
|
||||
e.logger.Info("recover: in-flight task still running; left in-flight",
|
||||
"op_id", entry.OpID, "upid", entry.UPID)
|
||||
continue
|
||||
}
|
||||
// Stopped: record the real outcome.
|
||||
if st.OK() {
|
||||
e.append(terminal(entry, OpSucceeded))
|
||||
res.Resumed++
|
||||
e.logger.Info("recover: in-flight task completed OK; marked succeeded",
|
||||
"op_id", entry.OpID, "upid", entry.UPID)
|
||||
} else {
|
||||
e.append(terminal(entry, OpFailed))
|
||||
res.Failed++
|
||||
e.logger.Warn("recover: in-flight task ended non-OK; marked failed",
|
||||
"op_id", entry.OpID, "upid", entry.UPID, "exitstatus", st.ExitStatus)
|
||||
}
|
||||
}
|
||||
if res.Examined > 0 {
|
||||
e.logger.Info("recover: in-flight journal reconciled", "result", res)
|
||||
}
|
||||
return res
|
||||
}
|
||||
|
||||
// RecoverResult summarizes a startup recovery pass.
|
||||
type RecoverResult struct {
|
||||
Examined int
|
||||
Resumed int // task found completed OK and recorded succeeded
|
||||
Failed int // task found ended non-OK and recorded failed
|
||||
RolledBack int // no task id → abandoned (fail-safe)
|
||||
StillRunning int // task still executing → left in-flight
|
||||
Unresolved int // task status unreadable → left in-flight
|
||||
}
|
||||
|
||||
// terminal builds a terminal journal record preserving the op's identity, with the
|
||||
// idempotency key carried through so a SUCCEEDED one-shot op marks its key applied.
|
||||
func terminal(e JournalEntry, state OpState) JournalEntry {
|
||||
return JournalEntry{
|
||||
OpID: e.OpID,
|
||||
VMID: e.VMID,
|
||||
Kind: e.Kind,
|
||||
UPID: e.UPID,
|
||||
State: state,
|
||||
IdempKey: e.IdempKey,
|
||||
At: time.Now().UTC(),
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,103 @@
|
||||
package reconcile
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"gitea.dooplex.hu/admin/felhom-agent/internal/proxmox"
|
||||
)
|
||||
|
||||
func seedInFlight(t *testing.T, j *Journal, e JournalEntry) {
|
||||
t.Helper()
|
||||
e.State = OpTaskRunning
|
||||
if e.At.IsZero() {
|
||||
e.At = time.Now().UTC()
|
||||
}
|
||||
if err := j.Append(e); err != nil {
|
||||
t.Fatalf("seed: %v", err)
|
||||
}
|
||||
}
|
||||
|
||||
func TestRecover_TaskCompletedOKMarksSucceeded(t *testing.T) {
|
||||
api := &fakeAPI{statusFunc: func(string) (proxmox.TaskStatus, error) {
|
||||
return proxmox.TaskStatus{Status: "stopped", ExitStatus: "OK"}, nil
|
||||
}}
|
||||
e, j, _ := newEngine(t, api, EmptyProvider{})
|
||||
seedInFlight(t, j, JournalEntry{OpID: "op1", VMID: 100, Kind: "set_config", UPID: "UPID:x:", IdempKey: "k1"})
|
||||
|
||||
res := e.Recover(context.Background())
|
||||
if res.Examined != 1 || res.Resumed != 1 {
|
||||
t.Fatalf("want 1 resumed, got %+v", res)
|
||||
}
|
||||
if len(j.InFlight()) != 0 {
|
||||
t.Errorf("resolved op should not be in-flight: %+v", j.InFlight())
|
||||
}
|
||||
// A resumed one-shot op marks its idempotency key applied (it really completed) —
|
||||
// this is the case idempotency-alone could not cover (Note 1).
|
||||
if !j.AlreadyApplied("k1") {
|
||||
t.Error("a recovered-succeeded op must mark its idempotency key applied")
|
||||
}
|
||||
}
|
||||
|
||||
func TestRecover_TaskEndedNonOKMarksFailed(t *testing.T) {
|
||||
api := &fakeAPI{statusFunc: func(string) (proxmox.TaskStatus, error) {
|
||||
return proxmox.TaskStatus{Status: "stopped", ExitStatus: "got 403"}, nil
|
||||
}}
|
||||
e, j, _ := newEngine(t, api, EmptyProvider{})
|
||||
seedInFlight(t, j, JournalEntry{OpID: "op2", VMID: 100, Kind: "guest_destroy", UPID: "UPID:x:", IdempKey: "k2"})
|
||||
|
||||
res := e.Recover(context.Background())
|
||||
if res.Failed != 1 {
|
||||
t.Fatalf("want 1 failed, got %+v", res)
|
||||
}
|
||||
if j.AlreadyApplied("k2") {
|
||||
t.Error("a failed op must NOT mark its key applied (it may be retried)")
|
||||
}
|
||||
}
|
||||
|
||||
func TestRecover_TaskStillRunningLeftInFlight(t *testing.T) {
|
||||
api := &fakeAPI{statusFunc: func(string) (proxmox.TaskStatus, error) {
|
||||
return proxmox.TaskStatus{Status: "running"}, nil
|
||||
}}
|
||||
e, j, _ := newEngine(t, api, EmptyProvider{})
|
||||
seedInFlight(t, j, JournalEntry{OpID: "op3", VMID: 100, Kind: "set_config", UPID: "UPID:x:"})
|
||||
|
||||
res := e.Recover(context.Background())
|
||||
if res.StillRunning != 1 || len(j.InFlight()) != 1 {
|
||||
t.Fatalf("still-running task must be left in-flight, got res=%+v inflight=%d", res, len(j.InFlight()))
|
||||
}
|
||||
}
|
||||
|
||||
func TestRecover_NoTaskIDRolledBack(t *testing.T) {
|
||||
// OpStarted with no UPID: the POST was never confirmed → abandon (fail-safe).
|
||||
e, j, _ := newEngine(t, &fakeAPI{}, EmptyProvider{})
|
||||
if err := j.Append(JournalEntry{OpID: "op4", VMID: 100, Kind: "start", State: OpStarted, At: time.Now().UTC()}); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
res := e.Recover(context.Background())
|
||||
if res.RolledBack != 1 || len(j.InFlight()) != 0 {
|
||||
t.Fatalf("no-task op must be rolled back, got res=%+v inflight=%d", res, len(j.InFlight()))
|
||||
}
|
||||
}
|
||||
|
||||
func TestRecover_UnreadableStatusLeftInFlight(t *testing.T) {
|
||||
api := &fakeAPI{statusFunc: func(string) (proxmox.TaskStatus, error) {
|
||||
return proxmox.TaskStatus{}, errors.New("api unreachable")
|
||||
}}
|
||||
e, j, _ := newEngine(t, api, EmptyProvider{})
|
||||
seedInFlight(t, j, JournalEntry{OpID: "op5", VMID: 100, Kind: "set_config", UPID: "UPID:x:"})
|
||||
|
||||
res := e.Recover(context.Background())
|
||||
if res.Unresolved != 1 || len(j.InFlight()) != 1 {
|
||||
t.Fatalf("unreadable status must leave op in-flight, got res=%+v inflight=%d", res, len(j.InFlight()))
|
||||
}
|
||||
}
|
||||
|
||||
func TestRecover_EmptyJournalNoop(t *testing.T) {
|
||||
e, _, _ := newEngine(t, &fakeAPI{}, EmptyProvider{})
|
||||
if res := e.Recover(context.Background()); res.Examined != 0 {
|
||||
t.Errorf("empty journal recover should be a no-op, got %+v", res)
|
||||
}
|
||||
}
|
||||
@@ -112,6 +112,9 @@ type GuestAPI interface {
|
||||
Stop(ctx context.Context, vmid int) (string, error)
|
||||
SetConfig(ctx context.Context, vmid int, params map[string]string) (string, error)
|
||||
WaitTask(ctx context.Context, upid string, opts proxmox.WaitOptions) (proxmox.TaskStatus, error)
|
||||
// TaskStatusOnce is a single non-blocking task-status read — used by crash
|
||||
// recovery to learn the outcome of an op that was in flight when the agent died.
|
||||
TaskStatusOnce(ctx context.Context, upid string) (proxmox.TaskStatus, error)
|
||||
}
|
||||
|
||||
// guestDescription decodes the (string-valued) `description` key from a GuestConfig's
|
||||
|
||||
Reference in New Issue
Block a user