feat(authz): operator signed-op verifier + durable nonce store (slice 2, v0.2.0)

internal/authz: production form of the Phase-4 SSHSIG signing primitive.

- Verifier.New/Verify with the LOCKED pipeline (namespace → allow-list by key
  material → crypto over RAW bytes → target → time → nonce LAST); each post-crypto
  stage rejects even with a valid sig; an invalid sig never burns a nonce.
- SSHSIG framing via x/crypto/ssh (no hand-rolled crypto); key-type-agnostic
  (ed25519 / sk-ssh-ed25519 / rsa / ecdsa via pub.Verify). Fixed namespace
  felhom-op-v1. Typed errors. OpBlob (fixed host_id/guest_id tags) + VerifiedOp.
- NonceStore: MemoryNonceStore + durable crash-safe FileNonceStore (fsync'd append
  log, replay-on-open, compaction, expiry-only pruning; survives restart).
- config.AuthzConfig (nonce path + pinned operational/recovery signer keys).
- Tests (14): real ssh-keygen fixture, per-stage rejection, nonce-not-burned,
  replay, persistence-across-restart, synthetic sk, byte-exactness.

Dep: golang.org/x/crypto v0.52.0 (declares go 1.25 — the Phase-4 doc's "Go 1.24.4 /
x/crypto v0.52.0" pairing doesn't build; build server upgraded to go1.26.0,
backward-compatible). Version 0.1.0 -> 0.2.0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-08 15:23:02 +02:00
parent 43b7e96905
commit f0fee7e193
19 changed files with 1231 additions and 41 deletions
+60 -37
View File
@@ -3,51 +3,74 @@
> This file holds the report for the **most recent** change, fully overwritten each task.
> Cumulative history lives in [CHANGELOG.md](CHANGELOG.md).
## Task: Agent scaffold + `proxmox` interaction package (slice 1) — v0.1.0
## Task: `authz` signed-op verifier (slice 2) — v0.2.0
Stood up the host-agent project and its foundation — the typed `proxmox` interaction
layer every other agent module will call — with a runnable read-only `--selftest`.
Pushed to `main` (main-only repo). Build/vet/test green; verified live against the demo host.
Turned the Phase-4 reference `VerifySignedOp` into a production package
(`internal/authz`): a key-type-agnostic SSHSIG verifier for operator-signed destructive
ops, the full anti-replay/authorization pipeline, and a durable, crash-safe nonce store.
This is what slice 4 (reconcile) calls to gate destructive desired-state deltas. Pushed to
`main`. Build/vet/test green locally (Go 1.26) and on the build server.
### Public surface
### Public surface (`internal/authz`)
- **`Verifier`** — `New(signers []AllowedSigner, store NonceStore, hostID string) *Verifier`;
`Verify(blob, sigArmored []byte) (*VerifiedOp, error)`. Optional `ClockSkew` (default 2m,
not-yet-valid only) and `Logger` (advisory key_id-mismatch warning).
- **`OpBlob`** — canonical signed object; `Target{HostID,GuestID}` with corrected
`host_id`/`guest_id` json tags; `Params json.RawMessage`, `Nonce`, `IssuedAt`, `ExpiresAt`, `KeyID`.
- **`VerifiedOp`** — `Op, HostID, GuestID, Params, Nonce, IssuedAt, ExpiresAt, KeyID (advisory),
Signer (matched), KeyIDMatchesSigner`.
- **`AllowedSigner`** + `NewAllowedSigner(keyID, role, authorizedKeyLine)`; roles
`RoleOperational` / `RoleRecovery` (doc 04 two-key model; role-scoping enforced by the caller).
- **`NonceStore`** interface + `MemoryNonceStore` (tests) and **`FileNonceStore`** (durable).
- **Typed errors**: `ErrMalformed, ErrNamespace, ErrUnknownSigner, ErrBadSignature, ErrTarget,
ErrExpired, ErrNotYetValid, ErrReplay` (errors.Is-friendly).
- **Config**: `config.AuthzConfig` (nonce-store path + pinned `Signers`).
**`proxmox.Client`** (API backend):
- Read: `Version`, `Nodes`, `NodeStatus`, `ListLXC`, `GuestStatus`, `GuestConfig`, `ListStorage`, `NodeStorage`, `StorageContent`
- Async mutating (return a UPID): `RestoreLXC` (primary create path), `Vzdump`, `Snapshot`, `Rollback`, `DeleteSnapshot`, `SetConfig`, `Start`, `Stop`
- Tasks: `WaitTask(ctx, upid, WaitOptions)`, `TaskStatusOnce`, `TaskLogTail`
- Errors: `*APIError` (parses the offending privilege from a 403), `*TaskError` (parses it from a failed task `exitstatus` + log tail)
- Types: `Version, Node, NodeStatus, Guest, GuestConfig (+Extra/MountPoints/Nets), Storage, StorageContent, TaskStatus, UPID`
### Locked pipeline (order load-bearing)
`parse armor → namespace (fixed felhom-op-v1) → parse pubkey → allow-list by key MATERIAL (not
key_id) → crypto verify over RAW received bytes → parse blob → target (host strict, guest
surfaced) → time window → nonce recorded LAST`. Each post-crypto stage rejects even with a
valid signature; an invalid signature can never consume a nonce.
**`proxmox.Privileged`** (fenced root-CLI; `Runner` iface, `ExecRunner` direct/`sudo -n`): `CreateGoldenLXC` (keyctl), `MountUSBByUUID`, `SMART`, `Sensors` — each documents *why it can't be the API*.
### Durable nonce store — mechanism & guarantee
fsync'd append-only JSONL log + in-memory index (replayed on open) + periodic compaction.
- **Crash-safe**: a nonce is written and `fsync`'d before `SeenOrRecord` returns `false`, so the
caller acts only *after* the durable record. A crash between verify and execute drops the op
(fail-safe) and never enables a replay. I/O failure → returns seen=true (op not executed).
- **Survives restart**: the log is replayed into the index on `OpenFileNonceStore`.
- **Pruning**: expired nonces dropped only at compaction (never before exp) — and an expired op
is rejected by the time check before the nonce check, so pruning is housekeeping, not a hole.
- **Concurrency-safe**: single mutex over file handle + index.
### API-vs-root routing table
### OPEN choices
- **Clock skew**: 2-minute tolerance on *not-yet-valid* only; expiry not extended (window stays an
honest bound).
- **Durable mechanism**: fsync'd append log + compaction (simple, honest, no embedded-KV dep).
- **Fixtures**: committed real `ssh-keygen -Y sign` vector (hermetic + proves OpenSSH interop) +
in-Go minting for rejection cases; the sk case is synthetic (spec-faithful, no hardware).
- **Package name**: `authz` (control-plane-authorization layer, matches doc 04).
| Backend | Ops | Why |
|---|---|---|
| **API** | node status, list/status/config guests, storage list+content, task status/log, **restore**, vzdump, snapshot/rollback/delete-snap, set-config, start/stop | FelhomAgent 16-priv token |
| **root-CLI (fenced)** | golden `pct create` (keyctl=1), USB mount-by-UUID/fstab, SMART/sensors | keyctl is `root@pam`-only; host mounts + SMART aren't API ops |
### Test matrix (all pass — 14 tests)
Real ssh-keygen fixture · happy path · per-stage rejection {namespace, unknown-signer, tampered,
retargeted-host, expired, not-yet-valid, replay} · **invalid-sig-does-NOT-burn-nonce** (then the
valid op with that nonce still succeeds) · replay-rejected-across-restart (durable store) ·
key-type-agnostic synthetic **sk-ssh-ed25519** · byte-exactness (re-serialized blob fails crypto).
Fence is **structural** (`Client` has no runner, `Privileged` has no HTTP client) and asserted in `routing_test.go`.
### OPEN-item choices
- **Config:** JSON file + `FELHOM_AGENT_*` env overrides (stdlib, zero-dep; swappable to `yaml.v3` if YAML house-style is preferred). Token never logged (`Redacted()`).
- **Privileged runner / uid:** `Runner` iface; `ExecRunner{Mode: sudo|direct}`, default `sudo -n`. Proposed (not finalized): non-root service user + narrow sudoers allowlist for the 3 fenced commands.
- **Polling:** first poll immediate, then 1s → exponential backoff capped 5s, default total timeout 10m; honors ctx cancellation. Tunable via `WaitOptions`.
- **`--selftest=task`:** included (gated behind the flag + `-vmid`). Unit-tested via mocks; not run live (the live token was read-only).
- **Versioning:** `version` var in `main.go` (default `0.1.0`, `-ldflags -X main.version=`), `--version` flag.
### What the live host revealed (recorded, not guessed)
- Node name is **`demo-felhom`**; `felhom-pve` is only the SSH alias.
- `/nodes/{node}/status`: `cpu` is a 0..1 fraction, **`loadavg` is an array of strings**; `memory`/`rootfs`/`swap` nested.
- `vmid` is an **integer** in list/status; `status/current` carries no `vmid` (set from the path arg).
- Task: `status` ∈ {running, stopped}, `exitstatus` only once stopped; task log is `[{"n":N,"t":"…"}]`. UPID = `UPID:node:pid(hex):pstart(hex):starttime(hex):worker:id:user:`.
- `pveum user token add … --output-format json` returns `{"value":"…"}`.
- **No spike fact failed in practice** — 16-priv role, async/UPID model, keyctl boundary, dual-grant privsep all held. Teardown logged `ignore invalid acl token …`, confirming ACL auto-invalidation (phase1-2 §5).
### Corrections to the Phase-4 §7 reference (for production)
- `Target` needed `host_id`/`guest_id` json tags — fixed.
- **The doc's "Go 1.24.4 / x/crypto v0.52.0" does not hold**: x/crypto v0.52.0 declares
`go 1.25.0` and won't build on Go 1.24. Resolved by upgrading the build server to **go1.26.0**
(backward-compatible — felhom-controller/hub build unchanged; distro Go package left intact,
upstream Go fronted on PATH).
- Free function → constructed `Verifier`; returns full `VerifiedOp`; typed errors; clock-skew;
durable nonce store (the net-new engineering).
- **Shared-contract flag (not built)**: the hub and `felhom-sign` CLI must produce byte-identical
canonical JSON or signatures won't verify; a shared canonicalizer both import is the right home.
### Verification
- `go build/vet/test` green twice: locally (Go 1.26) and on the build server (Go 1.24.4).
- **Live read-only `--selftest`** (built on 192.168.0.180, against `https://192.168.0.162:8006`, **TLS fingerprint-pinned** — no insecure mode): version, nodes, node status, guests, storage all `[ ok ]`. slog confirmed the token rendered as `…=********`. Throwaway token created + torn down.
- Mutating ops + live `WaitTask` are unit-tested only (live run used a read-only token); `--selftest=task` is ready to exercise them against a real `FelhomAgent` token.
- `go build/vet/test` green locally (go1.26.0) and on the build server (upgraded to go1.26.0).
- Real OpenSSH `ssh-keygen` (OpenSSH 10.0p2) minted the committed fixture and self-verified it
before commit.
### Repo state
- Branch: `main` only (feature branch merged + deleted, local & remote). Latest: `chore(agent): add CHANGELOG, version the agent at 0.1.0`.
- Branch: `main` only. Dep: `golang.org/x/crypto v0.52.0` (+ `x/sys` indirect); `go 1.25.0`.