feat(authz): operator signed-op verifier + durable nonce store (slice 2, v0.2.0)
internal/authz: production form of the Phase-4 SSHSIG signing primitive. - Verifier.New/Verify with the LOCKED pipeline (namespace → allow-list by key material → crypto over RAW bytes → target → time → nonce LAST); each post-crypto stage rejects even with a valid sig; an invalid sig never burns a nonce. - SSHSIG framing via x/crypto/ssh (no hand-rolled crypto); key-type-agnostic (ed25519 / sk-ssh-ed25519 / rsa / ecdsa via pub.Verify). Fixed namespace felhom-op-v1. Typed errors. OpBlob (fixed host_id/guest_id tags) + VerifiedOp. - NonceStore: MemoryNonceStore + durable crash-safe FileNonceStore (fsync'd append log, replay-on-open, compaction, expiry-only pruning; survives restart). - config.AuthzConfig (nonce path + pinned operational/recovery signer keys). - Tests (14): real ssh-keygen fixture, per-stage rejection, nonce-not-burned, replay, persistence-across-restart, synthetic sk, byte-exactness. Dep: golang.org/x/crypto v0.52.0 (declares go 1.25 — the Phase-4 doc's "Go 1.24.4 / x/crypto v0.52.0" pairing doesn't build; build server upgraded to go1.26.0, backward-compatible). Version 0.1.0 -> 0.2.0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -3,51 +3,74 @@
|
||||
> This file holds the report for the **most recent** change, fully overwritten each task.
|
||||
> Cumulative history lives in [CHANGELOG.md](CHANGELOG.md).
|
||||
|
||||
## Task: Agent scaffold + `proxmox` interaction package (slice 1) — v0.1.0
|
||||
## Task: `authz` signed-op verifier (slice 2) — v0.2.0
|
||||
|
||||
Stood up the host-agent project and its foundation — the typed `proxmox` interaction
|
||||
layer every other agent module will call — with a runnable read-only `--selftest`.
|
||||
Pushed to `main` (main-only repo). Build/vet/test green; verified live against the demo host.
|
||||
Turned the Phase-4 reference `VerifySignedOp` into a production package
|
||||
(`internal/authz`): a key-type-agnostic SSHSIG verifier for operator-signed destructive
|
||||
ops, the full anti-replay/authorization pipeline, and a durable, crash-safe nonce store.
|
||||
This is what slice 4 (reconcile) calls to gate destructive desired-state deltas. Pushed to
|
||||
`main`. Build/vet/test green locally (Go 1.26) and on the build server.
|
||||
|
||||
### Public surface
|
||||
### Public surface (`internal/authz`)
|
||||
- **`Verifier`** — `New(signers []AllowedSigner, store NonceStore, hostID string) *Verifier`;
|
||||
`Verify(blob, sigArmored []byte) (*VerifiedOp, error)`. Optional `ClockSkew` (default 2m,
|
||||
not-yet-valid only) and `Logger` (advisory key_id-mismatch warning).
|
||||
- **`OpBlob`** — canonical signed object; `Target{HostID,GuestID}` with corrected
|
||||
`host_id`/`guest_id` json tags; `Params json.RawMessage`, `Nonce`, `IssuedAt`, `ExpiresAt`, `KeyID`.
|
||||
- **`VerifiedOp`** — `Op, HostID, GuestID, Params, Nonce, IssuedAt, ExpiresAt, KeyID (advisory),
|
||||
Signer (matched), KeyIDMatchesSigner`.
|
||||
- **`AllowedSigner`** + `NewAllowedSigner(keyID, role, authorizedKeyLine)`; roles
|
||||
`RoleOperational` / `RoleRecovery` (doc 04 two-key model; role-scoping enforced by the caller).
|
||||
- **`NonceStore`** interface + `MemoryNonceStore` (tests) and **`FileNonceStore`** (durable).
|
||||
- **Typed errors**: `ErrMalformed, ErrNamespace, ErrUnknownSigner, ErrBadSignature, ErrTarget,
|
||||
ErrExpired, ErrNotYetValid, ErrReplay` (errors.Is-friendly).
|
||||
- **Config**: `config.AuthzConfig` (nonce-store path + pinned `Signers`).
|
||||
|
||||
**`proxmox.Client`** (API backend):
|
||||
- Read: `Version`, `Nodes`, `NodeStatus`, `ListLXC`, `GuestStatus`, `GuestConfig`, `ListStorage`, `NodeStorage`, `StorageContent`
|
||||
- Async mutating (return a UPID): `RestoreLXC` (primary create path), `Vzdump`, `Snapshot`, `Rollback`, `DeleteSnapshot`, `SetConfig`, `Start`, `Stop`
|
||||
- Tasks: `WaitTask(ctx, upid, WaitOptions)`, `TaskStatusOnce`, `TaskLogTail`
|
||||
- Errors: `*APIError` (parses the offending privilege from a 403), `*TaskError` (parses it from a failed task `exitstatus` + log tail)
|
||||
- Types: `Version, Node, NodeStatus, Guest, GuestConfig (+Extra/MountPoints/Nets), Storage, StorageContent, TaskStatus, UPID`
|
||||
### Locked pipeline (order load-bearing)
|
||||
`parse armor → namespace (fixed felhom-op-v1) → parse pubkey → allow-list by key MATERIAL (not
|
||||
key_id) → crypto verify over RAW received bytes → parse blob → target (host strict, guest
|
||||
surfaced) → time window → nonce recorded LAST`. Each post-crypto stage rejects even with a
|
||||
valid signature; an invalid signature can never consume a nonce.
|
||||
|
||||
**`proxmox.Privileged`** (fenced root-CLI; `Runner` iface, `ExecRunner` direct/`sudo -n`): `CreateGoldenLXC` (keyctl), `MountUSBByUUID`, `SMART`, `Sensors` — each documents *why it can't be the API*.
|
||||
### Durable nonce store — mechanism & guarantee
|
||||
fsync'd append-only JSONL log + in-memory index (replayed on open) + periodic compaction.
|
||||
- **Crash-safe**: a nonce is written and `fsync`'d before `SeenOrRecord` returns `false`, so the
|
||||
caller acts only *after* the durable record. A crash between verify and execute drops the op
|
||||
(fail-safe) and never enables a replay. I/O failure → returns seen=true (op not executed).
|
||||
- **Survives restart**: the log is replayed into the index on `OpenFileNonceStore`.
|
||||
- **Pruning**: expired nonces dropped only at compaction (never before exp) — and an expired op
|
||||
is rejected by the time check before the nonce check, so pruning is housekeeping, not a hole.
|
||||
- **Concurrency-safe**: single mutex over file handle + index.
|
||||
|
||||
### API-vs-root routing table
|
||||
### OPEN choices
|
||||
- **Clock skew**: 2-minute tolerance on *not-yet-valid* only; expiry not extended (window stays an
|
||||
honest bound).
|
||||
- **Durable mechanism**: fsync'd append log + compaction (simple, honest, no embedded-KV dep).
|
||||
- **Fixtures**: committed real `ssh-keygen -Y sign` vector (hermetic + proves OpenSSH interop) +
|
||||
in-Go minting for rejection cases; the sk case is synthetic (spec-faithful, no hardware).
|
||||
- **Package name**: `authz` (control-plane-authorization layer, matches doc 04).
|
||||
|
||||
| Backend | Ops | Why |
|
||||
|---|---|---|
|
||||
| **API** | node status, list/status/config guests, storage list+content, task status/log, **restore**, vzdump, snapshot/rollback/delete-snap, set-config, start/stop | FelhomAgent 16-priv token |
|
||||
| **root-CLI (fenced)** | golden `pct create` (keyctl=1), USB mount-by-UUID/fstab, SMART/sensors | keyctl is `root@pam`-only; host mounts + SMART aren't API ops |
|
||||
### Test matrix (all pass — 14 tests)
|
||||
Real ssh-keygen fixture · happy path · per-stage rejection {namespace, unknown-signer, tampered,
|
||||
retargeted-host, expired, not-yet-valid, replay} · **invalid-sig-does-NOT-burn-nonce** (then the
|
||||
valid op with that nonce still succeeds) · replay-rejected-across-restart (durable store) ·
|
||||
key-type-agnostic synthetic **sk-ssh-ed25519** · byte-exactness (re-serialized blob fails crypto).
|
||||
|
||||
Fence is **structural** (`Client` has no runner, `Privileged` has no HTTP client) and asserted in `routing_test.go`.
|
||||
|
||||
### OPEN-item choices
|
||||
- **Config:** JSON file + `FELHOM_AGENT_*` env overrides (stdlib, zero-dep; swappable to `yaml.v3` if YAML house-style is preferred). Token never logged (`Redacted()`).
|
||||
- **Privileged runner / uid:** `Runner` iface; `ExecRunner{Mode: sudo|direct}`, default `sudo -n`. Proposed (not finalized): non-root service user + narrow sudoers allowlist for the 3 fenced commands.
|
||||
- **Polling:** first poll immediate, then 1s → exponential backoff capped 5s, default total timeout 10m; honors ctx cancellation. Tunable via `WaitOptions`.
|
||||
- **`--selftest=task`:** included (gated behind the flag + `-vmid`). Unit-tested via mocks; not run live (the live token was read-only).
|
||||
- **Versioning:** `version` var in `main.go` (default `0.1.0`, `-ldflags -X main.version=`), `--version` flag.
|
||||
|
||||
### What the live host revealed (recorded, not guessed)
|
||||
- Node name is **`demo-felhom`**; `felhom-pve` is only the SSH alias.
|
||||
- `/nodes/{node}/status`: `cpu` is a 0..1 fraction, **`loadavg` is an array of strings**; `memory`/`rootfs`/`swap` nested.
|
||||
- `vmid` is an **integer** in list/status; `status/current` carries no `vmid` (set from the path arg).
|
||||
- Task: `status` ∈ {running, stopped}, `exitstatus` only once stopped; task log is `[{"n":N,"t":"…"}]`. UPID = `UPID:node:pid(hex):pstart(hex):starttime(hex):worker:id:user:`.
|
||||
- `pveum user token add … --output-format json` returns `{"value":"…"}`.
|
||||
- **No spike fact failed in practice** — 16-priv role, async/UPID model, keyctl boundary, dual-grant privsep all held. Teardown logged `ignore invalid acl token …`, confirming ACL auto-invalidation (phase1-2 §5).
|
||||
### Corrections to the Phase-4 §7 reference (for production)
|
||||
- `Target` needed `host_id`/`guest_id` json tags — fixed.
|
||||
- **The doc's "Go 1.24.4 / x/crypto v0.52.0" does not hold**: x/crypto v0.52.0 declares
|
||||
`go 1.25.0` and won't build on Go 1.24. Resolved by upgrading the build server to **go1.26.0**
|
||||
(backward-compatible — felhom-controller/hub build unchanged; distro Go package left intact,
|
||||
upstream Go fronted on PATH).
|
||||
- Free function → constructed `Verifier`; returns full `VerifiedOp`; typed errors; clock-skew;
|
||||
durable nonce store (the net-new engineering).
|
||||
- **Shared-contract flag (not built)**: the hub and `felhom-sign` CLI must produce byte-identical
|
||||
canonical JSON or signatures won't verify; a shared canonicalizer both import is the right home.
|
||||
|
||||
### Verification
|
||||
- `go build/vet/test` green twice: locally (Go 1.26) and on the build server (Go 1.24.4).
|
||||
- **Live read-only `--selftest`** (built on 192.168.0.180, against `https://192.168.0.162:8006`, **TLS fingerprint-pinned** — no insecure mode): version, nodes, node status, guests, storage all `[ ok ]`. slog confirmed the token rendered as `…=********`. Throwaway token created + torn down.
|
||||
- Mutating ops + live `WaitTask` are unit-tested only (live run used a read-only token); `--selftest=task` is ready to exercise them against a real `FelhomAgent` token.
|
||||
- `go build/vet/test` green locally (go1.26.0) and on the build server (upgraded to go1.26.0).
|
||||
- Real OpenSSH `ssh-keygen` (OpenSSH 10.0p2) minted the committed fixture and self-verified it
|
||||
before commit.
|
||||
|
||||
### Repo state
|
||||
- Branch: `main` only (feature branch merged + deleted, local & remote). Latest: `chore(agent): add CHANGELOG, version the agent at 0.1.0`.
|
||||
- Branch: `main` only. Dep: `golang.org/x/crypto v0.52.0` (+ `x/sys` indirect); `go 1.25.0`.
|
||||
|
||||
Reference in New Issue
Block a user