Files
felhom-agent/CHANGELOG.md
T
admin ab77fa3544 feat(hub): host-report client + collector + first daemon loop (slice 3, v0.3.0)
internal/hub: the agent's first daemon — a periodic read-only host-report POSTed to
the hub (the heartbeat; no separate ping).

- HostReport wire contract (shared field-for-field with the hub ingest): host
  metrics, guests (vmid + spec), cloudflared status; storage/backups/restore-tests/
  pbs/audit collections DEFINED but emitted empty (slices 5/6 fill).
- Collector over a read-only proxmoxReader (adapted to the real proxmox surface;
  no proxmox changes) + a CloudflaredProber. Partial-failure: NodeStatus fail = hard
  (skip POST); per-guest GuestConfig fail = status "unknown", still report.
- Client: Bearer-auth POST, standard TLS (system roots / optional ca_file), typed
  TransportError/HTTPError, token never in errors.
- Loop: immediate first report, adopt hub poll_interval (clamp [60,3600]), resilient
  to collect/report errors, clean ctx-cancel shutdown.
- ControlEnvelope: only poll_interval_seconds acted on; blocked/desired_generation/
  has_signed_ops parsed-but-ignored (slice 4).
- config: HubConfig + FELHOM_AGENT_HUB_* overlay + mode-aware HubConfig.Validate +
  WithDefaults + hub-key redaction; example config updated.
- main: no-selftest mode is now the daemon; added --selftest=hub. Version -> 0.3.0.

Tests: report serialization, client (incl. token-redaction), collector partial-
failure, loop continuation+interval adoption, config. internal/proxmox + internal/
authz untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 16:20:09 +02:00

168 lines
10 KiB
Markdown

# Changelog
All notable changes to **felhom-agent** are recorded here. Update on every code
change that gets pushed.
## v0.3.0 — hub client + host-report + first daemon loop (slice 3) (2026-06-08)
The agent's first daemon: a periodic read-only host-report POSTed to the hub (the
heartbeat). No Proxmox mutations, no desired-state/signed-op consumption, no
storage/backup collection yet — those are slices 4/5/6.
### Added
- **`internal/hub`** package:
- **`HostReport`** wire contract (`report.go`) shared field-for-field with the hub
ingest: host metrics, guests (`vmid` + spec), `cloudflared` status, and the
`storage_targets`/`backups`/`restore_tests`/`pbs_snapshots`/`audit_tail`
collections **defined but emitted empty** (typed `[]`, slices 5/6 fill them).
- **`Collector`** (`collect.go`) builds the report from a read-only `proxmoxReader`
(adapted to the real `internal/proxmox` surface — node held by the client, value
returns, `proxmox.Guest`) + a `CloudflaredProber`. Partial-failure policy: a
failed `NodeStatus` is a hard error (skip the POST); a failed per-guest
`GuestConfig` degrades that guest to `status="unknown"` (spec omitted) but still
sends; a cloudflared probe failure → `"unknown"`, never fatal.
- **`CloudflaredProber`** + `SystemctlProber` (`systemctl is-active cloudflared`;
read-only — NOT a Privileged/root op; tunnel management is a later slice).
- **`Client`** (`client.go`): `POST /api/v1/host-report` with
`Authorization: Bearer <key>`, standard TLS (system roots or optional `ca_file`;
verification always on). Typed `*TransportError` / `*HTTPError`; the bearer token
never appears in any error.
- **`Loop`** (`loop.go`): the daemon — immediate first report then tick; adopts the
hub's `poll_interval_seconds` clamped to [60,3600]; resilient (a collect/report
error is logged and the loop continues); clean shutdown on context cancel.
- **`ControlEnvelope`**: only `poll_interval_seconds` is acted on; `blocked` /
`desired_generation` / `has_signed_ops` are parsed-but-ignored (logged at most)
pending reconcile (slice 4).
- **Config**: `HubConfig` (url/host_id/api_key/poll_seconds/timeout_seconds/ca_file),
`FELHOM_AGENT_HUB_*` env overlay, `HubConfig.Validate()` (mode-aware — proxmox-only
`--selftest=read|task` still runs without hub config), `WithDefaults()`, and
`Redacted()` now also blanks the hub key. `configs/agent.example.json` gains `hub`
(and `authz`) blocks.
- **`cmd/felhom-agent`**: the no-`--selftest` mode is now the **daemon** (poll loop);
added **`--selftest=hub`** (one collect+report, prints the report + envelope).
Version 0.2.0 → 0.3.0.
### Tests
- Report serialization (field names; empty collections are `[]` not `null`; spec
omitted when unknown); client (Bearer header, non-2xx→`*HTTPError`,
transport→`*TransportError`, **token never in error**); collector (host mapping,
guest spec, per-guest failure degrades-but-still-reports, NodeStatus hard error,
cloudflared error→unknown); loop (immediate first report, continuation after an
injected error, interval adoption + clamp); config (hub validate/redact/env).
### Notes
- `internal/proxmox` and `internal/authz` were **not touched** — no new proxmox
surface was needed (`ListLXC` already exposes status/maxmem/maxdisk; `GuestConfig`
exposes cores). The task's `proxmoxReader` sketch (node-arg/pointer/`LXC`) was
adapted to the real exports as instructed.
- **Defined-but-empty** this slice: `storage_targets`, `backups`, `restore_tests`,
`pbs_snapshots`, `audit_tail` (slices 5/6). **Parsed-but-ignored**: the envelope's
`blocked`/`desired_generation`/`has_signed_ops` (slice 4).
## v0.2.0 — `authz` signed-op verifier (slice 2) (2026-06-08)
Production form of the Phase-4 signing primitive: a key-type-agnostic SSHSIG
verifier for operator-signed destructive ops, with the full anti-replay/
authorization pipeline and a durable, crash-safe nonce store. What slice 4
(reconcile) will call to gate destructive desired-state deltas. No hub, no signing
CLI, no reconcile loop.
### Added
- **`internal/authz``Verifier`**: `New(signers, store, hostID)` + `Verify(blob,
sigArmored) (*VerifiedOp, error)`. Runs the LOCKED pipeline (order is
load-bearing): parse armor → namespace → parse pubkey → allow-list (by key
**material**, `pub.Marshal()` equality, not key_id) → crypto verify (over the
**raw received bytes**, never re-canonicalized) → parse blob → target → time
window → **nonce recorded LAST**. Each post-crypto stage rejects even with a
valid signature.
- **SSHSIG framing** (`sshsig.go`) via `golang.org/x/crypto/ssh` — `pem.Decode` →
strip 6-byte magic → `ssh.Unmarshal` → `ssh.ParsePublicKey` → recompute signed
data with the named hash → `pub.Verify` (dispatches on key algorithm). No
hand-rolled crypto. Key-type-agnostic: ed25519 / **sk-ssh-ed25519 (FIDO2)** /
rsa / ecdsa via the one path.
- **Fixed namespace** `felhom-op-v1` (package constant, never caller-supplied).
- **`OpBlob`** (corrected `host_id`/`guest_id` json tags) + **`VerifiedOp`** (op,
host/guest, params, key_id, matched signer). key_id is advisory/audit only —
never an authz input.
- **Typed errors**: `ErrMalformed, ErrNamespace, ErrUnknownSigner, ErrBadSignature,
ErrTarget, ErrExpired, ErrNotYetValid, ErrReplay` (errors.Is-friendly).
- **`NonceStore`** + two impls: `MemoryNonceStore` (tests) and **`FileNonceStore`**
— durable, crash-safe (fsync'd append log, replayed into an index on open,
periodic compaction, expiry-only pruning). A nonce is fsync'd to disk before
`SeenOrRecord` returns false; replay protection survives restart; I/O failure
fails safe (reports seen=true). Target generalization: host_id matched strictly,
guest_id surfaced for the caller to route.
- **Config**: `AuthzConfig` (nonce-store path + pinned operator `signers` tagged
`operational`/`recovery` with a key_id, as authorized_keys lines).
- **Version 0.2.0.**
### Tests
- Real OpenSSH interop via a committed `ssh-keygen -Y sign` vector (hermetic CI);
per-stage rejection (each with an otherwise-valid sig); the headline
**invalid-sig-does-not-burn-the-nonce** invariant; replay; **persistence across
restart**; synthetic **sk-ssh-ed25519** through the unchanged path; byte-exactness
(a re-serialized blob fails crypto — not re-canonicalized).
### Notes / corrections to the Phase-4 reference
- §7's `Target` lacked json tags (`host_id`/`guest_id`) — fixed.
- The doc paired "Go 1.24.4 / x/crypto v0.52.0", but v0.52.0 declares `go 1.25.0`
and does **not** build on Go 1.24. Resolved by upgrading the build server to
go1.26.0 (backward-compatible; felhom-controller/hub unaffected); the module is
`go 1.25.0` on x/crypto v0.52.0.
- Free function → constructed `Verifier`; returns the full `VerifiedOp`; typed
errors; clock-skew tolerance added; durable nonce store is the net-new work.
- **Shared-contract dependency flagged** (not built): the hub and the `felhom-sign`
CLI must emit byte-identical canonical JSON or signatures won't verify; a shared
canonicalizer both import would be the right home.
## v0.1.0 — Scaffold + `proxmox` interaction layer (slice 1) (2026-06-08)
First slice: stand up the host-agent project and its foundation — the typed
Proxmox interaction layer every other module will call. No reconcile loop, hub
client, signing, or storage/backup orchestration yet (later slices).
### Added
- **Project scaffold**: module `gitea.dooplex.hu/admin/felhom-agent`, binary
`felhom-agent` (`cmd/felhom-agent/`), Go 1.24, zero external dependencies
(pure stdlib). `--version` flag; `version` var overridable via
`-ldflags "-X main.version=<v>"`.
- **`internal/proxmox` — API backend (`Client`)**: hand-rolled REST client over
`https://<host>:8006/api2/json` with `PVEAPIToken` auth. Typed read ops
(`Version`, `Nodes`, `NodeStatus`, `ListLXC`, `GuestStatus`, `GuestConfig`,
`ListStorage`, `NodeStorage`, `StorageContent`) and async mutating ops
returning a UPID (`RestoreLXC` — the primary create path, `Vzdump`, `Snapshot`,
`Rollback`, `DeleteSnapshot`, `SetConfig`, `Start`, `Stop`).
- **`WaitTask`**: polls `GET /nodes/{node}/tasks/{upid}/status` until stopped, then
asserts `exitstatus == "OK"` (authorization can surface at task execution, not
the POST — phase1-2 §1.3). Exponential backoff (1s→5s cap), context
cancellation + timeout. `*APIError` parses the offending privilege from a 403;
`*TaskError` parses it from a failed task exitstatus + log tail.
- **`internal/proxmox` — fenced root-CLI backend (`Privileged`)**: limited to the
three proven OS-root exceptions only — `CreateGoldenLXC` (keyctl `pct create`),
`MountUSBByUUID`, `SMART`, `Sensors`; each cites why it can't be the API. Fence
is structural (Client never shells out, Privileged never makes an HTTP call) and
asserted in tests.
- **TLS trust**: SHA-256 leaf-cert pinning (the host serves a self-signed cert) or
a CA file; an explicitly-named `insecure_skip_verify` that is off by default. No
blanket verification disable.
- **`internal/config`**: JSON config file + `FELHOM_AGENT_*` env overrides; the
token secret is never logged (`Redacted()`).
- **`internal/log`**: slog setup (text, stderr, configurable level).
- **`cmd/felhom-agent --selftest`**: read-only health report against a live host
(version/nodes/status/guests/storage); `--selftest=task --vmid N` exercises
`WaitTask` on a reversible snapshot→rollback→delete op (gated; default selftest
mutates nothing).
- **Tests**: unit tests with a mock HTTP transport + mock runner (UPID parse,
`WaitTask` running→OK / failed-403 / timeout / ctx-cancel, 403→privilege error,
response decoding against shapes captured live from `demo-felhom`, config
redaction, and the API-vs-root routing fence).
### Notes
- Types are grounded in the spike findings
(`felhom.eu/documentation/proxmox-platform.md`, `tests/phase{0,1-2,3}-findings.md`)
and the exact JSON shapes captured live from `demo-felhom` (PVE 9.2.2).
- Verified: `go build/vet/test` green on Go 1.24.4 (build server) and a live
read-only `--selftest` against the demo host with TLS fingerprint pinning.
- The 16-privilege `FelhomAgent` role + privsep token (role on **both** user and
token) is provisioned out-of-band; the agent only consumes the token.