feat(hub): host-report client + collector + first daemon loop (slice 3, v0.3.0)
internal/hub: the agent's first daemon — a periodic read-only host-report POSTed to the hub (the heartbeat; no separate ping). - HostReport wire contract (shared field-for-field with the hub ingest): host metrics, guests (vmid + spec), cloudflared status; storage/backups/restore-tests/ pbs/audit collections DEFINED but emitted empty (slices 5/6 fill). - Collector over a read-only proxmoxReader (adapted to the real proxmox surface; no proxmox changes) + a CloudflaredProber. Partial-failure: NodeStatus fail = hard (skip POST); per-guest GuestConfig fail = status "unknown", still report. - Client: Bearer-auth POST, standard TLS (system roots / optional ca_file), typed TransportError/HTTPError, token never in errors. - Loop: immediate first report, adopt hub poll_interval (clamp [60,3600]), resilient to collect/report errors, clean ctx-cancel shutdown. - ControlEnvelope: only poll_interval_seconds acted on; blocked/desired_generation/ has_signed_ops parsed-but-ignored (slice 4). - config: HubConfig + FELHOM_AGENT_HUB_* overlay + mode-aware HubConfig.Validate + WithDefaults + hub-key redaction; example config updated. - main: no-selftest mode is now the daemon; added --selftest=hub. Version -> 0.3.0. Tests: report serialization, client (incl. token-redaction), collector partial- failure, loop continuation+interval adoption, config. internal/proxmox + internal/ authz untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -3,74 +3,71 @@
|
||||
> This file holds the report for the **most recent** change, fully overwritten each task.
|
||||
> Cumulative history lives in [CHANGELOG.md](CHANGELOG.md).
|
||||
|
||||
## Task: `authz` signed-op verifier (slice 2) — v0.2.0
|
||||
## Task: hub client + host-report + first daemon loop (slice 3) — v0.3.0
|
||||
|
||||
Turned the Phase-4 reference `VerifySignedOp` into a production package
|
||||
(`internal/authz`): a key-type-agnostic SSHSIG verifier for operator-signed destructive
|
||||
ops, the full anti-replay/authorization pipeline, and a durable, crash-safe nonce store.
|
||||
This is what slice 4 (reconcile) calls to gate destructive desired-state deltas. Pushed to
|
||||
`main`. Build/vet/test green locally (Go 1.26) and on the build server.
|
||||
The agent's **first daemon**: a periodic, read-only host-report POSTed to the hub — which
|
||||
**is** the heartbeat (its server-side `received_at` is the dead-man's-switch signal). New
|
||||
`internal/hub` package + config additions + `main.go` daemon wiring. Pushed to `main`;
|
||||
build/vet/test green locally (go1.26) and on the build server.
|
||||
|
||||
### Public surface (`internal/authz`)
|
||||
- **`Verifier`** — `New(signers []AllowedSigner, store NonceStore, hostID string) *Verifier`;
|
||||
`Verify(blob, sigArmored []byte) (*VerifiedOp, error)`. Optional `ClockSkew` (default 2m,
|
||||
not-yet-valid only) and `Logger` (advisory key_id-mismatch warning).
|
||||
- **`OpBlob`** — canonical signed object; `Target{HostID,GuestID}` with corrected
|
||||
`host_id`/`guest_id` json tags; `Params json.RawMessage`, `Nonce`, `IssuedAt`, `ExpiresAt`, `KeyID`.
|
||||
- **`VerifiedOp`** — `Op, HostID, GuestID, Params, Nonce, IssuedAt, ExpiresAt, KeyID (advisory),
|
||||
Signer (matched), KeyIDMatchesSigner`.
|
||||
- **`AllowedSigner`** + `NewAllowedSigner(keyID, role, authorizedKeyLine)`; roles
|
||||
`RoleOperational` / `RoleRecovery` (doc 04 two-key model; role-scoping enforced by the caller).
|
||||
- **`NonceStore`** interface + `MemoryNonceStore` (tests) and **`FileNonceStore`** (durable).
|
||||
- **Typed errors**: `ErrMalformed, ErrNamespace, ErrUnknownSigner, ErrBadSignature, ErrTarget,
|
||||
ErrExpired, ErrNotYetValid, ErrReplay` (errors.Is-friendly).
|
||||
- **Config**: `config.AuthzConfig` (nonce-store path + pinned `Signers`).
|
||||
### `internal/hub` public surface
|
||||
- **`HostReport`** + sub-types (`HostMetrics`, `Guest`, `GuestSpec`, `Cloudflared`,
|
||||
`ControlEnvelope`) — the JSON wire contract shared field-for-field with the hub ingest.
|
||||
- **`Collector`** — `NewCollector(px proxmoxReader, cf CloudflaredProber, hostID, agentVersion, logger)`;
|
||||
`Collect(ctx) (*HostReport, error)`.
|
||||
- **`CloudflaredProber`** interface + **`SystemctlProber`** (`systemctl is-active`).
|
||||
- **`Client`** — `NewClient(cfg config.HubConfig, logger) (*Client, error)`;
|
||||
`Report(ctx, *HostReport) (*ControlEnvelope, error)`; typed `*TransportError` / `*HTTPError`.
|
||||
- **`Loop`** — `NewLoop(collector, client, interval, logger)`; `Run(ctx) error`. Constants
|
||||
`MinPollSeconds=60` / `MaxPollSeconds=3600`.
|
||||
|
||||
### Locked pipeline (order load-bearing)
|
||||
`parse armor → namespace (fixed felhom-op-v1) → parse pubkey → allow-list by key MATERIAL (not
|
||||
key_id) → crypto verify over RAW received bytes → parse blob → target (host strict, guest
|
||||
surfaced) → time window → nonce recorded LAST`. Each post-crypto stage rejects even with a
|
||||
valid signature; an invalid signature can never consume a nonce.
|
||||
### Config additions (`internal/config`)
|
||||
- `HubConfig{URL, HostID, APIKey, PollSeconds, TimeoutSeconds, CAFile}` on `Config.Hub`.
|
||||
- `FELHOM_AGENT_HUB_{URL,HOST_ID,API_KEY,POLL_SECONDS,TIMEOUT_SECONDS,CA_FILE}` overlay
|
||||
(int parse errors warn to stderr + keep file value, never crash).
|
||||
- `HubConfig.Validate()` (mode-aware — proxmox-only selftests unaffected; https required
|
||||
except loopback for tests), `HubConfig.WithDefaults()` (900s/30s), `Redacted()` blanks the key.
|
||||
- `configs/agent.example.json` gains `hub` (and `authz`) blocks.
|
||||
|
||||
### Durable nonce store — mechanism & guarantee
|
||||
fsync'd append-only JSONL log + in-memory index (replayed on open) + periodic compaction.
|
||||
- **Crash-safe**: a nonce is written and `fsync`'d before `SeenOrRecord` returns `false`, so the
|
||||
caller acts only *after* the durable record. A crash between verify and execute drops the op
|
||||
(fail-safe) and never enables a replay. I/O failure → returns seen=true (op not executed).
|
||||
- **Survives restart**: the log is replayed into the index on `OpenFileNonceStore`.
|
||||
- **Pruning**: expired nonces dropped only at compaction (never before exp) — and an expired op
|
||||
is rejected by the time check before the nonce check, so pruning is housekeeping, not a hole.
|
||||
- **Concurrency-safe**: single mutex over file handle + index.
|
||||
### Daemon-loop behaviour (`main.go`)
|
||||
- No `--selftest` flag → **daemon**: validate proxmox + hub config → build read-path proxmox
|
||||
client, collector, hub client, loop → `signal.NotifyContext(SIGINT, SIGTERM)` → `loop.Run`.
|
||||
- **Immediate first report**, then tick at the interval; adopt the hub's
|
||||
`poll_interval_seconds` (clamped [60,3600], reset the ticker on change).
|
||||
- **Resilient**: any collect/report error is logged and the loop continues (survives hub 5xx
|
||||
and transient proxmox read errors). Clean `nil` return on context cancel.
|
||||
- **`--selftest=hub`**: one collect + report; prints the report it would send + the envelope.
|
||||
- Startup line logs host_id/url/interval with the **key redacted**; no secret ever logged.
|
||||
|
||||
### OPEN choices
|
||||
- **Clock skew**: 2-minute tolerance on *not-yet-valid* only; expiry not extended (window stays an
|
||||
honest bound).
|
||||
- **Durable mechanism**: fsync'd append log + compaction (simple, honest, no embedded-KV dep).
|
||||
- **Fixtures**: committed real `ssh-keygen -Y sign` vector (hermetic + proves OpenSSH interop) +
|
||||
in-Go minting for rejection cases; the sk case is synthetic (spec-faithful, no hardware).
|
||||
- **Package name**: `authz` (control-plane-authorization layer, matches doc 04).
|
||||
### Explicitly deferred (defined now, not active)
|
||||
- **Defined-but-EMPTY** this slice (slices 5/6 fill): `storage_targets`, `backups`,
|
||||
`restore_tests`, `pbs_snapshots`, `audit_tail` — emitted as typed empty `[]`.
|
||||
- **Parsed-but-IGNORED** (slice 4 / reconcile consumes): the envelope's `blocked`,
|
||||
`desired_generation`, `has_signed_ops` — logged at most, never acted on.
|
||||
- No per-guest work queue (zero Proxmox mutations this slice); no canonical JSON (nothing
|
||||
signs the report); no controller_version (slice 8) — emitted `""`.
|
||||
|
||||
### Test matrix (all pass — 14 tests)
|
||||
Real ssh-keygen fixture · happy path · per-stage rejection {namespace, unknown-signer, tampered,
|
||||
retargeted-host, expired, not-yet-valid, replay} · **invalid-sig-does-NOT-burn-nonce** (then the
|
||||
valid op with that nonce still succeeds) · replay-rejected-across-restart (durable store) ·
|
||||
key-type-agnostic synthetic **sk-ssh-ed25519** · byte-exactness (re-serialized blob fails crypto).
|
||||
### proxmox surface
|
||||
**No changes to `internal/proxmox` or `internal/authz`.** No new proxmox surface was needed:
|
||||
`ListLXC` already returns status/maxmem/maxdisk and `GuestConfig` returns cores. The task's
|
||||
`proxmoxReader` sketch (node-arg / pointer returns / `LXC` type) was **adapted to the real
|
||||
exports** — `Node()` on the client, value returns, `proxmox.Guest` — per its instruction.
|
||||
|
||||
### Corrections to the Phase-4 §7 reference (for production)
|
||||
- `Target` needed `host_id`/`guest_id` json tags — fixed.
|
||||
- **The doc's "Go 1.24.4 / x/crypto v0.52.0" does not hold**: x/crypto v0.52.0 declares
|
||||
`go 1.25.0` and won't build on Go 1.24. Resolved by upgrading the build server to **go1.26.0**
|
||||
(backward-compatible — felhom-controller/hub build unchanged; distro Go package left intact,
|
||||
upstream Go fronted on PATH).
|
||||
- Free function → constructed `Verifier`; returns full `VerifiedOp`; typed errors; clock-skew;
|
||||
durable nonce store (the net-new engineering).
|
||||
- **Shared-contract flag (not built)**: the hub and `felhom-sign` CLI must produce byte-identical
|
||||
canonical JSON or signatures won't verify; a shared canonicalizer both import is the right home.
|
||||
### Test matrix (all green)
|
||||
- **report**: field names match §4; empty collections serialize as `[]` not `null`; spec
|
||||
omitted when unknown.
|
||||
- **client**: sets `Bearer`; non-2xx → `*HTTPError` (status preserved); transport → `*TransportError`;
|
||||
**asserts the bearer token never appears in any error string**.
|
||||
- **collector**: `NodeStatus`→host block; `ListLXC`+`GuestConfig`→guest spec; a failing
|
||||
`GuestConfig` → `status="unknown"` + omitted spec + **still returns a report**; a failing
|
||||
`NodeStatus` → hard error; cloudflared probe error → `"unknown"`.
|
||||
- **loop**: immediate first report; continues after an injected report error (≥3 cycles);
|
||||
adopts + clamps the envelope interval (cycle-level) and applies a slower interval in `Run`.
|
||||
- **config**: hub validate cases, key redaction, env overlay + defaults.
|
||||
|
||||
### Verification
|
||||
- `go build/vet/test` green locally (go1.26.0) and on the build server (upgraded to go1.26.0).
|
||||
- Real OpenSSH `ssh-keygen` (OpenSSH 10.0p2) minted the committed fixture and self-verified it
|
||||
before commit.
|
||||
- `go build/vet/test` green locally (go1.26.0) and on the build server (go1.26.0). No live hub
|
||||
or `systemctl` in unit tests (mock transport + fake prober/collector/reporter).
|
||||
|
||||
### Repo state
|
||||
- Branch: `main` only. Dep: `golang.org/x/crypto v0.52.0` (+ `x/sys` indirect); `go 1.25.0`.
|
||||
- Branch: `main` only. Version 0.3.0. Dep unchanged (`golang.org/x/crypto v0.52.0`).
|
||||
|
||||
Reference in New Issue
Block a user