diff --git a/CHANGELOG.md b/CHANGELOG.md index c42f560..f7aba91 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -3,6 +3,23 @@ All notable changes to **felhom-agent** are recorded here. Update on every code change that gets pushed. +## v0.3.1 — slice-3 validation follow-ups (2026-06-08) + +### Changed +- **Collector keeps the known run-status on a `GuestConfig` failure** (`internal/hub/collect.go`): + previously a per-guest config-read error forced `status="unknown"`; now the run-status from + `ListLXC` is preserved (only the `spec` is dropped). An empty status is still normalized to + `unknown` (wire value is always `running|stopped|unknown`). Test renamed to + `TestCollect_GuestConfigFailureKeepsStatusOmitsSpec` and asserts the preserved `running` + nil spec. +- **`--selftest` usage** error string now reads `(want read|task|hub)`. + +### Added +- **Cross-repo contract fixture** `internal/hub/testdata/host-report.golden.json` + + `TestHostReport_ContractMatchesGolden` — compares the marshaled `HostReport` field-name sets + (top level + `host` + `guests[0]`) against the golden, failing on any json-tag drift. The file is + **kept byte-identical** with felhom-hub's copy (duplicated contract until a shared types module; + revisit when slices 5/6 populate the empty collections). Version → 0.3.1. + ## v0.3.0 — hub client + host-report + first daemon loop (slice 3) (2026-06-08) The agent's first daemon: a periodic read-only host-report POSTed to the hub (the diff --git a/REPORT.md b/REPORT.md index e8ee82e..b103347 100644 --- a/REPORT.md +++ b/REPORT.md @@ -3,71 +3,45 @@ > This file holds the report for the **most recent** change, fully overwritten each task. > Cumulative history lives in [CHANGELOG.md](CHANGELOG.md). -## Task: hub client + host-report + first daemon loop (slice 3) — v0.3.0 +## Task: slice-3 validation follow-ups — v0.3.1 -The agent's **first daemon**: a periodic, read-only host-report POSTed to the hub — which -**is** the heartbeat (its server-side `received_at` is the dead-man's-switch signal). New -`internal/hub` package + config additions + `main.go` daemon wiring. Pushed to `main`; -build/vet/test green locally (go1.26) and on the build server. +Small fixes surfaced during slice-3 validation (agent half). Pushed to `main`; build/vet/test +green locally (go1.26) and on the build server. -### `internal/hub` public surface -- **`HostReport`** + sub-types (`HostMetrics`, `Guest`, `GuestSpec`, `Cloudflared`, - `ControlEnvelope`) — the JSON wire contract shared field-for-field with the hub ingest. -- **`Collector`** — `NewCollector(px proxmoxReader, cf CloudflaredProber, hostID, agentVersion, logger)`; - `Collect(ctx) (*HostReport, error)`. -- **`CloudflaredProber`** interface + **`SystemctlProber`** (`systemctl is-active`). -- **`Client`** — `NewClient(cfg config.HubConfig, logger) (*Client, error)`; - `Report(ctx, *HostReport) (*ControlEnvelope, error)`; typed `*TransportError` / `*HTTPError`. -- **`Loop`** — `NewLoop(collector, client, interval, logger)`; `Run(ctx) error`. Constants - `MinPollSeconds=60` / `MaxPollSeconds=3600`. +### §1 — `--selftest` usage string +`selftestFlag.Set`'s error now reads `(want read|task|hub)` (was missing `hub`, which became a +valid mode in slice 3). Cosmetic. -### Config additions (`internal/config`) -- `HubConfig{URL, HostID, APIKey, PollSeconds, TimeoutSeconds, CAFile}` on `Config.Hub`. -- `FELHOM_AGENT_HUB_{URL,HOST_ID,API_KEY,POLL_SECONDS,TIMEOUT_SECONDS,CA_FILE}` overlay - (int parse errors warn to stderr + keep file value, never crash). -- `HubConfig.Validate()` (mode-aware — proxmox-only selftests unaffected; https required - except loopback for tests), `HubConfig.WithDefaults()` (900s/30s), `Redacted()` blanks the key. -- `configs/agent.example.json` gains `hub` (and `authz`) blocks. +### §2 — collector keeps run-status on a `GuestConfig` failure +`internal/hub/collect.go` `collectGuests`: a per-guest `GuestConfig` error no longer forces +`status="unknown"`. The run-status from `ListLXC` is **preserved** (only `spec` is dropped — that's +the only thing actually unknown). An *empty* status is still normalized to `unknown`, so the wire +value is always `running|stopped|unknown` (matches the hub handler's empty→unknown defaulting). +Test renamed `TestCollect_GuestConfigFailureKeepsStatusOmitsSpec`, now asserting the preserved +`running` status **and** nil spec (not a hollow `!= "unknown"` check). -### Daemon-loop behaviour (`main.go`) -- No `--selftest` flag → **daemon**: validate proxmox + hub config → build read-path proxmox - client, collector, hub client, loop → `signal.NotifyContext(SIGINT, SIGTERM)` → `loop.Run`. -- **Immediate first report**, then tick at the interval; adopt the hub's - `poll_interval_seconds` (clamped [60,3600], reset the ticker on change). -- **Resilient**: any collect/report error is logged and the loop continues (survives hub 5xx - and transient proxmox read errors). Clean `nil` return on context cancel. -- **`--selftest=hub`**: one collect + report; prints the report it would send + the envelope. -- Startup line logs host_id/url/interval with the **key redacted**; no secret ever logged. +### §4 — cross-repo contract golden fixture (agent half) +The host-report shape lives in two repos with nothing failing on drift (the hub ignores unknown +fields). Locked it with a golden sample: +- `internal/hub/testdata/host-report.golden.json` — a populated report (host block, two guests: + one `running` with `spec`, one `stopped`; `cloudflared`; the four empty collections + `audit_tail` + as `[]`). +- `TestHostReport_ContractMatchesGolden` — marshals a constructed `HostReport`, unmarshals the + golden, and compares **field-name key sets** at top level + `host` + `guests[0]`. A renamed/added/ + removed json tag fails it. -### Explicitly deferred (defined now, not active) -- **Defined-but-EMPTY** this slice (slices 5/6 fill): `storage_targets`, `backups`, - `restore_tests`, `pbs_snapshots`, `audit_tail` — emitted as typed empty `[]`. -- **Parsed-but-IGNORED** (slice 4 / reconcile consumes): the envelope's `blocked`, - `desired_generation`, `has_signed_ops` — logged at most, never acted on. -- No per-guest work queue (zero Proxmox mutations this slice); no canonical JSON (nothing - signs the report); no controller_version (slice 8) — emitted `""`. +**Caveat (called out):** this is a *duplicated* contract — the file must stay **byte-identical** +with `felhom-hub`'s `hub/internal/api/testdata/host-report.golden.json`. JSON can't carry a comment, +so the mandatory "keep byte-identical" note lives in the test file's doc comment in both repos +instead of a JSON header. When slices 5/6 add real `storage_targets`/`backups` fields, revisit +promoting this to a shared Go types module (the proper fix). -### proxmox surface -**No changes to `internal/proxmox` or `internal/authz`.** No new proxmox surface was needed: -`ListLXC` already returns status/maxmem/maxdisk and `GuestConfig` returns cores. The task's -`proxmoxReader` sketch (node-arg / pointer returns / `LXC` type) was **adapted to the real -exports** — `Node()` on the client, value returns, `proxmox.Guest` — per its instruction. - -### Test matrix (all green) -- **report**: field names match §4; empty collections serialize as `[]` not `null`; spec - omitted when unknown. -- **client**: sets `Bearer`; non-2xx → `*HTTPError` (status preserved); transport → `*TransportError`; - **asserts the bearer token never appears in any error string**. -- **collector**: `NodeStatus`→host block; `ListLXC`+`GuestConfig`→guest spec; a failing - `GuestConfig` → `status="unknown"` + omitted spec + **still returns a report**; a failing - `NodeStatus` → hard error; cloudflared probe error → `"unknown"`. -- **loop**: immediate first report; continues after an injected report error (≥3 cycles); - adopts + clamps the envelope interval (cycle-level) and applies a slower interval in `Run`. -- **config**: hub validate cases, key redaction, env overlay + defaults. +### Not touched (confirmed) +The daemon's proxmox client timeout is already bounded: `proxmox.NewClient` defaults `HTTPTimeout` +to 30s when zero, and `newProxmoxClient` leaves it zero. No change (was a "confirm" item). ### Verification -- `go build/vet/test` green locally (go1.26.0) and on the build server (go1.26.0). No live hub - or `systemctl` in unit tests (mock transport + fake prober/collector/reporter). +`go build/vet/test ./...` green locally (go1.26) and on the build server (go1.26). Version 0.3.0 → 0.3.1. ### Repo state -- Branch: `main` only. Version 0.3.0. Dep unchanged (`golang.org/x/crypto v0.52.0`). +Branch: `main` only. Dep unchanged (`golang.org/x/crypto v0.52.0`). diff --git a/cmd/felhom-agent/main.go b/cmd/felhom-agent/main.go index b5b20e4..b7d6cac 100644 --- a/cmd/felhom-agent/main.go +++ b/cmd/felhom-agent/main.go @@ -26,7 +26,7 @@ import ( // version is the agent version. Overridable at build time with // -ldflags "-X main.version="; defaults to the in-repo CHANGELOG version. -var version = "0.3.0" +var version = "0.3.1" func main() { var ( @@ -326,7 +326,7 @@ func (f *selftestFlag) Set(v string) error { case "hub": f.mode = "hub" default: - return fmt.Errorf("invalid --selftest value %q (want read|task)", v) + return fmt.Errorf("invalid --selftest value %q (want read|task|hub)", v) } return nil } diff --git a/internal/hub/collect.go b/internal/hub/collect.go index 5749164..3b2a64e 100644 --- a/internal/hub/collect.go +++ b/internal/hub/collect.go @@ -105,13 +105,18 @@ func (c *Collector) collectGuests(ctx context.Context) []Guest { guests := make([]Guest, 0, len(lxc)) for _, g := range lxc { entry := Guest{VMID: g.VMID, Name: g.Name, Status: g.Status, ControllerVersion: ""} + // Normalize an empty run-status to "unknown" so the wire value is always one + // of running|stopped|unknown (matches the hub handler's empty→unknown default). + if entry.Status == "" { + entry.Status = "unknown" + } // GuestConfig supplies cores; memory/disk come from the list entry (bytes). + // On failure, KEEP the known run-status from ListLXC — only the spec is lost. cfg, err := c.px.GuestConfig(ctx, g.VMID) if err != nil { - c.logger.Warn("hub: GuestConfig failed; guest degraded to unknown", + c.logger.Warn("hub: GuestConfig failed; spec omitted (run-status kept)", "vmid", g.VMID, "err", err) - entry.Status = "unknown" - entry.Spec = nil // omitted + entry.Spec = nil } else { entry.Spec = &GuestSpec{ Cores: cfg.Cores, diff --git a/internal/hub/collect_test.go b/internal/hub/collect_test.go index 31ea667..d5d29dd 100644 --- a/internal/hub/collect_test.go +++ b/internal/hub/collect_test.go @@ -58,7 +58,7 @@ func TestCollect_HostAndGuests(t *testing.T) { } } -func TestCollect_GuestConfigFailureDegradesButStillReports(t *testing.T) { +func TestCollect_GuestConfigFailureKeepsStatusOmitsSpec(t *testing.T) { px := &fakePx{ node: "demo-felhom", ns: newTestNodeStatus(), @@ -69,7 +69,7 @@ func TestCollect_GuestConfigFailureDegradesButStillReports(t *testing.T) { cfg: map[int]proxmox.GuestConfig{100: {Cores: 2}}, cfgErr: map[int]error{200: errors.New("config read failed")}, } - c := NewCollector(px, fakeProber{status: "active"}, "h", "0.3.0", quietLogger()) + c := NewCollector(px, fakeProber{status: "active"}, "h", "0.3.1", quietLogger()) r, err := c.Collect(context.Background()) if err != nil { t.Fatalf("a per-guest failure must NOT fail the whole report: %v", err) @@ -77,9 +77,14 @@ func TestCollect_GuestConfigFailureDegradesButStillReports(t *testing.T) { if len(r.Guests) != 2 { t.Fatalf("guests = %d", len(r.Guests)) } + // GuestConfig failed for vmid 200, but its run-status (from ListLXC) is known and + // must be PRESERVED — only the spec is dropped. bad := r.Guests[1] - if bad.Status != "unknown" || bad.Spec != nil { - t.Errorf("degraded guest = %+v (want status=unknown, spec=nil)", bad) + if bad.Status != "running" { + t.Errorf("status = %q, want preserved \"running\" (not forced to unknown)", bad.Status) + } + if bad.Spec != nil { + t.Errorf("spec = %+v, want nil (omitted on config failure)", bad.Spec) } } diff --git a/internal/hub/contract_test.go b/internal/hub/contract_test.go new file mode 100644 index 0000000..754966a --- /dev/null +++ b/internal/hub/contract_test.go @@ -0,0 +1,76 @@ +package hub + +import ( + "encoding/json" + "os" + "reflect" + "sort" + "testing" +) + +// The host-report shape is a contract DUPLICATED across two repos (no shared types +// module yet). testdata/host-report.golden.json MUST be kept byte-identical with +// felhom-hub's hub/internal/api/testdata/host-report.golden.json. This test fails +// if a json tag on HostReport/HostMetrics/Guest is renamed/added/removed relative +// to the golden, catching silent drift before slices 5/6 populate the empty +// collections. (Promote to a shared types module when those land.) +func TestHostReport_ContractMatchesGolden(t *testing.T) { + raw, err := os.ReadFile("testdata/host-report.golden.json") + if err != nil { + t.Fatal(err) + } + var golden map[string]any + if err := json.Unmarshal(raw, &golden); err != nil { + t.Fatalf("golden is not valid JSON: %v", err) + } + + // A constructed report mirroring the golden's populated shape (guests[0] has spec). + report := &HostReport{ + HostID: "demo-host-01", ReportedAt: "2026-06-08T12:00:00Z", AgentVersion: "0.3.1", + Host: HostMetrics{Node: "demo-felhom", LoadAvg: []string{"0.10"}}, + Guests: []Guest{ + {VMID: 100, Name: "a", Status: "running", ControllerVersion: "", Spec: &GuestSpec{Cores: 2}}, + {VMID: 101, Name: "b", Status: "stopped", ControllerVersion: ""}, + }, + StorageTargets: []StorageTarget{}, Backups: []Backup{}, RestoreTests: []RestoreTest{}, + PBSSnapshots: []PBSSnapshot{}, AuditTail: []AuditEntry{}, + Cloudflared: Cloudflared{Status: "active"}, + } + b, _ := json.Marshal(report) + var got map[string]any + json.Unmarshal(b, &got) + + assertSameKeys(t, "", golden, got) + assertSameKeys(t, "host", golden["host"], got["host"]) + assertSameKeys(t, "guests[0]", + firstElem(golden["guests"]), firstElem(got["guests"])) +} + +func firstElem(v any) any { + arr, ok := v.([]any) + if !ok || len(arr) == 0 { + return map[string]any{} + } + return arr[0] +} + +func assertSameKeys(t *testing.T, where string, a, b any) { + t.Helper() + ka, kb := keysOf(a), keysOf(b) + if !reflect.DeepEqual(ka, kb) { + t.Errorf("contract drift at %s:\n golden keys = %v\n struct keys = %v", where, ka, kb) + } +} + +func keysOf(v any) []string { + m, ok := v.(map[string]any) + if !ok { + return nil + } + keys := make([]string, 0, len(m)) + for k := range m { + keys = append(keys, k) + } + sort.Strings(keys) + return keys +} diff --git a/internal/hub/testdata/host-report.golden.json b/internal/hub/testdata/host-report.golden.json new file mode 100644 index 0000000..e2a3066 --- /dev/null +++ b/internal/hub/testdata/host-report.golden.json @@ -0,0 +1,38 @@ +{ + "host_id": "demo-host-01", + "reported_at": "2026-06-08T12:00:00Z", + "agent_version": "0.3.1", + "host": { + "node": "demo-felhom", + "cpu_percent": 3.2, + "memory_total_bytes": 16777216000, + "memory_used_bytes": 4194304000, + "memory_percent": 25, + "disk_total_bytes": 152000000000, + "disk_used_bytes": 30000000000, + "disk_percent": 19.7, + "loadavg": ["0.10", "0.20", "0.15"], + "uptime_seconds": 86400 + }, + "guests": [ + { + "vmid": 100, + "name": "felhom-cust-acme", + "status": "running", + "controller_version": "", + "spec": { "cores": 2, "memory_bytes": 2147483648, "disk_bytes": 21474836480 } + }, + { + "vmid": 101, + "name": "felhom-cust-beta", + "status": "stopped", + "controller_version": "" + } + ], + "storage_targets": [], + "backups": [], + "restore_tests": [], + "pbs_snapshots": [], + "cloudflared": { "status": "active" }, + "audit_tail": [] +}