fix(agent): slice-3 follow-ups — keep run-status on config fail, selftest usage, contract golden (v0.3.1)
- collect: a per-guest GuestConfig failure preserves the ListLXC run-status (only spec dropped); empty status normalized to "unknown". Test asserts preserved "running" + nil spec. - main: --selftest usage error now reads (want read|task|hub). - contract: testdata/host-report.golden.json + TestHostReport_ContractMatchesGolden (field-name key-set check vs golden; byte-identical with the hub copy). - version 0.3.0 -> 0.3.1. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -3,6 +3,23 @@
|
|||||||
All notable changes to **felhom-agent** are recorded here. Update on every code
|
All notable changes to **felhom-agent** are recorded here. Update on every code
|
||||||
change that gets pushed.
|
change that gets pushed.
|
||||||
|
|
||||||
|
## v0.3.1 — slice-3 validation follow-ups (2026-06-08)
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- **Collector keeps the known run-status on a `GuestConfig` failure** (`internal/hub/collect.go`):
|
||||||
|
previously a per-guest config-read error forced `status="unknown"`; now the run-status from
|
||||||
|
`ListLXC` is preserved (only the `spec` is dropped). An empty status is still normalized to
|
||||||
|
`unknown` (wire value is always `running|stopped|unknown`). Test renamed to
|
||||||
|
`TestCollect_GuestConfigFailureKeepsStatusOmitsSpec` and asserts the preserved `running` + nil spec.
|
||||||
|
- **`--selftest` usage** error string now reads `(want read|task|hub)`.
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- **Cross-repo contract fixture** `internal/hub/testdata/host-report.golden.json` +
|
||||||
|
`TestHostReport_ContractMatchesGolden` — compares the marshaled `HostReport` field-name sets
|
||||||
|
(top level + `host` + `guests[0]`) against the golden, failing on any json-tag drift. The file is
|
||||||
|
**kept byte-identical** with felhom-hub's copy (duplicated contract until a shared types module;
|
||||||
|
revisit when slices 5/6 populate the empty collections). Version → 0.3.1.
|
||||||
|
|
||||||
## v0.3.0 — hub client + host-report + first daemon loop (slice 3) (2026-06-08)
|
## v0.3.0 — hub client + host-report + first daemon loop (slice 3) (2026-06-08)
|
||||||
|
|
||||||
The agent's first daemon: a periodic read-only host-report POSTed to the hub (the
|
The agent's first daemon: a periodic read-only host-report POSTed to the hub (the
|
||||||
|
|||||||
@@ -3,71 +3,45 @@
|
|||||||
> This file holds the report for the **most recent** change, fully overwritten each task.
|
> This file holds the report for the **most recent** change, fully overwritten each task.
|
||||||
> Cumulative history lives in [CHANGELOG.md](CHANGELOG.md).
|
> Cumulative history lives in [CHANGELOG.md](CHANGELOG.md).
|
||||||
|
|
||||||
## Task: hub client + host-report + first daemon loop (slice 3) — v0.3.0
|
## Task: slice-3 validation follow-ups — v0.3.1
|
||||||
|
|
||||||
The agent's **first daemon**: a periodic, read-only host-report POSTed to the hub — which
|
Small fixes surfaced during slice-3 validation (agent half). Pushed to `main`; build/vet/test
|
||||||
**is** the heartbeat (its server-side `received_at` is the dead-man's-switch signal). New
|
green locally (go1.26) and on the build server.
|
||||||
`internal/hub` package + config additions + `main.go` daemon wiring. Pushed to `main`;
|
|
||||||
build/vet/test green locally (go1.26) and on the build server.
|
|
||||||
|
|
||||||
### `internal/hub` public surface
|
### §1 — `--selftest` usage string
|
||||||
- **`HostReport`** + sub-types (`HostMetrics`, `Guest`, `GuestSpec`, `Cloudflared`,
|
`selftestFlag.Set`'s error now reads `(want read|task|hub)` (was missing `hub`, which became a
|
||||||
`ControlEnvelope`) — the JSON wire contract shared field-for-field with the hub ingest.
|
valid mode in slice 3). Cosmetic.
|
||||||
- **`Collector`** — `NewCollector(px proxmoxReader, cf CloudflaredProber, hostID, agentVersion, logger)`;
|
|
||||||
`Collect(ctx) (*HostReport, error)`.
|
|
||||||
- **`CloudflaredProber`** interface + **`SystemctlProber`** (`systemctl is-active`).
|
|
||||||
- **`Client`** — `NewClient(cfg config.HubConfig, logger) (*Client, error)`;
|
|
||||||
`Report(ctx, *HostReport) (*ControlEnvelope, error)`; typed `*TransportError` / `*HTTPError`.
|
|
||||||
- **`Loop`** — `NewLoop(collector, client, interval, logger)`; `Run(ctx) error`. Constants
|
|
||||||
`MinPollSeconds=60` / `MaxPollSeconds=3600`.
|
|
||||||
|
|
||||||
### Config additions (`internal/config`)
|
### §2 — collector keeps run-status on a `GuestConfig` failure
|
||||||
- `HubConfig{URL, HostID, APIKey, PollSeconds, TimeoutSeconds, CAFile}` on `Config.Hub`.
|
`internal/hub/collect.go` `collectGuests`: a per-guest `GuestConfig` error no longer forces
|
||||||
- `FELHOM_AGENT_HUB_{URL,HOST_ID,API_KEY,POLL_SECONDS,TIMEOUT_SECONDS,CA_FILE}` overlay
|
`status="unknown"`. The run-status from `ListLXC` is **preserved** (only `spec` is dropped — that's
|
||||||
(int parse errors warn to stderr + keep file value, never crash).
|
the only thing actually unknown). An *empty* status is still normalized to `unknown`, so the wire
|
||||||
- `HubConfig.Validate()` (mode-aware — proxmox-only selftests unaffected; https required
|
value is always `running|stopped|unknown` (matches the hub handler's empty→unknown defaulting).
|
||||||
except loopback for tests), `HubConfig.WithDefaults()` (900s/30s), `Redacted()` blanks the key.
|
Test renamed `TestCollect_GuestConfigFailureKeepsStatusOmitsSpec`, now asserting the preserved
|
||||||
- `configs/agent.example.json` gains `hub` (and `authz`) blocks.
|
`running` status **and** nil spec (not a hollow `!= "unknown"` check).
|
||||||
|
|
||||||
### Daemon-loop behaviour (`main.go`)
|
### §4 — cross-repo contract golden fixture (agent half)
|
||||||
- No `--selftest` flag → **daemon**: validate proxmox + hub config → build read-path proxmox
|
The host-report shape lives in two repos with nothing failing on drift (the hub ignores unknown
|
||||||
client, collector, hub client, loop → `signal.NotifyContext(SIGINT, SIGTERM)` → `loop.Run`.
|
fields). Locked it with a golden sample:
|
||||||
- **Immediate first report**, then tick at the interval; adopt the hub's
|
- `internal/hub/testdata/host-report.golden.json` — a populated report (host block, two guests:
|
||||||
`poll_interval_seconds` (clamped [60,3600], reset the ticker on change).
|
one `running` with `spec`, one `stopped`; `cloudflared`; the four empty collections + `audit_tail`
|
||||||
- **Resilient**: any collect/report error is logged and the loop continues (survives hub 5xx
|
as `[]`).
|
||||||
and transient proxmox read errors). Clean `nil` return on context cancel.
|
- `TestHostReport_ContractMatchesGolden` — marshals a constructed `HostReport`, unmarshals the
|
||||||
- **`--selftest=hub`**: one collect + report; prints the report it would send + the envelope.
|
golden, and compares **field-name key sets** at top level + `host` + `guests[0]`. A renamed/added/
|
||||||
- Startup line logs host_id/url/interval with the **key redacted**; no secret ever logged.
|
removed json tag fails it.
|
||||||
|
|
||||||
### Explicitly deferred (defined now, not active)
|
**Caveat (called out):** this is a *duplicated* contract — the file must stay **byte-identical**
|
||||||
- **Defined-but-EMPTY** this slice (slices 5/6 fill): `storage_targets`, `backups`,
|
with `felhom-hub`'s `hub/internal/api/testdata/host-report.golden.json`. JSON can't carry a comment,
|
||||||
`restore_tests`, `pbs_snapshots`, `audit_tail` — emitted as typed empty `[]`.
|
so the mandatory "keep byte-identical" note lives in the test file's doc comment in both repos
|
||||||
- **Parsed-but-IGNORED** (slice 4 / reconcile consumes): the envelope's `blocked`,
|
instead of a JSON header. When slices 5/6 add real `storage_targets`/`backups` fields, revisit
|
||||||
`desired_generation`, `has_signed_ops` — logged at most, never acted on.
|
promoting this to a shared Go types module (the proper fix).
|
||||||
- No per-guest work queue (zero Proxmox mutations this slice); no canonical JSON (nothing
|
|
||||||
signs the report); no controller_version (slice 8) — emitted `""`.
|
|
||||||
|
|
||||||
### proxmox surface
|
### Not touched (confirmed)
|
||||||
**No changes to `internal/proxmox` or `internal/authz`.** No new proxmox surface was needed:
|
The daemon's proxmox client timeout is already bounded: `proxmox.NewClient` defaults `HTTPTimeout`
|
||||||
`ListLXC` already returns status/maxmem/maxdisk and `GuestConfig` returns cores. The task's
|
to 30s when zero, and `newProxmoxClient` leaves it zero. No change (was a "confirm" item).
|
||||||
`proxmoxReader` sketch (node-arg / pointer returns / `LXC` type) was **adapted to the real
|
|
||||||
exports** — `Node()` on the client, value returns, `proxmox.Guest` — per its instruction.
|
|
||||||
|
|
||||||
### Test matrix (all green)
|
|
||||||
- **report**: field names match §4; empty collections serialize as `[]` not `null`; spec
|
|
||||||
omitted when unknown.
|
|
||||||
- **client**: sets `Bearer`; non-2xx → `*HTTPError` (status preserved); transport → `*TransportError`;
|
|
||||||
**asserts the bearer token never appears in any error string**.
|
|
||||||
- **collector**: `NodeStatus`→host block; `ListLXC`+`GuestConfig`→guest spec; a failing
|
|
||||||
`GuestConfig` → `status="unknown"` + omitted spec + **still returns a report**; a failing
|
|
||||||
`NodeStatus` → hard error; cloudflared probe error → `"unknown"`.
|
|
||||||
- **loop**: immediate first report; continues after an injected report error (≥3 cycles);
|
|
||||||
adopts + clamps the envelope interval (cycle-level) and applies a slower interval in `Run`.
|
|
||||||
- **config**: hub validate cases, key redaction, env overlay + defaults.
|
|
||||||
|
|
||||||
### Verification
|
### Verification
|
||||||
- `go build/vet/test` green locally (go1.26.0) and on the build server (go1.26.0). No live hub
|
`go build/vet/test ./...` green locally (go1.26) and on the build server (go1.26). Version 0.3.0 → 0.3.1.
|
||||||
or `systemctl` in unit tests (mock transport + fake prober/collector/reporter).
|
|
||||||
|
|
||||||
### Repo state
|
### Repo state
|
||||||
- Branch: `main` only. Version 0.3.0. Dep unchanged (`golang.org/x/crypto v0.52.0`).
|
Branch: `main` only. Dep unchanged (`golang.org/x/crypto v0.52.0`).
|
||||||
|
|||||||
@@ -26,7 +26,7 @@ import (
|
|||||||
|
|
||||||
// version is the agent version. Overridable at build time with
|
// version is the agent version. Overridable at build time with
|
||||||
// -ldflags "-X main.version=<v>"; defaults to the in-repo CHANGELOG version.
|
// -ldflags "-X main.version=<v>"; defaults to the in-repo CHANGELOG version.
|
||||||
var version = "0.3.0"
|
var version = "0.3.1"
|
||||||
|
|
||||||
func main() {
|
func main() {
|
||||||
var (
|
var (
|
||||||
@@ -326,7 +326,7 @@ func (f *selftestFlag) Set(v string) error {
|
|||||||
case "hub":
|
case "hub":
|
||||||
f.mode = "hub"
|
f.mode = "hub"
|
||||||
default:
|
default:
|
||||||
return fmt.Errorf("invalid --selftest value %q (want read|task)", v)
|
return fmt.Errorf("invalid --selftest value %q (want read|task|hub)", v)
|
||||||
}
|
}
|
||||||
return nil
|
return nil
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -105,13 +105,18 @@ func (c *Collector) collectGuests(ctx context.Context) []Guest {
|
|||||||
guests := make([]Guest, 0, len(lxc))
|
guests := make([]Guest, 0, len(lxc))
|
||||||
for _, g := range lxc {
|
for _, g := range lxc {
|
||||||
entry := Guest{VMID: g.VMID, Name: g.Name, Status: g.Status, ControllerVersion: ""}
|
entry := Guest{VMID: g.VMID, Name: g.Name, Status: g.Status, ControllerVersion: ""}
|
||||||
|
// Normalize an empty run-status to "unknown" so the wire value is always one
|
||||||
|
// of running|stopped|unknown (matches the hub handler's empty→unknown default).
|
||||||
|
if entry.Status == "" {
|
||||||
|
entry.Status = "unknown"
|
||||||
|
}
|
||||||
// GuestConfig supplies cores; memory/disk come from the list entry (bytes).
|
// GuestConfig supplies cores; memory/disk come from the list entry (bytes).
|
||||||
|
// On failure, KEEP the known run-status from ListLXC — only the spec is lost.
|
||||||
cfg, err := c.px.GuestConfig(ctx, g.VMID)
|
cfg, err := c.px.GuestConfig(ctx, g.VMID)
|
||||||
if err != nil {
|
if err != nil {
|
||||||
c.logger.Warn("hub: GuestConfig failed; guest degraded to unknown",
|
c.logger.Warn("hub: GuestConfig failed; spec omitted (run-status kept)",
|
||||||
"vmid", g.VMID, "err", err)
|
"vmid", g.VMID, "err", err)
|
||||||
entry.Status = "unknown"
|
entry.Spec = nil
|
||||||
entry.Spec = nil // omitted
|
|
||||||
} else {
|
} else {
|
||||||
entry.Spec = &GuestSpec{
|
entry.Spec = &GuestSpec{
|
||||||
Cores: cfg.Cores,
|
Cores: cfg.Cores,
|
||||||
|
|||||||
@@ -58,7 +58,7 @@ func TestCollect_HostAndGuests(t *testing.T) {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
func TestCollect_GuestConfigFailureDegradesButStillReports(t *testing.T) {
|
func TestCollect_GuestConfigFailureKeepsStatusOmitsSpec(t *testing.T) {
|
||||||
px := &fakePx{
|
px := &fakePx{
|
||||||
node: "demo-felhom",
|
node: "demo-felhom",
|
||||||
ns: newTestNodeStatus(),
|
ns: newTestNodeStatus(),
|
||||||
@@ -69,7 +69,7 @@ func TestCollect_GuestConfigFailureDegradesButStillReports(t *testing.T) {
|
|||||||
cfg: map[int]proxmox.GuestConfig{100: {Cores: 2}},
|
cfg: map[int]proxmox.GuestConfig{100: {Cores: 2}},
|
||||||
cfgErr: map[int]error{200: errors.New("config read failed")},
|
cfgErr: map[int]error{200: errors.New("config read failed")},
|
||||||
}
|
}
|
||||||
c := NewCollector(px, fakeProber{status: "active"}, "h", "0.3.0", quietLogger())
|
c := NewCollector(px, fakeProber{status: "active"}, "h", "0.3.1", quietLogger())
|
||||||
r, err := c.Collect(context.Background())
|
r, err := c.Collect(context.Background())
|
||||||
if err != nil {
|
if err != nil {
|
||||||
t.Fatalf("a per-guest failure must NOT fail the whole report: %v", err)
|
t.Fatalf("a per-guest failure must NOT fail the whole report: %v", err)
|
||||||
@@ -77,9 +77,14 @@ func TestCollect_GuestConfigFailureDegradesButStillReports(t *testing.T) {
|
|||||||
if len(r.Guests) != 2 {
|
if len(r.Guests) != 2 {
|
||||||
t.Fatalf("guests = %d", len(r.Guests))
|
t.Fatalf("guests = %d", len(r.Guests))
|
||||||
}
|
}
|
||||||
|
// GuestConfig failed for vmid 200, but its run-status (from ListLXC) is known and
|
||||||
|
// must be PRESERVED — only the spec is dropped.
|
||||||
bad := r.Guests[1]
|
bad := r.Guests[1]
|
||||||
if bad.Status != "unknown" || bad.Spec != nil {
|
if bad.Status != "running" {
|
||||||
t.Errorf("degraded guest = %+v (want status=unknown, spec=nil)", bad)
|
t.Errorf("status = %q, want preserved \"running\" (not forced to unknown)", bad.Status)
|
||||||
|
}
|
||||||
|
if bad.Spec != nil {
|
||||||
|
t.Errorf("spec = %+v, want nil (omitted on config failure)", bad.Spec)
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
@@ -0,0 +1,76 @@
|
|||||||
|
package hub
|
||||||
|
|
||||||
|
import (
|
||||||
|
"encoding/json"
|
||||||
|
"os"
|
||||||
|
"reflect"
|
||||||
|
"sort"
|
||||||
|
"testing"
|
||||||
|
)
|
||||||
|
|
||||||
|
// The host-report shape is a contract DUPLICATED across two repos (no shared types
|
||||||
|
// module yet). testdata/host-report.golden.json MUST be kept byte-identical with
|
||||||
|
// felhom-hub's hub/internal/api/testdata/host-report.golden.json. This test fails
|
||||||
|
// if a json tag on HostReport/HostMetrics/Guest is renamed/added/removed relative
|
||||||
|
// to the golden, catching silent drift before slices 5/6 populate the empty
|
||||||
|
// collections. (Promote to a shared types module when those land.)
|
||||||
|
func TestHostReport_ContractMatchesGolden(t *testing.T) {
|
||||||
|
raw, err := os.ReadFile("testdata/host-report.golden.json")
|
||||||
|
if err != nil {
|
||||||
|
t.Fatal(err)
|
||||||
|
}
|
||||||
|
var golden map[string]any
|
||||||
|
if err := json.Unmarshal(raw, &golden); err != nil {
|
||||||
|
t.Fatalf("golden is not valid JSON: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
// A constructed report mirroring the golden's populated shape (guests[0] has spec).
|
||||||
|
report := &HostReport{
|
||||||
|
HostID: "demo-host-01", ReportedAt: "2026-06-08T12:00:00Z", AgentVersion: "0.3.1",
|
||||||
|
Host: HostMetrics{Node: "demo-felhom", LoadAvg: []string{"0.10"}},
|
||||||
|
Guests: []Guest{
|
||||||
|
{VMID: 100, Name: "a", Status: "running", ControllerVersion: "", Spec: &GuestSpec{Cores: 2}},
|
||||||
|
{VMID: 101, Name: "b", Status: "stopped", ControllerVersion: ""},
|
||||||
|
},
|
||||||
|
StorageTargets: []StorageTarget{}, Backups: []Backup{}, RestoreTests: []RestoreTest{},
|
||||||
|
PBSSnapshots: []PBSSnapshot{}, AuditTail: []AuditEntry{},
|
||||||
|
Cloudflared: Cloudflared{Status: "active"},
|
||||||
|
}
|
||||||
|
b, _ := json.Marshal(report)
|
||||||
|
var got map[string]any
|
||||||
|
json.Unmarshal(b, &got)
|
||||||
|
|
||||||
|
assertSameKeys(t, "<top>", golden, got)
|
||||||
|
assertSameKeys(t, "host", golden["host"], got["host"])
|
||||||
|
assertSameKeys(t, "guests[0]",
|
||||||
|
firstElem(golden["guests"]), firstElem(got["guests"]))
|
||||||
|
}
|
||||||
|
|
||||||
|
func firstElem(v any) any {
|
||||||
|
arr, ok := v.([]any)
|
||||||
|
if !ok || len(arr) == 0 {
|
||||||
|
return map[string]any{}
|
||||||
|
}
|
||||||
|
return arr[0]
|
||||||
|
}
|
||||||
|
|
||||||
|
func assertSameKeys(t *testing.T, where string, a, b any) {
|
||||||
|
t.Helper()
|
||||||
|
ka, kb := keysOf(a), keysOf(b)
|
||||||
|
if !reflect.DeepEqual(ka, kb) {
|
||||||
|
t.Errorf("contract drift at %s:\n golden keys = %v\n struct keys = %v", where, ka, kb)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func keysOf(v any) []string {
|
||||||
|
m, ok := v.(map[string]any)
|
||||||
|
if !ok {
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
keys := make([]string, 0, len(m))
|
||||||
|
for k := range m {
|
||||||
|
keys = append(keys, k)
|
||||||
|
}
|
||||||
|
sort.Strings(keys)
|
||||||
|
return keys
|
||||||
|
}
|
||||||
+38
@@ -0,0 +1,38 @@
|
|||||||
|
{
|
||||||
|
"host_id": "demo-host-01",
|
||||||
|
"reported_at": "2026-06-08T12:00:00Z",
|
||||||
|
"agent_version": "0.3.1",
|
||||||
|
"host": {
|
||||||
|
"node": "demo-felhom",
|
||||||
|
"cpu_percent": 3.2,
|
||||||
|
"memory_total_bytes": 16777216000,
|
||||||
|
"memory_used_bytes": 4194304000,
|
||||||
|
"memory_percent": 25,
|
||||||
|
"disk_total_bytes": 152000000000,
|
||||||
|
"disk_used_bytes": 30000000000,
|
||||||
|
"disk_percent": 19.7,
|
||||||
|
"loadavg": ["0.10", "0.20", "0.15"],
|
||||||
|
"uptime_seconds": 86400
|
||||||
|
},
|
||||||
|
"guests": [
|
||||||
|
{
|
||||||
|
"vmid": 100,
|
||||||
|
"name": "felhom-cust-acme",
|
||||||
|
"status": "running",
|
||||||
|
"controller_version": "",
|
||||||
|
"spec": { "cores": 2, "memory_bytes": 2147483648, "disk_bytes": 21474836480 }
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"vmid": 101,
|
||||||
|
"name": "felhom-cust-beta",
|
||||||
|
"status": "stopped",
|
||||||
|
"controller_version": ""
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"storage_targets": [],
|
||||||
|
"backups": [],
|
||||||
|
"restore_tests": [],
|
||||||
|
"pbs_snapshots": [],
|
||||||
|
"cloudflared": { "status": "active" },
|
||||||
|
"audit_tail": []
|
||||||
|
}
|
||||||
Reference in New Issue
Block a user