Files
felhom-agent/REPORT.md
T
2026-06-08 14:47:38 +02:00

4.2 KiB

felhom-agent — latest task report

This file holds the report for the most recent change, fully overwritten each task. Cumulative history lives in CHANGELOG.md.

Task: Agent scaffold + proxmox interaction package (slice 1) — v0.1.0

Stood up the host-agent project and its foundation — the typed proxmox interaction layer every other agent module will call — with a runnable read-only --selftest. Pushed to main (main-only repo). Build/vet/test green; verified live against the demo host.

Public surface

proxmox.Client (API backend):

  • Read: Version, Nodes, NodeStatus, ListLXC, GuestStatus, GuestConfig, ListStorage, NodeStorage, StorageContent
  • Async mutating (return a UPID): RestoreLXC (primary create path), Vzdump, Snapshot, Rollback, DeleteSnapshot, SetConfig, Start, Stop
  • Tasks: WaitTask(ctx, upid, WaitOptions), TaskStatusOnce, TaskLogTail
  • Errors: *APIError (parses the offending privilege from a 403), *TaskError (parses it from a failed task exitstatus + log tail)
  • Types: Version, Node, NodeStatus, Guest, GuestConfig (+Extra/MountPoints/Nets), Storage, StorageContent, TaskStatus, UPID

proxmox.Privileged (fenced root-CLI; Runner iface, ExecRunner direct/sudo -n): CreateGoldenLXC (keyctl), MountUSBByUUID, SMART, Sensors — each documents why it can't be the API.

API-vs-root routing table

Backend Ops Why
API node status, list/status/config guests, storage list+content, task status/log, restore, vzdump, snapshot/rollback/delete-snap, set-config, start/stop FelhomAgent 16-priv token
root-CLI (fenced) golden pct create (keyctl=1), USB mount-by-UUID/fstab, SMART/sensors keyctl is root@pam-only; host mounts + SMART aren't API ops

Fence is structural (Client has no runner, Privileged has no HTTP client) and asserted in routing_test.go.

OPEN-item choices

  • Config: JSON file + FELHOM_AGENT_* env overrides (stdlib, zero-dep; swappable to yaml.v3 if YAML house-style is preferred). Token never logged (Redacted()).
  • Privileged runner / uid: Runner iface; ExecRunner{Mode: sudo|direct}, default sudo -n. Proposed (not finalized): non-root service user + narrow sudoers allowlist for the 3 fenced commands.
  • Polling: first poll immediate, then 1s → exponential backoff capped 5s, default total timeout 10m; honors ctx cancellation. Tunable via WaitOptions.
  • --selftest=task: included (gated behind the flag + -vmid). Unit-tested via mocks; not run live (the live token was read-only).
  • Versioning: version var in main.go (default 0.1.0, -ldflags -X main.version=), --version flag.

What the live host revealed (recorded, not guessed)

  • Node name is demo-felhom; felhom-pve is only the SSH alias.
  • /nodes/{node}/status: cpu is a 0..1 fraction, loadavg is an array of strings; memory/rootfs/swap nested.
  • vmid is an integer in list/status; status/current carries no vmid (set from the path arg).
  • Task: status ∈ {running, stopped}, exitstatus only once stopped; task log is [{"n":N,"t":"…"}]. UPID = UPID:node:pid(hex):pstart(hex):starttime(hex):worker:id:user:.
  • pveum user token add … --output-format json returns {"value":"…"}.
  • No spike fact failed in practice — 16-priv role, async/UPID model, keyctl boundary, dual-grant privsep all held. Teardown logged ignore invalid acl token …, confirming ACL auto-invalidation (phase1-2 §5).

Verification

  • go build/vet/test green twice: locally (Go 1.26) and on the build server (Go 1.24.4).
  • Live read-only --selftest (built on 192.168.0.180, against https://192.168.0.162:8006, TLS fingerprint-pinned — no insecure mode): version, nodes, node status, guests, storage all [ ok ]. slog confirmed the token rendered as …=********. Throwaway token created + torn down.
  • Mutating ops + live WaitTask are unit-tested only (live run used a read-only token); --selftest=task is ready to exercise them against a real FelhomAgent token.

Repo state

  • Branch: main only (feature branch merged + deleted, local & remote). Latest: chore(agent): add CHANGELOG, version the agent at 0.1.0.