Files
felhom-agent/REPORT.md
T
admin 237452c8c6 docs: reflow CLAUDE.md; unify REPORT/CHANGELOG convention; add no-secrets rule
Also overwrite REPORT.md with the live --selftest=task validation on demo-felhom
(snapshot/rollback/delete on guest 9999, exitstatus=OK under the felhom-agent@pve
privsep token; slice-1 mutating-ops gap closed, slice 4 unblocked). No version bump.
Token secret stored out-of-band, not committed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 20:54:18 +02:00

3.6 KiB

REPORT — live --selftest=task validation on the demo host (2026-06-08)

Overwrite-latest report (most recent significant run only). Cumulative history lives in CHANGELOG.md.

Outcome

--selftest=task PASSED live against the demo Proxmox host. The slice-1 gap — slice-1 mutating ops + WaitTask were unit-tested only, never run against a live host — is closed. The shared async WaitTask foundation (UPID poll → assert exitstatus == "OK") is now validated live. Slice 4 (reconcile) is unblocked.

What ran

Executed the live-validation runbook end-to-end on node demo-felhom (https://192.168.0.162:8006, PVE 9.2.x; root@pam via SSH alias felhom-pve for provisioning, agent run from the build server 192.168.0.180 on go1.26).

  1. Provisioned the operator-tier token (Part A). Created the FelhomAgent role with the full 16 privileges, the felhom-agent@pve user, and a --privsep 1 token felhom-agent@pve!agent. Granted the role on both the user and the token (the privsep intersection gotcha) — verified two ACL rows at /.
  2. Scratch guest (Part B). Created stopped LXC 9999 (felhom-selftest-scratch), rootfs on local-lvm (lvmthin → snapshot-capable). Kept stopped for deterministic rollback. No stale felhom-selftest snapshot present.
  3. Config + TLS (Part C). Confirmed the demo host's current leaf-cert SHA-256 fingerprint still matches the pinned value (BA:7C:99:7D:45:D0…). Built the agent (v0.3.1) on the build server.
  4. Read-only gate (Part D). --selftest=read clean with the new token: PVE 9.2.2, node online, guest 9999 visible, storages listed.
  5. Live mutating run (Part E). --selftest=task -vmid 9999 — snapshot → rollback → delete-snapshot, each returning a real UPID that WaitTask polled to exitstatus=OK.

Evidence

  • exitstatus=OK on all three ops (a 200 on the POST is explicitly not treated as success — the exitstatus assertion is the point of the run).
  • The task UPIDs name the token actor (…:vzsnapshot:9999:felhom-agent@pve!agent:, likewise vzrollback / vzdelsnapshot) — confirming the privsep token path was genuinely exercised, no privilege drift.
  • Role: all 16 privileges present (VM.Snapshot, VM.Snapshot.Rollback, VM.Backup, the VM.Config.* set, VM.PowerMgmt, VM.Allocate/Audit, Datastore.*, Sys.Audit, SDN.Use).
  • ACLs: both -user felhom-agent@pve and -token felhom-agent@pve!agent carry FelhomAgent at /.
  • Post-state (Part F): felhom-selftest snapshot created then cleaned (only current remains); guest left stopped, as started.

Scope / not covered (by design)

  • Not validated live: Start/Stop/SetConfig (reversible, low-risk; SetConfig is used by reconcile — an optional selftest extension could add them), Vzdump (already confirmed live in the phase1-2 spike), and RestoreLXC / provision-by-restore (deferred until the golden base image exists, ~slice 7).
  • The run used a stopped guest deliberately, to keep rollback deterministic (LXC snapshots carry no running-memory state; rollback of a running CT may error or stop the guest). Characterizing running-guest rollback is optional follow-up intel, not a slice-4 blocker.

Credentials

The standing FelhomAgent operator token (felhom-agent@pve!agent) provisioned here is the one slice 4+ consumes — not deleted. Its secret is stored out-of-band, supplied to the agent via FELHOM_AGENT_PROXMOX_TOKEN; it is not persisted to the repo (the on-disk config holds only a placeholder). Scratch guest 9999 is retained (stopped) as the standing selftest target.