Append a reversible SetConfig write+revert to runSelftestTask: read GuestConfig, write a `description` marker, verify it landed, restore the original (or delete if absent), verify the restore. Handles PVE's dual-mode SetConfig return (empty UPID = synchronous; UPID = WaitTask+assert OK). Live self-gate PASSED on demo-felhom / guest 9999. Findings: - LXC `description` write is synchronous (empty UPID) — dual-mode modeling confirmed; empty string is success, not an error. - PVE appends a trailing newline to `description` on read; slice-4 reconcile must normalize description comparisons (hence normDesc helper). First live exercise of the VM.Config.* privilege cluster. Standing operator token rotated during the run; new secret stored out-of-band, not in the repo. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
felhom-agent
The host agent for the Felhom platform — the operator-tier component that runs on each
Proxmox host and owns all Proxmox interaction (provision/restore guests, host storage,
backups, host+tunnel monitoring, hub control loop, per-guest local API). Design:
felhom.eu/documentation/architecture/03-host-agent.md.
Status — slice 1 of N. This repo currently contains the project scaffold and the
internal/proxmoxinteraction layer (the typed library every other module will call to talk to Proxmox), plus a runnable read-only--selftest. No reconcile loop, hub client, signing, or storage/backup orchestration yet — those are later slices.
Module: gitea.dooplex.hu/admin/felhom-agent · binary: felhom-agent · Go 1.24.
Layout
cmd/felhom-agent/ # entry point + --selftest (wiring only; no daemon loop yet)
internal/proxmox/ # the Proxmox interaction layer (API-first + fenced root-CLI)
internal/config/ # JSON config + env overrides (secrets never logged)
internal/log/ # slog setup
configs/agent.example.json
The proxmox package — model
Two backends, one fixed routing policy (the fence is structural — Client never shells out,
Privileged never makes an HTTP call; asserted in routing_test.go):
| Backend | Used for | |
|---|---|---|
| API (default) | proxmox.Client |
everything the scoped FelhomAgent token can do |
| root-CLI (fenced) | proxmox.Privileged |
the three proven OS-root exceptions only |
Grounded entirely in the spike findings (felhom.eu/documentation/proxmox-platform.md,
tests/phase{0,1-2,3}-findings.md). Every mutating API op is async: it returns a UPID and
the caller WaitTasks until the task stops, then asserts exitstatus == "OK" — authorization
can surface at task execution, not the HTTP POST (phase1-2 §1.3).
Public surface
Client (API):
- Read:
Version,Nodes,NodeStatus,ListLXC,GuestStatus,GuestConfig,ListStorage,NodeStorage,StorageContent. - Async mutating (return UPID):
RestoreLXC(primary create path),Vzdump,Snapshot,Rollback,DeleteSnapshot,SetConfig,Start,Stop. - Tasks:
WaitTask,TaskStatusOnce,TaskLogTail. - Errors:
*APIError(parses the offending privilege from a 403),*TaskError(parses it from a failed taskexitstatus).
Privileged (fenced root-CLI) — each method documents why it can't be the API:
CreateGoldenLXC—pct createwithkeyctl=1(root@pam-only; the only root-fenced create — the per-customer path provisions by restore, which preserves keyctl).MountUSBByUUID— host mount-by-UUID (not a Proxmox API op).SMART,Sensors— hardware reads (not API-exposed).
API-vs-root routing table
See the table in internal/proxmox/doc.go. Summary: the entire guest
lifecycle including restore is API-token-covered; OS-root is confined to golden-image
keyctl create, host mounts, and SMART/sensors (phase3 §B3).
TLS trust
The host serves a self-signed cert. Verification is not blanket-disabled. Pick one in
config: ca_file (PEM, full verify), fingerprint (SHA-256 of the host leaf cert — pinned
exact-cert match; the /nodes API returns each node's ssl_fingerprint to pin), or the
explicitly-named insecure_skip_verify (off by default; selftest-against-127.0.0.1 only).
Provisioning the token (out-of-band, operator side)
The agent only consumes a privilege-separated API token; role setup is a provisioning step. The role must be granted on both the user AND the token for the same path, or the intersection is empty and every call 403s (phase1-2 §1.2):
pveum role add FelhomAgent -privs "VM.Allocate VM.Audit VM.Config.Disk VM.Config.CPU \
VM.Config.Memory VM.Config.Network VM.Config.Options VM.PowerMgmt VM.Snapshot \
VM.Snapshot.Rollback VM.Backup Datastore.Allocate Datastore.AllocateSpace \
Datastore.Audit Sys.Audit SDN.Use" # 16 privileges, validated Phase 3 B3
pveum user add felhom-agent@pve
pveum user token add felhom-agent@pve agent --privsep 1 # capture the secret (shown once)
pveum acl modify / -user 'felhom-agent@pve' -role FelhomAgent
pveum acl modify / -token 'felhom-agent@pve!agent' -role FelhomAgent
(VM.Config.CPUMemory is not a real privilege; SDN.Use is required for bridge use.)
Run
go build ./...
# read-only health check against the host:
./felhom-agent --config configs/agent.example.json --selftest
# or via env (keeps the secret off disk):
FELHOM_AGENT_PROXMOX_TOKEN='felhom-agent@pve!agent=SECRET' \
FELHOM_AGENT_PROXMOX_NODE=demo-felhom \
FELHOM_AGENT_PROXMOX_ENDPOINT=https://192.168.0.162:8006 \
FELHOM_AGENT_PROXMOX_TLS_FINGERPRINT='BA:7C:...:CF' \
./felhom-agent --selftest
--selftest (read-only) loads config, builds the API client, and runs the read queries (version,
nodes, node status, guests, storage), printing a short health report. It mutates nothing and says
so cleanly if the token/endpoint isn't configured.
--selftest=task --vmid N (explicitly gated) exercises WaitTask on a reversible op
(snapshot → rollback → delete-snapshot) against guest N. Default --selftest never mutates.
Process model (proposed, not finalized — see 03 §3/§12)
Native Go binary, systemd service, non-root service user holding the scoped token, with a
narrow sudoers allowlist for the three fenced ops. privileged.mode: "sudo" matches this;
"direct" is for dev/CI where the agent is already root.
Test
go vet ./... && go test ./...
Unit tests use a mock HTTP transport + mock runner (no live host): UPID parse, WaitTask
(running→OK / running→failed-403 / timeout / ctx-cancel), 403→privilege-named error, response
decoding against the captured live shapes, and the API-vs-root routing fence.