# felhom-agent The **host agent** for the Felhom platform — the operator-tier component that runs on each Proxmox host and owns *all* Proxmox interaction (provision/restore guests, host storage, backups, host+tunnel monitoring, hub control loop, per-guest local API). Design: [`felhom.eu/documentation/architecture/03-host-agent.md`](https://gitea.dooplex.hu/admin/felhom.eu/raw/branch/main/documentation/architecture/03-host-agent.md). > **Status — slice 1 of N.** This repo currently contains the project scaffold and the > **`internal/proxmox`** interaction layer (the typed library every other module will call to > talk to Proxmox), plus a runnable read-only `--selftest`. **No** reconcile loop, hub client, > signing, or storage/backup orchestration yet — those are later slices. Module: `gitea.dooplex.hu/admin/felhom-agent` · binary: `felhom-agent` · Go 1.24. ## Layout ``` cmd/felhom-agent/ # entry point + --selftest (wiring only; no daemon loop yet) internal/proxmox/ # the Proxmox interaction layer (API-first + fenced root-CLI) internal/config/ # JSON config + env overrides (secrets never logged) internal/log/ # slog setup configs/agent.example.json ``` ## The `proxmox` package — model Two backends, one fixed routing policy (the fence is structural — `Client` never shells out, `Privileged` never makes an HTTP call; asserted in `routing_test.go`): | | Backend | Used for | |---|---|---| | **API (default)** | `proxmox.Client` | everything the scoped **FelhomAgent** token can do | | **root-CLI (fenced)** | `proxmox.Privileged` | the **three** proven OS-root exceptions only | Grounded entirely in the spike findings (`felhom.eu/documentation/proxmox-platform.md`, `tests/phase{0,1-2,3}-findings.md`). Every mutating API op is **async**: it returns a UPID and the caller `WaitTask`s until the task stops, then asserts `exitstatus == "OK"` — authorization can surface at task execution, not the HTTP POST (phase1-2 §1.3). ### Public surface `Client` (API): - Read: `Version`, `Nodes`, `NodeStatus`, `ListLXC`, `GuestStatus`, `GuestConfig`, `ListStorage`, `NodeStorage`, `StorageContent`. - Async mutating (return UPID): `RestoreLXC` (primary create path), `Vzdump`, `Snapshot`, `Rollback`, `DeleteSnapshot`, `SetConfig`, `Start`, `Stop`. - Tasks: `WaitTask`, `TaskStatusOnce`, `TaskLogTail`. - Errors: `*APIError` (parses the offending privilege from a 403), `*TaskError` (parses it from a failed task `exitstatus`). `Privileged` (fenced root-CLI) — each method documents *why it can't be the API*: - `CreateGoldenLXC` — `pct create` with `keyctl=1` (root@pam-only; the only root-fenced create — the per-customer path provisions by **restore**, which preserves keyctl). - `MountUSBByUUID` — host mount-by-UUID (not a Proxmox API op). - `SMART`, `Sensors` — hardware reads (not API-exposed). ### API-vs-root routing table See the table in [`internal/proxmox/doc.go`](internal/proxmox/doc.go). Summary: the entire guest lifecycle **including restore** is API-token-covered; OS-root is confined to golden-image `keyctl` create, host mounts, and SMART/sensors (phase3 §B3). ### TLS trust The host serves a self-signed cert. Verification is **not** blanket-disabled. Pick one in config: `ca_file` (PEM, full verify), `fingerprint` (SHA-256 of the host leaf cert — pinned exact-cert match; the `/nodes` API returns each node's `ssl_fingerprint` to pin), or the explicitly-named `insecure_skip_verify` (off by default; selftest-against-127.0.0.1 only). ## Provisioning the token (out-of-band, operator side) The agent only **consumes** a privilege-separated API token; role setup is a provisioning step. The role must be granted on **both the user AND the token** for the same path, or the intersection is empty and every call 403s (phase1-2 §1.2): ```bash pveum role add FelhomAgent -privs "VM.Allocate VM.Audit VM.Config.Disk VM.Config.CPU \ VM.Config.Memory VM.Config.Network VM.Config.Options VM.PowerMgmt VM.Snapshot \ VM.Snapshot.Rollback VM.Backup Datastore.Allocate Datastore.AllocateSpace \ Datastore.Audit Sys.Audit SDN.Use" # 16 privileges, validated Phase 3 B3 pveum user add felhom-agent@pve pveum user token add felhom-agent@pve agent --privsep 1 # capture the secret (shown once) pveum acl modify / -user 'felhom-agent@pve' -role FelhomAgent pveum acl modify / -token 'felhom-agent@pve!agent' -role FelhomAgent ``` (`VM.Config.CPUMemory` is **not** a real privilege; `SDN.Use` **is** required for bridge use.) ## Run ```bash go build ./... # read-only health check against the host: ./felhom-agent --config configs/agent.example.json --selftest # or via env (keeps the secret off disk): FELHOM_AGENT_PROXMOX_TOKEN='felhom-agent@pve!agent=SECRET' \ FELHOM_AGENT_PROXMOX_NODE=demo-felhom \ FELHOM_AGENT_PROXMOX_ENDPOINT=https://192.168.0.162:8006 \ FELHOM_AGENT_PROXMOX_TLS_FINGERPRINT='BA:7C:...:CF' \ ./felhom-agent --selftest ``` `--selftest` (read-only) loads config, builds the API client, and runs the read queries (version, nodes, node status, guests, storage), printing a short health report. It mutates nothing and says so cleanly if the token/endpoint isn't configured. `--selftest=task --vmid N` (explicitly gated) exercises `WaitTask` on a **reversible** op (snapshot → rollback → delete-snapshot) against guest `N`. Default `--selftest` never mutates. ## Process model (proposed, not finalized — see 03 §3/§12) Native Go binary, systemd service, **non-root** service user holding the scoped token, with a **narrow sudoers allowlist** for the three fenced ops. `privileged.mode: "sudo"` matches this; `"direct"` is for dev/CI where the agent is already root. ## Test ```bash go vet ./... && go test ./... ``` Unit tests use a mock HTTP transport + mock runner (no live host): UPID parse, `WaitTask` (running→OK / running→failed-403 / timeout / ctx-cancel), 403→privilege-named error, response decoding against the captured live shapes, and the API-vs-root routing fence.