feat(agent): scaffold + proxmox interaction layer (slice 1)
Stand up the felhom-agent project (module gitea.dooplex.hu/admin/felhom-agent, binary felhom-agent) and the internal/proxmox package: the typed library every other agent module calls to talk to Proxmox. - API-first Client (hand-rolled REST over net/http, PVEAPIToken auth) with typed read ops (version/nodes/status/lxc/config/storage) and async mutating ops (restore/vzdump/snapshot/rollback/delete-snapshot/setconfig/start/stop), each returning a UPID. WaitTask polls task status until stopped and asserts exitstatus OK (authz can surface at task exec, not the POST — phase1-2 §1.3). - Fenced Privileged (root-CLI) backend for the THREE proven exceptions only (keyctl pct create, USB mount/fstab, SMART/sensors); each cites why it can't be the API. Fence is structural (Client never shells out, Privileged never HTTPs) and asserted in routing_test.go. - TLS: SHA-256 leaf-cert pinning or CA file; insecure mode explicit + off by default. No blanket verification disable. - 403 -> privilege-named APIError; failed task -> privilege-named TaskError. - JSON config + env overrides (token never logged); slog logging. - cmd/felhom-agent --selftest (read-only health report) + gated --selftest=task (reversible snapshot/rollback/delete exercise of WaitTask). No daemon loop yet. - Types grounded in the spike findings and exact JSON shapes captured live from demo-felhom (PVE 9.2.2). Unit tests use a mock transport + runner. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,126 @@
|
||||
# felhom-agent
|
||||
|
||||
The **host agent** for the Felhom platform — the operator-tier component that runs on each
|
||||
Proxmox host and owns *all* Proxmox interaction (provision/restore guests, host storage,
|
||||
backups, host+tunnel monitoring, hub control loop, per-guest local API). Design:
|
||||
[`felhom.eu/documentation/architecture/03-host-agent.md`](https://gitea.dooplex.hu/admin/felhom.eu/raw/branch/main/documentation/architecture/03-host-agent.md).
|
||||
|
||||
> **Status — slice 1 of N.** This repo currently contains the project scaffold and the
|
||||
> **`internal/proxmox`** interaction layer (the typed library every other module will call to
|
||||
> talk to Proxmox), plus a runnable read-only `--selftest`. **No** reconcile loop, hub client,
|
||||
> signing, or storage/backup orchestration yet — those are later slices.
|
||||
|
||||
Module: `gitea.dooplex.hu/admin/felhom-agent` · binary: `felhom-agent` · Go 1.24.
|
||||
|
||||
## Layout
|
||||
|
||||
```
|
||||
cmd/felhom-agent/ # entry point + --selftest (wiring only; no daemon loop yet)
|
||||
internal/proxmox/ # the Proxmox interaction layer (API-first + fenced root-CLI)
|
||||
internal/config/ # JSON config + env overrides (secrets never logged)
|
||||
internal/log/ # slog setup
|
||||
configs/agent.example.json
|
||||
```
|
||||
|
||||
## The `proxmox` package — model
|
||||
|
||||
Two backends, one fixed routing policy (the fence is structural — `Client` never shells out,
|
||||
`Privileged` never makes an HTTP call; asserted in `routing_test.go`):
|
||||
|
||||
| | Backend | Used for |
|
||||
|---|---|---|
|
||||
| **API (default)** | `proxmox.Client` | everything the scoped **FelhomAgent** token can do |
|
||||
| **root-CLI (fenced)** | `proxmox.Privileged` | the **three** proven OS-root exceptions only |
|
||||
|
||||
Grounded entirely in the spike findings (`felhom.eu/documentation/proxmox-platform.md`,
|
||||
`tests/phase{0,1-2,3}-findings.md`). Every mutating API op is **async**: it returns a UPID and
|
||||
the caller `WaitTask`s until the task stops, then asserts `exitstatus == "OK"` — authorization
|
||||
can surface at task execution, not the HTTP POST (phase1-2 §1.3).
|
||||
|
||||
### Public surface
|
||||
|
||||
`Client` (API):
|
||||
|
||||
- Read: `Version`, `Nodes`, `NodeStatus`, `ListLXC`, `GuestStatus`, `GuestConfig`,
|
||||
`ListStorage`, `NodeStorage`, `StorageContent`.
|
||||
- Async mutating (return UPID): `RestoreLXC` (primary create path), `Vzdump`, `Snapshot`,
|
||||
`Rollback`, `DeleteSnapshot`, `SetConfig`, `Start`, `Stop`.
|
||||
- Tasks: `WaitTask`, `TaskStatusOnce`, `TaskLogTail`.
|
||||
- Errors: `*APIError` (parses the offending privilege from a 403), `*TaskError` (parses it from
|
||||
a failed task `exitstatus`).
|
||||
|
||||
`Privileged` (fenced root-CLI) — each method documents *why it can't be the API*:
|
||||
|
||||
- `CreateGoldenLXC` — `pct create` with `keyctl=1` (root@pam-only; the only root-fenced create —
|
||||
the per-customer path provisions by **restore**, which preserves keyctl).
|
||||
- `MountUSBByUUID` — host mount-by-UUID (not a Proxmox API op).
|
||||
- `SMART`, `Sensors` — hardware reads (not API-exposed).
|
||||
|
||||
### API-vs-root routing table
|
||||
|
||||
See the table in [`internal/proxmox/doc.go`](internal/proxmox/doc.go). Summary: the entire guest
|
||||
lifecycle **including restore** is API-token-covered; OS-root is confined to golden-image
|
||||
`keyctl` create, host mounts, and SMART/sensors (phase3 §B3).
|
||||
|
||||
### TLS trust
|
||||
|
||||
The host serves a self-signed cert. Verification is **not** blanket-disabled. Pick one in
|
||||
config: `ca_file` (PEM, full verify), `fingerprint` (SHA-256 of the host leaf cert — pinned
|
||||
exact-cert match; the `/nodes` API returns each node's `ssl_fingerprint` to pin), or the
|
||||
explicitly-named `insecure_skip_verify` (off by default; selftest-against-127.0.0.1 only).
|
||||
|
||||
## Provisioning the token (out-of-band, operator side)
|
||||
|
||||
The agent only **consumes** a privilege-separated API token; role setup is a provisioning step.
|
||||
The role must be granted on **both the user AND the token** for the same path, or the
|
||||
intersection is empty and every call 403s (phase1-2 §1.2):
|
||||
|
||||
```bash
|
||||
pveum role add FelhomAgent -privs "VM.Allocate VM.Audit VM.Config.Disk VM.Config.CPU \
|
||||
VM.Config.Memory VM.Config.Network VM.Config.Options VM.PowerMgmt VM.Snapshot \
|
||||
VM.Snapshot.Rollback VM.Backup Datastore.Allocate Datastore.AllocateSpace \
|
||||
Datastore.Audit Sys.Audit SDN.Use" # 16 privileges, validated Phase 3 B3
|
||||
pveum user add felhom-agent@pve
|
||||
pveum user token add felhom-agent@pve agent --privsep 1 # capture the secret (shown once)
|
||||
pveum acl modify / -user 'felhom-agent@pve' -role FelhomAgent
|
||||
pveum acl modify / -token 'felhom-agent@pve!agent' -role FelhomAgent
|
||||
```
|
||||
|
||||
(`VM.Config.CPUMemory` is **not** a real privilege; `SDN.Use` **is** required for bridge use.)
|
||||
|
||||
## Run
|
||||
|
||||
```bash
|
||||
go build ./...
|
||||
# read-only health check against the host:
|
||||
./felhom-agent --config configs/agent.example.json --selftest
|
||||
# or via env (keeps the secret off disk):
|
||||
FELHOM_AGENT_PROXMOX_TOKEN='felhom-agent@pve!agent=SECRET' \
|
||||
FELHOM_AGENT_PROXMOX_NODE=demo-felhom \
|
||||
FELHOM_AGENT_PROXMOX_ENDPOINT=https://192.168.0.162:8006 \
|
||||
FELHOM_AGENT_PROXMOX_TLS_FINGERPRINT='BA:7C:...:CF' \
|
||||
./felhom-agent --selftest
|
||||
```
|
||||
|
||||
`--selftest` (read-only) loads config, builds the API client, and runs the read queries (version,
|
||||
nodes, node status, guests, storage), printing a short health report. It mutates nothing and says
|
||||
so cleanly if the token/endpoint isn't configured.
|
||||
|
||||
`--selftest=task --vmid N` (explicitly gated) exercises `WaitTask` on a **reversible** op
|
||||
(snapshot → rollback → delete-snapshot) against guest `N`. Default `--selftest` never mutates.
|
||||
|
||||
## Process model (proposed, not finalized — see 03 §3/§12)
|
||||
|
||||
Native Go binary, systemd service, **non-root** service user holding the scoped token, with a
|
||||
**narrow sudoers allowlist** for the three fenced ops. `privileged.mode: "sudo"` matches this;
|
||||
`"direct"` is for dev/CI where the agent is already root.
|
||||
|
||||
## Test
|
||||
|
||||
```bash
|
||||
go vet ./... && go test ./...
|
||||
```
|
||||
|
||||
Unit tests use a mock HTTP transport + mock runner (no live host): UPID parse, `WaitTask`
|
||||
(running→OK / running→failed-403 / timeout / ctx-cancel), 403→privilege-named error, response
|
||||
decoding against the captured live shapes, and the API-vs-root routing fence.
|
||||
Reference in New Issue
Block a user