Files
felhom-agent/docs/Proxmox_Spike_-_API_&_Access-Control_Reference.md
T
2026-06-07 20:20:52 +02:00

7.3 KiB

Proxmox Spike — API & Access-Control Reference

Reference for the controller-as-guest architecture, synthesized from current Proxmox VE 9.x documentation (June 2026).

Items marked [confirm on box] should be verified once PVE is installed — treat them as Phase 0/1 verification steps, not gospel. Every Proxmox CLI tool is a thin wrapper over the same REST API, so anything below is reachable from Go.


1. API fundamentals

  • Base URL: https://192.168.0.162:8006/api2/json
  • Auth (API token): HTTP header Authorization: PVEAPIToken=USER@REALM!TOKENID=SECRET The secret is shown once at creation — capture it immediately, it can't be retrieved again.
  • Response shape: { "data": ... }; errors come back via HTTP status + body.
  • Discovery (do this live on the box instead of trusting any doc):
    • pvesh get /version
    • pvesh ls /nodes/<node>/qemu/<vmid>
    • Full schema browser: https://pve.proxmox.com/pve-docs/api-viewer/
    • "What call does the GUI make?" → perform the action in the web UI with browser DevTools → Network open and read the request. Fastest way to find the exact endpoint + params for anything.
  • Async tasks: long operations (backup, restore, clone) return a UPID (task id), not a result. Poll GET /nodes/<node>/tasks/<upid>/status until status: stopped, then check exitstatus. The controller must poll, not block. [confirm on box] the exact polling/response shape.

2. RBAC model — (path, principal, role)

An ACL entry is a triple of (path, user/group/token, role). A role is a bundle of privileges, assigned at the most specific path possible.

  • Paths: /, /vms/<vmid>, /nodes/<node>, /storage/<store>, /pool/<pool>, /access/...
  • Predefined roles include: PVEAuditor (read-only), PVEVMUser, PVEVMAdmin, PVEDatastoreUser, PVEAdmin, PVEUserAdmin.
  • API tokens with privilege separation (--privsep 1): the token's effective permissions are the intersection of (a) the backing user's permissions and (b) the token's own ACLs. A privsep token can therefore never exceed its user, and you grant it a separate, minimal ACL. This is exactly the property the in-guest controller needs.

Introspection:

pveum role list
pveum role info PVEVMAdmin
pveum user permissions <user> --path /vms/<vmid>

3. Two-tier privilege model (our architecture decision)

Tier A — in-guest controller (customer-facing, NARROW). Runs inside the customer's guest. Token scoped to that guest's own VMID only: read its own status/config, snapshot itself, back itself up, write the backup to the datastore. Cannot see or touch other guests. The LXC/VM's own privilege level is irrelevant here — reaching host:8006 is just an HTTPS call + token.

Tier B — operator (provisioning, BROAD). Creates/destroys guests, builds the golden template, attaches storage, wires PBS. Lives operator-side (hub / tooling), never on the customer box.

Phase 1 runbook — minimal self-backup role + scoped token

# 1. Custom least-privilege role: "back up / snapshot myself"
#    [confirm on box: exact privilege names via `pveum role list` / api-viewer]
pveum role add FelhomSelfBackup \
  -privs "VM.Audit VM.Snapshot VM.Backup Datastore.AllocateSpace Datastore.Audit"

# 2. Dedicated API-only user in the PVE realm (no login password)
pveum user add felhom-ctl@pve --comment "In-guest controller (self-backup)"

# 3. Privsep token for that user (SECRET shown once)
pveum user token add felhom-ctl@pve ctl --privsep 1

# 4. Scope the TOKEN to one guest + the backup datastore only
pveum acl modify /vms/<vmid>      -token 'felhom-ctl@pve!ctl' -role FelhomSelfBackup
pveum acl modify /storage/<store> -token 'felhom-ctl@pve!ctl' -role FelhomSelfBackup

# 5. Test FROM INSIDE the guest
curl -k https://<host>:8006/api2/json/version \
  -H "Authorization: PVEAPIToken=felhom-ctl@pve!ctl=<SECRET>"

curl -k -X POST https://<host>:8006/api2/json/nodes/<node>/vzdump \
  -H "Authorization: PVEAPIToken=felhom-ctl@pve!ctl=<SECRET>" \
  -d "vmid=<vmid>&storage=<store>&mode=snapshot"

Pass criteria: the token backs up its OWN vmid, and returns 403 on any other vmid. That single result validates the whole controller-as-guest design.

Open question to settle here: does Tier A also need VM.PowerMgmt so it can stop/start its own guest for stop-mode backups? Likely yes — add it and re-test.


4. Backup / restore (vzdump)

Modes:

  • stop — orderly guest shutdown → live backup → resume. Highest consistency, short defined downtime.
  • snapshot — lowest downtime; copies blocks while running. Small inconsistency risk unless the guest cooperates (see below).
  • suspend — legacy/compat, longer downtime, not recommended.

App-consistency — the concrete version of the earlier warning:

  • VM: install qemu-guest-agent in the guest and set agent: 1. snapshot-mode vzdump then calls guest-fsfreeze-freeze / -thaw around the copy → near-free filesystem consistency. This is a real point in the VM's favour over LXC.
  • LXC: no guest agent → no fsfreeze. App-consistency becomes the controller's job: quiesce in-guest first (stop stacks / flush DBs) then vzdump, or use stop mode. Same lesson as the restic work, moved to the guest layer.

CLI / API:

vzdump <vmid> --mode snapshot --storage <store>                 # CLI
# API (async → UPID):
POST /api2/json/nodes/<node>/vzdump        params: vmid, storage, mode, ...

Restore is NOT a single "restore" call — you recreate the guest from the archive:

  • VM: qmrestore <archive> <newvmid> / POST /nodes/<node>/qemu with archive=...
  • LXC: pct restore <newvmid> <archive> / POST /nodes/<node>/lxc with the archive as source

Phase 2's real-restore test = restore to a fresh vmid and boot it. Do not declare the backup "working" until a restored guest actually runs.


5. Key REST endpoints (qemu shown; lxc is parallel under /lxc)

GET  /nodes
GET  /nodes/<node>/qemu                          list VMs
GET  /nodes/<node>/qemu/<vmid>/status/current    live status
GET  /nodes/<node>/qemu/<vmid>/config            config
POST /nodes/<node>/qemu/<vmid>/status/{start,stop,shutdown,reboot}
POST /nodes/<node>/qemu/<vmid>/snapshot          (snapname, description)
GET  /nodes/<node>/qemu/<vmid>/snapshot          list snapshots
POST /nodes/<node>/qemu/<vmid>/snapshot/<snap>/rollback
POST /nodes/<node>/vzdump                         backup (async, UPID)
GET  /nodes/<node>/tasks/<upid>/status            poll async task

LXC: replace /qemu/ with /lxc/. For Docker-in-LXC the container needs features nesting=1,keyctl=1 (pct set <vmid> -features nesting=1,keyctl=1, or the features property on POST /nodes/<node>/lxc) — [confirm on box].


6. Phase 0 confirm-on-box checklist

  • PVE 9.2 installed; storage = LVM-thin (leave free space to also test dir/qcow2)
  • Exact privilege set for FelhomSelfBackup (pveum role info)
  • UPID task-polling response shape
  • Docker official apt repo has a trixie channel
  • LXC features nesting=1,keyctl=1 syntax + Docker actually runs inside an LXC
  • Baseline idle + under-load RAM/CPU: one Debian VM vs one Debian LXC, identical resources