18 KiB
Proxmox Platform Reference
Authoritative, living reference for the Proxmox platform underneath proxmox-controller.
It records facts about Proxmox and what we validated about it — not Felhom design
decisions. Where a design choice exists, this doc points to the (future) controller
architecture document rather than making the choice here.
Evidence base (raw, chronological spike logs — kept as the underlying record):
- tests/phase0-findings.md — VM-vs-LXC overhead, Docker-in-LXC viability
- tests/phase1-2-findings.md — privilege model, backup/restore round-trip
- tests/Proxmox_Spike_-API&_Access-Control_Reference.md — superseded pre-spike reference (contains a known privsep error; do not cite as authoritative)
Every nontrivial claim links to its evidence section. Validated on a single host
(demo-felhom, 192.168.0.162, 4 vCPU / 16 GB) on 2026-06-07; treat single-run timings and
measurements as indicative, not benchmarks.
1. Platform baseline
Validated stack [phase0 §1]:
| Component | Version |
|---|---|
Proxmox VE (pve-manager) |
9.2.2 (b9984c6d90a4bd80) |
| OS | Debian 13 (Trixie) |
| Kernel | proxmox-kernel 7.0.2-6-pve |
pve-qemu-kvm |
11.0.0-3 |
qemu-server |
9.1.15 |
pve-container |
6.1.10 |
lxc-pve / lxcfs |
7.0.0-2 / 7.0.0-pve1 |
criu |
4.1.1-1 |
pvesh get /version → release 9.2. Always confirm the node name on the box
(pvesh get /nodes) rather than hard-coding it.
1.1 Storage backends
Two backends were present and exercised [phase0 §1, phase1-2 §pre-flight]:
| Storage | Type | Path / VG | Content types | Holds |
|---|---|---|---|---|
local |
dir |
/var/lib/vz |
iso, vztmpl, backup, import |
ISOs, CT templates, vzdump archives |
local-lvm |
lvmthin |
VG pve, thinpool data |
rootdir, images |
guest disk volumes |
Why backups cannot live on LVM-thin: LVM-thin is a block backend — it allocates
logical volumes for guest disks. Backup archives and templates are files, which require a
file-level backend (dir, NFS, CIFS, or PBS). A vzdump target must therefore be a
storage whose content types include backup (here, local); pointing vzdump at
local-lvm is not valid. [phase1-2 §pre-flight / §2.1]
1.2 Repositories
PVE 9 uses deb822 .sources files under /etc/apt/sources.list.d/. For a host
without a subscription, the enterprise repos (pve-enterprise.sources,
ceph-*-enterprise.sources) must be disabled (they return 401) and a no-subscription repo
enabled. The spike host arrived with the no-subscription repo already configured and the
host updated [phase0 baseline]; the repo setup itself was not a
spike deliverable — the canonical no-subscription .sources is the standard Proxmox 9
procedure (/etc/apt/sources.list.d/pve-no-subscription.sources with
Components: pve-no-subscription). Treat the exact commands as standard setup, not
spike-validated.
Docker repository (validated): Docker's official apt repo has a trixie channel;
no fallback to Debian's docker.io was needed. Installed Docker 29.5.3 from it in both
guest types. [phase0 §1]
2. Guest model (LXC vs VM) — validated facts
Both guest types ran the identical workload (Debian 13, Docker 29.5.3, a postgres/redis/nginx compose stack) under identical resources (2 vCPU, 2048 MB, ~10 GB) [phase0].
2.1 Isolation characteristic (fact, not recommendation)
- LXC is an OS-level container: it shares the host kernel. Docker-in-LXC needs the container configured for nesting (see §2.3).
- VM runs its own guest kernel under KVM/QEMU, with full hardware-level isolation and its own firmware.
The trade-offs below follow directly from this difference.
2.2 Resource overhead (measured)
Host RAM used = MemTotal − MemAvailable, deltas vs a both-stopped baseline of 1702 MB;
one guest measured at a time [phase0 §2]:
| Metric | LXC | VM | Note |
|---|---|---|---|
| Idle host-RAM delta | +211 MB | +2056 MB | structural, see below |
| Under-load host-RAM delta | +410 MB | +2084 MB | |
| Per-guest attribution | cgroup memory.current 1961 MB¹ |
KVM RSS ~2031 MB | |
| Idle host CPU used | ~0.3 % | ~6.0 % | VM has an emulation/guest-kernel floor |
| Under-load host CPU used | ~39.4 % | ~53.9 % | VM work shows as %guest (31.9 %) |
| pgbench throughput | 2211 tps | 1820 tps | identical load, 0 failed both |
| Disk used (host thin-LV) | ~2.67 GiB | ~2.94 GiB | of 10 GiB allocated |
| Provisioning (create→ready) | ~10–15 s | ~60–75 s | template-extract vs qcow2-import+boot |
¹ cgroup memory.current counts reclaimable page cache shared with the host and
overstates the LXC's true incremental cost; the +211 MB host delta is the honest
number [phase0 §4.4].
Why the RAM gap is structural [phase0 §4.3]: LXC processes share the host kernel and page cache, so only the working set counts against the host. A VM with no ballooning configured has KVM back every guest-touched page (including the guest's own page cache), so its host cost ≈ the full RAM allocation and is largely load-independent. Ballooning / KSM were not tested and could change the VM figure.
2.3 Docker-in-LXC viability (validated)
Docker ran cleanly in an unprivileged LXC configured with
--features nesting=1,keyctl=1 --unprivileged 1 (PVE 9 syntax, accepted by pct create)
[phase0 §3]:
docker run hello-world→ success; full 3-container stack healthy.- Storage driver:
overlayfs(cgroup v2, systemd cgroup driver) — novfsfallback. (Docker 29 names the overlay driveroverlayfsvia the containerd snapshotter image store; same overlay technology as the legacyoverlay2.) - Named volume persisted writes; multi-container networking + published port worked
(
curl localhost:8080→ 200); 0 failed transactions under load. - No privileged-container fallback was needed.
2.4 Guest agent & app-consistency capability
- VM:
qemu-guest-agentinstalls and reports (agent: 1), enablingguest-fsfreeze-based app-consistentsnapshotbackups [phase0 §4.8]. The Debian genericcloud image does not ship the agent — it must be installed in-guest. - LXC: no guest agent exists → no fsfreeze (see §4.2).
3. API & access control
3.1 Fundamentals
- Base URL:
https://<host>:8006/api2/json. Everypve*CLI is a thin wrapper over this REST API. - Token auth header:
Authorization: PVEAPIToken=USER@REALM!TOKENID=SECRET. The secret is shown once at creation. Response envelope:{"data": ...}. - TLS reality: the host serves the default self-signed certificate.
curlwithout-kfailsSSL certificate problem: unable to get local issuer certificate[phase1-2 §1.5]. Production trust (pin the PVE CA / install a real cert) is a separate, not-yet-decided concern.
3.2 RBAC model
An ACL entry is a triple (path, principal, role); a role is a bundle of privileges,
assigned at the most specific path. Paths include /, /vms/<vmid>, /nodes/<node>,
/storage/<store>, /pool/<pool>, /access/....
Introspection (corrected for PVE 9) [phase1-2 §1.1]:
pveum role list— lists roles with their privileges.- ⚠️
pveum role info <role>does not exist in PVE 9 (the old reference used it). pveum acl list,pveum user permissions <user> --path <path>.
3.3 Privilege-separated tokens — the intersection rule (corrected)
A privsep token's (
--privsep 1) effective permissions are the intersection of (a) the backing user's permissions and (b) the token's own ACLs. The role must therefore be granted on BOTH the user AND the token for the same path. Granting it on the token only yields an empty intersection and a 403 even on self-calls. [phase1-2 §1.2]
This corrects the superseded reference (§3 there grants the ACL to the token only). The intersection is what keeps a privsep token ≤ its user while still being independently scopeable to a narrow path.
Working pattern (validated):
pveum role add <Role> -privs "<priv> <priv> ..." # NB: -privs is space-separated
pveum user add <user>@pve
pveum user token add <user>@pve <tokenid> --privsep 1 # capture SECRET (shown once)
pveum acl modify <path> -user '<user>@pve' -role <Role> # BOTH the user...
pveum acl modify <path> -token '<user>@pve!<tokenid>' -role <Role> # ...AND the token
pveum acl delete requires --roles (a bare -user/-token path errors
400 roles: property is missing). Deleting the token/user/role auto-invalidates the
referencing ACLs. [phase1-2 §5]
3.4 Validated minimal self-backup role
A token scoped to one VMID + the backup datastore can audit, snapshot, and back up only that guest, and is denied on every other guest and on create/allocate [phase1-2 §1.3–1.4]:
Minimal role for self-audit + self-snapshot + both
snapshot- andstop-mode self-backup:VM.Audit, VM.Snapshot, VM.Backup, Datastore.AllocateSpace, Datastore.Audit
⚠️ VM.PowerMgmt is NOT required for stop-mode backup — vzdump performs the guest
shutdown/restart internally under VM.Backup (tested: stop-mode self-backup returned
exitstatus OK without it) [phase1-2 §1.4]. This corrects the
old reference's "likely yes" guess.
Validated boundary (token scoped to /vms/<self> + /storage/local):
| Operation | Result |
|---|---|
GET /version |
200 |
GET self status, POST self snapshot, POST self vzdump |
200 / task OK |
GET/POST against another guest's vmid |
403 (read) / task 403 (backup) |
POST /nodes/<node>/lxc (create/allocate a guest) |
403 — create/allocate is operator-tier |
3.5 Async tasks — trust exitstatus, not the POST
Long operations (vzdump, snapshot, clone, restore) return a UPID, not a result.
Poll GET /nodes/<node>/tasks/<upid>/status until status: stopped, then read
exitstatus [phase1-2 §1.3].
⚠️ Authorization can surface at task execution, not at the HTTP POST. A
vzdumpagainst an unauthorized vmid returns HTTP 200 + a UPID, but the task then endsexitstatus: "403 Permission check failed (/vms/<id>, VM.Backup)"and produces no archive. A caller that trusts the 200 would wrongly believe the backup ran. Always poll the task and checkexitstatus.
(The task owner — including a token — can read its own task status: 200.)
4. Backup & restore (vzdump / pct restore)
4.1 Modes
stop— orderly guest shutdown → backup → restart. Highest consistency, defined downtime. (For LXC the shutdown/restart is internal tovzdump; needs onlyVM.Backup— §3.4.)snapshot— lowest downtime; copies blocks while running. Consistency depends on the guest cooperating (§4.2).suspend— legacy/compat, not used.
4.2 Consistency: crash-consistent vs quiesced, and no-fsfreeze-for-LXC
⚠️ An LXC has no guest agent, so
snapshot-modevzdumpdoes NOT fsfreeze. A running-stack LXC backup is therefore crash-consistent (filesystem-level), not app-consistent. App-consistency for an LXC is the caller's job: quiesce in-guest first (stop the stack / flush DBs) or usestopmode. A VM withqemu-guest-agentgetsguest-fsfreezearound the copy → near-free app-consistency. [phase1-2 §2.1, phase0 §4.8]
Validated restore behaviour (LXC, Postgres) [phase1-2 §2.2]:
- Crash-consistent (running): on first start Postgres ran automatic WAL recovery
(
database system was interrupted … not properly shut down; automatic recovery in progress … redo done … ready to accept connections) and the data was intact. - Quiesced (stack stopped): clean start, no recovery, data intact.
- Both restored correctly here on an idle-at-backup DB; this is not a durability guarantee under heavy write load (§6).
4.3 What a backup captures
A single LXC vzdump captures the container rootfs including the Docker named volumes
(they live in the rootfs) — one backup = the whole guest and its data. Validated: a
sentinel row survived both variants [phase1-2 §2.2].
Sizes/timings (2.5 GiB source, zstd) [phase1-2 §2.1–2.2]: backup ~934 MB (~2.7:1) in ~22–25 s; restore in ~11–12 s.
4.4 Restore = recreate-from-archive (identity is preserved)
There is no single "restore" call — you recreate the guest from the archive into a new VMID:
- LXC:
pct restore <newid> <archive> --storage <store> - VM:
qmrestore <archive> <newid>(orPOST /nodes/<node>/qemuwitharchive=)
⚠️
pct restorepreserves the source config — including the MAC address and hostname. Restoring while the original still runs causes a MAC/hostname collision on the bridge; reset network identity (pct set <id> -net0 name=eth0,bridge=vmbr0,ip=dhcpregenerates the MAC) before starting. [phase1-2 §2.2]
Restored config survives intact: unprivileged: 1 and features: nesting=1,keyctl=1
are preserved, so Docker runs in the restored CT [phase1-2 §2.2].
4.5 Snapshots
A running, unprivileged LXC can be snapshotted on LVM-thin with no stop required
(exitstatus OK; snapshot listed while the CT stays running)
[phase1-2 §1.6]. This is the mechanism available for a
snapshot-before-change rollback flow.
4.6 PBS (Proxmox Backup Server)
Not yet validated. No PBS datastore was configured or tested in the spike. All backup
findings above are for vzdump to a dir storage. PBS (dedup, incremental, remote, dirty-
bitmap) is pending.
5. Gotchas & operational notes (quick reference)
| Gotcha | Detail | Evidence |
|---|---|---|
| deb822 repos | PVE 9 repos are .sources files; disable enterprise, enable no-subscription |
standard setup |
| Privsep dual-grant | privsep token needs the role on both user and token, else empty intersection → 403 | phase1-2 §1.2 |
| Async authz | vzdump POST returns 200+UPID even when unauthorized; the 403 is in the task exitstatus; poll it |
phase1-2 §1.3 |
| No fsfreeze for LXC | running-LXC snapshot backup is crash-consistent only; quiesce or use stop for app-consistency |
phase1-2 §2.1 |
| Restore identity collision | pct restore keeps source MAC + hostname; reset before starting alongside the original |
phase1-2 §2.2 |
| Restart policy for self-heal | restored/rebooted containers come up exited with no restart policy; need a restart policy or an explicit compose up -d to return automatically |
phase1-2 §2.2/§3 |
| Self-signed TLS | host cert is self-signed; curl needs -k until trust is set up |
phase1-2 §1.5 |
pveum role info gone |
use pveum role list in PVE 9 |
phase1-2 §1.1 |
pveum acl delete needs --roles |
bare -user/-token path errors 400 roles: property is missing |
phase1-2 §5 |
VM.PowerMgmt not needed |
stop-mode backup works under VM.Backup alone |
phase1-2 §1.4 |
6. Validated vs open
Validated by the spike
| Fact | Evidence |
|---|---|
PVE 9.2.2 / Debian 13 / kernel 7.0.2 baseline; local (dir) vs local-lvm (thin) roles |
phase0 §1, phase1-2 pre-flight |
Docker runs in an unprivileged LXC (nesting=1,keyctl=1), driver overlayfs, cgroup v2 |
phase0 §3 |
| LXC vs VM overhead (idle host RAM +211 MB vs +2056 MB; CPU/throughput/provisioning) | phase0 §2 |
| Privsep token = intersection of user ∩ token ACLs (dual-grant required) | phase1-2 §1.2 |
Minimal self-backup role; VM.PowerMgmt unnecessary |
phase1-2 §1.4 |
| Token scoped to one VMID: self-ops succeed, cross-guest + create/allocate denied | phase1-2 §1.3 |
Async UPID model; vzdump authz surfaces in exitstatus, not the POST |
phase1-2 §1.3 |
| Running, unprivileged LXC snapshots on LVM-thin (no stop) | phase1-2 §1.6 |
vzdump → pct restore round-trip; one backup captures Docker volumes; config survives |
phase1-2 §2 |
| Crash-consistent restore recovers via Postgres WAL; quiesced restores clean | phase1-2 §2.2 |
Not yet validated (do not assume)
| Open item | Why it matters |
|---|---|
| PBS (dedup/incremental/remote backup) | the only backup path tested was vzdump to a dir |
The real controller running inside an LXC reaching host:8006 |
spike used curl/CLI, not the actual Go controller |
| App-consistency under heavy write load | WAL recovery was validated only on an idle-at-backup DB |
| Live migration / restore to a different host | single-node spike only |
| Ballooning / KSM effect on VM RAM cost | VM RAM measured with neither configured |
| Cluster / HA behaviour | single node |
| Production TLS trust for the API | all calls used -k against a self-signed cert |
| deb822 no-subscription repo setup as a controlled step | host arrived pre-configured |
7. Scope boundary
This document holds platform facts only. Felhom design decisions — e.g. which guest type is the default, whether to use privsep or non-privsep tokens, where PBS lives — are out of scope and belong in the controller-architecture document. Where this reference notes a decision exists, the decision itself is recorded there, not here.