23 KiB
Proxmox Platform Reference
Authoritative, living reference for the Proxmox platform underneath proxmox-controller.
It records facts about Proxmox and what we validated about it — not Felhom design
decisions. Where a design choice exists, this doc points to the (future) controller
architecture document rather than making the choice here.
Evidence base (raw, chronological spike logs — kept as the underlying record):
- tests/phase0-findings.md — VM-vs-LXC overhead, Docker-in-LXC viability
- tests/phase1-2-findings.md — privilege model, backup/restore round-trip
- tests/Proxmox_Spike_-API&_Access-Control_Reference.md — superseded pre-spike reference (contains a known privsep error; do not cite as authoritative)
Every nontrivial claim links to its evidence section. Validated on a single host
(demo-felhom, 192.168.0.162, 4 vCPU / 16 GB) on 2026-06-07; treat single-run timings and
measurements as indicative, not benchmarks.
1. Platform baseline
Validated stack [phase0 §1]:
| Component | Version |
|---|---|
Proxmox VE (pve-manager) |
9.2.2 (b9984c6d90a4bd80) |
| OS | Debian 13 (Trixie) |
| Kernel | proxmox-kernel 7.0.2-6-pve |
pve-qemu-kvm |
11.0.0-3 |
qemu-server |
9.1.15 |
pve-container |
6.1.10 |
lxc-pve / lxcfs |
7.0.0-2 / 7.0.0-pve1 |
criu |
4.1.1-1 |
pvesh get /version → release 9.2. Always confirm the node name on the box
(pvesh get /nodes) rather than hard-coding it.
1.1 Storage backends
Two backends were present and exercised [phase0 §1, phase1-2 §pre-flight]:
| Storage | Type | Path / VG | Content types | Holds |
|---|---|---|---|---|
local |
dir |
/var/lib/vz |
iso, vztmpl, backup, import |
ISOs, CT templates, vzdump archives |
local-lvm |
lvmthin |
VG pve, thinpool data |
rootdir, images |
guest disk volumes |
Why backups cannot live on LVM-thin: LVM-thin is a block backend — it allocates
logical volumes for guest disks. Backup archives and templates are files, which require a
file-level backend (dir, NFS, CIFS, or PBS). A vzdump target must therefore be a
storage whose content types include backup (here, local); pointing vzdump at
local-lvm is not valid. [phase1-2 §pre-flight / §2.1]
1.2 Repositories
PVE 9 uses deb822 .sources files under /etc/apt/sources.list.d/. For a host
without a subscription, the enterprise repos (pve-enterprise.sources,
ceph-*-enterprise.sources) must be disabled (they return 401) and a no-subscription repo
enabled. The spike host arrived with the no-subscription repo already configured and the
host updated [phase0 baseline]; the repo setup itself was not a
spike deliverable — the canonical no-subscription .sources is the standard Proxmox 9
procedure (/etc/apt/sources.list.d/pve-no-subscription.sources with
Components: pve-no-subscription). Treat the exact commands as standard setup, not
spike-validated.
Docker repository (validated): Docker's official apt repo has a trixie channel;
no fallback to Debian's docker.io was needed. Installed Docker 29.5.3 from it in both
guest types. [phase0 §1]
2. Guest model (LXC vs VM) — validated facts
Both guest types ran the identical workload (Debian 13, Docker 29.5.3, a postgres/redis/nginx compose stack) under identical resources (2 vCPU, 2048 MB, ~10 GB) [phase0].
2.1 Isolation characteristic (fact, not recommendation)
- LXC is an OS-level container: it shares the host kernel. Docker-in-LXC needs the container configured for nesting (see §2.3).
- VM runs its own guest kernel under KVM/QEMU, with full hardware-level isolation and its own firmware.
The trade-offs below follow directly from this difference.
2.2 Resource overhead (measured)
Host RAM used = MemTotal − MemAvailable, deltas vs a both-stopped baseline of 1702 MB;
one guest measured at a time [phase0 §2]:
| Metric | LXC | VM | Note |
|---|---|---|---|
| Idle host-RAM delta | +211 MB | +2056 MB | structural, see below |
| Under-load host-RAM delta | +410 MB | +2084 MB | |
| Per-guest attribution | cgroup memory.current 1961 MB¹ |
KVM RSS ~2031 MB | |
| Idle host CPU used | ~0.3 % | ~6.0 % | VM has an emulation/guest-kernel floor |
| Under-load host CPU used | ~39.4 % | ~53.9 % | VM work shows as %guest (31.9 %) |
| pgbench throughput | 2211 tps | 1820 tps | identical load, 0 failed both |
| Disk used (host thin-LV) | ~2.67 GiB | ~2.94 GiB | of 10 GiB allocated |
| Provisioning (create→ready) | ~10–15 s | ~60–75 s | template-extract vs qcow2-import+boot |
¹ cgroup memory.current counts reclaimable page cache shared with the host and
overstates the LXC's true incremental cost; the +211 MB host delta is the honest
number [phase0 §4.4].
Why the RAM gap is structural [phase0 §4.3]: LXC processes share the host kernel and page cache, so only the working set counts against the host. A VM with no ballooning configured has KVM back every guest-touched page (including the guest's own page cache), so its host cost ≈ the full RAM allocation and is largely load-independent. Ballooning / KSM were not tested and could change the VM figure.
2.3 Docker-in-LXC viability (validated)
Docker ran cleanly in an unprivileged LXC configured with
--features nesting=1,keyctl=1 --unprivileged 1 (PVE 9 syntax, accepted by pct create)
[phase0 §3]:
docker run hello-world→ success; full 3-container stack healthy.- Storage driver:
overlayfs(cgroup v2, systemd cgroup driver) — novfsfallback. (Docker 29 names the overlay driveroverlayfsvia the containerd snapshotter image store; same overlay technology as the legacyoverlay2.) - Named volume persisted writes; multi-container networking + published port worked
(
curl localhost:8080→ 200); 0 failed transactions under load. - No privileged-container fallback was needed.
2.4 Guest agent & app-consistency capability
- VM:
qemu-guest-agentinstalls and reports (agent: 1), enablingguest-fsfreeze-based app-consistentsnapshotbackups [phase0 §4.8]. The Debian genericcloud image does not ship the agent — it must be installed in-guest. - LXC: no guest agent exists → no fsfreeze (see §4.2).
3. API & access control
3.1 Fundamentals
- Base URL:
https://<host>:8006/api2/json. Everypve*CLI is a thin wrapper over this REST API. - Token auth header:
Authorization: PVEAPIToken=USER@REALM!TOKENID=SECRET. The secret is shown once at creation. Response envelope:{"data": ...}. - TLS reality: the host serves the default self-signed certificate.
curlwithout-kfailsSSL certificate problem: unable to get local issuer certificate[phase1-2 §1.5]. Production trust (pin the PVE CA / install a real cert) is a separate, not-yet-decided concern.
3.2 RBAC model
An ACL entry is a triple (path, principal, role); a role is a bundle of privileges,
assigned at the most specific path. Paths include /, /vms/<vmid>, /nodes/<node>,
/storage/<store>, /pool/<pool>, /access/....
Introspection (corrected for PVE 9) [phase1-2 §1.1]:
pveum role list— lists roles with their privileges.- ⚠️
pveum role info <role>does not exist in PVE 9 (the old reference used it). pveum acl list,pveum user permissions <user> --path <path>.
3.3 Privilege-separated tokens — the intersection rule (corrected)
A privsep token's (
--privsep 1) effective permissions are the intersection of (a) the backing user's permissions and (b) the token's own ACLs. The role must therefore be granted on BOTH the user AND the token for the same path. Granting it on the token only yields an empty intersection and a 403 even on self-calls. [phase1-2 §1.2]
This corrects the superseded reference (§3 there grants the ACL to the token only). The intersection is what keeps a privsep token ≤ its user while still being independently scopeable to a narrow path.
Working pattern (validated):
pveum role add <Role> -privs "<priv> <priv> ..." # NB: -privs is space-separated
pveum user add <user>@pve
pveum user token add <user>@pve <tokenid> --privsep 1 # capture SECRET (shown once)
pveum acl modify <path> -user '<user>@pve' -role <Role> # BOTH the user...
pveum acl modify <path> -token '<user>@pve!<tokenid>' -role <Role> # ...AND the token
pveum acl delete requires --roles (a bare -user/-token path errors
400 roles: property is missing). Deleting the token/user/role auto-invalidates the
referencing ACLs. [phase1-2 §5]
3.4 Validated minimal self-backup role
A token scoped to one VMID + the backup datastore can audit, snapshot, and back up only that guest, and is denied on every other guest and on create/allocate [phase1-2 §1.3–1.4]:
Minimal role for self-audit + self-snapshot + both
snapshot- andstop-mode self-backup:VM.Audit, VM.Snapshot, VM.Backup, Datastore.AllocateSpace, Datastore.Audit
⚠️ VM.PowerMgmt is NOT required for stop-mode backup — vzdump performs the guest
shutdown/restart internally under VM.Backup (tested: stop-mode self-backup returned
exitstatus OK without it) [phase1-2 §1.4]. This corrects the
old reference's "likely yes" guess.
Validated boundary (token scoped to /vms/<self> + /storage/local):
| Operation | Result |
|---|---|
GET /version |
200 |
GET self status, POST self snapshot, POST self vzdump |
200 / task OK |
GET/POST against another guest's vmid |
403 (read) / task 403 (backup) |
POST /nodes/<node>/lxc (create/allocate a guest) |
403 — create/allocate is operator-tier |
3.5 Async tasks — trust exitstatus, not the POST
Long operations (vzdump, snapshot, clone, restore) return a UPID, not a result.
Poll GET /nodes/<node>/tasks/<upid>/status until status: stopped, then read
exitstatus [phase1-2 §1.3].
⚠️ Authorization can surface at task execution, not at the HTTP POST. A
vzdumpagainst an unauthorized vmid returns HTTP 200 + a UPID, but the task then endsexitstatus: "403 Permission check failed (/vms/<id>, VM.Backup)"and produces no archive. A caller that trusts the 200 would wrongly believe the backup ran. Always poll the task and checkexitstatus.
(The task owner — including a token — can read its own task status: 200.)
3.6 Operator-tier agent role & root-vs-API boundary (validated)
The operator-tier host agent (03-host-agent.md) needs a far broader role than the
Phase-1 guest self-backup role (which is denied create/allocate — §3.4). The minimal role
that drives the full guest lifecycle via an API token, validated by paring
[phase3 §B3]:
FelhomAgent(operator-tier, 16 privileges):VM.Allocate, VM.Audit, VM.Config.Disk, VM.Config.CPU, VM.Config.Memory, VM.Config.Network, VM.Config.Options, VM.PowerMgmt, VM.Snapshot, VM.Snapshot.Rollback, VM.Backup, Datastore.Allocate, Datastore.AllocateSpace, Datastore.Audit, Sys.Audit, SDN.UseParing proved:
SDN.Useis required (PVE 9 gates bridge use; omitting it →403 (/sdn/zones/localnetwork/vmbr0, SDN.Use));Sys.Auditrequired for host metrics (GET /nodes/<node>/status);VM.Config.Network/VM.Config.Optionsrequired for NIC/onboot config;Datastore.AllocateTemplatenot needed (drop it). NBVM.Config.CPUMemoryis not a real privilege — it isVM.Config.CPU+VM.Config.Memory.
Root-vs-API boundary [phase3 §B3] — nearly the entire guest lifecycle, including restore, is API-token-covered; the genuine OS-root residual is narrow:
| Operation | Coverage |
|---|---|
| Create LXC (nesting-only), config, allocate, start/stop, snapshot/rollback, vzdump, restore, destroy, add storage definition, host metrics | scoped API token (the FelhomAgent role) |
⚠️ Create LXC with keyctl=1 (Docker needs it — §2.3) |
OS root root@pam only |
| USB physical mount-by-UUID / systemd mount unit / fstab; SMART/sensors | OS root / narrow sudoers |
⚠️
keyctl=1(and any feature flag exceptnesting) can be set only by an actualroot@pamsession —changing feature flags (except nesting) is only allowed for root@pam. No API token qualifies, not even a non-privseproot@pamtoken (same 403). So fresh provisioning of a Docker-capable LXC needspct createas OS root (or a narrow sudoers entry). Restore is exempt: a token-authorizedvzrestorepreserveskeyctl=1from the archive — the DR path needs no root.
4. Backup & restore (vzdump / pct restore)
4.1 Modes
stop— orderly guest shutdown → backup → restart. Highest consistency, defined downtime. (For LXC the shutdown/restart is internal tovzdump; needs onlyVM.Backup— §3.4.)snapshot— lowest downtime; copies blocks while running. Consistency depends on the guest cooperating (§4.2).suspend— legacy/compat, not used.
4.2 Consistency: crash-consistent vs quiesced, and no-fsfreeze-for-LXC
⚠️ An LXC has no guest agent, so
snapshot-modevzdumpdoes NOT fsfreeze. A running-stack LXC backup is therefore crash-consistent (filesystem-level), not app-consistent. App-consistency for an LXC is the caller's job: quiesce in-guest first (stop the stack / flush DBs) or usestopmode. A VM withqemu-guest-agentgetsguest-fsfreezearound the copy → near-free app-consistency. [phase1-2 §2.1, phase0 §4.8]
Validated restore behaviour (LXC, Postgres) [phase1-2 §2.2]:
- Crash-consistent (running): on first start Postgres ran automatic WAL recovery
(
database system was interrupted … not properly shut down; automatic recovery in progress … redo done … ready to accept connections) and the data was intact. - Quiesced (stack stopped): clean start, no recovery, data intact.
- Both restored correctly here on an idle-at-backup DB; this is not a durability guarantee under heavy write load (§6).
4.3 What a backup captures
A single LXC vzdump captures the container rootfs including the Docker named volumes
(they live in the rootfs) — one backup = the whole guest and its data. Validated: a
sentinel row survived both variants [phase1-2 §2.2].
Sizes/timings (2.5 GiB source, zstd) [phase1-2 §2.1–2.2]: backup ~934 MB (~2.7:1) in ~22–25 s; restore in ~11–12 s.
4.4 Restore = recreate-from-archive (identity is preserved)
There is no single "restore" call — you recreate the guest from the archive into a new VMID:
- LXC:
pct restore <newid> <archive> --storage <store> - VM:
qmrestore <archive> <newid>(orPOST /nodes/<node>/qemuwitharchive=)
⚠️
pct restorepreserves the source config — including the MAC address and hostname. Restoring while the original still runs causes a MAC/hostname collision on the bridge; reset network identity (pct set <id> -net0 name=eth0,bridge=vmbr0,ip=dhcpregenerates the MAC) before starting. [phase1-2 §2.2]
Restored config survives intact: unprivileged: 1 and features: nesting=1,keyctl=1
are preserved, so Docker runs in the restored CT [phase1-2 §2.2].
4.5 Snapshots
A running, unprivileged LXC can be snapshotted on LVM-thin with no stop required
(exitstatus OK; snapshot listed while the CT stays running)
[phase1-2 §1.6]. This is the mechanism available for a
snapshot-before-change rollback flow.
4.6 PBS (Proxmox Backup Server)
Not yet validated. No PBS datastore was configured or tested in the spike. All backup
findings above are for vzdump to a dir storage. PBS (dedup, incremental, remote, dirty-
bitmap) is pending.
4.7 vzdump scope by LXC mount type (validated)
A stop-mode vzdump includes/excludes each LXC mount point by type and the backup flag
[phase3 §B2]. Validated three ways (vzdump log, archive grep,
restore):
| Location | backup flag |
In the vzdump? |
|---|---|---|
| rootfs (and anything inside it) | — | included (always) |
| Docker named volume (default driver) | — | included — it lives in the rootfs (/var/lib/docker/volumes/<v>/_data) |
volume mount point (mpN) |
backup=1 |
included |
volume mount point (mpN) |
backup=0 |
excluded (vol recreated empty on restore) |
bind mount point (mpN: /host/path) |
n/a | excluded ("not a volume"); data is not in the archive |
⚠️ The
backup=<boolean>flag is honoured ONLY for volume mount points. A Docker named volume is in the rootfs and is always captured — so a "bulk" volume left as a default named volume is silently swept into the whole-guest image. To keep bulk data out, realize it as a dedicatedbackup=0volume mount point (proven recipe:pct set <id> -mpN <storage>:<size>,mp=/mnt/bulk,backup=0thendocker volume create --driver local -o type=none -o o=bind -o device=/mnt/bulk bulkvol). A bind mount's data is excluded from the archive entirely; on same-host restore it reappears only because the bind config re-attaches the same host dir — on a different host (true DR) it is gone unless backed up separately.
5. Gotchas & operational notes (quick reference)
| Gotcha | Detail | Evidence |
|---|---|---|
| deb822 repos | PVE 9 repos are .sources files; disable enterprise, enable no-subscription |
standard setup |
| Privsep dual-grant | privsep token needs the role on both user and token, else empty intersection → 403 | phase1-2 §1.2 |
| Async authz | vzdump POST returns 200+UPID even when unauthorized; the 403 is in the task exitstatus; poll it |
phase1-2 §1.3 |
| No fsfreeze for LXC | running-LXC snapshot backup is crash-consistent only; quiesce or use stop for app-consistency |
phase1-2 §2.1 |
| Restore identity collision | pct restore keeps source MAC + hostname; reset before starting alongside the original |
phase1-2 §2.2 |
| Restart policy for self-heal | restored/rebooted containers come up exited with no restart policy; need a restart policy or an explicit compose up -d to return automatically |
phase1-2 §2.2/§3 |
| Self-signed TLS | host cert is self-signed; curl needs -k until trust is set up |
phase1-2 §1.5 |
pveum role info gone |
use pveum role list in PVE 9 |
phase1-2 §1.1 |
pveum acl delete needs --roles |
bare -user/-token path errors 400 roles: property is missing |
phase1-2 §5 |
VM.PowerMgmt not needed |
stop-mode backup works under VM.Backup alone |
phase1-2 §1.4 |
keyctl=1 is root-only |
feature flags except nesting need a root@pam session; no API token (even root's) can set them; restore preserves them |
phase3 §B3 |
SDN.Use gates bridge use |
PVE 9 needs SDN.Use to attach a NIC to vmbr0; omit it → 403 |
phase3 §B3 |
| Docker named vol = always backed up | named volumes live in rootfs; only volume mountpoints honour backup=0; bulk must be a dedicated backup=0 mp |
phase3 §B2 |
6. Validated vs open
Validated by the spike
| Fact | Evidence |
|---|---|
PVE 9.2.2 / Debian 13 / kernel 7.0.2 baseline; local (dir) vs local-lvm (thin) roles |
phase0 §1, phase1-2 pre-flight |
Docker runs in an unprivileged LXC (nesting=1,keyctl=1), driver overlayfs, cgroup v2 |
phase0 §3 |
| LXC vs VM overhead (idle host RAM +211 MB vs +2056 MB; CPU/throughput/provisioning) | phase0 §2 |
| Privsep token = intersection of user ∩ token ACLs (dual-grant required) | phase1-2 §1.2 |
Minimal self-backup role; VM.PowerMgmt unnecessary |
phase1-2 §1.4 |
| Token scoped to one VMID: self-ops succeed, cross-guest + create/allocate denied | phase1-2 §1.3 |
Async UPID model; vzdump authz surfaces in exitstatus, not the POST |
phase1-2 §1.3 |
| Running, unprivileged LXC snapshots on LVM-thin (no stop) | phase1-2 §1.6 |
vzdump → pct restore round-trip; one backup captures Docker volumes; config survives |
phase1-2 §2 |
| Crash-consistent restore recovers via Postgres WAL; quiesced restores clean | phase1-2 §2.2 |
LXC vzdump scope by mount type; backup=0 excludes volume mps; Docker named vols ride rootfs; proven bulk-exclusion recipe |
phase3 §B2 |
Operator agent role (16 privs); guest lifecycle incl. restore is API-token-covered; keyctl create is root@pam-only |
phase3 §B3 |
Not yet validated (do not assume)
| Open item | Why it matters |
|---|---|
| PBS (dedup/incremental/remote backup) | the only backup path tested was vzdump to a dir |
The real controller running inside an LXC reaching host:8006 |
spike used curl/CLI, not the actual Go controller |
| App-consistency under heavy write load | WAL recovery was validated only on an idle-at-backup DB |
| Live migration / restore to a different host | single-node spike only |
| Ballooning / KSM effect on VM RAM cost | VM RAM measured with neither configured |
| Cluster / HA behaviour | single node |
| Production TLS trust for the API | all calls used -k against a self-signed cert |
| deb822 no-subscription repo setup as a controlled step | host arrived pre-configured |
7. Scope boundary
This document holds platform facts only. Felhom design decisions — e.g. which guest type is the default, whether to use privsep or non-privsep tokens, where PBS lives — are out of scope and belong in the controller-architecture document. Where this reference notes a decision exists, the decision itself is recorded there, not here.