Files
felhom.eu/documentation/proxmox-platform.md
T

23 KiB
Raw Blame History

Proxmox Platform Reference

Authoritative, living reference for the Proxmox platform underneath proxmox-controller. It records facts about Proxmox and what we validated about it — not Felhom design decisions. Where a design choice exists, this doc points to the (future) controller architecture document rather than making the choice here.

Evidence base (raw, chronological spike logs — kept as the underlying record):

Every nontrivial claim links to its evidence section. Validated on a single host (demo-felhom, 192.168.0.162, 4 vCPU / 16 GB) on 2026-06-07; treat single-run timings and measurements as indicative, not benchmarks.


1. Platform baseline

Validated stack [phase0 §1]:

Component Version
Proxmox VE (pve-manager) 9.2.2 (b9984c6d90a4bd80)
OS Debian 13 (Trixie)
Kernel proxmox-kernel 7.0.2-6-pve
pve-qemu-kvm 11.0.0-3
qemu-server 9.1.15
pve-container 6.1.10
lxc-pve / lxcfs 7.0.0-2 / 7.0.0-pve1
criu 4.1.1-1

pvesh get /version → release 9.2. Always confirm the node name on the box (pvesh get /nodes) rather than hard-coding it.

1.1 Storage backends

Two backends were present and exercised [phase0 §1, phase1-2 §pre-flight]:

Storage Type Path / VG Content types Holds
local dir /var/lib/vz iso, vztmpl, backup, import ISOs, CT templates, vzdump archives
local-lvm lvmthin VG pve, thinpool data rootdir, images guest disk volumes

Why backups cannot live on LVM-thin: LVM-thin is a block backend — it allocates logical volumes for guest disks. Backup archives and templates are files, which require a file-level backend (dir, NFS, CIFS, or PBS). A vzdump target must therefore be a storage whose content types include backup (here, local); pointing vzdump at local-lvm is not valid. [phase1-2 §pre-flight / §2.1]

1.2 Repositories

PVE 9 uses deb822 .sources files under /etc/apt/sources.list.d/. For a host without a subscription, the enterprise repos (pve-enterprise.sources, ceph-*-enterprise.sources) must be disabled (they return 401) and a no-subscription repo enabled. The spike host arrived with the no-subscription repo already configured and the host updated [phase0 baseline]; the repo setup itself was not a spike deliverable — the canonical no-subscription .sources is the standard Proxmox 9 procedure (/etc/apt/sources.list.d/pve-no-subscription.sources with Components: pve-no-subscription). Treat the exact commands as standard setup, not spike-validated.

Docker repository (validated): Docker's official apt repo has a trixie channel; no fallback to Debian's docker.io was needed. Installed Docker 29.5.3 from it in both guest types. [phase0 §1]


2. Guest model (LXC vs VM) — validated facts

Both guest types ran the identical workload (Debian 13, Docker 29.5.3, a postgres/redis/nginx compose stack) under identical resources (2 vCPU, 2048 MB, ~10 GB) [phase0].

2.1 Isolation characteristic (fact, not recommendation)

  • LXC is an OS-level container: it shares the host kernel. Docker-in-LXC needs the container configured for nesting (see §2.3).
  • VM runs its own guest kernel under KVM/QEMU, with full hardware-level isolation and its own firmware.

The trade-offs below follow directly from this difference.

2.2 Resource overhead (measured)

Host RAM used = MemTotal MemAvailable, deltas vs a both-stopped baseline of 1702 MB; one guest measured at a time [phase0 §2]:

Metric LXC VM Note
Idle host-RAM delta +211 MB +2056 MB structural, see below
Under-load host-RAM delta +410 MB +2084 MB
Per-guest attribution cgroup memory.current 1961 MB¹ KVM RSS ~2031 MB
Idle host CPU used ~0.3 % ~6.0 % VM has an emulation/guest-kernel floor
Under-load host CPU used ~39.4 % ~53.9 % VM work shows as %guest (31.9 %)
pgbench throughput 2211 tps 1820 tps identical load, 0 failed both
Disk used (host thin-LV) ~2.67 GiB ~2.94 GiB of 10 GiB allocated
Provisioning (create→ready) ~1015 s ~6075 s template-extract vs qcow2-import+boot

¹ cgroup memory.current counts reclaimable page cache shared with the host and overstates the LXC's true incremental cost; the +211 MB host delta is the honest number [phase0 §4.4].

Why the RAM gap is structural [phase0 §4.3]: LXC processes share the host kernel and page cache, so only the working set counts against the host. A VM with no ballooning configured has KVM back every guest-touched page (including the guest's own page cache), so its host cost ≈ the full RAM allocation and is largely load-independent. Ballooning / KSM were not tested and could change the VM figure.

2.3 Docker-in-LXC viability (validated)

Docker ran cleanly in an unprivileged LXC configured with --features nesting=1,keyctl=1 --unprivileged 1 (PVE 9 syntax, accepted by pct create) [phase0 §3]:

  • docker run hello-world → success; full 3-container stack healthy.
  • Storage driver: overlayfs (cgroup v2, systemd cgroup driver) — no vfs fallback. (Docker 29 names the overlay driver overlayfs via the containerd snapshotter image store; same overlay technology as the legacy overlay2.)
  • Named volume persisted writes; multi-container networking + published port worked (curl localhost:8080 → 200); 0 failed transactions under load.
  • No privileged-container fallback was needed.

2.4 Guest agent & app-consistency capability

  • VM: qemu-guest-agent installs and reports (agent: 1), enabling guest-fsfreeze-based app-consistent snapshot backups [phase0 §4.8]. The Debian genericcloud image does not ship the agent — it must be installed in-guest.
  • LXC: no guest agent exists → no fsfreeze (see §4.2).

3. API & access control

3.1 Fundamentals

  • Base URL: https://<host>:8006/api2/json. Every pve* CLI is a thin wrapper over this REST API.
  • Token auth header: Authorization: PVEAPIToken=USER@REALM!TOKENID=SECRET. The secret is shown once at creation. Response envelope: {"data": ...}.
  • TLS reality: the host serves the default self-signed certificate. curl without -k fails SSL certificate problem: unable to get local issuer certificate [phase1-2 §1.5]. Production trust (pin the PVE CA / install a real cert) is a separate, not-yet-decided concern.

3.2 RBAC model

An ACL entry is a triple (path, principal, role); a role is a bundle of privileges, assigned at the most specific path. Paths include /, /vms/<vmid>, /nodes/<node>, /storage/<store>, /pool/<pool>, /access/....

Introspection (corrected for PVE 9) [phase1-2 §1.1]:

  • pveum role list — lists roles with their privileges.
  • ⚠️ pveum role info <role> does not exist in PVE 9 (the old reference used it).
  • pveum acl list, pveum user permissions <user> --path <path>.

3.3 Privilege-separated tokens — the intersection rule (corrected)

A privsep token's (--privsep 1) effective permissions are the intersection of (a) the backing user's permissions and (b) the token's own ACLs. The role must therefore be granted on BOTH the user AND the token for the same path. Granting it on the token only yields an empty intersection and a 403 even on self-calls. [phase1-2 §1.2]

This corrects the superseded reference (§3 there grants the ACL to the token only). The intersection is what keeps a privsep token ≤ its user while still being independently scopeable to a narrow path.

Working pattern (validated):

pveum role add <Role> -privs "<priv> <priv> ..."          # NB: -privs is space-separated
pveum user add <user>@pve
pveum user token add <user>@pve <tokenid> --privsep 1     # capture SECRET (shown once)
pveum acl modify <path> -user  '<user>@pve'         -role <Role>   # BOTH the user...
pveum acl modify <path> -token '<user>@pve!<tokenid>' -role <Role> # ...AND the token

pveum acl delete requires --roles (a bare -user/-token path errors 400 roles: property is missing). Deleting the token/user/role auto-invalidates the referencing ACLs. [phase1-2 §5]

3.4 Validated minimal self-backup role

A token scoped to one VMID + the backup datastore can audit, snapshot, and back up only that guest, and is denied on every other guest and on create/allocate [phase1-2 §1.31.4]:

Minimal role for self-audit + self-snapshot + both snapshot- and stop-mode self-backup: VM.Audit, VM.Snapshot, VM.Backup, Datastore.AllocateSpace, Datastore.Audit

⚠️ VM.PowerMgmt is NOT required for stop-mode backupvzdump performs the guest shutdown/restart internally under VM.Backup (tested: stop-mode self-backup returned exitstatus OK without it) [phase1-2 §1.4]. This corrects the old reference's "likely yes" guess.

Validated boundary (token scoped to /vms/<self> + /storage/local):

Operation Result
GET /version 200
GET self status, POST self snapshot, POST self vzdump 200 / task OK
GET/POST against another guest's vmid 403 (read) / task 403 (backup)
POST /nodes/<node>/lxc (create/allocate a guest) 403 — create/allocate is operator-tier

3.5 Async tasks — trust exitstatus, not the POST

Long operations (vzdump, snapshot, clone, restore) return a UPID, not a result. Poll GET /nodes/<node>/tasks/<upid>/status until status: stopped, then read exitstatus [phase1-2 §1.3].

⚠️ Authorization can surface at task execution, not at the HTTP POST. A vzdump against an unauthorized vmid returns HTTP 200 + a UPID, but the task then ends exitstatus: "403 Permission check failed (/vms/<id>, VM.Backup)" and produces no archive. A caller that trusts the 200 would wrongly believe the backup ran. Always poll the task and check exitstatus.

(The task owner — including a token — can read its own task status: 200.)

3.6 Operator-tier agent role & root-vs-API boundary (validated)

The operator-tier host agent (03-host-agent.md) needs a far broader role than the Phase-1 guest self-backup role (which is denied create/allocate — §3.4). The minimal role that drives the full guest lifecycle via an API token, validated by paring [phase3 §B3]:

FelhomAgent (operator-tier, 16 privileges): VM.Allocate, VM.Audit, VM.Config.Disk, VM.Config.CPU, VM.Config.Memory, VM.Config.Network, VM.Config.Options, VM.PowerMgmt, VM.Snapshot, VM.Snapshot.Rollback, VM.Backup, Datastore.Allocate, Datastore.AllocateSpace, Datastore.Audit, Sys.Audit, SDN.Use

Paring proved: SDN.Use is required (PVE 9 gates bridge use; omitting it → 403 (/sdn/zones/localnetwork/vmbr0, SDN.Use)); Sys.Audit required for host metrics (GET /nodes/<node>/status); VM.Config.Network/VM.Config.Options required for NIC/onboot config; Datastore.AllocateTemplate not needed (drop it). NB VM.Config.CPUMemory is not a real privilege — it is VM.Config.CPU + VM.Config.Memory.

Root-vs-API boundary [phase3 §B3] — nearly the entire guest lifecycle, including restore, is API-token-covered; the genuine OS-root residual is narrow:

Operation Coverage
Create LXC (nesting-only), config, allocate, start/stop, snapshot/rollback, vzdump, restore, destroy, add storage definition, host metrics scoped API token (the FelhomAgent role)
⚠️ Create LXC with keyctl=1 (Docker needs it — §2.3) OS root root@pam only
USB physical mount-by-UUID / systemd mount unit / fstab; SMART/sensors OS root / narrow sudoers

⚠️ keyctl=1 (and any feature flag except nesting) can be set only by an actual root@pam sessionchanging feature flags (except nesting) is only allowed for root@pam. No API token qualifies, not even a non-privsep root@pam token (same 403). So fresh provisioning of a Docker-capable LXC needs pct create as OS root (or a narrow sudoers entry). Restore is exempt: a token-authorized vzrestore preserves keyctl=1 from the archive — the DR path needs no root.


4. Backup & restore (vzdump / pct restore)

4.1 Modes

  • stop — orderly guest shutdown → backup → restart. Highest consistency, defined downtime. (For LXC the shutdown/restart is internal to vzdump; needs only VM.Backup — §3.4.)
  • snapshot — lowest downtime; copies blocks while running. Consistency depends on the guest cooperating (§4.2).
  • suspend — legacy/compat, not used.

4.2 Consistency: crash-consistent vs quiesced, and no-fsfreeze-for-LXC

⚠️ An LXC has no guest agent, so snapshot-mode vzdump does NOT fsfreeze. A running-stack LXC backup is therefore crash-consistent (filesystem-level), not app-consistent. App-consistency for an LXC is the caller's job: quiesce in-guest first (stop the stack / flush DBs) or use stop mode. A VM with qemu-guest-agent gets guest-fsfreeze around the copy → near-free app-consistency. [phase1-2 §2.1, phase0 §4.8]

Validated restore behaviour (LXC, Postgres) [phase1-2 §2.2]:

  • Crash-consistent (running): on first start Postgres ran automatic WAL recovery (database system was interrupted … not properly shut down; automatic recovery in progress … redo done … ready to accept connections) and the data was intact.
  • Quiesced (stack stopped): clean start, no recovery, data intact.
  • Both restored correctly here on an idle-at-backup DB; this is not a durability guarantee under heavy write load (§6).

4.3 What a backup captures

A single LXC vzdump captures the container rootfs including the Docker named volumes (they live in the rootfs) — one backup = the whole guest and its data. Validated: a sentinel row survived both variants [phase1-2 §2.2].

Sizes/timings (2.5 GiB source, zstd) [phase1-2 §2.12.2]: backup ~934 MB (~2.7:1) in ~2225 s; restore in ~1112 s.

4.4 Restore = recreate-from-archive (identity is preserved)

There is no single "restore" call — you recreate the guest from the archive into a new VMID:

  • LXC: pct restore <newid> <archive> --storage <store>
  • VM: qmrestore <archive> <newid> (or POST /nodes/<node>/qemu with archive=)

⚠️ pct restore preserves the source config — including the MAC address and hostname. Restoring while the original still runs causes a MAC/hostname collision on the bridge; reset network identity (pct set <id> -net0 name=eth0,bridge=vmbr0,ip=dhcp regenerates the MAC) before starting. [phase1-2 §2.2]

Restored config survives intact: unprivileged: 1 and features: nesting=1,keyctl=1 are preserved, so Docker runs in the restored CT [phase1-2 §2.2].

4.5 Snapshots

A running, unprivileged LXC can be snapshotted on LVM-thin with no stop required (exitstatus OK; snapshot listed while the CT stays running) [phase1-2 §1.6]. This is the mechanism available for a snapshot-before-change rollback flow.

4.6 PBS (Proxmox Backup Server)

Not yet validated. No PBS datastore was configured or tested in the spike. All backup findings above are for vzdump to a dir storage. PBS (dedup, incremental, remote, dirty- bitmap) is pending.

4.7 vzdump scope by LXC mount type (validated)

A stop-mode vzdump includes/excludes each LXC mount point by type and the backup flag [phase3 §B2]. Validated three ways (vzdump log, archive grep, restore):

Location backup flag In the vzdump?
rootfs (and anything inside it) included (always)
Docker named volume (default driver) included — it lives in the rootfs (/var/lib/docker/volumes/<v>/_data)
volume mount point (mpN) backup=1 included
volume mount point (mpN) backup=0 excluded (vol recreated empty on restore)
bind mount point (mpN: /host/path) n/a excluded ("not a volume"); data is not in the archive

⚠️ The backup=<boolean> flag is honoured ONLY for volume mount points. A Docker named volume is in the rootfs and is always captured — so a "bulk" volume left as a default named volume is silently swept into the whole-guest image. To keep bulk data out, realize it as a dedicated backup=0 volume mount point (proven recipe: pct set <id> -mpN <storage>:<size>,mp=/mnt/bulk,backup=0 then docker volume create --driver local -o type=none -o o=bind -o device=/mnt/bulk bulkvol). A bind mount's data is excluded from the archive entirely; on same-host restore it reappears only because the bind config re-attaches the same host dir — on a different host (true DR) it is gone unless backed up separately.


5. Gotchas & operational notes (quick reference)

Gotcha Detail Evidence
deb822 repos PVE 9 repos are .sources files; disable enterprise, enable no-subscription standard setup
Privsep dual-grant privsep token needs the role on both user and token, else empty intersection → 403 phase1-2 §1.2
Async authz vzdump POST returns 200+UPID even when unauthorized; the 403 is in the task exitstatus; poll it phase1-2 §1.3
No fsfreeze for LXC running-LXC snapshot backup is crash-consistent only; quiesce or use stop for app-consistency phase1-2 §2.1
Restore identity collision pct restore keeps source MAC + hostname; reset before starting alongside the original phase1-2 §2.2
Restart policy for self-heal restored/rebooted containers come up exited with no restart policy; need a restart policy or an explicit compose up -d to return automatically phase1-2 §2.2/§3
Self-signed TLS host cert is self-signed; curl needs -k until trust is set up phase1-2 §1.5
pveum role info gone use pveum role list in PVE 9 phase1-2 §1.1
pveum acl delete needs --roles bare -user/-token path errors 400 roles: property is missing phase1-2 §5
VM.PowerMgmt not needed stop-mode backup works under VM.Backup alone phase1-2 §1.4
keyctl=1 is root-only feature flags except nesting need a root@pam session; no API token (even root's) can set them; restore preserves them phase3 §B3
SDN.Use gates bridge use PVE 9 needs SDN.Use to attach a NIC to vmbr0; omit it → 403 phase3 §B3
Docker named vol = always backed up named volumes live in rootfs; only volume mountpoints honour backup=0; bulk must be a dedicated backup=0 mp phase3 §B2

6. Validated vs open

Validated by the spike

Fact Evidence
PVE 9.2.2 / Debian 13 / kernel 7.0.2 baseline; local (dir) vs local-lvm (thin) roles phase0 §1, phase1-2 pre-flight
Docker runs in an unprivileged LXC (nesting=1,keyctl=1), driver overlayfs, cgroup v2 phase0 §3
LXC vs VM overhead (idle host RAM +211 MB vs +2056 MB; CPU/throughput/provisioning) phase0 §2
Privsep token = intersection of user ∩ token ACLs (dual-grant required) phase1-2 §1.2
Minimal self-backup role; VM.PowerMgmt unnecessary phase1-2 §1.4
Token scoped to one VMID: self-ops succeed, cross-guest + create/allocate denied phase1-2 §1.3
Async UPID model; vzdump authz surfaces in exitstatus, not the POST phase1-2 §1.3
Running, unprivileged LXC snapshots on LVM-thin (no stop) phase1-2 §1.6
vzdumppct restore round-trip; one backup captures Docker volumes; config survives phase1-2 §2
Crash-consistent restore recovers via Postgres WAL; quiesced restores clean phase1-2 §2.2
LXC vzdump scope by mount type; backup=0 excludes volume mps; Docker named vols ride rootfs; proven bulk-exclusion recipe phase3 §B2
Operator agent role (16 privs); guest lifecycle incl. restore is API-token-covered; keyctl create is root@pam-only phase3 §B3

Not yet validated (do not assume)

Open item Why it matters
PBS (dedup/incremental/remote backup) the only backup path tested was vzdump to a dir
The real controller running inside an LXC reaching host:8006 spike used curl/CLI, not the actual Go controller
App-consistency under heavy write load WAL recovery was validated only on an idle-at-backup DB
Live migration / restore to a different host single-node spike only
Ballooning / KSM effect on VM RAM cost VM RAM measured with neither configured
Cluster / HA behaviour single node
Production TLS trust for the API all calls used -k against a self-signed cert
deb822 no-subscription repo setup as a controlled step host arrived pre-configured

7. Scope boundary

This document holds platform facts only. Felhom design decisions — e.g. which guest type is the default, whether to use privsep or non-privsep tokens, where PBS lives — are out of scope and belong in the controller-architecture document. Where this reference notes a decision exists, the decision itself is recorded there, not here.