Files
felhom-agent/docs/proxmox-platform.md
T
2026-06-07 20:46:01 +02:00

18 KiB
Raw Blame History

Proxmox Platform Reference

Authoritative, living reference for the Proxmox platform underneath proxmox-controller. It records facts about Proxmox and what we validated about it — not Felhom design decisions. Where a design choice exists, this doc points to the (future) controller architecture document rather than making the choice here.

Evidence base (raw, chronological spike logs — kept as the underlying record):

Every nontrivial claim links to its evidence section. Validated on a single host (demo-felhom, 192.168.0.162, 4 vCPU / 16 GB) on 2026-06-07; treat single-run timings and measurements as indicative, not benchmarks.


1. Platform baseline

Validated stack [phase0 §1]:

Component Version
Proxmox VE (pve-manager) 9.2.2 (b9984c6d90a4bd80)
OS Debian 13 (Trixie)
Kernel proxmox-kernel 7.0.2-6-pve
pve-qemu-kvm 11.0.0-3
qemu-server 9.1.15
pve-container 6.1.10
lxc-pve / lxcfs 7.0.0-2 / 7.0.0-pve1
criu 4.1.1-1

pvesh get /version → release 9.2. Always confirm the node name on the box (pvesh get /nodes) rather than hard-coding it.

1.1 Storage backends

Two backends were present and exercised [phase0 §1, phase1-2 §pre-flight]:

Storage Type Path / VG Content types Holds
local dir /var/lib/vz iso, vztmpl, backup, import ISOs, CT templates, vzdump archives
local-lvm lvmthin VG pve, thinpool data rootdir, images guest disk volumes

Why backups cannot live on LVM-thin: LVM-thin is a block backend — it allocates logical volumes for guest disks. Backup archives and templates are files, which require a file-level backend (dir, NFS, CIFS, or PBS). A vzdump target must therefore be a storage whose content types include backup (here, local); pointing vzdump at local-lvm is not valid. [phase1-2 §pre-flight / §2.1]

1.2 Repositories

PVE 9 uses deb822 .sources files under /etc/apt/sources.list.d/. For a host without a subscription, the enterprise repos (pve-enterprise.sources, ceph-*-enterprise.sources) must be disabled (they return 401) and a no-subscription repo enabled. The spike host arrived with the no-subscription repo already configured and the host updated [phase0 baseline]; the repo setup itself was not a spike deliverable — the canonical no-subscription .sources is the standard Proxmox 9 procedure (/etc/apt/sources.list.d/pve-no-subscription.sources with Components: pve-no-subscription). Treat the exact commands as standard setup, not spike-validated.

Docker repository (validated): Docker's official apt repo has a trixie channel; no fallback to Debian's docker.io was needed. Installed Docker 29.5.3 from it in both guest types. [phase0 §1]


2. Guest model (LXC vs VM) — validated facts

Both guest types ran the identical workload (Debian 13, Docker 29.5.3, a postgres/redis/nginx compose stack) under identical resources (2 vCPU, 2048 MB, ~10 GB) [phase0].

2.1 Isolation characteristic (fact, not recommendation)

  • LXC is an OS-level container: it shares the host kernel. Docker-in-LXC needs the container configured for nesting (see §2.3).
  • VM runs its own guest kernel under KVM/QEMU, with full hardware-level isolation and its own firmware.

The trade-offs below follow directly from this difference.

2.2 Resource overhead (measured)

Host RAM used = MemTotal MemAvailable, deltas vs a both-stopped baseline of 1702 MB; one guest measured at a time [phase0 §2]:

Metric LXC VM Note
Idle host-RAM delta +211 MB +2056 MB structural, see below
Under-load host-RAM delta +410 MB +2084 MB
Per-guest attribution cgroup memory.current 1961 MB¹ KVM RSS ~2031 MB
Idle host CPU used ~0.3 % ~6.0 % VM has an emulation/guest-kernel floor
Under-load host CPU used ~39.4 % ~53.9 % VM work shows as %guest (31.9 %)
pgbench throughput 2211 tps 1820 tps identical load, 0 failed both
Disk used (host thin-LV) ~2.67 GiB ~2.94 GiB of 10 GiB allocated
Provisioning (create→ready) ~1015 s ~6075 s template-extract vs qcow2-import+boot

¹ cgroup memory.current counts reclaimable page cache shared with the host and overstates the LXC's true incremental cost; the +211 MB host delta is the honest number [phase0 §4.4].

Why the RAM gap is structural [phase0 §4.3]: LXC processes share the host kernel and page cache, so only the working set counts against the host. A VM with no ballooning configured has KVM back every guest-touched page (including the guest's own page cache), so its host cost ≈ the full RAM allocation and is largely load-independent. Ballooning / KSM were not tested and could change the VM figure.

2.3 Docker-in-LXC viability (validated)

Docker ran cleanly in an unprivileged LXC configured with --features nesting=1,keyctl=1 --unprivileged 1 (PVE 9 syntax, accepted by pct create) [phase0 §3]:

  • docker run hello-world → success; full 3-container stack healthy.
  • Storage driver: overlayfs (cgroup v2, systemd cgroup driver) — no vfs fallback. (Docker 29 names the overlay driver overlayfs via the containerd snapshotter image store; same overlay technology as the legacy overlay2.)
  • Named volume persisted writes; multi-container networking + published port worked (curl localhost:8080 → 200); 0 failed transactions under load.
  • No privileged-container fallback was needed.

2.4 Guest agent & app-consistency capability

  • VM: qemu-guest-agent installs and reports (agent: 1), enabling guest-fsfreeze-based app-consistent snapshot backups [phase0 §4.8]. The Debian genericcloud image does not ship the agent — it must be installed in-guest.
  • LXC: no guest agent exists → no fsfreeze (see §4.2).

3. API & access control

3.1 Fundamentals

  • Base URL: https://<host>:8006/api2/json. Every pve* CLI is a thin wrapper over this REST API.
  • Token auth header: Authorization: PVEAPIToken=USER@REALM!TOKENID=SECRET. The secret is shown once at creation. Response envelope: {"data": ...}.
  • TLS reality: the host serves the default self-signed certificate. curl without -k fails SSL certificate problem: unable to get local issuer certificate [phase1-2 §1.5]. Production trust (pin the PVE CA / install a real cert) is a separate, not-yet-decided concern.

3.2 RBAC model

An ACL entry is a triple (path, principal, role); a role is a bundle of privileges, assigned at the most specific path. Paths include /, /vms/<vmid>, /nodes/<node>, /storage/<store>, /pool/<pool>, /access/....

Introspection (corrected for PVE 9) [phase1-2 §1.1]:

  • pveum role list — lists roles with their privileges.
  • ⚠️ pveum role info <role> does not exist in PVE 9 (the old reference used it).
  • pveum acl list, pveum user permissions <user> --path <path>.

3.3 Privilege-separated tokens — the intersection rule (corrected)

A privsep token's (--privsep 1) effective permissions are the intersection of (a) the backing user's permissions and (b) the token's own ACLs. The role must therefore be granted on BOTH the user AND the token for the same path. Granting it on the token only yields an empty intersection and a 403 even on self-calls. [phase1-2 §1.2]

This corrects the superseded reference (§3 there grants the ACL to the token only). The intersection is what keeps a privsep token ≤ its user while still being independently scopeable to a narrow path.

Working pattern (validated):

pveum role add <Role> -privs "<priv> <priv> ..."          # NB: -privs is space-separated
pveum user add <user>@pve
pveum user token add <user>@pve <tokenid> --privsep 1     # capture SECRET (shown once)
pveum acl modify <path> -user  '<user>@pve'         -role <Role>   # BOTH the user...
pveum acl modify <path> -token '<user>@pve!<tokenid>' -role <Role> # ...AND the token

pveum acl delete requires --roles (a bare -user/-token path errors 400 roles: property is missing). Deleting the token/user/role auto-invalidates the referencing ACLs. [phase1-2 §5]

3.4 Validated minimal self-backup role

A token scoped to one VMID + the backup datastore can audit, snapshot, and back up only that guest, and is denied on every other guest and on create/allocate [phase1-2 §1.31.4]:

Minimal role for self-audit + self-snapshot + both snapshot- and stop-mode self-backup: VM.Audit, VM.Snapshot, VM.Backup, Datastore.AllocateSpace, Datastore.Audit

⚠️ VM.PowerMgmt is NOT required for stop-mode backupvzdump performs the guest shutdown/restart internally under VM.Backup (tested: stop-mode self-backup returned exitstatus OK without it) [phase1-2 §1.4]. This corrects the old reference's "likely yes" guess.

Validated boundary (token scoped to /vms/<self> + /storage/local):

Operation Result
GET /version 200
GET self status, POST self snapshot, POST self vzdump 200 / task OK
GET/POST against another guest's vmid 403 (read) / task 403 (backup)
POST /nodes/<node>/lxc (create/allocate a guest) 403 — create/allocate is operator-tier

3.5 Async tasks — trust exitstatus, not the POST

Long operations (vzdump, snapshot, clone, restore) return a UPID, not a result. Poll GET /nodes/<node>/tasks/<upid>/status until status: stopped, then read exitstatus [phase1-2 §1.3].

⚠️ Authorization can surface at task execution, not at the HTTP POST. A vzdump against an unauthorized vmid returns HTTP 200 + a UPID, but the task then ends exitstatus: "403 Permission check failed (/vms/<id>, VM.Backup)" and produces no archive. A caller that trusts the 200 would wrongly believe the backup ran. Always poll the task and check exitstatus.

(The task owner — including a token — can read its own task status: 200.)


4. Backup & restore (vzdump / pct restore)

4.1 Modes

  • stop — orderly guest shutdown → backup → restart. Highest consistency, defined downtime. (For LXC the shutdown/restart is internal to vzdump; needs only VM.Backup — §3.4.)
  • snapshot — lowest downtime; copies blocks while running. Consistency depends on the guest cooperating (§4.2).
  • suspend — legacy/compat, not used.

4.2 Consistency: crash-consistent vs quiesced, and no-fsfreeze-for-LXC

⚠️ An LXC has no guest agent, so snapshot-mode vzdump does NOT fsfreeze. A running-stack LXC backup is therefore crash-consistent (filesystem-level), not app-consistent. App-consistency for an LXC is the caller's job: quiesce in-guest first (stop the stack / flush DBs) or use stop mode. A VM with qemu-guest-agent gets guest-fsfreeze around the copy → near-free app-consistency. [phase1-2 §2.1, phase0 §4.8]

Validated restore behaviour (LXC, Postgres) [phase1-2 §2.2]:

  • Crash-consistent (running): on first start Postgres ran automatic WAL recovery (database system was interrupted … not properly shut down; automatic recovery in progress … redo done … ready to accept connections) and the data was intact.
  • Quiesced (stack stopped): clean start, no recovery, data intact.
  • Both restored correctly here on an idle-at-backup DB; this is not a durability guarantee under heavy write load (§6).

4.3 What a backup captures

A single LXC vzdump captures the container rootfs including the Docker named volumes (they live in the rootfs) — one backup = the whole guest and its data. Validated: a sentinel row survived both variants [phase1-2 §2.2].

Sizes/timings (2.5 GiB source, zstd) [phase1-2 §2.12.2]: backup ~934 MB (~2.7:1) in ~2225 s; restore in ~1112 s.

4.4 Restore = recreate-from-archive (identity is preserved)

There is no single "restore" call — you recreate the guest from the archive into a new VMID:

  • LXC: pct restore <newid> <archive> --storage <store>
  • VM: qmrestore <archive> <newid> (or POST /nodes/<node>/qemu with archive=)

⚠️ pct restore preserves the source config — including the MAC address and hostname. Restoring while the original still runs causes a MAC/hostname collision on the bridge; reset network identity (pct set <id> -net0 name=eth0,bridge=vmbr0,ip=dhcp regenerates the MAC) before starting. [phase1-2 §2.2]

Restored config survives intact: unprivileged: 1 and features: nesting=1,keyctl=1 are preserved, so Docker runs in the restored CT [phase1-2 §2.2].

4.5 Snapshots

A running, unprivileged LXC can be snapshotted on LVM-thin with no stop required (exitstatus OK; snapshot listed while the CT stays running) [phase1-2 §1.6]. This is the mechanism available for a snapshot-before-change rollback flow.

4.6 PBS (Proxmox Backup Server)

Not yet validated. No PBS datastore was configured or tested in the spike. All backup findings above are for vzdump to a dir storage. PBS (dedup, incremental, remote, dirty- bitmap) is pending.


5. Gotchas & operational notes (quick reference)

Gotcha Detail Evidence
deb822 repos PVE 9 repos are .sources files; disable enterprise, enable no-subscription standard setup
Privsep dual-grant privsep token needs the role on both user and token, else empty intersection → 403 phase1-2 §1.2
Async authz vzdump POST returns 200+UPID even when unauthorized; the 403 is in the task exitstatus; poll it phase1-2 §1.3
No fsfreeze for LXC running-LXC snapshot backup is crash-consistent only; quiesce or use stop for app-consistency phase1-2 §2.1
Restore identity collision pct restore keeps source MAC + hostname; reset before starting alongside the original phase1-2 §2.2
Restart policy for self-heal restored/rebooted containers come up exited with no restart policy; need a restart policy or an explicit compose up -d to return automatically phase1-2 §2.2/§3
Self-signed TLS host cert is self-signed; curl needs -k until trust is set up phase1-2 §1.5
pveum role info gone use pveum role list in PVE 9 phase1-2 §1.1
pveum acl delete needs --roles bare -user/-token path errors 400 roles: property is missing phase1-2 §5
VM.PowerMgmt not needed stop-mode backup works under VM.Backup alone phase1-2 §1.4

6. Validated vs open

Validated by the spike

Fact Evidence
PVE 9.2.2 / Debian 13 / kernel 7.0.2 baseline; local (dir) vs local-lvm (thin) roles phase0 §1, phase1-2 pre-flight
Docker runs in an unprivileged LXC (nesting=1,keyctl=1), driver overlayfs, cgroup v2 phase0 §3
LXC vs VM overhead (idle host RAM +211 MB vs +2056 MB; CPU/throughput/provisioning) phase0 §2
Privsep token = intersection of user ∩ token ACLs (dual-grant required) phase1-2 §1.2
Minimal self-backup role; VM.PowerMgmt unnecessary phase1-2 §1.4
Token scoped to one VMID: self-ops succeed, cross-guest + create/allocate denied phase1-2 §1.3
Async UPID model; vzdump authz surfaces in exitstatus, not the POST phase1-2 §1.3
Running, unprivileged LXC snapshots on LVM-thin (no stop) phase1-2 §1.6
vzdumppct restore round-trip; one backup captures Docker volumes; config survives phase1-2 §2
Crash-consistent restore recovers via Postgres WAL; quiesced restores clean phase1-2 §2.2

Not yet validated (do not assume)

Open item Why it matters
PBS (dedup/incremental/remote backup) the only backup path tested was vzdump to a dir
The real controller running inside an LXC reaching host:8006 spike used curl/CLI, not the actual Go controller
App-consistency under heavy write load WAL recovery was validated only on an idle-at-backup DB
Live migration / restore to a different host single-node spike only
Ballooning / KSM effect on VM RAM cost VM RAM measured with neither configured
Cluster / HA behaviour single node
Production TLS trust for the API all calls used -k against a self-signed cert
deb822 no-subscription repo setup as a controlled step host arrived pre-configured

7. Scope boundary

This document holds platform facts only. Felhom design decisions — e.g. which guest type is the default, whether to use privsep or non-privsep tokens, where PBS lives — are out of scope and belong in the controller-architecture document. Where this reference notes a decision exists, the decision itself is recorded there, not here.