# Proxmox Platform Reference Authoritative, living reference for the Proxmox platform underneath `proxmox-controller`. It records **facts about Proxmox and what we validated about it** — not Felhom design decisions. Where a design choice exists, this doc points to the (future) controller architecture document rather than making the choice here. **Evidence base** (raw, chronological spike logs — kept as the underlying record): - [tests/phase0-findings.md](tests/phase0-findings.md) — VM-vs-LXC overhead, Docker-in-LXC viability - [tests/phase1-2-findings.md](tests/phase1-2-findings.md) — privilege model, backup/restore round-trip - [tests/Proxmox_Spike_-_API_&_Access-Control_Reference.md](tests/Proxmox_Spike_-_API_&_Access-Control_Reference.md) — **superseded** pre-spike reference (contains a known privsep error; do not cite as authoritative) Every nontrivial claim links to its evidence section. Validated on a single host (`demo-felhom`, 192.168.0.162, 4 vCPU / 16 GB) on 2026-06-07; treat single-run timings and measurements as indicative, not benchmarks. --- ## 1. Platform baseline Validated stack [[phase0 §1](tests/phase0-findings.md)]: | Component | Version | |---|---| | Proxmox VE (`pve-manager`) | **9.2.2** (`b9984c6d90a4bd80`) | | OS | Debian 13 (Trixie) | | Kernel | proxmox-kernel **7.0.2-6-pve** | | `pve-qemu-kvm` | 11.0.0-3 | | `qemu-server` | 9.1.15 | | `pve-container` | 6.1.10 | | `lxc-pve` / `lxcfs` | 7.0.0-2 / 7.0.0-pve1 | | `criu` | 4.1.1-1 | `pvesh get /version` → release 9.2. Always confirm the node name on the box (`pvesh get /nodes`) rather than hard-coding it. ### 1.1 Storage backends Two backends were present and exercised [[phase0 §1](tests/phase0-findings.md), [phase1-2 §pre-flight](tests/phase1-2-findings.md)]: | Storage | Type | Path / VG | Content types | Holds | |---|---|---|---|---| | `local` | `dir` | `/var/lib/vz` | `iso, vztmpl, backup, import` | ISOs, CT templates, **vzdump archives** | | `local-lvm` | `lvmthin` | VG `pve`, thinpool `data` | `rootdir, images` | guest disk volumes | **Why backups cannot live on LVM-thin:** LVM-thin is a *block* backend — it allocates logical volumes for guest disks. Backup archives and templates are *files*, which require a file-level backend (`dir`, NFS, CIFS, or PBS). A `vzdump` target must therefore be a storage whose content types include `backup` (here, `local`); pointing `vzdump` at `local-lvm` is not valid. [[phase1-2 §pre-flight / §2.1](tests/phase1-2-findings.md)] ### 1.2 Repositories PVE 9 uses **deb822** `.sources` files under `/etc/apt/sources.list.d/`. For a host without a subscription, the enterprise repos (`pve-enterprise.sources`, `ceph-*-enterprise.sources`) must be disabled (they return 401) and a no-subscription repo enabled. *The spike host arrived with the no-subscription repo already configured and the host updated [[phase0 baseline](tests/phase0-findings.md)]; the repo setup itself was not a spike deliverable* — the canonical no-subscription `.sources` is the standard Proxmox 9 procedure (`/etc/apt/sources.list.d/pve-no-subscription.sources` with `Components: pve-no-subscription`). Treat the exact commands as standard setup, not spike-validated. **Docker repository (validated):** Docker's official apt repo **has a `trixie` channel**; no fallback to Debian's `docker.io` was needed. Installed Docker **29.5.3** from it in both guest types. [[phase0 §1](tests/phase0-findings.md)] --- ## 2. Guest model (LXC vs VM) — validated facts Both guest types ran the **identical** workload (Debian 13, Docker 29.5.3, a postgres/redis/nginx compose stack) under identical resources (2 vCPU, 2048 MB, ~10 GB) [[phase0](tests/phase0-findings.md)]. ### 2.1 Isolation characteristic (fact, not recommendation) - **LXC** is an OS-level container: it **shares the host kernel**. Docker-in-LXC needs the container configured for nesting (see §2.3). - **VM** runs its **own guest kernel** under KVM/QEMU, with full hardware-level isolation and its own firmware. The trade-offs below follow directly from this difference. ### 2.2 Resource overhead (measured) Host RAM used = `MemTotal − MemAvailable`, deltas vs a both-stopped baseline of 1702 MB; one guest measured at a time [[phase0 §2](tests/phase0-findings.md)]: | Metric | LXC | VM | Note | |---|---|---|---| | Idle host-RAM delta | **+211 MB** | **+2056 MB** | structural, see below | | Under-load host-RAM delta | **+410 MB** | **+2084 MB** | | | Per-guest attribution | cgroup `memory.current` 1961 MB¹ | KVM RSS ~2031 MB | | | Idle host CPU used | ~0.3 % | ~6.0 % | VM has an emulation/guest-kernel floor | | Under-load host CPU used | ~39.4 % | ~53.9 % | VM work shows as `%guest` (31.9 %) | | pgbench throughput | 2211 tps | 1820 tps | identical load, 0 failed both | | Disk used (host thin-LV) | ~2.67 GiB | ~2.94 GiB | of 10 GiB allocated | | Provisioning (create→ready) | ~10–15 s | ~60–75 s | template-extract vs qcow2-import+boot | ¹ `cgroup memory.current` counts reclaimable page cache shared with the host and **overstates** the LXC's true incremental cost; the +211 MB host delta is the honest number [[phase0 §4.4](tests/phase0-findings.md)]. **Why the RAM gap is structural** [[phase0 §4.3](tests/phase0-findings.md)]: LXC processes share the host kernel and page cache, so only the working set counts against the host. A VM with **no ballooning configured** has KVM back every guest-touched page (including the guest's own page cache), so its host cost ≈ the full RAM allocation and is largely load-independent. *Ballooning / KSM were not tested* and could change the VM figure. ### 2.3 Docker-in-LXC viability (validated) Docker ran **cleanly in an *unprivileged* LXC** configured with `--features nesting=1,keyctl=1 --unprivileged 1` (PVE 9 syntax, accepted by `pct create`) [[phase0 §3](tests/phase0-findings.md)]: - `docker run hello-world` → success; full 3-container stack healthy. - **Storage driver: `overlayfs`** (cgroup v2, systemd cgroup driver) — **no `vfs` fallback**. (Docker 29 names the overlay driver `overlayfs` via the containerd snapshotter image store; same overlay technology as the legacy `overlay2`.) - Named volume persisted writes; multi-container networking + published port worked (`curl localhost:8080` → 200); 0 failed transactions under load. - No privileged-container fallback was needed. ### 2.4 Guest agent & app-consistency capability - **VM:** `qemu-guest-agent` installs and reports (`agent: 1`), enabling `guest-fsfreeze`-based app-consistent `snapshot` backups [[phase0 §4.8](tests/phase0-findings.md)]. The Debian genericcloud image does **not** ship the agent — it must be installed in-guest. - **LXC:** no guest agent exists → **no fsfreeze** (see §4.2). --- ## 3. API & access control ### 3.1 Fundamentals - **Base URL:** `https://:8006/api2/json`. Every `pve*` CLI is a thin wrapper over this REST API. - **Token auth header:** `Authorization: PVEAPIToken=USER@REALM!TOKENID=SECRET`. The secret is shown **once** at creation. Response envelope: `{"data": ...}`. - **TLS reality:** the host serves the default **self-signed** certificate. `curl` without `-k` fails `SSL certificate problem: unable to get local issuer certificate` [[phase1-2 §1.5](tests/phase1-2-findings.md)]. Production trust (pin the PVE CA / install a real cert) is a separate, not-yet-decided concern. ### 3.2 RBAC model An ACL entry is a triple **(path, principal, role)**; a role is a bundle of privileges, assigned at the most specific path. Paths include `/`, `/vms/`, `/nodes/`, `/storage/`, `/pool/`, `/access/...`. Introspection (**corrected for PVE 9**) [[phase1-2 §1.1](tests/phase1-2-findings.md)]: - `pveum role list` — lists roles **with their privileges**. - ⚠️ `pveum role info ` **does not exist in PVE 9** (the old reference used it). - `pveum acl list`, `pveum user permissions --path `. ### 3.3 Privilege-separated tokens — the intersection rule (corrected) > **A privsep token's (`--privsep 1`) effective permissions are the *intersection* of (a) > the backing user's permissions and (b) the token's own ACLs.** The role must therefore be > granted on **BOTH the user AND the token** for the same path. Granting it on the token > only yields an **empty intersection** and a **403 even on self-calls.** > [[phase1-2 §1.2](tests/phase1-2-findings.md)] This corrects the superseded reference (§3 there grants the ACL to the token only). The intersection is what keeps a privsep token ≤ its user while still being independently scopeable to a narrow path. Working pattern (validated): ```bash pveum role add -privs " ..." # NB: -privs is space-separated pveum user add @pve pveum user token add @pve --privsep 1 # capture SECRET (shown once) pveum acl modify -user '@pve' -role # BOTH the user... pveum acl modify -token '@pve!' -role # ...AND the token ``` `pveum acl delete` **requires `--roles`** (a bare `-user`/`-token` path errors `400 roles: property is missing`). Deleting the token/user/role auto-invalidates the referencing ACLs. [[phase1-2 §5](tests/phase1-2-findings.md)] ### 3.4 Validated minimal self-backup role A token scoped to **one VMID + the backup datastore** can audit, snapshot, and back up **only that guest**, and is denied on every other guest and on create/allocate [[phase1-2 §1.3–1.4](tests/phase1-2-findings.md)]: > **Minimal role for self-audit + self-snapshot + both `snapshot`- and `stop`-mode > self-backup:** > `VM.Audit, VM.Snapshot, VM.Backup, Datastore.AllocateSpace, Datastore.Audit` ⚠️ **`VM.PowerMgmt` is NOT required for stop-mode backup** — `vzdump` performs the guest shutdown/restart internally under `VM.Backup` (tested: stop-mode self-backup returned `exitstatus OK` without it) [[phase1-2 §1.4](tests/phase1-2-findings.md)]. This corrects the old reference's "likely yes" guess. Validated boundary (token scoped to `/vms/` + `/storage/local`): | Operation | Result | |---|---| | `GET /version` | 200 | | `GET` self status, `POST` self snapshot, `POST` self vzdump | 200 / task `OK` | | `GET`/`POST` against **another** guest's vmid | **403** (read) / task **403** (backup) | | `POST /nodes//lxc` (create/allocate a guest) | **403** — create/allocate is operator-tier | ### 3.5 Async tasks — trust `exitstatus`, not the POST Long operations (`vzdump`, `snapshot`, clone, restore) return a **UPID**, not a result. Poll `GET /nodes//tasks//status` until `status: stopped`, then read `exitstatus` [[phase1-2 §1.3](tests/phase1-2-findings.md)]. > ⚠️ **Authorization can surface at task execution, not at the HTTP POST.** A `vzdump` > against an unauthorized vmid returns **HTTP 200 + a UPID**, but the task then ends > `exitstatus: "403 Permission check failed (/vms/, VM.Backup)"` and produces **no > archive**. A caller that trusts the 200 would wrongly believe the backup ran. Always poll > the task and check `exitstatus`. (The task owner — including a token — can read its own task status: 200.) ### 3.6 Operator-tier agent role & root-vs-API boundary (validated) The operator-tier **host agent** (`03-host-agent.md`) needs a far broader role than the Phase-1 *guest self-backup* role (which is denied create/allocate — §3.4). The minimal role that drives the full guest lifecycle via an API token, validated by paring [[phase3 §B3](tests/phase3-findings.md)]: > **`FelhomAgent` (operator-tier, 16 privileges):** > `VM.Allocate, VM.Audit, VM.Config.Disk, VM.Config.CPU, VM.Config.Memory, VM.Config.Network, > VM.Config.Options, VM.PowerMgmt, VM.Snapshot, VM.Snapshot.Rollback, VM.Backup, > Datastore.Allocate, Datastore.AllocateSpace, Datastore.Audit, Sys.Audit, SDN.Use` > > Paring proved: `SDN.Use` is **required** (PVE 9 gates bridge use; omitting it → `403 > (/sdn/zones/localnetwork/vmbr0, SDN.Use)`); `Sys.Audit` required for host metrics > (`GET /nodes//status`); `VM.Config.Network`/`VM.Config.Options` required for NIC/onboot > config; `Datastore.AllocateTemplate` **not** needed (drop it). NB `VM.Config.CPUMemory` is > not a real privilege — it is `VM.Config.CPU` + `VM.Config.Memory`. **Root-vs-API boundary** [[phase3 §B3](tests/phase3-findings.md)] — nearly the entire guest lifecycle, **including restore**, is API-token-covered; the genuine OS-root residual is narrow: | Operation | Coverage | |---|---| | Create LXC (nesting-only), config, allocate, start/stop, snapshot/rollback, vzdump, **restore**, destroy, add storage definition, host metrics | **scoped API token** (the `FelhomAgent` role) | | ⚠️ **Create LXC with `keyctl=1`** (Docker needs it — §2.3) | **OS root `root@pam` only** | | USB physical mount-by-UUID / systemd mount unit / fstab; SMART/sensors | OS root / narrow sudoers | > ⚠️ **`keyctl=1` (and any feature flag except `nesting`) can be set only by an actual > `root@pam` session** — `changing feature flags (except nesting) is only allowed for > root@pam`. **No API token qualifies**, not even a non-privsep `root@pam` token (same 403). > So *fresh provisioning* of a Docker-capable LXC needs `pct create` as OS root (or a narrow > sudoers entry). **Restore is exempt:** a token-authorized `vzrestore` **preserves > `keyctl=1`** from the archive — the DR path needs no root. --- ## 4. Backup & restore (`vzdump` / `pct restore`) ### 4.1 Modes - **`stop`** — orderly guest shutdown → backup → restart. Highest consistency, defined downtime. (For LXC the shutdown/restart is internal to `vzdump`; needs only `VM.Backup` — §3.4.) - **`snapshot`** — lowest downtime; copies blocks while running. Consistency depends on the guest cooperating (§4.2). - **`suspend`** — legacy/compat, not used. ### 4.2 Consistency: crash-consistent vs quiesced, and no-fsfreeze-for-LXC > ⚠️ **An LXC has no guest agent, so `snapshot`-mode `vzdump` does NOT fsfreeze.** A > running-stack LXC backup is therefore **crash-consistent** (filesystem-level), not > app-consistent. App-consistency for an LXC is the caller's job: quiesce in-guest first > (stop the stack / flush DBs) or use `stop` mode. A **VM** with `qemu-guest-agent` gets > `guest-fsfreeze` around the copy → near-free app-consistency. [[phase1-2 §2.1](tests/phase1-2-findings.md), [phase0 §4.8](tests/phase0-findings.md)] **Validated restore behaviour** (LXC, Postgres) [[phase1-2 §2.2](tests/phase1-2-findings.md)]: - **Crash-consistent (running):** on first start Postgres ran **automatic WAL recovery** (`database system was interrupted … not properly shut down; automatic recovery in progress … redo done … ready to accept connections`) and the data was intact. - **Quiesced (stack stopped):** clean start, no recovery, data intact. - Both restored correctly here on an idle-at-backup DB; this is **not** a durability guarantee under heavy write load (§6). ### 4.3 What a backup captures A single LXC `vzdump` captures the container rootfs **including the Docker named volumes** (they live in the rootfs) — one backup = the whole guest and its data. Validated: a sentinel row survived both variants [[phase1-2 §2.2](tests/phase1-2-findings.md)]. Sizes/timings (2.5 GiB source, zstd) [[phase1-2 §2.1–2.2](tests/phase1-2-findings.md)]: backup ~934 MB (~2.7:1) in ~22–25 s; restore in ~11–12 s. ### 4.4 Restore = recreate-from-archive (identity is preserved) There is no single "restore" call — you recreate the guest from the archive into a **new VMID**: - **LXC:** `pct restore --storage ` - **VM:** `qmrestore ` (or `POST /nodes//qemu` with `archive=`) > ⚠️ **`pct restore` preserves the source config — including the MAC address and > hostname.** Restoring while the original still runs causes a **MAC/hostname collision** on > the bridge; reset network identity (`pct set -net0 name=eth0,bridge=vmbr0,ip=dhcp` > regenerates the MAC) before starting. [[phase1-2 §2.2](tests/phase1-2-findings.md)] **Restored config survives intact:** `unprivileged: 1` and `features: nesting=1,keyctl=1` are preserved, so Docker runs in the restored CT [[phase1-2 §2.2](tests/phase1-2-findings.md)]. ### 4.5 Snapshots A **running, unprivileged LXC can be snapshotted on LVM-thin** with no stop required (`exitstatus OK`; snapshot listed while the CT stays `running`) [[phase1-2 §1.6](tests/phase1-2-findings.md)]. This is the mechanism available for a snapshot-before-change rollback flow. ### 4.6 PBS (Proxmox Backup Server) **Not yet validated.** No PBS datastore was configured or tested in the spike. All backup findings above are for `vzdump` to a `dir` storage. PBS (dedup, incremental, remote, dirty- bitmap) is pending. ### 4.7 vzdump scope by LXC mount type (validated) A stop-mode `vzdump` includes/excludes each LXC mount point by **type and the `backup` flag** [[phase3 §B2](tests/phase3-findings.md)]. Validated three ways (vzdump log, archive grep, restore): | Location | `backup` flag | In the vzdump? | |---|---|---| | rootfs (and anything inside it) | — | **included** (always) | | **Docker named volume** (default driver) | — | **included** — it lives in the rootfs (`/var/lib/docker/volumes//_data`) | | volume mount point (`mpN`) | `backup=1` | included | | volume mount point (`mpN`) | `backup=0` | **excluded** (vol recreated empty on restore) | | bind mount point (`mpN: /host/path`) | n/a | **excluded** ("not a volume"); data is *not* in the archive | > ⚠️ **The `backup=` flag is honoured ONLY for *volume* mount points.** A **Docker > named volume is in the rootfs and is always captured** — so a "bulk" volume left as a > default named volume is silently swept into the whole-guest image. To keep bulk data **out**, > realize it as a dedicated `backup=0` volume mount point (proven recipe: > `pct set -mpN :,mp=/mnt/bulk,backup=0` then > `docker volume create --driver local -o type=none -o o=bind -o device=/mnt/bulk bulkvol`). > A **bind mount's** data is excluded from the archive entirely; on same-host restore it > reappears only because the bind config re-attaches the same host dir — on a *different* host > (true DR) it is gone unless backed up separately. --- ## 5. Gotchas & operational notes (quick reference) | Gotcha | Detail | Evidence | |---|---|---| | **deb822 repos** | PVE 9 repos are `.sources` files; disable enterprise, enable no-subscription | standard setup | | **Privsep dual-grant** | privsep token needs the role on **both** user and token, else empty intersection → 403 | [phase1-2 §1.2](tests/phase1-2-findings.md) | | **Async authz** | `vzdump` POST returns 200+UPID even when unauthorized; the 403 is in the task `exitstatus`; poll it | [phase1-2 §1.3](tests/phase1-2-findings.md) | | **No fsfreeze for LXC** | running-LXC `snapshot` backup is crash-consistent only; quiesce or use `stop` for app-consistency | [phase1-2 §2.1](tests/phase1-2-findings.md) | | **Restore identity collision** | `pct restore` keeps source MAC + hostname; reset before starting alongside the original | [phase1-2 §2.2](tests/phase1-2-findings.md) | | **Restart policy for self-heal** | restored/rebooted containers come up `exited` with no restart policy; need a restart policy or an explicit `compose up -d` to return automatically | [phase1-2 §2.2/§3](tests/phase1-2-findings.md) | | **Self-signed TLS** | host cert is self-signed; `curl` needs `-k` until trust is set up | [phase1-2 §1.5](tests/phase1-2-findings.md) | | **`pveum role info` gone** | use `pveum role list` in PVE 9 | [phase1-2 §1.1](tests/phase1-2-findings.md) | | **`pveum acl delete` needs `--roles`** | bare `-user`/`-token` path errors `400 roles: property is missing` | [phase1-2 §5](tests/phase1-2-findings.md) | | **`VM.PowerMgmt` not needed** | stop-mode backup works under `VM.Backup` alone | [phase1-2 §1.4](tests/phase1-2-findings.md) | | **`keyctl=1` is root-only** | feature flags except `nesting` need a `root@pam` session; no API token (even root's) can set them; restore preserves them | [phase3 §B3](tests/phase3-findings.md) | | **`SDN.Use` gates bridge use** | PVE 9 needs `SDN.Use` to attach a NIC to `vmbr0`; omit it → 403 | [phase3 §B3](tests/phase3-findings.md) | | **Docker named vol = always backed up** | named volumes live in rootfs; only *volume mountpoints* honour `backup=0`; bulk must be a dedicated `backup=0` mp | [phase3 §B2](tests/phase3-findings.md) | --- ## 6. Validated vs open ### Validated by the spike | Fact | Evidence | |---|---| | PVE 9.2.2 / Debian 13 / kernel 7.0.2 baseline; `local` (dir) vs `local-lvm` (thin) roles | [phase0 §1](tests/phase0-findings.md), [phase1-2 pre-flight](tests/phase1-2-findings.md) | | Docker runs in an **unprivileged** LXC (`nesting=1,keyctl=1`), driver `overlayfs`, cgroup v2 | [phase0 §3](tests/phase0-findings.md) | | LXC vs VM overhead (idle host RAM +211 MB vs +2056 MB; CPU/throughput/provisioning) | [phase0 §2](tests/phase0-findings.md) | | Privsep token = intersection of user ∩ token ACLs (dual-grant required) | [phase1-2 §1.2](tests/phase1-2-findings.md) | | Minimal self-backup role; `VM.PowerMgmt` unnecessary | [phase1-2 §1.4](tests/phase1-2-findings.md) | | Token scoped to one VMID: self-ops succeed, cross-guest + create/allocate denied | [phase1-2 §1.3](tests/phase1-2-findings.md) | | Async UPID model; vzdump authz surfaces in `exitstatus`, not the POST | [phase1-2 §1.3](tests/phase1-2-findings.md) | | Running, unprivileged LXC snapshots on LVM-thin (no stop) | [phase1-2 §1.6](tests/phase1-2-findings.md) | | `vzdump` → `pct restore` round-trip; one backup captures Docker volumes; config survives | [phase1-2 §2](tests/phase1-2-findings.md) | | Crash-consistent restore recovers via Postgres WAL; quiesced restores clean | [phase1-2 §2.2](tests/phase1-2-findings.md) | | LXC vzdump scope by mount type; `backup=0` excludes volume mps; Docker named vols ride rootfs; proven bulk-exclusion recipe | [phase3 §B2](tests/phase3-findings.md) | | Operator agent role (16 privs); guest lifecycle incl. restore is API-token-covered; `keyctl` create is `root@pam`-only | [phase3 §B3](tests/phase3-findings.md) | ### Not yet validated (do not assume) | Open item | Why it matters | |---|---| | **PBS** (dedup/incremental/remote backup) | the only backup path tested was `vzdump` to a `dir` | | **The real controller running inside an LXC** reaching `host:8006` | spike used `curl`/CLI, not the actual Go controller | | **App-consistency under heavy write load** | WAL recovery was validated only on an idle-at-backup DB | | **Live migration / restore to a different host** | single-node spike only | | **Ballooning / KSM** effect on VM RAM cost | VM RAM measured with neither configured | | **Cluster / HA** behaviour | single node | | **Production TLS trust** for the API | all calls used `-k` against a self-signed cert | | **deb822 no-subscription repo setup** as a controlled step | host arrived pre-configured | --- ## 7. Scope boundary This document holds **platform facts only.** Felhom design decisions — e.g. which guest type is the default, whether to use privsep or non-privsep tokens, where PBS lives — are **out of scope** and belong in the controller-architecture document. Where this reference notes a decision exists, the decision itself is recorded there, not here.