Files
felhom-agent/docs/proxmox-platform.md
T
2026-06-07 20:46:01 +02:00

325 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Proxmox Platform Reference
Authoritative, living reference for the Proxmox platform underneath `proxmox-controller`.
It records **facts about Proxmox and what we validated about it** — not Felhom design
decisions. Where a design choice exists, this doc points to the (future) controller
architecture document rather than making the choice here.
**Evidence base** (raw, chronological spike logs — kept as the underlying record):
- [tests/phase0-findings.md](tests/phase0-findings.md) — VM-vs-LXC overhead, Docker-in-LXC viability
- [tests/phase1-2-findings.md](tests/phase1-2-findings.md) — privilege model, backup/restore round-trip
- [tests/Proxmox_Spike_-_API_&_Access-Control_Reference.md](tests/Proxmox_Spike_-_API_&_Access-Control_Reference.md) — **superseded** pre-spike reference (contains a known privsep error; do not cite as authoritative)
Every nontrivial claim links to its evidence section. Validated on a single host
(`demo-felhom`, 192.168.0.162, 4 vCPU / 16 GB) on 2026-06-07; treat single-run timings and
measurements as indicative, not benchmarks.
---
## 1. Platform baseline
Validated stack [[phase0 §1](tests/phase0-findings.md)]:
| Component | Version |
|---|---|
| Proxmox VE (`pve-manager`) | **9.2.2** (`b9984c6d90a4bd80`) |
| OS | Debian 13 (Trixie) |
| Kernel | proxmox-kernel **7.0.2-6-pve** |
| `pve-qemu-kvm` | 11.0.0-3 |
| `qemu-server` | 9.1.15 |
| `pve-container` | 6.1.10 |
| `lxc-pve` / `lxcfs` | 7.0.0-2 / 7.0.0-pve1 |
| `criu` | 4.1.1-1 |
`pvesh get /version` → release 9.2. Always confirm the node name on the box
(`pvesh get /nodes`) rather than hard-coding it.
### 1.1 Storage backends
Two backends were present and exercised [[phase0 §1](tests/phase0-findings.md), [phase1-2 §pre-flight](tests/phase1-2-findings.md)]:
| Storage | Type | Path / VG | Content types | Holds |
|---|---|---|---|---|
| `local` | `dir` | `/var/lib/vz` | `iso, vztmpl, backup, import` | ISOs, CT templates, **vzdump archives** |
| `local-lvm` | `lvmthin` | VG `pve`, thinpool `data` | `rootdir, images` | guest disk volumes |
**Why backups cannot live on LVM-thin:** LVM-thin is a *block* backend — it allocates
logical volumes for guest disks. Backup archives and templates are *files*, which require a
file-level backend (`dir`, NFS, CIFS, or PBS). A `vzdump` target must therefore be a
storage whose content types include `backup` (here, `local`); pointing `vzdump` at
`local-lvm` is not valid. [[phase1-2 §pre-flight / §2.1](tests/phase1-2-findings.md)]
### 1.2 Repositories
PVE 9 uses **deb822** `.sources` files under `/etc/apt/sources.list.d/`. For a host
without a subscription, the enterprise repos (`pve-enterprise.sources`,
`ceph-*-enterprise.sources`) must be disabled (they return 401) and a no-subscription repo
enabled. *The spike host arrived with the no-subscription repo already configured and the
host updated [[phase0 baseline](tests/phase0-findings.md)]; the repo setup itself was not a
spike deliverable* — the canonical no-subscription `.sources` is the standard Proxmox 9
procedure (`/etc/apt/sources.list.d/pve-no-subscription.sources` with
`Components: pve-no-subscription`). Treat the exact commands as standard setup, not
spike-validated.
**Docker repository (validated):** Docker's official apt repo **has a `trixie` channel**;
no fallback to Debian's `docker.io` was needed. Installed Docker **29.5.3** from it in both
guest types. [[phase0 §1](tests/phase0-findings.md)]
---
## 2. Guest model (LXC vs VM) — validated facts
Both guest types ran the **identical** workload (Debian 13, Docker 29.5.3, a
postgres/redis/nginx compose stack) under identical resources (2 vCPU, 2048 MB, ~10 GB)
[[phase0](tests/phase0-findings.md)].
### 2.1 Isolation characteristic (fact, not recommendation)
- **LXC** is an OS-level container: it **shares the host kernel**. Docker-in-LXC needs the
container configured for nesting (see §2.3).
- **VM** runs its **own guest kernel** under KVM/QEMU, with full hardware-level isolation
and its own firmware.
The trade-offs below follow directly from this difference.
### 2.2 Resource overhead (measured)
Host RAM used = `MemTotal MemAvailable`, deltas vs a both-stopped baseline of 1702 MB;
one guest measured at a time [[phase0 §2](tests/phase0-findings.md)]:
| Metric | LXC | VM | Note |
|---|---|---|---|
| Idle host-RAM delta | **+211 MB** | **+2056 MB** | structural, see below |
| Under-load host-RAM delta | **+410 MB** | **+2084 MB** | |
| Per-guest attribution | cgroup `memory.current` 1961 MB¹ | KVM RSS ~2031 MB | |
| Idle host CPU used | ~0.3 % | ~6.0 % | VM has an emulation/guest-kernel floor |
| Under-load host CPU used | ~39.4 % | ~53.9 % | VM work shows as `%guest` (31.9 %) |
| pgbench throughput | 2211 tps | 1820 tps | identical load, 0 failed both |
| Disk used (host thin-LV) | ~2.67 GiB | ~2.94 GiB | of 10 GiB allocated |
| Provisioning (create→ready) | ~1015 s | ~6075 s | template-extract vs qcow2-import+boot |
¹ `cgroup memory.current` counts reclaimable page cache shared with the host and
**overstates** the LXC's true incremental cost; the +211 MB host delta is the honest
number [[phase0 §4.4](tests/phase0-findings.md)].
**Why the RAM gap is structural** [[phase0 §4.3](tests/phase0-findings.md)]: LXC processes
share the host kernel and page cache, so only the working set counts against the host. A VM
with **no ballooning configured** has KVM back every guest-touched page (including the
guest's own page cache), so its host cost ≈ the full RAM allocation and is largely
load-independent. *Ballooning / KSM were not tested* and could change the VM figure.
### 2.3 Docker-in-LXC viability (validated)
Docker ran **cleanly in an *unprivileged* LXC** configured with
`--features nesting=1,keyctl=1 --unprivileged 1` (PVE 9 syntax, accepted by `pct create`)
[[phase0 §3](tests/phase0-findings.md)]:
- `docker run hello-world` → success; full 3-container stack healthy.
- **Storage driver: `overlayfs`** (cgroup v2, systemd cgroup driver) — **no `vfs`
fallback**. (Docker 29 names the overlay driver `overlayfs` via the containerd
snapshotter image store; same overlay technology as the legacy `overlay2`.)
- Named volume persisted writes; multi-container networking + published port worked
(`curl localhost:8080` → 200); 0 failed transactions under load.
- No privileged-container fallback was needed.
### 2.4 Guest agent & app-consistency capability
- **VM:** `qemu-guest-agent` installs and reports (`agent: 1`), enabling
`guest-fsfreeze`-based app-consistent `snapshot` backups [[phase0 §4.8](tests/phase0-findings.md)].
The Debian genericcloud image does **not** ship the agent — it must be installed
in-guest.
- **LXC:** no guest agent exists → **no fsfreeze** (see §4.2).
---
## 3. API & access control
### 3.1 Fundamentals
- **Base URL:** `https://<host>:8006/api2/json`. Every `pve*` CLI is a thin wrapper over
this REST API.
- **Token auth header:** `Authorization: PVEAPIToken=USER@REALM!TOKENID=SECRET`. The
secret is shown **once** at creation. Response envelope: `{"data": ...}`.
- **TLS reality:** the host serves the default **self-signed** certificate. `curl` without
`-k` fails `SSL certificate problem: unable to get local issuer certificate`
[[phase1-2 §1.5](tests/phase1-2-findings.md)]. Production trust (pin the PVE CA / install
a real cert) is a separate, not-yet-decided concern.
### 3.2 RBAC model
An ACL entry is a triple **(path, principal, role)**; a role is a bundle of privileges,
assigned at the most specific path. Paths include `/`, `/vms/<vmid>`, `/nodes/<node>`,
`/storage/<store>`, `/pool/<pool>`, `/access/...`.
Introspection (**corrected for PVE 9**) [[phase1-2 §1.1](tests/phase1-2-findings.md)]:
- `pveum role list` — lists roles **with their privileges**.
- ⚠️ `pveum role info <role>` **does not exist in PVE 9** (the old reference used it).
- `pveum acl list`, `pveum user permissions <user> --path <path>`.
### 3.3 Privilege-separated tokens — the intersection rule (corrected)
> **A privsep token's (`--privsep 1`) effective permissions are the *intersection* of (a)
> the backing user's permissions and (b) the token's own ACLs.** The role must therefore be
> granted on **BOTH the user AND the token** for the same path. Granting it on the token
> only yields an **empty intersection** and a **403 even on self-calls.**
> [[phase1-2 §1.2](tests/phase1-2-findings.md)]
This corrects the superseded reference (§3 there grants the ACL to the token only). The
intersection is what keeps a privsep token ≤ its user while still being independently
scopeable to a narrow path.
Working pattern (validated):
```bash
pveum role add <Role> -privs "<priv> <priv> ..." # NB: -privs is space-separated
pveum user add <user>@pve
pveum user token add <user>@pve <tokenid> --privsep 1 # capture SECRET (shown once)
pveum acl modify <path> -user '<user>@pve' -role <Role> # BOTH the user...
pveum acl modify <path> -token '<user>@pve!<tokenid>' -role <Role> # ...AND the token
```
`pveum acl delete` **requires `--roles`** (a bare `-user`/`-token` path errors
`400 roles: property is missing`). Deleting the token/user/role auto-invalidates the
referencing ACLs. [[phase1-2 §5](tests/phase1-2-findings.md)]
### 3.4 Validated minimal self-backup role
A token scoped to **one VMID + the backup datastore** can audit, snapshot, and back up
**only that guest**, and is denied on every other guest and on create/allocate
[[phase1-2 §1.31.4](tests/phase1-2-findings.md)]:
> **Minimal role for self-audit + self-snapshot + both `snapshot`- and `stop`-mode
> self-backup:**
> `VM.Audit, VM.Snapshot, VM.Backup, Datastore.AllocateSpace, Datastore.Audit`
⚠️ **`VM.PowerMgmt` is NOT required for stop-mode backup** — `vzdump` performs the guest
shutdown/restart internally under `VM.Backup` (tested: stop-mode self-backup returned
`exitstatus OK` without it) [[phase1-2 §1.4](tests/phase1-2-findings.md)]. This corrects the
old reference's "likely yes" guess.
Validated boundary (token scoped to `/vms/<self>` + `/storage/local`):
| Operation | Result |
|---|---|
| `GET /version` | 200 |
| `GET` self status, `POST` self snapshot, `POST` self vzdump | 200 / task `OK` |
| `GET`/`POST` against **another** guest's vmid | **403** (read) / task **403** (backup) |
| `POST /nodes/<node>/lxc` (create/allocate a guest) | **403** — create/allocate is operator-tier |
### 3.5 Async tasks — trust `exitstatus`, not the POST
Long operations (`vzdump`, `snapshot`, clone, restore) return a **UPID**, not a result.
Poll `GET /nodes/<node>/tasks/<upid>/status` until `status: stopped`, then read
`exitstatus` [[phase1-2 §1.3](tests/phase1-2-findings.md)].
> ⚠️ **Authorization can surface at task execution, not at the HTTP POST.** A `vzdump`
> against an unauthorized vmid returns **HTTP 200 + a UPID**, but the task then ends
> `exitstatus: "403 Permission check failed (/vms/<id>, VM.Backup)"` and produces **no
> archive**. A caller that trusts the 200 would wrongly believe the backup ran. Always poll
> the task and check `exitstatus`.
(The task owner — including a token — can read its own task status: 200.)
---
## 4. Backup & restore (`vzdump` / `pct restore`)
### 4.1 Modes
- **`stop`** — orderly guest shutdown → backup → restart. Highest consistency, defined
downtime. (For LXC the shutdown/restart is internal to `vzdump`; needs only `VM.Backup`
§3.4.)
- **`snapshot`** — lowest downtime; copies blocks while running. Consistency depends on the
guest cooperating (§4.2).
- **`suspend`** — legacy/compat, not used.
### 4.2 Consistency: crash-consistent vs quiesced, and no-fsfreeze-for-LXC
> ⚠️ **An LXC has no guest agent, so `snapshot`-mode `vzdump` does NOT fsfreeze.** A
> running-stack LXC backup is therefore **crash-consistent** (filesystem-level), not
> app-consistent. App-consistency for an LXC is the caller's job: quiesce in-guest first
> (stop the stack / flush DBs) or use `stop` mode. A **VM** with `qemu-guest-agent` gets
> `guest-fsfreeze` around the copy → near-free app-consistency. [[phase1-2 §2.1](tests/phase1-2-findings.md), [phase0 §4.8](tests/phase0-findings.md)]
**Validated restore behaviour** (LXC, Postgres) [[phase1-2 §2.2](tests/phase1-2-findings.md)]:
- **Crash-consistent (running):** on first start Postgres ran **automatic WAL recovery**
(`database system was interrupted … not properly shut down; automatic recovery in
progress … redo done … ready to accept connections`) and the data was intact.
- **Quiesced (stack stopped):** clean start, no recovery, data intact.
- Both restored correctly here on an idle-at-backup DB; this is **not** a durability
guarantee under heavy write load (§6).
### 4.3 What a backup captures
A single LXC `vzdump` captures the container rootfs **including the Docker named volumes**
(they live in the rootfs) — one backup = the whole guest and its data. Validated: a
sentinel row survived both variants [[phase1-2 §2.2](tests/phase1-2-findings.md)].
Sizes/timings (2.5 GiB source, zstd) [[phase1-2 §2.12.2](tests/phase1-2-findings.md)]:
backup ~934 MB (~2.7:1) in ~2225 s; restore in ~1112 s.
### 4.4 Restore = recreate-from-archive (identity is preserved)
There is no single "restore" call — you recreate the guest from the archive into a **new
VMID**:
- **LXC:** `pct restore <newid> <archive> --storage <store>`
- **VM:** `qmrestore <archive> <newid>` (or `POST /nodes/<node>/qemu` with `archive=`)
> ⚠️ **`pct restore` preserves the source config — including the MAC address and
> hostname.** Restoring while the original still runs causes a **MAC/hostname collision** on
> the bridge; reset network identity (`pct set <id> -net0 name=eth0,bridge=vmbr0,ip=dhcp`
> regenerates the MAC) before starting. [[phase1-2 §2.2](tests/phase1-2-findings.md)]
**Restored config survives intact:** `unprivileged: 1` and `features: nesting=1,keyctl=1`
are preserved, so Docker runs in the restored CT [[phase1-2 §2.2](tests/phase1-2-findings.md)].
### 4.5 Snapshots
A **running, unprivileged LXC can be snapshotted on LVM-thin** with no stop required
(`exitstatus OK`; snapshot listed while the CT stays `running`)
[[phase1-2 §1.6](tests/phase1-2-findings.md)]. This is the mechanism available for a
snapshot-before-change rollback flow.
### 4.6 PBS (Proxmox Backup Server)
**Not yet validated.** No PBS datastore was configured or tested in the spike. All backup
findings above are for `vzdump` to a `dir` storage. PBS (dedup, incremental, remote, dirty-
bitmap) is pending.
---
## 5. Gotchas & operational notes (quick reference)
| Gotcha | Detail | Evidence |
|---|---|---|
| **deb822 repos** | PVE 9 repos are `.sources` files; disable enterprise, enable no-subscription | standard setup |
| **Privsep dual-grant** | privsep token needs the role on **both** user and token, else empty intersection → 403 | [phase1-2 §1.2](tests/phase1-2-findings.md) |
| **Async authz** | `vzdump` POST returns 200+UPID even when unauthorized; the 403 is in the task `exitstatus`; poll it | [phase1-2 §1.3](tests/phase1-2-findings.md) |
| **No fsfreeze for LXC** | running-LXC `snapshot` backup is crash-consistent only; quiesce or use `stop` for app-consistency | [phase1-2 §2.1](tests/phase1-2-findings.md) |
| **Restore identity collision** | `pct restore` keeps source MAC + hostname; reset before starting alongside the original | [phase1-2 §2.2](tests/phase1-2-findings.md) |
| **Restart policy for self-heal** | restored/rebooted containers come up `exited` with no restart policy; need a restart policy or an explicit `compose up -d` to return automatically | [phase1-2 §2.2/§3](tests/phase1-2-findings.md) |
| **Self-signed TLS** | host cert is self-signed; `curl` needs `-k` until trust is set up | [phase1-2 §1.5](tests/phase1-2-findings.md) |
| **`pveum role info` gone** | use `pveum role list` in PVE 9 | [phase1-2 §1.1](tests/phase1-2-findings.md) |
| **`pveum acl delete` needs `--roles`** | bare `-user`/`-token` path errors `400 roles: property is missing` | [phase1-2 §5](tests/phase1-2-findings.md) |
| **`VM.PowerMgmt` not needed** | stop-mode backup works under `VM.Backup` alone | [phase1-2 §1.4](tests/phase1-2-findings.md) |
---
## 6. Validated vs open
### Validated by the spike
| Fact | Evidence |
|---|---|
| PVE 9.2.2 / Debian 13 / kernel 7.0.2 baseline; `local` (dir) vs `local-lvm` (thin) roles | [phase0 §1](tests/phase0-findings.md), [phase1-2 pre-flight](tests/phase1-2-findings.md) |
| Docker runs in an **unprivileged** LXC (`nesting=1,keyctl=1`), driver `overlayfs`, cgroup v2 | [phase0 §3](tests/phase0-findings.md) |
| LXC vs VM overhead (idle host RAM +211 MB vs +2056 MB; CPU/throughput/provisioning) | [phase0 §2](tests/phase0-findings.md) |
| Privsep token = intersection of user ∩ token ACLs (dual-grant required) | [phase1-2 §1.2](tests/phase1-2-findings.md) |
| Minimal self-backup role; `VM.PowerMgmt` unnecessary | [phase1-2 §1.4](tests/phase1-2-findings.md) |
| Token scoped to one VMID: self-ops succeed, cross-guest + create/allocate denied | [phase1-2 §1.3](tests/phase1-2-findings.md) |
| Async UPID model; vzdump authz surfaces in `exitstatus`, not the POST | [phase1-2 §1.3](tests/phase1-2-findings.md) |
| Running, unprivileged LXC snapshots on LVM-thin (no stop) | [phase1-2 §1.6](tests/phase1-2-findings.md) |
| `vzdump``pct restore` round-trip; one backup captures Docker volumes; config survives | [phase1-2 §2](tests/phase1-2-findings.md) |
| Crash-consistent restore recovers via Postgres WAL; quiesced restores clean | [phase1-2 §2.2](tests/phase1-2-findings.md) |
### Not yet validated (do not assume)
| Open item | Why it matters |
|---|---|
| **PBS** (dedup/incremental/remote backup) | the only backup path tested was `vzdump` to a `dir` |
| **The real controller running inside an LXC** reaching `host:8006` | spike used `curl`/CLI, not the actual Go controller |
| **App-consistency under heavy write load** | WAL recovery was validated only on an idle-at-backup DB |
| **Live migration / restore to a different host** | single-node spike only |
| **Ballooning / KSM** effect on VM RAM cost | VM RAM measured with neither configured |
| **Cluster / HA** behaviour | single node |
| **Production TLS trust** for the API | all calls used `-k` against a self-signed cert |
| **deb822 no-subscription repo setup** as a controlled step | host arrived pre-configured |
---
## 7. Scope boundary
This document holds **platform facts only.** Felhom design decisions — e.g. which guest
type is the default, whether to use privsep or non-privsep tokens, where PBS lives — are
**out of scope** and belong in the controller-architecture document. Where this reference
notes a decision exists, the decision itself is recorded there, not here.