doc update
This commit is contained in:
@@ -0,0 +1,324 @@
|
||||
# Proxmox Platform Reference
|
||||
|
||||
Authoritative, living reference for the Proxmox platform underneath `proxmox-controller`.
|
||||
It records **facts about Proxmox and what we validated about it** — not Felhom design
|
||||
decisions. Where a design choice exists, this doc points to the (future) controller
|
||||
architecture document rather than making the choice here.
|
||||
|
||||
**Evidence base** (raw, chronological spike logs — kept as the underlying record):
|
||||
- [tests/phase0-findings.md](tests/phase0-findings.md) — VM-vs-LXC overhead, Docker-in-LXC viability
|
||||
- [tests/phase1-2-findings.md](tests/phase1-2-findings.md) — privilege model, backup/restore round-trip
|
||||
- [tests/Proxmox_Spike_-_API_&_Access-Control_Reference.md](tests/Proxmox_Spike_-_API_&_Access-Control_Reference.md) — **superseded** pre-spike reference (contains a known privsep error; do not cite as authoritative)
|
||||
|
||||
Every nontrivial claim links to its evidence section. Validated on a single host
|
||||
(`demo-felhom`, 192.168.0.162, 4 vCPU / 16 GB) on 2026-06-07; treat single-run timings and
|
||||
measurements as indicative, not benchmarks.
|
||||
|
||||
---
|
||||
|
||||
## 1. Platform baseline
|
||||
|
||||
Validated stack [[phase0 §1](tests/phase0-findings.md)]:
|
||||
|
||||
| Component | Version |
|
||||
|---|---|
|
||||
| Proxmox VE (`pve-manager`) | **9.2.2** (`b9984c6d90a4bd80`) |
|
||||
| OS | Debian 13 (Trixie) |
|
||||
| Kernel | proxmox-kernel **7.0.2-6-pve** |
|
||||
| `pve-qemu-kvm` | 11.0.0-3 |
|
||||
| `qemu-server` | 9.1.15 |
|
||||
| `pve-container` | 6.1.10 |
|
||||
| `lxc-pve` / `lxcfs` | 7.0.0-2 / 7.0.0-pve1 |
|
||||
| `criu` | 4.1.1-1 |
|
||||
|
||||
`pvesh get /version` → release 9.2. Always confirm the node name on the box
|
||||
(`pvesh get /nodes`) rather than hard-coding it.
|
||||
|
||||
### 1.1 Storage backends
|
||||
Two backends were present and exercised [[phase0 §1](tests/phase0-findings.md), [phase1-2 §pre-flight](tests/phase1-2-findings.md)]:
|
||||
|
||||
| Storage | Type | Path / VG | Content types | Holds |
|
||||
|---|---|---|---|---|
|
||||
| `local` | `dir` | `/var/lib/vz` | `iso, vztmpl, backup, import` | ISOs, CT templates, **vzdump archives** |
|
||||
| `local-lvm` | `lvmthin` | VG `pve`, thinpool `data` | `rootdir, images` | guest disk volumes |
|
||||
|
||||
**Why backups cannot live on LVM-thin:** LVM-thin is a *block* backend — it allocates
|
||||
logical volumes for guest disks. Backup archives and templates are *files*, which require a
|
||||
file-level backend (`dir`, NFS, CIFS, or PBS). A `vzdump` target must therefore be a
|
||||
storage whose content types include `backup` (here, `local`); pointing `vzdump` at
|
||||
`local-lvm` is not valid. [[phase1-2 §pre-flight / §2.1](tests/phase1-2-findings.md)]
|
||||
|
||||
### 1.2 Repositories
|
||||
PVE 9 uses **deb822** `.sources` files under `/etc/apt/sources.list.d/`. For a host
|
||||
without a subscription, the enterprise repos (`pve-enterprise.sources`,
|
||||
`ceph-*-enterprise.sources`) must be disabled (they return 401) and a no-subscription repo
|
||||
enabled. *The spike host arrived with the no-subscription repo already configured and the
|
||||
host updated [[phase0 baseline](tests/phase0-findings.md)]; the repo setup itself was not a
|
||||
spike deliverable* — the canonical no-subscription `.sources` is the standard Proxmox 9
|
||||
procedure (`/etc/apt/sources.list.d/pve-no-subscription.sources` with
|
||||
`Components: pve-no-subscription`). Treat the exact commands as standard setup, not
|
||||
spike-validated.
|
||||
|
||||
**Docker repository (validated):** Docker's official apt repo **has a `trixie` channel**;
|
||||
no fallback to Debian's `docker.io` was needed. Installed Docker **29.5.3** from it in both
|
||||
guest types. [[phase0 §1](tests/phase0-findings.md)]
|
||||
|
||||
---
|
||||
|
||||
## 2. Guest model (LXC vs VM) — validated facts
|
||||
|
||||
Both guest types ran the **identical** workload (Debian 13, Docker 29.5.3, a
|
||||
postgres/redis/nginx compose stack) under identical resources (2 vCPU, 2048 MB, ~10 GB)
|
||||
[[phase0](tests/phase0-findings.md)].
|
||||
|
||||
### 2.1 Isolation characteristic (fact, not recommendation)
|
||||
- **LXC** is an OS-level container: it **shares the host kernel**. Docker-in-LXC needs the
|
||||
container configured for nesting (see §2.3).
|
||||
- **VM** runs its **own guest kernel** under KVM/QEMU, with full hardware-level isolation
|
||||
and its own firmware.
|
||||
|
||||
The trade-offs below follow directly from this difference.
|
||||
|
||||
### 2.2 Resource overhead (measured)
|
||||
Host RAM used = `MemTotal − MemAvailable`, deltas vs a both-stopped baseline of 1702 MB;
|
||||
one guest measured at a time [[phase0 §2](tests/phase0-findings.md)]:
|
||||
|
||||
| Metric | LXC | VM | Note |
|
||||
|---|---|---|---|
|
||||
| Idle host-RAM delta | **+211 MB** | **+2056 MB** | structural, see below |
|
||||
| Under-load host-RAM delta | **+410 MB** | **+2084 MB** | |
|
||||
| Per-guest attribution | cgroup `memory.current` 1961 MB¹ | KVM RSS ~2031 MB | |
|
||||
| Idle host CPU used | ~0.3 % | ~6.0 % | VM has an emulation/guest-kernel floor |
|
||||
| Under-load host CPU used | ~39.4 % | ~53.9 % | VM work shows as `%guest` (31.9 %) |
|
||||
| pgbench throughput | 2211 tps | 1820 tps | identical load, 0 failed both |
|
||||
| Disk used (host thin-LV) | ~2.67 GiB | ~2.94 GiB | of 10 GiB allocated |
|
||||
| Provisioning (create→ready) | ~10–15 s | ~60–75 s | template-extract vs qcow2-import+boot |
|
||||
|
||||
¹ `cgroup memory.current` counts reclaimable page cache shared with the host and
|
||||
**overstates** the LXC's true incremental cost; the +211 MB host delta is the honest
|
||||
number [[phase0 §4.4](tests/phase0-findings.md)].
|
||||
|
||||
**Why the RAM gap is structural** [[phase0 §4.3](tests/phase0-findings.md)]: LXC processes
|
||||
share the host kernel and page cache, so only the working set counts against the host. A VM
|
||||
with **no ballooning configured** has KVM back every guest-touched page (including the
|
||||
guest's own page cache), so its host cost ≈ the full RAM allocation and is largely
|
||||
load-independent. *Ballooning / KSM were not tested* and could change the VM figure.
|
||||
|
||||
### 2.3 Docker-in-LXC viability (validated)
|
||||
Docker ran **cleanly in an *unprivileged* LXC** configured with
|
||||
`--features nesting=1,keyctl=1 --unprivileged 1` (PVE 9 syntax, accepted by `pct create`)
|
||||
[[phase0 §3](tests/phase0-findings.md)]:
|
||||
|
||||
- `docker run hello-world` → success; full 3-container stack healthy.
|
||||
- **Storage driver: `overlayfs`** (cgroup v2, systemd cgroup driver) — **no `vfs`
|
||||
fallback**. (Docker 29 names the overlay driver `overlayfs` via the containerd
|
||||
snapshotter image store; same overlay technology as the legacy `overlay2`.)
|
||||
- Named volume persisted writes; multi-container networking + published port worked
|
||||
(`curl localhost:8080` → 200); 0 failed transactions under load.
|
||||
- No privileged-container fallback was needed.
|
||||
|
||||
### 2.4 Guest agent & app-consistency capability
|
||||
- **VM:** `qemu-guest-agent` installs and reports (`agent: 1`), enabling
|
||||
`guest-fsfreeze`-based app-consistent `snapshot` backups [[phase0 §4.8](tests/phase0-findings.md)].
|
||||
The Debian genericcloud image does **not** ship the agent — it must be installed
|
||||
in-guest.
|
||||
- **LXC:** no guest agent exists → **no fsfreeze** (see §4.2).
|
||||
|
||||
---
|
||||
|
||||
## 3. API & access control
|
||||
|
||||
### 3.1 Fundamentals
|
||||
- **Base URL:** `https://<host>:8006/api2/json`. Every `pve*` CLI is a thin wrapper over
|
||||
this REST API.
|
||||
- **Token auth header:** `Authorization: PVEAPIToken=USER@REALM!TOKENID=SECRET`. The
|
||||
secret is shown **once** at creation. Response envelope: `{"data": ...}`.
|
||||
- **TLS reality:** the host serves the default **self-signed** certificate. `curl` without
|
||||
`-k` fails `SSL certificate problem: unable to get local issuer certificate`
|
||||
[[phase1-2 §1.5](tests/phase1-2-findings.md)]. Production trust (pin the PVE CA / install
|
||||
a real cert) is a separate, not-yet-decided concern.
|
||||
|
||||
### 3.2 RBAC model
|
||||
An ACL entry is a triple **(path, principal, role)**; a role is a bundle of privileges,
|
||||
assigned at the most specific path. Paths include `/`, `/vms/<vmid>`, `/nodes/<node>`,
|
||||
`/storage/<store>`, `/pool/<pool>`, `/access/...`.
|
||||
|
||||
Introspection (**corrected for PVE 9**) [[phase1-2 §1.1](tests/phase1-2-findings.md)]:
|
||||
- `pveum role list` — lists roles **with their privileges**.
|
||||
- ⚠️ `pveum role info <role>` **does not exist in PVE 9** (the old reference used it).
|
||||
- `pveum acl list`, `pveum user permissions <user> --path <path>`.
|
||||
|
||||
### 3.3 Privilege-separated tokens — the intersection rule (corrected)
|
||||
> **A privsep token's (`--privsep 1`) effective permissions are the *intersection* of (a)
|
||||
> the backing user's permissions and (b) the token's own ACLs.** The role must therefore be
|
||||
> granted on **BOTH the user AND the token** for the same path. Granting it on the token
|
||||
> only yields an **empty intersection** and a **403 even on self-calls.**
|
||||
> [[phase1-2 §1.2](tests/phase1-2-findings.md)]
|
||||
|
||||
This corrects the superseded reference (§3 there grants the ACL to the token only). The
|
||||
intersection is what keeps a privsep token ≤ its user while still being independently
|
||||
scopeable to a narrow path.
|
||||
|
||||
Working pattern (validated):
|
||||
```bash
|
||||
pveum role add <Role> -privs "<priv> <priv> ..." # NB: -privs is space-separated
|
||||
pveum user add <user>@pve
|
||||
pveum user token add <user>@pve <tokenid> --privsep 1 # capture SECRET (shown once)
|
||||
pveum acl modify <path> -user '<user>@pve' -role <Role> # BOTH the user...
|
||||
pveum acl modify <path> -token '<user>@pve!<tokenid>' -role <Role> # ...AND the token
|
||||
```
|
||||
`pveum acl delete` **requires `--roles`** (a bare `-user`/`-token` path errors
|
||||
`400 roles: property is missing`). Deleting the token/user/role auto-invalidates the
|
||||
referencing ACLs. [[phase1-2 §5](tests/phase1-2-findings.md)]
|
||||
|
||||
### 3.4 Validated minimal self-backup role
|
||||
A token scoped to **one VMID + the backup datastore** can audit, snapshot, and back up
|
||||
**only that guest**, and is denied on every other guest and on create/allocate
|
||||
[[phase1-2 §1.3–1.4](tests/phase1-2-findings.md)]:
|
||||
|
||||
> **Minimal role for self-audit + self-snapshot + both `snapshot`- and `stop`-mode
|
||||
> self-backup:**
|
||||
> `VM.Audit, VM.Snapshot, VM.Backup, Datastore.AllocateSpace, Datastore.Audit`
|
||||
|
||||
⚠️ **`VM.PowerMgmt` is NOT required for stop-mode backup** — `vzdump` performs the guest
|
||||
shutdown/restart internally under `VM.Backup` (tested: stop-mode self-backup returned
|
||||
`exitstatus OK` without it) [[phase1-2 §1.4](tests/phase1-2-findings.md)]. This corrects the
|
||||
old reference's "likely yes" guess.
|
||||
|
||||
Validated boundary (token scoped to `/vms/<self>` + `/storage/local`):
|
||||
|
||||
| Operation | Result |
|
||||
|---|---|
|
||||
| `GET /version` | 200 |
|
||||
| `GET` self status, `POST` self snapshot, `POST` self vzdump | 200 / task `OK` |
|
||||
| `GET`/`POST` against **another** guest's vmid | **403** (read) / task **403** (backup) |
|
||||
| `POST /nodes/<node>/lxc` (create/allocate a guest) | **403** — create/allocate is operator-tier |
|
||||
|
||||
### 3.5 Async tasks — trust `exitstatus`, not the POST
|
||||
Long operations (`vzdump`, `snapshot`, clone, restore) return a **UPID**, not a result.
|
||||
Poll `GET /nodes/<node>/tasks/<upid>/status` until `status: stopped`, then read
|
||||
`exitstatus` [[phase1-2 §1.3](tests/phase1-2-findings.md)].
|
||||
|
||||
> ⚠️ **Authorization can surface at task execution, not at the HTTP POST.** A `vzdump`
|
||||
> against an unauthorized vmid returns **HTTP 200 + a UPID**, but the task then ends
|
||||
> `exitstatus: "403 Permission check failed (/vms/<id>, VM.Backup)"` and produces **no
|
||||
> archive**. A caller that trusts the 200 would wrongly believe the backup ran. Always poll
|
||||
> the task and check `exitstatus`.
|
||||
|
||||
(The task owner — including a token — can read its own task status: 200.)
|
||||
|
||||
---
|
||||
|
||||
## 4. Backup & restore (`vzdump` / `pct restore`)
|
||||
|
||||
### 4.1 Modes
|
||||
- **`stop`** — orderly guest shutdown → backup → restart. Highest consistency, defined
|
||||
downtime. (For LXC the shutdown/restart is internal to `vzdump`; needs only `VM.Backup` —
|
||||
§3.4.)
|
||||
- **`snapshot`** — lowest downtime; copies blocks while running. Consistency depends on the
|
||||
guest cooperating (§4.2).
|
||||
- **`suspend`** — legacy/compat, not used.
|
||||
|
||||
### 4.2 Consistency: crash-consistent vs quiesced, and no-fsfreeze-for-LXC
|
||||
> ⚠️ **An LXC has no guest agent, so `snapshot`-mode `vzdump` does NOT fsfreeze.** A
|
||||
> running-stack LXC backup is therefore **crash-consistent** (filesystem-level), not
|
||||
> app-consistent. App-consistency for an LXC is the caller's job: quiesce in-guest first
|
||||
> (stop the stack / flush DBs) or use `stop` mode. A **VM** with `qemu-guest-agent` gets
|
||||
> `guest-fsfreeze` around the copy → near-free app-consistency. [[phase1-2 §2.1](tests/phase1-2-findings.md), [phase0 §4.8](tests/phase0-findings.md)]
|
||||
|
||||
**Validated restore behaviour** (LXC, Postgres) [[phase1-2 §2.2](tests/phase1-2-findings.md)]:
|
||||
- **Crash-consistent (running):** on first start Postgres ran **automatic WAL recovery**
|
||||
(`database system was interrupted … not properly shut down; automatic recovery in
|
||||
progress … redo done … ready to accept connections`) and the data was intact.
|
||||
- **Quiesced (stack stopped):** clean start, no recovery, data intact.
|
||||
- Both restored correctly here on an idle-at-backup DB; this is **not** a durability
|
||||
guarantee under heavy write load (§6).
|
||||
|
||||
### 4.3 What a backup captures
|
||||
A single LXC `vzdump` captures the container rootfs **including the Docker named volumes**
|
||||
(they live in the rootfs) — one backup = the whole guest and its data. Validated: a
|
||||
sentinel row survived both variants [[phase1-2 §2.2](tests/phase1-2-findings.md)].
|
||||
|
||||
Sizes/timings (2.5 GiB source, zstd) [[phase1-2 §2.1–2.2](tests/phase1-2-findings.md)]:
|
||||
backup ~934 MB (~2.7:1) in ~22–25 s; restore in ~11–12 s.
|
||||
|
||||
### 4.4 Restore = recreate-from-archive (identity is preserved)
|
||||
There is no single "restore" call — you recreate the guest from the archive into a **new
|
||||
VMID**:
|
||||
- **LXC:** `pct restore <newid> <archive> --storage <store>`
|
||||
- **VM:** `qmrestore <archive> <newid>` (or `POST /nodes/<node>/qemu` with `archive=`)
|
||||
|
||||
> ⚠️ **`pct restore` preserves the source config — including the MAC address and
|
||||
> hostname.** Restoring while the original still runs causes a **MAC/hostname collision** on
|
||||
> the bridge; reset network identity (`pct set <id> -net0 name=eth0,bridge=vmbr0,ip=dhcp`
|
||||
> regenerates the MAC) before starting. [[phase1-2 §2.2](tests/phase1-2-findings.md)]
|
||||
|
||||
**Restored config survives intact:** `unprivileged: 1` and `features: nesting=1,keyctl=1`
|
||||
are preserved, so Docker runs in the restored CT [[phase1-2 §2.2](tests/phase1-2-findings.md)].
|
||||
|
||||
### 4.5 Snapshots
|
||||
A **running, unprivileged LXC can be snapshotted on LVM-thin** with no stop required
|
||||
(`exitstatus OK`; snapshot listed while the CT stays `running`)
|
||||
[[phase1-2 §1.6](tests/phase1-2-findings.md)]. This is the mechanism available for a
|
||||
snapshot-before-change rollback flow.
|
||||
|
||||
### 4.6 PBS (Proxmox Backup Server)
|
||||
**Not yet validated.** No PBS datastore was configured or tested in the spike. All backup
|
||||
findings above are for `vzdump` to a `dir` storage. PBS (dedup, incremental, remote, dirty-
|
||||
bitmap) is pending.
|
||||
|
||||
---
|
||||
|
||||
## 5. Gotchas & operational notes (quick reference)
|
||||
|
||||
| Gotcha | Detail | Evidence |
|
||||
|---|---|---|
|
||||
| **deb822 repos** | PVE 9 repos are `.sources` files; disable enterprise, enable no-subscription | standard setup |
|
||||
| **Privsep dual-grant** | privsep token needs the role on **both** user and token, else empty intersection → 403 | [phase1-2 §1.2](tests/phase1-2-findings.md) |
|
||||
| **Async authz** | `vzdump` POST returns 200+UPID even when unauthorized; the 403 is in the task `exitstatus`; poll it | [phase1-2 §1.3](tests/phase1-2-findings.md) |
|
||||
| **No fsfreeze for LXC** | running-LXC `snapshot` backup is crash-consistent only; quiesce or use `stop` for app-consistency | [phase1-2 §2.1](tests/phase1-2-findings.md) |
|
||||
| **Restore identity collision** | `pct restore` keeps source MAC + hostname; reset before starting alongside the original | [phase1-2 §2.2](tests/phase1-2-findings.md) |
|
||||
| **Restart policy for self-heal** | restored/rebooted containers come up `exited` with no restart policy; need a restart policy or an explicit `compose up -d` to return automatically | [phase1-2 §2.2/§3](tests/phase1-2-findings.md) |
|
||||
| **Self-signed TLS** | host cert is self-signed; `curl` needs `-k` until trust is set up | [phase1-2 §1.5](tests/phase1-2-findings.md) |
|
||||
| **`pveum role info` gone** | use `pveum role list` in PVE 9 | [phase1-2 §1.1](tests/phase1-2-findings.md) |
|
||||
| **`pveum acl delete` needs `--roles`** | bare `-user`/`-token` path errors `400 roles: property is missing` | [phase1-2 §5](tests/phase1-2-findings.md) |
|
||||
| **`VM.PowerMgmt` not needed** | stop-mode backup works under `VM.Backup` alone | [phase1-2 §1.4](tests/phase1-2-findings.md) |
|
||||
|
||||
---
|
||||
|
||||
## 6. Validated vs open
|
||||
|
||||
### Validated by the spike
|
||||
| Fact | Evidence |
|
||||
|---|---|
|
||||
| PVE 9.2.2 / Debian 13 / kernel 7.0.2 baseline; `local` (dir) vs `local-lvm` (thin) roles | [phase0 §1](tests/phase0-findings.md), [phase1-2 pre-flight](tests/phase1-2-findings.md) |
|
||||
| Docker runs in an **unprivileged** LXC (`nesting=1,keyctl=1`), driver `overlayfs`, cgroup v2 | [phase0 §3](tests/phase0-findings.md) |
|
||||
| LXC vs VM overhead (idle host RAM +211 MB vs +2056 MB; CPU/throughput/provisioning) | [phase0 §2](tests/phase0-findings.md) |
|
||||
| Privsep token = intersection of user ∩ token ACLs (dual-grant required) | [phase1-2 §1.2](tests/phase1-2-findings.md) |
|
||||
| Minimal self-backup role; `VM.PowerMgmt` unnecessary | [phase1-2 §1.4](tests/phase1-2-findings.md) |
|
||||
| Token scoped to one VMID: self-ops succeed, cross-guest + create/allocate denied | [phase1-2 §1.3](tests/phase1-2-findings.md) |
|
||||
| Async UPID model; vzdump authz surfaces in `exitstatus`, not the POST | [phase1-2 §1.3](tests/phase1-2-findings.md) |
|
||||
| Running, unprivileged LXC snapshots on LVM-thin (no stop) | [phase1-2 §1.6](tests/phase1-2-findings.md) |
|
||||
| `vzdump` → `pct restore` round-trip; one backup captures Docker volumes; config survives | [phase1-2 §2](tests/phase1-2-findings.md) |
|
||||
| Crash-consistent restore recovers via Postgres WAL; quiesced restores clean | [phase1-2 §2.2](tests/phase1-2-findings.md) |
|
||||
|
||||
### Not yet validated (do not assume)
|
||||
| Open item | Why it matters |
|
||||
|---|---|
|
||||
| **PBS** (dedup/incremental/remote backup) | the only backup path tested was `vzdump` to a `dir` |
|
||||
| **The real controller running inside an LXC** reaching `host:8006` | spike used `curl`/CLI, not the actual Go controller |
|
||||
| **App-consistency under heavy write load** | WAL recovery was validated only on an idle-at-backup DB |
|
||||
| **Live migration / restore to a different host** | single-node spike only |
|
||||
| **Ballooning / KSM** effect on VM RAM cost | VM RAM measured with neither configured |
|
||||
| **Cluster / HA** behaviour | single node |
|
||||
| **Production TLS trust** for the API | all calls used `-k` against a self-signed cert |
|
||||
| **deb822 no-subscription repo setup** as a controlled step | host arrived pre-configured |
|
||||
|
||||
---
|
||||
|
||||
## 7. Scope boundary
|
||||
|
||||
This document holds **platform facts only.** Felhom design decisions — e.g. which guest
|
||||
type is the default, whether to use privsep or non-privsep tokens, where PBS lives — are
|
||||
**out of scope** and belong in the controller-architecture document. Where this reference
|
||||
notes a decision exists, the decision itself is recorded there, not here.
|
||||
Reference in New Issue
Block a user