From 060bfb8ffd9234c9938404801a7744c5a828dbed Mon Sep 17 00:00:00 2001 From: kisfenyo Date: Sun, 7 Jun 2026 20:46:01 +0200 Subject: [PATCH] doc update --- docs/proxmox-platform.md | 324 ++++++++++++++++++ ..._Spike_-_API_&_Access-Control_Reference.md | 7 + docs/{ => tests}/phase0-findings.md | 0 docs/{ => tests}/phase1-2-findings.md | 0 4 files changed, 331 insertions(+) create mode 100644 docs/proxmox-platform.md rename docs/{ => tests}/Proxmox_Spike_-_API_&_Access-Control_Reference.md (94%) rename docs/{ => tests}/phase0-findings.md (100%) rename docs/{ => tests}/phase1-2-findings.md (100%) diff --git a/docs/proxmox-platform.md b/docs/proxmox-platform.md new file mode 100644 index 0000000..e7f1cc0 --- /dev/null +++ b/docs/proxmox-platform.md @@ -0,0 +1,324 @@ +# Proxmox Platform Reference + +Authoritative, living reference for the Proxmox platform underneath `proxmox-controller`. +It records **facts about Proxmox and what we validated about it** — not Felhom design +decisions. Where a design choice exists, this doc points to the (future) controller +architecture document rather than making the choice here. + +**Evidence base** (raw, chronological spike logs — kept as the underlying record): +- [tests/phase0-findings.md](tests/phase0-findings.md) — VM-vs-LXC overhead, Docker-in-LXC viability +- [tests/phase1-2-findings.md](tests/phase1-2-findings.md) — privilege model, backup/restore round-trip +- [tests/Proxmox_Spike_-_API_&_Access-Control_Reference.md](tests/Proxmox_Spike_-_API_&_Access-Control_Reference.md) — **superseded** pre-spike reference (contains a known privsep error; do not cite as authoritative) + +Every nontrivial claim links to its evidence section. Validated on a single host +(`demo-felhom`, 192.168.0.162, 4 vCPU / 16 GB) on 2026-06-07; treat single-run timings and +measurements as indicative, not benchmarks. + +--- + +## 1. Platform baseline + +Validated stack [[phase0 §1](tests/phase0-findings.md)]: + +| Component | Version | +|---|---| +| Proxmox VE (`pve-manager`) | **9.2.2** (`b9984c6d90a4bd80`) | +| OS | Debian 13 (Trixie) | +| Kernel | proxmox-kernel **7.0.2-6-pve** | +| `pve-qemu-kvm` | 11.0.0-3 | +| `qemu-server` | 9.1.15 | +| `pve-container` | 6.1.10 | +| `lxc-pve` / `lxcfs` | 7.0.0-2 / 7.0.0-pve1 | +| `criu` | 4.1.1-1 | + +`pvesh get /version` → release 9.2. Always confirm the node name on the box +(`pvesh get /nodes`) rather than hard-coding it. + +### 1.1 Storage backends +Two backends were present and exercised [[phase0 §1](tests/phase0-findings.md), [phase1-2 §pre-flight](tests/phase1-2-findings.md)]: + +| Storage | Type | Path / VG | Content types | Holds | +|---|---|---|---|---| +| `local` | `dir` | `/var/lib/vz` | `iso, vztmpl, backup, import` | ISOs, CT templates, **vzdump archives** | +| `local-lvm` | `lvmthin` | VG `pve`, thinpool `data` | `rootdir, images` | guest disk volumes | + +**Why backups cannot live on LVM-thin:** LVM-thin is a *block* backend — it allocates +logical volumes for guest disks. Backup archives and templates are *files*, which require a +file-level backend (`dir`, NFS, CIFS, or PBS). A `vzdump` target must therefore be a +storage whose content types include `backup` (here, `local`); pointing `vzdump` at +`local-lvm` is not valid. [[phase1-2 §pre-flight / §2.1](tests/phase1-2-findings.md)] + +### 1.2 Repositories +PVE 9 uses **deb822** `.sources` files under `/etc/apt/sources.list.d/`. For a host +without a subscription, the enterprise repos (`pve-enterprise.sources`, +`ceph-*-enterprise.sources`) must be disabled (they return 401) and a no-subscription repo +enabled. *The spike host arrived with the no-subscription repo already configured and the +host updated [[phase0 baseline](tests/phase0-findings.md)]; the repo setup itself was not a +spike deliverable* — the canonical no-subscription `.sources` is the standard Proxmox 9 +procedure (`/etc/apt/sources.list.d/pve-no-subscription.sources` with +`Components: pve-no-subscription`). Treat the exact commands as standard setup, not +spike-validated. + +**Docker repository (validated):** Docker's official apt repo **has a `trixie` channel**; +no fallback to Debian's `docker.io` was needed. Installed Docker **29.5.3** from it in both +guest types. [[phase0 §1](tests/phase0-findings.md)] + +--- + +## 2. Guest model (LXC vs VM) — validated facts + +Both guest types ran the **identical** workload (Debian 13, Docker 29.5.3, a +postgres/redis/nginx compose stack) under identical resources (2 vCPU, 2048 MB, ~10 GB) +[[phase0](tests/phase0-findings.md)]. + +### 2.1 Isolation characteristic (fact, not recommendation) +- **LXC** is an OS-level container: it **shares the host kernel**. Docker-in-LXC needs the + container configured for nesting (see §2.3). +- **VM** runs its **own guest kernel** under KVM/QEMU, with full hardware-level isolation + and its own firmware. + +The trade-offs below follow directly from this difference. + +### 2.2 Resource overhead (measured) +Host RAM used = `MemTotal − MemAvailable`, deltas vs a both-stopped baseline of 1702 MB; +one guest measured at a time [[phase0 §2](tests/phase0-findings.md)]: + +| Metric | LXC | VM | Note | +|---|---|---|---| +| Idle host-RAM delta | **+211 MB** | **+2056 MB** | structural, see below | +| Under-load host-RAM delta | **+410 MB** | **+2084 MB** | | +| Per-guest attribution | cgroup `memory.current` 1961 MB¹ | KVM RSS ~2031 MB | | +| Idle host CPU used | ~0.3 % | ~6.0 % | VM has an emulation/guest-kernel floor | +| Under-load host CPU used | ~39.4 % | ~53.9 % | VM work shows as `%guest` (31.9 %) | +| pgbench throughput | 2211 tps | 1820 tps | identical load, 0 failed both | +| Disk used (host thin-LV) | ~2.67 GiB | ~2.94 GiB | of 10 GiB allocated | +| Provisioning (create→ready) | ~10–15 s | ~60–75 s | template-extract vs qcow2-import+boot | + +¹ `cgroup memory.current` counts reclaimable page cache shared with the host and +**overstates** the LXC's true incremental cost; the +211 MB host delta is the honest +number [[phase0 §4.4](tests/phase0-findings.md)]. + +**Why the RAM gap is structural** [[phase0 §4.3](tests/phase0-findings.md)]: LXC processes +share the host kernel and page cache, so only the working set counts against the host. A VM +with **no ballooning configured** has KVM back every guest-touched page (including the +guest's own page cache), so its host cost ≈ the full RAM allocation and is largely +load-independent. *Ballooning / KSM were not tested* and could change the VM figure. + +### 2.3 Docker-in-LXC viability (validated) +Docker ran **cleanly in an *unprivileged* LXC** configured with +`--features nesting=1,keyctl=1 --unprivileged 1` (PVE 9 syntax, accepted by `pct create`) +[[phase0 §3](tests/phase0-findings.md)]: + +- `docker run hello-world` → success; full 3-container stack healthy. +- **Storage driver: `overlayfs`** (cgroup v2, systemd cgroup driver) — **no `vfs` + fallback**. (Docker 29 names the overlay driver `overlayfs` via the containerd + snapshotter image store; same overlay technology as the legacy `overlay2`.) +- Named volume persisted writes; multi-container networking + published port worked + (`curl localhost:8080` → 200); 0 failed transactions under load. +- No privileged-container fallback was needed. + +### 2.4 Guest agent & app-consistency capability +- **VM:** `qemu-guest-agent` installs and reports (`agent: 1`), enabling + `guest-fsfreeze`-based app-consistent `snapshot` backups [[phase0 §4.8](tests/phase0-findings.md)]. + The Debian genericcloud image does **not** ship the agent — it must be installed + in-guest. +- **LXC:** no guest agent exists → **no fsfreeze** (see §4.2). + +--- + +## 3. API & access control + +### 3.1 Fundamentals +- **Base URL:** `https://:8006/api2/json`. Every `pve*` CLI is a thin wrapper over + this REST API. +- **Token auth header:** `Authorization: PVEAPIToken=USER@REALM!TOKENID=SECRET`. The + secret is shown **once** at creation. Response envelope: `{"data": ...}`. +- **TLS reality:** the host serves the default **self-signed** certificate. `curl` without + `-k` fails `SSL certificate problem: unable to get local issuer certificate` + [[phase1-2 §1.5](tests/phase1-2-findings.md)]. Production trust (pin the PVE CA / install + a real cert) is a separate, not-yet-decided concern. + +### 3.2 RBAC model +An ACL entry is a triple **(path, principal, role)**; a role is a bundle of privileges, +assigned at the most specific path. Paths include `/`, `/vms/`, `/nodes/`, +`/storage/`, `/pool/`, `/access/...`. + +Introspection (**corrected for PVE 9**) [[phase1-2 §1.1](tests/phase1-2-findings.md)]: +- `pveum role list` — lists roles **with their privileges**. +- ⚠️ `pveum role info ` **does not exist in PVE 9** (the old reference used it). +- `pveum acl list`, `pveum user permissions --path `. + +### 3.3 Privilege-separated tokens — the intersection rule (corrected) +> **A privsep token's (`--privsep 1`) effective permissions are the *intersection* of (a) +> the backing user's permissions and (b) the token's own ACLs.** The role must therefore be +> granted on **BOTH the user AND the token** for the same path. Granting it on the token +> only yields an **empty intersection** and a **403 even on self-calls.** +> [[phase1-2 §1.2](tests/phase1-2-findings.md)] + +This corrects the superseded reference (§3 there grants the ACL to the token only). The +intersection is what keeps a privsep token ≤ its user while still being independently +scopeable to a narrow path. + +Working pattern (validated): +```bash +pveum role add -privs " ..." # NB: -privs is space-separated +pveum user add @pve +pveum user token add @pve --privsep 1 # capture SECRET (shown once) +pveum acl modify -user '@pve' -role # BOTH the user... +pveum acl modify -token '@pve!' -role # ...AND the token +``` +`pveum acl delete` **requires `--roles`** (a bare `-user`/`-token` path errors +`400 roles: property is missing`). Deleting the token/user/role auto-invalidates the +referencing ACLs. [[phase1-2 §5](tests/phase1-2-findings.md)] + +### 3.4 Validated minimal self-backup role +A token scoped to **one VMID + the backup datastore** can audit, snapshot, and back up +**only that guest**, and is denied on every other guest and on create/allocate +[[phase1-2 §1.3–1.4](tests/phase1-2-findings.md)]: + +> **Minimal role for self-audit + self-snapshot + both `snapshot`- and `stop`-mode +> self-backup:** +> `VM.Audit, VM.Snapshot, VM.Backup, Datastore.AllocateSpace, Datastore.Audit` + +⚠️ **`VM.PowerMgmt` is NOT required for stop-mode backup** — `vzdump` performs the guest +shutdown/restart internally under `VM.Backup` (tested: stop-mode self-backup returned +`exitstatus OK` without it) [[phase1-2 §1.4](tests/phase1-2-findings.md)]. This corrects the +old reference's "likely yes" guess. + +Validated boundary (token scoped to `/vms/` + `/storage/local`): + +| Operation | Result | +|---|---| +| `GET /version` | 200 | +| `GET` self status, `POST` self snapshot, `POST` self vzdump | 200 / task `OK` | +| `GET`/`POST` against **another** guest's vmid | **403** (read) / task **403** (backup) | +| `POST /nodes//lxc` (create/allocate a guest) | **403** — create/allocate is operator-tier | + +### 3.5 Async tasks — trust `exitstatus`, not the POST +Long operations (`vzdump`, `snapshot`, clone, restore) return a **UPID**, not a result. +Poll `GET /nodes//tasks//status` until `status: stopped`, then read +`exitstatus` [[phase1-2 §1.3](tests/phase1-2-findings.md)]. + +> ⚠️ **Authorization can surface at task execution, not at the HTTP POST.** A `vzdump` +> against an unauthorized vmid returns **HTTP 200 + a UPID**, but the task then ends +> `exitstatus: "403 Permission check failed (/vms/, VM.Backup)"` and produces **no +> archive**. A caller that trusts the 200 would wrongly believe the backup ran. Always poll +> the task and check `exitstatus`. + +(The task owner — including a token — can read its own task status: 200.) + +--- + +## 4. Backup & restore (`vzdump` / `pct restore`) + +### 4.1 Modes +- **`stop`** — orderly guest shutdown → backup → restart. Highest consistency, defined + downtime. (For LXC the shutdown/restart is internal to `vzdump`; needs only `VM.Backup` — + §3.4.) +- **`snapshot`** — lowest downtime; copies blocks while running. Consistency depends on the + guest cooperating (§4.2). +- **`suspend`** — legacy/compat, not used. + +### 4.2 Consistency: crash-consistent vs quiesced, and no-fsfreeze-for-LXC +> ⚠️ **An LXC has no guest agent, so `snapshot`-mode `vzdump` does NOT fsfreeze.** A +> running-stack LXC backup is therefore **crash-consistent** (filesystem-level), not +> app-consistent. App-consistency for an LXC is the caller's job: quiesce in-guest first +> (stop the stack / flush DBs) or use `stop` mode. A **VM** with `qemu-guest-agent` gets +> `guest-fsfreeze` around the copy → near-free app-consistency. [[phase1-2 §2.1](tests/phase1-2-findings.md), [phase0 §4.8](tests/phase0-findings.md)] + +**Validated restore behaviour** (LXC, Postgres) [[phase1-2 §2.2](tests/phase1-2-findings.md)]: +- **Crash-consistent (running):** on first start Postgres ran **automatic WAL recovery** + (`database system was interrupted … not properly shut down; automatic recovery in + progress … redo done … ready to accept connections`) and the data was intact. +- **Quiesced (stack stopped):** clean start, no recovery, data intact. +- Both restored correctly here on an idle-at-backup DB; this is **not** a durability + guarantee under heavy write load (§6). + +### 4.3 What a backup captures +A single LXC `vzdump` captures the container rootfs **including the Docker named volumes** +(they live in the rootfs) — one backup = the whole guest and its data. Validated: a +sentinel row survived both variants [[phase1-2 §2.2](tests/phase1-2-findings.md)]. + +Sizes/timings (2.5 GiB source, zstd) [[phase1-2 §2.1–2.2](tests/phase1-2-findings.md)]: +backup ~934 MB (~2.7:1) in ~22–25 s; restore in ~11–12 s. + +### 4.4 Restore = recreate-from-archive (identity is preserved) +There is no single "restore" call — you recreate the guest from the archive into a **new +VMID**: +- **LXC:** `pct restore --storage ` +- **VM:** `qmrestore ` (or `POST /nodes//qemu` with `archive=`) + +> ⚠️ **`pct restore` preserves the source config — including the MAC address and +> hostname.** Restoring while the original still runs causes a **MAC/hostname collision** on +> the bridge; reset network identity (`pct set -net0 name=eth0,bridge=vmbr0,ip=dhcp` +> regenerates the MAC) before starting. [[phase1-2 §2.2](tests/phase1-2-findings.md)] + +**Restored config survives intact:** `unprivileged: 1` and `features: nesting=1,keyctl=1` +are preserved, so Docker runs in the restored CT [[phase1-2 §2.2](tests/phase1-2-findings.md)]. + +### 4.5 Snapshots +A **running, unprivileged LXC can be snapshotted on LVM-thin** with no stop required +(`exitstatus OK`; snapshot listed while the CT stays `running`) +[[phase1-2 §1.6](tests/phase1-2-findings.md)]. This is the mechanism available for a +snapshot-before-change rollback flow. + +### 4.6 PBS (Proxmox Backup Server) +**Not yet validated.** No PBS datastore was configured or tested in the spike. All backup +findings above are for `vzdump` to a `dir` storage. PBS (dedup, incremental, remote, dirty- +bitmap) is pending. + +--- + +## 5. Gotchas & operational notes (quick reference) + +| Gotcha | Detail | Evidence | +|---|---|---| +| **deb822 repos** | PVE 9 repos are `.sources` files; disable enterprise, enable no-subscription | standard setup | +| **Privsep dual-grant** | privsep token needs the role on **both** user and token, else empty intersection → 403 | [phase1-2 §1.2](tests/phase1-2-findings.md) | +| **Async authz** | `vzdump` POST returns 200+UPID even when unauthorized; the 403 is in the task `exitstatus`; poll it | [phase1-2 §1.3](tests/phase1-2-findings.md) | +| **No fsfreeze for LXC** | running-LXC `snapshot` backup is crash-consistent only; quiesce or use `stop` for app-consistency | [phase1-2 §2.1](tests/phase1-2-findings.md) | +| **Restore identity collision** | `pct restore` keeps source MAC + hostname; reset before starting alongside the original | [phase1-2 §2.2](tests/phase1-2-findings.md) | +| **Restart policy for self-heal** | restored/rebooted containers come up `exited` with no restart policy; need a restart policy or an explicit `compose up -d` to return automatically | [phase1-2 §2.2/§3](tests/phase1-2-findings.md) | +| **Self-signed TLS** | host cert is self-signed; `curl` needs `-k` until trust is set up | [phase1-2 §1.5](tests/phase1-2-findings.md) | +| **`pveum role info` gone** | use `pveum role list` in PVE 9 | [phase1-2 §1.1](tests/phase1-2-findings.md) | +| **`pveum acl delete` needs `--roles`** | bare `-user`/`-token` path errors `400 roles: property is missing` | [phase1-2 §5](tests/phase1-2-findings.md) | +| **`VM.PowerMgmt` not needed** | stop-mode backup works under `VM.Backup` alone | [phase1-2 §1.4](tests/phase1-2-findings.md) | + +--- + +## 6. Validated vs open + +### Validated by the spike +| Fact | Evidence | +|---|---| +| PVE 9.2.2 / Debian 13 / kernel 7.0.2 baseline; `local` (dir) vs `local-lvm` (thin) roles | [phase0 §1](tests/phase0-findings.md), [phase1-2 pre-flight](tests/phase1-2-findings.md) | +| Docker runs in an **unprivileged** LXC (`nesting=1,keyctl=1`), driver `overlayfs`, cgroup v2 | [phase0 §3](tests/phase0-findings.md) | +| LXC vs VM overhead (idle host RAM +211 MB vs +2056 MB; CPU/throughput/provisioning) | [phase0 §2](tests/phase0-findings.md) | +| Privsep token = intersection of user ∩ token ACLs (dual-grant required) | [phase1-2 §1.2](tests/phase1-2-findings.md) | +| Minimal self-backup role; `VM.PowerMgmt` unnecessary | [phase1-2 §1.4](tests/phase1-2-findings.md) | +| Token scoped to one VMID: self-ops succeed, cross-guest + create/allocate denied | [phase1-2 §1.3](tests/phase1-2-findings.md) | +| Async UPID model; vzdump authz surfaces in `exitstatus`, not the POST | [phase1-2 §1.3](tests/phase1-2-findings.md) | +| Running, unprivileged LXC snapshots on LVM-thin (no stop) | [phase1-2 §1.6](tests/phase1-2-findings.md) | +| `vzdump` → `pct restore` round-trip; one backup captures Docker volumes; config survives | [phase1-2 §2](tests/phase1-2-findings.md) | +| Crash-consistent restore recovers via Postgres WAL; quiesced restores clean | [phase1-2 §2.2](tests/phase1-2-findings.md) | + +### Not yet validated (do not assume) +| Open item | Why it matters | +|---|---| +| **PBS** (dedup/incremental/remote backup) | the only backup path tested was `vzdump` to a `dir` | +| **The real controller running inside an LXC** reaching `host:8006` | spike used `curl`/CLI, not the actual Go controller | +| **App-consistency under heavy write load** | WAL recovery was validated only on an idle-at-backup DB | +| **Live migration / restore to a different host** | single-node spike only | +| **Ballooning / KSM** effect on VM RAM cost | VM RAM measured with neither configured | +| **Cluster / HA** behaviour | single node | +| **Production TLS trust** for the API | all calls used `-k` against a self-signed cert | +| **deb822 no-subscription repo setup** as a controlled step | host arrived pre-configured | + +--- + +## 7. Scope boundary + +This document holds **platform facts only.** Felhom design decisions — e.g. which guest +type is the default, whether to use privsep or non-privsep tokens, where PBS lives — are +**out of scope** and belong in the controller-architecture document. Where this reference +notes a decision exists, the decision itself is recorded there, not here. diff --git a/docs/Proxmox_Spike_-_API_&_Access-Control_Reference.md b/docs/tests/Proxmox_Spike_-_API_&_Access-Control_Reference.md similarity index 94% rename from docs/Proxmox_Spike_-_API_&_Access-Control_Reference.md rename to docs/tests/Proxmox_Spike_-_API_&_Access-Control_Reference.md index 225d5b9..bc929c5 100644 --- a/docs/Proxmox_Spike_-_API_&_Access-Control_Reference.md +++ b/docs/tests/Proxmox_Spike_-_API_&_Access-Control_Reference.md @@ -1,3 +1,10 @@ +> ⚠️ **SUPERSEDED — spike evidence only, not authoritative.** This is the *pre-spike* +> reference and contains at least one known error (the privsep/ACL mechanism in §3 — it +> grants the ACL to the token only, which yields an empty intersection and a 403 even on +> self-calls). For the corrected, validated facts read +> [`../proxmox-platform.md`](../proxmox-platform.md). Kept here unchanged as the record of +> what we believed going into the spike. + # Proxmox Spike — API & Access-Control Reference Reference for the **controller-as-guest** architecture, synthesized from current diff --git a/docs/phase0-findings.md b/docs/tests/phase0-findings.md similarity index 100% rename from docs/phase0-findings.md rename to docs/tests/phase0-findings.md diff --git a/docs/phase1-2-findings.md b/docs/tests/phase1-2-findings.md similarity index 100% rename from docs/phase1-2-findings.md rename to docs/tests/phase1-2-findings.md