docs updated

2026-06-07 20:20:52 +02:00
commit 23169cbef7
3 changed files with 806 additions and 0 deletions
@@ -0,0 +1,169 @@
+# Proxmox Spike — API & Access-Control Reference
+
+Reference for the **controller-as-guest** architecture, synthesized from current
+Proxmox VE 9.x documentation (June 2026).
+
+Items marked **[confirm on box]** should be verified once PVE is installed —
+treat them as Phase 0/1 verification steps, not gospel. Every Proxmox CLI tool
+is a thin wrapper over the same REST API, so anything below is reachable from Go.
+
+---
+
+## 1. API fundamentals
+
+- **Base URL:** `https://192.168.0.162:8006/api2/json`
+- **Auth (API token):** HTTP header
+  `Authorization: PVEAPIToken=USER@REALM!TOKENID=SECRET`
+  The secret is shown **once** at creation — capture it immediately, it can't be
+  retrieved again.
+- **Response shape:** `{ "data": ... }`; errors come back via HTTP status + body.
+- **Discovery (do this live on the box instead of trusting any doc):**
+  - `pvesh get /version`
+  - `pvesh ls /nodes/<node>/qemu/<vmid>`
+  - Full schema browser: `https://pve.proxmox.com/pve-docs/api-viewer/`
+  - "What call does the GUI make?" → perform the action in the web UI with
+    browser DevTools → Network open and read the request. Fastest way to find
+    the exact endpoint + params for anything.
+- **Async tasks:** long operations (backup, restore, clone) return a **UPID**
+  (task id), not a result. Poll `GET /nodes/<node>/tasks/<upid>/status` until
+  `status: stopped`, then check `exitstatus`. The controller must poll, not
+  block. **[confirm on box]** the exact polling/response shape.
+
+---
+
+## 2. RBAC model — (path, principal, role)
+
+An ACL entry is a triple of **(path, user/group/token, role)**. A role is a
+bundle of privileges, assigned at the most specific path possible.
+
+- **Paths:** `/`, `/vms/<vmid>`, `/nodes/<node>`, `/storage/<store>`,
+  `/pool/<pool>`, `/access/...`
+- **Predefined roles include:** `PVEAuditor` (read-only), `PVEVMUser`,
+  `PVEVMAdmin`, `PVEDatastoreUser`, `PVEAdmin`, `PVEUserAdmin`.
+- **API tokens with privilege separation (`--privsep 1`):** the token's
+  effective permissions are the **intersection** of (a) the backing user's
+  permissions and (b) the token's own ACLs. A privsep token can therefore never
+  exceed its user, and you grant it a separate, minimal ACL. This is exactly the
+  property the in-guest controller needs.
+
+Introspection:
+```bash
+pveum role list
+pveum role info PVEVMAdmin
+pveum user permissions <user> --path /vms/<vmid>
+```
+
+---
+
+## 3. Two-tier privilege model (our architecture decision)
+
+**Tier A — in-guest controller (customer-facing, NARROW).**
+Runs inside the customer's guest. Token scoped to *that guest's own VMID only*:
+read its own status/config, snapshot itself, back itself up, write the backup to
+the datastore. Cannot see or touch other guests. The LXC/VM's own privilege
+level is irrelevant here — reaching `host:8006` is just an HTTPS call + token.
+
+**Tier B — operator (provisioning, BROAD).**
+Creates/destroys guests, builds the golden template, attaches storage, wires PBS.
+Lives operator-side (hub / tooling), never on the customer box.
+
+### Phase 1 runbook — minimal self-backup role + scoped token
+
+```bash
+# 1. Custom least-privilege role: "back up / snapshot myself"
+#    [confirm on box: exact privilege names via `pveum role list` / api-viewer]
+pveum role add FelhomSelfBackup \
+  -privs "VM.Audit VM.Snapshot VM.Backup Datastore.AllocateSpace Datastore.Audit"
+
+# 2. Dedicated API-only user in the PVE realm (no login password)
+pveum user add felhom-ctl@pve --comment "In-guest controller (self-backup)"
+
+# 3. Privsep token for that user (SECRET shown once)
+pveum user token add felhom-ctl@pve ctl --privsep 1
+
+# 4. Scope the TOKEN to one guest + the backup datastore only
+pveum acl modify /vms/<vmid>      -token 'felhom-ctl@pve!ctl' -role FelhomSelfBackup
+pveum acl modify /storage/<store> -token 'felhom-ctl@pve!ctl' -role FelhomSelfBackup
+
+# 5. Test FROM INSIDE the guest
+curl -k https://<host>:8006/api2/json/version \
+  -H "Authorization: PVEAPIToken=felhom-ctl@pve!ctl=<SECRET>"
+
+curl -k -X POST https://<host>:8006/api2/json/nodes/<node>/vzdump \
+  -H "Authorization: PVEAPIToken=felhom-ctl@pve!ctl=<SECRET>" \
+  -d "vmid=<vmid>&storage=<store>&mode=snapshot"
+```
+
+**Pass criteria:** the token backs up its OWN vmid, and returns **403** on any
+other vmid. That single result validates the whole controller-as-guest design.
+
+**Open question to settle here:** does Tier A also need `VM.PowerMgmt` so it can
+stop/start its own guest for `stop`-mode backups? Likely yes — add it and re-test.
+
+---
+
+## 4. Backup / restore (vzdump)
+
+**Modes:**
+- **`stop`** — orderly guest shutdown → live backup → resume. Highest
+  consistency, short defined downtime.
+- **`snapshot`** — lowest downtime; copies blocks while running. *Small
+  inconsistency risk* unless the guest cooperates (see below).
+- **`suspend`** — legacy/compat, longer downtime, not recommended.
+
+**App-consistency — the concrete version of the earlier warning:**
+- **VM:** install `qemu-guest-agent` in the guest and set `agent: 1`.
+  `snapshot`-mode vzdump then calls `guest-fsfreeze-freeze` / `-thaw` around the
+  copy → near-free filesystem consistency. **This is a real point in the VM's
+  favour over LXC.**
+- **LXC:** no guest agent → no fsfreeze. App-consistency becomes the
+  *controller's* job: quiesce in-guest first (stop stacks / flush DBs) **then**
+  vzdump, or use `stop` mode. Same lesson as the restic work, moved to the guest
+  layer.
+
+**CLI / API:**
+```bash
+vzdump <vmid> --mode snapshot --storage <store>                 # CLI
+# API (async → UPID):
+POST /api2/json/nodes/<node>/vzdump        params: vmid, storage, mode, ...
+```
+
+**Restore is NOT a single "restore" call** — you recreate the guest from the
+archive:
+- **VM:** `qmrestore <archive> <newvmid>`  /  `POST /nodes/<node>/qemu` with `archive=...`
+- **LXC:** `pct restore <newvmid> <archive>`  /  `POST /nodes/<node>/lxc` with the archive as source
+
+Phase 2's real-restore test = restore to a **fresh vmid** and boot it. Do not
+declare the backup "working" until a restored guest actually runs.
+
+---
+
+## 5. Key REST endpoints (qemu shown; lxc is parallel under `/lxc`)
+
+```
+GET  /nodes
+GET  /nodes/<node>/qemu                          list VMs
+GET  /nodes/<node>/qemu/<vmid>/status/current    live status
+GET  /nodes/<node>/qemu/<vmid>/config            config
+POST /nodes/<node>/qemu/<vmid>/status/{start,stop,shutdown,reboot}
+POST /nodes/<node>/qemu/<vmid>/snapshot          (snapname, description)
+GET  /nodes/<node>/qemu/<vmid>/snapshot          list snapshots
+POST /nodes/<node>/qemu/<vmid>/snapshot/<snap>/rollback
+POST /nodes/<node>/vzdump                         backup (async, UPID)
+GET  /nodes/<node>/tasks/<upid>/status            poll async task
+```
+
+LXC: replace `/qemu/` with `/lxc/`. For **Docker-in-LXC** the container needs
+`features nesting=1,keyctl=1` (`pct set <vmid> -features nesting=1,keyctl=1`, or
+the `features` property on `POST /nodes/<node>/lxc`) — **[confirm on box]**.
+
+---
+
+## 6. Phase 0 confirm-on-box checklist
+
+- [ ] PVE 9.2 installed; storage = LVM-thin (leave free space to also test dir/qcow2)
+- [ ] Exact privilege set for `FelhomSelfBackup` (`pveum role info`)
+- [ ] UPID task-polling response shape
+- [ ] Docker official apt repo has a `trixie` channel
+- [ ] LXC `features nesting=1,keyctl=1` syntax + Docker actually runs inside an LXC
+- [ ] Baseline idle + under-load RAM/CPU: one Debian VM vs one Debian LXC, identical resources
@@ -0,0 +1,331 @@
+# Phase 0 — VM vs LXC Overhead Spike: Findings
+
+**Host:** `demo-felhom` (192.168.0.162) — Proxmox VE 9.2.2, Debian 13 (Trixie),
+kernel 7.0.2-6-pve, 4 vCPU, 16 GB RAM (15771 MB `MemTotal`).
+**Date:** 2026-06-07. **Measured one guest at a time, the other fully stopped.**
+
+> This document presents **data and observations only**. No recommendation or verdict —
+> the architecture decision is made elsewhere.
+
+---
+
+## 1. Provenance
+
+### Platform
+| Component | Version |
+|---|---|
+| pve-manager | 9.2.2 (`b9984c6d90a4bd80`) |
+| kernel | proxmox-kernel 7.0.2-6-pve |
+| pve-qemu-kvm | 11.0.0-3 |
+| qemu-server | 9.1.15 |
+| pve-container | 6.1.10 |
+| lxc-pve / lxcfs | 7.0.0-2 / 7.0.0-pve1 |
+| criu | 4.1.1-1 |
+
+`pvesh get /version` → release 9.2, version 9.2.2.
+
+### Guest images
+| | LXC (9001) | VM (9000) |
+|---|---|---|
+| Source | `local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst` | `debian-13-genericcloud-amd64.qcow2` |
+| Build | Debian 13.1 standard CT template (downloaded via `pveam`, checksum verified) | cloud build **20260601-2496**; in-guest reports Debian **13.5** after `apt update` |
+| qcow2 | n/a | virtual 3 GiB, on-disk 323 MiB, compat 1.1/zlib |
+
+### Docker (identical in both guests)
+| | LXC | VM |
+|---|---|---|
+| Source | Docker official apt repo, **`trixie` channel** (confirmed present) | same |
+| Version | **29.5.3** build d1c06ef | **29.5.3** build d1c06ef |
+| Storage Driver | **`overlayfs`** (not vfs) | **`overlayfs`** (not vfs) |
+| Cgroup Version / Driver | **v2 / systemd** | **v2 / systemd** |
+| `hello-world` | OK | OK |
+
+> Docker's official repo **does** have a `trixie` channel — no fallback to Debian's
+> `docker.io` was needed. Docker 29 reports the driver as `overlayfs` (the containerd
+> snapshotter image store) rather than the legacy name `overlay2`; this is the same
+> overlay technology and is **not** a `vfs` fallback.
+
+---
+
+## 2. Comparison table
+
+Baseline (both guests stopped): host RAM used **median 1702 MB** (range 1699–1703);
+host CPU **~0.1 % used** (99.9 % idle). All RAM deltas below are vs this baseline.
+Host RAM used = `MemTotal − MemAvailable`, 5 samples ~3 s apart (median reported).
+
+| Metric | LXC (9001) | VM (9000) | Δ (VM − LXC) |
+|---|---|---|---|
+| **Idle host-RAM delta** | **+211 MB** (1913) | **+2056 MB** (3758) | **+1845 MB** |
+| **Under-load host-RAM delta** | **+410 MB** (2112) | **+2084 MB** (3786) | **+1674 MB** |
+| **Per-guest mem attribution** | cgroup `memory.current` = **1961 MB**¹ | KVM process RSS = **2031 MB** (idle) / **2047 MB** (load) | — |
+| **Idle host CPU used** | **~0.3 %** (0.20 usr + 0.10 sys) | **~6.0 %** (3.37 usr + 2.31 sys + 0.29 guest) | **+5.7 pp** |
+| **Under-load host CPU used** | **~39.4 %** (17.1 usr + 7.5 sys + 14.5 iowait + 0.3 soft) | **~53.9 %** (31.9 guest + 16.4 iowait + 3.4 sys + 1.7 usr + 0.6 soft) | **+14.5 pp** |
+| **pgbench throughput** | **2211.7 tps**, lat 1.809 ms, 132 710 tx/60 s, 0 failed | **1819.6 tps**, lat 2.198 ms, 163 764 tx/90 s, 0 failed² | **−392 tps** |
+| **Disk allocated** | 10 GiB | 10 GiB | 0 |
+| **Disk used (host thin-LV)** | 26.73 % ≈ **2.67 GiB** | 29.33 % ≈ **2.94 GiB** | +0.27 GiB |
+| **Disk used (inside guest)** | 2.1 GiB / 9.7 GiB | 2.4 GiB / 9.7 GiB | +0.3 GiB |
+| **Provisioning (rough, create→ready)** | ~10–15 s³ | ~60–75 s³ | — |
+
+¹ `memory.current` counts reclaimable page cache shared with the host and therefore
+**overstates** the LXC's true incremental cost; the +211 MB host-RAM delta is the honest
+number. ² VM 60 s runs gave 1739 & 1759 tps — consistent with the 90 s definitive run.
+³ Guest-creation step only; see §4. Docker install + first image pull (~network-bound,
+~identical for both) is excluded.
+
+### Inside-guest `free -m` (context only — not the decisive number)
+| | total | used | buff/cache | available |
+|---|---|---|---|---|
+| LXC idle | 2048 | 125 | 1851 | 1922 |
+| VM idle | 1974 | 509 | 1524 | 1464 |
+
+The VM sees **1974 MB** usable of 2048 allocated (firmware/kernel reservation).
+
+---
+
+## 3. Docker-in-LXC viability
+
+**Worked cleanly in an *unprivileged* LXC with `--features nesting=1,keyctl=1`. No
+privileged fallback was needed.**
+
+- `--features nesting=1,keyctl=1 --unprivileged 1` accepted by `pct create` (PVE 9
+  syntax confirmed via `pct help create`).
+- `docker run hello-world` → success.
+- **Storage driver: `overlayfs`** (cgroup v2, systemd cgroup driver) — **no `vfs`
+  fallback**.
+- Full 3-container stack (`postgres:17`, `redis:7`, `nginx:alpine`) came up healthy.
+- Named volume `pgdata` persisted a write (`SELECT count` returned 1 after table
+  create/insert).
+- Multi-container networking + published port worked: `curl localhost:8080` → **HTTP 200**.
+- 60 s pgbench load: **0 failed transactions**.
+
+No errors, no `dmesg`/`journalctl` anomalies, no workarounds. The privileged-LXC
+fallback path (step A5) was therefore **not exercised**.
+
+---
+
+## 4. Observations & confounds
+
+1. **VM under-load CPU required a re-measurement (diagnosed, not hidden).** The first
+   VM-load sample showed host CPU ~5 % — identical to *idle* — while pgbench nonetheless
+   completed a full 60 s run (1739 tps). Root cause: the VM load was launched through a
+   **nested SSH + `nohup &`** layer (host→VM), which started pgbench *after* the sampling
+   window. The LXC path used local `pct exec` (no nested SSH) so its first sample was
+   valid. Re-running with pgbench held in the **foreground of a long-lived SSH channel**
+   (guaranteed active) and sampling during a confirmed window gave the true **53.9 %**
+   (`%guest`=31.9). **Confound:** the two guests' load was driven through different
+   plumbing (`pct exec` vs nested SSH); the *throughput* numbers are unaffected
+   (pgbench self-reports its own duration), but the CPU figures came from
+   methodologically asymmetric harnesses.
+2. **Baseline drift from residual page cache.** After stopping each guest, host RAM did
+   not snap back to 1702 MB immediately (e.g. 1895 MB just after the LXC stopped;
+   1965→1794 MB drifting down after the VM). This is reclaimable cache, not a leak.
+   Treat all RAM deltas as ±~100 MB.
+3. **The headline RAM gap is structural, not incidental.** LXC processes share the host
+   kernel and page cache, so only the working set counts against the host (+211 MB idle).
+   The VM, with **no ballooning configured**, has KVM back every guest-touched page —
+   including the guest's own 1.5 GB page cache — so the host cost ≈ the full 2 GB
+   allocation (KVM RSS ≈ 2031 MB) and is **largely load-independent** (3758 idle → 3786
+   load). Ballooning / KSM were not tested and could change this.
+4. **`cgroup memory.current` ≠ host cost.** For the LXC it read 1961 MB (near the 2 GB
+   limit) because it includes reclaimable page cache; the real incremental host cost was
+   +211 MB. Per the protocol, `MemTotal − MemAvailable` is the decisive metric.
+5. **VM idle CPU floor (~6 %) vs LXC (~0.3 %).** QEMU device emulation + a full guest
+   kernel's timer/housekeeping impose a small constant CPU cost even at rest.
+6. **Throughput vs CPU trade.** The VM did slightly *less* work (1820 vs 2211 tps) for
+   *more* host CPU (53.9 vs 39.4 %). The extra cost surfaces as `%guest` (31.9 %) — the
+   actual DB work *plus* virtualization overhead — whereas in the LXC the same DB work
+   appears directly as host `%usr`/`%sys`. iowait was comparable (~15–16 %, WAL fsync).
+7. **Workload fits in RAM.** pgbench scale `-s 10` (~150 MB) fits in cache in both
+   guests, so the test is commit/CPU-bound rather than disk-bound; a larger-than-RAM
+   dataset would stress the storage paths differently and is not covered here.
+8. **qemu-guest-agent confirmed on the VM** (`qm guest cmd 9000 ping` → OK). This enables
+   `guest-fsfreeze`-based app-consistent `snapshot`-mode vzdump for the VM — a capability
+   the LXC has no equivalent for. The genericcloud image does **not** ship the agent;
+   it had to be installed in-guest (and the VM IP had to be found via `nmap`/MAC until
+   the agent was up).
+9. **Provisioning asymmetry foreshadows cloning.** LXC create is template-extract-bound
+   (526 MiB at 387 MiB/s + SSH keygen, ~10–15 s). VM create is qcow2-import-bound (3 GiB
+   → LVM ≈ 30 s) plus a full firmware boot to SSH-ready (~30–45 s). Figures are rough,
+   single-run, and exclude the shared network-bound Docker install + first image pull.
+
+---
+
+## 5. Raw command log (appendix)
+
+### 5.1 Provenance
+```
+$ pveversion -v | grep ...
+pve-manager: 9.2.2 (running version: 9.2.2/b9984c6d90a4bd80)
+proxmox-kernel-7.0: 7.0.2-6
+criu: 4.1.1-1
+lxc-pve: 7.0.0-2
+lxcfs: 7.0.0-pve1
+pve-container: 6.1.10
+pve-qemu-kvm: 11.0.0-3
+qemu-server: 9.1.15
+
+$ pvesm status
+local         dir      active  98497780  4333576  89114656  4.40%
+local-lvm  lvmthin    active 365760512        0 365760512  0.00%
+
+# Docker repo trixie channel:
+$ curl -fsSL https://download.docker.com/linux/debian/dists/ | grep -oE 'trixie|bookworm|bullseye'
+bookworm / bullseye / trixie        # trixie present
+
+# Cloud image:
+$ qemu-img info debian-13-genericcloud-amd64.qcow2
+virtual size: 3 GiB ; disk size: 323 MiB ; compat 1.1 ; build 20260601-2496
+```
+
+### 5.2 Baseline (both guests stopped)
+```
+$ for i in 1..5; awk MemTotal-MemAvailable /proc/meminfo ; sleep 3
+used=1699 MB / 1702 / 1702 / 1702 / 1703 MB      (median 1702)
+
+$ mpstat 1 5
+Average: all 0.05 usr 0.05 sys ... 99.90 idle
+```
+
+### 5.3 LXC 9001 — create + Docker
+```
+$ pct create 9001 local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst \
+    --hostname spike-lxc --cores 2 --memory 2048 --rootfs local-lvm:10 \
+    --net0 name=eth0,bridge=vmbr0,ip=dhcp --features nesting=1,keyctl=1 \
+    --unprivileged 1 --start 1
+  Logical volume "vm-9001-disk-0" created.
+  extracting archive ... Total bytes read: 551505920 (526MiB, 387MiB/s)
+  Creating SSH host key ... done
+=== exit: 0 ; status: running
+features: nesting=1,keyctl=1 ; unprivileged: 1 ; ip 192.168.0.115/24
+
+# Docker install (official repo, trixie stable): DOCKER-INSTALL-OK
+$ docker --version            -> Docker version 29.5.3, build d1c06ef
+$ docker run --rm hello-world -> Hello from Docker!
+$ docker info | grep -iE 'Storage Driver|Cgroup'
+ Storage Driver: overlayfs
+ Cgroup Driver: systemd
+ Cgroup Version: 2
+ Server Version: 29.5.3 ; Kernel: 7.0.2-6-pve ; OS: Debian GNU/Linux 13 (trixie)
+```
+
+### 5.4 LXC 9001 — stack health
+```
+$ docker compose ps
+spike-cache-1  running   Up
+spike-db-1     running   Up
+spike-web-1    running   Up
+$ curl -s -o /dev/null -w 'HTTP %{http_code}' localhost:8080   -> HTTP 200
+$ psql CREATE TABLE spike_persist; INSERT; SELECT count(*)     -> 1   (volume persists)
+```
+
+### 5.5 LXC 9001 — idle measurement
+```
+Host RAM used (5x3s): 1913 / 1914 / 1913 / 1914 / 1913 MB     (median 1913, Δ +211)
+cgroup memory.current: 2056036352 B = 1961 MB
+inside free -m: total 2048 used 125 buff/cache 1851 available 1922
+mpstat 1 5 Average: 0.20 usr 0.10 sys ... 99.70 idle   (~0.3% used)
+pct df 9001: rootfs 9.7G size, 2.1G used, 21.6%
+```
+
+### 5.6 LXC 9001 — under-load measurement
+```
+$ pgbench -i -s 10  -> done in 1.39 s
+$ pgbench -T 60 -c 4 (run concurrently with sampling):
+Host RAM used (5x3s): 2149 / 2143 / 2112 / 2086 / 2071 MB     (median 2112, Δ +410)
+cgroup memory.current: 2130382848 B = 2032 MB
+mpstat 1 5 Average: 17.10 usr 7.50 sys 14.50 iowait 0.31 soft 60.59 idle  (~39.4% used)
+pgbench result: scaling 10, clients 4, 60 s
+  transactions: 132710 ; failed 0 (0.000%)
+  latency average = 1.809 ms ; tps = 2211.713864
+host thin LV vm-9001-disk-0: 10240 MB, Data% 26.73  (≈2.67 GiB)
+```
+
+### 5.7 VM 9000 — create + cloud-init
+```
+$ qm create 9000 --name spike-vm --cores 2 --memory 2048 \
+    --net0 virtio,bridge=vmbr0 --scsihw virtio-scsi-single --agent 1
+$ qm set 9000 --scsi0 local-lvm:0,import-from=/var/lib/vz/template/qcow2/debian-13-genericcloud-amd64.qcow2
+  transferred 3.0 GiB of 3.0 GiB (100.00%)
+  scsi0: successfully created disk 'local-lvm:vm-9000-disk-0,size=3G'
+$ qm set 9000 --ide2 local-lvm:cloudinit --boot order=scsi0 --serial0 socket --vga serial0
+$ qm disk resize 9000 scsi0 10G        -> resized 3.00 -> 10.00 GiB
+$ qm set 9000 --ciuser spike --cipassword spike --sshkeys /root/spike-pubkey.pub --ipconfig0 ip=dhcp
+   # pubkey file = the two real keys from the host's /etc/pve/priv/authorized_keys
+   #   (incl. ssh-ed25519 ...kisfenyo@windows — the same workstation key)
+$ qm start 9000   -> start-ok
+```
+
+### 5.8 VM 9000 — IP discovery + guest agent + Docker
+```
+# genericcloud has no guest-agent at first boot -> qm guest cmd ping failed.
+# IP found via MAC on the bridge:
+$ nmap -sn 192.168.0.0/24 | grep -B2 BC:24:11:C7:41:87
+  Nmap scan report for 192.168.0.155 ; MAC BC:24:11:C7:41:87 (Proxmox)
+$ ssh -i /root/.ssh/id_rsa spike@192.168.0.155 'hostname; cat /etc/debian_version'
+  spike-vm ; 13.5
+# install qemu-guest-agent + Docker (official repo, trixie): VM-INSTALL-OK
+$ qm guest cmd 9000 ping            -> AGENT OK   (fsfreeze available)
+$ docker --version                  -> Docker version 29.5.3, build d1c06ef
+$ docker run --rm hello-world       -> Hello from Docker!
+$ docker info | grep -iE 'Storage Driver|Cgroup'
+ Storage Driver: overlayfs ; Cgroup Driver: systemd ; Cgroup Version: 2
+```
+
+### 5.9 VM 9000 — stack health
+```
+$ docker compose ps -> spike-cache-1 / spike-db-1 / spike-web-1 all running
+$ curl ... localhost:8080 -> HTTP 200
+$ psql ... SELECT count(*) -> 1   (volume persists)
+```
+
+### 5.10 VM 9000 — idle measurement
+```
+Host RAM used (5x3s): 3758 / 3757 / 3754 / 3759 / 3758 MB     (median 3758, Δ +2056)
+KVM process RSS / VSZ: 2079988 / 3380896 KiB  (RSS = 2031 MB)
+inside free -m: total 1974 used 509 buff/cache 1524 available 1464
+mpstat 1 5 Average: 3.37 usr 2.31 sys 0.29 guest ... 94.04 idle  (~6.0% used)
+qm config: scsi0 local-lvm:vm-9000-disk-0,size=10G
+host thin LV vm-9000-disk-0: 10240 MB, Data% 29.33  (≈2.94 GiB)
+inside df -h /: 9.7G size, 2.4G used, 25%
+```
+
+### 5.11 VM 9000 — under-load measurement (definitive, load confirmed active)
+```
+# First attempt (nested-ssh + nohup &) launched pgbench AFTER the sample window ->
+# host CPU read a false ~5% (identical to idle). Diagnosed; re-run below holds
+# pgbench in the foreground of a long-lived SSH channel and samples during it.
+
+$ pgbench -T 90 -c 4 (foreground, channel held):
+  transactions: 163764 ; failed 0 (0.000%)
+  latency average = 2.198 ms ; tps = 1819.602345
+  (60 s confirmation runs: 1739 & 1759 tps)
+
+# Sampled 10 s into the confirmed-active load:
+Host RAM used (5x3s): 3784 / 3786 / 3786 / 3786 / 3786 MB     (median 3786, Δ +2084)
+KVM process RSS / VSZ: 2096508 / 4495008 KiB  (RSS = 2047 MB)
+guest uptime: load average 1.71 (2 vCPU)  -> vCPUs busy
+mpstat 1 8 Average:
+  1.70 usr  3.40 sys  16.35 iowait  0.58 soft  31.89 guest  46.08 idle   (~53.9% used)
+```
+
+### 5.12 Teardown state
+```
+$ qm list  -> 9000 spike-vm stopped
+$ pct list -> 9001 spike-lxc stopped
+# both present, both stopped (numbers can be re-checked)
+```
+
+---
+
+## 6. Teardown — destroy commands (NOT run)
+
+Both guests were left **stopped but present**. To remove them:
+
+```bash
+qm destroy 9000 --purge            # VM   (also removes cloudinit + disks)
+pct destroy 9001 --purge           # LXC
+# optional spike artifacts on the host:
+rm -f /var/lib/vz/template/qcow2/debian-13-genericcloud-amd64.qcow2
+rm -f /root/spike-pubkey.pub /root/vm-install.sh
+# (Debian 13 CT template left in place: local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst)
+```
@@ -0,0 +1,306 @@
+# Phase 1 + 2 — Privilege Model & Backup/Restore Round-Trip: Findings
+
+**Host:** `demo-felhom` (192.168.0.162) — Proxmox VE 9.2.2, node confirmed via
+`pvesh get /nodes` → `demo-felhom`. Storage: `local` (dir, content
+`iso,vztmpl,backup,import`), `local-lvm` (LVM-thin, `rootdir,images`).
+**Subject:** LXC `9001` (`spike-lxc`, unprivileged, `nesting=1,keyctl=1`, Docker +
+postgres/redis/nginx stack). **Date:** 2026-06-07.
+
+> Data and observations only — **no recommendation or verdict**.
+
+## Hypotheses — verdicts at a glance
+| | Hypothesis | Result |
+|---|---|---|
+| **H1** | Backup scopes to one VMID; restore/create needs node/pool allocate → denied to narrow token | **CONFIRMED** (create CT = 403) |
+| **H2** | An LXC vzdump captures the Docker volumes (they live in the container rootfs) | **CONFIRMED** (sentinel survived both restores) |
+| **H3** | Crash-consistent (running) *and* quiesced (stopped) backups both restore cleanly | **CONFIRMED** (A via WAL recovery, B clean start) |
+| **H4** | Running unprivileged LXC snapshots on LVM-thin; restored CT keeps unprivileged+nesting/keyctl | **CONFIRMED** (live snapshot OK; config survived) |
+
+---
+
+## 1. Phase 1 — Privilege model
+
+### 1.1 Setup (operator side, root)
+```
+pveum role add FelhomSelfBackup -privs "VM.Audit VM.Snapshot VM.Backup Datastore.AllocateSpace Datastore.Audit"
+pveum user add felhom-ctl@pve --comment "spike in-guest controller"
+pveum user token add felhom-ctl@pve ctl --privsep 1   # secret: b6547d9d-... (ephemeral, spike-only)
+pveum acl modify /vms/9001      -token 'felhom-ctl@pve!ctl' -role FelhomSelfBackup
+pveum acl modify /storage/local -token 'felhom-ctl@pve!ctl' -role FelhomSelfBackup
+```
+Privilege names were verified against `PVEVMAdmin` / `PVEDatastoreUser` via
+`pveum role list` first. **Note:** the reference doc's introspection command
+`pveum role info <role>` **does not exist in PVE 9** — only `pveum role list` works.
+
+### 1.2 ⚠️ Privsep gotcha — the doc's runbook is incomplete
+With `--privsep 1`, a token's effective rights are the **intersection of the backing
+user's permissions AND the token's own ACLs**. The reference doc (§3) grants ACLs to the
+**token only**. With the user `felhom-ctl@pve` holding **no** permissions, the
+intersection was **empty** — the first self-audit call returned:
+```
+HTTP 403  {"message":"Permission check failed (/vms/9001, VM.Audit)\n"}
+```
+**Fix applied:** also grant the user the role on the same paths
+(`pveum acl modify /vms/9001 -user felhom-ctl@pve -role FelhomSelfBackup`, same for
+`/storage/local`). After that the self-calls succeeded. **A privsep token needs the
+permission present on *both* the user and the token** (the token ACL is what keeps the
+token ≤ user / narrowly scoped). This must be reflected in the controller provisioning.
+
+### 1.3 Test matrix (every call run from **inside** the unprivileged LXC, `pct exec 9001`)
+`H=192.168.0.162  N=demo-felhom  AUTH="PVEAPIToken=felhom-ctl@pve!ctl=<secret>"`
+
+| # | Call | Expected | **Actual** | Notes |
+|---|---|---|---|---|
+| 1 | `GET /version` | 200 | **200** | reachable + auth from inside LXC (no privilege needed) |
+| 2 | `GET /nodes/$N/lxc/9001/status/current` | 200 | **200**¹ | self audit (after privsep fix) |
+| 3 | `POST /nodes/$N/lxc/9001/snapshot snapname=spk1` | 200/UPID→OK | **200, task exitstatus OK** | **running-LXC self-snapshot (H4)** |
+| 4 | `POST /nodes/$N/vzdump vmid=9001 storage=local mode=snapshot` | 200/UPID→OK | **200, task exitstatus OK** | self backup, archive produced |
+| 5 | `GET /nodes/$N/qemu/9000/status/current` | 403 | **403** | `Permission check failed (/vms/9000, VM.Audit)` |
+| 6 | `POST /nodes/$N/vzdump vmid=9000 storage=local` | 403 | **200 POST → task exitstatus 403**² | see note |
+| 7 | `POST /nodes/$N/lxc` (create CT) | 403 | **403** | `Permission check failed` — **proves create/allocate is operator-tier (H1)** |
+
+¹ before the privsep fix this was 403; see §1.2.
+² **Important nuance:** the `vzdump` endpoint accepts the POST and returns a UPID even for
+an unauthorized vmid; the authorization failure surfaces at **task execution**, not at the
+HTTP layer. Polled from root:
+`exitstatus: "403 Permission check failed (/vms/9000, VM.Backup)"`, and **no 9000 archive
+was created**. The boundary holds — but a controller must **poll the task exitstatus**, not
+trust the POST's 200, to know a cross-guest backup was actually refused.
+
+**Pass criteria met:** self-ops (1–4) succeed; cross-guest read (5), cross-guest backup
+(6, at task level), and create/allocate (7) are denied. The controller-as-guest boundary
+and the two-tier split are validated.
+
+### 1.4 Final minimal role — `VM.PowerMgmt` **not** required
+The doc's open question ("does Tier A need `VM.PowerMgmt` for stop-mode backups? Likely
+yes"). **Tested and refuted:** a **stop-mode** self-vzdump submitted by the token
+(`vmid=9001 mode=stop`) completed with **`exitstatus: OK`** using the role *without*
+`VM.PowerMgmt`. `vzdump` performs the guest shutdown/restart internally under
+`VM.Backup`; no separate power privilege is needed.
+
+> **Final minimal role (`FelhomSelfBackup`) — satisfies self-audit, self-snapshot, and
+> both `snapshot`- and `stop`-mode self-backup:**
+> `VM.Audit, VM.Snapshot, VM.Backup, Datastore.AllocateSpace, Datastore.Audit`
+> (`VM.PowerMgmt` deliberately omitted — confirmed unnecessary.)
+
+### 1.5 TLS observation
+From inside the LXC, `curl` **without** `-k`:
+```
+curl: (60) SSL certificate problem: unable to get local issuer certificate
+```
+The host serves the default self-signed PVE cert; all tests used `-k`. Production trust
+(pin the PVE CA / issue a proper cert) is a separate design decision, flagged here.
+
+### 1.6 Running-LXC snapshot (H4)
+Call #3 snapshotted the **running** unprivileged LXC on LVM-thin (`exitstatus OK`).
+`pct listsnapshot 9001` shows `spk1` with `pct status 9001 = running`. **No stop
+required** — the snapshot-before-update rollback flow is viable on a live container.
+
+---
+
+## 2. Phase 2 — Backup → real restore round-trip
+
+Sentinel written pre-flight into the `pgdata` volume:
+`restore_check(42,'phase2-sentinel')` → clean read `42|phase2-sentinel`.
+
+### 2.1 Backups (operator/root side)
+| Variant | Mode | Stack state | Task time | Wall | Archive | Size (zstd) |
+|---|---|---|---|---|---|---|
+| **A — crash-consistent** | `snapshot` | **running** | 00:00:24 | 25 s | `vzdump-lxc-9001-2026_06_07-20_13_43.tar.zst` | **934 MB** (979,718,569 B) |
+| **B — quiesced** | `snapshot` | **stopped** (`docker compose stop`) | 00:00:21 | 22 s | `vzdump-lxc-9001-2026_06_07-20_14_40.tar.zst` | **934 MB** (979,671,582 B) |
+
+Both from a 2.5 GiB source; zstd → ~934 MB (~2.7:1). The stack was restarted after
+Variant B. **LXC snapshot-mode vzdump does *not* fsfreeze** (no guest agent in an LXC —
+consistent with the Phase 0 finding) → Variant A is genuinely crash-consistent.
+
+### 2.2 Restore → fresh VMID → boot → verify
+| Check | 9002 (Variant A) | 9003 (Variant B) |
+|---|---|---|
+| Restore time (`pct restore … --storage local-lvm`) | **12 s** | **11 s** |
+| `unprivileged: 1` survived | **yes** | **yes** |
+| `features: nesting=1,keyctl=1` survived | **yes** | **yes** |
+| Containers after boot | `exited` (no restart policy) → `docker compose up -d` | same |
+| 3 containers healthy | **yes** | **yes** |
+| `curl localhost:8080` | **HTTP 200** | **HTTP 200** |
+| **Sentinel `(42,'phase2-sentinel')`** | **PRESENT** | **PRESENT** |
+| Postgres first-start | **WAL crash recovery** (see below) | **clean start, no recovery** |
+
+> Restored CTs inherit 9001's fixed `hwaddr`. To avoid a MAC clash with the still-running
+> 9001 on `vmbr0`, `net0` was reset to auto-generate a fresh MAC before boot. All
+> verification (stack health, `curl localhost`, sentinel) is guest-internal and needs no
+> external network — and the Docker images are inside the restored rootfs, so no pulls.
+
+**Variant A — Postgres automatic WAL recovery on 9002 (verbatim, post-restore boot):**
+```
+LOG:  database system was interrupted; last known up at 2026-06-07 18:13:21 UTC
+LOG:  database system was not properly shut down; automatic recovery in progress
+LOG:  redo starts at 0/CB12838
+LOG:  invalid record length at 0/CB12870: expected at least 24, got 0   # normal end-of-WAL
+LOG:  redo done at 0/CB12838 ...
+LOG:  checkpoint starting: end-of-recovery immediate wait
+LOG:  database system is ready to accept connections
+```
+**Variant B — clean start on 9003 (verbatim, post-restore boot):**
+```
+LOG:  database system was shut down at 2026-06-07 18:14:39 UTC
+LOG:  database system is ready to accept connections
+```
+
+**H2 confirmed:** one LXC vzdump captured the whole customer including the Docker named
+volume — the sentinel data restored in both guests. **H3 confirmed:** both variants
+restored to a bootable guest with intact data; the crash-consistent one recovered via WAL
+with no manual intervention, the quiesced one started clean. **H4 confirmed:** restored
+config preserved `unprivileged` + `nesting/keyctl`, so Docker ran in the restored CT.
+
+---
+
+## 3. Observations & confounds
+1. **Privsep token needs perms on user *and* token** (§1.2) — the single most important
+   correction to the reference runbook; without it every scoped call 403s.
+2. **vzdump authorization is task-level, not POST-level** (§1.3 note ²) — a 200 + UPID
+   does **not** mean authorized. The controller must poll `exitstatus`. This is also the
+   general async-task lesson: every backup/snapshot/restore returns a UPID and the real
+   result is in the task status.
+3. **`pveum role info` is gone in PVE 9** — use `pveum role list`. Minor doc drift.
+4. **`VM.PowerMgmt` not needed for stop-mode backup** (§1.4) — narrower role than the doc
+   assumed.
+5. **No fsfreeze for LXC** — Variant A relied on Postgres's own WAL crash recovery, which
+   worked here for an idle-at-backup DB. Under heavy write load, app-consistency for LXC
+   still rests on the controller quiescing first (or stop-mode), exactly as the reference
+   warned. This single test is not a durability guarantee under load.
+6. **Restore MAC collision** (§2.2) — `pct restore` preserves the source `hwaddr`;
+   restoring while the original runs needs a MAC reset (or the original stopped). The
+   controller's restore flow must handle identity (MAC/hostname/IP) to avoid clashes.
+7. **No restart policy on the compose services** — restored containers came up `exited`;
+   `docker compose up -d` (or a restart policy / systemd unit) is required for the stack
+   to return automatically after a restore or guest reboot.
+8. **Restore is fast, backup dominated by I/O** — restores were 11–12 s (extract at
+   ~524 MiB/s); backups ~22–25 s (read 2.5 GiB at ~108–119 MiB/s + zstd). Single runs,
+   idle host, ~150 MB DB; not a throughput benchmark.
+9. **Sequencing artifact:** a Phase-1 stop-mode self-backup ran before Phase 2 and
+   stopped/started 9001; the stack was brought back up and the sentinel re-verified
+   before the Variant A/B backups, so it does not affect the round-trip results.
+
+---
+
+## 4. Raw command log (appendix)
+
+### 4.1 Pre-flight
+```
+$ pvesh get /nodes  -> node: demo-felhom
+$ cat /etc/pve/storage.cfg
+dir: local   ... content iso,vztmpl,backup,import        # 'backup' present
+lvmthin: local-lvm ... content rootdir,images            # no backup (expected)
+$ pct start 9001 ; docker compose up -d  -> 3 containers Started
+$ curl localhost:8080  -> HTTP 200
+# sentinel:
+CREATE TABLE ; INSERT 0 1 ; SELECT count -> 1 ; SELECT * -> 42 | phase2-sentinel
+```
+
+### 4.2 Phase 1 — role/user/token/ACL
+```
+$ pveum role add FelhomSelfBackup -privs "VM.Audit VM.Snapshot VM.Backup Datastore.AllocateSpace Datastore.Audit"  -> role-ok
+$ pveum user add felhom-ctl@pve --comment "spike in-guest controller"  -> user-ok
+$ pveum user token add felhom-ctl@pve ctl --privsep 1
+  {"full-tokenid":"felhom-ctl@pve!ctl","info":{"privsep":"1"},"value":"b6547d9d-08ec-4f22-beb8-a551dc2cd69d"}
+$ pveum acl modify /vms/9001 -token 'felhom-ctl@pve!ctl' -role FelhomSelfBackup   -> ok
+$ pveum acl modify /storage/local -token 'felhom-ctl@pve!ctl' -role FelhomSelfBackup -> ok
+$ pveum role list | grep FelhomSelfBackup
+  FelhomSelfBackup | Datastore.AllocateSpace,Datastore.Audit,VM.Audit,VM.Backup,VM.Snapshot
+$ pveum role info FelhomSelfBackup   -> ERROR: unknown command 'pveum role info'   # PVE9 has no 'role info'
+```
+
+### 4.3 Phase 1 — matrix (from inside LXC)
+```
+# TLS without -k:
+curl: (60) SSL certificate problem: unable to get local issuer certificate
+
+# BEFORE privsep fix:
+#2 GET self status -> HTTP 403 {"message":"Permission check failed (/vms/9001, VM.Audit)\n"}
+
+# privsep fix:
+$ pveum acl modify /vms/9001 -user 'felhom-ctl@pve' -role FelhomSelfBackup  -> ok
+$ pveum acl modify /storage/local -user 'felhom-ctl@pve' -role FelhomSelfBackup -> ok
+
+# AFTER fix:
+#1 GET /version                         -> HTTP 200
+#2 GET /nodes/.../lxc/9001/status/current -> HTTP 200 {"data":{...,"status":"running",...}}
+#5 GET /nodes/.../qemu/9000/status/current -> HTTP 403 (/vms/9000, VM.Audit)
+#6 POST vzdump vmid=9000 -> HTTP 200 {"data":"UPID:...vzdump:9000:felhom-ctl@pve!ctl:"}
+   root poll: exitstatus="403 Permission check failed (/vms/9000, VM.Backup)"
+   task log: TASK ERROR: 403 Permission check failed (/vms/9000, VM.Backup)
+   /var/lib/vz/dump: no 9000 archive created
+#7 POST /nodes/.../lxc (create CT vmid=9009) -> HTTP 403 {"message":"Permission check failed\n"}
+
+#3 POST lxc/9001/snapshot snapname=spk1 -> HTTP 200 UPID:...vzsnapshot:9001...
+   root: exitstatus "OK" ; pct listsnapshot 9001 -> spk1 ; pct status 9001 -> running
+#4 POST vzdump vmid=9001 storage=local mode=snapshot -> HTTP 200 UPID:...vzdump:9001...
+   root: exitstatus "OK"
+   token can read own task status: HTTP 200 {"...exitstatus":"OK"}   # earlier poll TIMEOUTs were a shell-quoting bug in the helper, not a perms issue
+
+# stop-mode self-backup (VM.PowerMgmt test):
+$ token POST vzdump vmid=9001 storage=local mode=stop -> HTTP 200 UPID:...vzdump:9001...
+   root poll: exitstatus "OK"     # SUCCEEDED without VM.PowerMgmt in the role
+```
+
+### 4.4 Phase 2 — backups
+```
+# Variant A (running):
+$ vzdump 9001 --mode snapshot --storage local --compress zstd
+INFO: Total bytes written: 2585589760 (2.5GiB, 108MiB/s)
+INFO: archive file size: 934MB
+INFO: Finished Backup of VM 9001 (00:00:24)   ; WALL_SECONDS=25
+-> vzdump-lxc-9001-2026_06_07-20_13_43.tar.zst  (979718569 B)
+
+# Variant B (stopped):
+$ docker compose stop   (cache,db,web Stopped)
+$ vzdump 9001 --mode snapshot --storage local --compress zstd
+INFO: Total bytes written: 2585825280 (2.5GiB, 119MiB/s)
+INFO: Finished Backup of VM 9001 (00:00:21)   ; WALL_SECONDS=22
+-> vzdump-lxc-9001-2026_06_07-20_14_40.tar.zst  (979671582 B)
+$ docker compose start   (db,cache,web Started)
+```
+
+### 4.5 Phase 2 — restores + verification
+```
+# A -> 9002:
+$ pct restore 9002 .../20_13_43.tar.zst --storage local-lvm
+  Total bytes read: 2585589760 (2.5GiB, 524MiB/s) ; RESTORE_A_SECONDS=12
+$ pct config 9002 -> features: nesting=1,keyctl=1 ; unprivileged: 1
+$ pct set 9002 -net0 name=eth0,bridge=vmbr0,ip=dhcp   # fresh MAC BC:24:11:E3:F4:64
+$ pct start 9002 ; docker compose up -d -> 3 running ; curl -> HTTP 200
+$ psql SELECT * FROM restore_check -> 42 | phase2-sentinel
+  db log: "was interrupted ... not properly shut down; automatic recovery in progress
+           redo starts/redo done ... database system is ready to accept connections"
+
+# B -> 9003:
+$ pct restore 9003 .../20_14_40.tar.zst --storage local-lvm
+  Total bytes read: 2585825280 (2.5GiB, 524MiB/s) ; RESTORE_B_SECONDS=11
+$ pct config 9003 -> features: nesting=1,keyctl=1 ; unprivileged: 1
+$ pct set 9003 -net0 ... (fresh MAC) ; pct start 9003 ; docker compose up -d -> 3 running ; curl 200
+$ psql SELECT * FROM restore_check -> 42 | phase2-sentinel
+  db log: "database system was shut down at ... ; database system is ready to accept connections"  # clean
+```
+
+---
+
+## 5. Teardown (executed — see §6 for what was left)
+Restore targets destroyed; Phase 1 objects and spike artifacts removed; `9000`/`9001`
+left **stopped-but-present**.
+
+```bash
+pct destroy 9002 --purge ; pct destroy 9003 --purge
+pveum acl delete /vms/9001      -user 'felhom-ctl@pve' ; pveum acl delete /vms/9001      -token 'felhom-ctl@pve!ctl'
+pveum acl delete /storage/local -user 'felhom-ctl@pve' ; pveum acl delete /storage/local -token 'felhom-ctl@pve!ctl'
+pveum user token remove felhom-ctl@pve ctl ; pveum user delete felhom-ctl@pve ; pveum role delete FelhomSelfBackup
+pct delsnapshot 9001 spk1
+rm -f /var/lib/vz/dump/vzdump-lxc-9001-*.tar.zst /var/lib/vz/dump/vzdump-lxc-9001-*.log
+pct stop 9001     # back to stopped-but-present
+```
+
+## 6. To destroy 9000/9001 later (NOT run — left stopped-but-present)
+```bash
+qm destroy 9000 --purge        # VM  (Phase 0 subject)
+pct destroy 9001 --purge       # LXC (Phase 0/1/2 subject)
+# Debian 13 CT template left in place: local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst
+```