moved documentation to felhom.eu
This commit is contained in:
@@ -0,0 +1,176 @@
|
||||
> ⚠️ **SUPERSEDED — spike evidence only, not authoritative.** This is the *pre-spike*
|
||||
> reference and contains at least one known error (the privsep/ACL mechanism in §3 — it
|
||||
> grants the ACL to the token only, which yields an empty intersection and a 403 even on
|
||||
> self-calls). For the corrected, validated facts read
|
||||
> [`../proxmox-platform.md`](../proxmox-platform.md). Kept here unchanged as the record of
|
||||
> what we believed going into the spike.
|
||||
|
||||
# Proxmox Spike — API & Access-Control Reference
|
||||
|
||||
Reference for the **controller-as-guest** architecture, synthesized from current
|
||||
Proxmox VE 9.x documentation (June 2026).
|
||||
|
||||
Items marked **[confirm on box]** should be verified once PVE is installed —
|
||||
treat them as Phase 0/1 verification steps, not gospel. Every Proxmox CLI tool
|
||||
is a thin wrapper over the same REST API, so anything below is reachable from Go.
|
||||
|
||||
---
|
||||
|
||||
## 1. API fundamentals
|
||||
|
||||
- **Base URL:** `https://192.168.0.162:8006/api2/json`
|
||||
- **Auth (API token):** HTTP header
|
||||
`Authorization: PVEAPIToken=USER@REALM!TOKENID=SECRET`
|
||||
The secret is shown **once** at creation — capture it immediately, it can't be
|
||||
retrieved again.
|
||||
- **Response shape:** `{ "data": ... }`; errors come back via HTTP status + body.
|
||||
- **Discovery (do this live on the box instead of trusting any doc):**
|
||||
- `pvesh get /version`
|
||||
- `pvesh ls /nodes/<node>/qemu/<vmid>`
|
||||
- Full schema browser: `https://pve.proxmox.com/pve-docs/api-viewer/`
|
||||
- "What call does the GUI make?" → perform the action in the web UI with
|
||||
browser DevTools → Network open and read the request. Fastest way to find
|
||||
the exact endpoint + params for anything.
|
||||
- **Async tasks:** long operations (backup, restore, clone) return a **UPID**
|
||||
(task id), not a result. Poll `GET /nodes/<node>/tasks/<upid>/status` until
|
||||
`status: stopped`, then check `exitstatus`. The controller must poll, not
|
||||
block. **[confirm on box]** the exact polling/response shape.
|
||||
|
||||
---
|
||||
|
||||
## 2. RBAC model — (path, principal, role)
|
||||
|
||||
An ACL entry is a triple of **(path, user/group/token, role)**. A role is a
|
||||
bundle of privileges, assigned at the most specific path possible.
|
||||
|
||||
- **Paths:** `/`, `/vms/<vmid>`, `/nodes/<node>`, `/storage/<store>`,
|
||||
`/pool/<pool>`, `/access/...`
|
||||
- **Predefined roles include:** `PVEAuditor` (read-only), `PVEVMUser`,
|
||||
`PVEVMAdmin`, `PVEDatastoreUser`, `PVEAdmin`, `PVEUserAdmin`.
|
||||
- **API tokens with privilege separation (`--privsep 1`):** the token's
|
||||
effective permissions are the **intersection** of (a) the backing user's
|
||||
permissions and (b) the token's own ACLs. A privsep token can therefore never
|
||||
exceed its user, and you grant it a separate, minimal ACL. This is exactly the
|
||||
property the in-guest controller needs.
|
||||
|
||||
Introspection:
|
||||
```bash
|
||||
pveum role list
|
||||
pveum role info PVEVMAdmin
|
||||
pveum user permissions <user> --path /vms/<vmid>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Two-tier privilege model (our architecture decision)
|
||||
|
||||
**Tier A — in-guest controller (customer-facing, NARROW).**
|
||||
Runs inside the customer's guest. Token scoped to *that guest's own VMID only*:
|
||||
read its own status/config, snapshot itself, back itself up, write the backup to
|
||||
the datastore. Cannot see or touch other guests. The LXC/VM's own privilege
|
||||
level is irrelevant here — reaching `host:8006` is just an HTTPS call + token.
|
||||
|
||||
**Tier B — operator (provisioning, BROAD).**
|
||||
Creates/destroys guests, builds the golden template, attaches storage, wires PBS.
|
||||
Lives operator-side (hub / tooling), never on the customer box.
|
||||
|
||||
### Phase 1 runbook — minimal self-backup role + scoped token
|
||||
|
||||
```bash
|
||||
# 1. Custom least-privilege role: "back up / snapshot myself"
|
||||
# [confirm on box: exact privilege names via `pveum role list` / api-viewer]
|
||||
pveum role add FelhomSelfBackup \
|
||||
-privs "VM.Audit VM.Snapshot VM.Backup Datastore.AllocateSpace Datastore.Audit"
|
||||
|
||||
# 2. Dedicated API-only user in the PVE realm (no login password)
|
||||
pveum user add felhom-ctl@pve --comment "In-guest controller (self-backup)"
|
||||
|
||||
# 3. Privsep token for that user (SECRET shown once)
|
||||
pveum user token add felhom-ctl@pve ctl --privsep 1
|
||||
|
||||
# 4. Scope the TOKEN to one guest + the backup datastore only
|
||||
pveum acl modify /vms/<vmid> -token 'felhom-ctl@pve!ctl' -role FelhomSelfBackup
|
||||
pveum acl modify /storage/<store> -token 'felhom-ctl@pve!ctl' -role FelhomSelfBackup
|
||||
|
||||
# 5. Test FROM INSIDE the guest
|
||||
curl -k https://<host>:8006/api2/json/version \
|
||||
-H "Authorization: PVEAPIToken=felhom-ctl@pve!ctl=<SECRET>"
|
||||
|
||||
curl -k -X POST https://<host>:8006/api2/json/nodes/<node>/vzdump \
|
||||
-H "Authorization: PVEAPIToken=felhom-ctl@pve!ctl=<SECRET>" \
|
||||
-d "vmid=<vmid>&storage=<store>&mode=snapshot"
|
||||
```
|
||||
|
||||
**Pass criteria:** the token backs up its OWN vmid, and returns **403** on any
|
||||
other vmid. That single result validates the whole controller-as-guest design.
|
||||
|
||||
**Open question to settle here:** does Tier A also need `VM.PowerMgmt` so it can
|
||||
stop/start its own guest for `stop`-mode backups? Likely yes — add it and re-test.
|
||||
|
||||
---
|
||||
|
||||
## 4. Backup / restore (vzdump)
|
||||
|
||||
**Modes:**
|
||||
- **`stop`** — orderly guest shutdown → live backup → resume. Highest
|
||||
consistency, short defined downtime.
|
||||
- **`snapshot`** — lowest downtime; copies blocks while running. *Small
|
||||
inconsistency risk* unless the guest cooperates (see below).
|
||||
- **`suspend`** — legacy/compat, longer downtime, not recommended.
|
||||
|
||||
**App-consistency — the concrete version of the earlier warning:**
|
||||
- **VM:** install `qemu-guest-agent` in the guest and set `agent: 1`.
|
||||
`snapshot`-mode vzdump then calls `guest-fsfreeze-freeze` / `-thaw` around the
|
||||
copy → near-free filesystem consistency. **This is a real point in the VM's
|
||||
favour over LXC.**
|
||||
- **LXC:** no guest agent → no fsfreeze. App-consistency becomes the
|
||||
*controller's* job: quiesce in-guest first (stop stacks / flush DBs) **then**
|
||||
vzdump, or use `stop` mode. Same lesson as the restic work, moved to the guest
|
||||
layer.
|
||||
|
||||
**CLI / API:**
|
||||
```bash
|
||||
vzdump <vmid> --mode snapshot --storage <store> # CLI
|
||||
# API (async → UPID):
|
||||
POST /api2/json/nodes/<node>/vzdump params: vmid, storage, mode, ...
|
||||
```
|
||||
|
||||
**Restore is NOT a single "restore" call** — you recreate the guest from the
|
||||
archive:
|
||||
- **VM:** `qmrestore <archive> <newvmid>` / `POST /nodes/<node>/qemu` with `archive=...`
|
||||
- **LXC:** `pct restore <newvmid> <archive>` / `POST /nodes/<node>/lxc` with the archive as source
|
||||
|
||||
Phase 2's real-restore test = restore to a **fresh vmid** and boot it. Do not
|
||||
declare the backup "working" until a restored guest actually runs.
|
||||
|
||||
---
|
||||
|
||||
## 5. Key REST endpoints (qemu shown; lxc is parallel under `/lxc`)
|
||||
|
||||
```
|
||||
GET /nodes
|
||||
GET /nodes/<node>/qemu list VMs
|
||||
GET /nodes/<node>/qemu/<vmid>/status/current live status
|
||||
GET /nodes/<node>/qemu/<vmid>/config config
|
||||
POST /nodes/<node>/qemu/<vmid>/status/{start,stop,shutdown,reboot}
|
||||
POST /nodes/<node>/qemu/<vmid>/snapshot (snapname, description)
|
||||
GET /nodes/<node>/qemu/<vmid>/snapshot list snapshots
|
||||
POST /nodes/<node>/qemu/<vmid>/snapshot/<snap>/rollback
|
||||
POST /nodes/<node>/vzdump backup (async, UPID)
|
||||
GET /nodes/<node>/tasks/<upid>/status poll async task
|
||||
```
|
||||
|
||||
LXC: replace `/qemu/` with `/lxc/`. For **Docker-in-LXC** the container needs
|
||||
`features nesting=1,keyctl=1` (`pct set <vmid> -features nesting=1,keyctl=1`, or
|
||||
the `features` property on `POST /nodes/<node>/lxc`) — **[confirm on box]**.
|
||||
|
||||
---
|
||||
|
||||
## 6. Phase 0 confirm-on-box checklist
|
||||
|
||||
- [ ] PVE 9.2 installed; storage = LVM-thin (leave free space to also test dir/qcow2)
|
||||
- [ ] Exact privilege set for `FelhomSelfBackup` (`pveum role info`)
|
||||
- [ ] UPID task-polling response shape
|
||||
- [ ] Docker official apt repo has a `trixie` channel
|
||||
- [ ] LXC `features nesting=1,keyctl=1` syntax + Docker actually runs inside an LXC
|
||||
- [ ] Baseline idle + under-load RAM/CPU: one Debian VM vs one Debian LXC, identical resources
|
||||
@@ -0,0 +1,331 @@
|
||||
# Phase 0 — VM vs LXC Overhead Spike: Findings
|
||||
|
||||
**Host:** `demo-felhom` (192.168.0.162) — Proxmox VE 9.2.2, Debian 13 (Trixie),
|
||||
kernel 7.0.2-6-pve, 4 vCPU, 16 GB RAM (15771 MB `MemTotal`).
|
||||
**Date:** 2026-06-07. **Measured one guest at a time, the other fully stopped.**
|
||||
|
||||
> This document presents **data and observations only**. No recommendation or verdict —
|
||||
> the architecture decision is made elsewhere.
|
||||
|
||||
---
|
||||
|
||||
## 1. Provenance
|
||||
|
||||
### Platform
|
||||
| Component | Version |
|
||||
|---|---|
|
||||
| pve-manager | 9.2.2 (`b9984c6d90a4bd80`) |
|
||||
| kernel | proxmox-kernel 7.0.2-6-pve |
|
||||
| pve-qemu-kvm | 11.0.0-3 |
|
||||
| qemu-server | 9.1.15 |
|
||||
| pve-container | 6.1.10 |
|
||||
| lxc-pve / lxcfs | 7.0.0-2 / 7.0.0-pve1 |
|
||||
| criu | 4.1.1-1 |
|
||||
|
||||
`pvesh get /version` → release 9.2, version 9.2.2.
|
||||
|
||||
### Guest images
|
||||
| | LXC (9001) | VM (9000) |
|
||||
|---|---|---|
|
||||
| Source | `local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst` | `debian-13-genericcloud-amd64.qcow2` |
|
||||
| Build | Debian 13.1 standard CT template (downloaded via `pveam`, checksum verified) | cloud build **20260601-2496**; in-guest reports Debian **13.5** after `apt update` |
|
||||
| qcow2 | n/a | virtual 3 GiB, on-disk 323 MiB, compat 1.1/zlib |
|
||||
|
||||
### Docker (identical in both guests)
|
||||
| | LXC | VM |
|
||||
|---|---|---|
|
||||
| Source | Docker official apt repo, **`trixie` channel** (confirmed present) | same |
|
||||
| Version | **29.5.3** build d1c06ef | **29.5.3** build d1c06ef |
|
||||
| Storage Driver | **`overlayfs`** (not vfs) | **`overlayfs`** (not vfs) |
|
||||
| Cgroup Version / Driver | **v2 / systemd** | **v2 / systemd** |
|
||||
| `hello-world` | OK | OK |
|
||||
|
||||
> Docker's official repo **does** have a `trixie` channel — no fallback to Debian's
|
||||
> `docker.io` was needed. Docker 29 reports the driver as `overlayfs` (the containerd
|
||||
> snapshotter image store) rather than the legacy name `overlay2`; this is the same
|
||||
> overlay technology and is **not** a `vfs` fallback.
|
||||
|
||||
---
|
||||
|
||||
## 2. Comparison table
|
||||
|
||||
Baseline (both guests stopped): host RAM used **median 1702 MB** (range 1699–1703);
|
||||
host CPU **~0.1 % used** (99.9 % idle). All RAM deltas below are vs this baseline.
|
||||
Host RAM used = `MemTotal − MemAvailable`, 5 samples ~3 s apart (median reported).
|
||||
|
||||
| Metric | LXC (9001) | VM (9000) | Δ (VM − LXC) |
|
||||
|---|---|---|---|
|
||||
| **Idle host-RAM delta** | **+211 MB** (1913) | **+2056 MB** (3758) | **+1845 MB** |
|
||||
| **Under-load host-RAM delta** | **+410 MB** (2112) | **+2084 MB** (3786) | **+1674 MB** |
|
||||
| **Per-guest mem attribution** | cgroup `memory.current` = **1961 MB**¹ | KVM process RSS = **2031 MB** (idle) / **2047 MB** (load) | — |
|
||||
| **Idle host CPU used** | **~0.3 %** (0.20 usr + 0.10 sys) | **~6.0 %** (3.37 usr + 2.31 sys + 0.29 guest) | **+5.7 pp** |
|
||||
| **Under-load host CPU used** | **~39.4 %** (17.1 usr + 7.5 sys + 14.5 iowait + 0.3 soft) | **~53.9 %** (31.9 guest + 16.4 iowait + 3.4 sys + 1.7 usr + 0.6 soft) | **+14.5 pp** |
|
||||
| **pgbench throughput** | **2211.7 tps**, lat 1.809 ms, 132 710 tx/60 s, 0 failed | **1819.6 tps**, lat 2.198 ms, 163 764 tx/90 s, 0 failed² | **−392 tps** |
|
||||
| **Disk allocated** | 10 GiB | 10 GiB | 0 |
|
||||
| **Disk used (host thin-LV)** | 26.73 % ≈ **2.67 GiB** | 29.33 % ≈ **2.94 GiB** | +0.27 GiB |
|
||||
| **Disk used (inside guest)** | 2.1 GiB / 9.7 GiB | 2.4 GiB / 9.7 GiB | +0.3 GiB |
|
||||
| **Provisioning (rough, create→ready)** | ~10–15 s³ | ~60–75 s³ | — |
|
||||
|
||||
¹ `memory.current` counts reclaimable page cache shared with the host and therefore
|
||||
**overstates** the LXC's true incremental cost; the +211 MB host-RAM delta is the honest
|
||||
number. ² VM 60 s runs gave 1739 & 1759 tps — consistent with the 90 s definitive run.
|
||||
³ Guest-creation step only; see §4. Docker install + first image pull (~network-bound,
|
||||
~identical for both) is excluded.
|
||||
|
||||
### Inside-guest `free -m` (context only — not the decisive number)
|
||||
| | total | used | buff/cache | available |
|
||||
|---|---|---|---|---|
|
||||
| LXC idle | 2048 | 125 | 1851 | 1922 |
|
||||
| VM idle | 1974 | 509 | 1524 | 1464 |
|
||||
|
||||
The VM sees **1974 MB** usable of 2048 allocated (firmware/kernel reservation).
|
||||
|
||||
---
|
||||
|
||||
## 3. Docker-in-LXC viability
|
||||
|
||||
**Worked cleanly in an *unprivileged* LXC with `--features nesting=1,keyctl=1`. No
|
||||
privileged fallback was needed.**
|
||||
|
||||
- `--features nesting=1,keyctl=1 --unprivileged 1` accepted by `pct create` (PVE 9
|
||||
syntax confirmed via `pct help create`).
|
||||
- `docker run hello-world` → success.
|
||||
- **Storage driver: `overlayfs`** (cgroup v2, systemd cgroup driver) — **no `vfs`
|
||||
fallback**.
|
||||
- Full 3-container stack (`postgres:17`, `redis:7`, `nginx:alpine`) came up healthy.
|
||||
- Named volume `pgdata` persisted a write (`SELECT count` returned 1 after table
|
||||
create/insert).
|
||||
- Multi-container networking + published port worked: `curl localhost:8080` → **HTTP 200**.
|
||||
- 60 s pgbench load: **0 failed transactions**.
|
||||
|
||||
No errors, no `dmesg`/`journalctl` anomalies, no workarounds. The privileged-LXC
|
||||
fallback path (step A5) was therefore **not exercised**.
|
||||
|
||||
---
|
||||
|
||||
## 4. Observations & confounds
|
||||
|
||||
1. **VM under-load CPU required a re-measurement (diagnosed, not hidden).** The first
|
||||
VM-load sample showed host CPU ~5 % — identical to *idle* — while pgbench nonetheless
|
||||
completed a full 60 s run (1739 tps). Root cause: the VM load was launched through a
|
||||
**nested SSH + `nohup &`** layer (host→VM), which started pgbench *after* the sampling
|
||||
window. The LXC path used local `pct exec` (no nested SSH) so its first sample was
|
||||
valid. Re-running with pgbench held in the **foreground of a long-lived SSH channel**
|
||||
(guaranteed active) and sampling during a confirmed window gave the true **53.9 %**
|
||||
(`%guest`=31.9). **Confound:** the two guests' load was driven through different
|
||||
plumbing (`pct exec` vs nested SSH); the *throughput* numbers are unaffected
|
||||
(pgbench self-reports its own duration), but the CPU figures came from
|
||||
methodologically asymmetric harnesses.
|
||||
2. **Baseline drift from residual page cache.** After stopping each guest, host RAM did
|
||||
not snap back to 1702 MB immediately (e.g. 1895 MB just after the LXC stopped;
|
||||
1965→1794 MB drifting down after the VM). This is reclaimable cache, not a leak.
|
||||
Treat all RAM deltas as ±~100 MB.
|
||||
3. **The headline RAM gap is structural, not incidental.** LXC processes share the host
|
||||
kernel and page cache, so only the working set counts against the host (+211 MB idle).
|
||||
The VM, with **no ballooning configured**, has KVM back every guest-touched page —
|
||||
including the guest's own 1.5 GB page cache — so the host cost ≈ the full 2 GB
|
||||
allocation (KVM RSS ≈ 2031 MB) and is **largely load-independent** (3758 idle → 3786
|
||||
load). Ballooning / KSM were not tested and could change this.
|
||||
4. **`cgroup memory.current` ≠ host cost.** For the LXC it read 1961 MB (near the 2 GB
|
||||
limit) because it includes reclaimable page cache; the real incremental host cost was
|
||||
+211 MB. Per the protocol, `MemTotal − MemAvailable` is the decisive metric.
|
||||
5. **VM idle CPU floor (~6 %) vs LXC (~0.3 %).** QEMU device emulation + a full guest
|
||||
kernel's timer/housekeeping impose a small constant CPU cost even at rest.
|
||||
6. **Throughput vs CPU trade.** The VM did slightly *less* work (1820 vs 2211 tps) for
|
||||
*more* host CPU (53.9 vs 39.4 %). The extra cost surfaces as `%guest` (31.9 %) — the
|
||||
actual DB work *plus* virtualization overhead — whereas in the LXC the same DB work
|
||||
appears directly as host `%usr`/`%sys`. iowait was comparable (~15–16 %, WAL fsync).
|
||||
7. **Workload fits in RAM.** pgbench scale `-s 10` (~150 MB) fits in cache in both
|
||||
guests, so the test is commit/CPU-bound rather than disk-bound; a larger-than-RAM
|
||||
dataset would stress the storage paths differently and is not covered here.
|
||||
8. **qemu-guest-agent confirmed on the VM** (`qm guest cmd 9000 ping` → OK). This enables
|
||||
`guest-fsfreeze`-based app-consistent `snapshot`-mode vzdump for the VM — a capability
|
||||
the LXC has no equivalent for. The genericcloud image does **not** ship the agent;
|
||||
it had to be installed in-guest (and the VM IP had to be found via `nmap`/MAC until
|
||||
the agent was up).
|
||||
9. **Provisioning asymmetry foreshadows cloning.** LXC create is template-extract-bound
|
||||
(526 MiB at 387 MiB/s + SSH keygen, ~10–15 s). VM create is qcow2-import-bound (3 GiB
|
||||
→ LVM ≈ 30 s) plus a full firmware boot to SSH-ready (~30–45 s). Figures are rough,
|
||||
single-run, and exclude the shared network-bound Docker install + first image pull.
|
||||
|
||||
---
|
||||
|
||||
## 5. Raw command log (appendix)
|
||||
|
||||
### 5.1 Provenance
|
||||
```
|
||||
$ pveversion -v | grep ...
|
||||
pve-manager: 9.2.2 (running version: 9.2.2/b9984c6d90a4bd80)
|
||||
proxmox-kernel-7.0: 7.0.2-6
|
||||
criu: 4.1.1-1
|
||||
lxc-pve: 7.0.0-2
|
||||
lxcfs: 7.0.0-pve1
|
||||
pve-container: 6.1.10
|
||||
pve-qemu-kvm: 11.0.0-3
|
||||
qemu-server: 9.1.15
|
||||
|
||||
$ pvesm status
|
||||
local dir active 98497780 4333576 89114656 4.40%
|
||||
local-lvm lvmthin active 365760512 0 365760512 0.00%
|
||||
|
||||
# Docker repo trixie channel:
|
||||
$ curl -fsSL https://download.docker.com/linux/debian/dists/ | grep -oE 'trixie|bookworm|bullseye'
|
||||
bookworm / bullseye / trixie # trixie present
|
||||
|
||||
# Cloud image:
|
||||
$ qemu-img info debian-13-genericcloud-amd64.qcow2
|
||||
virtual size: 3 GiB ; disk size: 323 MiB ; compat 1.1 ; build 20260601-2496
|
||||
```
|
||||
|
||||
### 5.2 Baseline (both guests stopped)
|
||||
```
|
||||
$ for i in 1..5; awk MemTotal-MemAvailable /proc/meminfo ; sleep 3
|
||||
used=1699 MB / 1702 / 1702 / 1702 / 1703 MB (median 1702)
|
||||
|
||||
$ mpstat 1 5
|
||||
Average: all 0.05 usr 0.05 sys ... 99.90 idle
|
||||
```
|
||||
|
||||
### 5.3 LXC 9001 — create + Docker
|
||||
```
|
||||
$ pct create 9001 local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst \
|
||||
--hostname spike-lxc --cores 2 --memory 2048 --rootfs local-lvm:10 \
|
||||
--net0 name=eth0,bridge=vmbr0,ip=dhcp --features nesting=1,keyctl=1 \
|
||||
--unprivileged 1 --start 1
|
||||
Logical volume "vm-9001-disk-0" created.
|
||||
extracting archive ... Total bytes read: 551505920 (526MiB, 387MiB/s)
|
||||
Creating SSH host key ... done
|
||||
=== exit: 0 ; status: running
|
||||
features: nesting=1,keyctl=1 ; unprivileged: 1 ; ip 192.168.0.115/24
|
||||
|
||||
# Docker install (official repo, trixie stable): DOCKER-INSTALL-OK
|
||||
$ docker --version -> Docker version 29.5.3, build d1c06ef
|
||||
$ docker run --rm hello-world -> Hello from Docker!
|
||||
$ docker info | grep -iE 'Storage Driver|Cgroup'
|
||||
Storage Driver: overlayfs
|
||||
Cgroup Driver: systemd
|
||||
Cgroup Version: 2
|
||||
Server Version: 29.5.3 ; Kernel: 7.0.2-6-pve ; OS: Debian GNU/Linux 13 (trixie)
|
||||
```
|
||||
|
||||
### 5.4 LXC 9001 — stack health
|
||||
```
|
||||
$ docker compose ps
|
||||
spike-cache-1 running Up
|
||||
spike-db-1 running Up
|
||||
spike-web-1 running Up
|
||||
$ curl -s -o /dev/null -w 'HTTP %{http_code}' localhost:8080 -> HTTP 200
|
||||
$ psql CREATE TABLE spike_persist; INSERT; SELECT count(*) -> 1 (volume persists)
|
||||
```
|
||||
|
||||
### 5.5 LXC 9001 — idle measurement
|
||||
```
|
||||
Host RAM used (5x3s): 1913 / 1914 / 1913 / 1914 / 1913 MB (median 1913, Δ +211)
|
||||
cgroup memory.current: 2056036352 B = 1961 MB
|
||||
inside free -m: total 2048 used 125 buff/cache 1851 available 1922
|
||||
mpstat 1 5 Average: 0.20 usr 0.10 sys ... 99.70 idle (~0.3% used)
|
||||
pct df 9001: rootfs 9.7G size, 2.1G used, 21.6%
|
||||
```
|
||||
|
||||
### 5.6 LXC 9001 — under-load measurement
|
||||
```
|
||||
$ pgbench -i -s 10 -> done in 1.39 s
|
||||
$ pgbench -T 60 -c 4 (run concurrently with sampling):
|
||||
Host RAM used (5x3s): 2149 / 2143 / 2112 / 2086 / 2071 MB (median 2112, Δ +410)
|
||||
cgroup memory.current: 2130382848 B = 2032 MB
|
||||
mpstat 1 5 Average: 17.10 usr 7.50 sys 14.50 iowait 0.31 soft 60.59 idle (~39.4% used)
|
||||
pgbench result: scaling 10, clients 4, 60 s
|
||||
transactions: 132710 ; failed 0 (0.000%)
|
||||
latency average = 1.809 ms ; tps = 2211.713864
|
||||
host thin LV vm-9001-disk-0: 10240 MB, Data% 26.73 (≈2.67 GiB)
|
||||
```
|
||||
|
||||
### 5.7 VM 9000 — create + cloud-init
|
||||
```
|
||||
$ qm create 9000 --name spike-vm --cores 2 --memory 2048 \
|
||||
--net0 virtio,bridge=vmbr0 --scsihw virtio-scsi-single --agent 1
|
||||
$ qm set 9000 --scsi0 local-lvm:0,import-from=/var/lib/vz/template/qcow2/debian-13-genericcloud-amd64.qcow2
|
||||
transferred 3.0 GiB of 3.0 GiB (100.00%)
|
||||
scsi0: successfully created disk 'local-lvm:vm-9000-disk-0,size=3G'
|
||||
$ qm set 9000 --ide2 local-lvm:cloudinit --boot order=scsi0 --serial0 socket --vga serial0
|
||||
$ qm disk resize 9000 scsi0 10G -> resized 3.00 -> 10.00 GiB
|
||||
$ qm set 9000 --ciuser spike --cipassword spike --sshkeys /root/spike-pubkey.pub --ipconfig0 ip=dhcp
|
||||
# pubkey file = the two real keys from the host's /etc/pve/priv/authorized_keys
|
||||
# (incl. ssh-ed25519 ...kisfenyo@windows — the same workstation key)
|
||||
$ qm start 9000 -> start-ok
|
||||
```
|
||||
|
||||
### 5.8 VM 9000 — IP discovery + guest agent + Docker
|
||||
```
|
||||
# genericcloud has no guest-agent at first boot -> qm guest cmd ping failed.
|
||||
# IP found via MAC on the bridge:
|
||||
$ nmap -sn 192.168.0.0/24 | grep -B2 BC:24:11:C7:41:87
|
||||
Nmap scan report for 192.168.0.155 ; MAC BC:24:11:C7:41:87 (Proxmox)
|
||||
$ ssh -i /root/.ssh/id_rsa spike@192.168.0.155 'hostname; cat /etc/debian_version'
|
||||
spike-vm ; 13.5
|
||||
# install qemu-guest-agent + Docker (official repo, trixie): VM-INSTALL-OK
|
||||
$ qm guest cmd 9000 ping -> AGENT OK (fsfreeze available)
|
||||
$ docker --version -> Docker version 29.5.3, build d1c06ef
|
||||
$ docker run --rm hello-world -> Hello from Docker!
|
||||
$ docker info | grep -iE 'Storage Driver|Cgroup'
|
||||
Storage Driver: overlayfs ; Cgroup Driver: systemd ; Cgroup Version: 2
|
||||
```
|
||||
|
||||
### 5.9 VM 9000 — stack health
|
||||
```
|
||||
$ docker compose ps -> spike-cache-1 / spike-db-1 / spike-web-1 all running
|
||||
$ curl ... localhost:8080 -> HTTP 200
|
||||
$ psql ... SELECT count(*) -> 1 (volume persists)
|
||||
```
|
||||
|
||||
### 5.10 VM 9000 — idle measurement
|
||||
```
|
||||
Host RAM used (5x3s): 3758 / 3757 / 3754 / 3759 / 3758 MB (median 3758, Δ +2056)
|
||||
KVM process RSS / VSZ: 2079988 / 3380896 KiB (RSS = 2031 MB)
|
||||
inside free -m: total 1974 used 509 buff/cache 1524 available 1464
|
||||
mpstat 1 5 Average: 3.37 usr 2.31 sys 0.29 guest ... 94.04 idle (~6.0% used)
|
||||
qm config: scsi0 local-lvm:vm-9000-disk-0,size=10G
|
||||
host thin LV vm-9000-disk-0: 10240 MB, Data% 29.33 (≈2.94 GiB)
|
||||
inside df -h /: 9.7G size, 2.4G used, 25%
|
||||
```
|
||||
|
||||
### 5.11 VM 9000 — under-load measurement (definitive, load confirmed active)
|
||||
```
|
||||
# First attempt (nested-ssh + nohup &) launched pgbench AFTER the sample window ->
|
||||
# host CPU read a false ~5% (identical to idle). Diagnosed; re-run below holds
|
||||
# pgbench in the foreground of a long-lived SSH channel and samples during it.
|
||||
|
||||
$ pgbench -T 90 -c 4 (foreground, channel held):
|
||||
transactions: 163764 ; failed 0 (0.000%)
|
||||
latency average = 2.198 ms ; tps = 1819.602345
|
||||
(60 s confirmation runs: 1739 & 1759 tps)
|
||||
|
||||
# Sampled 10 s into the confirmed-active load:
|
||||
Host RAM used (5x3s): 3784 / 3786 / 3786 / 3786 / 3786 MB (median 3786, Δ +2084)
|
||||
KVM process RSS / VSZ: 2096508 / 4495008 KiB (RSS = 2047 MB)
|
||||
guest uptime: load average 1.71 (2 vCPU) -> vCPUs busy
|
||||
mpstat 1 8 Average:
|
||||
1.70 usr 3.40 sys 16.35 iowait 0.58 soft 31.89 guest 46.08 idle (~53.9% used)
|
||||
```
|
||||
|
||||
### 5.12 Teardown state
|
||||
```
|
||||
$ qm list -> 9000 spike-vm stopped
|
||||
$ pct list -> 9001 spike-lxc stopped
|
||||
# both present, both stopped (numbers can be re-checked)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Teardown — destroy commands (NOT run)
|
||||
|
||||
Both guests were left **stopped but present**. To remove them:
|
||||
|
||||
```bash
|
||||
qm destroy 9000 --purge # VM (also removes cloudinit + disks)
|
||||
pct destroy 9001 --purge # LXC
|
||||
# optional spike artifacts on the host:
|
||||
rm -f /var/lib/vz/template/qcow2/debian-13-genericcloud-amd64.qcow2
|
||||
rm -f /root/spike-pubkey.pub /root/vm-install.sh
|
||||
# (Debian 13 CT template left in place: local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst)
|
||||
```
|
||||
@@ -0,0 +1,315 @@
|
||||
# Phase 1 + 2 — Privilege Model & Backup/Restore Round-Trip: Findings
|
||||
|
||||
**Host:** `demo-felhom` (192.168.0.162) — Proxmox VE 9.2.2, node confirmed via
|
||||
`pvesh get /nodes` → `demo-felhom`. Storage: `local` (dir, content
|
||||
`iso,vztmpl,backup,import`), `local-lvm` (LVM-thin, `rootdir,images`).
|
||||
**Subject:** LXC `9001` (`spike-lxc`, unprivileged, `nesting=1,keyctl=1`, Docker +
|
||||
postgres/redis/nginx stack). **Date:** 2026-06-07.
|
||||
|
||||
> Data and observations only — **no recommendation or verdict**.
|
||||
|
||||
## Hypotheses — verdicts at a glance
|
||||
| | Hypothesis | Result |
|
||||
|---|---|---|
|
||||
| **H1** | Backup scopes to one VMID; restore/create needs node/pool allocate → denied to narrow token | **CONFIRMED** (create CT = 403) |
|
||||
| **H2** | An LXC vzdump captures the Docker volumes (they live in the container rootfs) | **CONFIRMED** (sentinel survived both restores) |
|
||||
| **H3** | Crash-consistent (running) *and* quiesced (stopped) backups both restore cleanly | **CONFIRMED** (A via WAL recovery, B clean start) |
|
||||
| **H4** | Running unprivileged LXC snapshots on LVM-thin; restored CT keeps unprivileged+nesting/keyctl | **CONFIRMED** (live snapshot OK; config survived) |
|
||||
|
||||
---
|
||||
|
||||
## 1. Phase 1 — Privilege model
|
||||
|
||||
### 1.1 Setup (operator side, root)
|
||||
```
|
||||
pveum role add FelhomSelfBackup -privs "VM.Audit VM.Snapshot VM.Backup Datastore.AllocateSpace Datastore.Audit"
|
||||
pveum user add felhom-ctl@pve --comment "spike in-guest controller"
|
||||
pveum user token add felhom-ctl@pve ctl --privsep 1 # secret: b6547d9d-... (ephemeral, spike-only)
|
||||
pveum acl modify /vms/9001 -token 'felhom-ctl@pve!ctl' -role FelhomSelfBackup
|
||||
pveum acl modify /storage/local -token 'felhom-ctl@pve!ctl' -role FelhomSelfBackup
|
||||
```
|
||||
Privilege names were verified against `PVEVMAdmin` / `PVEDatastoreUser` via
|
||||
`pveum role list` first. **Note:** the reference doc's introspection command
|
||||
`pveum role info <role>` **does not exist in PVE 9** — only `pveum role list` works.
|
||||
|
||||
### 1.2 ⚠️ Privsep gotcha — the doc's runbook is incomplete
|
||||
With `--privsep 1`, a token's effective rights are the **intersection of the backing
|
||||
user's permissions AND the token's own ACLs**. The reference doc (§3) grants ACLs to the
|
||||
**token only**. With the user `felhom-ctl@pve` holding **no** permissions, the
|
||||
intersection was **empty** — the first self-audit call returned:
|
||||
```
|
||||
HTTP 403 {"message":"Permission check failed (/vms/9001, VM.Audit)\n"}
|
||||
```
|
||||
**Fix applied:** also grant the user the role on the same paths
|
||||
(`pveum acl modify /vms/9001 -user felhom-ctl@pve -role FelhomSelfBackup`, same for
|
||||
`/storage/local`). After that the self-calls succeeded. **A privsep token needs the
|
||||
permission present on *both* the user and the token** (the token ACL is what keeps the
|
||||
token ≤ user / narrowly scoped). This must be reflected in the controller provisioning.
|
||||
|
||||
### 1.3 Test matrix (every call run from **inside** the unprivileged LXC, `pct exec 9001`)
|
||||
`H=192.168.0.162 N=demo-felhom AUTH="PVEAPIToken=felhom-ctl@pve!ctl=<secret>"`
|
||||
|
||||
| # | Call | Expected | **Actual** | Notes |
|
||||
|---|---|---|---|---|
|
||||
| 1 | `GET /version` | 200 | **200** | reachable + auth from inside LXC (no privilege needed) |
|
||||
| 2 | `GET /nodes/$N/lxc/9001/status/current` | 200 | **200**¹ | self audit (after privsep fix) |
|
||||
| 3 | `POST /nodes/$N/lxc/9001/snapshot snapname=spk1` | 200/UPID→OK | **200, task exitstatus OK** | **running-LXC self-snapshot (H4)** |
|
||||
| 4 | `POST /nodes/$N/vzdump vmid=9001 storage=local mode=snapshot` | 200/UPID→OK | **200, task exitstatus OK** | self backup, archive produced |
|
||||
| 5 | `GET /nodes/$N/qemu/9000/status/current` | 403 | **403** | `Permission check failed (/vms/9000, VM.Audit)` |
|
||||
| 6 | `POST /nodes/$N/vzdump vmid=9000 storage=local` | 403 | **200 POST → task exitstatus 403**² | see note |
|
||||
| 7 | `POST /nodes/$N/lxc` (create CT) | 403 | **403** | `Permission check failed` — **proves create/allocate is operator-tier (H1)** |
|
||||
|
||||
¹ before the privsep fix this was 403; see §1.2.
|
||||
² **Important nuance:** the `vzdump` endpoint accepts the POST and returns a UPID even for
|
||||
an unauthorized vmid; the authorization failure surfaces at **task execution**, not at the
|
||||
HTTP layer. Polled from root:
|
||||
`exitstatus: "403 Permission check failed (/vms/9000, VM.Backup)"`, and **no 9000 archive
|
||||
was created**. The boundary holds — but a controller must **poll the task exitstatus**, not
|
||||
trust the POST's 200, to know a cross-guest backup was actually refused.
|
||||
|
||||
**Pass criteria met:** self-ops (1–4) succeed; cross-guest read (5), cross-guest backup
|
||||
(6, at task level), and create/allocate (7) are denied. The controller-as-guest boundary
|
||||
and the two-tier split are validated.
|
||||
|
||||
### 1.4 Final minimal role — `VM.PowerMgmt` **not** required
|
||||
The doc's open question ("does Tier A need `VM.PowerMgmt` for stop-mode backups? Likely
|
||||
yes"). **Tested and refuted:** a **stop-mode** self-vzdump submitted by the token
|
||||
(`vmid=9001 mode=stop`) completed with **`exitstatus: OK`** using the role *without*
|
||||
`VM.PowerMgmt`. `vzdump` performs the guest shutdown/restart internally under
|
||||
`VM.Backup`; no separate power privilege is needed.
|
||||
|
||||
> **Final minimal role (`FelhomSelfBackup`) — satisfies self-audit, self-snapshot, and
|
||||
> both `snapshot`- and `stop`-mode self-backup:**
|
||||
> `VM.Audit, VM.Snapshot, VM.Backup, Datastore.AllocateSpace, Datastore.Audit`
|
||||
> (`VM.PowerMgmt` deliberately omitted — confirmed unnecessary.)
|
||||
|
||||
### 1.5 TLS observation
|
||||
From inside the LXC, `curl` **without** `-k`:
|
||||
```
|
||||
curl: (60) SSL certificate problem: unable to get local issuer certificate
|
||||
```
|
||||
The host serves the default self-signed PVE cert; all tests used `-k`. Production trust
|
||||
(pin the PVE CA / issue a proper cert) is a separate design decision, flagged here.
|
||||
|
||||
### 1.6 Running-LXC snapshot (H4)
|
||||
Call #3 snapshotted the **running** unprivileged LXC on LVM-thin (`exitstatus OK`).
|
||||
`pct listsnapshot 9001` shows `spk1` with `pct status 9001 = running`. **No stop
|
||||
required** — the snapshot-before-update rollback flow is viable on a live container.
|
||||
|
||||
---
|
||||
|
||||
## 2. Phase 2 — Backup → real restore round-trip
|
||||
|
||||
Sentinel written pre-flight into the `pgdata` volume:
|
||||
`restore_check(42,'phase2-sentinel')` → clean read `42|phase2-sentinel`.
|
||||
|
||||
### 2.1 Backups (operator/root side)
|
||||
| Variant | Mode | Stack state | Task time | Wall | Archive | Size (zstd) |
|
||||
|---|---|---|---|---|---|---|
|
||||
| **A — crash-consistent** | `snapshot` | **running** | 00:00:24 | 25 s | `vzdump-lxc-9001-2026_06_07-20_13_43.tar.zst` | **934 MB** (979,718,569 B) |
|
||||
| **B — quiesced** | `snapshot` | **stopped** (`docker compose stop`) | 00:00:21 | 22 s | `vzdump-lxc-9001-2026_06_07-20_14_40.tar.zst` | **934 MB** (979,671,582 B) |
|
||||
|
||||
Both from a 2.5 GiB source; zstd → ~934 MB (~2.7:1). The stack was restarted after
|
||||
Variant B. **LXC snapshot-mode vzdump does *not* fsfreeze** (no guest agent in an LXC —
|
||||
consistent with the Phase 0 finding) → Variant A is genuinely crash-consistent.
|
||||
|
||||
### 2.2 Restore → fresh VMID → boot → verify
|
||||
| Check | 9002 (Variant A) | 9003 (Variant B) |
|
||||
|---|---|---|
|
||||
| Restore time (`pct restore … --storage local-lvm`) | **12 s** | **11 s** |
|
||||
| `unprivileged: 1` survived | **yes** | **yes** |
|
||||
| `features: nesting=1,keyctl=1` survived | **yes** | **yes** |
|
||||
| Containers after boot | `exited` (no restart policy) → `docker compose up -d` | same |
|
||||
| 3 containers healthy | **yes** | **yes** |
|
||||
| `curl localhost:8080` | **HTTP 200** | **HTTP 200** |
|
||||
| **Sentinel `(42,'phase2-sentinel')`** | **PRESENT** | **PRESENT** |
|
||||
| Postgres first-start | **WAL crash recovery** (see below) | **clean start, no recovery** |
|
||||
|
||||
> Restored CTs inherit 9001's fixed `hwaddr`. To avoid a MAC clash with the still-running
|
||||
> 9001 on `vmbr0`, `net0` was reset to auto-generate a fresh MAC before boot. All
|
||||
> verification (stack health, `curl localhost`, sentinel) is guest-internal and needs no
|
||||
> external network — and the Docker images are inside the restored rootfs, so no pulls.
|
||||
|
||||
**Variant A — Postgres automatic WAL recovery on 9002 (verbatim, post-restore boot):**
|
||||
```
|
||||
LOG: database system was interrupted; last known up at 2026-06-07 18:13:21 UTC
|
||||
LOG: database system was not properly shut down; automatic recovery in progress
|
||||
LOG: redo starts at 0/CB12838
|
||||
LOG: invalid record length at 0/CB12870: expected at least 24, got 0 # normal end-of-WAL
|
||||
LOG: redo done at 0/CB12838 ...
|
||||
LOG: checkpoint starting: end-of-recovery immediate wait
|
||||
LOG: database system is ready to accept connections
|
||||
```
|
||||
**Variant B — clean start on 9003 (verbatim, post-restore boot):**
|
||||
```
|
||||
LOG: database system was shut down at 2026-06-07 18:14:39 UTC
|
||||
LOG: database system is ready to accept connections
|
||||
```
|
||||
|
||||
**H2 confirmed:** one LXC vzdump captured the whole customer including the Docker named
|
||||
volume — the sentinel data restored in both guests. **H3 confirmed:** both variants
|
||||
restored to a bootable guest with intact data; the crash-consistent one recovered via WAL
|
||||
with no manual intervention, the quiesced one started clean. **H4 confirmed:** restored
|
||||
config preserved `unprivileged` + `nesting/keyctl`, so Docker ran in the restored CT.
|
||||
|
||||
---
|
||||
|
||||
## 3. Observations & confounds
|
||||
1. **Privsep token needs perms on user *and* token** (§1.2) — the single most important
|
||||
correction to the reference runbook; without it every scoped call 403s.
|
||||
2. **vzdump authorization is task-level, not POST-level** (§1.3 note ²) — a 200 + UPID
|
||||
does **not** mean authorized. The controller must poll `exitstatus`. This is also the
|
||||
general async-task lesson: every backup/snapshot/restore returns a UPID and the real
|
||||
result is in the task status.
|
||||
3. **`pveum role info` is gone in PVE 9** — use `pveum role list`. Minor doc drift.
|
||||
4. **`VM.PowerMgmt` not needed for stop-mode backup** (§1.4) — narrower role than the doc
|
||||
assumed.
|
||||
5. **No fsfreeze for LXC** — Variant A relied on Postgres's own WAL crash recovery, which
|
||||
worked here for an idle-at-backup DB. Under heavy write load, app-consistency for LXC
|
||||
still rests on the controller quiescing first (or stop-mode), exactly as the reference
|
||||
warned. This single test is not a durability guarantee under load.
|
||||
6. **Restore MAC collision** (§2.2) — `pct restore` preserves the source `hwaddr`;
|
||||
restoring while the original runs needs a MAC reset (or the original stopped). The
|
||||
controller's restore flow must handle identity (MAC/hostname/IP) to avoid clashes.
|
||||
7. **No restart policy on the compose services** — restored containers came up `exited`;
|
||||
`docker compose up -d` (or a restart policy / systemd unit) is required for the stack
|
||||
to return automatically after a restore or guest reboot.
|
||||
8. **Restore is fast, backup dominated by I/O** — restores were 11–12 s (extract at
|
||||
~524 MiB/s); backups ~22–25 s (read 2.5 GiB at ~108–119 MiB/s + zstd). Single runs,
|
||||
idle host, ~150 MB DB; not a throughput benchmark.
|
||||
9. **Sequencing artifact:** a Phase-1 stop-mode self-backup ran before Phase 2 and
|
||||
stopped/started 9001; the stack was brought back up and the sentinel re-verified
|
||||
before the Variant A/B backups, so it does not affect the round-trip results.
|
||||
|
||||
---
|
||||
|
||||
## 4. Raw command log (appendix)
|
||||
|
||||
### 4.1 Pre-flight
|
||||
```
|
||||
$ pvesh get /nodes -> node: demo-felhom
|
||||
$ cat /etc/pve/storage.cfg
|
||||
dir: local ... content iso,vztmpl,backup,import # 'backup' present
|
||||
lvmthin: local-lvm ... content rootdir,images # no backup (expected)
|
||||
$ pct start 9001 ; docker compose up -d -> 3 containers Started
|
||||
$ curl localhost:8080 -> HTTP 200
|
||||
# sentinel:
|
||||
CREATE TABLE ; INSERT 0 1 ; SELECT count -> 1 ; SELECT * -> 42 | phase2-sentinel
|
||||
```
|
||||
|
||||
### 4.2 Phase 1 — role/user/token/ACL
|
||||
```
|
||||
$ pveum role add FelhomSelfBackup -privs "VM.Audit VM.Snapshot VM.Backup Datastore.AllocateSpace Datastore.Audit" -> role-ok
|
||||
$ pveum user add felhom-ctl@pve --comment "spike in-guest controller" -> user-ok
|
||||
$ pveum user token add felhom-ctl@pve ctl --privsep 1
|
||||
{"full-tokenid":"felhom-ctl@pve!ctl","info":{"privsep":"1"},"value":"b6547d9d-08ec-4f22-beb8-a551dc2cd69d"}
|
||||
$ pveum acl modify /vms/9001 -token 'felhom-ctl@pve!ctl' -role FelhomSelfBackup -> ok
|
||||
$ pveum acl modify /storage/local -token 'felhom-ctl@pve!ctl' -role FelhomSelfBackup -> ok
|
||||
$ pveum role list | grep FelhomSelfBackup
|
||||
FelhomSelfBackup | Datastore.AllocateSpace,Datastore.Audit,VM.Audit,VM.Backup,VM.Snapshot
|
||||
$ pveum role info FelhomSelfBackup -> ERROR: unknown command 'pveum role info' # PVE9 has no 'role info'
|
||||
```
|
||||
|
||||
### 4.3 Phase 1 — matrix (from inside LXC)
|
||||
```
|
||||
# TLS without -k:
|
||||
curl: (60) SSL certificate problem: unable to get local issuer certificate
|
||||
|
||||
# BEFORE privsep fix:
|
||||
#2 GET self status -> HTTP 403 {"message":"Permission check failed (/vms/9001, VM.Audit)\n"}
|
||||
|
||||
# privsep fix:
|
||||
$ pveum acl modify /vms/9001 -user 'felhom-ctl@pve' -role FelhomSelfBackup -> ok
|
||||
$ pveum acl modify /storage/local -user 'felhom-ctl@pve' -role FelhomSelfBackup -> ok
|
||||
|
||||
# AFTER fix:
|
||||
#1 GET /version -> HTTP 200
|
||||
#2 GET /nodes/.../lxc/9001/status/current -> HTTP 200 {"data":{...,"status":"running",...}}
|
||||
#5 GET /nodes/.../qemu/9000/status/current -> HTTP 403 (/vms/9000, VM.Audit)
|
||||
#6 POST vzdump vmid=9000 -> HTTP 200 {"data":"UPID:...vzdump:9000:felhom-ctl@pve!ctl:"}
|
||||
root poll: exitstatus="403 Permission check failed (/vms/9000, VM.Backup)"
|
||||
task log: TASK ERROR: 403 Permission check failed (/vms/9000, VM.Backup)
|
||||
/var/lib/vz/dump: no 9000 archive created
|
||||
#7 POST /nodes/.../lxc (create CT vmid=9009) -> HTTP 403 {"message":"Permission check failed\n"}
|
||||
|
||||
#3 POST lxc/9001/snapshot snapname=spk1 -> HTTP 200 UPID:...vzsnapshot:9001...
|
||||
root: exitstatus "OK" ; pct listsnapshot 9001 -> spk1 ; pct status 9001 -> running
|
||||
#4 POST vzdump vmid=9001 storage=local mode=snapshot -> HTTP 200 UPID:...vzdump:9001...
|
||||
root: exitstatus "OK"
|
||||
token can read own task status: HTTP 200 {"...exitstatus":"OK"} # earlier poll TIMEOUTs were a shell-quoting bug in the helper, not a perms issue
|
||||
|
||||
# stop-mode self-backup (VM.PowerMgmt test):
|
||||
$ token POST vzdump vmid=9001 storage=local mode=stop -> HTTP 200 UPID:...vzdump:9001...
|
||||
root poll: exitstatus "OK" # SUCCEEDED without VM.PowerMgmt in the role
|
||||
```
|
||||
|
||||
### 4.4 Phase 2 — backups
|
||||
```
|
||||
# Variant A (running):
|
||||
$ vzdump 9001 --mode snapshot --storage local --compress zstd
|
||||
INFO: Total bytes written: 2585589760 (2.5GiB, 108MiB/s)
|
||||
INFO: archive file size: 934MB
|
||||
INFO: Finished Backup of VM 9001 (00:00:24) ; WALL_SECONDS=25
|
||||
-> vzdump-lxc-9001-2026_06_07-20_13_43.tar.zst (979718569 B)
|
||||
|
||||
# Variant B (stopped):
|
||||
$ docker compose stop (cache,db,web Stopped)
|
||||
$ vzdump 9001 --mode snapshot --storage local --compress zstd
|
||||
INFO: Total bytes written: 2585825280 (2.5GiB, 119MiB/s)
|
||||
INFO: Finished Backup of VM 9001 (00:00:21) ; WALL_SECONDS=22
|
||||
-> vzdump-lxc-9001-2026_06_07-20_14_40.tar.zst (979671582 B)
|
||||
$ docker compose start (db,cache,web Started)
|
||||
```
|
||||
|
||||
### 4.5 Phase 2 — restores + verification
|
||||
```
|
||||
# A -> 9002:
|
||||
$ pct restore 9002 .../20_13_43.tar.zst --storage local-lvm
|
||||
Total bytes read: 2585589760 (2.5GiB, 524MiB/s) ; RESTORE_A_SECONDS=12
|
||||
$ pct config 9002 -> features: nesting=1,keyctl=1 ; unprivileged: 1
|
||||
$ pct set 9002 -net0 name=eth0,bridge=vmbr0,ip=dhcp # fresh MAC BC:24:11:E3:F4:64
|
||||
$ pct start 9002 ; docker compose up -d -> 3 running ; curl -> HTTP 200
|
||||
$ psql SELECT * FROM restore_check -> 42 | phase2-sentinel
|
||||
db log: "was interrupted ... not properly shut down; automatic recovery in progress
|
||||
redo starts/redo done ... database system is ready to accept connections"
|
||||
|
||||
# B -> 9003:
|
||||
$ pct restore 9003 .../20_14_40.tar.zst --storage local-lvm
|
||||
Total bytes read: 2585825280 (2.5GiB, 524MiB/s) ; RESTORE_B_SECONDS=11
|
||||
$ pct config 9003 -> features: nesting=1,keyctl=1 ; unprivileged: 1
|
||||
$ pct set 9003 -net0 ... (fresh MAC) ; pct start 9003 ; docker compose up -d -> 3 running ; curl 200
|
||||
$ psql SELECT * FROM restore_check -> 42 | phase2-sentinel
|
||||
db log: "database system was shut down at ... ; database system is ready to accept connections" # clean
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Teardown (executed)
|
||||
Restore targets destroyed; Phase 1 objects and spike artifacts removed; `9000`/`9001`
|
||||
left **stopped-but-present**. Verified clean: `felhom-ctl@pve` deleted, no spike ACLs,
|
||||
empty `dump/`, `spk1` removed.
|
||||
|
||||
> **Correction:** `pveum acl delete` **requires `--roles`** (a bare `-user`/`-token`
|
||||
> path errors `400 roles: property is missing`). In practice the explicit ACL deletes
|
||||
> are unnecessary — deleting the token/user/role **auto-invalidates** the referencing
|
||||
> ACLs (PVE logs `ignore invalid acl token …` and drops them).
|
||||
|
||||
```bash
|
||||
pct stop 9002 ; pct stop 9003 ; pct destroy 9002 --purge ; pct destroy 9003 --purge
|
||||
# correct ACL-delete syntax (needs --roles), or just let user/role deletion clean them:
|
||||
pveum acl delete /vms/9001 --roles FelhomSelfBackup --users 'felhom-ctl@pve'
|
||||
pveum acl delete /vms/9001 --roles FelhomSelfBackup --tokens 'felhom-ctl@pve!ctl'
|
||||
pveum acl delete /storage/local --roles FelhomSelfBackup --users 'felhom-ctl@pve'
|
||||
pveum acl delete /storage/local --roles FelhomSelfBackup --tokens 'felhom-ctl@pve!ctl'
|
||||
pveum user token remove felhom-ctl@pve ctl ; pveum user delete felhom-ctl@pve ; pveum role delete FelhomSelfBackup
|
||||
pct delsnapshot 9001 spk1
|
||||
rm -f /var/lib/vz/dump/vzdump-lxc-9001-*.tar.zst /var/lib/vz/dump/vzdump-lxc-9001-*.log
|
||||
pct stop 9001 # back to stopped-but-present
|
||||
```
|
||||
|
||||
## 6. To destroy 9000/9001 later (NOT run — left stopped-but-present)
|
||||
```bash
|
||||
qm destroy 9000 --purge # VM (Phase 0 subject)
|
||||
pct destroy 9001 --purge # LXC (Phase 0/1/2 subject)
|
||||
# Debian 13 CT template left in place: local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst
|
||||
```
|
||||
@@ -0,0 +1,234 @@
|
||||
# Phase 3 — vzdump exclusion (B2) & agent operator role + root boundary (B3): Findings
|
||||
|
||||
**Host:** `demo-felhom` (192.168.0.162) — Proxmox VE 9.2.2, node confirmed via
|
||||
`pvesh get /nodes` → `demo-felhom`. **Date:** 2026-06-08. Throwaway resources (VMIDs
|
||||
9010-9023, role/user `FelhomAgent`/`felhom-agent@pve`); all torn down (only the pre-existing
|
||||
9000/9001 remain, stopped). Every Proxmox op polled to `task exitstatus` (not the POST
|
||||
return).
|
||||
|
||||
> Validates the two items the design review (`_design-review.md`) flagged as unvalidated:
|
||||
> **B2** (what vzdump includes/excludes per LXC mount type + how to keep bulk out) and **B3**
|
||||
> (the least-privilege operator role + the root-vs-API boundary). Data only.
|
||||
|
||||
---
|
||||
|
||||
## B2 — vzdump inclusion/exclusion matrix
|
||||
|
||||
**Setup:** one unprivileged LXC `9010` (`nesting=1,keyctl=1`, overlayfs), Docker 29.5.3
|
||||
installed, with five sentinel locations:
|
||||
|
||||
| # | location | config |
|
||||
|---|---|---|
|
||||
| 1 | rootfs file `/SENTINEL_ROOTFS` | rootfs (`local-lvm:8`) |
|
||||
| 2 | Docker **named** volume `b2vol` → `SENTINEL_DOCKERVOL` | default driver |
|
||||
| 3 | `mp1` volume mount `/mnt/mp1` `SENTINEL_MP1` | `local-lvm:1,backup=1` |
|
||||
| 4 | `mp2` volume mount `/mnt/mp2` `SENTINEL_MP2` | `local-lvm:1,backup=0` |
|
||||
| 5 | `mp3` **bind** mount `/mnt/mp3` `SENTINEL_MP3` | host `/root/b2-bindsrc` |
|
||||
| 6 | bulk Docker vol `bulkvol` bound onto mp2 → `SENTINEL_BULK` | `--driver local -o type=none -o o=bind -o device=/mnt/mp2` |
|
||||
|
||||
**The "trap" confirmed at setup:** the Docker named volume's on-disk path is
|
||||
`/var/lib/docker/volumes/b2vol/_data` — **inside the LXC rootfs**.
|
||||
|
||||
### Result matrix (stop-mode vzdump → `local`, verified 3 ways: vzdump log, archive grep, restore to 9011)
|
||||
|
||||
| Sentinel | location | flag | **in archive?** | restored 9011 |
|
||||
|---|---|---|---|---|
|
||||
| `SENTINEL_ROOTFS` | rootfs | — | **INCLUDED** | present |
|
||||
| `SENTINEL_DOCKERVOL` | Docker named vol (in rootfs) | — | **INCLUDED** ⚠️ the trap | present |
|
||||
| `SENTINEL_MP1` | volume mp | `backup=1` | **INCLUDED** | present |
|
||||
| `SENTINEL_MP2` | volume mp | `backup=0` | **EXCLUDED** | absent (vol recreated empty) |
|
||||
| `SENTINEL_MP3` | bind mount | n/a | **EXCLUDED** | reappears via re-bind only¹ |
|
||||
| `SENTINEL_BULK` | Docker vol on mp2 | `backup=0` | **EXCLUDED** | absent |
|
||||
|
||||
¹ The bind-mount **data is not in the archive** (archive grep shows no mp3 path). It
|
||||
reappears in the restored 9011 only because `pct restore` preserves the bind config
|
||||
`mp3: /root/b2-bindsrc` and re-attaches the **same host dir**. On a *different* host (true DR)
|
||||
the bind data would be gone unless backed up separately — important for DR planning.
|
||||
|
||||
**vzdump log (verbatim) — the authoritative per-mount decision:**
|
||||
```
|
||||
INFO: including mount point rootfs ('/') in backup
|
||||
INFO: including mount point mp1 ('/mnt/mp1') in backup
|
||||
INFO: excluding volume mount point mp2 ('/mnt/mp2') from backup (disabled)
|
||||
INFO: excluding bind mount point mp3 ('/mnt/mp3') from backup (not a volume)
|
||||
```
|
||||
**Archive contents (verbatim) — `tar --zstd -tf … | grep SENTINEL`:**
|
||||
```
|
||||
./var/lib/docker/volumes/b2vol/_data/SENTINEL_DOCKERVOL
|
||||
./SENTINEL_ROOTFS
|
||||
./mnt/mp1/SENTINEL_MP1
|
||||
```
|
||||
**Restore verification (verbatim) — sentinels in restored 9011:**
|
||||
```
|
||||
PRESENT : /SENTINEL_ROOTFS
|
||||
PRESENT : /var/lib/docker/volumes/b2vol/_data/SENTINEL_DOCKERVOL
|
||||
PRESENT : /mnt/mp1/SENTINEL_MP1
|
||||
ABSENT : /mnt/mp2/SENTINEL_MP2
|
||||
ABSENT : /mnt/mp2/SENTINEL_BULK
|
||||
PRESENT : /mnt/mp3/SENTINEL_MP3 # via re-bind to same host dir, NOT from archive
|
||||
```
|
||||
|
||||
### Proven bulk-exclusion recipe
|
||||
A "bulk" Docker volume is kept out of the guest vzdump by binding it onto a **volume
|
||||
mountpoint with `backup=0`**:
|
||||
1. Attach a Proxmox volume mountpoint with the flag:
|
||||
`pct set <id> -mpN <storage>:<size>,mp=/mnt/bulk,backup=0`
|
||||
2. Realize the Docker volume on that path:
|
||||
`docker volume create --driver local -o type=none -o o=bind -o device=/mnt/bulk bulkvol`
|
||||
(or a compose bind to `/mnt/bulk`).
|
||||
3. Data written through `bulkvol` lands on the `backup=0` mountpoint → **excluded** from
|
||||
vzdump, while rootfs/hot sentinels are **included**. Verified: `SENTINEL_BULK` absent from
|
||||
archive and restore; `SENTINEL_ROOTFS` present.
|
||||
|
||||
### The trap, stated for the placement component
|
||||
`backup=<boolean>` is **only honoured for volume mount points** (confirmed: pct manpage +
|
||||
vzdump log "excluding volume mount point … (disabled)"). A Docker **named volume uses the
|
||||
default driver and lands in the rootfs**, which is **always backed up** — so a "bulk" volume
|
||||
left as an ordinary named volume is **silently swept into the whole-guest image**. The
|
||||
per-volume placement component **must** realize every `bulk` volume as a dedicated `backup=0`
|
||||
mountpoint (or external bind mount), never a default named volume.
|
||||
|
||||
---
|
||||
|
||||
## B3 — agent operator role + root-vs-API boundary
|
||||
|
||||
**Caveat applied (Phase 1):** privsep token needs the role on **both** user and token. Setup:
|
||||
user `felhom-agent@pve` + privsep token `agent`, role `FelhomAgent`, dual-granted at `/`.
|
||||
All ops driven **as the token** via the REST API; task `exitstatus` polled.
|
||||
|
||||
> ⚠️ **Terminology:** the Phase-1 `FelhomSelfBackup` role is the discarded **guest-side
|
||||
> self-backup** role (scoped to one guest, *denied* create/allocate). `FelhomAgent` here is
|
||||
> its **operator-tier replacement** — a different, broader role. Do not conflate.
|
||||
|
||||
### Op matrix (as the scoped token)
|
||||
|
||||
| # | Operation | API call | Result |
|
||||
|---|---|---|---|
|
||||
| read | host status | `GET /nodes/$N/status` | **200** (needs `Sys.Audit`) |
|
||||
| read | storage list | `GET /storage` | **200** (`Datastore.Audit`) |
|
||||
| 1 | **create LXC, `nesting=1,keyctl=1`** | `POST /nodes/$N/lxc` | **403** — `changing feature flags (except nesting) is only allowed for root@pam` |
|
||||
| 1′ | create LXC, **nesting-only** | `POST /nodes/$N/lxc` | **200 / OK** |
|
||||
| 2 | set config (mem/cpu/options + mountpoint w/ `backup` flag) | `PUT /nodes/$N/lxc/<id>/config` | **200** |
|
||||
| 3 | allocate volume | `POST /nodes/$N/storage/local-lvm/content` | **200** (`Datastore.AllocateSpace`) |
|
||||
| 4 | start | `POST …/status/start` | **OK** (`VM.PowerMgmt`) |
|
||||
| 5 | stop | `POST …/status/stop` | **OK** |
|
||||
| 6a | snapshot | `POST …/snapshot` | **OK** (`VM.Snapshot`) |
|
||||
| 6b | rollback | `POST …/snapshot/s1/rollback` | **OK** (`VM.Snapshot.Rollback`) |
|
||||
| 7 | stop-mode backup | `POST /nodes/$N/vzdump mode=stop` | **OK** (`VM.Backup`) |
|
||||
| 8 | restore → fresh vmid | `POST /nodes/$N/lxc restore=1` | **OK** — and **restored CT kept `features: nesting=1,keyctl=1`** |
|
||||
| 9 | destroy CT | `DELETE /nodes/$N/lxc/<id>?purge=1` | **OK** (`VM.Allocate`) |
|
||||
| 9b | add storage definition (dir) | `POST /storage` | **200** (`Datastore.Allocate`, **no root**) |
|
||||
|
||||
**The two headline results:**
|
||||
1. **`keyctl=1` on create is `root@pam`-only.** Verbatim:
|
||||
`Permission check failed (changing feature flags (except nesting) is only allowed for root@pam)`.
|
||||
Confirmed this is **not** token-fixable: a **non-privsep `root@pam` token** got the **same
|
||||
403**. Only an actual `root@pam` session (OS root / `pct create` as root) can set it.
|
||||
`nesting` alone is allowed for a scoped token.
|
||||
2. **Restore preserves `keyctl`.** A token-authorized `vzrestore` of a keyctl archive produced
|
||||
`9021` with `features: nesting=1,keyctl=1, unprivileged: 1`. So the **DR/restore path is
|
||||
fully token-covered**; only *fresh provisioning* needs root for the keyctl flag.
|
||||
|
||||
### Paring (each drop shown to still pass, or proven needed)
|
||||
|
||||
| Privilege | Verdict | Evidence |
|
||||
|---|---|---|
|
||||
| `Datastore.AllocateTemplate` | **DROP** (unnecessary) | create-from-template succeeded without it (200/OK) |
|
||||
| `Sys.Audit` | **KEEP** | `GET /nodes/$N/status` → **403** without it (host metrics, `03` §5) |
|
||||
| `VM.Config.Network` | **KEEP** | create with `net0` → **403 (/vms/…, VM.Config.Network)** without it |
|
||||
| `VM.Config.Options` | **KEEP** | config `onboot=1` → **403 (/vms/…, VM.Config.Options)** without it |
|
||||
| `SDN.Use` | **KEEP (added vs review sketch)** | create → **403 (/sdn/zones/localnetwork/vmbr0, SDN.Use)** without it |
|
||||
|
||||
> Corrections to the review's candidate sketch: `VM.Config.CPUMemory` is **not a real
|
||||
> privilege** — split into `VM.Config.CPU` + `VM.Config.Memory`. `SDN.Use` was **missing** and
|
||||
> is **required** (PVE 9 gates bridge use behind it). `Datastore.AllocateTemplate` is **not
|
||||
> needed**.
|
||||
|
||||
### Final minimal `FelhomAgent` role (proven sufficient for ops 1′–9b)
|
||||
```
|
||||
VM.Allocate VM.Audit VM.Config.Disk VM.Config.CPU VM.Config.Memory
|
||||
VM.Config.Network VM.Config.Options VM.PowerMgmt VM.Snapshot VM.Snapshot.Rollback
|
||||
VM.Backup Datastore.Allocate Datastore.AllocateSpace Datastore.Audit Sys.Audit SDN.Use
|
||||
```
|
||||
(16 privileges. `Datastore.Allocate` is for the storage-definition add; drop it if the agent
|
||||
never creates Proxmox storage entries via the API. `VM.PowerMgmt` is for start/stop lifecycle
|
||||
— not for the backup itself, consistent with `proxmox-platform.md` §3.4.)
|
||||
|
||||
### Root-vs-API boundary table (answers `03` §3)
|
||||
|
||||
| Agent host operation | Coverage | Notes |
|
||||
|---|---|---|
|
||||
| Create unprivileged LXC, **nesting-only** | **API token** | `VM.Allocate`+`VM.Config.*`+`Datastore.AllocateSpace`+`SDN.Use` |
|
||||
| **Create with `keyctl=1` (Docker needs it — Phase 0)** | **OS root `root@pam`** (`pct create` as root / sudoers) | no API token works, incl. a root@pam token |
|
||||
| Set config (mem/cpu/net/options/mountpoint + `backup` flag) | API token | |
|
||||
| Allocate guest volume | API token | `Datastore.AllocateSpace` |
|
||||
| Start / stop / snapshot / rollback | API token | `VM.PowerMgmt` / `VM.Snapshot(.Rollback)` |
|
||||
| vzdump backup (stop/snapshot mode) | API token | `VM.Backup` |
|
||||
| **Restore from vzdump (preserves keyctl)** | **API token** | DR path needs no root |
|
||||
| Destroy guest (scratch + compensating rollback, B1) | API token | `VM.Allocate` |
|
||||
| Add Proxmox **storage definition** (dir/nfs/cifs/pbs) | API token | `Datastore.Allocate`; the *definition* only |
|
||||
| Host status / metrics report | API token | `Sys.Audit` |
|
||||
| **USB physical mount-by-UUID / systemd mount unit / fstab** | **OS root / narrow sudoers** | not a Proxmox API op (host-level mount; not tested here) |
|
||||
| **SMART / hardware sensors** | OS root | not API-exposed |
|
||||
|
||||
**Boundary summary:** nearly the entire guest lifecycle — including **restore** — is covered
|
||||
by the scoped token. The genuine OS-root residual is narrow: **(1) fresh creation of a
|
||||
Docker-capable LXC (the `keyctl` flag), (2) physical USB mount-by-UUID / systemd mount units /
|
||||
fstab, (3) hardware/SMART.** This supports `03` §3's "non-root service + scoped token + narrow
|
||||
sudoers" model — with the **specific** sudoers/root entries being: `pct create` (or just the
|
||||
keyctl-setting step) and the host mount operations.
|
||||
|
||||
---
|
||||
|
||||
## Raw command log (appendix)
|
||||
|
||||
### B2
|
||||
```
|
||||
pct create 9010 ... --features nesting=1,keyctl=1 --unprivileged 1 # rootfs local-lvm:8
|
||||
pct set 9010 -mp1 local-lvm:1,mp=/mnt/mp1,backup=1
|
||||
pct set 9010 -mp2 local-lvm:1,mp=/mnt/mp2,backup=0
|
||||
pct set 9010 -mp3 /root/b2-bindsrc,mp=/mnt/mp3
|
||||
# docker named vol: docker volume inspect b2vol -> /var/lib/docker/volumes/b2vol/_data
|
||||
# bulk: docker volume create --driver local -o type=none -o o=bind -o device=/mnt/mp2 bulkvol
|
||||
vzdump 9010 --mode stop --storage local --compress zstd
|
||||
# INFO: including mount point rootfs ('/') in backup
|
||||
# INFO: including mount point mp1 ('/mnt/mp1') in backup
|
||||
# INFO: excluding volume mount point mp2 ('/mnt/mp2') from backup (disabled)
|
||||
# INFO: excluding bind mount point mp3 ('/mnt/mp3') from backup (not a volume)
|
||||
tar --zstd -tf <archive> | grep SENTINEL # -> rootfs, dockervol, mp1 only
|
||||
pct restore 9011 <archive> --storage local-lvm # -> mp2/bulk absent, mp3 via re-bind
|
||||
```
|
||||
|
||||
### B3
|
||||
```
|
||||
pveum role add FelhomAgent -privs "VM.Allocate VM.Audit VM.Config.Disk VM.Config.CPU VM.Config.Memory VM.Config.Network VM.Config.Options VM.PowerMgmt VM.Snapshot VM.Snapshot.Rollback VM.Backup Datastore.Allocate Datastore.AllocateSpace Datastore.AllocateTemplate Datastore.Audit Sys.Audit" # candidate (pre-SDN)
|
||||
pveum user add felhom-agent@pve ; pveum user token add felhom-agent@pve agent --privsep 1
|
||||
pveum acl modify / -user 'felhom-agent@pve' -role FelhomAgent
|
||||
pveum acl modify / -token 'felhom-agent@pve!agent' -role FelhomAgent
|
||||
|
||||
# token create with keyctl:
|
||||
POST /nodes/demo-felhom/lxc ... features=nesting=1,keyctl=1
|
||||
-> 403 "changing feature flags (except nesting) is only allowed for root@pam"
|
||||
# + SDN.Use missing initially:
|
||||
-> 403 "Permission check failed (/sdn/zones/localnetwork/vmbr0, SDN.Use)"
|
||||
# root@pam non-privsep token, keyctl create:
|
||||
-> 403 (same "only allowed for root@pam") # tokens never qualify
|
||||
|
||||
# token nesting-only create / config(PUT) / start / stop / snapshot / rollback /
|
||||
# vzdump(stop) / restore->9021 (kept keyctl) / destroy / POST /storage -> all 200/OK
|
||||
|
||||
# paring:
|
||||
GET /nodes/$N/status without Sys.Audit -> 403 (KEEP)
|
||||
create net0 without VM.Config.Network -> 403 (KEEP)
|
||||
config onboot=1 without VM.Config.Options -> 403 (KEEP)
|
||||
create from template without Datastore.AllocateTemplate -> OK (DROP)
|
||||
```
|
||||
|
||||
### Teardown
|
||||
```
|
||||
pct destroy 9010 9011 9021 --purge # 9020/9022/9023 already destroyed during tests
|
||||
pveum user token remove felhom-agent@pve agent ; pveum user delete felhom-agent@pve
|
||||
pveum role delete FelhomAgent # ACLs at / auto-invalidated
|
||||
rm -f /var/lib/vz/dump/vzdump-lxc-9010-* /var/lib/vz/dump/vzdump-lxc-9020-*
|
||||
# verified: only 9000/9001 remain (stopped-but-present); no felhom-agent user/role; dump dir empty
|
||||
```
|
||||
@@ -0,0 +1,257 @@
|
||||
# Phase 4 — Control-plane signing primitive (SSHSIG + Go verify): Findings
|
||||
|
||||
**Where run:** build server `192.168.0.180` (Debian 13, **Go 1.24.4**, **OpenSSH 10.0p2**),
|
||||
no Proxmox. **Date:** 2026-06-08. Throwaway key generated, used, and **deleted** — no private
|
||||
key, passphrase, or `.sig` committed.
|
||||
|
||||
> De-risks the signing primitive *before* it is written into `04-control-plane-authorization.md`
|
||||
> or the agent's verify code. **Verdict up front: the approach works cleanly and is key-type-
|
||||
> agnostic — no fallback needed.** Go verifies the armored `SSHSIG` format, every tamper/replay/
|
||||
> authorization case is rejected, and a synthetic FIDO2 `sk-ssh-ed25519` signature verifies
|
||||
> through the **unchanged** code path (true hardware drop-in).
|
||||
|
||||
---
|
||||
|
||||
## 0. Result at a glance — 14/14 checks pass
|
||||
|
||||
```
|
||||
== Step 2: SSHSIG signature verification (key-type-agnostic path) ==
|
||||
PASS correct verified, op="guest_destroy"
|
||||
PASS wrong key rejected: signer not in allowed set
|
||||
PASS tampered blob rejected: signature invalid: ssh: signature did not verify
|
||||
PASS wrong namespace rejected: namespace mismatch: got "felhom-op-wrong" want "felhom-op-v1"
|
||||
|
||||
== Step 3: anti-replay / authorization (valid signature, still rejected) ==
|
||||
PASS first use verified, op="guest_destroy"
|
||||
PASS replay (same nonce) rejected: replay: nonce a1b2c3d4...8f90 already seen
|
||||
PASS expired rejected: expired (expires_at=2020-01-02 ..., now=2026-06-08 ...)
|
||||
PASS not-yet-valid rejected: not yet valid (issued_at=2030-01-01 ...)
|
||||
PASS retargeted host rejected: target mismatch: blob=demo-felhom/9001 this=other-host/9001
|
||||
PASS retargeted guest rejected: target mismatch: blob=demo-felhom/9001 this=demo-felhom/8888
|
||||
|
||||
== Step 4: key-type-agnosticism — FIDO2 sk-ssh-ed25519 (synthetic, no device) ==
|
||||
PASS parses sk pubkey type="sk-ssh-ed25519@openssh.com"
|
||||
PASS authorized_keys form sk-ssh-ed25519@openssh.com AAAAGnNrLXNzaC1lZDI1NTE5...
|
||||
PASS sk end-to-end verify verified, op="guest_destroy"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 1. Software round-trip (baseline, CLI)
|
||||
|
||||
- Key: `ssh-keygen -t ed25519 -f felhom-op -N '<passphrase>' -C felhom-operator`.
|
||||
(Signing non-interactively used an `SSH_ASKPASS` helper + `setsid -w`; in production the
|
||||
operator key lives behind an agent or a FIDO2 device, so the at-sign passphrase prompt is a
|
||||
non-issue. The passphrase mechanics are **not** what this spike de-risks.)
|
||||
- Sign with a **domain-separated namespace**:
|
||||
`ssh-keygen -Y sign -f felhom-op -n felhom-op-v1 blob.json` → `blob.json.sig`
|
||||
(armored `-----BEGIN SSH SIGNATURE-----`).
|
||||
- Baseline verify (CLI sanity) with an allow-list:
|
||||
```
|
||||
allowed_signers: felhom-operator namespaces="felhom-op-v1" ssh-ed25519 AAAAC3...
|
||||
$ ssh-keygen -Y verify -f allowed_signers -I felhom-operator -n felhom-op-v1 \
|
||||
-s blob.json.sig < blob.json
|
||||
Good "felhom-op-v1" signature for felhom-operator with ED25519 key SHA256:y0Lj8dIYTM6...
|
||||
```
|
||||
|
||||
## 2. Canonical op blob spec (documented)
|
||||
|
||||
The signature covers **these exact bytes**; the operator CLI (also Go) must reproduce them
|
||||
byte-for-byte. **Canonical form: JSON, keys sorted lexicographically at every level, no
|
||||
insignificant whitespace, no trailing newline, UTF-8.**
|
||||
|
||||
```json
|
||||
{"expires_at":"<RFC3339 UTC>","issued_at":"<RFC3339 UTC>","key_id":"<id>","nonce":"<128-bit hex>","op":"<op>","params":{...},"target":{"guest_id":"<vmid>","host_id":"<node>"}}
|
||||
```
|
||||
|
||||
| field | meaning |
|
||||
|---|---|
|
||||
| `op` | the operation, e.g. `guest_destroy`, `storage_detach`, `restore_overwrite` |
|
||||
| `target.host_id` / `target.guest_id` | the box + guest the op is bound to (anti-retarget) |
|
||||
| `params` | op-specific arguments (themselves canonical-sorted) |
|
||||
| `nonce` | unique per op (anti-replay); ≥128-bit random |
|
||||
| `issued_at` / `expires_at` | validity window (short — minutes) |
|
||||
| `key_id` | which operator key (for rotation / audit) |
|
||||
|
||||
Exact test blob (236 bytes): `{"expires_at":"2026-06-09T00:00:00Z","issued_at":"2026-06-08T00:00:00Z","key_id":"felhom-op-1","nonce":"a1b2c3d4e5f60718293a4b5c6d7e8f90","op":"guest_destroy","params":{"purge":true},"target":{"guest_id":"9001","host_id":"demo-felhom"}}`
|
||||
|
||||
> Note: the SSHSIG **namespace** (`felhom-op-v1`) is the cryptographic domain separator and is
|
||||
> a **fixed constant in the verifier**, never caller-supplied — a signature minted for any
|
||||
> other namespace must not verify (proven: "wrong namespace" rejected).
|
||||
|
||||
## 3. Go SSHSIG verify — approach + implementation cost
|
||||
|
||||
**It is not a one-call verify, but it is clean — no hand-rolled crypto.** The only manual work
|
||||
is SSHSIG *framing*; all crypto and key-type dispatch is the library's. Steps:
|
||||
|
||||
1. `pem.Decode` the armor → `block.Type == "SSH SIGNATURE"`, `block.Bytes` is the binary SSHSIG.
|
||||
*(Go's `encoding/pem` parses the armor directly — no manual base64/line handling.)*
|
||||
2. Strip the literal 6-byte `SSHSIG` magic preamble (it is **not** length-prefixed).
|
||||
3. `ssh.Unmarshal` the rest into a struct `{Version uint32; PublicKey, Namespace, Reserved,
|
||||
HashAlgo, Signature string}` — library does the SSH wire parsing.
|
||||
4. `ssh.ParsePublicKey([]byte(PublicKey))` → an `ssh.PublicKey`.
|
||||
5. Recompute the signed data per spec: `"SSHSIG" || string(namespace) || string(reserved) ||
|
||||
string(hash_algorithm) || string(H(message))`, where `H` is the **named** hash
|
||||
(`sha256`/`sha512`) — built with one `ssh.Marshal`.
|
||||
6. `ssh.Unmarshal([]byte(Signature))` into `ssh.Signature`, then **`pub.Verify(signed, &sig)`** —
|
||||
which **dispatches on the key's own algorithm** (this is what makes it key-agnostic).
|
||||
|
||||
**Cost verdict:** ~40 lines of framing in one file, zero crypto implemented by us. Well within
|
||||
the agent's budget; **no reason to fall back** to a different primitive.
|
||||
|
||||
## 4. Anti-replay / authorization layer (on top of signature validity)
|
||||
|
||||
Enforced in `VerifySignedOp` *after* the signature check, each proven to reject **even with a
|
||||
valid signature** (Step 3 output above):
|
||||
|
||||
- **replay** — nonce already recorded in the window → reject;
|
||||
- **expired / not-yet-valid** — `now ∉ [issued_at, expires_at]` → reject (both sides shown);
|
||||
- **retargeted** — `target.host_id`/`guest_id` ≠ this box/guest → reject (both shown).
|
||||
|
||||
(Order matters: signature → namespace → allow-list → crypto verify → target → time → nonce, so
|
||||
a replayed *but otherwise valid* op is still caught, and an invalid sig never consumes a nonce.)
|
||||
|
||||
## 5. Key-type-agnosticism — **TRUE DROP-IN** (no box change for FIDO2 later)
|
||||
|
||||
No FIDO2 device was used (by choice). Instead the spike **emulated the authenticator exactly**:
|
||||
|
||||
- Synthesized a well-formed `sk-ssh-ed25519@openssh.com` public key; `ssh.ParsePublicKey` parses
|
||||
it and `ssh.MarshalAuthorizedKey` round-trips it.
|
||||
- Constructed a real `SSHSIG` whose inner signature follows the sk scheme (per OpenSSH
|
||||
`PROTOCOL.u2f`): `ed25519` over `sha256(application) || flags || counter || sha256(signed_data)`,
|
||||
with the blob `string(format) string(ed25519_sig) byte(flags) uint32(counter)` — i.e. exactly
|
||||
what a FIDO2 key emits.
|
||||
- Ran it through the **unchanged `VerifySignedOp`** → **verified** (`op="guest_destroy"`).
|
||||
|
||||
**Verdict: true drop-in.** `pub.Verify` for `sk-ssh-ed25519` is implemented in
|
||||
`golang.org/x/crypto/ssh` **v0.52.0** (it reconstructs `appDigest‖flags‖counter‖dataDigest` and
|
||||
`ed25519.Verify`s it). Introducing a hardware operator key later is a **no-op on the boxes** —
|
||||
the agent's verify code is identical; only the operator's signer key (and the allowed-signers
|
||||
set entry) changes. No sk-specific handler is needed.
|
||||
|
||||
> Because verification dispatches on the key type embedded in the signature, the same path also
|
||||
> accepts `ssh-ed25519`, `rsa-sha2-*`, `ecdsa-sha2-*`, etc. — algorithm choice is the operator's,
|
||||
> not the agent's.
|
||||
|
||||
## 6. Fallback (not taken) and its cost
|
||||
|
||||
A fallback would be a **raw Ed25519 detached signature** (or `minisign`): trivially one
|
||||
`ed25519.Verify` call, no SSHSIG framing. **Rejected** because it **loses the clean FIDO2 path** —
|
||||
a raw-Ed25519 verifier cannot consume an `sk-ssh-ed25519` signature (which carries flags+counter
|
||||
and a different signed-data construction), so the future hardware swap would require **changing
|
||||
the verifier on every box**. SSHSIG buys exactly the key-type-agnosticism (§5) that a raw scheme
|
||||
forfeits, at a one-file framing cost (§3). **No fallback is warranted.**
|
||||
|
||||
## 7. Reference verifier (seed of the agent's verify code)
|
||||
|
||||
Verified working on Go 1.24.4 / `x/crypto` v0.52.0. (Test harness omitted; this is the verify
|
||||
core + SSHSIG framing + anti-replay/authz.)
|
||||
|
||||
```go
|
||||
const Namespace = "felhom-op-v1" // FIXED domain separator, never caller-supplied
|
||||
const sshsigMagic = "SSHSIG"
|
||||
|
||||
type Target struct{ HostID, GuestID string }
|
||||
type OpBlob struct {
|
||||
Op string `json:"op"`
|
||||
Target Target `json:"target"`
|
||||
Params json.RawMessage `json:"params"`
|
||||
Nonce string `json:"nonce"`
|
||||
IssuedAt time.Time `json:"issued_at"`
|
||||
ExpiresAt time.Time `json:"expires_at"`
|
||||
KeyID string `json:"key_id"`
|
||||
}
|
||||
// (Target needs json tags host_id/guest_id in the real struct.)
|
||||
|
||||
type NonceStore interface{ SeenOrRecord(nonce string, exp time.Time) bool }
|
||||
|
||||
type sshsigBlob struct {
|
||||
Version uint32
|
||||
PublicKey, Namespace, Reserved, HashAlgo, Signature string
|
||||
}
|
||||
|
||||
func hashByName(n string) (hash.Hash, error) {
|
||||
switch n {
|
||||
case "sha256": return sha256.New(), nil
|
||||
case "sha512": return sha512.New(), nil
|
||||
}
|
||||
return nil, fmt.Errorf("unsupported SSHSIG hash %q", n)
|
||||
}
|
||||
|
||||
func parseArmoredSSHSIG(armored []byte) (*sshsigBlob, error) {
|
||||
block, _ := pem.Decode(armored)
|
||||
if block == nil || block.Type != "SSH SIGNATURE" {
|
||||
return nil, errors.New("not an SSH SIGNATURE armor")
|
||||
}
|
||||
if len(block.Bytes) < 6 || string(block.Bytes[:6]) != sshsigMagic {
|
||||
return nil, errors.New("missing SSHSIG magic")
|
||||
}
|
||||
var sb sshsigBlob
|
||||
if err := ssh.Unmarshal(block.Bytes[6:], &sb); err != nil { return nil, err }
|
||||
if sb.Version != 1 { return nil, fmt.Errorf("bad version %d", sb.Version) }
|
||||
return &sb, nil
|
||||
}
|
||||
|
||||
func signedData(sb *sshsigBlob, msg []byte) ([]byte, error) {
|
||||
h, err := hashByName(sb.HashAlgo); if err != nil { return nil, err }
|
||||
h.Write(msg); md := h.Sum(nil)
|
||||
body := ssh.Marshal(struct{ Namespace, Reserved, HashAlgo string; Hash []byte }{
|
||||
sb.Namespace, sb.Reserved, sb.HashAlgo, md})
|
||||
return append([]byte(sshsigMagic), body...), nil
|
||||
}
|
||||
|
||||
// VerifySignedOp: key-type-agnostic signature verify + anti-replay/authorization.
|
||||
// allowedSigners is the trusted operator set (one key now; a quorum set later).
|
||||
func VerifySignedOp(blob, sigArmored []byte, allowedSigners []ssh.PublicKey,
|
||||
thisHostID, thisGuestID string, seenNonces NonceStore) (string, error) {
|
||||
|
||||
sb, err := parseArmoredSSHSIG(sigArmored)
|
||||
if err != nil { return "", err }
|
||||
if sb.Namespace != Namespace {
|
||||
return "", fmt.Errorf("namespace mismatch: got %q want %q", sb.Namespace, Namespace)
|
||||
}
|
||||
pub, err := ssh.ParsePublicKey([]byte(sb.PublicKey))
|
||||
if err != nil { return "", err }
|
||||
allowed := false
|
||||
for _, a := range allowedSigners {
|
||||
if bytes.Equal(a.Marshal(), pub.Marshal()) { allowed = true; break }
|
||||
}
|
||||
if !allowed { return "", errors.New("signer not in allowed set") }
|
||||
|
||||
signed, err := signedData(sb, blob)
|
||||
if err != nil { return "", err }
|
||||
var inner ssh.Signature
|
||||
if err := ssh.Unmarshal([]byte(sb.Signature), &inner); err != nil { return "", err }
|
||||
if err := pub.Verify(signed, &inner); err != nil { // dispatches on key algorithm
|
||||
return "", fmt.Errorf("signature invalid: %w", err)
|
||||
}
|
||||
|
||||
var op OpBlob
|
||||
if err := json.Unmarshal(blob, &op); err != nil { return "", err }
|
||||
if op.Target.HostID != thisHostID || op.Target.GuestID != thisGuestID {
|
||||
return "", fmt.Errorf("target mismatch")
|
||||
}
|
||||
now := time.Now().UTC()
|
||||
if now.Before(op.IssuedAt) { return "", errors.New("not yet valid") }
|
||||
if now.After(op.ExpiresAt) { return "", errors.New("expired") }
|
||||
if seenNonces.SeenOrRecord(op.Nonce, op.ExpiresAt) {
|
||||
return "", fmt.Errorf("replay: nonce %s already seen", op.Nonce)
|
||||
}
|
||||
return op.Op, nil
|
||||
}
|
||||
```
|
||||
|
||||
## 8. Inputs to the design doc (`04-control-plane-authorization.md`)
|
||||
|
||||
- **Primitive confirmed:** SSHSIG (`ssh-keygen -Y sign` / armored `BEGIN SSH SIGNATURE`),
|
||||
verified in Go via `pem.Decode` + `ssh.Unmarshal` + `ssh.ParsePublicKey` + `pub.Verify`. Low
|
||||
implementation cost; no crypto hand-rolled.
|
||||
- **Hub cannot forge:** the operator private key never touches the hub; the hub only queues the
|
||||
opaque armored blob (matches `03` §4).
|
||||
- **Key-type-agnostic / hardware-ready:** software `ed25519` now, FIDO2 `sk-ssh-ed25519` later is
|
||||
a **box no-op** (proven end-to-end). The verifier hardcodes neither key type nor algorithm.
|
||||
- **`allowedSigners` is a set:** single signer today; **threshold/quorum is just set sizing** plus
|
||||
an N-of-M policy on top (out of scope here).
|
||||
- **Anti-replay/authz are mandatory and cheap:** namespace (fixed), allow-list, then crypto,
|
||||
then target-binding, time-window, nonce — all enforced and tested.
|
||||
- **Canonical blob (§2)** is the shared contract between the operator CLI and the agent verifier.
|
||||
Reference in New Issue
Block a user