Files
felhom-agent/docs/phase0-findings.md
T
2026-06-07 20:20:52 +02:00

332 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 0 — VM vs LXC Overhead Spike: Findings
**Host:** `demo-felhom` (192.168.0.162) — Proxmox VE 9.2.2, Debian 13 (Trixie),
kernel 7.0.2-6-pve, 4 vCPU, 16 GB RAM (15771 MB `MemTotal`).
**Date:** 2026-06-07. **Measured one guest at a time, the other fully stopped.**
> This document presents **data and observations only**. No recommendation or verdict —
> the architecture decision is made elsewhere.
---
## 1. Provenance
### Platform
| Component | Version |
|---|---|
| pve-manager | 9.2.2 (`b9984c6d90a4bd80`) |
| kernel | proxmox-kernel 7.0.2-6-pve |
| pve-qemu-kvm | 11.0.0-3 |
| qemu-server | 9.1.15 |
| pve-container | 6.1.10 |
| lxc-pve / lxcfs | 7.0.0-2 / 7.0.0-pve1 |
| criu | 4.1.1-1 |
`pvesh get /version` → release 9.2, version 9.2.2.
### Guest images
| | LXC (9001) | VM (9000) |
|---|---|---|
| Source | `local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst` | `debian-13-genericcloud-amd64.qcow2` |
| Build | Debian 13.1 standard CT template (downloaded via `pveam`, checksum verified) | cloud build **20260601-2496**; in-guest reports Debian **13.5** after `apt update` |
| qcow2 | n/a | virtual 3 GiB, on-disk 323 MiB, compat 1.1/zlib |
### Docker (identical in both guests)
| | LXC | VM |
|---|---|---|
| Source | Docker official apt repo, **`trixie` channel** (confirmed present) | same |
| Version | **29.5.3** build d1c06ef | **29.5.3** build d1c06ef |
| Storage Driver | **`overlayfs`** (not vfs) | **`overlayfs`** (not vfs) |
| Cgroup Version / Driver | **v2 / systemd** | **v2 / systemd** |
| `hello-world` | OK | OK |
> Docker's official repo **does** have a `trixie` channel — no fallback to Debian's
> `docker.io` was needed. Docker 29 reports the driver as `overlayfs` (the containerd
> snapshotter image store) rather than the legacy name `overlay2`; this is the same
> overlay technology and is **not** a `vfs` fallback.
---
## 2. Comparison table
Baseline (both guests stopped): host RAM used **median 1702 MB** (range 16991703);
host CPU **~0.1 % used** (99.9 % idle). All RAM deltas below are vs this baseline.
Host RAM used = `MemTotal MemAvailable`, 5 samples ~3 s apart (median reported).
| Metric | LXC (9001) | VM (9000) | Δ (VM LXC) |
|---|---|---|---|
| **Idle host-RAM delta** | **+211 MB** (1913) | **+2056 MB** (3758) | **+1845 MB** |
| **Under-load host-RAM delta** | **+410 MB** (2112) | **+2084 MB** (3786) | **+1674 MB** |
| **Per-guest mem attribution** | cgroup `memory.current` = **1961 MB**¹ | KVM process RSS = **2031 MB** (idle) / **2047 MB** (load) | — |
| **Idle host CPU used** | **~0.3 %** (0.20 usr + 0.10 sys) | **~6.0 %** (3.37 usr + 2.31 sys + 0.29 guest) | **+5.7 pp** |
| **Under-load host CPU used** | **~39.4 %** (17.1 usr + 7.5 sys + 14.5 iowait + 0.3 soft) | **~53.9 %** (31.9 guest + 16.4 iowait + 3.4 sys + 1.7 usr + 0.6 soft) | **+14.5 pp** |
| **pgbench throughput** | **2211.7 tps**, lat 1.809 ms, 132 710 tx/60 s, 0 failed | **1819.6 tps**, lat 2.198 ms, 163 764 tx/90 s, 0 failed² | **392 tps** |
| **Disk allocated** | 10 GiB | 10 GiB | 0 |
| **Disk used (host thin-LV)** | 26.73 % ≈ **2.67 GiB** | 29.33 % ≈ **2.94 GiB** | +0.27 GiB |
| **Disk used (inside guest)** | 2.1 GiB / 9.7 GiB | 2.4 GiB / 9.7 GiB | +0.3 GiB |
| **Provisioning (rough, create→ready)** | ~1015 s³ | ~6075 s³ | — |
¹ `memory.current` counts reclaimable page cache shared with the host and therefore
**overstates** the LXC's true incremental cost; the +211 MB host-RAM delta is the honest
number. ² VM 60 s runs gave 1739 & 1759 tps — consistent with the 90 s definitive run.
³ Guest-creation step only; see §4. Docker install + first image pull (~network-bound,
~identical for both) is excluded.
### Inside-guest `free -m` (context only — not the decisive number)
| | total | used | buff/cache | available |
|---|---|---|---|---|
| LXC idle | 2048 | 125 | 1851 | 1922 |
| VM idle | 1974 | 509 | 1524 | 1464 |
The VM sees **1974 MB** usable of 2048 allocated (firmware/kernel reservation).
---
## 3. Docker-in-LXC viability
**Worked cleanly in an *unprivileged* LXC with `--features nesting=1,keyctl=1`. No
privileged fallback was needed.**
- `--features nesting=1,keyctl=1 --unprivileged 1` accepted by `pct create` (PVE 9
syntax confirmed via `pct help create`).
- `docker run hello-world` → success.
- **Storage driver: `overlayfs`** (cgroup v2, systemd cgroup driver) — **no `vfs`
fallback**.
- Full 3-container stack (`postgres:17`, `redis:7`, `nginx:alpine`) came up healthy.
- Named volume `pgdata` persisted a write (`SELECT count` returned 1 after table
create/insert).
- Multi-container networking + published port worked: `curl localhost:8080`**HTTP 200**.
- 60 s pgbench load: **0 failed transactions**.
No errors, no `dmesg`/`journalctl` anomalies, no workarounds. The privileged-LXC
fallback path (step A5) was therefore **not exercised**.
---
## 4. Observations & confounds
1. **VM under-load CPU required a re-measurement (diagnosed, not hidden).** The first
VM-load sample showed host CPU ~5 % — identical to *idle* — while pgbench nonetheless
completed a full 60 s run (1739 tps). Root cause: the VM load was launched through a
**nested SSH + `nohup &`** layer (host→VM), which started pgbench *after* the sampling
window. The LXC path used local `pct exec` (no nested SSH) so its first sample was
valid. Re-running with pgbench held in the **foreground of a long-lived SSH channel**
(guaranteed active) and sampling during a confirmed window gave the true **53.9 %**
(`%guest`=31.9). **Confound:** the two guests' load was driven through different
plumbing (`pct exec` vs nested SSH); the *throughput* numbers are unaffected
(pgbench self-reports its own duration), but the CPU figures came from
methodologically asymmetric harnesses.
2. **Baseline drift from residual page cache.** After stopping each guest, host RAM did
not snap back to 1702 MB immediately (e.g. 1895 MB just after the LXC stopped;
1965→1794 MB drifting down after the VM). This is reclaimable cache, not a leak.
Treat all RAM deltas as ±~100 MB.
3. **The headline RAM gap is structural, not incidental.** LXC processes share the host
kernel and page cache, so only the working set counts against the host (+211 MB idle).
The VM, with **no ballooning configured**, has KVM back every guest-touched page —
including the guest's own 1.5 GB page cache — so the host cost ≈ the full 2 GB
allocation (KVM RSS ≈ 2031 MB) and is **largely load-independent** (3758 idle → 3786
load). Ballooning / KSM were not tested and could change this.
4. **`cgroup memory.current` ≠ host cost.** For the LXC it read 1961 MB (near the 2 GB
limit) because it includes reclaimable page cache; the real incremental host cost was
+211 MB. Per the protocol, `MemTotal MemAvailable` is the decisive metric.
5. **VM idle CPU floor (~6 %) vs LXC (~0.3 %).** QEMU device emulation + a full guest
kernel's timer/housekeeping impose a small constant CPU cost even at rest.
6. **Throughput vs CPU trade.** The VM did slightly *less* work (1820 vs 2211 tps) for
*more* host CPU (53.9 vs 39.4 %). The extra cost surfaces as `%guest` (31.9 %) — the
actual DB work *plus* virtualization overhead — whereas in the LXC the same DB work
appears directly as host `%usr`/`%sys`. iowait was comparable (~1516 %, WAL fsync).
7. **Workload fits in RAM.** pgbench scale `-s 10` (~150 MB) fits in cache in both
guests, so the test is commit/CPU-bound rather than disk-bound; a larger-than-RAM
dataset would stress the storage paths differently and is not covered here.
8. **qemu-guest-agent confirmed on the VM** (`qm guest cmd 9000 ping` → OK). This enables
`guest-fsfreeze`-based app-consistent `snapshot`-mode vzdump for the VM — a capability
the LXC has no equivalent for. The genericcloud image does **not** ship the agent;
it had to be installed in-guest (and the VM IP had to be found via `nmap`/MAC until
the agent was up).
9. **Provisioning asymmetry foreshadows cloning.** LXC create is template-extract-bound
(526 MiB at 387 MiB/s + SSH keygen, ~1015 s). VM create is qcow2-import-bound (3 GiB
→ LVM ≈ 30 s) plus a full firmware boot to SSH-ready (~3045 s). Figures are rough,
single-run, and exclude the shared network-bound Docker install + first image pull.
---
## 5. Raw command log (appendix)
### 5.1 Provenance
```
$ pveversion -v | grep ...
pve-manager: 9.2.2 (running version: 9.2.2/b9984c6d90a4bd80)
proxmox-kernel-7.0: 7.0.2-6
criu: 4.1.1-1
lxc-pve: 7.0.0-2
lxcfs: 7.0.0-pve1
pve-container: 6.1.10
pve-qemu-kvm: 11.0.0-3
qemu-server: 9.1.15
$ pvesm status
local dir active 98497780 4333576 89114656 4.40%
local-lvm lvmthin active 365760512 0 365760512 0.00%
# Docker repo trixie channel:
$ curl -fsSL https://download.docker.com/linux/debian/dists/ | grep -oE 'trixie|bookworm|bullseye'
bookworm / bullseye / trixie # trixie present
# Cloud image:
$ qemu-img info debian-13-genericcloud-amd64.qcow2
virtual size: 3 GiB ; disk size: 323 MiB ; compat 1.1 ; build 20260601-2496
```
### 5.2 Baseline (both guests stopped)
```
$ for i in 1..5; awk MemTotal-MemAvailable /proc/meminfo ; sleep 3
used=1699 MB / 1702 / 1702 / 1702 / 1703 MB (median 1702)
$ mpstat 1 5
Average: all 0.05 usr 0.05 sys ... 99.90 idle
```
### 5.3 LXC 9001 — create + Docker
```
$ pct create 9001 local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst \
--hostname spike-lxc --cores 2 --memory 2048 --rootfs local-lvm:10 \
--net0 name=eth0,bridge=vmbr0,ip=dhcp --features nesting=1,keyctl=1 \
--unprivileged 1 --start 1
Logical volume "vm-9001-disk-0" created.
extracting archive ... Total bytes read: 551505920 (526MiB, 387MiB/s)
Creating SSH host key ... done
=== exit: 0 ; status: running
features: nesting=1,keyctl=1 ; unprivileged: 1 ; ip 192.168.0.115/24
# Docker install (official repo, trixie stable): DOCKER-INSTALL-OK
$ docker --version -> Docker version 29.5.3, build d1c06ef
$ docker run --rm hello-world -> Hello from Docker!
$ docker info | grep -iE 'Storage Driver|Cgroup'
Storage Driver: overlayfs
Cgroup Driver: systemd
Cgroup Version: 2
Server Version: 29.5.3 ; Kernel: 7.0.2-6-pve ; OS: Debian GNU/Linux 13 (trixie)
```
### 5.4 LXC 9001 — stack health
```
$ docker compose ps
spike-cache-1 running Up
spike-db-1 running Up
spike-web-1 running Up
$ curl -s -o /dev/null -w 'HTTP %{http_code}' localhost:8080 -> HTTP 200
$ psql CREATE TABLE spike_persist; INSERT; SELECT count(*) -> 1 (volume persists)
```
### 5.5 LXC 9001 — idle measurement
```
Host RAM used (5x3s): 1913 / 1914 / 1913 / 1914 / 1913 MB (median 1913, Δ +211)
cgroup memory.current: 2056036352 B = 1961 MB
inside free -m: total 2048 used 125 buff/cache 1851 available 1922
mpstat 1 5 Average: 0.20 usr 0.10 sys ... 99.70 idle (~0.3% used)
pct df 9001: rootfs 9.7G size, 2.1G used, 21.6%
```
### 5.6 LXC 9001 — under-load measurement
```
$ pgbench -i -s 10 -> done in 1.39 s
$ pgbench -T 60 -c 4 (run concurrently with sampling):
Host RAM used (5x3s): 2149 / 2143 / 2112 / 2086 / 2071 MB (median 2112, Δ +410)
cgroup memory.current: 2130382848 B = 2032 MB
mpstat 1 5 Average: 17.10 usr 7.50 sys 14.50 iowait 0.31 soft 60.59 idle (~39.4% used)
pgbench result: scaling 10, clients 4, 60 s
transactions: 132710 ; failed 0 (0.000%)
latency average = 1.809 ms ; tps = 2211.713864
host thin LV vm-9001-disk-0: 10240 MB, Data% 26.73 (≈2.67 GiB)
```
### 5.7 VM 9000 — create + cloud-init
```
$ qm create 9000 --name spike-vm --cores 2 --memory 2048 \
--net0 virtio,bridge=vmbr0 --scsihw virtio-scsi-single --agent 1
$ qm set 9000 --scsi0 local-lvm:0,import-from=/var/lib/vz/template/qcow2/debian-13-genericcloud-amd64.qcow2
transferred 3.0 GiB of 3.0 GiB (100.00%)
scsi0: successfully created disk 'local-lvm:vm-9000-disk-0,size=3G'
$ qm set 9000 --ide2 local-lvm:cloudinit --boot order=scsi0 --serial0 socket --vga serial0
$ qm disk resize 9000 scsi0 10G -> resized 3.00 -> 10.00 GiB
$ qm set 9000 --ciuser spike --cipassword spike --sshkeys /root/spike-pubkey.pub --ipconfig0 ip=dhcp
# pubkey file = the two real keys from the host's /etc/pve/priv/authorized_keys
# (incl. ssh-ed25519 ...kisfenyo@windows — the same workstation key)
$ qm start 9000 -> start-ok
```
### 5.8 VM 9000 — IP discovery + guest agent + Docker
```
# genericcloud has no guest-agent at first boot -> qm guest cmd ping failed.
# IP found via MAC on the bridge:
$ nmap -sn 192.168.0.0/24 | grep -B2 BC:24:11:C7:41:87
Nmap scan report for 192.168.0.155 ; MAC BC:24:11:C7:41:87 (Proxmox)
$ ssh -i /root/.ssh/id_rsa spike@192.168.0.155 'hostname; cat /etc/debian_version'
spike-vm ; 13.5
# install qemu-guest-agent + Docker (official repo, trixie): VM-INSTALL-OK
$ qm guest cmd 9000 ping -> AGENT OK (fsfreeze available)
$ docker --version -> Docker version 29.5.3, build d1c06ef
$ docker run --rm hello-world -> Hello from Docker!
$ docker info | grep -iE 'Storage Driver|Cgroup'
Storage Driver: overlayfs ; Cgroup Driver: systemd ; Cgroup Version: 2
```
### 5.9 VM 9000 — stack health
```
$ docker compose ps -> spike-cache-1 / spike-db-1 / spike-web-1 all running
$ curl ... localhost:8080 -> HTTP 200
$ psql ... SELECT count(*) -> 1 (volume persists)
```
### 5.10 VM 9000 — idle measurement
```
Host RAM used (5x3s): 3758 / 3757 / 3754 / 3759 / 3758 MB (median 3758, Δ +2056)
KVM process RSS / VSZ: 2079988 / 3380896 KiB (RSS = 2031 MB)
inside free -m: total 1974 used 509 buff/cache 1524 available 1464
mpstat 1 5 Average: 3.37 usr 2.31 sys 0.29 guest ... 94.04 idle (~6.0% used)
qm config: scsi0 local-lvm:vm-9000-disk-0,size=10G
host thin LV vm-9000-disk-0: 10240 MB, Data% 29.33 (≈2.94 GiB)
inside df -h /: 9.7G size, 2.4G used, 25%
```
### 5.11 VM 9000 — under-load measurement (definitive, load confirmed active)
```
# First attempt (nested-ssh + nohup &) launched pgbench AFTER the sample window ->
# host CPU read a false ~5% (identical to idle). Diagnosed; re-run below holds
# pgbench in the foreground of a long-lived SSH channel and samples during it.
$ pgbench -T 90 -c 4 (foreground, channel held):
transactions: 163764 ; failed 0 (0.000%)
latency average = 2.198 ms ; tps = 1819.602345
(60 s confirmation runs: 1739 & 1759 tps)
# Sampled 10 s into the confirmed-active load:
Host RAM used (5x3s): 3784 / 3786 / 3786 / 3786 / 3786 MB (median 3786, Δ +2084)
KVM process RSS / VSZ: 2096508 / 4495008 KiB (RSS = 2047 MB)
guest uptime: load average 1.71 (2 vCPU) -> vCPUs busy
mpstat 1 8 Average:
1.70 usr 3.40 sys 16.35 iowait 0.58 soft 31.89 guest 46.08 idle (~53.9% used)
```
### 5.12 Teardown state
```
$ qm list -> 9000 spike-vm stopped
$ pct list -> 9001 spike-lxc stopped
# both present, both stopped (numbers can be re-checked)
```
---
## 6. Teardown — destroy commands (NOT run)
Both guests were left **stopped but present**. To remove them:
```bash
qm destroy 9000 --purge # VM (also removes cloudinit + disks)
pct destroy 9001 --purge # LXC
# optional spike artifacts on the host:
rm -f /var/lib/vz/template/qcow2/debian-13-genericcloud-amd64.qcow2
rm -f /root/spike-pubkey.pub /root/vm-install.sh
# (Debian 13 CT template left in place: local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst)
```