diff --git a/documentation/tests/slice7-bringup-spike-findings.md b/documentation/tests/slice7-bringup-spike-findings.md new file mode 100644 index 0000000..6b7dc8d --- /dev/null +++ b/documentation/tests/slice7-bringup-spike-findings.md @@ -0,0 +1,141 @@ +# Slice 7 Phase 0 — Golden base build + live bring-up (front half): Findings + +**Host:** `demo-felhom` (192.168.0.162) — Proxmox VE 9.2.2, Debian 13 (Trixie). Bridge `vmbr0`, +LAN DHCP (router at 192.168.0.1). +**Date:** 2026-06-09. **Driver:** SPIKE-RUNBOOK (root@pam CLI for the golden build; the +`FelhomAgent` API token for the per-customer front-half ops — restore/config/resize/start). +**VMIDs:** golden-build `9100`, restored-test `9101` (both torn down; golden archive kept). + +> This document presents **data, observations, and the resulting design deliverables** (the +> identity-reset field list). It feeds the spec of the unified bring-up reconcile job (Phase 7.1). + +--- + +## 1. Provenance / setup + +| Component | Value | +|---|---| +| Template | `local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst` | +| Restore storage | `local-lvm` (lvmthin) · Archive storage | `local` (dir, `/var/lib/vz/dump`) | +| Token | `felhom@pve!agent` (the `FelhomAgent` 16-priv role; by reference) | +| Golden archive (KEPT) | `local:backup/vzdump-lxc-9100-2026_06_09-20_41_10.tar.zst` (298 MB) | +| openssh-server (in guest) | `1:10.0p1-7` | +| Docker storage driver | **`overlayfs`** (not `overlay2`/`vfs`) — consistent with phase0 | + +Token API smoke (S0): `GET /version` → 200, `GET /nodes/demo-felhom/lxc` → 200. Token holds +`VM.Allocate`, `Datastore.Allocate`/`AllocateSpace`, `VM.Config.{Disk,Network,Options,…}`, +`VM.PowerMgmt`, `VM.Backup`, etc. (full set confirmed via `/access/permissions`). + +## 2. Golden recipe (validated — build the real golden from this) + +1. **Create (root@pam — the one root step; `keyctl=1` is root-only, phase3 #1):** + ``` + pct create 9100 local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst \ + --hostname felhom-golden --unprivileged 1 --features nesting=1,keyctl=1 \ + --rootfs local-lvm:8 --cores 2 --memory 2048 \ + --net0 name=eth0,bridge=vmbr0,ip=dhcp --onboot 0 + ``` + (`pct create` auto-generates SSH host keys — these get wiped in step 3.) +2. **Docker (official apt repo, `trixie` channel):** `ca-certificates curl` → keyring → + `docker-ce docker-ce-cli containerd.io`. Confirmed working in the build guest: + `docker run --rm hello-world` → "Hello from Docker!", **storage driver `overlayfs`**. +3. **Identity-clean + minimize (guest-internal, run during build):** + ``` + systemctl stop docker containerd + apt-get clean; rm -rf /var/lib/apt/lists/* + rm -f /etc/ssh/ssh_host_* # SSH host keys + truncate -s 0 /etc/machine-id # systemd regenerates on first boot + rm -f /var/lib/dbus/machine-id; ln -sf /etc/machine-id /var/lib/dbus/machine-id + rm -rf /var/log/*; : > /root/.bash_history + rm -f /etc/hostname # set per-guest at provision + ``` +4. **Stop + archive (root vzdump is fine for the build):** + `pct stop 9100; vzdump 9100 --storage local --mode stop --compress zstd`. +5. **Archive carries keyctl (verified, phase3 method — embedded `./etc/vzdump/pct.conf`):** + `features: nesting=1,keyctl=1` · `unprivileged: 1`. **It also carries the build guest's + baked MAC** `BC:24:11:63:43:F4` and `hostname: felhom-golden` — see §4. + +## 3. Result matrix + +| Property | As-restored (9101, stopped, pre-reset) | Front-half reset (token) | After link-up boot | +|---|---|---|---| +| keyctl / nesting / unpriv | **preserved** `nesting=1,keyctl=1,unprivileged:1` | — | **Docker runs** (`hello-world` OK) — keyctl *functional*, not just flag-present | +| **MAC** | **KEPT golden's** `BC:24:11:63:43:F4` | reset → fresh `BC:24:11:A6:C0:DE` (PUT net0, **omit hwaddr** → PVE regenerates) | DHCP lease `192.168.0.109`; MAC unique; no LAN collision | +| **hostname** | **KEPT golden's** `felhom-golden` (config field; `/etc/hostname` file absent) | reset → `felhom-spike-9101` (PUT hostname) | **propagated** inside (`hostname` = `felhom-spike-9101`) | +| **machine-id** | **empty** (baked `truncate`) | — | **auto-regenerated by systemd** → `faeffb0bc1b8403089cdd0b981cff109` (unique) | +| **SSH host keys** | **absent** (baked `rm`) | — | **NOT regenerated; `ssh.service` FAILED** — see Finding F3 | +| rootfs | 8 G | **resize → 10 G** (`PUT /resize disk=rootfs size=+2G`) | — | +| mp0 mount | n/a | attached `local-lvm:1,mp=/mnt/spike-test` (transient 500 → retry 200, F4) | present + **writable** (ext4) | + +Token ops all ran as `felhom-agent@pve!agent` (restore `vzrestore` OK, start `vzstart` OK) — +the per-customer front half is **fully token-covered**. + +## 4. Findings + +- **F1 — MAC reset is UNCONDITIONAL.** A token `vzrestore` **preserves the archived MAC** + (9101 came up with the golden's `BC:24:11:63:43:F4`). Every guest restored from the golden + would therefore share one MAC → guaranteed L2 collision. The reconcile job **must** reset MAC + on every provision (host-side: `PUT net0` with `hwaddr` omitted → PVE generates a fresh + `BC:24:11:xx:xx:xx`). This settles the §9 "MAC handling" question for the *provision* path: + always reset. (DR-restore of a *customer* backup is the separate continuity case — §9.) +- **F2 — hostname is carried in the config and must be reset host-side.** The archive's + `hostname:` field restored verbatim (`felhom-golden`); `PUT hostname=` resets it and it + **propagates into the guest** on boot. Host-side, token-covered — no guest-internal step. +- **F3 — machine-id regenerates for free; SSH host keys do NOT (design consequence).** + - `machine-id`: bake `truncate -s 0` → **systemd regenerates it on first boot** (confirmed + non-empty + unique). No agent action needed. ✓ free. + - SSH host keys: bake `rm` → on Debian 13 they are **not** regenerated at boot (the keygen is + a `pct create` hook + a package-install action; **`pct restore` runs neither**). Result: + `openssh-server` is installed and `ssh.service` is **enabled but FAILED** on first boot (no + host keys). `ssh-keygen -A` regenerates them cleanly (unique fingerprint + `SHA256:MAX191…ED25519`, `root@felhom-spike-9101`). + → **The bring-up reconcile job must regenerate SSH host keys guest-side** (`ssh-keygen -A`, + or `dpkg-reconfigure openssh-server`). **This widens the agent's guest-internal surface** + beyond pure host-side config — the one real design consequence this spike surfaced. + *Alternative to consider in the spec:* bake a one-shot first-boot unit into the golden that + runs `ssh-keygen -A` (keeps regeneration guest-internal-but-baked, so the agent stays + host-side-only). Either way it must be decided; it is **not** free like machine-id. +- **F4 — transient config-lock 500 on back-to-back PUTs.** A `mp0` attach issued immediately + after a `resize` returned **HTTP 500**, then succeeded (200) on retry seconds later — a + config-lock contention, **not** a permission issue (token holds `VM.Config.Disk` + + `Datastore.AllocateSpace`). The reconcile job's existing **per-guest serialization** avoids + this; add a **retry on transient 500** for safety. +- **F5 — keyctl-through-restore is *functional*, not just flag-present.** Docker started and + ran `hello-world` in the *restored* guest — re-confirms phase3 #8 on the golden specifically. +- **F6 — Docker storage driver is `overlayfs`** (not `overlay2`), matching phase0's LXC result. + No extra config beyond `nesting=1,keyctl=1` was needed. +- **F7 — live link-up surfaced no DHCP/ARP problem.** Fresh MAC → fresh lease `192.168.0.109`; + the golden's old MAC only lingered as a STALE IPv6-neighbour cache entry from the (stopped) + build guest. No active collision. + +## 5. Identity-reset deliverable (the §9 / doc-13 open item — settled for the *provision* path) + +| Field | Restore leaves it as | Who resets it | Where | Cost | +|---|---|---|---|---| +| MAC | golden's archived MAC | reconcile job (unconditional) | **host-side** token `PUT net0` (omit hwaddr) | cheap | +| hostname | golden's archived hostname | reconcile job | **host-side** token `PUT hostname` | cheap | +| machine-id | empty (baked) | **systemd, first boot** | guest first-boot regen (golden bake) | **free** | +| SSH host keys | absent (baked) | reconcile job | **guest-side** `ssh-keygen -A` (or baked first-boot unit) | **surface-widening — flag** | + +**Reconcile-job front-half reset set (provision):** host-side `{MAC, hostname}` via token config; +guest-side `{SSH host keys}` via `ssh-keygen -A` (or a baked first-boot unit); `{machine-id}` is +handled for free by the bake-clean golden. Restic / tunnel / hub identity are **out of scope** +here (back half, slice 8 / DR policy §9). + +## 6. Verdict + +**READY to spec the unified bring-up reconcile job (Phase 7.1).** The golden recipe is validated +end-to-end and the token-covered front half (restore → reset MAC+hostname → resize → attach +mount → start link-up) works with Docker functional in the restored guest. **One design change +the findings force:** the front half is **not** purely host-side — SSH-host-key regeneration is a +guest-internal step (F3). The spec must choose between an agent-run `ssh-keygen -A` (widening the +guest-internal surface) and a baked first-boot unit in the golden (keeping the agent host-side). +machine-id needs no such step. MAC reset is unconditional (F1). + +## 7. Out of scope (not done here — note for the implementation) + +- Controller deploy / bootstrap / per-guest local-token mint — **slice 8** (back half). +- Restic / tunnel / hub identity handling — DR identity policy (§9) + slice 8/10. +- Reconcile-job journaling + compensating rollback — the **implementation** (Phase 7.1), + specced from these findings; this spike restored/destroyed manually without the journal. +- PBS escrow (§8a) — separate slice-7 thread.