SPIKE-RUNBOOK Slice 7 Phase 0, executed live on demo-felhom. Golden base (Debian 13 + Docker, nesting=1,keyctl=1, identity-cleaned) built as root@pam, archived, then token-restored to a throwaway guest and brought up LINK-UP with the FelhomAgent token (restore/config/resize/start all token-covered). Key findings: - MAC reset is UNCONDITIONAL — vzrestore preserves the archived MAC (F1). - hostname reset is host-side token config (F2). - machine-id auto-regenerates on first boot (free); SSH host keys do NOT — ssh.service fails, agent must run ssh-keygen -A guest-side OR bake a first-boot unit (F3, the one surface-widening design consequence). - keyctl-through-restore is functional (Docker hello-world in the restored guest); storage driver overlayfs (F5/F6). - Settles the §9 / doc-13 identity-reset field list for the provision path. Verdict: READY to spec the unified bring-up reconcile job (Phase 7.1). Golden archive kept; both spike guests torn down. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
9.5 KiB
Slice 7 Phase 0 — Golden base build + live bring-up (front half): Findings
Host: demo-felhom (192.168.0.162) — Proxmox VE 9.2.2, Debian 13 (Trixie). Bridge vmbr0,
LAN DHCP (router at 192.168.0.1).
Date: 2026-06-09. Driver: SPIKE-RUNBOOK (root@pam CLI for the golden build; the
FelhomAgent API token for the per-customer front-half ops — restore/config/resize/start).
VMIDs: golden-build 9100, restored-test 9101 (both torn down; golden archive kept).
This document presents data, observations, and the resulting design deliverables (the identity-reset field list). It feeds the spec of the unified bring-up reconcile job (Phase 7.1).
1. Provenance / setup
| Component | Value |
|---|---|
| Template | local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst |
| Restore storage | local-lvm (lvmthin) · Archive storage |
| Token | felhom@pve!agent (the FelhomAgent 16-priv role; by reference) |
| Golden archive (KEPT) | local:backup/vzdump-lxc-9100-2026_06_09-20_41_10.tar.zst (298 MB) |
| openssh-server (in guest) | 1:10.0p1-7 |
| Docker storage driver | overlayfs (not overlay2/vfs) — consistent with phase0 |
Token API smoke (S0): GET /version → 200, GET /nodes/demo-felhom/lxc → 200. Token holds
VM.Allocate, Datastore.Allocate/AllocateSpace, VM.Config.{Disk,Network,Options,…},
VM.PowerMgmt, VM.Backup, etc. (full set confirmed via /access/permissions).
2. Golden recipe (validated — build the real golden from this)
- Create (root@pam — the one root step;
keyctl=1is root-only, phase3 #1):(pct create 9100 local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst \ --hostname felhom-golden --unprivileged 1 --features nesting=1,keyctl=1 \ --rootfs local-lvm:8 --cores 2 --memory 2048 \ --net0 name=eth0,bridge=vmbr0,ip=dhcp --onboot 0pct createauto-generates SSH host keys — these get wiped in step 3.) - Docker (official apt repo,
trixiechannel):ca-certificates curl→ keyring →docker-ce docker-ce-cli containerd.io. Confirmed working in the build guest:docker run --rm hello-world→ "Hello from Docker!", storage driveroverlayfs. - Identity-clean + minimize (guest-internal, run during build):
systemctl stop docker containerd apt-get clean; rm -rf /var/lib/apt/lists/* rm -f /etc/ssh/ssh_host_* # SSH host keys truncate -s 0 /etc/machine-id # systemd regenerates on first boot rm -f /var/lib/dbus/machine-id; ln -sf /etc/machine-id /var/lib/dbus/machine-id rm -rf /var/log/*; : > /root/.bash_history rm -f /etc/hostname # set per-guest at provision - Stop + archive (root vzdump is fine for the build):
pct stop 9100; vzdump 9100 --storage local --mode stop --compress zstd. - Archive carries keyctl (verified, phase3 method — embedded
./etc/vzdump/pct.conf):features: nesting=1,keyctl=1·unprivileged: 1. It also carries the build guest's baked MACBC:24:11:63:43:F4andhostname: felhom-golden— see §4.
3. Result matrix
| Property | As-restored (9101, stopped, pre-reset) | Front-half reset (token) | After link-up boot |
|---|---|---|---|
| keyctl / nesting / unpriv | preserved nesting=1,keyctl=1,unprivileged:1 |
— | Docker runs (hello-world OK) — keyctl functional, not just flag-present |
| MAC | KEPT golden's BC:24:11:63:43:F4 |
reset → fresh BC:24:11:A6:C0:DE (PUT net0, omit hwaddr → PVE regenerates) |
DHCP lease 192.168.0.109; MAC unique; no LAN collision |
| hostname | KEPT golden's felhom-golden (config field; /etc/hostname file absent) |
reset → felhom-spike-9101 (PUT hostname) |
propagated inside (hostname = felhom-spike-9101) |
| machine-id | empty (baked truncate) |
— | auto-regenerated by systemd → faeffb0bc1b8403089cdd0b981cff109 (unique) |
| SSH host keys | absent (baked rm) |
— | NOT regenerated; ssh.service FAILED — see Finding F3 |
| rootfs | 8 G | resize → 10 G (PUT /resize disk=rootfs size=+2G) |
— |
| mp0 mount | n/a | attached local-lvm:1,mp=/mnt/spike-test (transient 500 → retry 200, F4) |
present + writable (ext4) |
Token ops all ran as felhom-agent@pve!agent (restore vzrestore OK, start vzstart OK) —
the per-customer front half is fully token-covered.
4. Findings
- F1 — MAC reset is UNCONDITIONAL. A token
vzrestorepreserves the archived MAC (9101 came up with the golden'sBC:24:11:63:43:F4). Every guest restored from the golden would therefore share one MAC → guaranteed L2 collision. The reconcile job must reset MAC on every provision (host-side:PUT net0withhwaddromitted → PVE generates a freshBC:24:11:xx:xx:xx). This settles the §9 "MAC handling" question for the provision path: always reset. (DR-restore of a customer backup is the separate continuity case — §9.) - F2 — hostname is carried in the config and must be reset host-side. The archive's
hostname:field restored verbatim (felhom-golden);PUT hostname=resets it and it propagates into the guest on boot. Host-side, token-covered — no guest-internal step. - F3 — machine-id regenerates for free; SSH host keys do NOT (design consequence).
machine-id: baketruncate -s 0→ systemd regenerates it on first boot (confirmed non-empty + unique). No agent action needed. ✓ free.- SSH host keys: bake
rm→ on Debian 13 they are not regenerated at boot (the keygen is apct createhook + a package-install action;pct restoreruns neither). Result:openssh-serveris installed andssh.serviceis enabled but FAILED on first boot (no host keys).ssh-keygen -Aregenerates them cleanly (unique fingerprintSHA256:MAX191…ED25519,root@felhom-spike-9101). → The bring-up reconcile job must regenerate SSH host keys guest-side (ssh-keygen -A, ordpkg-reconfigure openssh-server). This widens the agent's guest-internal surface beyond pure host-side config — the one real design consequence this spike surfaced. Alternative to consider in the spec: bake a one-shot first-boot unit into the golden that runsssh-keygen -A(keeps regeneration guest-internal-but-baked, so the agent stays host-side-only). Either way it must be decided; it is not free like machine-id.
- F4 — transient config-lock 500 on back-to-back PUTs. A
mp0attach issued immediately after aresizereturned HTTP 500, then succeeded (200) on retry seconds later — a config-lock contention, not a permission issue (token holdsVM.Config.Disk+Datastore.AllocateSpace). The reconcile job's existing per-guest serialization avoids this; add a retry on transient 500 for safety. - F5 — keyctl-through-restore is functional, not just flag-present. Docker started and
ran
hello-worldin the restored guest — re-confirms phase3 #8 on the golden specifically. - F6 — Docker storage driver is
overlayfs(notoverlay2), matching phase0's LXC result. No extra config beyondnesting=1,keyctl=1was needed. - F7 — live link-up surfaced no DHCP/ARP problem. Fresh MAC → fresh lease
192.168.0.109; the golden's old MAC only lingered as a STALE IPv6-neighbour cache entry from the (stopped) build guest. No active collision.
5. Identity-reset deliverable (the §9 / doc-13 open item — settled for the provision path)
| Field | Restore leaves it as | Who resets it | Where | Cost |
|---|---|---|---|---|
| MAC | golden's archived MAC | reconcile job (unconditional) | host-side token PUT net0 (omit hwaddr) |
cheap |
| hostname | golden's archived hostname | reconcile job | host-side token PUT hostname |
cheap |
| machine-id | empty (baked) | systemd, first boot | guest first-boot regen (golden bake) | free |
| SSH host keys | absent (baked) | reconcile job | guest-side ssh-keygen -A (or baked first-boot unit) |
surface-widening — flag |
Reconcile-job front-half reset set (provision): host-side {MAC, hostname} via token config;
guest-side {SSH host keys} via ssh-keygen -A (or a baked first-boot unit); {machine-id} is
handled for free by the bake-clean golden. Restic / tunnel / hub identity are out of scope
here (back half, slice 8 / DR policy §9).
6. Verdict
READY to spec the unified bring-up reconcile job (Phase 7.1). The golden recipe is validated
end-to-end and the token-covered front half (restore → reset MAC+hostname → resize → attach
mount → start link-up) works with Docker functional in the restored guest. One design change
the findings force: the front half is not purely host-side — SSH-host-key regeneration is a
guest-internal step (F3). The spec must choose between an agent-run ssh-keygen -A (widening the
guest-internal surface) and a baked first-boot unit in the golden (keeping the agent host-side).
machine-id needs no such step. MAC reset is unconditional (F1).
7. Out of scope (not done here — note for the implementation)
- Controller deploy / bootstrap / per-guest local-token mint — slice 8 (back half).
- Restic / tunnel / hub identity handling — DR identity policy (§9) + slice 8/10.
- Reconcile-job journaling + compensating rollback — the implementation (Phase 7.1), specced from these findings; this spike restored/destroyed manually without the journal.
- PBS escrow (§8a) — separate slice-7 thread.