Files
felhom.eu/documentation/tests/slice7-bringup-spike-findings.md
T
admin 33429933af spike(slice7): golden base build + live bring-up front-half findings
SPIKE-RUNBOOK Slice 7 Phase 0, executed live on demo-felhom. Golden base
(Debian 13 + Docker, nesting=1,keyctl=1, identity-cleaned) built as root@pam,
archived, then token-restored to a throwaway guest and brought up LINK-UP with
the FelhomAgent token (restore/config/resize/start all token-covered).

Key findings:
- MAC reset is UNCONDITIONAL — vzrestore preserves the archived MAC (F1).
- hostname reset is host-side token config (F2).
- machine-id auto-regenerates on first boot (free); SSH host keys do NOT —
  ssh.service fails, agent must run ssh-keygen -A guest-side OR bake a first-boot
  unit (F3, the one surface-widening design consequence).
- keyctl-through-restore is functional (Docker hello-world in the restored guest);
  storage driver overlayfs (F5/F6).
- Settles the §9 / doc-13 identity-reset field list for the provision path.

Verdict: READY to spec the unified bring-up reconcile job (Phase 7.1).
Golden archive kept; both spike guests torn down.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 20:48:50 +02:00

9.5 KiB

Slice 7 Phase 0 — Golden base build + live bring-up (front half): Findings

Host: demo-felhom (192.168.0.162) — Proxmox VE 9.2.2, Debian 13 (Trixie). Bridge vmbr0, LAN DHCP (router at 192.168.0.1). Date: 2026-06-09. Driver: SPIKE-RUNBOOK (root@pam CLI for the golden build; the FelhomAgent API token for the per-customer front-half ops — restore/config/resize/start). VMIDs: golden-build 9100, restored-test 9101 (both torn down; golden archive kept).

This document presents data, observations, and the resulting design deliverables (the identity-reset field list). It feeds the spec of the unified bring-up reconcile job (Phase 7.1).


1. Provenance / setup

Component Value
Template local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst
Restore storage local-lvm (lvmthin) · Archive storage
Token felhom@pve!agent (the FelhomAgent 16-priv role; by reference)
Golden archive (KEPT) local:backup/vzdump-lxc-9100-2026_06_09-20_41_10.tar.zst (298 MB)
openssh-server (in guest) 1:10.0p1-7
Docker storage driver overlayfs (not overlay2/vfs) — consistent with phase0

Token API smoke (S0): GET /version → 200, GET /nodes/demo-felhom/lxc → 200. Token holds VM.Allocate, Datastore.Allocate/AllocateSpace, VM.Config.{Disk,Network,Options,…}, VM.PowerMgmt, VM.Backup, etc. (full set confirmed via /access/permissions).

2. Golden recipe (validated — build the real golden from this)

  1. Create (root@pam — the one root step; keyctl=1 is root-only, phase3 #1):
    pct create 9100 local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst \
      --hostname felhom-golden --unprivileged 1 --features nesting=1,keyctl=1 \
      --rootfs local-lvm:8 --cores 2 --memory 2048 \
      --net0 name=eth0,bridge=vmbr0,ip=dhcp --onboot 0
    
    (pct create auto-generates SSH host keys — these get wiped in step 3.)
  2. Docker (official apt repo, trixie channel): ca-certificates curl → keyring → docker-ce docker-ce-cli containerd.io. Confirmed working in the build guest: docker run --rm hello-world → "Hello from Docker!", storage driver overlayfs.
  3. Identity-clean + minimize (guest-internal, run during build):
    systemctl stop docker containerd
    apt-get clean; rm -rf /var/lib/apt/lists/*
    rm -f /etc/ssh/ssh_host_*            # SSH host keys
    truncate -s 0 /etc/machine-id        # systemd regenerates on first boot
    rm -f /var/lib/dbus/machine-id; ln -sf /etc/machine-id /var/lib/dbus/machine-id
    rm -rf /var/log/*; : > /root/.bash_history
    rm -f /etc/hostname                  # set per-guest at provision
    
  4. Stop + archive (root vzdump is fine for the build): pct stop 9100; vzdump 9100 --storage local --mode stop --compress zstd.
  5. Archive carries keyctl (verified, phase3 method — embedded ./etc/vzdump/pct.conf): features: nesting=1,keyctl=1 · unprivileged: 1. It also carries the build guest's baked MAC BC:24:11:63:43:F4 and hostname: felhom-golden — see §4.

3. Result matrix

Property As-restored (9101, stopped, pre-reset) Front-half reset (token) After link-up boot
keyctl / nesting / unpriv preserved nesting=1,keyctl=1,unprivileged:1 Docker runs (hello-world OK) — keyctl functional, not just flag-present
MAC KEPT golden's BC:24:11:63:43:F4 reset → fresh BC:24:11:A6:C0:DE (PUT net0, omit hwaddr → PVE regenerates) DHCP lease 192.168.0.109; MAC unique; no LAN collision
hostname KEPT golden's felhom-golden (config field; /etc/hostname file absent) reset → felhom-spike-9101 (PUT hostname) propagated inside (hostname = felhom-spike-9101)
machine-id empty (baked truncate) auto-regenerated by systemdfaeffb0bc1b8403089cdd0b981cff109 (unique)
SSH host keys absent (baked rm) NOT regenerated; ssh.service FAILED — see Finding F3
rootfs 8 G resize → 10 G (PUT /resize disk=rootfs size=+2G)
mp0 mount n/a attached local-lvm:1,mp=/mnt/spike-test (transient 500 → retry 200, F4) present + writable (ext4)

Token ops all ran as felhom-agent@pve!agent (restore vzrestore OK, start vzstart OK) — the per-customer front half is fully token-covered.

4. Findings

  • F1 — MAC reset is UNCONDITIONAL. A token vzrestore preserves the archived MAC (9101 came up with the golden's BC:24:11:63:43:F4). Every guest restored from the golden would therefore share one MAC → guaranteed L2 collision. The reconcile job must reset MAC on every provision (host-side: PUT net0 with hwaddr omitted → PVE generates a fresh BC:24:11:xx:xx:xx). This settles the §9 "MAC handling" question for the provision path: always reset. (DR-restore of a customer backup is the separate continuity case — §9.)
  • F2 — hostname is carried in the config and must be reset host-side. The archive's hostname: field restored verbatim (felhom-golden); PUT hostname= resets it and it propagates into the guest on boot. Host-side, token-covered — no guest-internal step.
  • F3 — machine-id regenerates for free; SSH host keys do NOT (design consequence).
    • machine-id: bake truncate -s 0systemd regenerates it on first boot (confirmed non-empty + unique). No agent action needed. ✓ free.
    • SSH host keys: bake rm → on Debian 13 they are not regenerated at boot (the keygen is a pct create hook + a package-install action; pct restore runs neither). Result: openssh-server is installed and ssh.service is enabled but FAILED on first boot (no host keys). ssh-keygen -A regenerates them cleanly (unique fingerprint SHA256:MAX191…ED25519, root@felhom-spike-9101). → The bring-up reconcile job must regenerate SSH host keys guest-side (ssh-keygen -A, or dpkg-reconfigure openssh-server). This widens the agent's guest-internal surface beyond pure host-side config — the one real design consequence this spike surfaced. Alternative to consider in the spec: bake a one-shot first-boot unit into the golden that runs ssh-keygen -A (keeps regeneration guest-internal-but-baked, so the agent stays host-side-only). Either way it must be decided; it is not free like machine-id.
  • F4 — transient config-lock 500 on back-to-back PUTs. A mp0 attach issued immediately after a resize returned HTTP 500, then succeeded (200) on retry seconds later — a config-lock contention, not a permission issue (token holds VM.Config.Disk + Datastore.AllocateSpace). The reconcile job's existing per-guest serialization avoids this; add a retry on transient 500 for safety.
  • F5 — keyctl-through-restore is functional, not just flag-present. Docker started and ran hello-world in the restored guest — re-confirms phase3 #8 on the golden specifically.
  • F6 — Docker storage driver is overlayfs (not overlay2), matching phase0's LXC result. No extra config beyond nesting=1,keyctl=1 was needed.
  • F7 — live link-up surfaced no DHCP/ARP problem. Fresh MAC → fresh lease 192.168.0.109; the golden's old MAC only lingered as a STALE IPv6-neighbour cache entry from the (stopped) build guest. No active collision.

5. Identity-reset deliverable (the §9 / doc-13 open item — settled for the provision path)

Field Restore leaves it as Who resets it Where Cost
MAC golden's archived MAC reconcile job (unconditional) host-side token PUT net0 (omit hwaddr) cheap
hostname golden's archived hostname reconcile job host-side token PUT hostname cheap
machine-id empty (baked) systemd, first boot guest first-boot regen (golden bake) free
SSH host keys absent (baked) reconcile job guest-side ssh-keygen -A (or baked first-boot unit) surface-widening — flag

Reconcile-job front-half reset set (provision): host-side {MAC, hostname} via token config; guest-side {SSH host keys} via ssh-keygen -A (or a baked first-boot unit); {machine-id} is handled for free by the bake-clean golden. Restic / tunnel / hub identity are out of scope here (back half, slice 8 / DR policy §9).

6. Verdict

READY to spec the unified bring-up reconcile job (Phase 7.1). The golden recipe is validated end-to-end and the token-covered front half (restore → reset MAC+hostname → resize → attach mount → start link-up) works with Docker functional in the restored guest. One design change the findings force: the front half is not purely host-side — SSH-host-key regeneration is a guest-internal step (F3). The spec must choose between an agent-run ssh-keygen -A (widening the guest-internal surface) and a baked first-boot unit in the golden (keeping the agent host-side). machine-id needs no such step. MAC reset is unconditional (F1).

7. Out of scope (not done here — note for the implementation)

  • Controller deploy / bootstrap / per-guest local-token mint — slice 8 (back half).
  • Restic / tunnel / hub identity handling — DR identity policy (§9) + slice 8/10.
  • Reconcile-job journaling + compensating rollback — the implementation (Phase 7.1), specced from these findings; this spike restored/destroyed manually without the journal.
  • PBS escrow (§8a) — separate slice-7 thread.