33429933af
SPIKE-RUNBOOK Slice 7 Phase 0, executed live on demo-felhom. Golden base (Debian 13 + Docker, nesting=1,keyctl=1, identity-cleaned) built as root@pam, archived, then token-restored to a throwaway guest and brought up LINK-UP with the FelhomAgent token (restore/config/resize/start all token-covered). Key findings: - MAC reset is UNCONDITIONAL — vzrestore preserves the archived MAC (F1). - hostname reset is host-side token config (F2). - machine-id auto-regenerates on first boot (free); SSH host keys do NOT — ssh.service fails, agent must run ssh-keygen -A guest-side OR bake a first-boot unit (F3, the one surface-widening design consequence). - keyctl-through-restore is functional (Docker hello-world in the restored guest); storage driver overlayfs (F5/F6). - Settles the §9 / doc-13 identity-reset field list for the provision path. Verdict: READY to spec the unified bring-up reconcile job (Phase 7.1). Golden archive kept; both spike guests torn down. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
142 lines
9.5 KiB
Markdown
142 lines
9.5 KiB
Markdown
# Slice 7 Phase 0 — Golden base build + live bring-up (front half): Findings
|
|
|
|
**Host:** `demo-felhom` (192.168.0.162) — Proxmox VE 9.2.2, Debian 13 (Trixie). Bridge `vmbr0`,
|
|
LAN DHCP (router at 192.168.0.1).
|
|
**Date:** 2026-06-09. **Driver:** SPIKE-RUNBOOK (root@pam CLI for the golden build; the
|
|
`FelhomAgent` API token for the per-customer front-half ops — restore/config/resize/start).
|
|
**VMIDs:** golden-build `9100`, restored-test `9101` (both torn down; golden archive kept).
|
|
|
|
> This document presents **data, observations, and the resulting design deliverables** (the
|
|
> identity-reset field list). It feeds the spec of the unified bring-up reconcile job (Phase 7.1).
|
|
|
|
---
|
|
|
|
## 1. Provenance / setup
|
|
|
|
| Component | Value |
|
|
|---|---|
|
|
| Template | `local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst` |
|
|
| Restore storage | `local-lvm` (lvmthin) · Archive storage | `local` (dir, `/var/lib/vz/dump`) |
|
|
| Token | `felhom@pve!agent` (the `FelhomAgent` 16-priv role; by reference) |
|
|
| Golden archive (KEPT) | `local:backup/vzdump-lxc-9100-2026_06_09-20_41_10.tar.zst` (298 MB) |
|
|
| openssh-server (in guest) | `1:10.0p1-7` |
|
|
| Docker storage driver | **`overlayfs`** (not `overlay2`/`vfs`) — consistent with phase0 |
|
|
|
|
Token API smoke (S0): `GET /version` → 200, `GET /nodes/demo-felhom/lxc` → 200. Token holds
|
|
`VM.Allocate`, `Datastore.Allocate`/`AllocateSpace`, `VM.Config.{Disk,Network,Options,…}`,
|
|
`VM.PowerMgmt`, `VM.Backup`, etc. (full set confirmed via `/access/permissions`).
|
|
|
|
## 2. Golden recipe (validated — build the real golden from this)
|
|
|
|
1. **Create (root@pam — the one root step; `keyctl=1` is root-only, phase3 #1):**
|
|
```
|
|
pct create 9100 local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst \
|
|
--hostname felhom-golden --unprivileged 1 --features nesting=1,keyctl=1 \
|
|
--rootfs local-lvm:8 --cores 2 --memory 2048 \
|
|
--net0 name=eth0,bridge=vmbr0,ip=dhcp --onboot 0
|
|
```
|
|
(`pct create` auto-generates SSH host keys — these get wiped in step 3.)
|
|
2. **Docker (official apt repo, `trixie` channel):** `ca-certificates curl` → keyring →
|
|
`docker-ce docker-ce-cli containerd.io`. Confirmed working in the build guest:
|
|
`docker run --rm hello-world` → "Hello from Docker!", **storage driver `overlayfs`**.
|
|
3. **Identity-clean + minimize (guest-internal, run during build):**
|
|
```
|
|
systemctl stop docker containerd
|
|
apt-get clean; rm -rf /var/lib/apt/lists/*
|
|
rm -f /etc/ssh/ssh_host_* # SSH host keys
|
|
truncate -s 0 /etc/machine-id # systemd regenerates on first boot
|
|
rm -f /var/lib/dbus/machine-id; ln -sf /etc/machine-id /var/lib/dbus/machine-id
|
|
rm -rf /var/log/*; : > /root/.bash_history
|
|
rm -f /etc/hostname # set per-guest at provision
|
|
```
|
|
4. **Stop + archive (root vzdump is fine for the build):**
|
|
`pct stop 9100; vzdump 9100 --storage local --mode stop --compress zstd`.
|
|
5. **Archive carries keyctl (verified, phase3 method — embedded `./etc/vzdump/pct.conf`):**
|
|
`features: nesting=1,keyctl=1` · `unprivileged: 1`. **It also carries the build guest's
|
|
baked MAC** `BC:24:11:63:43:F4` and `hostname: felhom-golden` — see §4.
|
|
|
|
## 3. Result matrix
|
|
|
|
| Property | As-restored (9101, stopped, pre-reset) | Front-half reset (token) | After link-up boot |
|
|
|---|---|---|---|
|
|
| keyctl / nesting / unpriv | **preserved** `nesting=1,keyctl=1,unprivileged:1` | — | **Docker runs** (`hello-world` OK) — keyctl *functional*, not just flag-present |
|
|
| **MAC** | **KEPT golden's** `BC:24:11:63:43:F4` | reset → fresh `BC:24:11:A6:C0:DE` (PUT net0, **omit hwaddr** → PVE regenerates) | DHCP lease `192.168.0.109`; MAC unique; no LAN collision |
|
|
| **hostname** | **KEPT golden's** `felhom-golden` (config field; `/etc/hostname` file absent) | reset → `felhom-spike-9101` (PUT hostname) | **propagated** inside (`hostname` = `felhom-spike-9101`) |
|
|
| **machine-id** | **empty** (baked `truncate`) | — | **auto-regenerated by systemd** → `faeffb0bc1b8403089cdd0b981cff109` (unique) |
|
|
| **SSH host keys** | **absent** (baked `rm`) | — | **NOT regenerated; `ssh.service` FAILED** — see Finding F3 |
|
|
| rootfs | 8 G | **resize → 10 G** (`PUT /resize disk=rootfs size=+2G`) | — |
|
|
| mp0 mount | n/a | attached `local-lvm:1,mp=/mnt/spike-test` (transient 500 → retry 200, F4) | present + **writable** (ext4) |
|
|
|
|
Token ops all ran as `felhom-agent@pve!agent` (restore `vzrestore` OK, start `vzstart` OK) —
|
|
the per-customer front half is **fully token-covered**.
|
|
|
|
## 4. Findings
|
|
|
|
- **F1 — MAC reset is UNCONDITIONAL.** A token `vzrestore` **preserves the archived MAC**
|
|
(9101 came up with the golden's `BC:24:11:63:43:F4`). Every guest restored from the golden
|
|
would therefore share one MAC → guaranteed L2 collision. The reconcile job **must** reset MAC
|
|
on every provision (host-side: `PUT net0` with `hwaddr` omitted → PVE generates a fresh
|
|
`BC:24:11:xx:xx:xx`). This settles the §9 "MAC handling" question for the *provision* path:
|
|
always reset. (DR-restore of a *customer* backup is the separate continuity case — §9.)
|
|
- **F2 — hostname is carried in the config and must be reset host-side.** The archive's
|
|
`hostname:` field restored verbatim (`felhom-golden`); `PUT hostname=` resets it and it
|
|
**propagates into the guest** on boot. Host-side, token-covered — no guest-internal step.
|
|
- **F3 — machine-id regenerates for free; SSH host keys do NOT (design consequence).**
|
|
- `machine-id`: bake `truncate -s 0` → **systemd regenerates it on first boot** (confirmed
|
|
non-empty + unique). No agent action needed. ✓ free.
|
|
- SSH host keys: bake `rm` → on Debian 13 they are **not** regenerated at boot (the keygen is
|
|
a `pct create` hook + a package-install action; **`pct restore` runs neither**). Result:
|
|
`openssh-server` is installed and `ssh.service` is **enabled but FAILED** on first boot (no
|
|
host keys). `ssh-keygen -A` regenerates them cleanly (unique fingerprint
|
|
`SHA256:MAX191…ED25519`, `root@felhom-spike-9101`).
|
|
→ **The bring-up reconcile job must regenerate SSH host keys guest-side** (`ssh-keygen -A`,
|
|
or `dpkg-reconfigure openssh-server`). **This widens the agent's guest-internal surface**
|
|
beyond pure host-side config — the one real design consequence this spike surfaced.
|
|
*Alternative to consider in the spec:* bake a one-shot first-boot unit into the golden that
|
|
runs `ssh-keygen -A` (keeps regeneration guest-internal-but-baked, so the agent stays
|
|
host-side-only). Either way it must be decided; it is **not** free like machine-id.
|
|
- **F4 — transient config-lock 500 on back-to-back PUTs.** A `mp0` attach issued immediately
|
|
after a `resize` returned **HTTP 500**, then succeeded (200) on retry seconds later — a
|
|
config-lock contention, **not** a permission issue (token holds `VM.Config.Disk` +
|
|
`Datastore.AllocateSpace`). The reconcile job's existing **per-guest serialization** avoids
|
|
this; add a **retry on transient 500** for safety.
|
|
- **F5 — keyctl-through-restore is *functional*, not just flag-present.** Docker started and
|
|
ran `hello-world` in the *restored* guest — re-confirms phase3 #8 on the golden specifically.
|
|
- **F6 — Docker storage driver is `overlayfs`** (not `overlay2`), matching phase0's LXC result.
|
|
No extra config beyond `nesting=1,keyctl=1` was needed.
|
|
- **F7 — live link-up surfaced no DHCP/ARP problem.** Fresh MAC → fresh lease `192.168.0.109`;
|
|
the golden's old MAC only lingered as a STALE IPv6-neighbour cache entry from the (stopped)
|
|
build guest. No active collision.
|
|
|
|
## 5. Identity-reset deliverable (the §9 / doc-13 open item — settled for the *provision* path)
|
|
|
|
| Field | Restore leaves it as | Who resets it | Where | Cost |
|
|
|---|---|---|---|---|
|
|
| MAC | golden's archived MAC | reconcile job (unconditional) | **host-side** token `PUT net0` (omit hwaddr) | cheap |
|
|
| hostname | golden's archived hostname | reconcile job | **host-side** token `PUT hostname` | cheap |
|
|
| machine-id | empty (baked) | **systemd, first boot** | guest first-boot regen (golden bake) | **free** |
|
|
| SSH host keys | absent (baked) | reconcile job | **guest-side** `ssh-keygen -A` (or baked first-boot unit) | **surface-widening — flag** |
|
|
|
|
**Reconcile-job front-half reset set (provision):** host-side `{MAC, hostname}` via token config;
|
|
guest-side `{SSH host keys}` via `ssh-keygen -A` (or a baked first-boot unit); `{machine-id}` is
|
|
handled for free by the bake-clean golden. Restic / tunnel / hub identity are **out of scope**
|
|
here (back half, slice 8 / DR policy §9).
|
|
|
|
## 6. Verdict
|
|
|
|
**READY to spec the unified bring-up reconcile job (Phase 7.1).** The golden recipe is validated
|
|
end-to-end and the token-covered front half (restore → reset MAC+hostname → resize → attach
|
|
mount → start link-up) works with Docker functional in the restored guest. **One design change
|
|
the findings force:** the front half is **not** purely host-side — SSH-host-key regeneration is a
|
|
guest-internal step (F3). The spec must choose between an agent-run `ssh-keygen -A` (widening the
|
|
guest-internal surface) and a baked first-boot unit in the golden (keeping the agent host-side).
|
|
machine-id needs no such step. MAC reset is unconditional (F1).
|
|
|
|
## 7. Out of scope (not done here — note for the implementation)
|
|
|
|
- Controller deploy / bootstrap / per-guest local-token mint — **slice 8** (back half).
|
|
- Restic / tunnel / hub identity handling — DR identity policy (§9) + slice 8/10.
|
|
- Reconcile-job journaling + compensating rollback — the **implementation** (Phase 7.1),
|
|
specced from these findings; this spike restored/destroyed manually without the journal.
|
|
- PBS escrow (§8a) — separate slice-7 thread.
|