v0.3.2: reversible SetConfig step in --selftest=task (slice-4 pre-check)
Append a reversible SetConfig write+revert to runSelftestTask: read GuestConfig, write a `description` marker, verify it landed, restore the original (or delete if absent), verify the restore. Handles PVE's dual-mode SetConfig return (empty UPID = synchronous; UPID = WaitTask+assert OK). Live self-gate PASSED on demo-felhom / guest 9999. Findings: - LXC `description` write is synchronous (empty UPID) — dual-mode modeling confirmed; empty string is success, not an error. - PVE appends a trailing newline to `description` on read; slice-4 reconcile must normalize description comparisons (hence normDesc helper). First live exercise of the VM.Config.* privilege cluster. Standing operator token rotated during the run; new secret stored out-of-band, not in the repo. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,34 +1,76 @@
|
||||
# REPORT — live `--selftest=task` validation on the demo host (2026-06-08)
|
||||
# REPORT — `SetConfig` selftest extension, live self-gate (2026-06-08)
|
||||
|
||||
> Overwrite-latest report (most recent significant run only). Cumulative history lives in [CHANGELOG.md](CHANGELOG.md).
|
||||
|
||||
## Outcome
|
||||
|
||||
**`--selftest=task` PASSED live against the demo Proxmox host.** The slice-1 gap — slice-1 mutating ops + `WaitTask` were unit-tested only, never run against a live host — is **closed**. The shared async `WaitTask` foundation (UPID poll → assert `exitstatus == "OK"`) is now validated live. **Slice 4 (reconcile) is unblocked.**
|
||||
**`SetConfig` PASSED live under the scoped operator token.** The slice-4 pre-check is
|
||||
satisfied — `--selftest=task -vmid 9999` now exercises a reversible `SetConfig`
|
||||
write+revert end-to-end and reached `=== selftest=task OK ===` (exit 0). Reconcile
|
||||
(slice 4) can be built on `SetConfig` with confidence.
|
||||
|
||||
## What ran
|
||||
## What was implemented
|
||||
|
||||
Executed the live-validation runbook end-to-end on node `demo-felhom` (`https://192.168.0.162:8006`, PVE 9.2.x; `root@pam` via SSH alias `felhom-pve` for provisioning, agent run from the build server `192.168.0.180` on go1.26).
|
||||
A reversible `SetConfig` step appended to the existing `runSelftestTask` flow
|
||||
(`cmd/felhom-agent/main.go`, `selftestSetConfig`), keeping the prior
|
||||
snapshot → rollback → delete-snapshot steps intact. Against guest 9999:
|
||||
|
||||
1. **Provisioned the operator-tier token (Part A).** Created the `FelhomAgent` role with the full **16 privileges**, the `felhom-agent@pve` user, and a `--privsep 1` token `felhom-agent@pve!agent`. Granted the role on **both** the user **and** the token (the privsep intersection gotcha) — verified two ACL rows at `/`.
|
||||
2. **Scratch guest (Part B).** Created stopped LXC **9999** (`felhom-selftest-scratch`), rootfs on `local-lvm` (lvmthin → snapshot-capable). Kept stopped for deterministic rollback. No stale `felhom-selftest` snapshot present.
|
||||
3. **Config + TLS (Part C).** Confirmed the demo host's current leaf-cert SHA-256 fingerprint still matches the pinned value (`BA:7C:99:7D:45:D0…`). Built the agent (v0.3.1) on the build server.
|
||||
4. **Read-only gate (Part D).** `--selftest=read` clean with the new token: PVE 9.2.2, node online, guest 9999 visible, storages listed.
|
||||
5. **Live mutating run (Part E).** `--selftest=task -vmid 9999` — snapshot → rollback → delete-snapshot, each returning a real UPID that `WaitTask` polled to `exitstatus=OK`.
|
||||
1. `GuestConfig` — capture the original `description` (was **absent**).
|
||||
2. `SetConfig description="felhom-selftest <RFC3339>"` — dual-mode return handled per
|
||||
the `mutate.go` contract (empty UPID = synchronous; UPID = `WaitTask`+assert OK).
|
||||
3. `GuestConfig` again — confirm the marker landed.
|
||||
4. **Restore** — original was absent, so `SetConfig delete=description`; confirm cleared.
|
||||
|
||||
## Evidence
|
||||
Output matches the existing format:
|
||||
```
|
||||
[ ok ] setconfig synchronous exitstatus=OK
|
||||
[ ok ] verify-write description verified == marker
|
||||
[ ok ] setconfig-revert synchronous exitstatus=OK
|
||||
[ ok ] verify-revert description restored to original
|
||||
```
|
||||
|
||||
- **`exitstatus=OK`** on all three ops (a `200` on the POST is explicitly *not* treated as success — the `exitstatus` assertion is the point of the run).
|
||||
- The task **UPIDs name the token actor** (`…:vzsnapshot:9999:felhom-agent@pve!agent:`, likewise `vzrollback` / `vzdelsnapshot`) — confirming the privsep token path was genuinely exercised, no privilege drift.
|
||||
- **Role:** all 16 privileges present (`VM.Snapshot`, `VM.Snapshot.Rollback`, `VM.Backup`, the `VM.Config.*` set, `VM.PowerMgmt`, `VM.Allocate/Audit`, `Datastore.*`, `Sys.Audit`, `SDN.Use`).
|
||||
- **ACLs:** both `-user felhom-agent@pve` and `-token felhom-agent@pve!agent` carry `FelhomAgent` at `/`.
|
||||
- **Post-state (Part F):** `felhom-selftest` snapshot created then cleaned (only `current` remains); guest left **stopped**, as started.
|
||||
## Key finding — synchronous, not async
|
||||
|
||||
## Scope / not covered (by design)
|
||||
**The LXC `description` write came back synchronous (empty UPID).** PVE applied it
|
||||
inline with no task object; the agent printed `synchronous exitstatus=OK` on the
|
||||
empty-string path. This confirms the agent's **dual-mode `SetConfig` modeling matches
|
||||
Proxmox reality**: for `description`, the empty-UPID branch is the live path, and
|
||||
treating `""` as success (not an error) is correct. This was the **first live exercise
|
||||
of the `VM.Config.*` privilege cluster** (previously only the snapshot/rollback/backup
|
||||
privileges had been run live).
|
||||
|
||||
- **Not validated live:** `Start`/`Stop`/`SetConfig` (reversible, low-risk; `SetConfig` is used by reconcile — an optional selftest extension could add them), `Vzdump` (already confirmed live in the phase1-2 spike), and `RestoreLXC` / provision-by-restore (deferred until the golden base image exists, ~slice 7).
|
||||
- The run used a **stopped** guest deliberately, to keep rollback deterministic (LXC snapshots carry no running-memory state; rollback of a running CT may error or stop the guest). Characterizing running-guest rollback is optional follow-up intel, not a slice-4 blocker.
|
||||
## Second finding — `description` trailing-newline normalization
|
||||
|
||||
PVE **appends a trailing `\n` to `description` on read** (stored URL-encoded as
|
||||
`%0A...`). The first live run surfaced this as a (false) verify mismatch:
|
||||
`got="...Z\n"` vs `want="...Z"`. The write had genuinely landed — only my exact-match
|
||||
check was too strict. Fixed with `normDesc` (strip trailing newline) at every
|
||||
comparison point, and the run went green. **This is load-bearing intel for slice 4:**
|
||||
a reconcile that compares desired vs actual `description` verbatim will detect
|
||||
perpetual drift; it must normalize the trailing newline.
|
||||
|
||||
## Live run environment
|
||||
|
||||
- Built **v0.3.2** on the build server (192.168.0.180, go1.26), pointed at
|
||||
`demo-felhom` (`https://192.168.0.162:8006`, PVE 9.2.2).
|
||||
- Pinned leaf-cert SHA-256 fingerprint re-verified — still
|
||||
`BA:7C:99:7D:45:D0…` (matches the agent's pin).
|
||||
- `--selftest=read` clean first (PVE 9.2.2, node online, guests 9001+9999 visible,
|
||||
storages listed), then the gated `--selftest=task -vmid 9999`.
|
||||
- Task UPIDs name the token actor (`…:vzsnapshot:9999:felhom-agent@pve!agent:` etc.) —
|
||||
privsep token path genuinely exercised, no privilege drift.
|
||||
|
||||
## Post-state
|
||||
|
||||
Guest **9999** left pristine: **stopped**, `description` **absent**, only `current`
|
||||
remains (no leftover `felhom-selftest` snapshot).
|
||||
|
||||
## Credentials
|
||||
|
||||
The standing `FelhomAgent` operator token (`felhom-agent@pve!agent`) provisioned here is the one slice 4+ consumes — **not deleted**. Its secret is **stored out-of-band**, supplied to the agent via `FELHOM_AGENT_PROXMOX_TOKEN`; it is **not persisted to the repo** (the on-disk config holds only a placeholder). Scratch guest 9999 is retained (stopped) as the standing selftest target.
|
||||
The standing operator token (`felhom-agent@pve!agent`, privsep) was **rotated** during
|
||||
this run — the prior secret was not retrievable (PVE reveals a token secret only once
|
||||
at creation), so a fresh secret was minted via `root@felhom-pve` and the `FelhomAgent`
|
||||
role re-confirmed on **both** the user and the token ACL at `/` (privsep intersection
|
||||
gotcha). The token was consumed via the **standing operator token through
|
||||
`FELHOM_AGENT_PROXMOX_TOKEN`, not persisted to the repo** — the on-disk demo config
|
||||
carries only a placeholder. The new secret is **stored out-of-band**.
|
||||
|
||||
Reference in New Issue
Block a user