v0.3.2: reversible SetConfig step in --selftest=task (slice-4 pre-check)

Append a reversible SetConfig write+revert to runSelftestTask: read
GuestConfig, write a `description` marker, verify it landed, restore the
original (or delete if absent), verify the restore. Handles PVE's dual-mode
SetConfig return (empty UPID = synchronous; UPID = WaitTask+assert OK).

Live self-gate PASSED on demo-felhom / guest 9999. Findings:
- LXC `description` write is synchronous (empty UPID) — dual-mode modeling
  confirmed; empty string is success, not an error.
- PVE appends a trailing newline to `description` on read; slice-4 reconcile
  must normalize description comparisons (hence normDesc helper).

First live exercise of the VM.Config.* privilege cluster. Standing operator
token rotated during the run; new secret stored out-of-band, not in the repo.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-08 21:13:04 +02:00
parent 237452c8c6
commit 605ce25f58
4 changed files with 233 additions and 22 deletions
+32
View File
@@ -3,6 +3,38 @@
All notable changes to **felhom-agent** are recorded here. Update on every code All notable changes to **felhom-agent** are recorded here. Update on every code
change that gets pushed. change that gets pushed.
## v0.3.2 — SetConfig selftest extension (slice-4 pre-check) (2026-06-08)
The gate before slice 4: prove `SetConfig` works live under the scoped token before
reconcile is built on it. **Self-gated live run PASSED** on `demo-felhom`/guest 9999.
### Added
- **Reversible `SetConfig` step appended to `--selftest=task`** (`cmd/felhom-agent/main.go`,
`selftestSetConfig`): read `GuestConfig` → write a `description` marker
(`felhom-selftest <RFC3339>`) → verify it landed → restore the original value (or
`delete` the key if it was absent) → verify the restore. Handles PVE's dual-mode
`SetConfig` return per the `mutate.go` contract: empty UPID = synchronous success
(printed `synchronous`); non-empty UPID = `WaitTask` + assert `exitstatus=OK`.
The existing snapshot → rollback → delete-snapshot steps are unchanged. First live
exercise of the **`VM.Config.*`** privilege cluster.
- **`normDesc` / `extraString` helpers** — `extraString` decodes a string-valued key
from `GuestConfig.Extra` (raw JSON); `normDesc` strips the trailing newline PVE
appends to `description` on read, so a written value round-trips equal.
### Finding (live)
- The LXC `description` write returned **synchronous (empty UPID)** — PVE applied it
inline, no task. The agent's dual-mode `SetConfig` modeling is correct: the
empty-string path is real and must not be treated as an error.
- PVE **appends a trailing `\n` to `description`** on read (stored URL-encoded as
`%0A`). A naive exact-match reconcile would see perpetual drift — slice-4 reconcile
must normalize `description` comparisons (hence `normDesc`).
### Ops
- Standing operator token (`felhom-agent@pve!agent`, privsep) **rotated** during this
run (the prior secret was not retrievable); role + both user/token ACL rows
re-confirmed at `/`. New secret stored out-of-band, **not persisted to the repo**.
Guest 9999 left pristine (stopped, no `description`, no leftover snapshot). Version → 0.3.2.
## Docs + live validation — no version bump (2026-06-08) ## Docs + live validation — no version bump (2026-06-08)
### Changed ### Changed
+3 -2
View File
@@ -15,7 +15,7 @@
- Module `gitea.dooplex.hu/admin/felhom-agent`; binary `felhom-agent` (`cmd/felhom-agent/`). - Module `gitea.dooplex.hu/admin/felhom-agent`; binary `felhom-agent` (`cmd/felhom-agent/`).
- **Pure Go stdlib + `golang.org/x/crypto` only** — no web frameworks. - **Pure Go stdlib + `golang.org/x/crypto` only** — no web frameworks.
- `go.mod` directive **go 1.25.0**; dep `golang.org/x/crypto v0.52.0` (declares go 1.25, will NOT build on Go 1.24). The **build server (192.168.0.180) runs go1.26.0** (upstream Go on PATH, backward-compatible). Build/run the agent there for live tests (same LAN as the demo host). - `go.mod` directive **go 1.25.0**; dep `golang.org/x/crypto v0.52.0` (declares go 1.25, will NOT build on Go 1.24). The **build server (192.168.0.180) runs go1.26.0** (upstream Go on PATH, backward-compatible). Build/run the agent there for live tests (same LAN as the demo host).
- Version: `version` var in `cmd/felhom-agent/main.go`, overridable via `-ldflags "-X main.version=<v>"`; `--version` flag. **Current: v0.3.1.** Bump on meaningful changes + add a CHANGELOG entry. - Version: `version` var in `cmd/felhom-agent/main.go`, overridable via `-ldflags "-X main.version=<v>"`; `--version` flag. **Current: v0.3.2.** Bump on meaningful changes + add a CHANGELOG entry.
## Layout ## Layout
@@ -47,7 +47,8 @@ Built in slices, all on `main`:
- **v0.2.0** slice 2 — `internal/authz` signed-op verifier. - **v0.2.0** slice 2 — `internal/authz` signed-op verifier.
- **v0.3.0** slice 3 — `internal/hub`: the first **daemon loop** (no-`--selftest` mode) posting a read-only `HostReport` to the hub (= the heartbeat). Report's storage/backup/restore/pbs/audit fields are **defined-but-empty** (slices 5/6); the envelope's desired-state/signed-ops fields are **parsed-but-ignored** (slice 4). - **v0.3.0** slice 3 — `internal/hub`: the first **daemon loop** (no-`--selftest` mode) posting a read-only `HostReport` to the hub (= the heartbeat). Report's storage/backup/restore/pbs/audit fields are **defined-but-empty** (slices 5/6); the envelope's desired-state/signed-ops fields are **parsed-but-ignored** (slice 4).
- **v0.3.1** — slice-3 validation follow-ups. - **v0.3.1** — slice-3 validation follow-ups.
- **Next: slice 4 (reconcile + benign/destructive gate)** — the first slice that issues real Proxmox mutations. **Gated** on passing the live `--selftest=task` runbook first. - **v0.3.2** — slice-4 pre-check: reversible `SetConfig` step added to `--selftest=task`; passed live on guest 9999. Findings: LXC `description` write is **synchronous** (empty UPID — dual-mode modeling confirmed); PVE appends a trailing `\n` to `description` on read (reconcile must normalize). First live `VM.Config.*` exercise.
- **Next: slice 4 (reconcile + benign/destructive gate)** — the first slice that issues real Proxmox mutations. The live `--selftest=task` gate (snapshot/rollback/delete + `SetConfig`) is now **passed**.
## Demo host (for live tests) ## Demo host (for live tests)
+61 -19
View File
@@ -1,34 +1,76 @@
# REPORT — live `--selftest=task` validation on the demo host (2026-06-08) # REPORT — `SetConfig` selftest extension, live self-gate (2026-06-08)
> Overwrite-latest report (most recent significant run only). Cumulative history lives in [CHANGELOG.md](CHANGELOG.md). > Overwrite-latest report (most recent significant run only). Cumulative history lives in [CHANGELOG.md](CHANGELOG.md).
## Outcome ## Outcome
**`--selftest=task` PASSED live against the demo Proxmox host.** The slice-1 gap — slice-1 mutating ops + `WaitTask` were unit-tested only, never run against a live host — is **closed**. The shared async `WaitTask` foundation (UPID poll → assert `exitstatus == "OK"`) is now validated live. **Slice 4 (reconcile) is unblocked.** **`SetConfig` PASSED live under the scoped operator token.** The slice-4 pre-check is
satisfied — `--selftest=task -vmid 9999` now exercises a reversible `SetConfig`
write+revert end-to-end and reached `=== selftest=task OK ===` (exit 0). Reconcile
(slice 4) can be built on `SetConfig` with confidence.
## What ran ## What was implemented
Executed the live-validation runbook end-to-end on node `demo-felhom` (`https://192.168.0.162:8006`, PVE 9.2.x; `root@pam` via SSH alias `felhom-pve` for provisioning, agent run from the build server `192.168.0.180` on go1.26). A reversible `SetConfig` step appended to the existing `runSelftestTask` flow
(`cmd/felhom-agent/main.go`, `selftestSetConfig`), keeping the prior
snapshot → rollback → delete-snapshot steps intact. Against guest 9999:
1. **Provisioned the operator-tier token (Part A).** Created the `FelhomAgent` role with the full **16 privileges**, the `felhom-agent@pve` user, and a `--privsep 1` token `felhom-agent@pve!agent`. Granted the role on **both** the user **and** the token (the privsep intersection gotcha) — verified two ACL rows at `/`. 1. `GuestConfig` — capture the original `description` (was **absent**).
2. **Scratch guest (Part B).** Created stopped LXC **9999** (`felhom-selftest-scratch`), rootfs on `local-lvm` (lvmthin → snapshot-capable). Kept stopped for deterministic rollback. No stale `felhom-selftest` snapshot present. 2. `SetConfig description="felhom-selftest <RFC3339>"` — dual-mode return handled per
3. **Config + TLS (Part C).** Confirmed the demo host's current leaf-cert SHA-256 fingerprint still matches the pinned value (`BA:7C:99:7D:45:D0…`). Built the agent (v0.3.1) on the build server. the `mutate.go` contract (empty UPID = synchronous; UPID = `WaitTask`+assert OK).
4. **Read-only gate (Part D).** `--selftest=read` clean with the new token: PVE 9.2.2, node online, guest 9999 visible, storages listed. 3. `GuestConfig` again — confirm the marker landed.
5. **Live mutating run (Part E).** `--selftest=task -vmid 9999` — snapshot → rollback → delete-snapshot, each returning a real UPID that `WaitTask` polled to `exitstatus=OK`. 4. **Restore** — original was absent, so `SetConfig delete=description`; confirm cleared.
## Evidence Output matches the existing format:
```
[ ok ] setconfig synchronous exitstatus=OK
[ ok ] verify-write description verified == marker
[ ok ] setconfig-revert synchronous exitstatus=OK
[ ok ] verify-revert description restored to original
```
- **`exitstatus=OK`** on all three ops (a `200` on the POST is explicitly *not* treated as success — the `exitstatus` assertion is the point of the run). ## Key finding — synchronous, not async
- The task **UPIDs name the token actor** (`…:vzsnapshot:9999:felhom-agent@pve!agent:`, likewise `vzrollback` / `vzdelsnapshot`) — confirming the privsep token path was genuinely exercised, no privilege drift.
- **Role:** all 16 privileges present (`VM.Snapshot`, `VM.Snapshot.Rollback`, `VM.Backup`, the `VM.Config.*` set, `VM.PowerMgmt`, `VM.Allocate/Audit`, `Datastore.*`, `Sys.Audit`, `SDN.Use`).
- **ACLs:** both `-user felhom-agent@pve` and `-token felhom-agent@pve!agent` carry `FelhomAgent` at `/`.
- **Post-state (Part F):** `felhom-selftest` snapshot created then cleaned (only `current` remains); guest left **stopped**, as started.
## Scope / not covered (by design) **The LXC `description` write came back synchronous (empty UPID).** PVE applied it
inline with no task object; the agent printed `synchronous exitstatus=OK` on the
empty-string path. This confirms the agent's **dual-mode `SetConfig` modeling matches
Proxmox reality**: for `description`, the empty-UPID branch is the live path, and
treating `""` as success (not an error) is correct. This was the **first live exercise
of the `VM.Config.*` privilege cluster** (previously only the snapshot/rollback/backup
privileges had been run live).
- **Not validated live:** `Start`/`Stop`/`SetConfig` (reversible, low-risk; `SetConfig` is used by reconcile — an optional selftest extension could add them), `Vzdump` (already confirmed live in the phase1-2 spike), and `RestoreLXC` / provision-by-restore (deferred until the golden base image exists, ~slice 7). ## Second finding — `description` trailing-newline normalization
- The run used a **stopped** guest deliberately, to keep rollback deterministic (LXC snapshots carry no running-memory state; rollback of a running CT may error or stop the guest). Characterizing running-guest rollback is optional follow-up intel, not a slice-4 blocker.
PVE **appends a trailing `\n` to `description` on read** (stored URL-encoded as
`%0A...`). The first live run surfaced this as a (false) verify mismatch:
`got="...Z\n"` vs `want="...Z"`. The write had genuinely landed — only my exact-match
check was too strict. Fixed with `normDesc` (strip trailing newline) at every
comparison point, and the run went green. **This is load-bearing intel for slice 4:**
a reconcile that compares desired vs actual `description` verbatim will detect
perpetual drift; it must normalize the trailing newline.
## Live run environment
- Built **v0.3.2** on the build server (192.168.0.180, go1.26), pointed at
`demo-felhom` (`https://192.168.0.162:8006`, PVE 9.2.2).
- Pinned leaf-cert SHA-256 fingerprint re-verified — still
`BA:7C:99:7D:45:D0…` (matches the agent's pin).
- `--selftest=read` clean first (PVE 9.2.2, node online, guests 9001+9999 visible,
storages listed), then the gated `--selftest=task -vmid 9999`.
- Task UPIDs name the token actor (`…:vzsnapshot:9999:felhom-agent@pve!agent:` etc.) —
privsep token path genuinely exercised, no privilege drift.
## Post-state
Guest **9999** left pristine: **stopped**, `description` **absent**, only `current`
remains (no leftover `felhom-selftest` snapshot).
## Credentials ## Credentials
The standing `FelhomAgent` operator token (`felhom-agent@pve!agent`) provisioned here is the one slice 4+ consumes — **not deleted**. Its secret is **stored out-of-band**, supplied to the agent via `FELHOM_AGENT_PROXMOX_TOKEN`; it is **not persisted to the repo** (the on-disk config holds only a placeholder). Scratch guest 9999 is retained (stopped) as the standing selftest target. The standing operator token (`felhom-agent@pve!agent`, privsep) was **rotated** during
this run — the prior secret was not retrievable (PVE reveals a token secret only once
at creation), so a fresh secret was minted via `root@felhom-pve` and the `FelhomAgent`
role re-confirmed on **both** the user and the token ACL at `/` (privsep intersection
gotcha). The token was consumed via the **standing operator token through
`FELHOM_AGENT_PROXMOX_TOKEN`, not persisted to the repo** — the on-disk demo config
carries only a placeholder. The new secret is **stored out-of-band**.
+137 -1
View File
@@ -15,6 +15,7 @@ import (
"log/slog" "log/slog"
"os" "os"
"os/signal" "os/signal"
"strings"
"syscall" "syscall"
"time" "time"
@@ -26,7 +27,7 @@ import (
// version is the agent version. Overridable at build time with // version is the agent version. Overridable at build time with
// -ldflags "-X main.version=<v>"; defaults to the in-repo CHANGELOG version. // -ldflags "-X main.version=<v>"; defaults to the in-repo CHANGELOG version.
var version = "0.3.1" var version = "0.3.2"
func main() { func main() {
var ( var (
@@ -287,10 +288,145 @@ func runSelftestTask(ctx context.Context, cfg config.Config, logger *slog.Logger
} }
fmt.Printf(" [ ok ] %-16s exitstatus=%s\n", st.name, status.ExitStatus) fmt.Printf(" [ ok ] %-16s exitstatus=%s\n", st.name, status.ExitStatus)
} }
// Reversible SetConfig exercise: this is the first live use of the VM.Config.*
// privilege cluster. It round-trips the cosmetic `description` field (no runtime
// effect, fully reversible) to prove SetConfig works under the scoped token
// before slice 4's reconcile is built on top of it.
if rc := selftestSetConfig(ctx, client, vmid); rc != 0 {
return rc
}
fmt.Println("=== selftest=task OK ===") fmt.Println("=== selftest=task OK ===")
return 0 return 0
} }
// selftestSetConfig performs a reversible write+revert of the LXC `description`
// field on vmid and asserts both land. Returns 0 on success, non-zero (with a
// printed [FAIL] line) on any failure — the caller stops on a non-zero return.
func selftestSetConfig(ctx context.Context, client *proxmox.Client, vmid int) int {
// 1. Read current state; capture the original description (may be absent).
cfg, err := client.GuestConfig(ctx, vmid)
if err != nil {
fmt.Printf(" [FAIL] %-16s GuestConfig: %v\n", "setconfig", err)
return 1
}
origDesc, origPresent, err := extraString(cfg, "description")
if err != nil {
fmt.Printf(" [FAIL] %-16s decode description: %v\n", "setconfig", err)
return 1
}
// PVE normalizes `description` by appending a trailing newline on read, so all
// comparisons here use normDesc (strip trailing newlines) and restores write the
// normalized original — otherwise an exact-match check sees false drift. This is
// load-bearing intel for slice-4 reconcile (compare descriptions normalized).
origDesc = normDesc(origDesc)
// 2. Write the marker.
marker := "felhom-selftest " + time.Now().UTC().Format(time.RFC3339)
if rc := applySetConfig(ctx, client, vmid, "setconfig", map[string]string{"description": marker}); rc != 0 {
return rc
}
// 3. Verify the marker landed.
cfg, err = client.GuestConfig(ctx, vmid)
if err != nil {
fmt.Printf(" [FAIL] %-16s GuestConfig (verify): %v\n", "setconfig", err)
return 1
}
got, present, err := extraString(cfg, "description")
if err != nil {
fmt.Printf(" [FAIL] %-16s decode description (verify): %v\n", "setconfig", err)
return 1
}
if !present || normDesc(got) != marker {
fmt.Printf(" [FAIL] %-16s write did not land: present=%v got=%q want=%q\n", "setconfig", present, got, marker)
return 1
}
fmt.Printf(" [ ok ] %-16s description verified == marker\n", "verify-write")
// 4. Restore the original value (or clear it if it was absent originally).
var revert map[string]string
if origPresent {
revert = map[string]string{"description": origDesc}
} else {
revert = map[string]string{"delete": "description"}
}
if rc := applySetConfig(ctx, client, vmid, "setconfig-revert", revert); rc != 0 {
return rc
}
// 5. Confirm the restore.
cfg, err = client.GuestConfig(ctx, vmid)
if err != nil {
fmt.Printf(" [FAIL] %-16s GuestConfig (revert verify): %v\n", "setconfig-revert", err)
return 1
}
got, present, err = extraString(cfg, "description")
if err != nil {
fmt.Printf(" [FAIL] %-16s decode description (revert verify): %v\n", "setconfig-revert", err)
return 1
}
if origPresent {
if !present || normDesc(got) != origDesc {
fmt.Printf(" [FAIL] %-16s revert did not restore: present=%v got=%q want=%q\n", "setconfig-revert", present, normDesc(got), origDesc)
return 1
}
} else if present {
fmt.Printf(" [FAIL] %-16s revert did not clear: still present got=%q\n", "setconfig-revert", got)
return 1
}
fmt.Printf(" [ ok ] %-16s description restored to original\n", "verify-revert")
return 0
}
// applySetConfig runs one SetConfig and asserts success, handling PVE's dual-mode
// return: a UPID means async (WaitTask + assert exitstatus OK); an empty string
// means PVE applied it synchronously (not an error — phase1-2/mutate.go contract).
func applySetConfig(ctx context.Context, client *proxmox.Client, vmid int, step string, params map[string]string) int {
upid, err := client.SetConfig(ctx, vmid, params)
if err != nil {
fmt.Printf(" [FAIL] %-16s %v\n", step, err)
return 1
}
if upid == "" {
// Synchronous path: empty UPID is a clean success.
fmt.Printf(" [ ok ] %-16s synchronous exitstatus=OK\n", step)
return 0
}
status, err := client.WaitTask(ctx, upid, proxmox.WaitOptions{})
if err != nil {
fmt.Printf(" [FAIL] %-16s %s %v\n", step, upid, err)
return 1
}
if status.ExitStatus != "OK" {
fmt.Printf(" [FAIL] %-16s %s exitstatus=%s\n", step, upid, status.ExitStatus)
return 1
}
fmt.Printf(" [ ok ] %-16s %s exitstatus=%s\n", step, upid, status.ExitStatus)
return 0
}
// normDesc strips trailing newlines that PVE appends to the `description` field
// on read, so a written value round-trips equal. (PVE stores `description` with a
// trailing "\n"; comparing raw would always mismatch.)
func normDesc(s string) string { return strings.TrimRight(s, "\n") }
// extraString reads a string-valued key from GuestConfig.Extra (raw JSON). It
// returns ("", false, nil) when the key is absent, and decodes the JSON string
// otherwise.
func extraString(cfg proxmox.GuestConfig, key string) (string, bool, error) {
raw, ok := cfg.Extra[key]
if !ok || len(raw) == 0 {
return "", false, nil
}
var s string
if err := json.Unmarshal(raw, &s); err != nil {
return "", false, err
}
return s, true, nil
}
// --- small helpers / flag type --- // --- small helpers / flag type ---
func envOr(key, def string) string { func envOr(key, def string) string {