diff --git a/CHANGELOG.md b/CHANGELOG.md index 1fa8a5b..1cf3629 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -3,6 +3,38 @@ All notable changes to **felhom-agent** are recorded here. Update on every code change that gets pushed. +## v0.3.2 — SetConfig selftest extension (slice-4 pre-check) (2026-06-08) + +The gate before slice 4: prove `SetConfig` works live under the scoped token before +reconcile is built on it. **Self-gated live run PASSED** on `demo-felhom`/guest 9999. + +### Added +- **Reversible `SetConfig` step appended to `--selftest=task`** (`cmd/felhom-agent/main.go`, + `selftestSetConfig`): read `GuestConfig` → write a `description` marker + (`felhom-selftest `) → verify it landed → restore the original value (or + `delete` the key if it was absent) → verify the restore. Handles PVE's dual-mode + `SetConfig` return per the `mutate.go` contract: empty UPID = synchronous success + (printed `synchronous`); non-empty UPID = `WaitTask` + assert `exitstatus=OK`. + The existing snapshot → rollback → delete-snapshot steps are unchanged. First live + exercise of the **`VM.Config.*`** privilege cluster. +- **`normDesc` / `extraString` helpers** — `extraString` decodes a string-valued key + from `GuestConfig.Extra` (raw JSON); `normDesc` strips the trailing newline PVE + appends to `description` on read, so a written value round-trips equal. + +### Finding (live) +- The LXC `description` write returned **synchronous (empty UPID)** — PVE applied it + inline, no task. The agent's dual-mode `SetConfig` modeling is correct: the + empty-string path is real and must not be treated as an error. +- PVE **appends a trailing `\n` to `description`** on read (stored URL-encoded as + `%0A`). A naive exact-match reconcile would see perpetual drift — slice-4 reconcile + must normalize `description` comparisons (hence `normDesc`). + +### Ops +- Standing operator token (`felhom-agent@pve!agent`, privsep) **rotated** during this + run (the prior secret was not retrievable); role + both user/token ACL rows + re-confirmed at `/`. New secret stored out-of-band, **not persisted to the repo**. + Guest 9999 left pristine (stopped, no `description`, no leftover snapshot). Version → 0.3.2. + ## Docs + live validation — no version bump (2026-06-08) ### Changed diff --git a/CLAUDE.md b/CLAUDE.md index b8efafa..12342dc 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -15,7 +15,7 @@ - Module `gitea.dooplex.hu/admin/felhom-agent`; binary `felhom-agent` (`cmd/felhom-agent/`). - **Pure Go stdlib + `golang.org/x/crypto` only** — no web frameworks. - `go.mod` directive **go 1.25.0**; dep `golang.org/x/crypto v0.52.0` (declares go 1.25, will NOT build on Go 1.24). The **build server (192.168.0.180) runs go1.26.0** (upstream Go on PATH, backward-compatible). Build/run the agent there for live tests (same LAN as the demo host). -- Version: `version` var in `cmd/felhom-agent/main.go`, overridable via `-ldflags "-X main.version="`; `--version` flag. **Current: v0.3.1.** Bump on meaningful changes + add a CHANGELOG entry. +- Version: `version` var in `cmd/felhom-agent/main.go`, overridable via `-ldflags "-X main.version="`; `--version` flag. **Current: v0.3.2.** Bump on meaningful changes + add a CHANGELOG entry. ## Layout @@ -47,7 +47,8 @@ Built in slices, all on `main`: - **v0.2.0** slice 2 — `internal/authz` signed-op verifier. - **v0.3.0** slice 3 — `internal/hub`: the first **daemon loop** (no-`--selftest` mode) posting a read-only `HostReport` to the hub (= the heartbeat). Report's storage/backup/restore/pbs/audit fields are **defined-but-empty** (slices 5/6); the envelope's desired-state/signed-ops fields are **parsed-but-ignored** (slice 4). - **v0.3.1** — slice-3 validation follow-ups. -- **Next: slice 4 (reconcile + benign/destructive gate)** — the first slice that issues real Proxmox mutations. **Gated** on passing the live `--selftest=task` runbook first. +- **v0.3.2** — slice-4 pre-check: reversible `SetConfig` step added to `--selftest=task`; passed live on guest 9999. Findings: LXC `description` write is **synchronous** (empty UPID — dual-mode modeling confirmed); PVE appends a trailing `\n` to `description` on read (reconcile must normalize). First live `VM.Config.*` exercise. +- **Next: slice 4 (reconcile + benign/destructive gate)** — the first slice that issues real Proxmox mutations. The live `--selftest=task` gate (snapshot/rollback/delete + `SetConfig`) is now **passed**. ## Demo host (for live tests) diff --git a/REPORT.md b/REPORT.md index 0fb1ce5..c62024f 100644 --- a/REPORT.md +++ b/REPORT.md @@ -1,34 +1,76 @@ -# REPORT — live `--selftest=task` validation on the demo host (2026-06-08) +# REPORT — `SetConfig` selftest extension, live self-gate (2026-06-08) > Overwrite-latest report (most recent significant run only). Cumulative history lives in [CHANGELOG.md](CHANGELOG.md). ## Outcome -**`--selftest=task` PASSED live against the demo Proxmox host.** The slice-1 gap — slice-1 mutating ops + `WaitTask` were unit-tested only, never run against a live host — is **closed**. The shared async `WaitTask` foundation (UPID poll → assert `exitstatus == "OK"`) is now validated live. **Slice 4 (reconcile) is unblocked.** +**`SetConfig` PASSED live under the scoped operator token.** The slice-4 pre-check is +satisfied — `--selftest=task -vmid 9999` now exercises a reversible `SetConfig` +write+revert end-to-end and reached `=== selftest=task OK ===` (exit 0). Reconcile +(slice 4) can be built on `SetConfig` with confidence. -## What ran +## What was implemented -Executed the live-validation runbook end-to-end on node `demo-felhom` (`https://192.168.0.162:8006`, PVE 9.2.x; `root@pam` via SSH alias `felhom-pve` for provisioning, agent run from the build server `192.168.0.180` on go1.26). +A reversible `SetConfig` step appended to the existing `runSelftestTask` flow +(`cmd/felhom-agent/main.go`, `selftestSetConfig`), keeping the prior +snapshot → rollback → delete-snapshot steps intact. Against guest 9999: -1. **Provisioned the operator-tier token (Part A).** Created the `FelhomAgent` role with the full **16 privileges**, the `felhom-agent@pve` user, and a `--privsep 1` token `felhom-agent@pve!agent`. Granted the role on **both** the user **and** the token (the privsep intersection gotcha) — verified two ACL rows at `/`. -2. **Scratch guest (Part B).** Created stopped LXC **9999** (`felhom-selftest-scratch`), rootfs on `local-lvm` (lvmthin → snapshot-capable). Kept stopped for deterministic rollback. No stale `felhom-selftest` snapshot present. -3. **Config + TLS (Part C).** Confirmed the demo host's current leaf-cert SHA-256 fingerprint still matches the pinned value (`BA:7C:99:7D:45:D0…`). Built the agent (v0.3.1) on the build server. -4. **Read-only gate (Part D).** `--selftest=read` clean with the new token: PVE 9.2.2, node online, guest 9999 visible, storages listed. -5. **Live mutating run (Part E).** `--selftest=task -vmid 9999` — snapshot → rollback → delete-snapshot, each returning a real UPID that `WaitTask` polled to `exitstatus=OK`. +1. `GuestConfig` — capture the original `description` (was **absent**). +2. `SetConfig description="felhom-selftest "` — dual-mode return handled per + the `mutate.go` contract (empty UPID = synchronous; UPID = `WaitTask`+assert OK). +3. `GuestConfig` again — confirm the marker landed. +4. **Restore** — original was absent, so `SetConfig delete=description`; confirm cleared. -## Evidence +Output matches the existing format: +``` + [ ok ] setconfig synchronous exitstatus=OK + [ ok ] verify-write description verified == marker + [ ok ] setconfig-revert synchronous exitstatus=OK + [ ok ] verify-revert description restored to original +``` -- **`exitstatus=OK`** on all three ops (a `200` on the POST is explicitly *not* treated as success — the `exitstatus` assertion is the point of the run). -- The task **UPIDs name the token actor** (`…:vzsnapshot:9999:felhom-agent@pve!agent:`, likewise `vzrollback` / `vzdelsnapshot`) — confirming the privsep token path was genuinely exercised, no privilege drift. -- **Role:** all 16 privileges present (`VM.Snapshot`, `VM.Snapshot.Rollback`, `VM.Backup`, the `VM.Config.*` set, `VM.PowerMgmt`, `VM.Allocate/Audit`, `Datastore.*`, `Sys.Audit`, `SDN.Use`). -- **ACLs:** both `-user felhom-agent@pve` and `-token felhom-agent@pve!agent` carry `FelhomAgent` at `/`. -- **Post-state (Part F):** `felhom-selftest` snapshot created then cleaned (only `current` remains); guest left **stopped**, as started. +## Key finding — synchronous, not async -## Scope / not covered (by design) +**The LXC `description` write came back synchronous (empty UPID).** PVE applied it +inline with no task object; the agent printed `synchronous exitstatus=OK` on the +empty-string path. This confirms the agent's **dual-mode `SetConfig` modeling matches +Proxmox reality**: for `description`, the empty-UPID branch is the live path, and +treating `""` as success (not an error) is correct. This was the **first live exercise +of the `VM.Config.*` privilege cluster** (previously only the snapshot/rollback/backup +privileges had been run live). -- **Not validated live:** `Start`/`Stop`/`SetConfig` (reversible, low-risk; `SetConfig` is used by reconcile — an optional selftest extension could add them), `Vzdump` (already confirmed live in the phase1-2 spike), and `RestoreLXC` / provision-by-restore (deferred until the golden base image exists, ~slice 7). -- The run used a **stopped** guest deliberately, to keep rollback deterministic (LXC snapshots carry no running-memory state; rollback of a running CT may error or stop the guest). Characterizing running-guest rollback is optional follow-up intel, not a slice-4 blocker. +## Second finding — `description` trailing-newline normalization + +PVE **appends a trailing `\n` to `description` on read** (stored URL-encoded as +`%0A...`). The first live run surfaced this as a (false) verify mismatch: +`got="...Z\n"` vs `want="...Z"`. The write had genuinely landed — only my exact-match +check was too strict. Fixed with `normDesc` (strip trailing newline) at every +comparison point, and the run went green. **This is load-bearing intel for slice 4:** +a reconcile that compares desired vs actual `description` verbatim will detect +perpetual drift; it must normalize the trailing newline. + +## Live run environment + +- Built **v0.3.2** on the build server (192.168.0.180, go1.26), pointed at + `demo-felhom` (`https://192.168.0.162:8006`, PVE 9.2.2). +- Pinned leaf-cert SHA-256 fingerprint re-verified — still + `BA:7C:99:7D:45:D0…` (matches the agent's pin). +- `--selftest=read` clean first (PVE 9.2.2, node online, guests 9001+9999 visible, + storages listed), then the gated `--selftest=task -vmid 9999`. +- Task UPIDs name the token actor (`…:vzsnapshot:9999:felhom-agent@pve!agent:` etc.) — + privsep token path genuinely exercised, no privilege drift. + +## Post-state + +Guest **9999** left pristine: **stopped**, `description` **absent**, only `current` +remains (no leftover `felhom-selftest` snapshot). ## Credentials -The standing `FelhomAgent` operator token (`felhom-agent@pve!agent`) provisioned here is the one slice 4+ consumes — **not deleted**. Its secret is **stored out-of-band**, supplied to the agent via `FELHOM_AGENT_PROXMOX_TOKEN`; it is **not persisted to the repo** (the on-disk config holds only a placeholder). Scratch guest 9999 is retained (stopped) as the standing selftest target. +The standing operator token (`felhom-agent@pve!agent`, privsep) was **rotated** during +this run — the prior secret was not retrievable (PVE reveals a token secret only once +at creation), so a fresh secret was minted via `root@felhom-pve` and the `FelhomAgent` +role re-confirmed on **both** the user and the token ACL at `/` (privsep intersection +gotcha). The token was consumed via the **standing operator token through +`FELHOM_AGENT_PROXMOX_TOKEN`, not persisted to the repo** — the on-disk demo config +carries only a placeholder. The new secret is **stored out-of-band**. diff --git a/cmd/felhom-agent/main.go b/cmd/felhom-agent/main.go index b7d6cac..240486b 100644 --- a/cmd/felhom-agent/main.go +++ b/cmd/felhom-agent/main.go @@ -15,6 +15,7 @@ import ( "log/slog" "os" "os/signal" + "strings" "syscall" "time" @@ -26,7 +27,7 @@ import ( // version is the agent version. Overridable at build time with // -ldflags "-X main.version="; defaults to the in-repo CHANGELOG version. -var version = "0.3.1" +var version = "0.3.2" func main() { var ( @@ -287,10 +288,145 @@ func runSelftestTask(ctx context.Context, cfg config.Config, logger *slog.Logger } fmt.Printf(" [ ok ] %-16s exitstatus=%s\n", st.name, status.ExitStatus) } + + // Reversible SetConfig exercise: this is the first live use of the VM.Config.* + // privilege cluster. It round-trips the cosmetic `description` field (no runtime + // effect, fully reversible) to prove SetConfig works under the scoped token + // before slice 4's reconcile is built on top of it. + if rc := selftestSetConfig(ctx, client, vmid); rc != 0 { + return rc + } + fmt.Println("=== selftest=task OK ===") return 0 } +// selftestSetConfig performs a reversible write+revert of the LXC `description` +// field on vmid and asserts both land. Returns 0 on success, non-zero (with a +// printed [FAIL] line) on any failure — the caller stops on a non-zero return. +func selftestSetConfig(ctx context.Context, client *proxmox.Client, vmid int) int { + // 1. Read current state; capture the original description (may be absent). + cfg, err := client.GuestConfig(ctx, vmid) + if err != nil { + fmt.Printf(" [FAIL] %-16s GuestConfig: %v\n", "setconfig", err) + return 1 + } + origDesc, origPresent, err := extraString(cfg, "description") + if err != nil { + fmt.Printf(" [FAIL] %-16s decode description: %v\n", "setconfig", err) + return 1 + } + // PVE normalizes `description` by appending a trailing newline on read, so all + // comparisons here use normDesc (strip trailing newlines) and restores write the + // normalized original — otherwise an exact-match check sees false drift. This is + // load-bearing intel for slice-4 reconcile (compare descriptions normalized). + origDesc = normDesc(origDesc) + + // 2. Write the marker. + marker := "felhom-selftest " + time.Now().UTC().Format(time.RFC3339) + if rc := applySetConfig(ctx, client, vmid, "setconfig", map[string]string{"description": marker}); rc != 0 { + return rc + } + + // 3. Verify the marker landed. + cfg, err = client.GuestConfig(ctx, vmid) + if err != nil { + fmt.Printf(" [FAIL] %-16s GuestConfig (verify): %v\n", "setconfig", err) + return 1 + } + got, present, err := extraString(cfg, "description") + if err != nil { + fmt.Printf(" [FAIL] %-16s decode description (verify): %v\n", "setconfig", err) + return 1 + } + if !present || normDesc(got) != marker { + fmt.Printf(" [FAIL] %-16s write did not land: present=%v got=%q want=%q\n", "setconfig", present, got, marker) + return 1 + } + fmt.Printf(" [ ok ] %-16s description verified == marker\n", "verify-write") + + // 4. Restore the original value (or clear it if it was absent originally). + var revert map[string]string + if origPresent { + revert = map[string]string{"description": origDesc} + } else { + revert = map[string]string{"delete": "description"} + } + if rc := applySetConfig(ctx, client, vmid, "setconfig-revert", revert); rc != 0 { + return rc + } + + // 5. Confirm the restore. + cfg, err = client.GuestConfig(ctx, vmid) + if err != nil { + fmt.Printf(" [FAIL] %-16s GuestConfig (revert verify): %v\n", "setconfig-revert", err) + return 1 + } + got, present, err = extraString(cfg, "description") + if err != nil { + fmt.Printf(" [FAIL] %-16s decode description (revert verify): %v\n", "setconfig-revert", err) + return 1 + } + if origPresent { + if !present || normDesc(got) != origDesc { + fmt.Printf(" [FAIL] %-16s revert did not restore: present=%v got=%q want=%q\n", "setconfig-revert", present, normDesc(got), origDesc) + return 1 + } + } else if present { + fmt.Printf(" [FAIL] %-16s revert did not clear: still present got=%q\n", "setconfig-revert", got) + return 1 + } + fmt.Printf(" [ ok ] %-16s description restored to original\n", "verify-revert") + return 0 +} + +// applySetConfig runs one SetConfig and asserts success, handling PVE's dual-mode +// return: a UPID means async (WaitTask + assert exitstatus OK); an empty string +// means PVE applied it synchronously (not an error — phase1-2/mutate.go contract). +func applySetConfig(ctx context.Context, client *proxmox.Client, vmid int, step string, params map[string]string) int { + upid, err := client.SetConfig(ctx, vmid, params) + if err != nil { + fmt.Printf(" [FAIL] %-16s %v\n", step, err) + return 1 + } + if upid == "" { + // Synchronous path: empty UPID is a clean success. + fmt.Printf(" [ ok ] %-16s synchronous exitstatus=OK\n", step) + return 0 + } + status, err := client.WaitTask(ctx, upid, proxmox.WaitOptions{}) + if err != nil { + fmt.Printf(" [FAIL] %-16s %s %v\n", step, upid, err) + return 1 + } + if status.ExitStatus != "OK" { + fmt.Printf(" [FAIL] %-16s %s exitstatus=%s\n", step, upid, status.ExitStatus) + return 1 + } + fmt.Printf(" [ ok ] %-16s %s exitstatus=%s\n", step, upid, status.ExitStatus) + return 0 +} + +// normDesc strips trailing newlines that PVE appends to the `description` field +// on read, so a written value round-trips equal. (PVE stores `description` with a +// trailing "\n"; comparing raw would always mismatch.) +func normDesc(s string) string { return strings.TrimRight(s, "\n") } + +// extraString reads a string-valued key from GuestConfig.Extra (raw JSON). It +// returns ("", false, nil) when the key is absent, and decodes the JSON string +// otherwise. +func extraString(cfg proxmox.GuestConfig, key string) (string, bool, error) { + raw, ok := cfg.Extra[key] + if !ok || len(raw) == 0 { + return "", false, nil + } + var s string + if err := json.Unmarshal(raw, &s); err != nil { + return "", false, err + } + return s, true, nil +} + // --- small helpers / flag type --- func envOr(key, def string) string {