v0.3.2: reversible SetConfig step in --selftest=task (slice-4 pre-check)
Append a reversible SetConfig write+revert to runSelftestTask: read GuestConfig, write a `description` marker, verify it landed, restore the original (or delete if absent), verify the restore. Handles PVE's dual-mode SetConfig return (empty UPID = synchronous; UPID = WaitTask+assert OK). Live self-gate PASSED on demo-felhom / guest 9999. Findings: - LXC `description` write is synchronous (empty UPID) — dual-mode modeling confirmed; empty string is success, not an error. - PVE appends a trailing newline to `description` on read; slice-4 reconcile must normalize description comparisons (hence normDesc helper). First live exercise of the VM.Config.* privilege cluster. Standing operator token rotated during the run; new secret stored out-of-band, not in the repo. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -3,6 +3,38 @@
|
||||
All notable changes to **felhom-agent** are recorded here. Update on every code
|
||||
change that gets pushed.
|
||||
|
||||
## v0.3.2 — SetConfig selftest extension (slice-4 pre-check) (2026-06-08)
|
||||
|
||||
The gate before slice 4: prove `SetConfig` works live under the scoped token before
|
||||
reconcile is built on it. **Self-gated live run PASSED** on `demo-felhom`/guest 9999.
|
||||
|
||||
### Added
|
||||
- **Reversible `SetConfig` step appended to `--selftest=task`** (`cmd/felhom-agent/main.go`,
|
||||
`selftestSetConfig`): read `GuestConfig` → write a `description` marker
|
||||
(`felhom-selftest <RFC3339>`) → verify it landed → restore the original value (or
|
||||
`delete` the key if it was absent) → verify the restore. Handles PVE's dual-mode
|
||||
`SetConfig` return per the `mutate.go` contract: empty UPID = synchronous success
|
||||
(printed `synchronous`); non-empty UPID = `WaitTask` + assert `exitstatus=OK`.
|
||||
The existing snapshot → rollback → delete-snapshot steps are unchanged. First live
|
||||
exercise of the **`VM.Config.*`** privilege cluster.
|
||||
- **`normDesc` / `extraString` helpers** — `extraString` decodes a string-valued key
|
||||
from `GuestConfig.Extra` (raw JSON); `normDesc` strips the trailing newline PVE
|
||||
appends to `description` on read, so a written value round-trips equal.
|
||||
|
||||
### Finding (live)
|
||||
- The LXC `description` write returned **synchronous (empty UPID)** — PVE applied it
|
||||
inline, no task. The agent's dual-mode `SetConfig` modeling is correct: the
|
||||
empty-string path is real and must not be treated as an error.
|
||||
- PVE **appends a trailing `\n` to `description`** on read (stored URL-encoded as
|
||||
`%0A`). A naive exact-match reconcile would see perpetual drift — slice-4 reconcile
|
||||
must normalize `description` comparisons (hence `normDesc`).
|
||||
|
||||
### Ops
|
||||
- Standing operator token (`felhom-agent@pve!agent`, privsep) **rotated** during this
|
||||
run (the prior secret was not retrievable); role + both user/token ACL rows
|
||||
re-confirmed at `/`. New secret stored out-of-band, **not persisted to the repo**.
|
||||
Guest 9999 left pristine (stopped, no `description`, no leftover snapshot). Version → 0.3.2.
|
||||
|
||||
## Docs + live validation — no version bump (2026-06-08)
|
||||
|
||||
### Changed
|
||||
|
||||
@@ -15,7 +15,7 @@
|
||||
- Module `gitea.dooplex.hu/admin/felhom-agent`; binary `felhom-agent` (`cmd/felhom-agent/`).
|
||||
- **Pure Go stdlib + `golang.org/x/crypto` only** — no web frameworks.
|
||||
- `go.mod` directive **go 1.25.0**; dep `golang.org/x/crypto v0.52.0` (declares go 1.25, will NOT build on Go 1.24). The **build server (192.168.0.180) runs go1.26.0** (upstream Go on PATH, backward-compatible). Build/run the agent there for live tests (same LAN as the demo host).
|
||||
- Version: `version` var in `cmd/felhom-agent/main.go`, overridable via `-ldflags "-X main.version=<v>"`; `--version` flag. **Current: v0.3.1.** Bump on meaningful changes + add a CHANGELOG entry.
|
||||
- Version: `version` var in `cmd/felhom-agent/main.go`, overridable via `-ldflags "-X main.version=<v>"`; `--version` flag. **Current: v0.3.2.** Bump on meaningful changes + add a CHANGELOG entry.
|
||||
|
||||
## Layout
|
||||
|
||||
@@ -47,7 +47,8 @@ Built in slices, all on `main`:
|
||||
- **v0.2.0** slice 2 — `internal/authz` signed-op verifier.
|
||||
- **v0.3.0** slice 3 — `internal/hub`: the first **daemon loop** (no-`--selftest` mode) posting a read-only `HostReport` to the hub (= the heartbeat). Report's storage/backup/restore/pbs/audit fields are **defined-but-empty** (slices 5/6); the envelope's desired-state/signed-ops fields are **parsed-but-ignored** (slice 4).
|
||||
- **v0.3.1** — slice-3 validation follow-ups.
|
||||
- **Next: slice 4 (reconcile + benign/destructive gate)** — the first slice that issues real Proxmox mutations. **Gated** on passing the live `--selftest=task` runbook first.
|
||||
- **v0.3.2** — slice-4 pre-check: reversible `SetConfig` step added to `--selftest=task`; passed live on guest 9999. Findings: LXC `description` write is **synchronous** (empty UPID — dual-mode modeling confirmed); PVE appends a trailing `\n` to `description` on read (reconcile must normalize). First live `VM.Config.*` exercise.
|
||||
- **Next: slice 4 (reconcile + benign/destructive gate)** — the first slice that issues real Proxmox mutations. The live `--selftest=task` gate (snapshot/rollback/delete + `SetConfig`) is now **passed**.
|
||||
|
||||
## Demo host (for live tests)
|
||||
|
||||
|
||||
@@ -1,34 +1,76 @@
|
||||
# REPORT — live `--selftest=task` validation on the demo host (2026-06-08)
|
||||
# REPORT — `SetConfig` selftest extension, live self-gate (2026-06-08)
|
||||
|
||||
> Overwrite-latest report (most recent significant run only). Cumulative history lives in [CHANGELOG.md](CHANGELOG.md).
|
||||
|
||||
## Outcome
|
||||
|
||||
**`--selftest=task` PASSED live against the demo Proxmox host.** The slice-1 gap — slice-1 mutating ops + `WaitTask` were unit-tested only, never run against a live host — is **closed**. The shared async `WaitTask` foundation (UPID poll → assert `exitstatus == "OK"`) is now validated live. **Slice 4 (reconcile) is unblocked.**
|
||||
**`SetConfig` PASSED live under the scoped operator token.** The slice-4 pre-check is
|
||||
satisfied — `--selftest=task -vmid 9999` now exercises a reversible `SetConfig`
|
||||
write+revert end-to-end and reached `=== selftest=task OK ===` (exit 0). Reconcile
|
||||
(slice 4) can be built on `SetConfig` with confidence.
|
||||
|
||||
## What ran
|
||||
## What was implemented
|
||||
|
||||
Executed the live-validation runbook end-to-end on node `demo-felhom` (`https://192.168.0.162:8006`, PVE 9.2.x; `root@pam` via SSH alias `felhom-pve` for provisioning, agent run from the build server `192.168.0.180` on go1.26).
|
||||
A reversible `SetConfig` step appended to the existing `runSelftestTask` flow
|
||||
(`cmd/felhom-agent/main.go`, `selftestSetConfig`), keeping the prior
|
||||
snapshot → rollback → delete-snapshot steps intact. Against guest 9999:
|
||||
|
||||
1. **Provisioned the operator-tier token (Part A).** Created the `FelhomAgent` role with the full **16 privileges**, the `felhom-agent@pve` user, and a `--privsep 1` token `felhom-agent@pve!agent`. Granted the role on **both** the user **and** the token (the privsep intersection gotcha) — verified two ACL rows at `/`.
|
||||
2. **Scratch guest (Part B).** Created stopped LXC **9999** (`felhom-selftest-scratch`), rootfs on `local-lvm` (lvmthin → snapshot-capable). Kept stopped for deterministic rollback. No stale `felhom-selftest` snapshot present.
|
||||
3. **Config + TLS (Part C).** Confirmed the demo host's current leaf-cert SHA-256 fingerprint still matches the pinned value (`BA:7C:99:7D:45:D0…`). Built the agent (v0.3.1) on the build server.
|
||||
4. **Read-only gate (Part D).** `--selftest=read` clean with the new token: PVE 9.2.2, node online, guest 9999 visible, storages listed.
|
||||
5. **Live mutating run (Part E).** `--selftest=task -vmid 9999` — snapshot → rollback → delete-snapshot, each returning a real UPID that `WaitTask` polled to `exitstatus=OK`.
|
||||
1. `GuestConfig` — capture the original `description` (was **absent**).
|
||||
2. `SetConfig description="felhom-selftest <RFC3339>"` — dual-mode return handled per
|
||||
the `mutate.go` contract (empty UPID = synchronous; UPID = `WaitTask`+assert OK).
|
||||
3. `GuestConfig` again — confirm the marker landed.
|
||||
4. **Restore** — original was absent, so `SetConfig delete=description`; confirm cleared.
|
||||
|
||||
## Evidence
|
||||
Output matches the existing format:
|
||||
```
|
||||
[ ok ] setconfig synchronous exitstatus=OK
|
||||
[ ok ] verify-write description verified == marker
|
||||
[ ok ] setconfig-revert synchronous exitstatus=OK
|
||||
[ ok ] verify-revert description restored to original
|
||||
```
|
||||
|
||||
- **`exitstatus=OK`** on all three ops (a `200` on the POST is explicitly *not* treated as success — the `exitstatus` assertion is the point of the run).
|
||||
- The task **UPIDs name the token actor** (`…:vzsnapshot:9999:felhom-agent@pve!agent:`, likewise `vzrollback` / `vzdelsnapshot`) — confirming the privsep token path was genuinely exercised, no privilege drift.
|
||||
- **Role:** all 16 privileges present (`VM.Snapshot`, `VM.Snapshot.Rollback`, `VM.Backup`, the `VM.Config.*` set, `VM.PowerMgmt`, `VM.Allocate/Audit`, `Datastore.*`, `Sys.Audit`, `SDN.Use`).
|
||||
- **ACLs:** both `-user felhom-agent@pve` and `-token felhom-agent@pve!agent` carry `FelhomAgent` at `/`.
|
||||
- **Post-state (Part F):** `felhom-selftest` snapshot created then cleaned (only `current` remains); guest left **stopped**, as started.
|
||||
## Key finding — synchronous, not async
|
||||
|
||||
## Scope / not covered (by design)
|
||||
**The LXC `description` write came back synchronous (empty UPID).** PVE applied it
|
||||
inline with no task object; the agent printed `synchronous exitstatus=OK` on the
|
||||
empty-string path. This confirms the agent's **dual-mode `SetConfig` modeling matches
|
||||
Proxmox reality**: for `description`, the empty-UPID branch is the live path, and
|
||||
treating `""` as success (not an error) is correct. This was the **first live exercise
|
||||
of the `VM.Config.*` privilege cluster** (previously only the snapshot/rollback/backup
|
||||
privileges had been run live).
|
||||
|
||||
- **Not validated live:** `Start`/`Stop`/`SetConfig` (reversible, low-risk; `SetConfig` is used by reconcile — an optional selftest extension could add them), `Vzdump` (already confirmed live in the phase1-2 spike), and `RestoreLXC` / provision-by-restore (deferred until the golden base image exists, ~slice 7).
|
||||
- The run used a **stopped** guest deliberately, to keep rollback deterministic (LXC snapshots carry no running-memory state; rollback of a running CT may error or stop the guest). Characterizing running-guest rollback is optional follow-up intel, not a slice-4 blocker.
|
||||
## Second finding — `description` trailing-newline normalization
|
||||
|
||||
PVE **appends a trailing `\n` to `description` on read** (stored URL-encoded as
|
||||
`%0A...`). The first live run surfaced this as a (false) verify mismatch:
|
||||
`got="...Z\n"` vs `want="...Z"`. The write had genuinely landed — only my exact-match
|
||||
check was too strict. Fixed with `normDesc` (strip trailing newline) at every
|
||||
comparison point, and the run went green. **This is load-bearing intel for slice 4:**
|
||||
a reconcile that compares desired vs actual `description` verbatim will detect
|
||||
perpetual drift; it must normalize the trailing newline.
|
||||
|
||||
## Live run environment
|
||||
|
||||
- Built **v0.3.2** on the build server (192.168.0.180, go1.26), pointed at
|
||||
`demo-felhom` (`https://192.168.0.162:8006`, PVE 9.2.2).
|
||||
- Pinned leaf-cert SHA-256 fingerprint re-verified — still
|
||||
`BA:7C:99:7D:45:D0…` (matches the agent's pin).
|
||||
- `--selftest=read` clean first (PVE 9.2.2, node online, guests 9001+9999 visible,
|
||||
storages listed), then the gated `--selftest=task -vmid 9999`.
|
||||
- Task UPIDs name the token actor (`…:vzsnapshot:9999:felhom-agent@pve!agent:` etc.) —
|
||||
privsep token path genuinely exercised, no privilege drift.
|
||||
|
||||
## Post-state
|
||||
|
||||
Guest **9999** left pristine: **stopped**, `description` **absent**, only `current`
|
||||
remains (no leftover `felhom-selftest` snapshot).
|
||||
|
||||
## Credentials
|
||||
|
||||
The standing `FelhomAgent` operator token (`felhom-agent@pve!agent`) provisioned here is the one slice 4+ consumes — **not deleted**. Its secret is **stored out-of-band**, supplied to the agent via `FELHOM_AGENT_PROXMOX_TOKEN`; it is **not persisted to the repo** (the on-disk config holds only a placeholder). Scratch guest 9999 is retained (stopped) as the standing selftest target.
|
||||
The standing operator token (`felhom-agent@pve!agent`, privsep) was **rotated** during
|
||||
this run — the prior secret was not retrievable (PVE reveals a token secret only once
|
||||
at creation), so a fresh secret was minted via `root@felhom-pve` and the `FelhomAgent`
|
||||
role re-confirmed on **both** the user and the token ACL at `/` (privsep intersection
|
||||
gotcha). The token was consumed via the **standing operator token through
|
||||
`FELHOM_AGENT_PROXMOX_TOKEN`, not persisted to the repo** — the on-disk demo config
|
||||
carries only a placeholder. The new secret is **stored out-of-band**.
|
||||
|
||||
+137
-1
@@ -15,6 +15,7 @@ import (
|
||||
"log/slog"
|
||||
"os"
|
||||
"os/signal"
|
||||
"strings"
|
||||
"syscall"
|
||||
"time"
|
||||
|
||||
@@ -26,7 +27,7 @@ import (
|
||||
|
||||
// version is the agent version. Overridable at build time with
|
||||
// -ldflags "-X main.version=<v>"; defaults to the in-repo CHANGELOG version.
|
||||
var version = "0.3.1"
|
||||
var version = "0.3.2"
|
||||
|
||||
func main() {
|
||||
var (
|
||||
@@ -287,10 +288,145 @@ func runSelftestTask(ctx context.Context, cfg config.Config, logger *slog.Logger
|
||||
}
|
||||
fmt.Printf(" [ ok ] %-16s exitstatus=%s\n", st.name, status.ExitStatus)
|
||||
}
|
||||
|
||||
// Reversible SetConfig exercise: this is the first live use of the VM.Config.*
|
||||
// privilege cluster. It round-trips the cosmetic `description` field (no runtime
|
||||
// effect, fully reversible) to prove SetConfig works under the scoped token
|
||||
// before slice 4's reconcile is built on top of it.
|
||||
if rc := selftestSetConfig(ctx, client, vmid); rc != 0 {
|
||||
return rc
|
||||
}
|
||||
|
||||
fmt.Println("=== selftest=task OK ===")
|
||||
return 0
|
||||
}
|
||||
|
||||
// selftestSetConfig performs a reversible write+revert of the LXC `description`
|
||||
// field on vmid and asserts both land. Returns 0 on success, non-zero (with a
|
||||
// printed [FAIL] line) on any failure — the caller stops on a non-zero return.
|
||||
func selftestSetConfig(ctx context.Context, client *proxmox.Client, vmid int) int {
|
||||
// 1. Read current state; capture the original description (may be absent).
|
||||
cfg, err := client.GuestConfig(ctx, vmid)
|
||||
if err != nil {
|
||||
fmt.Printf(" [FAIL] %-16s GuestConfig: %v\n", "setconfig", err)
|
||||
return 1
|
||||
}
|
||||
origDesc, origPresent, err := extraString(cfg, "description")
|
||||
if err != nil {
|
||||
fmt.Printf(" [FAIL] %-16s decode description: %v\n", "setconfig", err)
|
||||
return 1
|
||||
}
|
||||
// PVE normalizes `description` by appending a trailing newline on read, so all
|
||||
// comparisons here use normDesc (strip trailing newlines) and restores write the
|
||||
// normalized original — otherwise an exact-match check sees false drift. This is
|
||||
// load-bearing intel for slice-4 reconcile (compare descriptions normalized).
|
||||
origDesc = normDesc(origDesc)
|
||||
|
||||
// 2. Write the marker.
|
||||
marker := "felhom-selftest " + time.Now().UTC().Format(time.RFC3339)
|
||||
if rc := applySetConfig(ctx, client, vmid, "setconfig", map[string]string{"description": marker}); rc != 0 {
|
||||
return rc
|
||||
}
|
||||
|
||||
// 3. Verify the marker landed.
|
||||
cfg, err = client.GuestConfig(ctx, vmid)
|
||||
if err != nil {
|
||||
fmt.Printf(" [FAIL] %-16s GuestConfig (verify): %v\n", "setconfig", err)
|
||||
return 1
|
||||
}
|
||||
got, present, err := extraString(cfg, "description")
|
||||
if err != nil {
|
||||
fmt.Printf(" [FAIL] %-16s decode description (verify): %v\n", "setconfig", err)
|
||||
return 1
|
||||
}
|
||||
if !present || normDesc(got) != marker {
|
||||
fmt.Printf(" [FAIL] %-16s write did not land: present=%v got=%q want=%q\n", "setconfig", present, got, marker)
|
||||
return 1
|
||||
}
|
||||
fmt.Printf(" [ ok ] %-16s description verified == marker\n", "verify-write")
|
||||
|
||||
// 4. Restore the original value (or clear it if it was absent originally).
|
||||
var revert map[string]string
|
||||
if origPresent {
|
||||
revert = map[string]string{"description": origDesc}
|
||||
} else {
|
||||
revert = map[string]string{"delete": "description"}
|
||||
}
|
||||
if rc := applySetConfig(ctx, client, vmid, "setconfig-revert", revert); rc != 0 {
|
||||
return rc
|
||||
}
|
||||
|
||||
// 5. Confirm the restore.
|
||||
cfg, err = client.GuestConfig(ctx, vmid)
|
||||
if err != nil {
|
||||
fmt.Printf(" [FAIL] %-16s GuestConfig (revert verify): %v\n", "setconfig-revert", err)
|
||||
return 1
|
||||
}
|
||||
got, present, err = extraString(cfg, "description")
|
||||
if err != nil {
|
||||
fmt.Printf(" [FAIL] %-16s decode description (revert verify): %v\n", "setconfig-revert", err)
|
||||
return 1
|
||||
}
|
||||
if origPresent {
|
||||
if !present || normDesc(got) != origDesc {
|
||||
fmt.Printf(" [FAIL] %-16s revert did not restore: present=%v got=%q want=%q\n", "setconfig-revert", present, normDesc(got), origDesc)
|
||||
return 1
|
||||
}
|
||||
} else if present {
|
||||
fmt.Printf(" [FAIL] %-16s revert did not clear: still present got=%q\n", "setconfig-revert", got)
|
||||
return 1
|
||||
}
|
||||
fmt.Printf(" [ ok ] %-16s description restored to original\n", "verify-revert")
|
||||
return 0
|
||||
}
|
||||
|
||||
// applySetConfig runs one SetConfig and asserts success, handling PVE's dual-mode
|
||||
// return: a UPID means async (WaitTask + assert exitstatus OK); an empty string
|
||||
// means PVE applied it synchronously (not an error — phase1-2/mutate.go contract).
|
||||
func applySetConfig(ctx context.Context, client *proxmox.Client, vmid int, step string, params map[string]string) int {
|
||||
upid, err := client.SetConfig(ctx, vmid, params)
|
||||
if err != nil {
|
||||
fmt.Printf(" [FAIL] %-16s %v\n", step, err)
|
||||
return 1
|
||||
}
|
||||
if upid == "" {
|
||||
// Synchronous path: empty UPID is a clean success.
|
||||
fmt.Printf(" [ ok ] %-16s synchronous exitstatus=OK\n", step)
|
||||
return 0
|
||||
}
|
||||
status, err := client.WaitTask(ctx, upid, proxmox.WaitOptions{})
|
||||
if err != nil {
|
||||
fmt.Printf(" [FAIL] %-16s %s %v\n", step, upid, err)
|
||||
return 1
|
||||
}
|
||||
if status.ExitStatus != "OK" {
|
||||
fmt.Printf(" [FAIL] %-16s %s exitstatus=%s\n", step, upid, status.ExitStatus)
|
||||
return 1
|
||||
}
|
||||
fmt.Printf(" [ ok ] %-16s %s exitstatus=%s\n", step, upid, status.ExitStatus)
|
||||
return 0
|
||||
}
|
||||
|
||||
// normDesc strips trailing newlines that PVE appends to the `description` field
|
||||
// on read, so a written value round-trips equal. (PVE stores `description` with a
|
||||
// trailing "\n"; comparing raw would always mismatch.)
|
||||
func normDesc(s string) string { return strings.TrimRight(s, "\n") }
|
||||
|
||||
// extraString reads a string-valued key from GuestConfig.Extra (raw JSON). It
|
||||
// returns ("", false, nil) when the key is absent, and decodes the JSON string
|
||||
// otherwise.
|
||||
func extraString(cfg proxmox.GuestConfig, key string) (string, bool, error) {
|
||||
raw, ok := cfg.Extra[key]
|
||||
if !ok || len(raw) == 0 {
|
||||
return "", false, nil
|
||||
}
|
||||
var s string
|
||||
if err := json.Unmarshal(raw, &s); err != nil {
|
||||
return "", false, err
|
||||
}
|
||||
return s, true, nil
|
||||
}
|
||||
|
||||
// --- small helpers / flag type ---
|
||||
|
||||
func envOr(key, def string) string {
|
||||
|
||||
Reference in New Issue
Block a user