Append a reversible SetConfig write+revert to runSelftestTask: read GuestConfig, write a `description` marker, verify it landed, restore the original (or delete if absent), verify the restore. Handles PVE's dual-mode SetConfig return (empty UPID = synchronous; UPID = WaitTask+assert OK). Live self-gate PASSED on demo-felhom / guest 9999. Findings: - LXC `description` write is synchronous (empty UPID) — dual-mode modeling confirmed; empty string is success, not an error. - PVE appends a trailing newline to `description` on read; slice-4 reconcile must normalize description comparisons (hence normDesc helper). First live exercise of the VM.Config.* privilege cluster. Standing operator token rotated during the run; new secret stored out-of-band, not in the repo. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
6.3 KiB
CLAUDE.md — felhom-agent
Place at the repo root (
felhom-agent/CLAUDE.md). Loads when Claude Code touches this repo. Keep under ~200 lines. The cross-repo orientation lives in the workspace-roote:\git\CLAUDE.md; this file isfelhom-agent-specific.
What this repo is
felhom-agent is the operator-tier host agent that runs on each Proxmox host and owns all Proxmox interaction: provision/restore guests, host storage, backup/restore orchestration, the hub control loop, and a narrow per-guest local API. It is the most privilege-sensitive component.
- It is the renamed former
proxmox-controllerrepo. - Distinct from
felhom-controller— that is the in-guest controller (Docker-only, no Proxmox creds). Do not confuse them. - Control plane, not data plane: if the agent dies, apps keep serving; only management degrades.
Build / run
- Module
gitea.dooplex.hu/admin/felhom-agent; binaryfelhom-agent(cmd/felhom-agent/). - Pure Go stdlib +
golang.org/x/cryptoonly — no web frameworks. go.moddirective go 1.25.0; depgolang.org/x/crypto v0.52.0(declares go 1.25, will NOT build on Go 1.24). The build server (192.168.0.180) runs go1.26.0 (upstream Go on PATH, backward-compatible). Build/run the agent there for live tests (same LAN as the demo host).- Version:
versionvar incmd/felhom-agent/main.go, overridable via-ldflags "-X main.version=<v>";--versionflag. Current: v0.3.2. Bump on meaningful changes + add a CHANGELOG entry.
Layout
cmd/felhom-agent/ main + flag handling + --selftest modes + the daemon entry
internal/config/ JSON config + FELHOM_AGENT_* env overlay; secrets redacted (Redacted())
internal/log/ slog setup
internal/proxmox/ API-first Client + fenced root-CLI Privileged + UPID WaitTask
internal/authz/ operator signed-op verifier (SSHSIG); durable FileNonceStore
internal/hub/ daemon: HostReport collector + Bearer client + resilient Loop
Proxmox model (the load-bearing rules)
- API-first via a scoped
FelhomAgenttoken (16 privileges). Raw root-CLI is fenced to exactly 3 exceptions: keyctlpct create(golden image), USB mount/fstab, SMART/sensors.Clientnever shells out;Privilegednever makes HTTP calls (asserted by tests). Keep that fence. - Every mutating op is async → returns a UPID →
WaitTaskassertsexitstatus == "OK". A 200 on the POST is not success; authorization can fail at task execution, not the POST. - TLS: SHA-256 leaf-cert pinning (the host serves a self-signed cert). No insecure default.
- Privsep token gotcha: a
--privsep 1token's rights = intersection of the backing user's perms AND the token's ACLs — so the role must be granted on both user and token, or every call 403s. (Token provisioning is out-of-band / human-run; the agent only consumes the token.)
Design + platform facts (read before designing)
- Design doc:
felhom.eu/documentation/architecture/03-host-agent.md(locked). - Platform facts:
felhom.eu/documentation/proxmox-platform.md+tests/phase{0,1-2,3,4}-findings.md.
Current state
Built in slices, all on main:
- v0.1.0 slice 1 — scaffold +
internal/proxmox+internal/config/log+--selftest. - v0.2.0 slice 2 —
internal/authzsigned-op verifier. - v0.3.0 slice 3 —
internal/hub: the first daemon loop (no---selftestmode) posting a read-onlyHostReportto the hub (= the heartbeat). Report's storage/backup/restore/pbs/audit fields are defined-but-empty (slices 5/6); the envelope's desired-state/signed-ops fields are parsed-but-ignored (slice 4). - v0.3.1 — slice-3 validation follow-ups.
- v0.3.2 — slice-4 pre-check: reversible
SetConfigstep added to--selftest=task; passed live on guest 9999. Findings: LXCdescriptionwrite is synchronous (empty UPID — dual-mode modeling confirmed); PVE appends a trailing\ntodescriptionon read (reconcile must normalize). First liveVM.Config.*exercise. - Next: slice 4 (reconcile + benign/destructive gate) — the first slice that issues real Proxmox mutations. The live
--selftest=taskgate (snapshot/rollback/delete +SetConfig) is now passed.
Demo host (for live tests)
Node demo-felhom, API https://192.168.0.162:8006, PVE 9.2.2; leaf-cert SHA-256 fingerprint starts BA:7C:99:7D:45:D0… (verify it still matches before a live run — the agent pins it). pveum/pct ops need root@pam on the PVE (SSH alias felhom-pve) - available to Claude Code
Selftest modes (run from the build server, pointed at the demo API):
--selftest/--selftest=read— read-only health checks.--selftest=task -vmid N— reversible snapshot→rollback→delete on guest N (gated; never under bare--selftest).--selftest=hub— one collect + report round-trip to the hub.- No flag → the daemon (poll loop); requires
hubconfig.
Conventions
- Push to
maindirectly; no feature branches.
In every repository where you make a change, update both files in that repo:
CHANGELOG.md— a cumulative log of all changes; newest entry on top.REPORT.md— overwrite with a summary of the most recent implementation (or significant validation/operational run) only; not cumulative.Never write secrets — tokens, passwords, private keys, API keys — into
CHANGELOG.md,REPORT.md, or any committed file. Reference them as "stored out-of-band" instead.
- Code quality: verify generated code for bugs/edge cases; add debug logging; ask rather than guess when you'd otherwise invent input/output.
Workflow & artifacts
- Implement
TASK.md/TASK-*.mdspecs (when placed asTASK.mdor told to implement one), then push + CHANGELOG + REPORT.md. RUNBOOK-*.md— an operational procedure. CC executes the steps it has access and capability for, including live validation on the demo nodes and the demo Proxmox host (CC has root@felhom-pve SSH + the felhom-agent token). A step is human-only only when it genuinely needs physical presence, a real-world decision, or credentials CC truly lacks — mark those steps HUMAN. Do not decline a whole procedure because it touches a live host or a privileged token. (Judgment still applies: confirm before irreversible ops on real customer data — but demo scratch guests are fair game.)