docs: reflow CLAUDE.md; unify REPORT/CHANGELOG convention; add no-secrets rule
Also overwrite REPORT.md with the live --selftest=task validation on demo-felhom (snapshot/rollback/delete on guest 9999, exitstatus=OK under the felhom-agent@pve privsep token; slice-1 mutating-ops gap closed, slice 4 unblocked). No version bump. Token secret stored out-of-band, not committed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -3,6 +3,15 @@
|
|||||||
All notable changes to **felhom-agent** are recorded here. Update on every code
|
All notable changes to **felhom-agent** are recorded here. Update on every code
|
||||||
change that gets pushed.
|
change that gets pushed.
|
||||||
|
|
||||||
|
## Docs + live validation — no version bump (2026-06-08)
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- **Reflowed `CLAUDE.md`** — removed hard mid-paragraph line wraps (prose, list items, blockquotes now single-line, soft-wrapped); code blocks and tables untouched; rendered output unchanged.
|
||||||
|
- **Unified the REPORT/CHANGELOG convention** in `CLAUDE.md`: `CHANGELOG.md` is the cumulative log (newest on top); `REPORT.md` is overwritten with the most-recent implementation/validation only. Added an explicit **no-secrets** rule (never write tokens/passwords/keys into committed files; reference them as stored out-of-band).
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- **`REPORT.md`** rewritten for the live `--selftest=task` validation on the demo host (`demo-felhom`): snapshot → rollback → delete-snapshot on guest 9999, each polled to `exitstatus=OK` under the `felhom-agent@pve!agent` privsep token (UPIDs name the token actor — privsep path genuinely exercised); 16-privilege `FelhomAgent` role + both user & token ACLs confirmed; `--selftest=read` clean. Closes the slice-1 "mutating ops unit-tested only" gap; `WaitTask` async foundation validated live → **slice 4 unblocked**. (Token secret stored out-of-band, not in the repo.)
|
||||||
|
|
||||||
## v0.3.1 — slice-3 validation follow-ups (2026-06-08)
|
## v0.3.1 — slice-3 validation follow-ups (2026-06-08)
|
||||||
|
|
||||||
### Changed
|
### Changed
|
||||||
|
|||||||
@@ -1,29 +1,21 @@
|
|||||||
# CLAUDE.md — `felhom-agent`
|
# CLAUDE.md — `felhom-agent`
|
||||||
|
|
||||||
> Place at the repo root (`felhom-agent/CLAUDE.md`). Loads when Claude Code touches this repo.
|
> Place at the repo root (`felhom-agent/CLAUDE.md`). Loads when Claude Code touches this repo. Keep under ~200 lines. The cross-repo orientation lives in the workspace-root `e:\git\CLAUDE.md`; this file is `felhom-agent`-specific.
|
||||||
> Keep under ~200 lines. The cross-repo orientation lives in the workspace-root `e:\git\CLAUDE.md`;
|
|
||||||
> this file is `felhom-agent`-specific.
|
|
||||||
|
|
||||||
## What this repo is
|
## What this repo is
|
||||||
|
|
||||||
`felhom-agent` is the operator-tier **host agent** that runs on each Proxmox host and owns **all**
|
`felhom-agent` is the operator-tier **host agent** that runs on each Proxmox host and owns **all** Proxmox interaction: provision/restore guests, host storage, backup/restore orchestration, the hub control loop, and a narrow per-guest local API. It is the **most privilege-sensitive** component.
|
||||||
Proxmox interaction: provision/restore guests, host storage, backup/restore orchestration, the hub
|
|
||||||
control loop, and a narrow per-guest local API. It is the **most privilege-sensitive** component.
|
|
||||||
|
|
||||||
- It is the renamed former `proxmox-controller` repo.
|
- It is the renamed former `proxmox-controller` repo.
|
||||||
- **Distinct from `felhom-controller`** — that is the *in-guest* controller (Docker-only, no Proxmox
|
- **Distinct from `felhom-controller`** — that is the *in-guest* controller (Docker-only, no Proxmox creds). Do not confuse them.
|
||||||
creds). Do not confuse them.
|
|
||||||
- Control plane, not data plane: if the agent dies, apps keep serving; only management degrades.
|
- Control plane, not data plane: if the agent dies, apps keep serving; only management degrades.
|
||||||
|
|
||||||
## Build / run
|
## Build / run
|
||||||
|
|
||||||
- Module `gitea.dooplex.hu/admin/felhom-agent`; binary `felhom-agent` (`cmd/felhom-agent/`).
|
- Module `gitea.dooplex.hu/admin/felhom-agent`; binary `felhom-agent` (`cmd/felhom-agent/`).
|
||||||
- **Pure Go stdlib + `golang.org/x/crypto` only** — no web frameworks.
|
- **Pure Go stdlib + `golang.org/x/crypto` only** — no web frameworks.
|
||||||
- `go.mod` directive **go 1.25.0**; dep `golang.org/x/crypto v0.52.0` (declares go 1.25, will NOT
|
- `go.mod` directive **go 1.25.0**; dep `golang.org/x/crypto v0.52.0` (declares go 1.25, will NOT build on Go 1.24). The **build server (192.168.0.180) runs go1.26.0** (upstream Go on PATH, backward-compatible). Build/run the agent there for live tests (same LAN as the demo host).
|
||||||
build on Go 1.24). The **build server (192.168.0.180) runs go1.26.0** (upstream Go on PATH,
|
- Version: `version` var in `cmd/felhom-agent/main.go`, overridable via `-ldflags "-X main.version=<v>"`; `--version` flag. **Current: v0.3.1.** Bump on meaningful changes + add a CHANGELOG entry.
|
||||||
backward-compatible). Build/run the agent there for live tests (same LAN as the demo host).
|
|
||||||
- Version: `version` var in `cmd/felhom-agent/main.go`, overridable via `-ldflags "-X main.version=<v>"`;
|
|
||||||
`--version` flag. **Current: v0.3.1.** Bump on meaningful changes + add a CHANGELOG entry.
|
|
||||||
|
|
||||||
## Layout
|
## Layout
|
||||||
|
|
||||||
@@ -38,15 +30,10 @@ internal/hub/ daemon: HostReport collector + Bearer client + resilient Lo
|
|||||||
|
|
||||||
## Proxmox model (the load-bearing rules)
|
## Proxmox model (the load-bearing rules)
|
||||||
|
|
||||||
- **API-first** via a scoped `FelhomAgent` token (16 privileges). Raw root-CLI is **fenced to
|
- **API-first** via a scoped `FelhomAgent` token (16 privileges). Raw root-CLI is **fenced to exactly 3 exceptions**: keyctl `pct create` (golden image), USB mount/fstab, SMART/sensors. `Client` never shells out; `Privileged` never makes HTTP calls (asserted by tests). Keep that fence.
|
||||||
exactly 3 exceptions**: keyctl `pct create` (golden image), USB mount/fstab, SMART/sensors.
|
- **Every mutating op is async** → returns a UPID → `WaitTask` asserts `exitstatus == "OK"`. A 200 on the POST is **not** success; authorization can fail at task execution, not the POST.
|
||||||
`Client` never shells out; `Privileged` never makes HTTP calls (asserted by tests). Keep that fence.
|
|
||||||
- **Every mutating op is async** → returns a UPID → `WaitTask` asserts `exitstatus == "OK"`.
|
|
||||||
A 200 on the POST is **not** success; authorization can fail at task execution, not the POST.
|
|
||||||
- **TLS:** SHA-256 leaf-cert pinning (the host serves a self-signed cert). No insecure default.
|
- **TLS:** SHA-256 leaf-cert pinning (the host serves a self-signed cert). No insecure default.
|
||||||
- **Privsep token gotcha:** a `--privsep 1` token's rights = intersection of the backing user's
|
- **Privsep token gotcha:** a `--privsep 1` token's rights = intersection of the backing user's perms AND the token's ACLs — so the role must be granted on **both** user and token, or every call 403s. (Token provisioning is out-of-band / human-run; the agent only consumes the token.)
|
||||||
perms AND the token's ACLs — so the role must be granted on **both** user and token, or every
|
|
||||||
call 403s. (Token provisioning is out-of-band / human-run; the agent only consumes the token.)
|
|
||||||
|
|
||||||
## Design + platform facts (read before designing)
|
## Design + platform facts (read before designing)
|
||||||
|
|
||||||
@@ -58,19 +45,13 @@ internal/hub/ daemon: HostReport collector + Bearer client + resilient Lo
|
|||||||
Built in slices, all on `main`:
|
Built in slices, all on `main`:
|
||||||
- **v0.1.0** slice 1 — scaffold + `internal/proxmox` + `internal/config`/`log` + `--selftest`.
|
- **v0.1.0** slice 1 — scaffold + `internal/proxmox` + `internal/config`/`log` + `--selftest`.
|
||||||
- **v0.2.0** slice 2 — `internal/authz` signed-op verifier.
|
- **v0.2.0** slice 2 — `internal/authz` signed-op verifier.
|
||||||
- **v0.3.0** slice 3 — `internal/hub`: the first **daemon loop** (no-`--selftest` mode) posting a
|
- **v0.3.0** slice 3 — `internal/hub`: the first **daemon loop** (no-`--selftest` mode) posting a read-only `HostReport` to the hub (= the heartbeat). Report's storage/backup/restore/pbs/audit fields are **defined-but-empty** (slices 5/6); the envelope's desired-state/signed-ops fields are **parsed-but-ignored** (slice 4).
|
||||||
read-only `HostReport` to the hub (= the heartbeat). Report's storage/backup/restore/pbs/audit
|
|
||||||
fields are **defined-but-empty** (slices 5/6); the envelope's desired-state/signed-ops fields are
|
|
||||||
**parsed-but-ignored** (slice 4).
|
|
||||||
- **v0.3.1** — slice-3 validation follow-ups.
|
- **v0.3.1** — slice-3 validation follow-ups.
|
||||||
- **Next: slice 4 (reconcile + benign/destructive gate)** — the first slice that issues real Proxmox
|
- **Next: slice 4 (reconcile + benign/destructive gate)** — the first slice that issues real Proxmox mutations. **Gated** on passing the live `--selftest=task` runbook first.
|
||||||
mutations. **Gated** on passing the live `--selftest=task` runbook first.
|
|
||||||
|
|
||||||
## Demo host (for live tests)
|
## Demo host (for live tests)
|
||||||
|
|
||||||
Node **`demo-felhom`**, API `https://192.168.0.162:8006`, PVE 9.2.2; leaf-cert SHA-256 fingerprint
|
Node **`demo-felhom`**, API `https://192.168.0.162:8006`, PVE 9.2.2; leaf-cert SHA-256 fingerprint starts `BA:7C:99:7D:45:D0…` (verify it still matches before a live run — the agent pins it). `pveum`/`pct` ops need `root@pam` on the PVE (SSH alias `felhom-pve`) - available to Claude Code
|
||||||
starts `BA:7C:99:7D:45:D0…` (verify it still matches before a live run — the agent pins it).
|
|
||||||
`pveum`/`pct` ops need `root@pam` on the PVE (SSH alias `felhom-pve`) - available to Claude Code
|
|
||||||
|
|
||||||
Selftest modes (run from the build server, pointed at the demo API):
|
Selftest modes (run from the build server, pointed at the demo API):
|
||||||
- `--selftest` / `--selftest=read` — read-only health checks.
|
- `--selftest` / `--selftest=read` — read-only health checks.
|
||||||
@@ -81,14 +62,16 @@ Selftest modes (run from the build server, pointed at the demo API):
|
|||||||
## Conventions
|
## Conventions
|
||||||
|
|
||||||
- Push to `main` directly; no feature branches.
|
- Push to `main` directly; no feature branches.
|
||||||
- `CHANGELOG.md` (repo root), newest on top, on every pushed change.
|
|
||||||
- `REPORT.md` (repo root) is **fully overwritten** each task with that task's report (cumulative
|
> **In every repository where you make a change, update both files in that repo:**
|
||||||
history lives in CHANGELOG).
|
> - **`CHANGELOG.md`** — a cumulative log of **all** changes; newest entry on top.
|
||||||
- Code quality: verify generated code for bugs/edge cases; add debug logging; **ask rather than
|
> - **`REPORT.md`** — **overwrite** with a summary of the **most recent** implementation (or significant validation/operational run) only; not cumulative.
|
||||||
guess** when you'd otherwise invent input/output.
|
>
|
||||||
|
> **Never write secrets** — tokens, passwords, private keys, API keys — into `CHANGELOG.md`, `REPORT.md`, or any committed file. Reference them as "stored out-of-band" instead.
|
||||||
|
|
||||||
|
- Code quality: verify generated code for bugs/edge cases; add debug logging; **ask rather than guess** when you'd otherwise invent input/output.
|
||||||
|
|
||||||
## Workflow & artifacts
|
## Workflow & artifacts
|
||||||
|
|
||||||
- Implement **`TASK.md` / `TASK-*.md`** specs (when placed as `TASK.md` or told to implement one),
|
- Implement **`TASK.md` / `TASK-*.md`** specs (when placed as `TASK.md` or told to implement one), then push + CHANGELOG + REPORT.md.
|
||||||
then push + CHANGELOG + REPORT.md.
|
- **`RUNBOOK-*.md`** — an operational procedure. CC executes the steps it has access and capability for, including live validation on the demo nodes and the demo Proxmox host (CC has root@felhom-pve SSH + the felhom-agent token). A step is human-only only when it genuinely needs physical presence, a real-world decision, or credentials CC truly lacks — mark those steps HUMAN. Do not decline a whole procedure because it touches a live host or a privileged token. (Judgment still applies: confirm before irreversible ops on real customer data — but demo scratch guests are fair game.)
|
||||||
- **`RUNBOOK-*.md`** — an operational procedure. CC executes the steps it has access and capability for, including live validation on the demo nodes and the demo Proxmox host (CC has root@felhom-pve SSH + the felhom-agent token). A step is human-only only when it genuinely needs physical presence, a real-world decision, or credentials CC truly lacks — mark those steps HUMAN. Do not decline a whole procedure because it touches a live host or a privileged token. (Judgment still applies: confirm before irreversible ops on real customer data — but demo scratch guests are fair game.)
|
|
||||||
|
|||||||
@@ -1,47 +1,34 @@
|
|||||||
# felhom-agent — latest task report
|
# REPORT — live `--selftest=task` validation on the demo host (2026-06-08)
|
||||||
|
|
||||||
> This file holds the report for the **most recent** change, fully overwritten each task.
|
> Overwrite-latest report (most recent significant run only). Cumulative history lives in [CHANGELOG.md](CHANGELOG.md).
|
||||||
> Cumulative history lives in [CHANGELOG.md](CHANGELOG.md).
|
|
||||||
|
|
||||||
## Task: slice-3 validation follow-ups — v0.3.1
|
## Outcome
|
||||||
|
|
||||||
Small fixes surfaced during slice-3 validation (agent half). Pushed to `main`; build/vet/test
|
**`--selftest=task` PASSED live against the demo Proxmox host.** The slice-1 gap — slice-1 mutating ops + `WaitTask` were unit-tested only, never run against a live host — is **closed**. The shared async `WaitTask` foundation (UPID poll → assert `exitstatus == "OK"`) is now validated live. **Slice 4 (reconcile) is unblocked.**
|
||||||
green locally (go1.26) and on the build server.
|
|
||||||
|
|
||||||
### §1 — `--selftest` usage string
|
## What ran
|
||||||
`selftestFlag.Set`'s error now reads `(want read|task|hub)` (was missing `hub`, which became a
|
|
||||||
valid mode in slice 3). Cosmetic.
|
|
||||||
|
|
||||||
### §2 — collector keeps run-status on a `GuestConfig` failure
|
Executed the live-validation runbook end-to-end on node `demo-felhom` (`https://192.168.0.162:8006`, PVE 9.2.x; `root@pam` via SSH alias `felhom-pve` for provisioning, agent run from the build server `192.168.0.180` on go1.26).
|
||||||
`internal/hub/collect.go` `collectGuests`: a per-guest `GuestConfig` error no longer forces
|
|
||||||
`status="unknown"`. The run-status from `ListLXC` is **preserved** (only `spec` is dropped — that's
|
|
||||||
the only thing actually unknown). An *empty* status is still normalized to `unknown`, so the wire
|
|
||||||
value is always `running|stopped|unknown` (matches the hub handler's empty→unknown defaulting).
|
|
||||||
Test renamed `TestCollect_GuestConfigFailureKeepsStatusOmitsSpec`, now asserting the preserved
|
|
||||||
`running` status **and** nil spec (not a hollow `!= "unknown"` check).
|
|
||||||
|
|
||||||
### §4 — cross-repo contract golden fixture (agent half)
|
1. **Provisioned the operator-tier token (Part A).** Created the `FelhomAgent` role with the full **16 privileges**, the `felhom-agent@pve` user, and a `--privsep 1` token `felhom-agent@pve!agent`. Granted the role on **both** the user **and** the token (the privsep intersection gotcha) — verified two ACL rows at `/`.
|
||||||
The host-report shape lives in two repos with nothing failing on drift (the hub ignores unknown
|
2. **Scratch guest (Part B).** Created stopped LXC **9999** (`felhom-selftest-scratch`), rootfs on `local-lvm` (lvmthin → snapshot-capable). Kept stopped for deterministic rollback. No stale `felhom-selftest` snapshot present.
|
||||||
fields). Locked it with a golden sample:
|
3. **Config + TLS (Part C).** Confirmed the demo host's current leaf-cert SHA-256 fingerprint still matches the pinned value (`BA:7C:99:7D:45:D0…`). Built the agent (v0.3.1) on the build server.
|
||||||
- `internal/hub/testdata/host-report.golden.json` — a populated report (host block, two guests:
|
4. **Read-only gate (Part D).** `--selftest=read` clean with the new token: PVE 9.2.2, node online, guest 9999 visible, storages listed.
|
||||||
one `running` with `spec`, one `stopped`; `cloudflared`; the four empty collections + `audit_tail`
|
5. **Live mutating run (Part E).** `--selftest=task -vmid 9999` — snapshot → rollback → delete-snapshot, each returning a real UPID that `WaitTask` polled to `exitstatus=OK`.
|
||||||
as `[]`).
|
|
||||||
- `TestHostReport_ContractMatchesGolden` — marshals a constructed `HostReport`, unmarshals the
|
|
||||||
golden, and compares **field-name key sets** at top level + `host` + `guests[0]`. A renamed/added/
|
|
||||||
removed json tag fails it.
|
|
||||||
|
|
||||||
**Caveat (called out):** this is a *duplicated* contract — the file must stay **byte-identical**
|
## Evidence
|
||||||
with `felhom-hub`'s `hub/internal/api/testdata/host-report.golden.json`. JSON can't carry a comment,
|
|
||||||
so the mandatory "keep byte-identical" note lives in the test file's doc comment in both repos
|
|
||||||
instead of a JSON header. When slices 5/6 add real `storage_targets`/`backups` fields, revisit
|
|
||||||
promoting this to a shared Go types module (the proper fix).
|
|
||||||
|
|
||||||
### Not touched (confirmed)
|
- **`exitstatus=OK`** on all three ops (a `200` on the POST is explicitly *not* treated as success — the `exitstatus` assertion is the point of the run).
|
||||||
The daemon's proxmox client timeout is already bounded: `proxmox.NewClient` defaults `HTTPTimeout`
|
- The task **UPIDs name the token actor** (`…:vzsnapshot:9999:felhom-agent@pve!agent:`, likewise `vzrollback` / `vzdelsnapshot`) — confirming the privsep token path was genuinely exercised, no privilege drift.
|
||||||
to 30s when zero, and `newProxmoxClient` leaves it zero. No change (was a "confirm" item).
|
- **Role:** all 16 privileges present (`VM.Snapshot`, `VM.Snapshot.Rollback`, `VM.Backup`, the `VM.Config.*` set, `VM.PowerMgmt`, `VM.Allocate/Audit`, `Datastore.*`, `Sys.Audit`, `SDN.Use`).
|
||||||
|
- **ACLs:** both `-user felhom-agent@pve` and `-token felhom-agent@pve!agent` carry `FelhomAgent` at `/`.
|
||||||
|
- **Post-state (Part F):** `felhom-selftest` snapshot created then cleaned (only `current` remains); guest left **stopped**, as started.
|
||||||
|
|
||||||
### Verification
|
## Scope / not covered (by design)
|
||||||
`go build/vet/test ./...` green locally (go1.26) and on the build server (go1.26). Version 0.3.0 → 0.3.1.
|
|
||||||
|
|
||||||
### Repo state
|
- **Not validated live:** `Start`/`Stop`/`SetConfig` (reversible, low-risk; `SetConfig` is used by reconcile — an optional selftest extension could add them), `Vzdump` (already confirmed live in the phase1-2 spike), and `RestoreLXC` / provision-by-restore (deferred until the golden base image exists, ~slice 7).
|
||||||
Branch: `main` only. Dep unchanged (`golang.org/x/crypto v0.52.0`).
|
- The run used a **stopped** guest deliberately, to keep rollback deterministic (LXC snapshots carry no running-memory state; rollback of a running CT may error or stop the guest). Characterizing running-guest rollback is optional follow-up intel, not a slice-4 blocker.
|
||||||
|
|
||||||
|
## Credentials
|
||||||
|
|
||||||
|
The standing `FelhomAgent` operator token (`felhom-agent@pve!agent`) provisioned here is the one slice 4+ consumes — **not deleted**. Its secret is **stored out-of-band**, supplied to the agent via `FELHOM_AGENT_PROXMOX_TOKEN`; it is **not persisted to the repo** (the on-disk config holds only a placeholder). Scratch guest 9999 is retained (stopped) as the standing selftest target.
|
||||||
|
|||||||
Reference in New Issue
Block a user