Files

T

admin 1af21a6cac v0.4.0: slice 4 Phase B — reversibility gate + signed-op consuming layer

The security core of slice 4: hub-supplied intent is no longer trusted for
destructive change. The gate fronts the per-guest queue's executor, so every
mutation passes it. Reuses internal/authz for all crypto (surface untouched).

- Classifier (doc 03 §4): benign vs destructive by provenance + data-bearing-
  ness, NOT by verb. Destroy/overwrite of customer data is destructive unless
  agent-internal provenance (same-journaled-txn create, or agent-tagged scratch)
  makes it benign — and that provenance is journal-recorded, NEVER hub-sourced.
  Unknown op class fails safe to destructive.
- Reversibility gate: benign -> allowed unsigned; destructive -> requires a
  verified, role-scoped, action-bound operator signature, else pending_signature
  and never executed. Every decision audited (signal, never the guard).
- Signed-op consuming layer over authz.Verifier.Verify (locked pipeline
  untouched): role-scoping (doc 04 §4 — recovery=rotation only, operational=
  ordinary destructive + planned rotation) + op-to-action binding (op+host+
  guest+params must match the gated action).
- Signed-job orchestration: idempotency dedupe by nonce + journal-wrapped
  execution via an injected DestructiveExecutor (nil this slice — inert).
- Crash recovery (Note 1): Engine.Recover consumes the journal InFlight() set at
  startup (resume-or-rollback) — covers an op that crashed after the POST and
  before its terminal record, which idempotency dedupe alone cannot. Added
  TaskStatusOnce to the GuestAPI seam. Wired into daemon startup.
- Note 2: memory comparison canonicalized to MiB (desiredMemoryMiB) so a
  non-MiB-aligned MemoryBytes converges in one pass, not perpetual drift.
- Daemon: builds the verifier from config signers (none = nil verifier, the
  common slice-4 state), the gate (+SlogAudit), runs Recover before mutating.

Adversarial matrix proven against the REAL authz.Verifier with in-test-minted
SSHSIGs (framing replicated in reconcile's test binary; authz untouched, no
signing added to the verify-only package): unsigned job + unsigned desired-state
delta -> pending_signature; unknown signer/expired/replay-across-restart/wrong
host -> typed authz rejections; wrong guest/op/params -> binding_mismatch;
recovery key on ordinary destructive -> role_denied; hub-supplied scratch tag
ignored -> refused; valid+role+target+fresh nonce -> accepted then replay
rejected. Full module race-clean + vet-clean on the Linux build server.

Inert this slice: no destructive deltas served until slice 10; the destructive
path is classified, gated, and tested but not wired to live execution.

CHECKPOINT: Phase B complete (slice 4 done). Awaiting validation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-08 23:56:20 +02:00

7.6 KiB

Raw Permalink Blame History

CLAUDE.md — `felhom-agent`

Place at the repo root (felhom-agent/CLAUDE.md). Loads when Claude Code touches this repo. Keep under ~200 lines. The cross-repo orientation lives in the workspace-root e:\git\CLAUDE.md; this file is felhom-agent-specific.

What this repo is

felhom-agent is the operator-tier host agent that runs on each Proxmox host and owns all Proxmox interaction: provision/restore guests, host storage, backup/restore orchestration, the hub control loop, and a narrow per-guest local API. It is the most privilege-sensitive component.

It is the renamed former proxmox-controller repo.
Distinct from felhom-controller — that is the in-guest controller (Docker-only, no Proxmox creds). Do not confuse them.
Control plane, not data plane: if the agent dies, apps keep serving; only management degrades.

Build / run

Module gitea.dooplex.hu/admin/felhom-agent; binary felhom-agent (cmd/felhom-agent/).
Pure Go stdlib + golang.org/x/crypto only — no web frameworks.
go.mod directive go 1.25.0; dep golang.org/x/crypto v0.52.0 (declares go 1.25, will NOT build on Go 1.24). The build server (192.168.0.180) runs go1.26.0 (upstream Go on PATH, backward-compatible). Build/run the agent there for live tests (same LAN as the demo host).
Version: version var in cmd/felhom-agent/main.go, overridable via -ldflags "-X main.version=<v>"; --version flag. Current: v0.4.0 (slice 4 complete: reconcile engine + reversibility gate). Bump on meaningful changes + add a CHANGELOG entry.

Layout

cmd/felhom-agent/    main + flag handling + --selftest modes + the daemon entry
internal/config/     JSON config + FELHOM_AGENT_* env overlay; secrets redacted (Redacted())
internal/log/        slog setup
internal/proxmox/    API-first Client + fenced root-CLI Privileged + UPID WaitTask
internal/authz/      operator signed-op verifier (SSHSIG); durable FileNonceStore
internal/hub/        daemon: HostReport collector + Bearer client + resilient Loop

Proxmox model (the load-bearing rules)

API-first via a scoped FelhomAgent token (16 privileges). Raw root-CLI is fenced to exactly 3 exceptions: keyctl pct create (golden image), USB mount/fstab, SMART/sensors. Client never shells out; Privileged never makes HTTP calls (asserted by tests). Keep that fence.
Every mutating op is async → returns a UPID → WaitTask asserts exitstatus == "OK". A 200 on the POST is not success; authorization can fail at task execution, not the POST.
TLS: SHA-256 leaf-cert pinning (the host serves a self-signed cert). No insecure default.
Privsep token gotcha: a --privsep 1 token's rights = intersection of the backing user's perms AND the token's ACLs — so the role must be granted on both user and token, or every call 403s. (Token provisioning is out-of-band / human-run; the agent only consumes the token.)

Design + platform facts (read before designing)

Design doc: felhom.eu/documentation/architecture/03-host-agent.md (locked).
Platform facts: felhom.eu/documentation/proxmox-platform.md + tests/phase{0,1-2,3,4}-findings.md.

Current state

Built in slices, all on main:

v0.1.0 slice 1 — scaffold + internal/proxmox + internal/config/log + --selftest.
v0.2.0 slice 2 — internal/authz signed-op verifier.
v0.3.0 slice 3 — internal/hub: the first daemon loop (no---selftest mode) posting a read-only HostReport to the hub (= the heartbeat). Report's storage/backup/restore/pbs/audit fields are defined-but-empty (slices 5/6); the envelope's desired-state/signed-ops fields are parsed-but-ignored (slice 4).
v0.3.1 — slice-3 validation follow-ups.
v0.3.2 — slice-4 pre-check: reversible SetConfig step added to --selftest=task; passed live on guest 9999. Findings: LXC description write is synchronous (empty UPID — dual-mode modeling confirmed); PVE appends a trailing \n to description on read (reconcile must normalize). First live VM.Config.* exercise.
v0.4.0-rc1 — slice-4 Phase A (structural): internal/reconcile — engine, per-guest serializer (§10), desired-state model + DesiredProvider seam, normalization layer (NormDescription promoted out of main.go), plan/diff engine (benign Start/Stop/SetConfig set), durable op journal + idempotency store. Wired into runDaemon sharing the queue. Runs live but unfed (EmptyProvider → zero mutations until slice 10).
v0.4.0 — slice-4 Phase B (security core): the benign/destructive classifier (provenance + data-bearing, not by verb; scratch/same-txn provenance is agent-internal, never hub-sourced), the reversibility gate (destructive → pending_signature unless a verified, role-scoped, action-bound operator signature), the signed-op consuming layer over internal/authz (role-scoping per doc 04 §4, op-to-action binding, idempotency-by-nonce, audit), and the crash-recovery consumer (Recover over InFlight(), resume-or-rollback). The gate fronts the queue's executor (every mutation passes it). Inert this slice — no destructive deltas served until slice 10; the destructive path is classified, gated, and adversarially tested but not wired to live execution. authz surface untouched.
Next: slice 5/6 (storage manifest, backup/restore) — the slices that fill the host-report's empty storage/backup collections and add the destructive executors the gate already guards.

Demo host (for live tests)

Node demo-felhom, API https://192.168.0.162:8006, PVE 9.2.2; leaf-cert SHA-256 fingerprint starts BA:7C:99:7D:45:D0… (verify it still matches before a live run — the agent pins it). pveum/pct ops need root@pam on the PVE (SSH alias felhom-pve) - available to Claude Code

Selftest modes (run from the build server, pointed at the demo API):

--selftest / --selftest=read — read-only health checks.
--selftest=task -vmid N — reversible snapshot→rollback→delete on guest N (gated; never under bare --selftest).
--selftest=hub — one collect + report round-trip to the hub.
No flag → the daemon (poll loop); requires hub config.

Conventions

Push to main directly; no feature branches.

In every repository where you make a change, update both files in that repo:

CHANGELOG.md — a cumulative log of all changes; newest entry on top.

REPORT.md — overwrite with a summary of the most recent implementation (or significant validation/operational run) only; not cumulative.

Never write secrets — tokens, passwords, private keys, API keys — into CHANGELOG.md, REPORT.md, or any committed file. Reference them as "stored out-of-band" instead.

Code quality: verify generated code for bugs/edge cases; add debug logging; ask rather than guess when you'd otherwise invent input/output.

Workflow & artifacts

Implement TASK.md / TASK-*.md specs (when placed as TASK.md or told to implement one), then push + CHANGELOG + REPORT.md.
RUNBOOK-*.md — an operational procedure. CC executes the steps it has access and capability for, including live validation on the demo nodes and the demo Proxmox host (CC has root@felhom-pve SSH + the felhom-agent token). A step is human-only only when it genuinely needs physical presence, a real-world decision, or credentials CC truly lacks — mark those steps HUMAN. Do not decline a whole procedure because it touches a live host or a privileged token. (Judgment still applies: confirm before irreversible ops on real customer data — but demo scratch guests are fair game.)

7.6 KiB Raw Permalink Blame History

CLAUDE.md — felhom-agent