T

admin 1af21a6cac v0.4.0: slice 4 Phase B — reversibility gate + signed-op consuming layer

The security core of slice 4: hub-supplied intent is no longer trusted for
destructive change. The gate fronts the per-guest queue's executor, so every
mutation passes it. Reuses internal/authz for all crypto (surface untouched).

- Classifier (doc 03 §4): benign vs destructive by provenance + data-bearing-
  ness, NOT by verb. Destroy/overwrite of customer data is destructive unless
  agent-internal provenance (same-journaled-txn create, or agent-tagged scratch)
  makes it benign — and that provenance is journal-recorded, NEVER hub-sourced.
  Unknown op class fails safe to destructive.
- Reversibility gate: benign -> allowed unsigned; destructive -> requires a
  verified, role-scoped, action-bound operator signature, else pending_signature
  and never executed. Every decision audited (signal, never the guard).
- Signed-op consuming layer over authz.Verifier.Verify (locked pipeline
  untouched): role-scoping (doc 04 §4 — recovery=rotation only, operational=
  ordinary destructive + planned rotation) + op-to-action binding (op+host+
  guest+params must match the gated action).
- Signed-job orchestration: idempotency dedupe by nonce + journal-wrapped
  execution via an injected DestructiveExecutor (nil this slice — inert).
- Crash recovery (Note 1): Engine.Recover consumes the journal InFlight() set at
  startup (resume-or-rollback) — covers an op that crashed after the POST and
  before its terminal record, which idempotency dedupe alone cannot. Added
  TaskStatusOnce to the GuestAPI seam. Wired into daemon startup.
- Note 2: memory comparison canonicalized to MiB (desiredMemoryMiB) so a
  non-MiB-aligned MemoryBytes converges in one pass, not perpetual drift.
- Daemon: builds the verifier from config signers (none = nil verifier, the
  common slice-4 state), the gate (+SlogAudit), runs Recover before mutating.

Adversarial matrix proven against the REAL authz.Verifier with in-test-minted
SSHSIGs (framing replicated in reconcile's test binary; authz untouched, no
signing added to the verify-only package): unsigned job + unsigned desired-state
delta -> pending_signature; unknown signer/expired/replay-across-restart/wrong
host -> typed authz rejections; wrong guest/op/params -> binding_mismatch;
recovery key on ordinary destructive -> role_denied; hub-supplied scratch tag
ignored -> refused; valid+role+target+fresh nonce -> accepted then replay
rejected. Full module race-clean + vet-clean on the Linux build server.

Inert this slice: no destructive deltas served until slice 10; the destructive
path is classified, gated, and tested but not wired to live execution.

CHECKPOINT: Phase B complete (slice 4 done). Awaiting validation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-08 23:56:20 +02:00

cmd/felhom-agent

v0.4.0: slice 4 Phase B — reversibility gate + signed-op consuming layer

2026-06-08 23:56:20 +02:00

configs

feat(hub): host-report client + collector + first daemon loop (slice 3, v0.3.0)

2026-06-08 16:20:09 +02:00

internal

v0.4.0: slice 4 Phase B — reversibility gate + signed-op consuming layer

2026-06-08 23:56:20 +02:00

.gitignore

feat(agent): scaffold + proxmox interaction layer (slice 1)

2026-06-08 14:34:32 +02:00

CHANGELOG.md

v0.4.0: slice 4 Phase B — reversibility gate + signed-op consuming layer

2026-06-08 23:56:20 +02:00

CLAUDE.md

v0.4.0: slice 4 Phase B — reversibility gate + signed-op consuming layer

2026-06-08 23:56:20 +02:00

go.mod

feat(authz): operator signed-op verifier + durable nonce store (slice 2, v0.2.0)

2026-06-08 15:23:02 +02:00

go.sum

feat(authz): operator signed-op verifier + durable nonce store (slice 2, v0.2.0)

2026-06-08 15:23:02 +02:00

README.md

feat(agent): scaffold + proxmox interaction layer (slice 1)

2026-06-08 14:34:32 +02:00

REPORT.md

v0.4.0: slice 4 Phase B — reversibility gate + signed-op consuming layer

2026-06-08 23:56:20 +02:00

README.md

felhom-agent

The host agent for the Felhom platform — the operator-tier component that runs on each Proxmox host and owns all Proxmox interaction (provision/restore guests, host storage, backups, host+tunnel monitoring, hub control loop, per-guest local API). Design: felhom.eu/documentation/architecture/03-host-agent.md.

Status — slice 1 of N. This repo currently contains the project scaffold and the internal/proxmox interaction layer (the typed library every other module will call to talk to Proxmox), plus a runnable read-only --selftest. No reconcile loop, hub client, signing, or storage/backup orchestration yet — those are later slices.

Module: gitea.dooplex.hu/admin/felhom-agent · binary: felhom-agent · Go 1.24.

Layout

cmd/felhom-agent/      # entry point + --selftest (wiring only; no daemon loop yet)
internal/proxmox/      # the Proxmox interaction layer (API-first + fenced root-CLI)
internal/config/       # JSON config + env overrides (secrets never logged)
internal/log/          # slog setup
configs/agent.example.json

The `proxmox` package — model

Two backends, one fixed routing policy (the fence is structural — Client never shells out, Privileged never makes an HTTP call; asserted in routing_test.go):

	Backend	Used for
API (default)	`proxmox.Client`	everything the scoped FelhomAgent token can do
root-CLI (fenced)	`proxmox.Privileged`	the three proven OS-root exceptions only

Grounded entirely in the spike findings (felhom.eu/documentation/proxmox-platform.md, tests/phase{0,1-2,3}-findings.md). Every mutating API op is async: it returns a UPID and the caller WaitTasks until the task stops, then asserts exitstatus == "OK" — authorization can surface at task execution, not the HTTP POST (phase1-2 §1.3).

Public surface

Client (API):

Read: Version, Nodes, NodeStatus, ListLXC, GuestStatus, GuestConfig, ListStorage, NodeStorage, StorageContent.
Async mutating (return UPID): RestoreLXC (primary create path), Vzdump, Snapshot, Rollback, DeleteSnapshot, SetConfig, Start, Stop.
Tasks: WaitTask, TaskStatusOnce, TaskLogTail.
Errors: *APIError (parses the offending privilege from a 403), *TaskError (parses it from a failed task exitstatus).

Privileged (fenced root-CLI) — each method documents why it can't be the API:

CreateGoldenLXC — pct create with keyctl=1 (root@pam-only; the only root-fenced create — the per-customer path provisions by restore, which preserves keyctl).
MountUSBByUUID — host mount-by-UUID (not a Proxmox API op).
SMART, Sensors — hardware reads (not API-exposed).

API-vs-root routing table

See the table in internal/proxmox/doc.go. Summary: the entire guest lifecycle including restore is API-token-covered; OS-root is confined to golden-image keyctl create, host mounts, and SMART/sensors (phase3 §B3).

TLS trust

The host serves a self-signed cert. Verification is not blanket-disabled. Pick one in config: ca_file (PEM, full verify), fingerprint (SHA-256 of the host leaf cert — pinned exact-cert match; the /nodes API returns each node's ssl_fingerprint to pin), or the explicitly-named insecure_skip_verify (off by default; selftest-against-127.0.0.1 only).

Provisioning the token (out-of-band, operator side)

The agent only consumes a privilege-separated API token; role setup is a provisioning step. The role must be granted on both the user AND the token for the same path, or the intersection is empty and every call 403s (phase1-2 §1.2):

pveum role add FelhomAgent -privs "VM.Allocate VM.Audit VM.Config.Disk VM.Config.CPU \
  VM.Config.Memory VM.Config.Network VM.Config.Options VM.PowerMgmt VM.Snapshot \
  VM.Snapshot.Rollback VM.Backup Datastore.Allocate Datastore.AllocateSpace \
  Datastore.Audit Sys.Audit SDN.Use"          # 16 privileges, validated Phase 3 B3
pveum user add felhom-agent@pve
pveum user token add felhom-agent@pve agent --privsep 1   # capture the secret (shown once)
pveum acl modify / -user  'felhom-agent@pve'       -role FelhomAgent
pveum acl modify / -token 'felhom-agent@pve!agent' -role FelhomAgent

(VM.Config.CPUMemory is not a real privilege; SDN.Use is required for bridge use.)

Run

go build ./...
# read-only health check against the host:
./felhom-agent --config configs/agent.example.json --selftest
# or via env (keeps the secret off disk):
FELHOM_AGENT_PROXMOX_TOKEN='felhom-agent@pve!agent=SECRET' \
FELHOM_AGENT_PROXMOX_NODE=demo-felhom \
FELHOM_AGENT_PROXMOX_ENDPOINT=https://192.168.0.162:8006 \
FELHOM_AGENT_PROXMOX_TLS_FINGERPRINT='BA:7C:...:CF' \
  ./felhom-agent --selftest

--selftest (read-only) loads config, builds the API client, and runs the read queries (version, nodes, node status, guests, storage), printing a short health report. It mutates nothing and says so cleanly if the token/endpoint isn't configured.

--selftest=task --vmid N (explicitly gated) exercises WaitTask on a reversible op (snapshot → rollback → delete-snapshot) against guest N. Default --selftest never mutates.

Process model (proposed, not finalized — see 03 §3/§12)

Native Go binary, systemd service, non-root service user holding the scoped token, with a narrow sudoers allowlist for the three fenced ops. privileged.mode: "sudo" matches this; "direct" is for dev/CI where the agent is already root.

Test

go vet ./... && go test ./...

Unit tests use a mock HTTP transport + mock runner (no live host): UPID parse, WaitTask (running→OK / running→failed-403 / timeout / ctx-cancel), 403→privilege-named error, response decoding against the captured live shapes, and the API-vs-root routing fence.

README.md

felhom-agent

Layout

The proxmox package — model

Public surface

API-vs-root routing table

TLS trust

Provisioning the token (out-of-band, operator side)

Run

Process model (proposed, not finalized — see 03 §3/§12)

Test

The `proxmox` package — model