Files
felhom.eu/documentation/tests/phase5-pbs-spike-findings.md
T
admin 94a236328b docs(spike): phase 5 PBS mechanism findings (DooPlex server ← N100 client)
Empirical PBS validation before the slice-6 Phase B spec. Records: PBS install on
Debian-13 DooPlex (trixie key ships in proxmox-archive-keyring, no standalone .gpg),
datastore + cert fingerprint, the PBS privsep gotcha (grant role on user AND token),
the encrypted pbs storage + key location (/etc/pve/priv/storage/<id>.enc), the snapshot
volid format + native fields (→ PBSSnapshot shape), restore-from-PBS works unchanged,
the verify mechanism (server-side; agent drives it remotely via the PBS API, result read
from snapshot verification.state), no operator-token privilege gap, and zero-knowledge
confirmed (server can't decrypt without the client key). PBS+datastore+storage left up
for Phase B; no secrets committed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 16:26:57 +02:00

11 KiB
Raw Blame History

Phase 5 spike — PBS mechanism validation (DooPlex server ← N100 client)

Status: empirical findings from a live spike (2026-06-09). PBS was never validated in any prior spike (proxmox-platform.md §4.6). This establishes the real mechanisms before the slice-6 Phase B spec is written. No production data; probe → record → teardown. The PBS server + datastore + the N100's felhom-pbs storage are left in place for Phase B's live runbook to reuse.

Topology under test

  • PBS server: DooPlex 192.168.0.180 (Debian 13 trixie, separate box, backup-only — runs no guests). PBS proxmox-backup-server 4.2.1-1.
  • Client: the N100 demo-felhom 192.168.0.162 (PVE 9.2.2, proxmox-backup-client 4.2.0). It backs up to DooPlex and restores from it.

Pre-flight

Result
P1 DooPlex can host PBS Debian 13 trixie; proxmox-backup-server not in stock apt — needed the Proxmox PBS no-subscription repo (deb http://download.proxmox.com/debian/pbs trixie pbs-no-subscription). Surprise: there is no standalone proxmox-release-trixie.gpg (404; only bookworm/bullseye are published) — the trixie key ships in the proxmox-archive-keyring package (key 24B30F06…0BFE778E). I copied that keyring from the N100 (PVE9/trixie already has it). 126 GB free on /.
P2 N100 → DooPlex:8007 reachable (8007 closed pre-install, open after).
P3 N100 PBS client proxmox-backup-client 4.2.0, PVE PBSPlugin.pm present.

Stand-up (DooPlex)

  • S1proxmox-backup-manager datastore create felhom-spike /var/lib/pbs-spikeTASK OK. Services proxmox-backup-proxy + proxmox-backup active, listening on :8007. Server cert fingerprint: 3b:95:5a:fa:9e:0e:4a:54:f3:64:08:e5:a2:a2:6c:66:e9:86:44:64:40:8e:c2:f7:6e:41:d2:c2:1e:86:48:c4.
  • S2 — created PBS user felhom@pbs + API token felhom@pbs!n100, ACL DatastoreAdmin on /datastore/felhom-spike. ⚠️ PBS privsep gotcha (mirrors PVE): an API token's effective rights = token-ACL ∩ user-ACL. Granting only the token wasn't enough — pvesm add failed with "Cannot find datastore" until DatastoreAdmin was also granted to the felhom@pbs user. Phase B's enrollment must grant the role on both the user and the token.

Adding as an encrypted PVE storage (N100)

  • A1pvesm add pbs felhom-pbs --server 192.168.0.180 --datastore felhom-spike --fingerprint <fp> --username 'felhom@pbs!n100' --password <token-secret> --encryption-key autogen --content backup. Resulting /etc/pve/storage.cfg:
    pbs: felhom-pbs
        datastore felhom-spike
        server 192.168.0.180
        content backup
        encryption-key 01:36:e9:fe:e1:ee:3d:7a:9d:bf:3d:63:d0:68:fd:24:45:b7:5f:bc:b6:82:bc:6d:d2:b4:7a:b0:1a:86:6d:a1
        fingerprint 3b:95:5a:fa:…:48:c4
        username felhom@pbs!n100
    
    Where the keys live on the box (the "live key on box"):
    • client encryption key/etc/pve/priv/storage/felhom-pbs.enc (root:www-data 0600, 255 B). The encryption-key line in storage.cfg is only the key's fingerprint (01:36:e9:fe…), not the key.
    • PBS token secret/etc/pve/priv/storage/felhom-pbs.pw (0600, 37 B).
  • A2 — the slice-5 agent observe (--selftest=storage) sees the target with the fingerprint-pinned durable_id exactly as designed: durable=192.168.0.180:felhom-spike#3b:95:5a:fa:…:48:c4, type=pbs, state=attached. No agent change needed for observation.

Probes (B1B6)

B1 — backup to PBS

vzdump 9001 --storage felhom-pbs --mode snapshot

  • PBS snapshot id ct/9001/2026-06-09T14:18:33Z (<type>/<id>/<RFC3339Z>); the underlying proxmox-backup-client backup … --repository felhom@pbs!n100@192.168.0.180:felhom-spike.
  • Encrypted client-side: --crypt-mode=encrypt, "Using encryption key from file descriptor", "Encryption key fingerprint: 01:36:e9:fe:e1:ee:3d:7a" (matches the storage key). Incremental/deduped ("reused 41 MiB"). ~19 s for ~1 GiB.
  • Surprise vs Phase A: vzdump chose stop mode for the (stopped) guest even though snapshot was requested (INFO: backup mode: stop). PVE picks the actual mode; the reported Backup.mode is what was requested. For a running guest on lvm-thin it would snapshot. (Still crash-consistent only — no fsfreeze, per slice 6.)

B2 — snapshot inventory → the PBSSnapshot wire shape

  • PVE volid (pvesm list felhom-pbs): felhom-pbs:backup/ct/9001/2026-06-09T14:18:33Z, format pbs-ct, type backup. This is the exact volid pct restore consumes (B3).
  • PBS native (proxmox-backup-client snapshot list --output-format json) per snapshot: backup-type (ct|vm), backup-id, backup-time (epoch int), size, owner (felhom@pbs!n100), protected (bool), fingerprint (the encryption-key fp), and files[] each with filename + size + crypt-mode (encrypt for data, sign-only for index.json). verification is ABSENT until a verify runs (see B4). Namespace: not shown → the default (root) namespace; a ns field appears only for non-root namespaces.
    • Proposed PBSSnapshot: namespace, backup_type, backup_id, backup_time, size_bytes, owner, protected, encrypted (derive from files[].crypt-mode), verify_state (ok|failed|none), verified_at/verify_upid.

B3 — restore from PBS

pct restore 990001 'felhom-pbs:backup/ct/9001/2026-06-09T14:18:33Z' --storage local-lvm → restored + booted to running. The existing restore path works UNCHANGED against a pbs-sourced volid — same volid + --storage shape the agent's RestoreLXC already uses (ostemplate=<volid>, restore=1). No agent restore code change needed for PBS. PVE pulls

  • decrypts using the storage's .enc key automatically.

B4 — verify mechanism (the big unknown — resolved)

  • proxmox-backup-client has NO verify subcommand — verify is server-side.
  • Triggers: server CLI proxmox-backup-manager verify <store> [--ignore-verified] [--outdated-after N] on DooPlex, OR the PBS API POST /api2/json/admin/datastore/ <ds>/verify (whole datastore; per-snapshot params available).
  • The agent on the N100 CAN drive it remotely via the PBS API + token (no DooPlex shell needed). Proven: curl -X POST …/admin/datastore/felhom-spike/verify with header Authorization: PBSAPIToken=felhom@pbs!n100:<secret> returned a task UPID UPID:dooplex:…:verify:felhom\x2dspike:felhom@pbs!n100:. Needs Datastore.Verify (in DatastoreAdmin).
  • Result read-back: after verify, the snapshot's verification field appears: {"state":"ok","upid":"UPID:dooplex:…"} (read via snapshot list). So the agent triggers via API → polls/re-lists → reads verification.state (ok/failed). (Task-status polling needs the PBS node name — it's dooplex, embedded in the UPID; localhost returns exitstatus: unknown.)

B5 — agent-token (felhom-agent@pve) privileges — no gap

Driven by the agent (operator token, not root@pam):

  • Backup to PBS (--selftest=backup): felhom-pbs:backup/ct/9001/2026-06-09T14:22:30Z, crash_consistent, success.
  • Restore from PBS (--selftest=restore-test): restored into scratch 990000, booted, verified running, torn down — pass.
  • The FelhomAgent role's existing Datastore.{Audit,Allocate,AllocateSpace} + VM.Backup suffice for both backup-to-PBS and restore-from-PBS. No role widening needed. (Two auth layers: the PVE operator token authorizes the vzdump/restore API call; the PBS token in storage.cfg authenticates PVE→PBS. The spike exercised both.)

B6 — zero-knowledge confirmed

  • All data files are crypt-mode=encrypt (B2); index.json is sign-only.
  • Without the key, an authenticated restore fails to decrypt: proxmox-backup-client restore … pct.conf.blob - (no --keyfile) → Error: missing key - manifest was created with key 01:36:e9:fe:e1:ee:3d:7a.
  • With --keyfile /etc/pve/priv/storage/felhom-pbs.enc → decrypts (returns the guest config). The key is the only gate.
  • The PBS server holds no client keyfind /etc/proxmox-backup /var/lib/pbs-spike for key material returns only the server's own csrf.key, never the client encryption key. So DooPlex can store + serve chunks but cannot read guest data. Zero-knowledge holds: the live key on the N100 is the irreducible residual (the operator/hub can't read the data).

Implications for the Phase B spec (flagged surprises vs the dir-storage assumptions)

  1. Enrollment must grant the PBS role on BOTH the user AND the token (PBS privsep), and add the pbs storage with --encryption-key autogen → the live key lands at /etc/pve/priv/storage/<id>.enc (the "live PBS key on the box", doc 03 §8). The hub holds only the recovery-code-wrapped escrow (out of scope here).
  2. Backup + restore need NO new code beyond targeting a pbs storage — Vzdump and RestoreLXC/pct restore work against pbs volids unchanged. The agent's LatestBackupVolID (StorageContent filter) already resolved the pbs volid.
  3. Verify is a NEW capability to build: a server-side op the agent triggers remotely via the PBS API (POST …/datastore/<ds>/verify) using the storage's token, then reads back verification.state from the snapshot list. This is the "lighter frequent integrity check" (§8) — it does NOT need the encryption key (ciphertext-level), unlike the full self-restore- test. Phase B needs a small PBS-API client (token auth, fingerprint pin) for verify + snapshot-list-with-verify-state; the existing proxmox.Client (PVE API) does not cover it.
  4. PBSSnapshot wire shape = the B2 fields; verify_state is the load-bearing one and is none until a verify runs.
  5. vzdump mode is PVE's choice (stop for stopped guests) — report requested-vs-actual if it matters, or read the actual mode from the task log.

Teardown / left-in-place

  • Throwaway restore guest 990001 destroyed; agent restore-test scratch self-torn-down; pct listno leftover guests. Agent config reverted (backup.local_backup_targetlocal). Token-secret temp files removed from both boxes.
  • Left in place for Phase B: the PBS server on DooPlex, the felhom-spike datastore (with two test snapshots of 9001), the felhom@pbs!n100 token + ACLs, and the N100's felhom-pbs encrypted storage (+ its .enc/.pw under /etc/pve/priv/storage/).
  • No secrets committed — the encryption key, token secret, and PBS password live only in /etc/pve/priv/storage/ (0600) on the N100; this doc references them by location/fingerprint only.