docs(spike): phase 5 PBS mechanism findings (DooPlex server ← N100 client)

Empirical PBS validation before the slice-6 Phase B spec. Records: PBS install on
Debian-13 DooPlex (trixie key ships in proxmox-archive-keyring, no standalone .gpg),
datastore + cert fingerprint, the PBS privsep gotcha (grant role on user AND token),
the encrypted pbs storage + key location (/etc/pve/priv/storage/<id>.enc), the snapshot
volid format + native fields (→ PBSSnapshot shape), restore-from-PBS works unchanged,
the verify mechanism (server-side; agent drives it remotely via the PBS API, result read
from snapshot verification.state), no operator-token privilege gap, and zero-knowledge
confirmed (server can't decrypt without the client key). PBS+datastore+storage left up
for Phase B; no secrets committed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-09 16:26:57 +02:00
parent 41f2d2b5da
commit 94a236328b
@@ -0,0 +1,163 @@
# Phase 5 spike — PBS mechanism validation (DooPlex server ← N100 client)
> **Status: empirical findings from a live spike (2026-06-09).** PBS was never validated in
> any prior spike (proxmox-platform.md §4.6). This establishes the *real* mechanisms before
> the slice-6 **Phase B** spec is written. No production data; probe → record → teardown. The
> PBS server + datastore + the N100's `felhom-pbs` storage are **left in place** for Phase B's
> live runbook to reuse.
## Topology under test
- **PBS server: DooPlex** `192.168.0.180` (Debian 13 trixie, separate box, backup-only — runs
no guests). PBS `proxmox-backup-server 4.2.1-1`.
- **Client: the N100** `demo-felhom` `192.168.0.162` (PVE 9.2.2, `proxmox-backup-client 4.2.0`).
It backs up to DooPlex and restores from it.
## Pre-flight
| | Result |
|---|---|
| P1 DooPlex can host PBS | ✅ Debian 13 trixie; `proxmox-backup-server` **not** in stock apt — needed the Proxmox **PBS no-subscription** repo (`deb http://download.proxmox.com/debian/pbs trixie pbs-no-subscription`). **Surprise:** there is **no standalone `proxmox-release-trixie.gpg`** (404; only bookworm/bullseye are published) — the trixie key ships in the **`proxmox-archive-keyring`** package (key `24B30F06…0BFE778E`). I copied that keyring from the N100 (PVE9/trixie already has it). 126 GB free on `/`. |
| P2 N100 → DooPlex:8007 | ✅ reachable (8007 closed pre-install, open after). |
| P3 N100 PBS client | ✅ `proxmox-backup-client 4.2.0`, PVE `PBSPlugin.pm` present. |
## Stand-up (DooPlex)
- **S1** — `proxmox-backup-manager datastore create felhom-spike /var/lib/pbs-spike``TASK OK`.
Services `proxmox-backup-proxy` + `proxmox-backup` active, listening on `:8007`.
**Server cert fingerprint:** `3b:95:5a:fa:9e:0e:4a:54:f3:64:08:e5:a2:a2:6c:66:e9:86:44:64:40:8e:c2:f7:6e:41:d2:c2:1e:86:48:c4`.
- **S2** — created PBS user `felhom@pbs` + **API token** `felhom@pbs!n100`, ACL `DatastoreAdmin`
on `/datastore/felhom-spike`.
**⚠️ PBS privsep gotcha (mirrors PVE):** an API token's effective rights = token-ACL ∩
**user**-ACL. Granting only the token wasn't enough — `pvesm add` failed with *"Cannot find
datastore"* until `DatastoreAdmin` was **also** granted to the `felhom@pbs` user. Phase B's
enrollment must grant the role on **both** the user and the token.
## Adding as an encrypted PVE storage (N100)
- **A1** — `pvesm add pbs felhom-pbs --server 192.168.0.180 --datastore felhom-spike
--fingerprint <fp> --username 'felhom@pbs!n100' --password <token-secret> --encryption-key
autogen --content backup`. Resulting `/etc/pve/storage.cfg`:
```
pbs: felhom-pbs
datastore felhom-spike
server 192.168.0.180
content backup
encryption-key 01:36:e9:fe:e1:ee:3d:7a:9d:bf:3d:63:d0:68:fd:24:45:b7:5f:bc:b6:82:bc:6d:d2:b4:7a:b0:1a:86:6d:a1
fingerprint 3b:95:5a:fa:…:48:c4
username felhom@pbs!n100
```
**Where the keys live on the box (the "live key on box"):**
- **client encryption key** → `/etc/pve/priv/storage/felhom-pbs.enc` (root:www-data **0600**, 255 B).
The `encryption-key` line in storage.cfg is only the key's **fingerprint** (`01:36:e9:fe…`),
not the key.
- **PBS token secret** → `/etc/pve/priv/storage/felhom-pbs.pw` (0600, 37 B).
- **A2** — the slice-5 agent observe (`--selftest=storage`) sees the target with the
**fingerprint-pinned durable_id** exactly as designed:
`durable=192.168.0.180:felhom-spike#3b:95:5a:fa:…:48:c4`, `type=pbs`, `state=attached`.
No agent change needed for observation.
## Probes (B1B6)
### B1 — backup to PBS
`vzdump 9001 --storage felhom-pbs --mode snapshot` →
- PBS snapshot id **`ct/9001/2026-06-09T14:18:33Z`** (`<type>/<id>/<RFC3339Z>`); the underlying
`proxmox-backup-client backup … --repository felhom@pbs!n100@192.168.0.180:felhom-spike`.
- **Encrypted client-side**: `--crypt-mode=encrypt`, *"Using encryption key from file
descriptor"*, *"Encryption key fingerprint: 01:36:e9:fe:e1:ee:3d:7a"* (matches the storage
key). Incremental/deduped (*"reused 41 MiB"*). ~19 s for ~1 GiB.
- **Surprise vs Phase A:** vzdump **chose `stop` mode** for the (stopped) guest even though
`snapshot` was requested (`INFO: backup mode: stop`). PVE picks the actual mode; the
reported `Backup.mode` is what was *requested*. For a running guest on lvm-thin it would
snapshot. (Still crash-consistent only — no fsfreeze, per slice 6.)
### B2 — snapshot inventory → the `PBSSnapshot` wire shape
- **PVE volid** (`pvesm list felhom-pbs`): **`felhom-pbs:backup/ct/9001/2026-06-09T14:18:33Z`**,
format `pbs-ct`, type `backup`. This is the exact volid `pct restore` consumes (B3).
- **PBS native** (`proxmox-backup-client snapshot list --output-format json`) per snapshot:
`backup-type` (ct|vm), `backup-id`, `backup-time` (epoch int), `size`, `owner`
(`felhom@pbs!n100`), `protected` (bool), `fingerprint` (the encryption-key fp), and
`files[]` each with `filename` + `size` + **`crypt-mode`** (`encrypt` for data, `sign-only`
for `index.json`). **`verification` is ABSENT until a verify runs** (see B4). **Namespace**:
not shown → the default (root) namespace; a `ns` field appears only for non-root namespaces.
- → **Proposed `PBSSnapshot`**: `namespace`, `backup_type`, `backup_id`, `backup_time`,
`size_bytes`, `owner`, `protected`, `encrypted` (derive from `files[].crypt-mode`),
`verify_state` (ok|failed|none), `verified_at`/`verify_upid`.
### B3 — restore from PBS
`pct restore 990001 'felhom-pbs:backup/ct/9001/2026-06-09T14:18:33Z' --storage local-lvm` →
restored + booted to `running`. **The existing restore path works UNCHANGED against a
pbs-sourced volid** — same `volid` + `--storage` shape the agent's `RestoreLXC` already uses
(`ostemplate=<volid>`, `restore=1`). **No agent restore code change needed for PBS.** PVE pulls
+ decrypts using the storage's `.enc` key automatically.
### B4 — verify mechanism (the big unknown — resolved)
- **`proxmox-backup-client` has NO `verify` subcommand** — verify is **server-side**.
- Triggers: server CLI `proxmox-backup-manager verify <store> [--ignore-verified]
[--outdated-after N]` **on DooPlex**, OR the **PBS API** `POST /api2/json/admin/datastore/
<ds>/verify` (whole datastore; per-snapshot params available).
- **The agent on the N100 CAN drive it remotely** via the PBS API + token (no DooPlex shell
needed). Proven: `curl -X POST …/admin/datastore/felhom-spike/verify` with header
`Authorization: PBSAPIToken=felhom@pbs!n100:<secret>` returned a task UPID
`UPID:dooplex:…:verify:felhom\x2dspike:felhom@pbs!n100:`. Needs `Datastore.Verify` (in
`DatastoreAdmin`).
- **Result read-back:** after verify, the snapshot's **`verification` field** appears:
`{"state":"ok","upid":"UPID:dooplex:…"}` (read via `snapshot list`). So the agent triggers
via API → polls/re-lists → reads `verification.state` (`ok`/`failed`). (Task-status polling
needs the PBS **node name** — it's `dooplex`, embedded in the UPID; `localhost` returns
`exitstatus: unknown`.)
### B5 — agent-token (`felhom-agent@pve`) privileges — **no gap**
Driven by the agent (operator token, not root@pam):
- **Backup to PBS** (`--selftest=backup`): ✅ `felhom-pbs:backup/ct/9001/2026-06-09T14:22:30Z`,
crash_consistent, success.
- **Restore from PBS** (`--selftest=restore-test`): ✅ restored into scratch 990000, booted,
verified `running`, torn down — pass.
- The **FelhomAgent role's existing `Datastore.{Audit,Allocate,AllocateSpace}` + `VM.Backup`
suffice** for both backup-to-PBS and restore-from-PBS. **No role widening needed.** (Two auth
layers: the *PVE* operator token authorizes the vzdump/restore API call; the *PBS* token in
storage.cfg authenticates PVE→PBS. The spike exercised both.)
### B6 — zero-knowledge confirmed
- All data files are `crypt-mode=encrypt` (B2); `index.json` is `sign-only`.
- **Without the key**, an authenticated restore **fails to decrypt**:
`proxmox-backup-client restore … pct.conf.blob -` (no `--keyfile`) →
`Error: missing key - manifest was created with key 01:36:e9:fe:e1:ee:3d:7a`.
- **With** `--keyfile /etc/pve/priv/storage/felhom-pbs.enc` → decrypts (returns the guest
config). The key is the *only* gate.
- **The PBS server holds no client key** — `find /etc/proxmox-backup /var/lib/pbs-spike` for
key material returns only the server's own `csrf.key`, never the client encryption key. So
DooPlex can store + serve chunks but cannot read guest data. Zero-knowledge holds: the live
key on the N100 is the irreducible residual (the operator/hub can't read the data).
## Implications for the Phase B spec (flagged surprises vs the dir-storage assumptions)
1. **Enrollment must grant the PBS role on BOTH the user AND the token** (PBS privsep), and add
the `pbs` storage with `--encryption-key autogen` → the live key lands at
`/etc/pve/priv/storage/<id>.enc` (the "live PBS key on the box", doc 03 §8). The hub holds
only the recovery-code-wrapped escrow (out of scope here).
2. **Backup + restore need NO new code** beyond targeting a `pbs` storage — `Vzdump` and
`RestoreLXC`/`pct restore` work against pbs volids unchanged. The agent's `LatestBackupVolID`
(StorageContent filter) already resolved the pbs volid.
3. **Verify is a NEW capability to build**: a server-side op the agent triggers **remotely via
the PBS API** (`POST …/datastore/<ds>/verify`) using the storage's token, then reads back
`verification.state` from the snapshot list. This is the "lighter frequent integrity check"
(§8) — it does NOT need the encryption key (ciphertext-level), unlike the full self-restore-
test. Phase B needs a small PBS-API client (token auth, fingerprint pin) for verify +
snapshot-list-with-verify-state; the existing `proxmox.Client` (PVE API) does not cover it.
4. **`PBSSnapshot` wire shape** = the B2 fields; `verify_state` is the load-bearing one and is
`none` until a verify runs.
5. **vzdump mode** is PVE's choice (stop for stopped guests) — report requested-vs-actual if it
matters, or read the actual mode from the task log.
## Teardown / left-in-place
- Throwaway restore guest **990001 destroyed**; agent restore-test scratch self-torn-down;
`pct list` → **no leftover guests**. Agent config reverted (`backup.local_backup_target` →
`local`). Token-secret temp files removed from both boxes.
- **Left in place for Phase B:** the PBS server on DooPlex, the `felhom-spike` datastore (with
two test snapshots of 9001), the `felhom@pbs!n100` token + ACLs, and the N100's `felhom-pbs`
encrypted storage (+ its `.enc`/`.pw` under `/etc/pve/priv/storage/`).
- **No secrets committed** — the encryption key, token secret, and PBS password live only in
`/etc/pve/priv/storage/` (0600) on the N100; this doc references them by location/fingerprint
only.