Files
felhom.eu/documentation/tests/phase5-pbs-spike-findings.md
T
admin 94a236328b docs(spike): phase 5 PBS mechanism findings (DooPlex server ← N100 client)
Empirical PBS validation before the slice-6 Phase B spec. Records: PBS install on
Debian-13 DooPlex (trixie key ships in proxmox-archive-keyring, no standalone .gpg),
datastore + cert fingerprint, the PBS privsep gotcha (grant role on user AND token),
the encrypted pbs storage + key location (/etc/pve/priv/storage/<id>.enc), the snapshot
volid format + native fields (→ PBSSnapshot shape), restore-from-PBS works unchanged,
the verify mechanism (server-side; agent drives it remotely via the PBS API, result read
from snapshot verification.state), no operator-token privilege gap, and zero-knowledge
confirmed (server can't decrypt without the client key). PBS+datastore+storage left up
for Phase B; no secrets committed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 16:26:57 +02:00

164 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 5 spike — PBS mechanism validation (DooPlex server ← N100 client)
> **Status: empirical findings from a live spike (2026-06-09).** PBS was never validated in
> any prior spike (proxmox-platform.md §4.6). This establishes the *real* mechanisms before
> the slice-6 **Phase B** spec is written. No production data; probe → record → teardown. The
> PBS server + datastore + the N100's `felhom-pbs` storage are **left in place** for Phase B's
> live runbook to reuse.
## Topology under test
- **PBS server: DooPlex** `192.168.0.180` (Debian 13 trixie, separate box, backup-only — runs
no guests). PBS `proxmox-backup-server 4.2.1-1`.
- **Client: the N100** `demo-felhom` `192.168.0.162` (PVE 9.2.2, `proxmox-backup-client 4.2.0`).
It backs up to DooPlex and restores from it.
## Pre-flight
| | Result |
|---|---|
| P1 DooPlex can host PBS | ✅ Debian 13 trixie; `proxmox-backup-server` **not** in stock apt — needed the Proxmox **PBS no-subscription** repo (`deb http://download.proxmox.com/debian/pbs trixie pbs-no-subscription`). **Surprise:** there is **no standalone `proxmox-release-trixie.gpg`** (404; only bookworm/bullseye are published) — the trixie key ships in the **`proxmox-archive-keyring`** package (key `24B30F06…0BFE778E`). I copied that keyring from the N100 (PVE9/trixie already has it). 126 GB free on `/`. |
| P2 N100 → DooPlex:8007 | ✅ reachable (8007 closed pre-install, open after). |
| P3 N100 PBS client | ✅ `proxmox-backup-client 4.2.0`, PVE `PBSPlugin.pm` present. |
## Stand-up (DooPlex)
- **S1** — `proxmox-backup-manager datastore create felhom-spike /var/lib/pbs-spike``TASK OK`.
Services `proxmox-backup-proxy` + `proxmox-backup` active, listening on `:8007`.
**Server cert fingerprint:** `3b:95:5a:fa:9e:0e:4a:54:f3:64:08:e5:a2:a2:6c:66:e9:86:44:64:40:8e:c2:f7:6e:41:d2:c2:1e:86:48:c4`.
- **S2** — created PBS user `felhom@pbs` + **API token** `felhom@pbs!n100`, ACL `DatastoreAdmin`
on `/datastore/felhom-spike`.
**⚠️ PBS privsep gotcha (mirrors PVE):** an API token's effective rights = token-ACL ∩
**user**-ACL. Granting only the token wasn't enough — `pvesm add` failed with *"Cannot find
datastore"* until `DatastoreAdmin` was **also** granted to the `felhom@pbs` user. Phase B's
enrollment must grant the role on **both** the user and the token.
## Adding as an encrypted PVE storage (N100)
- **A1** — `pvesm add pbs felhom-pbs --server 192.168.0.180 --datastore felhom-spike
--fingerprint <fp> --username 'felhom@pbs!n100' --password <token-secret> --encryption-key
autogen --content backup`. Resulting `/etc/pve/storage.cfg`:
```
pbs: felhom-pbs
datastore felhom-spike
server 192.168.0.180
content backup
encryption-key 01:36:e9:fe:e1:ee:3d:7a:9d:bf:3d:63:d0:68:fd:24:45:b7:5f:bc:b6:82:bc:6d:d2:b4:7a:b0:1a:86:6d:a1
fingerprint 3b:95:5a:fa:…:48:c4
username felhom@pbs!n100
```
**Where the keys live on the box (the "live key on box"):**
- **client encryption key** → `/etc/pve/priv/storage/felhom-pbs.enc` (root:www-data **0600**, 255 B).
The `encryption-key` line in storage.cfg is only the key's **fingerprint** (`01:36:e9:fe…`),
not the key.
- **PBS token secret** → `/etc/pve/priv/storage/felhom-pbs.pw` (0600, 37 B).
- **A2** — the slice-5 agent observe (`--selftest=storage`) sees the target with the
**fingerprint-pinned durable_id** exactly as designed:
`durable=192.168.0.180:felhom-spike#3b:95:5a:fa:…:48:c4`, `type=pbs`, `state=attached`.
No agent change needed for observation.
## Probes (B1B6)
### B1 — backup to PBS
`vzdump 9001 --storage felhom-pbs --mode snapshot` →
- PBS snapshot id **`ct/9001/2026-06-09T14:18:33Z`** (`<type>/<id>/<RFC3339Z>`); the underlying
`proxmox-backup-client backup … --repository felhom@pbs!n100@192.168.0.180:felhom-spike`.
- **Encrypted client-side**: `--crypt-mode=encrypt`, *"Using encryption key from file
descriptor"*, *"Encryption key fingerprint: 01:36:e9:fe:e1:ee:3d:7a"* (matches the storage
key). Incremental/deduped (*"reused 41 MiB"*). ~19 s for ~1 GiB.
- **Surprise vs Phase A:** vzdump **chose `stop` mode** for the (stopped) guest even though
`snapshot` was requested (`INFO: backup mode: stop`). PVE picks the actual mode; the
reported `Backup.mode` is what was *requested*. For a running guest on lvm-thin it would
snapshot. (Still crash-consistent only — no fsfreeze, per slice 6.)
### B2 — snapshot inventory → the `PBSSnapshot` wire shape
- **PVE volid** (`pvesm list felhom-pbs`): **`felhom-pbs:backup/ct/9001/2026-06-09T14:18:33Z`**,
format `pbs-ct`, type `backup`. This is the exact volid `pct restore` consumes (B3).
- **PBS native** (`proxmox-backup-client snapshot list --output-format json`) per snapshot:
`backup-type` (ct|vm), `backup-id`, `backup-time` (epoch int), `size`, `owner`
(`felhom@pbs!n100`), `protected` (bool), `fingerprint` (the encryption-key fp), and
`files[]` each with `filename` + `size` + **`crypt-mode`** (`encrypt` for data, `sign-only`
for `index.json`). **`verification` is ABSENT until a verify runs** (see B4). **Namespace**:
not shown → the default (root) namespace; a `ns` field appears only for non-root namespaces.
- → **Proposed `PBSSnapshot`**: `namespace`, `backup_type`, `backup_id`, `backup_time`,
`size_bytes`, `owner`, `protected`, `encrypted` (derive from `files[].crypt-mode`),
`verify_state` (ok|failed|none), `verified_at`/`verify_upid`.
### B3 — restore from PBS
`pct restore 990001 'felhom-pbs:backup/ct/9001/2026-06-09T14:18:33Z' --storage local-lvm` →
restored + booted to `running`. **The existing restore path works UNCHANGED against a
pbs-sourced volid** — same `volid` + `--storage` shape the agent's `RestoreLXC` already uses
(`ostemplate=<volid>`, `restore=1`). **No agent restore code change needed for PBS.** PVE pulls
+ decrypts using the storage's `.enc` key automatically.
### B4 — verify mechanism (the big unknown — resolved)
- **`proxmox-backup-client` has NO `verify` subcommand** — verify is **server-side**.
- Triggers: server CLI `proxmox-backup-manager verify <store> [--ignore-verified]
[--outdated-after N]` **on DooPlex**, OR the **PBS API** `POST /api2/json/admin/datastore/
<ds>/verify` (whole datastore; per-snapshot params available).
- **The agent on the N100 CAN drive it remotely** via the PBS API + token (no DooPlex shell
needed). Proven: `curl -X POST …/admin/datastore/felhom-spike/verify` with header
`Authorization: PBSAPIToken=felhom@pbs!n100:<secret>` returned a task UPID
`UPID:dooplex:…:verify:felhom\x2dspike:felhom@pbs!n100:`. Needs `Datastore.Verify` (in
`DatastoreAdmin`).
- **Result read-back:** after verify, the snapshot's **`verification` field** appears:
`{"state":"ok","upid":"UPID:dooplex:…"}` (read via `snapshot list`). So the agent triggers
via API → polls/re-lists → reads `verification.state` (`ok`/`failed`). (Task-status polling
needs the PBS **node name** — it's `dooplex`, embedded in the UPID; `localhost` returns
`exitstatus: unknown`.)
### B5 — agent-token (`felhom-agent@pve`) privileges — **no gap**
Driven by the agent (operator token, not root@pam):
- **Backup to PBS** (`--selftest=backup`): ✅ `felhom-pbs:backup/ct/9001/2026-06-09T14:22:30Z`,
crash_consistent, success.
- **Restore from PBS** (`--selftest=restore-test`): ✅ restored into scratch 990000, booted,
verified `running`, torn down — pass.
- The **FelhomAgent role's existing `Datastore.{Audit,Allocate,AllocateSpace}` + `VM.Backup`
suffice** for both backup-to-PBS and restore-from-PBS. **No role widening needed.** (Two auth
layers: the *PVE* operator token authorizes the vzdump/restore API call; the *PBS* token in
storage.cfg authenticates PVE→PBS. The spike exercised both.)
### B6 — zero-knowledge confirmed
- All data files are `crypt-mode=encrypt` (B2); `index.json` is `sign-only`.
- **Without the key**, an authenticated restore **fails to decrypt**:
`proxmox-backup-client restore … pct.conf.blob -` (no `--keyfile`) →
`Error: missing key - manifest was created with key 01:36:e9:fe:e1:ee:3d:7a`.
- **With** `--keyfile /etc/pve/priv/storage/felhom-pbs.enc` → decrypts (returns the guest
config). The key is the *only* gate.
- **The PBS server holds no client key** — `find /etc/proxmox-backup /var/lib/pbs-spike` for
key material returns only the server's own `csrf.key`, never the client encryption key. So
DooPlex can store + serve chunks but cannot read guest data. Zero-knowledge holds: the live
key on the N100 is the irreducible residual (the operator/hub can't read the data).
## Implications for the Phase B spec (flagged surprises vs the dir-storage assumptions)
1. **Enrollment must grant the PBS role on BOTH the user AND the token** (PBS privsep), and add
the `pbs` storage with `--encryption-key autogen` → the live key lands at
`/etc/pve/priv/storage/<id>.enc` (the "live PBS key on the box", doc 03 §8). The hub holds
only the recovery-code-wrapped escrow (out of scope here).
2. **Backup + restore need NO new code** beyond targeting a `pbs` storage — `Vzdump` and
`RestoreLXC`/`pct restore` work against pbs volids unchanged. The agent's `LatestBackupVolID`
(StorageContent filter) already resolved the pbs volid.
3. **Verify is a NEW capability to build**: a server-side op the agent triggers **remotely via
the PBS API** (`POST …/datastore/<ds>/verify`) using the storage's token, then reads back
`verification.state` from the snapshot list. This is the "lighter frequent integrity check"
(§8) — it does NOT need the encryption key (ciphertext-level), unlike the full self-restore-
test. Phase B needs a small PBS-API client (token auth, fingerprint pin) for verify +
snapshot-list-with-verify-state; the existing `proxmox.Client` (PVE API) does not cover it.
4. **`PBSSnapshot` wire shape** = the B2 fields; `verify_state` is the load-bearing one and is
`none` until a verify runs.
5. **vzdump mode** is PVE's choice (stop for stopped guests) — report requested-vs-actual if it
matters, or read the actual mode from the task log.
## Teardown / left-in-place
- Throwaway restore guest **990001 destroyed**; agent restore-test scratch self-torn-down;
`pct list` → **no leftover guests**. Agent config reverted (`backup.local_backup_target` →
`local`). Token-secret temp files removed from both boxes.
- **Left in place for Phase B:** the PBS server on DooPlex, the `felhom-spike` datastore (with
two test snapshots of 9001), the `felhom@pbs!n100` token + ACLs, and the N100's `felhom-pbs`
encrypted storage (+ its `.enc`/`.pw` under `/etc/pve/priv/storage/`).
- **No secrets committed** — the encryption key, token secret, and PBS password live only in
`/etc/pve/priv/storage/` (0600) on the N100; this doc references them by location/fingerprint
only.