docs: Phase 2 capture side — REPORT/CONTEXT/README for v0.53.x recovery unit
REPORT overwritten (secret-free recovery unit: design, what shipped, golden deploy mechanism, live 9201 validation incl. NO_LEAK grep). CONTEXT dated entry. README: recovery-unit subsection + flagged the stale restic/secondary paths section. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,60 +1,66 @@
|
||||
# REPORT — felhom-controller v0.52.0 (Phase 1 GATE: deploy-side double-nest fix)
|
||||
# REPORT — felhom-controller v0.53.1 (Phase 2: per-app recovery unit, capture side)
|
||||
|
||||
Completes the Model-A double-nest reconciliation deferred in v0.48.0 and half-fixed in v0.51.0.
|
||||
v0.51.0 fixed the **backup-helper** side (`NamespaceRoot` provenance); this slice fixes the
|
||||
**deploy/compose** side and locks the two together. Validated live on guest 9201 (demo-felhom).
|
||||
Each app's on-drive backup is now a self-contained, recreatable **recovery unit** — and it is
|
||||
**secret-free by design**. Built, unit-tested, shipped to `main`, and validated live on guest 9201.
|
||||
(Phase 1, the deploy-side double-nest GATE, shipped earlier as v0.52.0 — see git history.)
|
||||
|
||||
This is the GATE for the larger per-app-recovery-unit / Tier-2 slice — Phases 2–4 build on a proven
|
||||
single-nested, agreeing path layout. Per plan: Phase 1 shipped + validated first.
|
||||
## The design decision that shaped Phase 2 (secret handling)
|
||||
The recovery unit carries **no secrets, no data-keys, and not the Docker image**. This was decided after
|
||||
reading the *actual* hub code (the controller README that implied the hub stores app.yaml is stale
|
||||
pre-strip):
|
||||
- The hub is deliberately **zero-knowledge** — it holds a per-host recovery-code-wrapped PBS key it
|
||||
cannot decrypt + non-secret config; **no per-app secrets**. Escrowing app secrets there would regress
|
||||
that posture, so it was rejected.
|
||||
- `app.yaml` (encrypted) + the encryption key live on the **guest rootfs** (`local-lvm:vm-9201-disk-0`,
|
||||
confirmed via `pct config`) → already inside the **PBS whole-guest snapshot**; the external drive
|
||||
(`mp0` bind) is not. So the secret↔data split maps onto the tiers: **secrets ride PBS; bulk userdata
|
||||
rides the drive + (Phase 3) Tier 2.**
|
||||
- Therefore: secret-free unit; restore recovers the original secrets from the guest's own app.yaml
|
||||
(live, else PBS); **regenerate nothing**. `data_key` is a fail-closed annotation, not a
|
||||
preserve/regenerate decision.
|
||||
|
||||
## Root cause (corrected from the spec's assumption)
|
||||
The extra `felhom-data` segment was **not** built in controller `deploy.go` — `DeployStack` passes
|
||||
`HDD_PATH` through verbatim, and the deploy storage dropdown only ever offers registered in-guest drives
|
||||
(`GetSchedulableStoragePaths`), never the system-data path. The segment was hardcoded in the **app-catalog
|
||||
compose templates** as `${HDD_PATH}/felhom-data/appdata/<app>`. On a Model-A in-guest drive the guest
|
||||
mount `/mnt/<drive>` already IS the host's `<drive>/felhom-data` namespace, so that segment double-nested
|
||||
to `<drive>/felhom-data/felhom-data/appdata/<app>` on disk — diverging from where the (v0.51.0) backup
|
||||
helpers look: `AppDataDir(NamespaceRoot(HDD_PATH,true))` = `/mnt/<drive>/appdata/<app>`, single-nested.
|
||||
## What shipped (v0.53.0 + v0.53.1)
|
||||
- **Unit layout** (rooted at the existing `backups/primary/<app>/` — a deliberate low-churn choice, no
|
||||
risky dump-dir migration): `compose/` (docker-compose.yml + .felhom.yml + a **secret-stripped**
|
||||
app.yaml) + the existing `db-dumps/` + `volume-dumps/` + `manifest.json`. New helpers
|
||||
`RecoveryUnit{Path,ComposePath,ManifestPath}` (`internal/appbackup/paths.go`).
|
||||
- **Secret-free manifest** (`internal/backup/recovery_unit.go`): app id, display name, controller
|
||||
version, timestamp, drive, namespace root, **image pins** (image NOT stored — re-pulled on restore),
|
||||
the **NAMES** of secret env vars (values never stored), `data_key` env-var names, an explicit
|
||||
`secret_source` note, captured config-file list, enumerated dumps, sha256 checksums.
|
||||
- **Capture needs no secret access:** non-secret env is plaintext in app.yaml, so the capture excludes
|
||||
secret-named keys (plus a defensive `crypto.IsEncrypted` guard) and reads no secret value. New
|
||||
`StackDataProvider.GetStackRecoveryInfo` + `RecoveryInfo`, implemented by the main.go `stackAdapter`;
|
||||
`ParseComposeImages` extracts pins.
|
||||
- **`data_key`**: `DeployField.DataKey` + `Metadata.DataKeyEnvVars()`; catalog `adventurelog/.felhom.yml`
|
||||
`SECRET_KEY` ("Titkosítási kulcs") marked `data_key: true`.
|
||||
- **Refresh cadence (v0.53.1):** capture runs from the daily DB dump AND the periodic `RefreshCache`
|
||||
(startup + every 5m), **idempotent** — content is built in memory and writes are skipped when the unit
|
||||
is already current (checksum + dump-set + version), so a spinning USB drive is not thrashed.
|
||||
- **Tests:** capture is secret-free (a secret in the source app.yaml never appears in the unit) +
|
||||
manifest structure + idempotency (unchanged → skip; config change → rewrite). `go build ./...` clean.
|
||||
|
||||
## Changes
|
||||
- **Catalog (`app-catalog-felhom.eu`, the behavioral fix):** exhaustive `grep` confirmed exactly four HDD
|
||||
templates carried the segment — `romm`, `nextcloud`, `immich`, `paperless-ngx`. All changed
|
||||
`${HDD_PATH}/felhom-data/appdata/<app>` → `${HDD_PATH}/appdata/<app>` (volume mounts + header comments).
|
||||
- **Controller (test-only, no runtime change):** `internal/stacks/hddpath_agreement_test.go` resolves a
|
||||
compose's `${HDD_PATH}` bind mounts via the real deploy-side `ParseComposeHDDMounts` and asserts they
|
||||
are byte-identical to the backup-side `AppDataDir(NamespaceRoot(HDD_PATH,true))` — no doubled
|
||||
`felhom-data`, deploy↔backup locked so they can't drift again. `go test ./internal/stacks/...` and
|
||||
`./internal/appbackup/...` pass; `go build ./...` clean.
|
||||
- **No controller image rebuild.** The controller passes `HDD_PATH` through unchanged and already
|
||||
resolved the single-nested path since v0.51.0, so no runtime change was needed. The controller in guest
|
||||
9201 stays `:0.51.0` (functionally current); v0.52.0 marks the catalog gate + test and rolls into the
|
||||
next image build (Phase 2). The controller is golden-image/bootstrap-managed — not rebaked for a no-op.
|
||||
## Deploy mechanism (resolved this session)
|
||||
The controller in guest 9201 is **golden/bootstrap-managed**: `felhom-controller-bootstrap.service` runs
|
||||
`/usr/local/sbin/felhom-controller-bootstrap.sh`, which `docker run`s the tag from
|
||||
`/etc/felhom-controller-image` (gitea anon-pull, no login). Deploy = build+push tag → anon-pull → update
|
||||
that tag file → `systemctl restart felhom-controller-bootstrap.service`. Data volume + encryption key
|
||||
persist. (This is what "self-update handles version drift" refers to.)
|
||||
|
||||
## Live validation (guest 9201, demo-felhom, root@felhom-pve)
|
||||
- **Only one live HDD app:** RomM (the others aren't drive-deployed). `HDD_PATH=/mnt/felhom-usb`; data was
|
||||
at the doubled `/mnt/felhom-usb/felhom-data/appdata/romm`.
|
||||
- **Catalog fix delivered by the real mechanism:** the controller's periodic git-sync logged
|
||||
*"Sablonok frissítve — frissítve: immich, nextcloud, paperless-ngx, romm"* and updated all four
|
||||
on-disk stack files to `${HDD_PATH}/appdata/<app>` (single-nest) — confirmed in
|
||||
`/opt/docker/stacks/*/docker-compose.yml`.
|
||||
- **RomM migrated (stop → move → verify → redeploy, ordered, reversible `mv`, never delete-then-move):**
|
||||
captured RomM's runtime env from the running containers into a guest-only temp file (secrets never left
|
||||
the guest), `docker compose down` (named DB/redis volumes preserved), moved
|
||||
`/mnt/felhom-usb/felhom-data/appdata/romm` → `/mnt/felhom-usb/appdata/romm`, verified file count
|
||||
unchanged, then recreated with the fixed compose + captured env.
|
||||
- **Result:** RomM binds the single-nest paths
|
||||
(`/mnt/felhom-usb/appdata/romm/{library,resources}`), all three containers healthy, DB connected (the
|
||||
captured creds worked), in-guest `HTTP/1.1 200 OK`, *"Application startup complete"*. Old namespace
|
||||
`/mnt/felhom-usb/felhom-data/appdata/` confirmed empty; backups already single-nested at
|
||||
`/mnt/felhom-usb/backups/primary`. Controller untouched & healthy throughout.
|
||||
## Live validation (guest 9201, demo-felhom)
|
||||
- Deployed v0.53.1; on startup `RefreshCache` captured units: **romm** (`images=3, secrets-referenced=3,
|
||||
data_keys=0`) and **actualbudget** (`images=1`, system-fallback path `…/sys_drive/felhom-data/…`).
|
||||
- RomM unit on disk: `compose/{app.yaml,docker-compose.yml,.felhom.yml}` + `db-dumps/romm-mariadb.sql` +
|
||||
`manifest.json`. Manifest is secret-free (image pins + secret NAMES + `secret_source`); captured
|
||||
app.yaml holds only DOMAIN/HDD_PATH/SUBDOMAIN with the three secret names listed as stripped.
|
||||
- **Secret-leak grep against the three actual RomM secret values → `NO_LEAK`.** Idempotency confirmed
|
||||
(single capture log line; the 5m refresh skips).
|
||||
|
||||
## Gate outcome — PASSED
|
||||
A drive-app lands single-nested AND the backup helpers resolve the identical path — proven live (not
|
||||
REPORT-only): the deploy-resolver and the backup helper agree by test and by the live RomM binds, the
|
||||
catalog fix propagated via real git-sync, and no doubled `felhom-data` remains. Cleared to start Phase 2.
|
||||
|
||||
## Not done (intentionally deferred to Phases 2–4)
|
||||
Per-app recovery unit (`backup/<app>`), Tier 2 off-drive copy (auto-enabled, durable-id target, rootfs
|
||||
headroom guard), secret preserve-vs-regenerate classing, FileBrowser scoping, deploy-UI DB-on-SSD note,
|
||||
monitoring storage sort/descriptions. The README's backup-paths section still describes the stale
|
||||
restic/secondary layout — to be rewritten when Tier 2 is built.
|
||||
## Not done — Phase 2b (the immediate next increment)
|
||||
The restore-from-unit **recreate** (write compose/config back → re-pull image from pins → recover
|
||||
secrets from the guest's app.yaml, live or via PBS → restore DB+volumes+userdata → boot), the
|
||||
**fail-closed `data_key` gate** (refuse + warn if an encrypted app's key is unrecoverable), and the
|
||||
live **AdventureLog readable-data** validation (deploy with an encryption key → back up → recreate →
|
||||
confirm data decrypts). The existing `RestoreApp` still does the live-guest volume-tar restore. The
|
||||
README backup-paths section still describes the stale restic/secondary layout — rewritten when Tier 2
|
||||
(Phase 3) lands.
|
||||
|
||||
Reference in New Issue
Block a user