diff --git a/CONTEXT.md b/CONTEXT.md index 506e6e2..70aa9c9 100644 --- a/CONTEXT.md +++ b/CONTEXT.md @@ -13,6 +13,23 @@ Last updated: 2026-06-12 (storage UX polish) > is tracked in `CHANGELOG.md`, `controller/README.md`, and the auto-memory `MEMORY.md`. Live version: > **v0.45.0**. > +> **2026-06-13 — v0.53.0/v0.53.1 Phase 2: per-app recovery unit (capture side, SECRET-FREE):** +> - Each app's `backups/primary//` becomes a self-contained recovery unit: `compose/` +> (docker-compose.yml + .felhom.yml + **secret-stripped** app.yaml) + db-dumps/ + volume-dumps/ + +> `manifest.json` (image pins, secret env-var NAMES, data_key names, checksums, secret_source note). +> - **Secret-free by design.** Decided after reading the ACTUAL hub code: hub is zero-knowledge (no app +> secrets); app.yaml + key live on the guest rootfs → in the PBS whole-guest snapshot. So the unit +> stores no secret/data-key/image; restore recovers secrets from the guest's app.yaml (live/PBS), +> regenerates nothing. `data_key` (DeployField.DataKey; AdventureLog SECRET_KEY marked) = fail-closed +> restore annotation only. +> - Capture needs no decryption (non-secret env is plaintext; excludes secret-named + encrypted keys). +> Wired into RunDBDumps AND the periodic RefreshCache (idempotent checksum-skip → no USB thrash). +> - **Deploy mechanism resolved:** controller in guest 9201 is golden/bootstrap-managed — +> `felhom-controller-bootstrap.service` docker-runs the tag from `/etc/felhom-controller-image` +> (gitea anon-pull). Deploy = build+push → anon-pull → update tag file → restart the service. +> - **Live-validated (9201):** RomM unit captured (images=3, secrets=3, data_keys=0), secret-leak grep +> = NO_LEAK. Next: Phase 2b restore-from-unit recreate + fail-closed gate + AdventureLog readable-data. +> > **2026-06-13 — v0.52.0 Phase 1 GATE: deploy-side double-nest fix (catalog) + path-agreement test:** > - The `felhom-data` double-nest lived in the **app-catalog compose templates** > (`${HDD_PATH}/felhom-data/appdata/`), not in `deploy.go`. On a Model-A in-guest drive the mount diff --git a/REPORT.md b/REPORT.md index c1fe0e4..d4e9119 100644 --- a/REPORT.md +++ b/REPORT.md @@ -1,60 +1,66 @@ -# REPORT — felhom-controller v0.52.0 (Phase 1 GATE: deploy-side double-nest fix) +# REPORT — felhom-controller v0.53.1 (Phase 2: per-app recovery unit, capture side) -Completes the Model-A double-nest reconciliation deferred in v0.48.0 and half-fixed in v0.51.0. -v0.51.0 fixed the **backup-helper** side (`NamespaceRoot` provenance); this slice fixes the -**deploy/compose** side and locks the two together. Validated live on guest 9201 (demo-felhom). +Each app's on-drive backup is now a self-contained, recreatable **recovery unit** — and it is +**secret-free by design**. Built, unit-tested, shipped to `main`, and validated live on guest 9201. +(Phase 1, the deploy-side double-nest GATE, shipped earlier as v0.52.0 — see git history.) -This is the GATE for the larger per-app-recovery-unit / Tier-2 slice — Phases 2–4 build on a proven -single-nested, agreeing path layout. Per plan: Phase 1 shipped + validated first. +## The design decision that shaped Phase 2 (secret handling) +The recovery unit carries **no secrets, no data-keys, and not the Docker image**. This was decided after +reading the *actual* hub code (the controller README that implied the hub stores app.yaml is stale +pre-strip): +- The hub is deliberately **zero-knowledge** — it holds a per-host recovery-code-wrapped PBS key it + cannot decrypt + non-secret config; **no per-app secrets**. Escrowing app secrets there would regress + that posture, so it was rejected. +- `app.yaml` (encrypted) + the encryption key live on the **guest rootfs** (`local-lvm:vm-9201-disk-0`, + confirmed via `pct config`) → already inside the **PBS whole-guest snapshot**; the external drive + (`mp0` bind) is not. So the secret↔data split maps onto the tiers: **secrets ride PBS; bulk userdata + rides the drive + (Phase 3) Tier 2.** +- Therefore: secret-free unit; restore recovers the original secrets from the guest's own app.yaml + (live, else PBS); **regenerate nothing**. `data_key` is a fail-closed annotation, not a + preserve/regenerate decision. -## Root cause (corrected from the spec's assumption) -The extra `felhom-data` segment was **not** built in controller `deploy.go` — `DeployStack` passes -`HDD_PATH` through verbatim, and the deploy storage dropdown only ever offers registered in-guest drives -(`GetSchedulableStoragePaths`), never the system-data path. The segment was hardcoded in the **app-catalog -compose templates** as `${HDD_PATH}/felhom-data/appdata/`. On a Model-A in-guest drive the guest -mount `/mnt/` already IS the host's `/felhom-data` namespace, so that segment double-nested -to `/felhom-data/felhom-data/appdata/` on disk — diverging from where the (v0.51.0) backup -helpers look: `AppDataDir(NamespaceRoot(HDD_PATH,true))` = `/mnt//appdata/`, single-nested. +## What shipped (v0.53.0 + v0.53.1) +- **Unit layout** (rooted at the existing `backups/primary//` — a deliberate low-churn choice, no + risky dump-dir migration): `compose/` (docker-compose.yml + .felhom.yml + a **secret-stripped** + app.yaml) + the existing `db-dumps/` + `volume-dumps/` + `manifest.json`. New helpers + `RecoveryUnit{Path,ComposePath,ManifestPath}` (`internal/appbackup/paths.go`). +- **Secret-free manifest** (`internal/backup/recovery_unit.go`): app id, display name, controller + version, timestamp, drive, namespace root, **image pins** (image NOT stored — re-pulled on restore), + the **NAMES** of secret env vars (values never stored), `data_key` env-var names, an explicit + `secret_source` note, captured config-file list, enumerated dumps, sha256 checksums. +- **Capture needs no secret access:** non-secret env is plaintext in app.yaml, so the capture excludes + secret-named keys (plus a defensive `crypto.IsEncrypted` guard) and reads no secret value. New + `StackDataProvider.GetStackRecoveryInfo` + `RecoveryInfo`, implemented by the main.go `stackAdapter`; + `ParseComposeImages` extracts pins. +- **`data_key`**: `DeployField.DataKey` + `Metadata.DataKeyEnvVars()`; catalog `adventurelog/.felhom.yml` + `SECRET_KEY` ("Titkosítási kulcs") marked `data_key: true`. +- **Refresh cadence (v0.53.1):** capture runs from the daily DB dump AND the periodic `RefreshCache` + (startup + every 5m), **idempotent** — content is built in memory and writes are skipped when the unit + is already current (checksum + dump-set + version), so a spinning USB drive is not thrashed. +- **Tests:** capture is secret-free (a secret in the source app.yaml never appears in the unit) + + manifest structure + idempotency (unchanged → skip; config change → rewrite). `go build ./...` clean. -## Changes -- **Catalog (`app-catalog-felhom.eu`, the behavioral fix):** exhaustive `grep` confirmed exactly four HDD - templates carried the segment — `romm`, `nextcloud`, `immich`, `paperless-ngx`. All changed - `${HDD_PATH}/felhom-data/appdata/` → `${HDD_PATH}/appdata/` (volume mounts + header comments). -- **Controller (test-only, no runtime change):** `internal/stacks/hddpath_agreement_test.go` resolves a - compose's `${HDD_PATH}` bind mounts via the real deploy-side `ParseComposeHDDMounts` and asserts they - are byte-identical to the backup-side `AppDataDir(NamespaceRoot(HDD_PATH,true))` — no doubled - `felhom-data`, deploy↔backup locked so they can't drift again. `go test ./internal/stacks/...` and - `./internal/appbackup/...` pass; `go build ./...` clean. -- **No controller image rebuild.** The controller passes `HDD_PATH` through unchanged and already - resolved the single-nested path since v0.51.0, so no runtime change was needed. The controller in guest - 9201 stays `:0.51.0` (functionally current); v0.52.0 marks the catalog gate + test and rolls into the - next image build (Phase 2). The controller is golden-image/bootstrap-managed — not rebaked for a no-op. +## Deploy mechanism (resolved this session) +The controller in guest 9201 is **golden/bootstrap-managed**: `felhom-controller-bootstrap.service` runs +`/usr/local/sbin/felhom-controller-bootstrap.sh`, which `docker run`s the tag from +`/etc/felhom-controller-image` (gitea anon-pull, no login). Deploy = build+push tag → anon-pull → update +that tag file → `systemctl restart felhom-controller-bootstrap.service`. Data volume + encryption key +persist. (This is what "self-update handles version drift" refers to.) -## Live validation (guest 9201, demo-felhom, root@felhom-pve) -- **Only one live HDD app:** RomM (the others aren't drive-deployed). `HDD_PATH=/mnt/felhom-usb`; data was - at the doubled `/mnt/felhom-usb/felhom-data/appdata/romm`. -- **Catalog fix delivered by the real mechanism:** the controller's periodic git-sync logged - *"Sablonok frissítve — frissítve: immich, nextcloud, paperless-ngx, romm"* and updated all four - on-disk stack files to `${HDD_PATH}/appdata/` (single-nest) — confirmed in - `/opt/docker/stacks/*/docker-compose.yml`. -- **RomM migrated (stop → move → verify → redeploy, ordered, reversible `mv`, never delete-then-move):** - captured RomM's runtime env from the running containers into a guest-only temp file (secrets never left - the guest), `docker compose down` (named DB/redis volumes preserved), moved - `/mnt/felhom-usb/felhom-data/appdata/romm` → `/mnt/felhom-usb/appdata/romm`, verified file count - unchanged, then recreated with the fixed compose + captured env. -- **Result:** RomM binds the single-nest paths - (`/mnt/felhom-usb/appdata/romm/{library,resources}`), all three containers healthy, DB connected (the - captured creds worked), in-guest `HTTP/1.1 200 OK`, *"Application startup complete"*. Old namespace - `/mnt/felhom-usb/felhom-data/appdata/` confirmed empty; backups already single-nested at - `/mnt/felhom-usb/backups/primary`. Controller untouched & healthy throughout. +## Live validation (guest 9201, demo-felhom) +- Deployed v0.53.1; on startup `RefreshCache` captured units: **romm** (`images=3, secrets-referenced=3, + data_keys=0`) and **actualbudget** (`images=1`, system-fallback path `…/sys_drive/felhom-data/…`). +- RomM unit on disk: `compose/{app.yaml,docker-compose.yml,.felhom.yml}` + `db-dumps/romm-mariadb.sql` + + `manifest.json`. Manifest is secret-free (image pins + secret NAMES + `secret_source`); captured + app.yaml holds only DOMAIN/HDD_PATH/SUBDOMAIN with the three secret names listed as stripped. +- **Secret-leak grep against the three actual RomM secret values → `NO_LEAK`.** Idempotency confirmed + (single capture log line; the 5m refresh skips). -## Gate outcome — PASSED -A drive-app lands single-nested AND the backup helpers resolve the identical path — proven live (not -REPORT-only): the deploy-resolver and the backup helper agree by test and by the live RomM binds, the -catalog fix propagated via real git-sync, and no doubled `felhom-data` remains. Cleared to start Phase 2. - -## Not done (intentionally deferred to Phases 2–4) -Per-app recovery unit (`backup/`), Tier 2 off-drive copy (auto-enabled, durable-id target, rootfs -headroom guard), secret preserve-vs-regenerate classing, FileBrowser scoping, deploy-UI DB-on-SSD note, -monitoring storage sort/descriptions. The README's backup-paths section still describes the stale -restic/secondary layout — to be rewritten when Tier 2 is built. +## Not done — Phase 2b (the immediate next increment) +The restore-from-unit **recreate** (write compose/config back → re-pull image from pins → recover +secrets from the guest's app.yaml, live or via PBS → restore DB+volumes+userdata → boot), the +**fail-closed `data_key` gate** (refuse + warn if an encrypted app's key is unrecoverable), and the +live **AdventureLog readable-data** validation (deploy with an encryption key → back up → recreate → +confirm data decrypts). The existing `RestoreApp` still does the live-guest volume-tar restore. The +README backup-paths section still describes the stale restic/secondary layout — rewritten when Tier 2 +(Phase 3) lands. diff --git a/controller/README.md b/controller/README.md index 874ada7..429eb12 100644 --- a/controller/README.md +++ b/controller/README.md @@ -349,6 +349,35 @@ Path computation is centralized in `backup/paths.go` via the `FelhomDataDir = "f - `SecondaryInfraPath(drivePath)` → `/felhom-data/backups/secondary/_infra/` - `InfraBackupDir(mountPath)` → `/.felhom-infra-backup/` (**unchanged** — stays at drive root for DR scanner) +> **⚠️ Stale:** the restic/secondary helpers above (`PrimaryResticRepoPath`, `SecondaryResticRepoPath`, +> `AppSecondaryRsyncPath`, `SecondaryInfraPath`) describe the pre-strip layout — restic/cross-drive was +> removed in slice 8C. This section is rewritten when Tier 2 (Phase 3) lands. + +#### Per-app recovery unit (Phase 2, v0.53.x) — SECRET-FREE + +Each app's `backups/primary//` is a self-contained, recreatable **recovery unit**: + +``` +backups/primary// +├── compose/ docker-compose.yml + .felhom.yml + a SECRET-STRIPPED app.yaml +├── db-dumps/ app-consistent DB dump(s) +├── volume-dumps/ named-volume tars +└── manifest.json image pins, secret env-var NAMES, data_key names, checksums, secret_source +``` + +- **Secret-free by design.** The unit stores **no secret value, no data-encrypting key, and not the + Docker image** — only the pinned image tag(s) (re-pulled on restore) and the *names* of the secret / + `data_key` env vars. Rationale: app.yaml + the encryption key live on the guest rootfs → already in + the PBS whole-guest snapshot, and the hub is deliberately zero-knowledge. Restore recovers the + original secrets from the guest's own app.yaml (live, or via PBS) and **regenerates nothing**; for a + `data_key` app it **fails closed** (refuse + warn) if the key can't be recovered. +- Helpers: `RecoveryUnitPath` / `RecoveryUnitComposePath` / `RecoveryUnitManifestPath` + (`internal/appbackup/paths.go`). Capture: `Manager.CaptureRecoveryUnit` (`internal/backup/recovery_unit.go`), + run from the daily DB dump and the periodic `RefreshCache` (idempotent checksum-skip). The non-secret + env comes from `StackDataProvider.GetStackRecoveryInfo` (excludes secret-named + encrypted values, so + the capture never touches a secret). `data_key` fields are marked in `.felhom.yml` + (`DeployField.DataKey`). + **Phase 1 — Database Dumps** (`internal/backup/dbdump.go`, scheduled 02:30) - **Auto-discovery** of PostgreSQL and MariaDB containers via `docker ps` + `docker inspect`