From 88ca1178aed1f7b058c0725932f0e70bbaf42bb5 Mon Sep 17 00:00:00 2001 From: kisfenyo Date: Sat, 13 Jun 2026 13:29:33 +0200 Subject: [PATCH] =?UTF-8?q?docs:=20Phase=203=20off-drive=20Tier=202=20?= =?UTF-8?q?=E2=80=94=20REPORT/CONTEXT/README=20for=20v0.55.0?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit REPORT (Tier 2 engine + rootfs-headroom guard + live validation: happy path RomM->SSD off felhom-usb, refuse path 1G dummy -> honest "needs 2nd HDD", UI card). CONTEXT entry. README Tier 2 subsection. Co-Authored-By: Claude Opus 4.8 (1M context) --- CONTEXT.md | 13 +++++ REPORT.md | 132 +++++++++++++++---------------------------- controller/README.md | 12 ++++ 3 files changed, 72 insertions(+), 85 deletions(-) diff --git a/CONTEXT.md b/CONTEXT.md index 8df12c7..1ab31be 100644 --- a/CONTEXT.md +++ b/CONTEXT.md @@ -13,6 +13,19 @@ Last updated: 2026-06-12 (storage UX polish) > is tracked in `CHANGELOG.md`, `controller/README.md`, and the auto-memory `MEMORY.md`. Live version: > **v0.45.0**. > +> **2026-06-13 — v0.55.0 Phase 3: auto off-drive Tier 2 (rootfs-headroom guard):** +> - `internal/backup/tier2.go`: rsync `-a --delete` of each HDD app's recovery unit + appdata → a +> DIFFERENT physical disk (`/backups/secondary//`). Auto target: prefer another registered +> drive (off-disk via `system.SamePhysicalDevice`), else internal SSD for SMALL units only. +> - **Rootfs-headroom guard** (`tier2FitsHeadroom`, unit-tested): SSD = ~8G guest rootfs, so REFUSE +> unless the unit fits leaving reserve = max(2G, 20%) free; honest "needs 2nd HDD" status when nothing +> fits — never fills the rootfs. Status via surviving `settings.CrossDriveBackup`; "2. mentés" UI card +> now populated (`buildAppBackupRows`). Daily `tier2-backup` 03:30 + `POST /api/backup/tier2`. +> - **Live-validated (9201):** happy path (RomM → SSD, off felhom-usb, 77KB, "[SSD: DB/config only]"); +> refuse path (1G userdata dummy → REFUSED with honest msg, rootfs not filled); UI card shows +> "Sikeres → belső SSD (csak DB/konfiguráció)". Demo cleaned. +> - Next: Phase 4 (FileBrowser scoping + deploy-UI DB-on-SSD note + monitoring sort). +> > **2026-06-13 — v0.53.0/v0.53.1 Phase 2: per-app recovery unit (capture side, SECRET-FREE):** > - Each app's `backups/primary//` becomes a self-contained recovery unit: `compose/` > (docker-compose.yml + .felhom.yml + **secret-stripped** app.yaml) + db-dumps/ + volume-dumps/ + diff --git a/REPORT.md b/REPORT.md index 36e7002..0dd59fe 100644 --- a/REPORT.md +++ b/REPORT.md @@ -1,91 +1,53 @@ -# REPORT — felhom-controller v0.54.0 (Phase 2: recovery unit — capture + restore) +# REPORT — felhom-controller v0.55.0 (Phase 3: auto off-drive Tier 2) -Each app's on-drive backup is a self-contained, recreatable **recovery unit** (secret-free), and restore -now **recreates an app from its unit + the guest's own secrets** with a **fail-closed data-key gate**. -Built, unit/integration-tested, shipped to `main`, deployed to guest 9201. (Phase 1, the deploy-side -double-nest GATE, shipped as v0.52.0; Phase 2 capture side as v0.53.x — see git history.) +Tier 2 makes an **off-drive copy** of each HDD app's recovery unit + bulk userdata to a **different +physical disk** — the only off-drive protection browsable HDD userdata can get (PBS can't reach bind +mounts). Auto-enabled, auto-targeted, and — crucially — it **refuses rather than fills** the small guest +rootfs. Built, unit-tested, shipped, deployed, and live-validated on guest 9201. (Phases 1/2/2b shipped +as v0.52–0.54 — see git history.) -## The design decision that shaped Phase 2 (secret handling) -The recovery unit carries **no secrets, no data-keys, and not the Docker image**. This was decided after -reading the *actual* hub code (the controller README that implied the hub stores app.yaml is stale -pre-strip): -- The hub is deliberately **zero-knowledge** — it holds a per-host recovery-code-wrapped PBS key it - cannot decrypt + non-secret config; **no per-app secrets**. Escrowing app secrets there would regress - that posture, so it was rejected. -- `app.yaml` (encrypted) + the encryption key live on the **guest rootfs** (`local-lvm:vm-9201-disk-0`, - confirmed via `pct config`) → already inside the **PBS whole-guest snapshot**; the external drive - (`mp0` bind) is not. So the secret↔data split maps onto the tiers: **secrets ride PBS; bulk userdata - rides the drive + (Phase 3) Tier 2.** -- Therefore: secret-free unit; restore recovers the original secrets from the guest's own app.yaml - (live, else PBS); **regenerate nothing**. `data_key` is a fail-closed annotation, not a - preserve/regenerate decision. +## What shipped +- **Engine** (`internal/backup/tier2.go`, `RunTier2`/`RunAllTier2`): rsync `-a --delete` mirror of the + recovery unit (`backups/primary//`) and the app's `appdata//` → `/backups/secondary/ + /`. restic is **not** revived — a plain, browsable mirror. +- **Auto target selection:** prefer another registered user-data drive on a **different physical disk** + (can hold bulk userdata); else the internal SSD for **small units only**. Off-disk enforced by + `system.SamePhysicalDevice` (block-device identity — new exported helper, linux + non-linux stub), + re-checked before the copy (defense in depth). +- **Rootfs-headroom guard (the safety):** the SSD target is the ~8 GB guest rootfs, so a size-aware + guard (`tier2FitsHeadroom`, unit-tested) **refuses** unless the unit fits leaving a reserve free + (`max(2 GB, 20% of total)`). When nothing fits, it records an **honest** "needs a 2nd HDD" status — + never silently no-ops, never endangers the rootfs. +- **Status + UI:** results persist via the surviving `settings.CrossDriveBackup`. `buildAppBackupRows` + now **populates** the "2. mentés" card — real target ("belső SSD (csak DB/konfiguráció)" vs an external + drive) on success, or the honest no-target reason. Notifications via the surviving + `NotifyCrossDrive{Completed,Failed}` hooks. +- **Scheduling + trigger:** daily `tier2-backup` (03:30, after the DB dump); manual `POST /api/backup/tier2`. +- Fixed a stale pre-existing test (`TestBackupCopiesOnPath`, which still used the old + `felhom-data/backups/secondary` layout) to the Model-A in-guest layout Tier 2 actually uses. -## What shipped (v0.53.0 + v0.53.1) -- **Unit layout** (rooted at the existing `backups/primary//` — a deliberate low-churn choice, no - risky dump-dir migration): `compose/` (docker-compose.yml + .felhom.yml + a **secret-stripped** - app.yaml) + the existing `db-dumps/` + `volume-dumps/` + `manifest.json`. New helpers - `RecoveryUnit{Path,ComposePath,ManifestPath}` (`internal/appbackup/paths.go`). -- **Secret-free manifest** (`internal/backup/recovery_unit.go`): app id, display name, controller - version, timestamp, drive, namespace root, **image pins** (image NOT stored — re-pulled on restore), - the **NAMES** of secret env vars (values never stored), `data_key` env-var names, an explicit - `secret_source` note, captured config-file list, enumerated dumps, sha256 checksums. -- **Capture needs no secret access:** non-secret env is plaintext in app.yaml, so the capture excludes - secret-named keys (plus a defensive `crypto.IsEncrypted` guard) and reads no secret value. New - `StackDataProvider.GetStackRecoveryInfo` + `RecoveryInfo`, implemented by the main.go `stackAdapter`; - `ParseComposeImages` extracts pins. -- **`data_key`**: `DeployField.DataKey` + `Metadata.DataKeyEnvVars()`; catalog `adventurelog/.felhom.yml` - `SECRET_KEY` ("Titkosítási kulcs") marked `data_key: true`. -- **Refresh cadence (v0.53.1):** capture runs from the daily DB dump AND the periodic `RefreshCache` - (startup + every 5m), **idempotent** — content is built in memory and writes are skipped when the unit - is already current (checksum + dump-set + version), so a spinning USB drive is not thrashed. -- **Tests:** capture is secret-free (a secret in the source app.yaml never appears in the unit) + - manifest structure + idempotency (unchanged → skip; config change → rewrite). `go build ./...` clean. +## Live validation (guest 9201) +- **Happy path:** triggered Tier 2 → *"Tier 2 copied romm → /mnt/sys_drive/felhom-data/backups/secondary/ + romm (77.1 KB) [SSD: DB/config only]"*. The recovery unit landed on the SSD, **off** the felhom-usb + source (block devices 2065 vs 64518 — off-disk confirmed), auto-picking the SSD (no 2nd drive). +- **Refuse path (rootfs-headroom guard):** placed a 1 GB userdata dummy (SSD had 2.3 GB free) → Tier 2 + **refused**: *"nincs elég hely a belső SSD-n — a nagy fájlok off-drive mentéséhez 2. meghajtó (vagy + távoli tárhely) szükséges"*, and did **not** copy the 1 GB to the rootfs. Removed the dummy; re-trigger + restored the successful small-unit copy. +- **UI end-to-end:** the backups page "2. mentés" card renders *Sikeres → belső SSD (csak + DB/konfiguráció)* for RomM. +- Demo left clean (dummy removed; RomM's intended small Tier 2 copy remains on the SSD). -## Deploy mechanism (resolved this session) -The controller in guest 9201 is **golden/bootstrap-managed**: `felhom-controller-bootstrap.service` runs -`/usr/local/sbin/felhom-controller-bootstrap.sh`, which `docker run`s the tag from -`/etc/felhom-controller-image` (gitea anon-pull, no login). Deploy = build+push tag → anon-pull → update -that tag file → `systemctl restart felhom-controller-bootstrap.service`. Data volume + encryption key -persist. (This is what "self-update handles version drift" refers to.) - -## Live validation (guest 9201, demo-felhom) -- Deployed v0.53.1; on startup `RefreshCache` captured units: **romm** (`images=3, secrets-referenced=3, - data_keys=0`) and **actualbudget** (`images=1`, system-fallback path `…/sys_drive/felhom-data/…`). -- RomM unit on disk: `compose/{app.yaml,docker-compose.yml,.felhom.yml}` + `db-dumps/romm-mariadb.sql` + - `manifest.json`. Manifest is secret-free (image pins + secret NAMES + `secret_source`); captured - app.yaml holds only DOMAIN/HDD_PATH/SUBDOMAIN with the three secret names listed as stripped. -- **Secret-leak grep against the three actual RomM secret values → `NO_LEAK`.** Idempotency confirmed - (single capture log line; the 5m refresh skips). - -## Phase 2b — restore-from-unit + fail-closed gate (v0.54.0) -- **`reconcileRestoreSecrets`** (pure, exhaustively unit-tested): merges the unit's non-secret env with - the secrets recovered from the guest's live app.yaml. A missing/empty **data-encrypting key** aborts - the restore (a PBS whole-guest restore is required) — regenerating it would corrupt data. A missing - resettable secret is non-fatal (warn + proceed). **Regenerates nothing.** -- **`RestoreFromRecoveryUnit`**: manifest → recover secrets from the guest → gate → restore named-volume - tars → recover the app definition from the unit → redeploy with the reconstructed env (re-pull pinned - image). Falls back to volume-only `RestoreApp` when no unit exists. Wired into `/backup/restore`. -- Seams: `RecoverStackSecrets` / `RecreateStackFromUnit` (adapter, with `encKey` to decrypt the live - app.yaml); `stacks.RedeployFromEnv`. `isDebug` made nil-safe. -- **Tests:** the gate (recovered / data-key-missing→refuse / empty-data-key→refuse / resettable-missing - →proceed, values used verbatim), the full orchestration (success→recreate-with-merged-env; - data-key-missing→refused, recreate never called), and `data_key` parsing from `.felhom.yml`. - -## Validation status -- **Unit/integration-tested (authoritative):** the fail-closed gate, the restore orchestration, secret - reconciliation (regenerate-nothing), and the catalog→metadata `data_key` flow. -- **Live-validated (guest 9201):** the capture side (v0.53.1, RomM — secret-free, NO_LEAK). For Phase 2b - on **AdventureLog** (a real data_key app): its unit manifest carries `data_key_env_vars: [SECRET_KEY]` - (catalog→manifest flow live); and with `SECRET_KEY` made unrecoverable, `POST /backup/restore` - **refused** with the exact fail-closed message **before any compose-up** (no side effects). The demo - has no dashboard password → the API is open (auth + CSRF skipped), driven via the public URL. -- **One e2e not run — environment limit, not a code gap:** the full "deploy with data → restore → - confirm decrypts" — AdventureLog's images do not fit the **8 GB guest rootfs** (deploy hit "no space - left on device"). That is precisely the Phase 3 rootfs-headroom concern, now observed live. - Key-preservation is covered by the gate's verbatim-recovery unit test. Demo left clean (AdventureLog - reverted to not-deployed, no leftovers). +## Notes / follow-ups +- **Off-disk identity** uses block-device (`Stat_t.Dev`) equality — correct for the felhom layout + (external drive vs system rootfs). Two partitions on one physical disk would look "different"; the + agent's `DiskInfo.DurableID` is the stronger guarantee for that case (future hardening). +- Non-HDD apps (data on the rootfs, already in PBS) are skipped by Tier 2; their "2. mentés" card shows + "Nincs 2." — cosmetically it could be hidden for non-HDD apps (Phase 4 polish). +- The single-drive demo can only Tier 2 to the SSD (small units); a 2nd HDD would let bulk userdata copy + off-drive — the engine already prefers it when present. ## Still ahead -Phase 3 (auto off-drive Tier 2 with rootfs-headroom guard) and Phase 4 (FileBrowser scoping + deploy-UI -DB-on-SSD note + monitoring sort). The README backup-paths section still shows the stale restic/secondary -layout — rewritten when Tier 2 lands. +Phase 4: FileBrowser scoping (hide recovery units), deploy-UI "DB runs on the fast internal drive" note, +monitoring storage-bar sort + descriptions. The README backup-paths section's stale restic/secondary +text should be rewritten alongside. diff --git a/controller/README.md b/controller/README.md index 429eb12..0885ea1 100644 --- a/controller/README.md +++ b/controller/README.md @@ -378,6 +378,18 @@ backups/primary// the capture never touches a secret). `data_key` fields are marked in `.felhom.yml` (`DeployField.DataKey`). +#### Tier 2 — off-drive copy (Phase 3, v0.55.x) + +For every HDD app, Tier 2 (`internal/backup/tier2.go`) rsync-mirrors the recovery unit +(`backups/primary//`) + the app's `appdata//` to `/backups/secondary//` on a +**different physical disk** — the only off-drive protection bind-mounted HDD userdata can get (PBS can't +reach bind mounts). Auto-targeted: **prefer another registered user-data drive** (off-disk via +`system.SamePhysicalDevice`); else the **internal SSD for small units only**, behind a size-aware +**rootfs-headroom guard** (`tier2FitsHeadroom`) that **refuses rather than fills** the ~8 GB guest rootfs +(reserve = `max(2 GB, 20%)`), recording an honest "needs a 2nd HDD" status. Status persists via +`settings.CrossDriveBackup` and drives the "2. mentés" card. Runs daily (`tier2-backup`, 03:30) or via +`POST /api/backup/tier2`. restic is **not** used — a plain browsable mirror. + **Phase 1 — Database Dumps** (`internal/backup/dbdump.go`, scheduled 02:30) - **Auto-discovery** of PostgreSQL and MariaDB containers via `docker ps` + `docker inspect`