feat: backup safety — stop-before-dump, streaming restore, health check, per-app restic, infra configs (v0.34.0)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 08:56:48 +01:00
parent 783830a9d4
commit fb11c3b75a
8 changed files with 147 additions and 33 deletions
@@ -336,6 +336,7 @@ Path computation is centralized in `backup/paths.go` via the `FelhomDataDir = "f
 **Phase 1b — Docker Volume Dumps** (`internal/backup/backup.go`, runs after DB dumps)

 - Iterates all deployed stacks that have Docker named volumes (`GetDockerVolumes()`)
+- **v0.34.0:** Each stack is stopped before dump, restarted after (`DumpAppVolumesSafe()`) — prevents inconsistent tars of live databases. Protected stacks (traefik, etc.) that reject StopStack are skipped with a warning.
 - For each volume: `docker run --rm -v <vol>:/vol:ro -v <dumpDir>:/out alpine tar cf /out/<vol>.tar -C /vol .`
 - 10-minute timeout per volume; warnings on failure (non-fatal)
 - Stale tars cleaned up (volumes that no longer exist)
@@ -347,12 +348,12 @@ Path computation is centralized in `backup/paths.go` via the `FelhomDataDir = "f
 - Apps are **grouped by drive** via `groupStacksByDrive()` — each drive's apps are backed up to that drive's restic repo
 - App drive resolution: `GetStackHDDPath()` (from `StackDataProvider`) → falls back to `SystemDataPath`
 - Auto-generated repository password (32 random bytes, base64url), shared across all repos, synced to hub
- **Paths included in every per-drive snapshot:**
+- **Paths included in each per-drive snapshot (v0.34.0: per-app scoped):**
  - Per-app DB dump dirs on that drive
  - Per-app Docker volume dump dirs (`volume-dumps/*.tar`)
  - Per-app HDD mount paths (user data)
-  - Stacks dir (compose.yml + app.yaml + .felhom.yml for all apps)
-  - `controller.yaml` (controller config)
+  - Per-app stack config dir (`<StacksDir>/<stackName>/` — only for stacks on this drive)
+  - `controller.yaml` — only on the system drive (not duplicated across all drives)
 - Auto-detects and unlocks stale locks (restic repo lock)
 - Weekly prune on Sundays with configurable retention (keep-daily, keep-weekly, keep-monthly)
 - Weekly integrity check (`restic check`) on Sunday 04:00 — checks **all** primary repos
@@ -377,7 +378,7 @@ data back up config + DB + user data + Docker volumes; apps without HDD back up
  - **restic** — Versioned, deduplicated, encrypted (shared repo across apps, not browsable)
 - Per-app configuration in settings.json: destination path, method, schedule (daily/weekly/manual)
 - **Pre-backup DB dump:** `DumpStackDB()` runs fresh pg_dump/mariadb-dump before each cross-drive backup; non-fatal on failure (wired via `DBDumper` interface to avoid circular imports)
- **Pre-backup volume dump (v0.33.0):** `DumpAppVolumes()` exports Docker named volumes to tar before each cross-drive backup (wired via `VolumeDumper` interface)
+- **Pre-backup volume dump (v0.33.0, safe stop/start v0.34.0):** `DumpAppVolumesSafe()` stops the stack, exports Docker named volumes to tar, restarts — wired via `VolumeDumper` interface
 - **Empty mounts allowed:** `RunAppBackup` accepts apps with no HDD mounts — the rsync
  mount loop simply doesn't execute, but DB + config copy still runs
 - **Drive-type-aware validation** (`ValidateDestination`):
@@ -440,16 +441,17 @@ appear in the restore dropdown with per-app snapshot filtering.
  - Config only: "Csak konfiguracio visszaallitasa"

 **Tier 1 restore** (`RestoreApp`):
- Stop app → resolve app's home drive → `restic restore <id> --target / --include <path>...` → populate Docker volumes from restored tars → restart app
+- Stop app → resolve app's home drive → `restic restore <id> --target / --include <path>...` → populate Docker volumes from restored tars → restart app → health check
 - Restore paths: config dir, DB dump dir, volume dump dir, HDD mounts
 - Docker volumes restored via `restoreDockerVolumes()`: `docker volume rm -f` → `docker volume create` → `docker run alpine tar xf`

 **Tier 2 restore** (`RestoreAppFromTier2`):
- Stop app → rsync config from `_config/` → rsync HDD data (single/multi-mount) → copy DB dumps from `_db/` → restore Docker volumes from `_volumes/` tars → restart app
+- Stop app → rsync config from `_config/` → rsync HDD data (single/multi-mount) → copy DB dumps from `_db/` (streaming `copyFile`) → restore Docker volumes from `_volumes/` tars → restart app → health check
 - Uses rsync `--delete` for config and HDD data to ensure exact mirror state
 - Single-mount apps: data directly in rsync dir (excluding `_*`); multi-mount: per-leaf subdirectories

 **Common:**
+- **v0.34.0:** Post-restore health check (`waitForHealthy`) polls container state with `docker ps` refresh every 5s for up to 90s. Warning logged if app doesn't reach running state; restore still returns success (data is restored regardless).
 - Running flag prevents concurrent backup/restore operations
 - Snapshot ID validated (8-64 lowercase hex, or special `tier2-rsync`)
 - Import from `.fab` bundle link shown in restore section for cross-system migration
@@ -970,7 +972,7 @@ After each backup cycle (including manual Tier 2 triggers via `OnCrossDriveCompl
 - `controller.yaml` (base64-encoded, full config including secrets)
 - `settings.json` (base64-encoded, backup prefs, storage paths, cross-drive configs)
 - Disk layout (UUIDs, labels, mount points, fstab options, bind-mount topology)
- Deployed stacks manifest (app names, HDD paths)
+- Deployed stacks manifest (app names, HDD paths) with actual config files: `docker-compose.yml`, `app.yaml`, `.felhom.yml` (base64-encoded per stack, v0.34.0)
 - Restic passwords (primary + cross-drive, base64-encoded)

 This enables fully automated recovery when the system drive is replaced — the new controller pulls the snapshot from the Hub, auto-mounts surviving drives by UUID, and restores all applications.