From ea7105d468fba28e30066f1f1d1c5e25f75abfec Mon Sep 17 00:00:00 2001 From: kisfenyo Date: Thu, 19 Feb 2026 08:41:59 +0100 Subject: [PATCH] Fix data loss on container restart (v0.15.2) --- TASK.md | 391 +++++++++++++------------------------------------------- 1 file changed, 89 insertions(+), 302 deletions(-) diff --git a/TASK.md b/TASK.md index 5888195..5917135 100644 --- a/TASK.md +++ b/TASK.md @@ -1,339 +1,126 @@ -# TASK.md — Bug Fixes from v0.12.4–v0.13.1 Code Review +# TASK: Fix data loss on container restart (v0.15.2) -**Version:** v0.14.1 → v0.14.2 -**Type:** Bug fixes identified by thorough code review of today's changes (sessions 40–48) -**Scope:** Controller Go code — `internal/backup/`, `internal/web/` +Two pieces of data are lost when the felhom-controller container restarts: +1. Snapshot history delta stats (HOZZÁADOTT, ÚJ FÁJL, VÁLTOZOTT columns show 0) +2. DB validation info (ÉRVÉNYESÍTÉS column shows "–" instead of table counts) ---- +## Bug 1: Snapshot history delta stats lost on restart -## Bug Summary +### Root cause -| # | Severity | File | Description | -|---|----------|------|-------------| -| 1 | **HIGH** | `crossdrive.go` | rsync `--delete` destroys `_db/` and `_config/` directories on every single-mount run | -| 2 | **MEDIUM** | `backup.go` | Scheduled backups don't set `m.running` flag — restore/manual backup can overlap | -| 3 | **MEDIUM** | `crossdrive.go` | `ValidateDestination` silently succeeds when disk usage can't be read | -| 4 | **MEDIUM** | `backup.go` | Empty `systemDataPath` silently produces relative dump paths | +`LoadSnapshotHistory()` in `controller/internal/backup/backup.go` (~line 788) loads snapshots from restic repos on startup, but sets `HasStats: false` because restic's `snapshots` command only returns ID/time/paths/tags — NOT the delta stats (files_new, files_changed, data_added) which are only available in the `restic backup` output during the actual backup run. ---- - -## Bug 1 (HIGH): rsync `--delete` destroys `_db/` and `_config/` directories - -### File -`internal/backup/crossdrive.go`, function `runRsyncBackup`, lines 242–276 - -### Root Cause -When an app has exactly one HDD mount (`len(mounts) == 1`), rsync copies directly into `destDir/` with the `--delete` flag. The `_db/` and `_config/` subdirectories — created by the DB dump copy (lines 278–286) and config rsync (lines 288–303) from the **previous** backup run — don't exist in the source mount, so `--delete` removes them from the destination **before** the current run re-creates them. - -### Impact -1. **Brief window of incomplete backup** on every run — between rsync completion and the subsequent `_db`/`_config` copy steps, the backup destination is missing DB dumps and config -2. **Data loss if interrupted** — if the process is killed (OOM, power loss, context cancellation) between the rsync step and the copy steps, `_db/` and `_config/` are gone until the next successful complete run -3. **Wasted I/O** — every run deletes and re-creates these directories unnecessarily - -### Reproduction -1. Deploy Immich with a single HDD mount to `/mnt/hdd_1/storage/immich` -2. Configure cross-drive rsync backup to `/mnt/sys_drive` -3. Run backup twice -4. After first run: `ls /mnt/sys_drive/backups/secondary/immich/rsync/` shows `_db/`, `_config/`, and app data -5. During second run's rsync phase: `_db/` and `_config/` disappear from dest -6. After second run completes: `_db/` and `_config/` reappear (re-created) - -### Fix Instructions -In `runRsyncBackup()`, add an `--exclude` flag with pattern `_*` to the rsync command. This prevents `--delete` from touching any controller-managed directories (which all use underscore prefix: `_db`, `_config`, and any future ones) while still cleaning up stale user data. - -**Change the rsync command construction (around line 267) from:** -```go -cmd := exec.CommandContext(ctx, "rsync", "-a", "--delete", - "--exclude", "backups/*.sql.gz", - "--exclude", "backups/*.sql", - "--exclude", "backups/*.dump", - src, dst) +The template in `controller/internal/web/templates/backups.html` renders 0 for `HasStats: false`: +```html +{{if .HasStats}}+{{.DataAdded}}{{else}}0{{end}} ``` -**To:** -```go -cmd := exec.CommandContext(ctx, "rsync", "-a", "--delete", - "--exclude", "_*", - "--exclude", "backups/*.sql.gz", - "--exclude", "backups/*.sql", - "--exclude", "backups/*.dump", - src, dst) -``` +### Fix -**Why `_*` instead of listing `_db` and `_config` individually:** -- Future-proof — if we ever add another controller-managed subdirectory (e.g., `_metadata`), it's automatically protected -- The underscore prefix is a convention for controller-managed dirs; no real app data directory starts with `_` -- Simpler to maintain (one pattern vs growing list) +Persist the `snapshotHistory` ring buffer to a JSON file on disk so delta stats survive restarts. -**Why this is safe:** The `_db` and `_config` directories are managed exclusively by the controller (lines 278–303), never by the user's app. No user data directory starts with `_`. The underscore prefix convention was specifically chosen to avoid collision with app data paths. +**Implementation steps:** -**Note on `_infra`:** The `_infra/` directory lives at `/backups/secondary/_infra/` (a sibling of app directories, NOT inside any app's rsync target directory). So it's not affected by `--delete` regardless. The `_*` exclude is still good practice in case the directory structure changes later. +1. **Add persistence methods to Manager** in `controller/internal/backup/backup.go`: + - Add a `snapshotHistoryFile` field to `Manager` struct (set to e.g. `/snapshot-history.json`) + - Add `saveSnapshotHistory()` — marshals `snapshotHistory` to JSON, writes atomically (write to .tmp, rename) + - Add `loadSnapshotHistory()` — reads JSON file, unmarshals into `snapshotHistory` + - Call `saveSnapshotHistory()` at the end of `appendSnapshotRecord()` (called after each backup) -**For the multi-mount case** (`len(mounts) > 1`): rsync goes into per-leaf subdirectories, not the root `destDir`, so `_db/` and `_config/` are siblings (not children) of the rsync targets. The `--delete` flag only affects the target subtree, so they're safe in the multi-mount case. Adding the exclude universally doesn't hurt. +2. **Update `LoadSnapshotHistory()`** in same file: + - First try `loadSnapshotHistory()` from the JSON file + - If the file exists and loads successfully, merge with restic repo snapshots: + - Build a map of persisted records by SnapshotID + - For any restic snapshot NOT in the persisted map, add it with `HasStats: false` + - This preserves delta stats for recent snapshots while still picking up very old ones from restic + - If the file doesn't exist (first run), fall back to the current restic-only loading ---- +3. **Wire up the file path** in `NewManager()`: + - The data path can be derived from `cfg.Paths.SystemDataPath` or use a fixed path like `/opt/docker/felhom-controller/data/snapshot-history.json` + - The `data/` directory is a Docker volume, so it persists across container recreations + - Check what `config.Config` has available. See `controller/internal/config/config.go` for the config struct. The controller's data directory is typically `/opt/docker/felhom-controller/data/` which is a mounted volume (check `cmd/controller/main.go` for the data path). -## Bug 2 (MEDIUM): Scheduled backups don't set `m.running` flag +**Key file paths to check:** +- `controller/internal/backup/backup.go` — Manager struct, LoadSnapshotHistory, appendSnapshotRecord +- `controller/internal/config/config.go` — Config struct for data paths +- `controller/cmd/controller/main.go` — where NewManager is called, what paths are available -### File -`internal/backup/backup.go`, functions `RunBackup()` (line 272) and `RunDBDumps()` (line 181) +## Bug 2: DB validation missing after restart -### Root Cause -Only `RunFullBackup()` (line 454) sets `m.running = true` before running. The scheduler in `main.go` calls `RunBackup()` and `RunDBDumps()` directly (not via `RunFullBackup()`), so `m.running` is never set during nightly scheduled runs. +### Root cause -Code that checks `m.running`: -- `RestoreApp()` in `restore.go:25` — blocks if running -- `RunFullBackup()` in `backup.go:456` — blocks if running -- `IsRunning()` in `backup.go:496` — used by API and UI to show status - -### Impact -1. **UI shows "not running" during nightly backups** — the dashboard backup card and backup page don't indicate that a backup is in progress -2. **Restore can overlap with nightly backup** — a user triggering "Restore" at 03:05 while the nightly restic backup is running won't see a "backup in progress" error. Both operations compete for the restic repo lock; one will fail with an opaque error -3. **Manual "Backup now" can overlap with scheduled backup** — `RunFullBackup()` checks `m.running`, but since the scheduled `RunBackup()` doesn't set it, the manual trigger proceeds and both run concurrently - -### Fix Instructions -Add `m.running` guard to both `RunBackup()` and `RunDBDumps()`. The pattern is the same as in `RunFullBackup()`: - -**In `RunBackup()` (line 272), add at the beginning of the function:** -```go -func (m *Manager) RunBackup(ctx context.Context) error { - m.mu.Lock() - if m.running { - m.mu.Unlock() - return fmt.Errorf("backup already in progress") - } - m.running = true - m.mu.Unlock() - defer func() { - m.mu.Lock() - m.running = false - m.mu.Unlock() - }() - - start := time.Now() - // ... rest of existing function -``` - -**In `RunDBDumps()` (line 181), add the same guard:** -```go -func (m *Manager) RunDBDumps(ctx context.Context) error { - m.mu.Lock() - if m.running { - m.mu.Unlock() - return fmt.Errorf("backup already in progress") - } - m.running = true - m.mu.Unlock() - defer func() { - m.mu.Lock() - m.running = false - m.mu.Unlock() - }() - - start := time.Now() - // ... rest of existing function -``` - -**IMPORTANT:** `RunFullBackup()` calls `RunDBDumps()` then `RunBackup()` internally. After adding the guards above, `RunFullBackup()` would deadlock on itself (it sets `m.running=true`, then calls `RunDBDumps()` which also tries to set it). Fix by either: - -**Option A (recommended):** Extract the logic into internal methods that DON'T check the flag: -```go -// Public methods — set the running guard -func (m *Manager) RunDBDumps(ctx context.Context) error { - if err := m.acquireRunning(); err != nil { - return err - } - defer m.releaseRunning() - return m.runDBDumpsInternal(ctx) -} - -func (m *Manager) RunBackup(ctx context.Context) error { - if err := m.acquireRunning(); err != nil { - return err - } - defer m.releaseRunning() - return m.runBackupInternal(ctx) -} - -func (m *Manager) RunFullBackup(ctx context.Context) error { - if err := m.acquireRunning(); err != nil { - return err - } - defer m.releaseRunning() - - if err := m.runDBDumpsInternal(ctx); err != nil { - m.logger.Printf("[WARN] DB dump had errors, continuing with backup anyway") - } - return m.runBackupInternal(ctx) -} - -// Helper methods -func (m *Manager) acquireRunning() error { - m.mu.Lock() - defer m.mu.Unlock() - if m.running { - return fmt.Errorf("backup already in progress") - } - m.running = true - return nil -} - -func (m *Manager) releaseRunning() { - m.mu.Lock() - m.running = false - m.mu.Unlock() -} - -// Internal methods — no guard, caller must hold running flag -func (m *Manager) runDBDumpsInternal(ctx context.Context) error { - // ... current RunDBDumps body (without the running guard) -} - -func (m *Manager) runBackupInternal(ctx context.Context) error { - // ... current RunBackup body (without the running guard) -} -``` - -**Option B (simpler):** Only add the guard to `RunBackup()` and `RunDBDumps()`, but make `RunFullBackup()` NOT call them — instead inline the logic. This duplicates code but avoids the refactor. - -**Option A is recommended** — it's cleaner and ensures all entry points are guarded. - ---- - -## Bug 3 (MEDIUM): `ValidateDestination` silently succeeds when disk usage can't be read - -### File -`internal/backup/crossdrive.go`, function `ValidateDestination`, line 215 - -### Root Cause -```go -if di := system.GetDiskUsage(path); di != nil { - // Space checks only execute if di != nil - ... -} -return nil // ← Success even when di was nil (space unknown) -``` - -If `system.GetDiskUsage(path)` returns `nil` (e.g., unsupported filesystem like FUSE/NFS, permission issue, or a path on a virtual filesystem), all space checks are skipped and validation passes without any space verification. - -### Impact -A destination with zero free space could pass validation, leading to a backup failure mid-operation. The rsync or restic process would fail with "no space left on device", but only after partial work, leaving an incomplete backup. - -### Fix Instructions -After the `GetDiskUsage` check, add a warning log and optionally block: +In `GetFullStatus()` in `controller/internal/backup/backup.go` (~line 973-992), when `LastDBDump` is nil (always nil after restart), a synthetic `LastDBDump` is created from `DumpFiles`. But the synthesized `DumpResult` does NOT copy the `Validation` field from `DumpFileInfo`: ```go -di := system.GetDiskUsage(path) -if di == nil { - r.logger.Printf("[WARN] Cannot determine disk usage for %s — proceeding without space verification", path) - return nil -} +// Current code — BUG: missing Validation +results = append(results, DumpResult{ + DB: DiscoveredDB{StackName: f.StackName, DBType: f.DBType, ContainerName: f.StackName}, + FilePath: f.FileName, + Size: f.Size, +}) ``` -This preserves backward compatibility (doesn't block) but makes the situation visible in logs. If you want to be stricter, return an error instead: +The `DumpResult` struct has a `Validation DumpValidation` field (see `controller/internal/backup/dbdump.go:41`), but it's left zero-valued (`Valid: false, Error: ""`). The template then renders "–" for this case. + +Meanwhile, `ListDumpFiles()` in `dbdump.go:342` DOES call `ValidateDump()` on each file and populates `DumpFileInfo.Validation` correctly. The data is there — it's just not copied into the synthesized `DumpResult`. + +### Fix + +This is a one-line fix. In `GetFullStatus()` (~line 978), add `Validation: f.Validation` to the synthesized `DumpResult`: ```go -if di == nil { - return fmt.Errorf("destination %s: cannot determine disk usage", path) -} +results = append(results, DumpResult{ + DB: DiscoveredDB{StackName: f.StackName, DBType: f.DBType, ContainerName: f.StackName}, + FilePath: f.FileName, + Size: f.Size, + Validation: f.Validation, // <-- ADD THIS LINE +}) ``` -**Recommended:** Use the warning-only approach for now. A destination that can't report disk usage is unusual and worth logging, but blocking it could break setups with exotic filesystems (CIFS mounts, etc.). +**File:** `controller/internal/backup/backup.go`, inside `GetFullStatus()`, in the "Synthesize LastDBDump from DumpFiles" block. ---- +## Build & Deploy -## Bug 4 (MEDIUM): Empty `systemDataPath` produces relative dump paths +Version: **v0.15.2** -### File -`internal/backup/backup.go`, function `DumpStackDB`, line 570 - -### Root Cause -```go -drivePath := m.GetAppDrivePath(stackName) // Returns "" if stackProvider is nil AND systemDataPath is "" -dumpDir := AppDBDumpPath(drivePath, stackName) // AppDBDumpPath("", "mealie") = "backups/primary/mealie/db-dumps" (RELATIVE!) +### Build workflow +```bash +SSH=/c/Windows/System32/OpenSSH/ssh.exe +# 1. Commit & push +cd e:/git/deploy-felhom-compose +git add -A && git commit -m "v0.15.2: Fix snapshot stats and DB validation loss on restart" && git push +# 2. Build +$SSH kisfenyo@192.168.0.180 "cd ~/build/felhom-controller && git -C ~/git/deploy-felhom-compose pull && ./build.sh v0.15.2 --push" +# 3. Deploy +$SSH kisfenyo@192.168.0.162 "cd /opt/docker/felhom-controller && sudo docker pull gitea.dooplex.hu/admin/felhom-controller:v0.15.2 && sudo sed -i 's|image: gitea.dooplex.hu/admin/felhom-controller:.*|image: gitea.dooplex.hu/admin/felhom-controller:v0.15.2|' docker-compose.yml && sudo docker compose up -d" +# 4. Verify +$SSH kisfenyo@192.168.0.162 "docker ps --filter name=felhom-controller --format '{{.Image}} {{.Status}}'" ``` -If `systemDataPath` is empty (misconfiguration or missing `system_data_path` in controller.yaml) and the stack has no HDD_PATH, `GetAppDrivePath()` returns an empty string. `AppDBDumpPath("", stackName)` then produces a **relative path** instead of an absolute one. The `DumpOne()` function creates directories relative to the process's working directory (typically `/opt/docker/felhom-controller/`), orphaning the dumps where nothing else looks for them. +### Compile check +Always run `go build ./...` in `controller/` before committing to ensure no compile errors. -The same issue exists in `RunDBDumps()` (line 212) and `RunBackup()` (line 308, 321). +## Documentation -### Impact -- DB dumps written to unexpected location -- Restic backup doesn't find them (looks in the expected absolute path) -- Backup appears to succeed but DB dumps are orphaned -- Restore would not have fresh DB dumps +Add a CHANGELOG.md entry at the top (under `## Changelog`). Read the first 30 lines to see the format, then insert a new entry. Example: -### Fix Instructions -Add a validation check in `GetAppDrivePath()`: +```markdown +### What was just completed (2026-02-19 session 52) +- **v0.15.2 — Fix data loss on container restart (2 bugs):** -```go -func (m *Manager) GetAppDrivePath(stackName string) string { - if m.stackProvider != nil { - if hddPath := m.stackProvider.GetStackHDDPath(stackName); hddPath != "" { - return hddPath - } - } - if m.systemDataPath == "" { - m.logger.Printf("[ERROR] systemDataPath is empty — cannot determine drive for %s", stackName) - } - return m.systemDataPath -} + **Bug 1:** Snapshot history delta stats (HOZZÁADOTT, ÚJ FÁJL, VÁLTOZOTT) showed 0 after container restart because restic doesn't store these stats — they were only in memory. Fixed by persisting the snapshot history ring buffer to `data/snapshot-history.json`. On startup, persisted stats are merged with restic repo snapshots. + + **Bug 2:** DB validation (ÉRVÉNYESÍTÉS column) showed "–" after restart because the synthesized `LastDBDump.Results` didn't copy `Validation` from `DumpFileInfo`. One-line fix: added `Validation: f.Validation` to the synthesized `DumpResult` in `GetFullStatus()`. + + **Files modified:** `internal/backup/backup.go` ``` -Also add a startup validation in `NewManager()`: +Update version in `C:\Users\User\.claude\projects\e--git\memory\MEMORY.md` to `v0.15.2`. -```go -func NewManager(cfg *config.Config, pinger *monitor.Pinger, sett *settings.Settings, logger *log.Logger) *Manager { - if cfg.Paths.SystemDataPath == "" { - logger.Printf("[WARN] SystemDataPath is empty in config — SSD-only apps will not have correct backup paths") - } - return &Manager{ - // ... existing fields - } -} -``` +## Verification -And in `DumpStackDB()`, add a guard before using the path: - -```go -drivePath := m.GetAppDrivePath(stackName) -if drivePath == "" || !filepath.IsAbs(drivePath) { - return fmt.Errorf("cannot determine absolute drive path for %s (systemDataPath not configured?)", stackName) -} -dumpDir := AppDBDumpPath(drivePath, stackName) -``` - ---- - -## Testing Checklist - -After fixing all bugs, verify: - -- [ ] **Bug 1:** Run cross-drive rsync backup twice for a single-mount app → `_db/` and `_config/` persist between runs (not deleted by `--delete`) -- [ ] **Bug 1:** Run cross-drive rsync backup for a multi-mount app → behavior unchanged -- [ ] **Bug 2:** During nightly scheduled backup, UI shows "Mentés folyamatban" on dashboard -- [ ] **Bug 2:** During nightly scheduled backup, "Visszaállítás" button shows "already in progress" error -- [ ] **Bug 2:** `RunFullBackup()` still works correctly (calls internal methods, doesn't deadlock) -- [ ] **Bug 2:** Scheduled `db-dump` and `backup` jobs don't deadlock with each other (they run at different times, but if one overruns, the next should get "already in progress" error) -- [ ] **Bug 3:** Cross-drive backup to a path where `GetDiskUsage` returns nil → logs a warning (not a silent pass) -- [ ] **Bug 4:** With `system_data_path: ""` in config → startup log warning + `DumpStackDB()` returns error instead of using relative path -- [ ] Build succeeds: `go build ./...` - ---- - -## Implementation Notes - -- **Do NOT change** any template files, CSS, or UI text — only Go backend files -- **Do NOT change** any function signatures that are part of the public API (other packages import them) -- **Do NOT change** the scheduler wiring in `main.go` — the fix should be in the backup package -- **Run `go vet ./...` and `go build ./...`** after all changes to verify no compilation errors -- Keep all log messages in English (UI text is Hungarian, but log messages are English) - ---- - -## Files to Modify - -| File | Bug(s) | Changes | -|------|--------|---------| -| `internal/backup/crossdrive.go` | 1, 3 | Add rsync `--exclude` flags; add nil check for GetDiskUsage | -| `internal/backup/backup.go` | 2, 4 | Extract `acquireRunning`/`releaseRunning` + internal methods; add systemDataPath validation | +After deploying v0.15.2: +1. Navigate to /backups — verify Pillanatképek shows actual values (if a backup has been run before the restart) +2. Restart the container: `sudo docker compose restart` +3. Refresh /backups — Pillanatképek should still show the stats from before restart +4. DB Adatbázisok table should show table counts (e.g., "42 tábla") in ÉRVÉNYESÍTÉS column, not "–"