Files

T

admin 54dc74a525 Bug Fixes from v0.12.4–v0.13.1 Code Review

2026-02-18 19:40:50 +01:00

14 KiB

Raw Blame History

TASK.md — Bug Fixes from v0.12.4–v0.13.1 Code Review

Version: v0.14.1 → v0.14.2 Type: Bug fixes identified by thorough code review of today's changes (sessions 40–48) Scope: Controller Go code — internal/backup/, internal/web/

Bug Summary

#	Severity	File	Description
1	HIGH	`crossdrive.go`	rsync `--delete` destroys `_db/` and `_config/` directories on every single-mount run
2	MEDIUM	`backup.go`	Scheduled backups don't set `m.running` flag — restore/manual backup can overlap
3	MEDIUM	`crossdrive.go`	`ValidateDestination` silently succeeds when disk usage can't be read
4	MEDIUM	`backup.go`	Empty `systemDataPath` silently produces relative dump paths

Bug 1 (HIGH): rsync `--delete` destroys `_db/` and `_config/` directories

File

internal/backup/crossdrive.go, function runRsyncBackup, lines 242–276

Root Cause

When an app has exactly one HDD mount (len(mounts) == 1), rsync copies directly into destDir/ with the --delete flag. The _db/ and _config/ subdirectories — created by the DB dump copy (lines 278–286) and config rsync (lines 288–303) from the previous backup run — don't exist in the source mount, so --delete removes them from the destination before the current run re-creates them.

Impact

Brief window of incomplete backup on every run — between rsync completion and the subsequent _db/_config copy steps, the backup destination is missing DB dumps and config
Data loss if interrupted — if the process is killed (OOM, power loss, context cancellation) between the rsync step and the copy steps, _db/ and _config/ are gone until the next successful complete run
Wasted I/O — every run deletes and re-creates these directories unnecessarily

Reproduction

Deploy Immich with a single HDD mount to /mnt/hdd_1/storage/immich
Configure cross-drive rsync backup to /mnt/sys_drive
Run backup twice
After first run: ls /mnt/sys_drive/backups/secondary/immich/rsync/ shows _db/, _config/, and app data
During second run's rsync phase: _db/ and _config/ disappear from dest
After second run completes: _db/ and _config/ reappear (re-created)

Fix Instructions

In runRsyncBackup(), add an --exclude flag with pattern _* to the rsync command. This prevents --delete from touching any controller-managed directories (which all use underscore prefix: _db, _config, and any future ones) while still cleaning up stale user data.

Change the rsync command construction (around line 267) from:

cmd := exec.CommandContext(ctx, "rsync", "-a", "--delete",
    "--exclude", "backups/*.sql.gz",
    "--exclude", "backups/*.sql",
    "--exclude", "backups/*.dump",
    src, dst)

To:

cmd := exec.CommandContext(ctx, "rsync", "-a", "--delete",
    "--exclude", "_*",
    "--exclude", "backups/*.sql.gz",
    "--exclude", "backups/*.sql",
    "--exclude", "backups/*.dump",
    src, dst)

Why _* instead of listing _db and _config individually:

Future-proof — if we ever add another controller-managed subdirectory (e.g., _metadata), it's automatically protected
The underscore prefix is a convention for controller-managed dirs; no real app data directory starts with _
Simpler to maintain (one pattern vs growing list)

Why this is safe: The _db and _config directories are managed exclusively by the controller (lines 278–303), never by the user's app. No user data directory starts with _. The underscore prefix convention was specifically chosen to avoid collision with app data paths.

Note on _infra: The _infra/ directory lives at <dest>/backups/secondary/_infra/ (a sibling of app directories, NOT inside any app's rsync target directory). So it's not affected by --delete regardless. The _* exclude is still good practice in case the directory structure changes later.

For the multi-mount case (len(mounts) > 1): rsync goes into per-leaf subdirectories, not the root destDir, so _db/ and _config/ are siblings (not children) of the rsync targets. The --delete flag only affects the target subtree, so they're safe in the multi-mount case. Adding the exclude universally doesn't hurt.

Bug 2 (MEDIUM): Scheduled backups don't set `m.running` flag

File

internal/backup/backup.go, functions RunBackup() (line 272) and RunDBDumps() (line 181)

Root Cause

Only RunFullBackup() (line 454) sets m.running = true before running. The scheduler in main.go calls RunBackup() and RunDBDumps() directly (not via RunFullBackup()), so m.running is never set during nightly scheduled runs.

Code that checks m.running:

RestoreApp() in restore.go:25 — blocks if running
RunFullBackup() in backup.go:456 — blocks if running
IsRunning() in backup.go:496 — used by API and UI to show status

Impact

UI shows "not running" during nightly backups — the dashboard backup card and backup page don't indicate that a backup is in progress
Restore can overlap with nightly backup — a user triggering "Restore" at 03:05 while the nightly restic backup is running won't see a "backup in progress" error. Both operations compete for the restic repo lock; one will fail with an opaque error
Manual "Backup now" can overlap with scheduled backup — RunFullBackup() checks m.running, but since the scheduled RunBackup() doesn't set it, the manual trigger proceeds and both run concurrently

Fix Instructions

Add m.running guard to both RunBackup() and RunDBDumps(). The pattern is the same as in RunFullBackup():

In RunBackup() (line 272), add at the beginning of the function:

func (m *Manager) RunBackup(ctx context.Context) error {
	m.mu.Lock()
	if m.running {
		m.mu.Unlock()
		return fmt.Errorf("backup already in progress")
	}
	m.running = true
	m.mu.Unlock()
	defer func() {
		m.mu.Lock()
		m.running = false
		m.mu.Unlock()
	}()

	start := time.Now()
	// ... rest of existing function

In RunDBDumps() (line 181), add the same guard:

func (m *Manager) RunDBDumps(ctx context.Context) error {
	m.mu.Lock()
	if m.running {
		m.mu.Unlock()
		return fmt.Errorf("backup already in progress")
	}
	m.running = true
	m.mu.Unlock()
	defer func() {
		m.mu.Lock()
		m.running = false
		m.mu.Unlock()
	}()

	start := time.Now()
	// ... rest of existing function

IMPORTANT: RunFullBackup() calls RunDBDumps() then RunBackup() internally. After adding the guards above, RunFullBackup() would deadlock on itself (it sets m.running=true, then calls RunDBDumps() which also tries to set it). Fix by either:

Option A (recommended): Extract the logic into internal methods that DON'T check the flag:

// Public methods — set the running guard
func (m *Manager) RunDBDumps(ctx context.Context) error {
	if err := m.acquireRunning(); err != nil {
		return err
	}
	defer m.releaseRunning()
	return m.runDBDumpsInternal(ctx)
}

func (m *Manager) RunBackup(ctx context.Context) error {
	if err := m.acquireRunning(); err != nil {
		return err
	}
	defer m.releaseRunning()
	return m.runBackupInternal(ctx)
}

func (m *Manager) RunFullBackup(ctx context.Context) error {
	if err := m.acquireRunning(); err != nil {
		return err
	}
	defer m.releaseRunning()

	if err := m.runDBDumpsInternal(ctx); err != nil {
		m.logger.Printf("[WARN] DB dump had errors, continuing with backup anyway")
	}
	return m.runBackupInternal(ctx)
}

// Helper methods
func (m *Manager) acquireRunning() error {
	m.mu.Lock()
	defer m.mu.Unlock()
	if m.running {
		return fmt.Errorf("backup already in progress")
	}
	m.running = true
	return nil
}

func (m *Manager) releaseRunning() {
	m.mu.Lock()
	m.running = false
	m.mu.Unlock()
}

// Internal methods — no guard, caller must hold running flag
func (m *Manager) runDBDumpsInternal(ctx context.Context) error {
	// ... current RunDBDumps body (without the running guard)
}

func (m *Manager) runBackupInternal(ctx context.Context) error {
	// ... current RunBackup body (without the running guard)
}

Option B (simpler): Only add the guard to RunBackup() and RunDBDumps(), but make RunFullBackup() NOT call them — instead inline the logic. This duplicates code but avoids the refactor.

Option A is recommended — it's cleaner and ensures all entry points are guarded.

Bug 3 (MEDIUM): `ValidateDestination` silently succeeds when disk usage can't be read

File

internal/backup/crossdrive.go, function ValidateDestination, line 215

Root Cause

if di := system.GetDiskUsage(path); di != nil {
    // Space checks only execute if di != nil
    ...
}
return nil  // ← Success even when di was nil (space unknown)

If system.GetDiskUsage(path) returns nil (e.g., unsupported filesystem like FUSE/NFS, permission issue, or a path on a virtual filesystem), all space checks are skipped and validation passes without any space verification.

Impact

A destination with zero free space could pass validation, leading to a backup failure mid-operation. The rsync or restic process would fail with "no space left on device", but only after partial work, leaving an incomplete backup.

Fix Instructions

After the GetDiskUsage check, add a warning log and optionally block:

di := system.GetDiskUsage(path)
if di == nil {
    r.logger.Printf("[WARN] Cannot determine disk usage for %s — proceeding without space verification", path)
    return nil
}

This preserves backward compatibility (doesn't block) but makes the situation visible in logs. If you want to be stricter, return an error instead:

if di == nil {
    return fmt.Errorf("destination %s: cannot determine disk usage", path)
}

Recommended: Use the warning-only approach for now. A destination that can't report disk usage is unusual and worth logging, but blocking it could break setups with exotic filesystems (CIFS mounts, etc.).

Bug 4 (MEDIUM): Empty `systemDataPath` produces relative dump paths

File

internal/backup/backup.go, function DumpStackDB, line 570

Root Cause

drivePath := m.GetAppDrivePath(stackName)  // Returns "" if stackProvider is nil AND systemDataPath is ""
dumpDir := AppDBDumpPath(drivePath, stackName)  // AppDBDumpPath("", "mealie") = "backups/primary/mealie/db-dumps" (RELATIVE!)

If systemDataPath is empty (misconfiguration or missing system_data_path in controller.yaml) and the stack has no HDD_PATH, GetAppDrivePath() returns an empty string. AppDBDumpPath("", stackName) then produces a relative path instead of an absolute one. The DumpOne() function creates directories relative to the process's working directory (typically /opt/docker/felhom-controller/), orphaning the dumps where nothing else looks for them.

The same issue exists in RunDBDumps() (line 212) and RunBackup() (line 308, 321).

Impact

DB dumps written to unexpected location
Restic backup doesn't find them (looks in the expected absolute path)
Backup appears to succeed but DB dumps are orphaned
Restore would not have fresh DB dumps

Fix Instructions

Add a validation check in GetAppDrivePath():

func (m *Manager) GetAppDrivePath(stackName string) string {
	if m.stackProvider != nil {
		if hddPath := m.stackProvider.GetStackHDDPath(stackName); hddPath != "" {
			return hddPath
		}
	}
	if m.systemDataPath == "" {
		m.logger.Printf("[ERROR] systemDataPath is empty — cannot determine drive for %s", stackName)
	}
	return m.systemDataPath
}

Also add a startup validation in NewManager():

func NewManager(cfg *config.Config, pinger *monitor.Pinger, sett *settings.Settings, logger *log.Logger) *Manager {
	if cfg.Paths.SystemDataPath == "" {
		logger.Printf("[WARN] SystemDataPath is empty in config — SSD-only apps will not have correct backup paths")
	}
	return &Manager{
		// ... existing fields
	}
}

And in DumpStackDB(), add a guard before using the path:

drivePath := m.GetAppDrivePath(stackName)
if drivePath == "" || !filepath.IsAbs(drivePath) {
    return fmt.Errorf("cannot determine absolute drive path for %s (systemDataPath not configured?)", stackName)
}
dumpDir := AppDBDumpPath(drivePath, stackName)

Testing Checklist

After fixing all bugs, verify:

Bug 1: Run cross-drive rsync backup twice for a single-mount app → _db/ and _config/ persist between runs (not deleted by --delete)
Bug 1: Run cross-drive rsync backup for a multi-mount app → behavior unchanged
Bug 2: During nightly scheduled backup, UI shows "Mentés folyamatban" on dashboard
Bug 2: During nightly scheduled backup, "Visszaállítás" button shows "already in progress" error
Bug 2: RunFullBackup() still works correctly (calls internal methods, doesn't deadlock)
Bug 2: Scheduled db-dump and backup jobs don't deadlock with each other (they run at different times, but if one overruns, the next should get "already in progress" error)
Bug 3: Cross-drive backup to a path where GetDiskUsage returns nil → logs a warning (not a silent pass)
Bug 4: With system_data_path: "" in config → startup log warning + DumpStackDB() returns error instead of using relative path
Build succeeds: go build ./...

Implementation Notes

Do NOT change any template files, CSS, or UI text — only Go backend files
Do NOT change any function signatures that are part of the public API (other packages import them)
Do NOT change the scheduler wiring in main.go — the fix should be in the backup package
Run go vet ./... and go build ./... after all changes to verify no compilation errors
Keep all log messages in English (UI text is Hungarian, but log messages are English)

Files to Modify

File	Bug(s)	Changes
`internal/backup/crossdrive.go`	1, 3	Add rsync `--exclude` flags; add nil check for GetDiskUsage
`internal/backup/backup.go`	2, 4	Extract `acquireRunning`/`releaseRunning` + internal methods; add systemDataPath validation

14 KiB Raw Blame History Unescape Escape

TASK.md — Bug Fixes from v0.12.4–v0.13.1 Code Review

Bug Summary

Bug 1 (HIGH): rsync --delete destroys _db/ and _config/ directories

File

Root Cause

Impact

Reproduction

Fix Instructions

Bug 2 (MEDIUM): Scheduled backups don't set m.running flag

File

Root Cause

Impact

Fix Instructions

Bug 3 (MEDIUM): ValidateDestination silently succeeds when disk usage can't be read

File

Root Cause

Impact

Fix Instructions

Bug 4 (MEDIUM): Empty systemDataPath produces relative dump paths

File

Root Cause

Impact

Fix Instructions

Testing Checklist

Implementation Notes

Files to Modify

14 KiB

Raw Blame History

Bug 1 (HIGH): rsync `--delete` destroys `_db/` and `_config/` directories

Bug 2 (MEDIUM): Scheduled backups don't set `m.running` flag

Bug 3 (MEDIUM): `ValidateDestination` silently succeeds when disk usage can't be read

Bug 4 (MEDIUM): Empty `systemDataPath` produces relative dump paths