Files
deploy-felhom-compose/TASK.md
T

340 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# TASK.md — Bug Fixes from v0.12.4v0.13.1 Code Review
**Version:** v0.14.1 → v0.14.2
**Type:** Bug fixes identified by thorough code review of today's changes (sessions 4048)
**Scope:** Controller Go code — `internal/backup/`, `internal/web/`
---
## Bug Summary
| # | Severity | File | Description |
|---|----------|------|-------------|
| 1 | **HIGH** | `crossdrive.go` | rsync `--delete` destroys `_db/` and `_config/` directories on every single-mount run |
| 2 | **MEDIUM** | `backup.go` | Scheduled backups don't set `m.running` flag — restore/manual backup can overlap |
| 3 | **MEDIUM** | `crossdrive.go` | `ValidateDestination` silently succeeds when disk usage can't be read |
| 4 | **MEDIUM** | `backup.go` | Empty `systemDataPath` silently produces relative dump paths |
---
## Bug 1 (HIGH): rsync `--delete` destroys `_db/` and `_config/` directories
### File
`internal/backup/crossdrive.go`, function `runRsyncBackup`, lines 242276
### Root Cause
When an app has exactly one HDD mount (`len(mounts) == 1`), rsync copies directly into `destDir/` with the `--delete` flag. The `_db/` and `_config/` subdirectories — created by the DB dump copy (lines 278286) and config rsync (lines 288303) from the **previous** backup run — don't exist in the source mount, so `--delete` removes them from the destination **before** the current run re-creates them.
### Impact
1. **Brief window of incomplete backup** on every run — between rsync completion and the subsequent `_db`/`_config` copy steps, the backup destination is missing DB dumps and config
2. **Data loss if interrupted** — if the process is killed (OOM, power loss, context cancellation) between the rsync step and the copy steps, `_db/` and `_config/` are gone until the next successful complete run
3. **Wasted I/O** — every run deletes and re-creates these directories unnecessarily
### Reproduction
1. Deploy Immich with a single HDD mount to `/mnt/hdd_1/storage/immich`
2. Configure cross-drive rsync backup to `/mnt/sys_drive`
3. Run backup twice
4. After first run: `ls /mnt/sys_drive/backups/secondary/immich/rsync/` shows `_db/`, `_config/`, and app data
5. During second run's rsync phase: `_db/` and `_config/` disappear from dest
6. After second run completes: `_db/` and `_config/` reappear (re-created)
### Fix Instructions
In `runRsyncBackup()`, add an `--exclude` flag with pattern `_*` to the rsync command. This prevents `--delete` from touching any controller-managed directories (which all use underscore prefix: `_db`, `_config`, and any future ones) while still cleaning up stale user data.
**Change the rsync command construction (around line 267) from:**
```go
cmd := exec.CommandContext(ctx, "rsync", "-a", "--delete",
"--exclude", "backups/*.sql.gz",
"--exclude", "backups/*.sql",
"--exclude", "backups/*.dump",
src, dst)
```
**To:**
```go
cmd := exec.CommandContext(ctx, "rsync", "-a", "--delete",
"--exclude", "_*",
"--exclude", "backups/*.sql.gz",
"--exclude", "backups/*.sql",
"--exclude", "backups/*.dump",
src, dst)
```
**Why `_*` instead of listing `_db` and `_config` individually:**
- Future-proof — if we ever add another controller-managed subdirectory (e.g., `_metadata`), it's automatically protected
- The underscore prefix is a convention for controller-managed dirs; no real app data directory starts with `_`
- Simpler to maintain (one pattern vs growing list)
**Why this is safe:** The `_db` and `_config` directories are managed exclusively by the controller (lines 278303), never by the user's app. No user data directory starts with `_`. The underscore prefix convention was specifically chosen to avoid collision with app data paths.
**Note on `_infra`:** The `_infra/` directory lives at `<dest>/backups/secondary/_infra/` (a sibling of app directories, NOT inside any app's rsync target directory). So it's not affected by `--delete` regardless. The `_*` exclude is still good practice in case the directory structure changes later.
**For the multi-mount case** (`len(mounts) > 1`): rsync goes into per-leaf subdirectories, not the root `destDir`, so `_db/` and `_config/` are siblings (not children) of the rsync targets. The `--delete` flag only affects the target subtree, so they're safe in the multi-mount case. Adding the exclude universally doesn't hurt.
---
## Bug 2 (MEDIUM): Scheduled backups don't set `m.running` flag
### File
`internal/backup/backup.go`, functions `RunBackup()` (line 272) and `RunDBDumps()` (line 181)
### Root Cause
Only `RunFullBackup()` (line 454) sets `m.running = true` before running. The scheduler in `main.go` calls `RunBackup()` and `RunDBDumps()` directly (not via `RunFullBackup()`), so `m.running` is never set during nightly scheduled runs.
Code that checks `m.running`:
- `RestoreApp()` in `restore.go:25` — blocks if running
- `RunFullBackup()` in `backup.go:456` — blocks if running
- `IsRunning()` in `backup.go:496` — used by API and UI to show status
### Impact
1. **UI shows "not running" during nightly backups** — the dashboard backup card and backup page don't indicate that a backup is in progress
2. **Restore can overlap with nightly backup** — a user triggering "Restore" at 03:05 while the nightly restic backup is running won't see a "backup in progress" error. Both operations compete for the restic repo lock; one will fail with an opaque error
3. **Manual "Backup now" can overlap with scheduled backup**`RunFullBackup()` checks `m.running`, but since the scheduled `RunBackup()` doesn't set it, the manual trigger proceeds and both run concurrently
### Fix Instructions
Add `m.running` guard to both `RunBackup()` and `RunDBDumps()`. The pattern is the same as in `RunFullBackup()`:
**In `RunBackup()` (line 272), add at the beginning of the function:**
```go
func (m *Manager) RunBackup(ctx context.Context) error {
m.mu.Lock()
if m.running {
m.mu.Unlock()
return fmt.Errorf("backup already in progress")
}
m.running = true
m.mu.Unlock()
defer func() {
m.mu.Lock()
m.running = false
m.mu.Unlock()
}()
start := time.Now()
// ... rest of existing function
```
**In `RunDBDumps()` (line 181), add the same guard:**
```go
func (m *Manager) RunDBDumps(ctx context.Context) error {
m.mu.Lock()
if m.running {
m.mu.Unlock()
return fmt.Errorf("backup already in progress")
}
m.running = true
m.mu.Unlock()
defer func() {
m.mu.Lock()
m.running = false
m.mu.Unlock()
}()
start := time.Now()
// ... rest of existing function
```
**IMPORTANT:** `RunFullBackup()` calls `RunDBDumps()` then `RunBackup()` internally. After adding the guards above, `RunFullBackup()` would deadlock on itself (it sets `m.running=true`, then calls `RunDBDumps()` which also tries to set it). Fix by either:
**Option A (recommended):** Extract the logic into internal methods that DON'T check the flag:
```go
// Public methods — set the running guard
func (m *Manager) RunDBDumps(ctx context.Context) error {
if err := m.acquireRunning(); err != nil {
return err
}
defer m.releaseRunning()
return m.runDBDumpsInternal(ctx)
}
func (m *Manager) RunBackup(ctx context.Context) error {
if err := m.acquireRunning(); err != nil {
return err
}
defer m.releaseRunning()
return m.runBackupInternal(ctx)
}
func (m *Manager) RunFullBackup(ctx context.Context) error {
if err := m.acquireRunning(); err != nil {
return err
}
defer m.releaseRunning()
if err := m.runDBDumpsInternal(ctx); err != nil {
m.logger.Printf("[WARN] DB dump had errors, continuing with backup anyway")
}
return m.runBackupInternal(ctx)
}
// Helper methods
func (m *Manager) acquireRunning() error {
m.mu.Lock()
defer m.mu.Unlock()
if m.running {
return fmt.Errorf("backup already in progress")
}
m.running = true
return nil
}
func (m *Manager) releaseRunning() {
m.mu.Lock()
m.running = false
m.mu.Unlock()
}
// Internal methods — no guard, caller must hold running flag
func (m *Manager) runDBDumpsInternal(ctx context.Context) error {
// ... current RunDBDumps body (without the running guard)
}
func (m *Manager) runBackupInternal(ctx context.Context) error {
// ... current RunBackup body (without the running guard)
}
```
**Option B (simpler):** Only add the guard to `RunBackup()` and `RunDBDumps()`, but make `RunFullBackup()` NOT call them — instead inline the logic. This duplicates code but avoids the refactor.
**Option A is recommended** — it's cleaner and ensures all entry points are guarded.
---
## Bug 3 (MEDIUM): `ValidateDestination` silently succeeds when disk usage can't be read
### File
`internal/backup/crossdrive.go`, function `ValidateDestination`, line 215
### Root Cause
```go
if di := system.GetDiskUsage(path); di != nil {
// Space checks only execute if di != nil
...
}
return nil // ← Success even when di was nil (space unknown)
```
If `system.GetDiskUsage(path)` returns `nil` (e.g., unsupported filesystem like FUSE/NFS, permission issue, or a path on a virtual filesystem), all space checks are skipped and validation passes without any space verification.
### Impact
A destination with zero free space could pass validation, leading to a backup failure mid-operation. The rsync or restic process would fail with "no space left on device", but only after partial work, leaving an incomplete backup.
### Fix Instructions
After the `GetDiskUsage` check, add a warning log and optionally block:
```go
di := system.GetDiskUsage(path)
if di == nil {
r.logger.Printf("[WARN] Cannot determine disk usage for %s — proceeding without space verification", path)
return nil
}
```
This preserves backward compatibility (doesn't block) but makes the situation visible in logs. If you want to be stricter, return an error instead:
```go
if di == nil {
return fmt.Errorf("destination %s: cannot determine disk usage", path)
}
```
**Recommended:** Use the warning-only approach for now. A destination that can't report disk usage is unusual and worth logging, but blocking it could break setups with exotic filesystems (CIFS mounts, etc.).
---
## Bug 4 (MEDIUM): Empty `systemDataPath` produces relative dump paths
### File
`internal/backup/backup.go`, function `DumpStackDB`, line 570
### Root Cause
```go
drivePath := m.GetAppDrivePath(stackName) // Returns "" if stackProvider is nil AND systemDataPath is ""
dumpDir := AppDBDumpPath(drivePath, stackName) // AppDBDumpPath("", "mealie") = "backups/primary/mealie/db-dumps" (RELATIVE!)
```
If `systemDataPath` is empty (misconfiguration or missing `system_data_path` in controller.yaml) and the stack has no HDD_PATH, `GetAppDrivePath()` returns an empty string. `AppDBDumpPath("", stackName)` then produces a **relative path** instead of an absolute one. The `DumpOne()` function creates directories relative to the process's working directory (typically `/opt/docker/felhom-controller/`), orphaning the dumps where nothing else looks for them.
The same issue exists in `RunDBDumps()` (line 212) and `RunBackup()` (line 308, 321).
### Impact
- DB dumps written to unexpected location
- Restic backup doesn't find them (looks in the expected absolute path)
- Backup appears to succeed but DB dumps are orphaned
- Restore would not have fresh DB dumps
### Fix Instructions
Add a validation check in `GetAppDrivePath()`:
```go
func (m *Manager) GetAppDrivePath(stackName string) string {
if m.stackProvider != nil {
if hddPath := m.stackProvider.GetStackHDDPath(stackName); hddPath != "" {
return hddPath
}
}
if m.systemDataPath == "" {
m.logger.Printf("[ERROR] systemDataPath is empty — cannot determine drive for %s", stackName)
}
return m.systemDataPath
}
```
Also add a startup validation in `NewManager()`:
```go
func NewManager(cfg *config.Config, pinger *monitor.Pinger, sett *settings.Settings, logger *log.Logger) *Manager {
if cfg.Paths.SystemDataPath == "" {
logger.Printf("[WARN] SystemDataPath is empty in config — SSD-only apps will not have correct backup paths")
}
return &Manager{
// ... existing fields
}
}
```
And in `DumpStackDB()`, add a guard before using the path:
```go
drivePath := m.GetAppDrivePath(stackName)
if drivePath == "" || !filepath.IsAbs(drivePath) {
return fmt.Errorf("cannot determine absolute drive path for %s (systemDataPath not configured?)", stackName)
}
dumpDir := AppDBDumpPath(drivePath, stackName)
```
---
## Testing Checklist
After fixing all bugs, verify:
- [ ] **Bug 1:** Run cross-drive rsync backup twice for a single-mount app → `_db/` and `_config/` persist between runs (not deleted by `--delete`)
- [ ] **Bug 1:** Run cross-drive rsync backup for a multi-mount app → behavior unchanged
- [ ] **Bug 2:** During nightly scheduled backup, UI shows "Mentés folyamatban" on dashboard
- [ ] **Bug 2:** During nightly scheduled backup, "Visszaállítás" button shows "already in progress" error
- [ ] **Bug 2:** `RunFullBackup()` still works correctly (calls internal methods, doesn't deadlock)
- [ ] **Bug 2:** Scheduled `db-dump` and `backup` jobs don't deadlock with each other (they run at different times, but if one overruns, the next should get "already in progress" error)
- [ ] **Bug 3:** Cross-drive backup to a path where `GetDiskUsage` returns nil → logs a warning (not a silent pass)
- [ ] **Bug 4:** With `system_data_path: ""` in config → startup log warning + `DumpStackDB()` returns error instead of using relative path
- [ ] Build succeeds: `go build ./...`
---
## Implementation Notes
- **Do NOT change** any template files, CSS, or UI text — only Go backend files
- **Do NOT change** any function signatures that are part of the public API (other packages import them)
- **Do NOT change** the scheduler wiring in `main.go` — the fix should be in the backup package
- **Run `go vet ./...` and `go build ./...`** after all changes to verify no compilation errors
- Keep all log messages in English (UI text is Hungarian, but log messages are English)
---
## Files to Modify
| File | Bug(s) | Changes |
|------|--------|---------|
| `internal/backup/crossdrive.go` | 1, 3 | Add rsync `--exclude` flags; add nil check for GetDiskUsage |
| `internal/backup/backup.go` | 2, 4 | Extract `acquireRunning`/`releaseRunning` + internal methods; add systemDataPath validation |