From 20b3a22c8832a2ddcc78b1cdd6b430d02f8eebed Mon Sep 17 00:00:00 2001 From: kisfenyo Date: Tue, 17 Feb 2026 20:46:50 +0100 Subject: [PATCH] Bug hunt and TASK.md for Sonnet --- TASK.md | 710 ++++++++++++++++++++++------ controller/README.md | 1045 +++++++++++++++++++++++++----------------- 2 files changed, 1190 insertions(+), 565 deletions(-) diff --git a/TASK.md b/TASK.md index b0e8062..65483a4 100644 --- a/TASK.md +++ b/TASK.md @@ -1,192 +1,616 @@ -# 0.12.2 - Bug 4: Restore Section ("Visszaállítás") Simplification — TASK.md +# TASK.md — Bug Fix Plan for Sonnet 4.5 -## Current State +## Prompt / Context Gathering -### Backup architecture (what goes where) +Before starting any fixes, read the following files to understand the project: -**Restic snapshots** (daily, on SSD): -- `/opt/docker/stacks/` — all Docker configs, compose files, app.yaml -- `/opt/docker/db-dumps/` — DB dump SQL files -- `/opt/docker/felhom-controller/controller.yaml` -- App HDD data paths (if backup enabled) — e.g., `/mnt/hdd_1/storage/immich/` +1. **`controller/README.md`** — Full architecture, module map, API endpoints, feature documentation +2. **`CHANGELOG.md`** — Recent changes (sessions 26-38, 2026-02-17) for context on what was implemented +3. **`controller/TASK.md`** (this file) — The bug list and fix instructions -**Cross-drive backups** (per-app, rsync/restic to secondary drive): -- Copies app user data between drives (e.g., HDD→SSD or SSD→HDD) -- Separate from restic — not part of snapshot restore +Then read each source file listed in a bug section before fixing it. -**Key detail:** `SnapshotInfo.Paths []string` from `restic snapshots --json` includes the backup source paths. Snapshots created BEFORE an app's HDD backup was enabled won't have that app's HDD paths listed. This can be used for filtering. - -### Current restore UI flow (backups.html Section 7) -1. **App dropdown** — only apps with `HasHDDData && BackupEnabled` (filters out auto-only apps) -2. **Snapshot dropdown** — loads ALL snapshots from `/api/backup/snapshots` (no filtering by app) -3. **Path display** — shows HDD paths that will be restored (technical, confusing for customers) -4. **Warning block** — "FELÜLÍRJA a jelenlegi adatait... NEM vonható vissza!... Javasoljuk az alkalmazás leállítását" -5. **Confirmation checkbox** — "Megértettem, visszaállítás saját felelősségre" -6. **Restore button** → POST `/backup/restore` with `stack_name` + `snapshot_id` - -### Current restore backend (`RestoreApp` in `restore.go`) -- Validates backup enabled + snapshot exists -- Resolves app's HDD paths via `GetStackHDDMounts()` -- Runs `restic restore {snapshot} --target / --include {path1} --include {path2}` -- Only restores HDD data — does NOT restore Docker configs or DB dumps +**After all fixes are complete**, update: +- `CHANGELOG.md` — Add a new entry at the top following the existing format (session 39, next version **v0.12.3**) +- `controller/README.md` — If any fix changes behavior, API, or architecture, update the relevant section --- -## Problems +## Fix Priority -1. **Snapshot selector shows ALL snapshots** — user can pick one that predates their app's backup setup → restore fails or does nothing -2. **Path display is confusing** — customers don't need to see `/mnt/hdd_1/storage/immich/` -3. **Warning is intimidating** — multiple strong statements, feels dangerous -4. **No auto-stop** — just "recommended" to stop the app, user can forget -5. **DB-only restore** is implicitly possible (pick old snapshot) but not clear what it does +Fixes are organized into 3 tiers: + +- **CRITICAL** — Data races, security vulnerabilities, data loss risks. Fix these first. +- **HIGH** — Logic errors, resource leaks, incorrect behavior. Fix these second. +- **MEDIUM** — Edge cases, code quality, minor issues. Fix if time allows. --- -## Proposed Design +## CRITICAL Fixes -### Simplified UI (3 elements instead of 6) +### C1. Data race on `m.lastDBDump.Results` mutation (backup/backup.go) -``` -┌─────────────────────────────────────────────────┐ -│ Visszaállítás │ -│ │ -│ Alkalmazás: [▼ Immich ] │ -│ Pillanatkép: [▼ 2026-02-17 03:00 (a3f2b1) ] │ -│ │ -│ ⚠ A visszaállítás felülírja az alkalmazás │ -│ jelenlegi adatait a kiválasztott mentés │ -│ állapotával. Az alkalmazás a folyamat során │ -│ automatikusan leáll és újraindul. │ -│ │ -│ ☐ Megértettem, visszaállítás indítása │ -│ │ -│ [ Visszaállítás indítása ] │ -└─────────────────────────────────────────────────┘ -``` +**File:** `internal/backup/backup.go`, around lines 556-570 -### Changes +**Problem:** In `RefreshCache()`, the code mutates `m.lastDBDump.Results[i].Validation` while holding NO lock. The mutex is only acquired later at line 574. Other goroutines reading `m.lastDBDump` via `GetFullStatus()` can see torn writes. -| # | What | Details | -|---|------|---------| -| 1 | **Filter snapshots by app** | When app selected, `/api/backup/snapshots?stack={name}` returns only snapshots whose `Paths` contain at least one of the app's HDD paths. Show snapshot time in human-friendly format, no raw IDs | -| 2 | **Remove path display** | Customer doesn't need to see mount paths. The restore handler knows what to include | -| 3 | **Single calm warning** | Replace 3-line warning + separate checkbox label with one concise block | -| 4 | **Auto-stop + restart** | Restore handler stops the app's containers before restore, restarts after. Eliminates "javasoljuk leállítását" advice | -| 5 | **Hide DB-only restore** | Not exposed in UI — support contacts Viktor for CLI-level `restic restore` | +**Fix:** Move the `m.lastDBDump.Results` mutation block (lines 556-570) inside the `m.mu.Lock()` section that starts at line 574. The re-validation loop should happen after acquiring the lock, before setting `m.cachedStatus`. --- -## Implementation Plan +### C2. SnapshotHistory reversed after unlock (backup/backup.go) -### Backend changes +**File:** `internal/backup/backup.go`, around lines 582-589 -**1. Filtered snapshots API** (`internal/web/api_router.go` or `router.go`) +**Problem:** `m.cachedStatus = status` is set at line 582 while holding the lock, but the snapshot history reversal at lines 587-589 happens AFTER `m.mu.Unlock()`. Since `status.SnapshotHistory` is a slice backed by the same array as `m.cachedStatus.SnapshotHistory` (created by `copy` which copies elements but shares nothing — wait, actually `copy` creates a new backing array, so this is safe for the array). However, the local `status` variable IS `m.cachedStatus` (assigned at line 582), so reversing `status.SnapshotHistory` also reverses `m.cachedStatus.SnapshotHistory` without the lock. -Extend `GET /api/backup/snapshots` with optional `?stack={name}` query param: +**Fix:** Move the reversal loop (lines 587-589) BEFORE `m.mu.Unlock()`, right after `copy(status.SnapshotHistory, m.snapshotHistory)` at line 581 and before assigning to `m.cachedStatus` at line 582. + +--- + +### C3. `SetStackProvider` write without synchronization (backup/backup.go) + +**File:** `internal/backup/backup.go`, line 413-415 + +**Problem:** `m.stackProvider = provider` is a direct field write with no mutex protection. The field is read by `resolveAppBackupPaths()` and `RefreshCache()` concurrently. + +**Fix:** Wrap the assignment in `m.mu.Lock()` / `m.mu.Unlock()`. + +--- + +### C4. `GetFullStatus` shallow-copies mutable pointers (backup/backup.go) + +**File:** `internal/backup/backup.go`, lines 619-620 + +**Problem:** `status.LastDBDump = m.lastDBDump` and `status.LastBackup = m.lastBackup` copy pointer values. Callers can mutate the referenced objects, creating races with the manager's internal state. + +**Fix:** Deep-copy these structs. If `lastDBDump` is non-nil, create a copy: `copyDump := *m.lastDBDump; status.LastDBDump = ©Dump`. Same for `lastBackup`. Also deep-copy the `Results` slice inside `lastDBDump` if present (since it's a slice of structs, a simple `copy` into a new slice suffices). + +--- + +### C5. `IsSystemDisk` uses 8-bit major mask (storage/safety_linux.go) + +**File:** `internal/storage/safety_linux.go`, lines 30-31 + +**Problem:** `major := (stat.Rdev >> 8) & 0xff` extracts only 8 bits. Linux dev_t uses a 12-bit major: bits 8-19 with bits 0-7 at positions 8-15 and bits 8-11 at positions 44-47 (for 64-bit). For the common case, the formula should use `unix.Major()` from `golang.org/x/sys/unix` or at minimum the standard extraction: `major = (dev & 0xfff00) >> 8`. Also, the function only compares major numbers — ALL SCSI disks share major 8, so this blocks formatting of any SCSI disk if root is also SCSI. + +**Fix:** Use `unix.Major(uint64(stat.Rdev))` and `unix.Minor(uint64(stat.Rdev))` from `golang.org/x/sys/unix` (already a dependency). Compare both major AND the disk portion of the minor to identify the physical disk. For SCSI (major 8), disks are distinguished by `minor / 16`. For NVMe (major 259), compare the whole-disk minor. The simplest correct approach: resolve both device paths to their parent disk name (e.g., via `partitionToParentDisk`) and compare those names instead of using device numbers. + +--- + +### C6. No `/dev/` prefix validation on `DevicePath` (storage/format_linux.go) + +**File:** `internal/storage/format_linux.go`, around lines 37-41 + +**Problem:** The `FormatAndMount` function accepts `DevicePath` from user JSON without validating it starts with `/dev/`. An attacker could pass arbitrary paths. + +**Fix:** At the start of `FormatAndMount`, validate: ```go -func (r *Router) snapshotsHandler(w http.ResponseWriter, req *http.Request) { - snapshots, err := r.backupMgr.ListSnapshots(50) - // ... existing error handling ... - - // Filter by stack if requested - if stackName := req.URL.Query().Get("stack"); stackName != "" { - hddMounts := r.stackProvider.GetStackHDDMounts(stackName) - if len(hddMounts) > 0 { - snapshots = filterSnapshotsByPaths(snapshots, hddMounts) - } - } - - writeJSON(w, http.StatusOK, apiResponse{OK: true, Data: snapshots}) +if !strings.HasPrefix(req.DevicePath, "/dev/") { + return fmt.Errorf("invalid device path: must start with /dev/") } +if strings.Contains(req.DevicePath, "..") { + return fmt.Errorf("invalid device path: must not contain ..") +} +``` -func filterSnapshotsByPaths(snapshots []backup.SnapshotInfo, requiredPaths []string) []backup.SnapshotInfo { - var filtered []backup.SnapshotInfo - for _, snap := range snapshots { - for _, req := range requiredPaths { - for _, sp := range snap.Paths { - if strings.HasPrefix(req, sp) || strings.HasPrefix(sp, req) { - filtered = append(filtered, snap) - goto next - } +--- + +### C7. Path traversal in `extractName` (api/router.go) + +**File:** `internal/api/router.go`, lines 769-771 + +**Problem:** `extractName()` extracts a stack name from the URL path but doesn't reject `..`, `/`, or `\` characters. This name is used in file paths and Docker commands. + +**Fix:** Add validation after extracting the name: +```go +func extractName(r *http.Request) string { + name := // ... existing extraction logic + if strings.ContainsAny(name, "/\\") || name == ".." || name == "." || name == "" { + return "" + } + return name +} +``` +Then ensure all callers handle empty string as an error (most already check `name == ""`). + +--- + +### C8. Path traversal in `TargetPath` (web/storage_handlers.go) + +**File:** `internal/web/storage_handlers.go`, around lines 374-417 + +**Problem:** Migration API accepts `TargetPath` from user input without validating it's a registered storage path. Allows writing to arbitrary filesystem locations. + +**Fix:** Validate `TargetPath` against the list of registered storage paths from settings: +```go +registeredPaths := s.settings.GetStoragePaths() // or equivalent +valid := false +for _, p := range registeredPaths { + if req.TargetPath == p.Path { + valid = true + break + } +} +if !valid { + http.Error(w, "invalid target path", http.StatusBadRequest) + return +} +``` + +--- + +### C9. Path traversal in `DestinationPath` (api/router.go) + +**File:** `internal/api/router.go`, in `saveCrossDriveConfig()` around lines 587-644 + +**Problem:** `DestinationPath` from user JSON not validated against registered storage paths. Allows cross-drive backup to write anywhere. + +**Fix:** Same pattern as C8 — validate `DestinationPath` against registered storage paths. Also validate it's not the same device as the source (this check may already exist in `crossdrive.go`, but the API should validate early). + +--- + +### C10. Path traversal in `ParseComposeHDDMounts` (stacks/delete.go) + +**File:** `internal/stacks/delete.go`, lines 243-251 + +**Problem:** After `${HDD_PATH}` substitution, the resulting path is checked with `strings.HasPrefix(hostPath, hddPath)` BEFORE `filepath.Clean`. An attacker could craft `${HDD_PATH}/../../etc/passwd` — the prefix check passes, but the resolved path escapes the HDD directory. + +**Fix:** Apply `filepath.Clean()` to the resolved path BEFORE the prefix check: +```go +resolved := filepath.Clean(hostPath) +if !strings.HasPrefix(resolved, filepath.Clean(hddPath) + string(filepath.Separator)) { + continue // skip paths that escape HDD directory +} +``` + +--- + +## HIGH Fixes + +### H1. `ValidateDump` reads entire file into memory (backup/dbdump.go) + +**File:** `internal/backup/dbdump.go`, line 251 + +**Problem:** `os.ReadFile(filePath)` loads entire SQL dump into memory. For large databases (500MB+), this causes massive memory allocation during every 5-minute cache refresh. + +**Fix:** Replace with `bufio.Scanner` reading line by line, or `io.LimitReader` to read only the first N bytes (e.g., 64KB for header) and last N bytes (for footer/completion markers). The validation only checks for structural markers like `CREATE TABLE`, `INSERT INTO`, completion tags — these are at the start and end of the file. + +--- + +### H2. `du` calls without timeout (backup/appdata.go) + +**File:** `internal/backup/appdata.go`, lines 136 and 150 + +**Problem:** `exec.Command("du", "-sh", path)` and `exec.Command("du", "-sb", path)` have no timeout. Can hang on slow/unmounted filesystems, blocking the web UI. + +**Fix:** Use `exec.CommandContext` with a 30-second timeout: +```go +ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) +defer cancel() +cmd := exec.CommandContext(ctx, "du", "-sh", path) +``` + +--- + +### H3. Double `du` invocation per mount (backup/appdata.go) + +**File:** `internal/backup/appdata.go`, lines 79-80 + +**Problem:** Both `appDirSizeBytes(mount)` and `appDirSizeHuman(mount)` run `du` independently. That's 2 subprocess calls per mount point. + +**Fix:** Create a single function `appDirSize(path string) (int64, string)` that runs `du -sb` once, parses the byte count, and formats the human string using `humanizeBytes()`. Replace both calls with this single function. + +--- + +### H4. Snapshot validation only checks first 100 (backup/restore.go) + +**File:** `internal/backup/restore.go`, line 22 + +**Problem:** `m.restic.ListSnapshots(100)` limits to 100 snapshots. Older snapshots can't be restored. + +**Fix:** Either: +- Pass 0 or a very large number to list all snapshots, OR +- Better: validate the snapshot ID format with a regex (`^[0-9a-f]{8,64}$`) and skip the existence check (let `restic restore` fail naturally with a clear error if the ID doesn't exist). This avoids listing all snapshots entirely. + +--- + +### H5. No pruning for cross-drive restic repos (backup/crossdrive.go) + +**File:** `internal/backup/crossdrive.go` + +**Problem:** Cross-drive restic backups accumulate snapshots forever. Main backups have pruning via `restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 6`, but cross-drive repos don't. + +**Fix:** After a successful cross-drive restic backup, run a prune step similar to the main backup. Add a `pruneRepo(ctx, repoPath, pwFile string)` function that runs `restic forget --keep-daily 7 --keep-weekly 4 --prune` on the cross-drive repo. Call it after backup in `runResticBackup()`. + +--- + +### H6. Temp password file management (backup/crossdrive.go) + +**File:** `internal/backup/crossdrive.go`, lines 230-240 + +**Problem:** `defer os.Remove(pwFile.Name())` is registered before the file is closed. On some systems this can fail. Also, no `defer pwFile.Close()` — the explicit `Close()` at line 240 won't run if code panics between creation and close. + +**Fix:** Reorganize: +```go +pwFile, err := os.CreateTemp("", "felhom-crossdrive-pw-*") +if err != nil { return ... } +pwPath := pwFile.Name() +if _, err := pwFile.WriteString(password); err != nil { + pwFile.Close() + os.Remove(pwPath) + return ... +} +pwFile.Close() +defer os.Remove(pwPath) +``` + +--- + +### H7. `dirSizeBytes` swallows all walk errors (backup/crossdrive.go) + +**File:** `internal/backup/crossdrive.go`, lines 297-307 + +**Problem:** `filepath.Walk` callback returns `nil` on error, hiding permission/IO issues. + +**Fix:** Return the error (or at least log it): +```go +err := filepath.Walk(path, func(_ string, info os.FileInfo, err error) error { + if err != nil { + return err // propagate instead of swallowing + } + if !info.IsDir() { + total += info.Size() + } + return nil +}) +``` + +--- + +### H8. Non-atomic fstab write (storage/safety_linux.go) + +**File:** `internal/storage/safety_linux.go`, lines 91-102 + +**Problem:** `AppendFstabEntry` writes directly to `/etc/fstab`. A crash during write corrupts the file, potentially bricking the system on reboot. + +**Fix:** Read fstab, append entry, write to `/etc/fstab.tmp`, then `os.Rename("/etc/fstab.tmp", "/etc/fstab")`. This is the same atomic pattern used in settings.go. + +--- + +### H9. `IsDeviceMounted` naive prefix matching (storage/safety_linux.go) + +**File:** `internal/storage/safety_linux.go`, line 54 + +**Problem:** `strings.HasPrefix(fields[0], devicePath)` causes `/dev/sdb` to match `/dev/sdb1`, `/dev/sdb2`, etc. — and worse, `/dev/sdb` matches `/dev/sdba` (hypothetical). + +**Fix:** After the prefix check, verify the next character is either end-of-string or a digit (for partition suffixes): +```go +src := fields[0] +if src == devicePath || (strings.HasPrefix(src, devicePath) && len(src) > len(devicePath) && (src[len(devicePath)] >= '0' && src[len(devicePath)] <= '9' || src[len(devicePath)] == 'p')) { + return true, nil +} +``` +Or better: just check exact match for the device, and separately check if any partition of it is mounted. + +--- + +### H10. eMMC device mapping bug (storage/scan_linux.go) + +**File:** `internal/storage/scan_linux.go`, lines 161-176 + +**Problem:** `partitionToParentDisk` strips trailing digits to find the parent. For eMMC (`mmcblk0p1`), it strips `0p1` → `mmcblk` instead of correctly producing `mmcblk0`. The `p` separator between device number and partition number is not handled. + +**Fix:** Add eMMC/NVMe pattern detection: +```go +func partitionToParentDisk(partName string) string { + // Handle mmcblk0p1, nvme0n1p1 patterns (separator is "p" before partition number) + if idx := strings.LastIndex(partName, "p"); idx > 0 { + prefix := partName[:idx] + suffix := partName[idx+1:] + if len(suffix) > 0 && suffix[0] >= '0' && suffix[0] <= '9' { + // Check if prefix ends with a digit (e.g., mmcblk0, nvme0n1) + if prefix[len(prefix)-1] >= '0' && prefix[len(prefix)-1] <= '9' { + return prefix } } - next: } - return filtered + // Default: strip trailing digits (sda1 → sda) + return strings.TrimRight(partName, "0123456789") } ``` -**2. Auto-stop/restart in RestoreApp** (`internal/backup/restore.go`) +--- +### H11. Data race on `bytesCopied` in rsync error path (storage/migrate.go) + +**File:** `internal/storage/migrate.go`, around line 243 + +**Problem:** In `runRsync`, the `bytesCopied` variable is read (in the error return path) without holding the mutex, while a goroutine may be writing to it. + +**Fix:** Acquire the lock before reading `bytesCopied` in the error path, or use `atomic.Int64` instead of a mutex-protected variable. + +--- + +### H12. Goroutine leak in rsync stdout reader (storage/migrate.go) + +**File:** `internal/storage/migrate.go`, lines 214-237 + +**Problem:** A goroutine reads rsync's stdout. If the main function returns early (e.g., on context cancellation), the goroutine may block on `scanner.Scan()` forever if stdout isn't closed, or try to send on a closed channel. + +**Fix:** Ensure the goroutine exits cleanly: +1. Use `cmd.StdoutPipe()` and ensure it's closed on function exit +2. Use a `done` channel or context cancellation to signal the goroutine +3. Don't send on the progress channel after function returns — use a select with a done channel + +--- + +### H13. Path prefix match without separator (storage/migrate.go) + +**File:** `internal/storage/migrate.go`, lines 128-131 + +**Problem:** `strings.HasPrefix(stackMount, currentHDD)` causes `/mnt/hdd` to match `/mnt/hdd_backup/data`. Stacks on a similarly-named but different mount would be migrated incorrectly. + +**Fix:** Append a trailing separator: ```go -func (m *Manager) RestoreApp(stackName, snapshotID string) error { - // ... existing validation ... - - // Stop the app's containers before restore - if m.stackProvider != nil { - m.logger.Printf("[INFO] Stopping %s for restore...", stackName) - if err := m.stackProvider.StopStack(stackName); err != nil { - m.logger.Printf("[WARN] Could not stop %s: %v (proceeding anyway)", stackName, err) - } - } - - // Execute restore (existing logic) - if err := m.restic.RestoreAppData(snapshotID, hddMounts); err != nil { - // Try to restart even on failure - m.stackProvider.StartStack(stackName) - return err - } - - // Restart after successful restore - if m.stackProvider != nil { - m.logger.Printf("[INFO] Restarting %s after restore...", stackName) - m.stackProvider.StartStack(stackName) - } - - return nil +if strings.HasPrefix(stackMount, currentHDD + "/") || stackMount == currentHDD { +``` + +--- + +### H14. `DeleteStack` continues after failed `docker compose down` (stacks/delete.go) + +**File:** `internal/stacks/delete.go`, lines 88-92 + +**Problem:** If `docker compose down` fails, the function continues to delete stack files and HDD data. This can leave orphaned containers running while their config is deleted. + +**Fix:** Return the error from `docker compose down` failure. If the user wants to force-delete, that should be a separate explicit option. + +--- + +### H15. XSS in flash messages (web/handlers.go) + +**File:** `internal/web/handlers.go`, around lines 282-288 + +**Problem:** URL query parameters (`?success=...`, `?error=...`) are passed to template data. If templates render these with `{{.}}` instead of escaping, it's XSS. Go's `html/template` auto-escapes in HTML context, but check if any use `{{. | safeHTML}}` or JavaScript context. + +**Fix:** Verify all flash message rendering uses standard `{{.Success}}` / `{{.Error}}` in HTML context (which Go auto-escapes). If any template uses these values inside `