24 KiB
TASK.md — Bug Fix Plan for Sonnet 4.5
Prompt / Context Gathering
Before starting any fixes, read the following files to understand the project:
controller/README.md— Full architecture, module map, API endpoints, feature documentationCHANGELOG.md— Recent changes (sessions 26-38, 2026-02-17) for context on what was implementedcontroller/TASK.md(this file) — The bug list and fix instructions
Then read each source file listed in a bug section before fixing it.
After all fixes are complete, update:
CHANGELOG.md— Add a new entry at the top following the existing format (session 39, next version v0.12.3)controller/README.md— If any fix changes behavior, API, or architecture, update the relevant section
Fix Priority
Fixes are organized into 3 tiers:
- CRITICAL — Data races, security vulnerabilities, data loss risks. Fix these first.
- HIGH — Logic errors, resource leaks, incorrect behavior. Fix these second.
- MEDIUM — Edge cases, code quality, minor issues. Fix if time allows.
CRITICAL Fixes
C1. Data race on m.lastDBDump.Results mutation (backup/backup.go)
File: internal/backup/backup.go, around lines 556-570
Problem: In RefreshCache(), the code mutates m.lastDBDump.Results[i].Validation while holding NO lock. The mutex is only acquired later at line 574. Other goroutines reading m.lastDBDump via GetFullStatus() can see torn writes.
Fix: Move the m.lastDBDump.Results mutation block (lines 556-570) inside the m.mu.Lock() section that starts at line 574. The re-validation loop should happen after acquiring the lock, before setting m.cachedStatus.
C2. SnapshotHistory reversed after unlock (backup/backup.go)
File: internal/backup/backup.go, around lines 582-589
Problem: m.cachedStatus = status is set at line 582 while holding the lock, but the snapshot history reversal at lines 587-589 happens AFTER m.mu.Unlock(). Since status.SnapshotHistory is a slice backed by the same array as m.cachedStatus.SnapshotHistory (created by copy which copies elements but shares nothing — wait, actually copy creates a new backing array, so this is safe for the array). However, the local status variable IS m.cachedStatus (assigned at line 582), so reversing status.SnapshotHistory also reverses m.cachedStatus.SnapshotHistory without the lock.
Fix: Move the reversal loop (lines 587-589) BEFORE m.mu.Unlock(), right after copy(status.SnapshotHistory, m.snapshotHistory) at line 581 and before assigning to m.cachedStatus at line 582.
C3. SetStackProvider write without synchronization (backup/backup.go)
File: internal/backup/backup.go, line 413-415
Problem: m.stackProvider = provider is a direct field write with no mutex protection. The field is read by resolveAppBackupPaths() and RefreshCache() concurrently.
Fix: Wrap the assignment in m.mu.Lock() / m.mu.Unlock().
C4. GetFullStatus shallow-copies mutable pointers (backup/backup.go)
File: internal/backup/backup.go, lines 619-620
Problem: status.LastDBDump = m.lastDBDump and status.LastBackup = m.lastBackup copy pointer values. Callers can mutate the referenced objects, creating races with the manager's internal state.
Fix: Deep-copy these structs. If lastDBDump is non-nil, create a copy: copyDump := *m.lastDBDump; status.LastDBDump = ©Dump. Same for lastBackup. Also deep-copy the Results slice inside lastDBDump if present (since it's a slice of structs, a simple copy into a new slice suffices).
C5. IsSystemDisk uses 8-bit major mask (storage/safety_linux.go)
File: internal/storage/safety_linux.go, lines 30-31
Problem: major := (stat.Rdev >> 8) & 0xff extracts only 8 bits. Linux dev_t uses a 12-bit major: bits 8-19 with bits 0-7 at positions 8-15 and bits 8-11 at positions 44-47 (for 64-bit). For the common case, the formula should use unix.Major() from golang.org/x/sys/unix or at minimum the standard extraction: major = (dev & 0xfff00) >> 8. Also, the function only compares major numbers — ALL SCSI disks share major 8, so this blocks formatting of any SCSI disk if root is also SCSI.
Fix: Use unix.Major(uint64(stat.Rdev)) and unix.Minor(uint64(stat.Rdev)) from golang.org/x/sys/unix (already a dependency). Compare both major AND the disk portion of the minor to identify the physical disk. For SCSI (major 8), disks are distinguished by minor / 16. For NVMe (major 259), compare the whole-disk minor. The simplest correct approach: resolve both device paths to their parent disk name (e.g., via partitionToParentDisk) and compare those names instead of using device numbers.
C6. No /dev/ prefix validation on DevicePath (storage/format_linux.go)
File: internal/storage/format_linux.go, around lines 37-41
Problem: The FormatAndMount function accepts DevicePath from user JSON without validating it starts with /dev/. An attacker could pass arbitrary paths.
Fix: At the start of FormatAndMount, validate:
if !strings.HasPrefix(req.DevicePath, "/dev/") {
return fmt.Errorf("invalid device path: must start with /dev/")
}
if strings.Contains(req.DevicePath, "..") {
return fmt.Errorf("invalid device path: must not contain ..")
}
C7. Path traversal in extractName (api/router.go)
File: internal/api/router.go, lines 769-771
Problem: extractName() extracts a stack name from the URL path but doesn't reject .., /, or \ characters. This name is used in file paths and Docker commands.
Fix: Add validation after extracting the name:
func extractName(r *http.Request) string {
name := // ... existing extraction logic
if strings.ContainsAny(name, "/\\") || name == ".." || name == "." || name == "" {
return ""
}
return name
}
Then ensure all callers handle empty string as an error (most already check name == "").
C8. Path traversal in TargetPath (web/storage_handlers.go)
File: internal/web/storage_handlers.go, around lines 374-417
Problem: Migration API accepts TargetPath from user input without validating it's a registered storage path. Allows writing to arbitrary filesystem locations.
Fix: Validate TargetPath against the list of registered storage paths from settings:
registeredPaths := s.settings.GetStoragePaths() // or equivalent
valid := false
for _, p := range registeredPaths {
if req.TargetPath == p.Path {
valid = true
break
}
}
if !valid {
http.Error(w, "invalid target path", http.StatusBadRequest)
return
}
C9. Path traversal in DestinationPath (api/router.go)
File: internal/api/router.go, in saveCrossDriveConfig() around lines 587-644
Problem: DestinationPath from user JSON not validated against registered storage paths. Allows cross-drive backup to write anywhere.
Fix: Same pattern as C8 — validate DestinationPath against registered storage paths. Also validate it's not the same device as the source (this check may already exist in crossdrive.go, but the API should validate early).
C10. Path traversal in ParseComposeHDDMounts (stacks/delete.go)
File: internal/stacks/delete.go, lines 243-251
Problem: After ${HDD_PATH} substitution, the resulting path is checked with strings.HasPrefix(hostPath, hddPath) BEFORE filepath.Clean. An attacker could craft ${HDD_PATH}/../../etc/passwd — the prefix check passes, but the resolved path escapes the HDD directory.
Fix: Apply filepath.Clean() to the resolved path BEFORE the prefix check:
resolved := filepath.Clean(hostPath)
if !strings.HasPrefix(resolved, filepath.Clean(hddPath) + string(filepath.Separator)) {
continue // skip paths that escape HDD directory
}
HIGH Fixes
H1. ValidateDump reads entire file into memory (backup/dbdump.go)
File: internal/backup/dbdump.go, line 251
Problem: os.ReadFile(filePath) loads entire SQL dump into memory. For large databases (500MB+), this causes massive memory allocation during every 5-minute cache refresh.
Fix: Replace with bufio.Scanner reading line by line, or io.LimitReader to read only the first N bytes (e.g., 64KB for header) and last N bytes (for footer/completion markers). The validation only checks for structural markers like CREATE TABLE, INSERT INTO, completion tags — these are at the start and end of the file.
H2. du calls without timeout (backup/appdata.go)
File: internal/backup/appdata.go, lines 136 and 150
Problem: exec.Command("du", "-sh", path) and exec.Command("du", "-sb", path) have no timeout. Can hang on slow/unmounted filesystems, blocking the web UI.
Fix: Use exec.CommandContext with a 30-second timeout:
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
cmd := exec.CommandContext(ctx, "du", "-sh", path)
H3. Double du invocation per mount (backup/appdata.go)
File: internal/backup/appdata.go, lines 79-80
Problem: Both appDirSizeBytes(mount) and appDirSizeHuman(mount) run du independently. That's 2 subprocess calls per mount point.
Fix: Create a single function appDirSize(path string) (int64, string) that runs du -sb once, parses the byte count, and formats the human string using humanizeBytes(). Replace both calls with this single function.
H4. Snapshot validation only checks first 100 (backup/restore.go)
File: internal/backup/restore.go, line 22
Problem: m.restic.ListSnapshots(100) limits to 100 snapshots. Older snapshots can't be restored.
Fix: Either:
- Pass 0 or a very large number to list all snapshots, OR
- Better: validate the snapshot ID format with a regex (
^[0-9a-f]{8,64}$) and skip the existence check (letrestic restorefail naturally with a clear error if the ID doesn't exist). This avoids listing all snapshots entirely.
H5. No pruning for cross-drive restic repos (backup/crossdrive.go)
File: internal/backup/crossdrive.go
Problem: Cross-drive restic backups accumulate snapshots forever. Main backups have pruning via restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 6, but cross-drive repos don't.
Fix: After a successful cross-drive restic backup, run a prune step similar to the main backup. Add a pruneRepo(ctx, repoPath, pwFile string) function that runs restic forget --keep-daily 7 --keep-weekly 4 --prune on the cross-drive repo. Call it after backup in runResticBackup().
H6. Temp password file management (backup/crossdrive.go)
File: internal/backup/crossdrive.go, lines 230-240
Problem: defer os.Remove(pwFile.Name()) is registered before the file is closed. On some systems this can fail. Also, no defer pwFile.Close() — the explicit Close() at line 240 won't run if code panics between creation and close.
Fix: Reorganize:
pwFile, err := os.CreateTemp("", "felhom-crossdrive-pw-*")
if err != nil { return ... }
pwPath := pwFile.Name()
if _, err := pwFile.WriteString(password); err != nil {
pwFile.Close()
os.Remove(pwPath)
return ...
}
pwFile.Close()
defer os.Remove(pwPath)
H7. dirSizeBytes swallows all walk errors (backup/crossdrive.go)
File: internal/backup/crossdrive.go, lines 297-307
Problem: filepath.Walk callback returns nil on error, hiding permission/IO issues.
Fix: Return the error (or at least log it):
err := filepath.Walk(path, func(_ string, info os.FileInfo, err error) error {
if err != nil {
return err // propagate instead of swallowing
}
if !info.IsDir() {
total += info.Size()
}
return nil
})
H8. Non-atomic fstab write (storage/safety_linux.go)
File: internal/storage/safety_linux.go, lines 91-102
Problem: AppendFstabEntry writes directly to /etc/fstab. A crash during write corrupts the file, potentially bricking the system on reboot.
Fix: Read fstab, append entry, write to /etc/fstab.tmp, then os.Rename("/etc/fstab.tmp", "/etc/fstab"). This is the same atomic pattern used in settings.go.
H9. IsDeviceMounted naive prefix matching (storage/safety_linux.go)
File: internal/storage/safety_linux.go, line 54
Problem: strings.HasPrefix(fields[0], devicePath) causes /dev/sdb to match /dev/sdb1, /dev/sdb2, etc. — and worse, /dev/sdb matches /dev/sdba (hypothetical).
Fix: After the prefix check, verify the next character is either end-of-string or a digit (for partition suffixes):
src := fields[0]
if src == devicePath || (strings.HasPrefix(src, devicePath) && len(src) > len(devicePath) && (src[len(devicePath)] >= '0' && src[len(devicePath)] <= '9' || src[len(devicePath)] == 'p')) {
return true, nil
}
Or better: just check exact match for the device, and separately check if any partition of it is mounted.
H10. eMMC device mapping bug (storage/scan_linux.go)
File: internal/storage/scan_linux.go, lines 161-176
Problem: partitionToParentDisk strips trailing digits to find the parent. For eMMC (mmcblk0p1), it strips 0p1 → mmcblk instead of correctly producing mmcblk0. The p separator between device number and partition number is not handled.
Fix: Add eMMC/NVMe pattern detection:
func partitionToParentDisk(partName string) string {
// Handle mmcblk0p1, nvme0n1p1 patterns (separator is "p" before partition number)
if idx := strings.LastIndex(partName, "p"); idx > 0 {
prefix := partName[:idx]
suffix := partName[idx+1:]
if len(suffix) > 0 && suffix[0] >= '0' && suffix[0] <= '9' {
// Check if prefix ends with a digit (e.g., mmcblk0, nvme0n1)
if prefix[len(prefix)-1] >= '0' && prefix[len(prefix)-1] <= '9' {
return prefix
}
}
}
// Default: strip trailing digits (sda1 → sda)
return strings.TrimRight(partName, "0123456789")
}
H11. Data race on bytesCopied in rsync error path (storage/migrate.go)
File: internal/storage/migrate.go, around line 243
Problem: In runRsync, the bytesCopied variable is read (in the error return path) without holding the mutex, while a goroutine may be writing to it.
Fix: Acquire the lock before reading bytesCopied in the error path, or use atomic.Int64 instead of a mutex-protected variable.
H12. Goroutine leak in rsync stdout reader (storage/migrate.go)
File: internal/storage/migrate.go, lines 214-237
Problem: A goroutine reads rsync's stdout. If the main function returns early (e.g., on context cancellation), the goroutine may block on scanner.Scan() forever if stdout isn't closed, or try to send on a closed channel.
Fix: Ensure the goroutine exits cleanly:
- Use
cmd.StdoutPipe()and ensure it's closed on function exit - Use a
donechannel or context cancellation to signal the goroutine - Don't send on the progress channel after function returns — use a select with a done channel
H13. Path prefix match without separator (storage/migrate.go)
File: internal/storage/migrate.go, lines 128-131
Problem: strings.HasPrefix(stackMount, currentHDD) causes /mnt/hdd to match /mnt/hdd_backup/data. Stacks on a similarly-named but different mount would be migrated incorrectly.
Fix: Append a trailing separator:
if strings.HasPrefix(stackMount, currentHDD + "/") || stackMount == currentHDD {
H14. DeleteStack continues after failed docker compose down (stacks/delete.go)
File: internal/stacks/delete.go, lines 88-92
Problem: If docker compose down fails, the function continues to delete stack files and HDD data. This can leave orphaned containers running while their config is deleted.
Fix: Return the error from docker compose down failure. If the user wants to force-delete, that should be a separate explicit option.
H15. XSS in flash messages (web/handlers.go)
File: internal/web/handlers.go, around lines 282-288
Problem: URL query parameters (?success=..., ?error=...) are passed to template data. If templates render these with {{.}} instead of escaping, it's XSS. Go's html/template auto-escapes in HTML context, but check if any use {{. | safeHTML}} or JavaScript context.
Fix: Verify all flash message rendering uses standard {{.Success}} / {{.Error}} in HTML context (which Go auto-escapes). If any template uses these values inside <script> tags or onclick attributes, switch to JSON encoding or remove the pattern. Consider using session-based flash messages instead of URL params to prevent reflected XSS entirely.
H16. exec.Command("docker") without timeout (web/handlers.go)
File: internal/web/handlers.go, around lines 1247-1253
Problem: syncFileBrowserMounts() runs docker compose up -d without a context timeout. Can hang indefinitely.
Fix: Use exec.CommandContext with a 60-second timeout.
H17. SetNotificationPrefs stores caller's pointer (settings/settings.go)
File: internal/settings/settings.go, around line 219
Problem: s.data.NotificationPrefs = prefs stores the caller's pointer/struct directly. If the caller later modifies the struct, it mutates settings state without going through the mutex.
Fix: Deep-copy the prefs:
func (s *Settings) SetNotificationPrefs(prefs NotificationPrefs) {
s.mu.Lock()
defer s.mu.Unlock()
copy := prefs // value copy
copy.EnabledEvents = make([]string, len(prefs.EnabledEvents))
copy(copy.EnabledEvents, prefs.EnabledEvents)
s.data.NotificationPrefs = copy
s.saveLocked()
}
H18. wipefs error silently discarded (storage/format_linux.go)
File: internal/storage/format_linux.go, line 74
Problem: _ = exec.Command("wipefs", ...).Run() discards the error. If wipefs fails, stale filesystem signatures remain, causing mkfs.ext4 or mount to misbehave.
Fix: Check the error. If wipefs fails, log a warning but continue (some systems don't have wipefs). At minimum don't use _ =:
if err := exec.Command("wipefs", "-a", partPath).Run(); err != nil {
log.Printf("[WARN] wipefs failed on %s: %v (continuing)", partPath, err)
}
H19. Orphaned fstab entry on mount failure (storage/format_linux.go)
File: internal/storage/format_linux.go, lines 140-150
Problem: If fstab entry is written but the subsequent mount command fails, an orphaned entry remains in fstab. On next reboot, the system may hang waiting for a mount that doesn't work.
Fix: If mount fails after fstab write, remove the fstab entry as rollback. Read fstab, remove the line containing the UUID, write back atomically.
MEDIUM Fixes
M1. formatBytes duplicate in dbdump.go (backup/dbdump.go)
File: internal/backup/dbdump.go, lines 467-483
Problem: Duplicates humanizeBytes() from appdata.go.
Fix: Delete formatBytes() from dbdump.go and replace all calls with humanizeBytes() from appdata.go.
M2. Dead code .tmp suffix check (backup/dbdump.go)
File: internal/backup/dbdump.go, lines 331-332
Problem: Line 328 already filters for .sql suffix, so .tmp check is unreachable.
Fix: Remove the dead .tmp check. If intent was to skip .sql.tmp files, fix the filter order: check .tmp before .sql.
M3. sizeBytes() returns 0 for string types (storage/scan_linux.go)
File: internal/storage/scan_linux.go, lines 33-39
Problem: lsblk can return size as a string. The function only handles float64 from JSON unmarshaling.
Fix: Add string handling:
case string:
n, _ := strconv.ParseUint(v, 10, 64)
return n
M4. Missing PARTUUID= handling in fstab (storage/scan_linux.go)
File: internal/storage/scan_linux.go, lines 130-134
Problem: getSystemDiskNames only parses UUID= entries from fstab, missing PARTUUID= entries common on Raspberry Pi.
Fix: Add a PARTUUID= case that uses blkid -t PARTUUID=xxx -o device to resolve to a device path.
M5. Inconsistent rollback in migration (storage/migrate.go)
File: internal/storage/migrate.go, lines 160-163
Problem: If updateFn fails (the callback that updates app config to point to new path), the data has been copied but the config still points to old path. No cleanup of copied data on target.
Fix: If updateFn fails, attempt to clean up the target directory and return a clear error explaining the situation.
M6. Dead elapsed variable (storage/migrate.go)
File: internal/storage/migrate.go, around line 177
Problem: elapsed is calculated but never used.
Fix: Remove or use it in a log message.
M7. time.LoadLocation error silently discarded (web/handlers.go)
File: internal/web/handlers.go, lines 449-450, 541-542
Problem: loc, _ := time.LoadLocation("Europe/Budapest") — if tzdata is missing, loc is nil causing panic on time.Now().In(loc).
Fix: Handle the error:
loc, err := time.LoadLocation("Europe/Budapest")
if err != nil {
loc = time.UTC
log.Printf("[WARN] timezone Europe/Budapest not available: %v", err)
}
M8. "Run all" triggers manual-schedule backups (api/router.go)
File: internal/api/router.go, lines 701-703
Problem: The "run all" endpoint calls RunAllScheduled(ctx, "manual") or iterates all configs. If the intent of "manual" schedule was "only run when explicitly triggered per-app", running all defeats this.
Fix: Clarify intent. If "Run all" should only trigger daily+weekly configs, filter out manual ones. If manual should also be included in "run all", document this behavior.
M9. Background goroutines use context.Background() (api/router.go)
File: internal/api/router.go, lines 658-662, 693-704
Problem: Background goroutines for backup/restore use context.Background() and survive server shutdown, potentially corrupting data.
Fix: Use a server-level context that's cancelled on shutdown. Pass the server's context (from http.Server.BaseContext or a custom one) and derive goroutine contexts from it.
M10. filterSnapshotsByPaths imprecise prefix (api/router.go)
File: internal/api/router.go, lines 476-490
Problem: strings.HasPrefix(snapPath, reqPath) causes /mnt/hdd_1 to match /mnt/hdd_10/data.
Fix: Append separator: strings.HasPrefix(snapPath, reqPath + "/") || snapPath == reqPath.
M11. XSS in editStorageLabel innerHTML (templates/settings.html)
File: internal/web/templates/settings.html, around lines 280-296
Problem: cancelEditLabel function uses innerHTML to restore label content. If path or currentLabel contain HTML, it's XSS.
Fix: Use textContent instead of innerHTML:
el.textContent = currentLabel || path;
M12. GetNotificationPrefs leaks slice reference (settings/settings.go)
File: internal/settings/settings.go, around line 196
Problem: Returns direct reference to DefaultEnabledEvents slice. Caller can modify the global default.
Fix: Return a copy of the slice:
events := make([]string, len(DefaultEnabledEvents))
copy(events, DefaultEnabledEvents)
M13. CheckBackupDestination tier priority (system/mounts_linux.go)
Problem: Tier 4 (disk full) overwrites Tier 3 (system drive) warning. If a drive is both the system drive and >90% full, only the disk-full warning shows.
Fix: Accumulate warnings instead of overwriting. Return a slice of warnings, or use the highest severity.
M14. Hardcoded paths in backup.go
File: internal/backup/backup.go, lines 245-248, 524, 683
Problem: /opt/docker/felhom-controller/controller.yaml is hardcoded in 3 places.
Fix: Add a constant or derive from m.cfg.Paths.DataDir:
const controllerYAMLPath = "/opt/docker/felhom-controller/controller.yaml"
Or better: filepath.Join(m.cfg.Paths.DataDir, "controller.yaml").
Testing Notes
After making fixes:
- Build:
go build ./...from thecontroller/directory must succeed - Vet:
go vet ./...must pass - Key areas to manually verify:
- Storage format/mount still works (C5, C6, H8, H9, H18, H19)
- Backup cycle completes (C1-C4, H1-H7)
- Migration works (H11-H13, M5)
- Web UI loads without errors (H15, M11)
Post-Fix Checklist
- Add
CHANGELOG.mdentry (session 39, v0.12.3) listing all fixes - Update
controller/README.mdif any API behavior changed go build ./...passesgo vet ./...passes