18 KiB
TASK.md — Bug Fix Plan for Sonnet 4.5 (Session 2)
Prompt (copy-paste this into Claude Code)
You have a bug fix task. Read these files in order:
1. CLAUDE.md — project overview, workspace layout, build & deploy workflow
2. controller/README.md — full architecture, module map, feature docs
3. CHANGELOG.md — recent changes (including yesterday's bug fix session)
4. TASK.md — THIS FILE, the bug list with detailed fix instructions
Then fix ALL bugs listed in TASK.md, starting with CRITICAL, then HIGH, then MEDIUM.
Skip the "Previously Identified — Still Open" section (P1–P6) — those are lower priority.
For each bug:
- Read the source file first
- Apply the fix as described
- Move on to the next bug
After all fixes are done:
1. Run `go build ./...` and `go vet ./...` from the controller/ directory — fix any errors
2. Update CHANGELOG.md with a new entry at the top (session 40, v0.12.4)
3. Commit, build, and deploy following the workflow in CLAUDE.md:
- git add -A && git commit && git push
- Build on 192.168.0.180 with version 0.12.4
- Deploy on 192.168.0.162
- Verify with docker ps and docker logs
Context Gathering
Before starting any fixes, read the following files to understand the project:
CLAUDE.md— Project overview, workspace layout, build & deploy workflow (MANDATORY)controller/README.md— Full architecture, module map, API endpoints, feature documentationCHANGELOG.md— Recent changes for context on what was implemented (including yesterday's bug fixes)TASK.md(this file) — The bug list and fix instructions
Then read each source file listed in a bug section before fixing it.
After all fixes are complete, update:
CHANGELOG.md— Add a new entry at the top following the existing format (session 40, version v0.12.4)controller/README.md— If any fix changes behavior, API, or architecture, update the relevant section
Fix Priority
Fixes are organized into 3 tiers:
- CRITICAL — Data loss, panics, crashes. Fix these first.
- HIGH — Logic errors, resource leaks, incorrect behavior. Fix these second.
- MEDIUM — Edge cases, code quality, minor issues. Fix if time allows.
CRITICAL Fixes
C1. SetAppBackupBulk data loss + nil map panic (settings/settings.go)
File: internal/settings/settings.go, lines 271-282
Problem (two bugs in one function):
-
Data loss: Line 280
s.AppBackup = newMapreplaces the ENTIRE map with only the keys from the inputprefsparameter. Any stacks NOT in the input map are permanently deleted from the settings file, including theirCrossDrivebackup configurations.Example: Settings has
{"app1": {Enabled:true, CrossDrive:{...}}, "app2": {Enabled:false}}. Caller passesprefs = {"app1": false}(intending to disable only app1). After this function, app2's entire config is deleted. -
Nil map panic: Line 276 reads
s.AppBackup[name]but there is no nil check ons.AppBackupbefore the loop. Ifs.AppBackupis nil (new installation, empty settings), this panics.
Fix: Update the map IN PLACE instead of replacing it. Add nil guard:
func (s *Settings) SetAppBackupBulk(prefs map[string]bool) error {
s.mu.Lock()
defer s.mu.Unlock()
if s.AppBackup == nil {
s.AppBackup = make(map[string]AppBackupPrefs)
}
for name, enabled := range prefs {
existing := s.AppBackup[name]
existing.Enabled = enabled
s.AppBackup[name] = existing
}
return s.save()
}
Key change: Remove line 274 (newMap := make(...)) and line 280 (s.AppBackup = newMap). Iterate over s.AppBackup directly. The comment on line 270 says "Preserves existing CrossDrive configs" — the current code does NOT do that for stacks absent from the input.
C2. UpdateStackConfig nil Env map panic (stacks/deploy.go)
File: internal/stacks/deploy.go, line 245
Problem: appCfg.Env[key] = val panics with "assignment to entry in nil map" when appCfg.Env is nil. This happens when app.yaml exists but has no env: section (e.g., deployed: true with no env vars).
The fix already exists in UpdateOptionalConfig (lines 312-314) but is missing from UpdateStackConfig.
Fix: Add nil check after line 233 (after if appCfg == nil || !appCfg.Deployed check), before the loop:
if appCfg.Env == nil {
appCfg.Env = make(map[string]string)
}
C3. ValidateDump missing scanner.Err() check (backup/dbdump.go)
File: internal/backup/dbdump.go, after line 295
Problem: After the for scanner.Scan() { ... } loop (lines 269-295), scanner.Err() is never checked. If an I/O error occurs during scanning (disk error, permission issue, file truncated), the scanner silently stops iterating. The function continues with whatever partial data was read and may incorrectly mark the dump as valid (v.Valid = true at line 315). This means a corrupted or unreadable dump file could pass validation.
Fix: Add error check immediately after the for loop ends (after line 295, before line 296):
if err := scanner.Err(); err != nil {
v.Error = fmt.Sprintf("hiba az olvasás közben: %v", err)
log.Printf("[WARN] ValidateDump FAIL: %s — scanner error: %v", filePath, err)
return v
}
HIGH Fixes
H1. nextDailyRun uses Add(24h) — wrong across DST transitions (scheduler/scheduler.go)
File: internal/scheduler/scheduler.go, line 252
Problem: next = next.Add(24 * time.Hour) is used when today's scheduled time has passed. During a DST transition in Europe/Budapest:
- Spring forward (March): Adding 24h to "02:30 CET" gives "03:30 CEST" instead of "02:30 CEST" — the job runs 1 hour late
- Fall back (October): Adding 24h from "02:30 CEST" gives "01:30 CET" — the job runs 1 hour early
Fix: Replace line 252 with calendar-day arithmetic:
// If the time has already passed today, schedule for tomorrow
if !next.After(now) {
next = time.Date(now.Year(), now.Month(), now.Day()+1, hour, min, 0, 0, loc)
}
This uses time.Date which correctly handles DST by constructing the time in the target timezone for the specific calendar date.
H2. nextDailyRun calls time.LoadLocation on every iteration (scheduler/scheduler.go)
File: internal/scheduler/scheduler.go, lines 241-245
Problem: time.LoadLocation("Europe/Budapest") is called every time nextDailyRun runs — i.e., every single day for each daily job, and additionally on startup. This is both inefficient (filesystem read + parsing each time) and dangerous: if the system loses access to tzdata at runtime (e.g., container rebuild removing the package), the fallback to UTC happens silently with only a log line. Jobs would shift by 1-2 hours with no alert.
Fix: Cache the timezone at package level or in the Scheduler struct. Preferred approach — package-level sync.Once:
var (
budapestLoc *time.Location
budapestLocOnce sync.Once
)
func getBudapestLocation() *time.Location {
budapestLocOnce.Do(func() {
loc, err := time.LoadLocation("Europe/Budapest")
if err != nil {
log.Printf("[ERROR] Cannot load Europe/Budapest timezone: %v — using UTC", err)
loc = time.UTC
}
budapestLoc = loc
})
return budapestLoc
}
Then in nextDailyRun, replace lines 241-245 with: loc := getBudapestLocation().
H3. settings.save() leaks .tmp file on WriteFile failure (settings/settings.go)
File: internal/settings/settings.go, lines 138-139
Problem: If os.WriteFile(tmpPath, data, 0644) fails (e.g., disk full, permission denied), the function returns the error but does NOT clean up the .tmp file. Only the os.Rename failure path (line 143) calls os.Remove(tmpPath). Repeated failures accumulate orphaned .tmp files.
Fix: Add cleanup to the WriteFile error path:
if err := os.WriteFile(tmpPath, data, 0644); err != nil {
os.Remove(tmpPath) // clean up partial file
return fmt.Errorf("writing tmp settings: %w", err)
}
H4. SetNotificationPrefs panics on nil input (settings/settings.go)
File: internal/settings/settings.go, line 220
Problem: copy := *prefs dereferences the pointer without checking for nil. If any caller passes nil (possible from malformed JSON decoding or future code changes), this panics.
Fix: Add nil guard at the start of the function (after line 217, before line 218):
func (s *Settings) SetNotificationPrefs(prefs *NotificationPrefs) error {
if prefs == nil {
return fmt.Errorf("notification preferences cannot be nil")
}
s.mu.Lock()
defer s.mu.Unlock()
// ... rest unchanged
}
H5. appDirSize ignores Sscanf return value (backup/appdata.go)
File: internal/backup/appdata.go, line 150
Problem: fmt.Sscanf(fields[0], "%d", &size) return value is not checked. If du -sb returns malformed output, size silently stays 0 and the function returns (0, "0 B") which looks like a successful zero-size measurement rather than a parse error.
Same pattern exists in stacks/delete.go line 289 (getDirSizeBytes).
Fix for appdata.go line 150:
var size int64
n, _ := fmt.Sscanf(fields[0], "%d", &size)
if n != 1 {
return 0, "?"
}
return size, humanizeBytes(size)
Fix for stacks/delete.go line 289:
var size int64
if n, _ := fmt.Sscanf(fields[0], "%d", &size); n != 1 {
return 0
}
return size
H6. getDirSizeBytes has no timeout (stacks/delete.go)
File: internal/stacks/delete.go, lines 280-293
Problem: exec.Command("du", "-sb", path) runs with no timeout. If the path is on a slow/unmounted filesystem, this hangs the delete handler indefinitely. The analogous function in backup/appdata.go already uses exec.CommandContext with a 30s timeout.
Fix:
func getDirSizeBytes(path string) int64 {
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
cmd := exec.CommandContext(ctx, "du", "-sb", path)
// ... rest unchanged
}
Add "context" and "time" to the import block if not already present.
H7. dbdump.go tmpFile not using defer Close (backup/dbdump.go)
File: internal/backup/dbdump.go, lines 178-190
Problem: After os.Create(tmpPath) at line 178, the file is closed explicitly at line 190 (tmpFile.Close()). But if cmd.Run() panics (rather than returning an error), the file handle leaks. The standard Go idiom is to use defer for resource cleanup.
Fix: Replace the explicit close with a deferred close:
tmpFile, err := os.Create(tmpPath)
if err != nil {
result.Error = fmt.Errorf("creating tmp file: %w", err)
result.Duration = time.Since(start)
return result
}
defer tmpFile.Close()
cmd.Stdout = tmpFile
// ... remove the explicit tmpFile.Close() at line 190
Note: The defer runs after cmd.Run(), so stdout is properly flushed by the subprocess exit. On Linux, closing an already-closed fd is a no-op, so this is safe even if the OS closes it implicitly.
H8. UpdateCrossDriveStatus comment contradicts implementation (settings/settings.go)
File: internal/settings/settings.go, line 324
Problem: The comment says "fn receives a pointer to the CrossDriveBackup (creates one if nil) and may mutate it" but the implementation at lines 332-333 does the opposite — it returns nil when CrossDrive is nil (does NOT create one).
Fix: Update the comment:
// UpdateCrossDriveStatus updates runtime status fields for a cross-drive backup in-place.
// fn receives a pointer to the CrossDriveBackup and may mutate it.
// If no cross-drive config exists for the stack, does nothing and returns nil.
MEDIUM Fixes
M1. notify/notifier.go custom containsBytes should use strings.Contains (notify/notifier.go)
File: internal/notify/notifier.go, lines 316-327
Problem: The contains() and containsBytes() functions are custom reimplementations of strings.Contains(). While technically correct (Go uses signed int for len() so the subtraction doesn't underflow), this is unnecessary complexity. The standard library function is well-tested, optimized, and immediately recognizable.
Fix: Replace both functions with:
func contains(s, substr string) bool {
return strings.Contains(s, substr)
}
Or better: remove the contains function entirely and use strings.Contains() directly at all call sites. Delete the containsBytes function.
Add "strings" to the import block if not already present.
M2. scheduler.Every() doesn't validate interval > 0 (scheduler/scheduler.go)
File: internal/scheduler/scheduler.go, lines 43-53
Problem: If interval is zero or negative, time.NewTicker(interval) panics at line 134 with "non-positive interval for NewTicker". While current callers always pass valid intervals, a future misconfiguration (e.g., a missing config value parsed as 0) would crash the controller at startup with no useful error message.
Fix: Add validation at the start of Every():
func (s *Scheduler) Every(name string, interval time.Duration, fn JobFunc) {
if interval <= 0 {
s.logger.Printf("[ERROR] Periodic job %s has invalid interval %s — job not registered", name, interval)
return
}
// ... rest unchanged
}
M3. scheduler.executeJob doesn't set LastRun on panic (scheduler/scheduler.go)
File: internal/scheduler/scheduler.go, lines 186-193
Problem: When a job panics, the panic recovery defer at lines 186-193 sets job.LastErr but does NOT set job.LastRun. The normal path sets both at lines 203-205, but those lines are never reached during a panic. This means after a panic, the job status shows the panic error with a stale LastRun timestamp (or zero time if the job never succeeded).
Fix: Update the panic recovery to also set LastRun:
defer func() {
if r := recover(); r != nil {
s.mu.Lock()
job.LastErr = fmt.Errorf("panic: %v", r)
job.LastRun = time.Now()
s.mu.Unlock()
s.logger.Printf("[ERROR] Job %s panicked: %v", job.Name, r)
}
}()
M4. logPostStartStatus goroutine captures env slice by reference (stacks/manager.go)
File: internal/stacks/manager.go, around lines 630-651
Problem: The goroutine captures env []string from the function parameter. If the calling code reuses or appends to the slice after calling logPostStartStatus, the goroutine may read modified data due to Go's slice sharing semantics. While this is unlikely in current code (callers don't modify env after the call), it's a defensive programming issue.
Fix: Copy the slice at the start of the function:
func (m *Manager) logPostStartStatus(name, stackDir string, env []string) {
envCopy := make([]string, len(env))
copy(envCopy, env)
go func() {
time.Sleep(3 * time.Second)
output, err := m.composeExecCustomEnv(stackDir, envCopy, "ps", "-a", ...)
// ... rest uses envCopy instead of env
}()
}
M5. Multiple time.LoadLocation("Europe/Budapest") calls across the codebase
Files: internal/web/handlers.go (lines 450, 546), internal/web/funcmap.go (lines 15-18)
Problem: Same time.LoadLocation call duplicated in multiple places, each with its own error handling. The handlers.go calls were fixed with M7 from last session to fall back to UTC, but the repeated pattern is inefficient and error-prone.
Fix: Create a package-level helper in internal/web/ (e.g., in funcmap.go since it already loads the timezone):
var (
webTimezone *time.Location
webTimezoneOnce sync.Once
)
func getTimezone() *time.Location {
webTimezoneOnce.Do(func() {
loc, err := time.LoadLocation("Europe/Budapest")
if err != nil {
loc = time.UTC
}
webTimezone = loc
})
return webTimezone
}
Then replace all time.LoadLocation("Europe/Budapest") calls in the web package with getTimezone().
Previously Identified (from session 1) — Still Open
These were in the previous TASK.md but were not yet fixed:
P1. Missing PARTUUID= handling in fstab (storage/scan_linux.go)
File: internal/storage/scan_linux.go, around line 130
getSystemDiskNames only parses UUID= entries from fstab, missing PARTUUID= entries common on Raspberry Pi. Add a PARTUUID= case that uses blkid -t PARTUUID=xxx -o device to resolve to a device path.
P2. "Run all" triggers manual-schedule cross-drive backups (api/router.go)
If the "Run all" endpoint should only trigger daily+weekly configs, it should filter out manual ones. Clarify intent and filter accordingly.
P3. Background goroutines use context.Background() (api/router.go)
Background goroutines for backup/restore use context.Background() and survive server shutdown. Use a server-level context that's cancelled on shutdown.
P4. CheckBackupDestination tier priority (system/mounts_linux.go)
Tier 4 (disk full) overwrites Tier 3 (system drive) warning. Should accumulate warnings instead of overwriting.
P5. Hardcoded /opt/docker/felhom-controller/controller.yaml paths (backup/backup.go)
Should use a constant or derive from m.cfg.Paths.DataDir.
P6. Inconsistent rollback in migration (storage/migrate.go)
If updateFn callback fails after data copy, the copied data on target is not cleaned up.
Testing Notes
After making fixes:
- Build:
go build ./...from thecontroller/directory must succeed - Vet:
go vet ./...must pass - Key areas to manually verify:
- Settings save/load cycle (C1 — create a settings.json with cross-drive configs, call SetAppBackupBulk with a subset, verify missing stacks are preserved)
- Stack config update with empty env (C2 — deploy a stack, clear its env, try UpdateStackConfig)
- DB dump validation on large files (C3 — ensure scanner.Err is checked)
- Daily job scheduling around midnight (H1)
- Settings file operations under disk-full conditions (H3)
Post-Fix Checklist
- Add
CHANGELOG.mdentry listing all fixes - Update
controller/README.mdif any API behavior changed go build ./...passesgo vet ./...passes