admin/deploy-felhom-compose

Fork 0

Files

T

admin 731cca15a8 aded prompt

2026-02-18 07:44:06 +01:00

18 KiB

Raw Blame History

TASK.md — Bug Fix Plan for Sonnet 4.5 (Session 2)

Prompt (copy-paste this into Claude Code)

You have a bug fix task. Read these files in order:

1. CLAUDE.md — project overview, workspace layout, build & deploy workflow
2. controller/README.md — full architecture, module map, feature docs
3. CHANGELOG.md — recent changes (including yesterday's bug fix session)
4. TASK.md — THIS FILE, the bug list with detailed fix instructions

Then fix ALL bugs listed in TASK.md, starting with CRITICAL, then HIGH, then MEDIUM.
Skip the "Previously Identified — Still Open" section (P1–P6) — those are lower priority.

For each bug:
- Read the source file first
- Apply the fix as described
- Move on to the next bug

After all fixes are done:
1. Run `go build ./...` and `go vet ./...` from the controller/ directory — fix any errors
2. Update CHANGELOG.md with a new entry at the top (session 40, v0.12.4)
3. Commit, build, and deploy following the workflow in CLAUDE.md:
   - git add -A && git commit && git push
   - Build on 192.168.0.180 with version 0.12.4
   - Deploy on 192.168.0.162
   - Verify with docker ps and docker logs

Context Gathering

Before starting any fixes, read the following files to understand the project:

CLAUDE.md — Project overview, workspace layout, build & deploy workflow (MANDATORY)
controller/README.md — Full architecture, module map, API endpoints, feature documentation
CHANGELOG.md — Recent changes for context on what was implemented (including yesterday's bug fixes)
TASK.md (this file) — The bug list and fix instructions

Then read each source file listed in a bug section before fixing it.

After all fixes are complete, update:

CHANGELOG.md — Add a new entry at the top following the existing format (session 40, version v0.12.4)
controller/README.md — If any fix changes behavior, API, or architecture, update the relevant section

Fix Priority

Fixes are organized into 3 tiers:

CRITICAL — Data loss, panics, crashes. Fix these first.
HIGH — Logic errors, resource leaks, incorrect behavior. Fix these second.
MEDIUM — Edge cases, code quality, minor issues. Fix if time allows.

CRITICAL Fixes

C1. `SetAppBackupBulk` data loss + nil map panic (settings/settings.go)

File: internal/settings/settings.go, lines 271-282

Problem (two bugs in one function):

Data loss: Line 280 s.AppBackup = newMap replaces the ENTIRE map with only the keys from the input prefs parameter. Any stacks NOT in the input map are permanently deleted from the settings file, including their CrossDrive backup configurations.

Example: Settings has {"app1": {Enabled:true, CrossDrive:{...}}, "app2": {Enabled:false}}. Caller passes prefs = {"app1": false} (intending to disable only app1). After this function, app2's entire config is deleted.
Nil map panic: Line 276 reads s.AppBackup[name] but there is no nil check on s.AppBackup before the loop. If s.AppBackup is nil (new installation, empty settings), this panics.

Fix: Update the map IN PLACE instead of replacing it. Add nil guard:

func (s *Settings) SetAppBackupBulk(prefs map[string]bool) error {
    s.mu.Lock()
    defer s.mu.Unlock()
    if s.AppBackup == nil {
        s.AppBackup = make(map[string]AppBackupPrefs)
    }
    for name, enabled := range prefs {
        existing := s.AppBackup[name]
        existing.Enabled = enabled
        s.AppBackup[name] = existing
    }
    return s.save()
}

Key change: Remove line 274 (newMap := make(...)) and line 280 (s.AppBackup = newMap). Iterate over s.AppBackup directly. The comment on line 270 says "Preserves existing CrossDrive configs" — the current code does NOT do that for stacks absent from the input.

C2. `UpdateStackConfig` nil Env map panic (stacks/deploy.go)

File: internal/stacks/deploy.go, line 245

Problem: appCfg.Env[key] = val panics with "assignment to entry in nil map" when appCfg.Env is nil. This happens when app.yaml exists but has no env: section (e.g., deployed: true with no env vars).

The fix already exists in UpdateOptionalConfig (lines 312-314) but is missing from UpdateStackConfig.

Fix: Add nil check after line 233 (after if appCfg == nil || !appCfg.Deployed check), before the loop:

if appCfg.Env == nil {
    appCfg.Env = make(map[string]string)
}

C3. `ValidateDump` missing `scanner.Err()` check (backup/dbdump.go)

File: internal/backup/dbdump.go, after line 295

Problem: After the for scanner.Scan() { ... } loop (lines 269-295), scanner.Err() is never checked. If an I/O error occurs during scanning (disk error, permission issue, file truncated), the scanner silently stops iterating. The function continues with whatever partial data was read and may incorrectly mark the dump as valid (v.Valid = true at line 315). This means a corrupted or unreadable dump file could pass validation.

Fix: Add error check immediately after the for loop ends (after line 295, before line 296):

if err := scanner.Err(); err != nil {
    v.Error = fmt.Sprintf("hiba az olvasás közben: %v", err)
    log.Printf("[WARN] ValidateDump FAIL: %s — scanner error: %v", filePath, err)
    return v
}

HIGH Fixes

H1. `nextDailyRun` uses `Add(24h)` — wrong across DST transitions (scheduler/scheduler.go)

File: internal/scheduler/scheduler.go, line 252

Problem: next = next.Add(24 * time.Hour) is used when today's scheduled time has passed. During a DST transition in Europe/Budapest:

Spring forward (March): Adding 24h to "02:30 CET" gives "03:30 CEST" instead of "02:30 CEST" — the job runs 1 hour late
Fall back (October): Adding 24h from "02:30 CEST" gives "01:30 CET" — the job runs 1 hour early

Fix: Replace line 252 with calendar-day arithmetic:

// If the time has already passed today, schedule for tomorrow
if !next.After(now) {
    next = time.Date(now.Year(), now.Month(), now.Day()+1, hour, min, 0, 0, loc)
}

This uses time.Date which correctly handles DST by constructing the time in the target timezone for the specific calendar date.

H2. `nextDailyRun` calls `time.LoadLocation` on every iteration (scheduler/scheduler.go)

File: internal/scheduler/scheduler.go, lines 241-245

Problem: time.LoadLocation("Europe/Budapest") is called every time nextDailyRun runs — i.e., every single day for each daily job, and additionally on startup. This is both inefficient (filesystem read + parsing each time) and dangerous: if the system loses access to tzdata at runtime (e.g., container rebuild removing the package), the fallback to UTC happens silently with only a log line. Jobs would shift by 1-2 hours with no alert.

Fix: Cache the timezone at package level or in the Scheduler struct. Preferred approach — package-level sync.Once:

var (
    budapestLoc     *time.Location
    budapestLocOnce sync.Once
)

func getBudapestLocation() *time.Location {
    budapestLocOnce.Do(func() {
        loc, err := time.LoadLocation("Europe/Budapest")
        if err != nil {
            log.Printf("[ERROR] Cannot load Europe/Budapest timezone: %v — using UTC", err)
            loc = time.UTC
        }
        budapestLoc = loc
    })
    return budapestLoc
}

Then in nextDailyRun, replace lines 241-245 with: loc := getBudapestLocation().

H3. `settings.save()` leaks .tmp file on `WriteFile` failure (settings/settings.go)

File: internal/settings/settings.go, lines 138-139

Problem: If os.WriteFile(tmpPath, data, 0644) fails (e.g., disk full, permission denied), the function returns the error but does NOT clean up the .tmp file. Only the os.Rename failure path (line 143) calls os.Remove(tmpPath). Repeated failures accumulate orphaned .tmp files.

Fix: Add cleanup to the WriteFile error path:

if err := os.WriteFile(tmpPath, data, 0644); err != nil {
    os.Remove(tmpPath) // clean up partial file
    return fmt.Errorf("writing tmp settings: %w", err)
}

H4. `SetNotificationPrefs` panics on nil input (settings/settings.go)

File: internal/settings/settings.go, line 220

Problem: copy := *prefs dereferences the pointer without checking for nil. If any caller passes nil (possible from malformed JSON decoding or future code changes), this panics.

Fix: Add nil guard at the start of the function (after line 217, before line 218):

func (s *Settings) SetNotificationPrefs(prefs *NotificationPrefs) error {
    if prefs == nil {
        return fmt.Errorf("notification preferences cannot be nil")
    }
    s.mu.Lock()
    defer s.mu.Unlock()
    // ... rest unchanged
}

H5. `appDirSize` ignores `Sscanf` return value (backup/appdata.go)

File: internal/backup/appdata.go, line 150

Problem: fmt.Sscanf(fields[0], "%d", &size) return value is not checked. If du -sb returns malformed output, size silently stays 0 and the function returns (0, "0 B") which looks like a successful zero-size measurement rather than a parse error.

Same pattern exists in stacks/delete.go line 289 (getDirSizeBytes).

Fix for appdata.go line 150:

var size int64
n, _ := fmt.Sscanf(fields[0], "%d", &size)
if n != 1 {
    return 0, "?"
}
return size, humanizeBytes(size)

Fix for stacks/delete.go line 289:

var size int64
if n, _ := fmt.Sscanf(fields[0], "%d", &size); n != 1 {
    return 0
}
return size

H6. `getDirSizeBytes` has no timeout (stacks/delete.go)

File: internal/stacks/delete.go, lines 280-293

Problem: exec.Command("du", "-sb", path) runs with no timeout. If the path is on a slow/unmounted filesystem, this hangs the delete handler indefinitely. The analogous function in backup/appdata.go already uses exec.CommandContext with a 30s timeout.

Fix:

func getDirSizeBytes(path string) int64 {
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()
    cmd := exec.CommandContext(ctx, "du", "-sb", path)
    // ... rest unchanged
}

Add "context" and "time" to the import block if not already present.

H7. `dbdump.go` tmpFile not using `defer Close` (backup/dbdump.go)

File: internal/backup/dbdump.go, lines 178-190

Problem: After os.Create(tmpPath) at line 178, the file is closed explicitly at line 190 (tmpFile.Close()). But if cmd.Run() panics (rather than returning an error), the file handle leaks. The standard Go idiom is to use defer for resource cleanup.

Fix: Replace the explicit close with a deferred close:

tmpFile, err := os.Create(tmpPath)
if err != nil {
    result.Error = fmt.Errorf("creating tmp file: %w", err)
    result.Duration = time.Since(start)
    return result
}
defer tmpFile.Close()

cmd.Stdout = tmpFile
// ... remove the explicit tmpFile.Close() at line 190

Note: The defer runs after cmd.Run(), so stdout is properly flushed by the subprocess exit. On Linux, closing an already-closed fd is a no-op, so this is safe even if the OS closes it implicitly.

H8. `UpdateCrossDriveStatus` comment contradicts implementation (settings/settings.go)

File: internal/settings/settings.go, line 324

Problem: The comment says "fn receives a pointer to the CrossDriveBackup (creates one if nil) and may mutate it" but the implementation at lines 332-333 does the opposite — it returns nil when CrossDrive is nil (does NOT create one).

Fix: Update the comment:

// UpdateCrossDriveStatus updates runtime status fields for a cross-drive backup in-place.
// fn receives a pointer to the CrossDriveBackup and may mutate it.
// If no cross-drive config exists for the stack, does nothing and returns nil.

MEDIUM Fixes

M1. `notify/notifier.go` custom `containsBytes` should use `strings.Contains` (notify/notifier.go)

File: internal/notify/notifier.go, lines 316-327

Problem: The contains() and containsBytes() functions are custom reimplementations of strings.Contains(). While technically correct (Go uses signed int for len() so the subtraction doesn't underflow), this is unnecessary complexity. The standard library function is well-tested, optimized, and immediately recognizable.

Fix: Replace both functions with:

func contains(s, substr string) bool {
    return strings.Contains(s, substr)
}

Or better: remove the contains function entirely and use strings.Contains() directly at all call sites. Delete the containsBytes function.

Add "strings" to the import block if not already present.

M2. `scheduler.Every()` doesn't validate interval > 0 (scheduler/scheduler.go)

File: internal/scheduler/scheduler.go, lines 43-53

Problem: If interval is zero or negative, time.NewTicker(interval) panics at line 134 with "non-positive interval for NewTicker". While current callers always pass valid intervals, a future misconfiguration (e.g., a missing config value parsed as 0) would crash the controller at startup with no useful error message.

Fix: Add validation at the start of Every():

func (s *Scheduler) Every(name string, interval time.Duration, fn JobFunc) {
    if interval <= 0 {
        s.logger.Printf("[ERROR] Periodic job %s has invalid interval %s — job not registered", name, interval)
        return
    }
    // ... rest unchanged
}

M3. `scheduler.executeJob` doesn't set `LastRun` on panic (scheduler/scheduler.go)

File: internal/scheduler/scheduler.go, lines 186-193

Problem: When a job panics, the panic recovery defer at lines 186-193 sets job.LastErr but does NOT set job.LastRun. The normal path sets both at lines 203-205, but those lines are never reached during a panic. This means after a panic, the job status shows the panic error with a stale LastRun timestamp (or zero time if the job never succeeded).

Fix: Update the panic recovery to also set LastRun:

defer func() {
    if r := recover(); r != nil {
        s.mu.Lock()
        job.LastErr = fmt.Errorf("panic: %v", r)
        job.LastRun = time.Now()
        s.mu.Unlock()
        s.logger.Printf("[ERROR] Job %s panicked: %v", job.Name, r)
    }
}()

M4. `logPostStartStatus` goroutine captures env slice by reference (stacks/manager.go)

File: internal/stacks/manager.go, around lines 630-651

Problem: The goroutine captures env []string from the function parameter. If the calling code reuses or appends to the slice after calling logPostStartStatus, the goroutine may read modified data due to Go's slice sharing semantics. While this is unlikely in current code (callers don't modify env after the call), it's a defensive programming issue.

Fix: Copy the slice at the start of the function:

func (m *Manager) logPostStartStatus(name, stackDir string, env []string) {
    envCopy := make([]string, len(env))
    copy(envCopy, env)
    go func() {
        time.Sleep(3 * time.Second)
        output, err := m.composeExecCustomEnv(stackDir, envCopy, "ps", "-a", ...)
        // ... rest uses envCopy instead of env
    }()
}

M5. Multiple `time.LoadLocation("Europe/Budapest")` calls across the codebase

Files: internal/web/handlers.go (lines 450, 546), internal/web/funcmap.go (lines 15-18)

Problem: Same time.LoadLocation call duplicated in multiple places, each with its own error handling. The handlers.go calls were fixed with M7 from last session to fall back to UTC, but the repeated pattern is inefficient and error-prone.

Fix: Create a package-level helper in internal/web/ (e.g., in funcmap.go since it already loads the timezone):

var (
    webTimezone     *time.Location
    webTimezoneOnce sync.Once
)

func getTimezone() *time.Location {
    webTimezoneOnce.Do(func() {
        loc, err := time.LoadLocation("Europe/Budapest")
        if err != nil {
            loc = time.UTC
        }
        webTimezone = loc
    })
    return webTimezone
}

Then replace all time.LoadLocation("Europe/Budapest") calls in the web package with getTimezone().

Previously Identified (from session 1) — Still Open

These were in the previous TASK.md but were not yet fixed:

P1. Missing `PARTUUID=` handling in fstab (storage/scan_linux.go)

File: internal/storage/scan_linux.go, around line 130

getSystemDiskNames only parses UUID= entries from fstab, missing PARTUUID= entries common on Raspberry Pi. Add a PARTUUID= case that uses blkid -t PARTUUID=xxx -o device to resolve to a device path.

P2. "Run all" triggers manual-schedule cross-drive backups (api/router.go)

If the "Run all" endpoint should only trigger daily+weekly configs, it should filter out manual ones. Clarify intent and filter accordingly.

P3. Background goroutines use `context.Background()` (api/router.go)

Background goroutines for backup/restore use context.Background() and survive server shutdown. Use a server-level context that's cancelled on shutdown.

P4. `CheckBackupDestination` tier priority (system/mounts_linux.go)

Tier 4 (disk full) overwrites Tier 3 (system drive) warning. Should accumulate warnings instead of overwriting.

P5. Hardcoded `/opt/docker/felhom-controller/controller.yaml` paths (backup/backup.go)

Should use a constant or derive from m.cfg.Paths.DataDir.

P6. Inconsistent rollback in migration (storage/migrate.go)

If updateFn callback fails after data copy, the copied data on target is not cleaned up.

Testing Notes

After making fixes:

Build: go build ./... from the controller/ directory must succeed
Vet: go vet ./... must pass
Key areas to manually verify:
- Settings save/load cycle (C1 — create a settings.json with cross-drive configs, call SetAppBackupBulk with a subset, verify missing stacks are preserved)
- Stack config update with empty env (C2 — deploy a stack, clear its env, try UpdateStackConfig)
- DB dump validation on large files (C3 — ensure scanner.Err is checked)
- Daily job scheduling around midnight (H1)
- Settings file operations under disk-full conditions (H3)

Post-Fix Checklist

Add CHANGELOG.md entry listing all fixes
Update controller/README.md if any API behavior changed
go build ./... passes
go vet ./... passes

18 KiB Raw Blame History Unescape Escape

TASK.md — Bug Fix Plan for Sonnet 4.5 (Session 2)

Prompt (copy-paste this into Claude Code)

Context Gathering

Fix Priority

CRITICAL Fixes

C1. SetAppBackupBulk data loss + nil map panic (settings/settings.go)

C2. UpdateStackConfig nil Env map panic (stacks/deploy.go)

C3. ValidateDump missing scanner.Err() check (backup/dbdump.go)

HIGH Fixes

H1. nextDailyRun uses Add(24h) — wrong across DST transitions (scheduler/scheduler.go)

H2. nextDailyRun calls time.LoadLocation on every iteration (scheduler/scheduler.go)

H3. settings.save() leaks .tmp file on WriteFile failure (settings/settings.go)

H4. SetNotificationPrefs panics on nil input (settings/settings.go)

H5. appDirSize ignores Sscanf return value (backup/appdata.go)

H6. getDirSizeBytes has no timeout (stacks/delete.go)

H7. dbdump.go tmpFile not using defer Close (backup/dbdump.go)

H8. UpdateCrossDriveStatus comment contradicts implementation (settings/settings.go)