Bug fixes: - Add applyEnvOverrides to LoadFromBytes (M05) - Set state=failed on compose-up failure in selfupdate (M16) - Clamp usableMB to min 0 in memory check (M22) - Remove "manual" schedule from triggerAllCrossBackups (M23) - Add mmcblk device handling for partition paths (M21) - Fix stripPartition for mmcblk devices (L25) - Fix TruncateStr for UTF-8 and negative maxLen (L05/L06) - Fix AllDone to return false for empty restore plans (L14) - Fix PushOnce to return actual errors (L39) - Restore pending events on save failure in DrainPendingEvents (M03) - Add duplicate check in AddStoragePath (M04) - Call CleanupTempMounts after drive scan (H13) - Log SetStep save errors (M25) Hardening: - Guard scheduler Start() against double-start (M14) - Acquire mutex in scheduler Stop() before reading cancel (L24) - Cap log lines parameter to 10000 (L31) - Require POST for logout (L32) - Use sync.Once for Server.Close() (L49) - Panic on crypto/rand.Read failure in setup CSRF (L40) - Validate Bearer token against Hub API key in CSRF (H16 fix) - Replace custom hasPrefix with strings.HasPrefix (L13) - Replace simpleHash with crc32.ChecksumIEEE (L48) Cleanup: - Remove dead imageName function (L02) - Remove dead detectHostIPViaRoute function (L03) - Rename shadowed copy variable to cp (L07) - Copy DefaultEnabledEvents in GetNotificationPrefs early return (L09) - Update BUGHUNT.md with comprehensive audit results Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
31 KiB
Bug Hunt Report — Comprehensive Controller Audit
Date: 2026-02-25
Scope: Full controller codebase (controller/ — all ~80 Go files, all packages)
Method: Systematic code review of every file + live testing on demo node (192.168.0.162)
Verified on: v0.30.2, Go 1.24.0
Priority Legend
| Priority | Meaning |
|---|---|
| P0 | Critical — Data race causing corruption, potential panic in production, security vulnerability |
| P1 | High — Logic error causing wrong behavior, resource leak, silent data loss |
| P2 | Medium — Suboptimal behavior, inconsistency, performance issue, minor bug |
| P3 | Low — Code smell, dead code, cosmetic issue, hardening |
Table of Contents
- P0 Critical Issues
- P1 High Issues
- P2 Medium Issues
- P3 Low Issues
- Standardization / Optimization Opportunities
- Fix Plan — Ordered by Implementation Phase
P0 Critical Issues
C01 — UpdateStackConfig passes encrypted env values to docker compose
Files: stacks/deploy.go:340-366
Category: Bug — Silent data corruption
UpdateStackConfig loads app.yaml via LoadAppConfig() (line 340) which returns encrypted ENC:... values for sensitive fields. It then updates the modified fields and calls composeExecWithEnv (line 366) passing the mixed map — old encrypted values + new plaintext values. Docker compose receives the encrypted ciphertext as literal environment variable values. Containers get ENC:H98iUGr... instead of the actual DB password.
Verified on demo: immich/app.yaml has DB_PASSWORD: ENC:H98i.... If a user changes optional config via the UI, the next docker compose up -d would inject the encrypted string as DB_PASSWORD.
Fix: Use LoadAppConfigDecrypted(stackDir, m.encKey) for the compose execution path, or call m.stackEnv(stackDir) instead of m.composeExecWithEnv.
C02 — DecryptMap silently swallows decryption errors
File: crypto/crypto.go:107-113
Category: Silent data corruption
When decryption fails (key rotated, data corrupted), DecryptMap silently returns the raw ENC:... ciphertext as the value. Applications receive garbled base64 as passwords/secrets, causing silent failures. No error is logged or returned.
Fix: Return an error from DecryptMap, or at minimum log a [WARN]. Update callers to handle the error.
C03 — appCfg.Deployed = false written outside lock in deploy goroutine
Files: stacks/deploy.go:311
Category: Data race
In runComposeDeploy, on failure, appCfg.Deployed = false is written at line 311 outside the m.mu.Lock() block (lines 302-309). The same appCfg pointer is shared with s.AppConfig in the stacks map. Any concurrent GetStack/GetStacks call reading AppConfig.Deployed is a data race.
Fix: Move line 311 (appCfg.Deployed = false) inside the lock block before line 309.
C04 — GetStack/GetStacks returns shallow copies with shared pointer fields
Files: stacks/manager.go:449-470
Category: Data race
GetStack does copy := *s — a shallow copy. Stack.AppConfig is a pointer, so the "copy" shares the AppConfig with the original. Any goroutine reading copy.AppConfig.Env while another modifies the original's AppConfig (e.g., deploy goroutine) is a data race. Same issue with Containers, HealthProbe, and Meta.DeployFields.
Fix: Implement deep copy for Stack struct. At minimum, deep-copy AppConfig, Containers, and HealthProbe.
C05 — Watchdog pathProbeState fields read/written without synchronization
File: monitor/watchdog.go:141-163 and throughout
Category: Data race
Check() reads and writes pathProbeState fields (lastProbeTime, consecutiveFailures, probeCount, lastStatus, etc.) without holding w.mu. Meanwhile, GetDebugStatus() reads these same fields while holding w.mu. SafeDisconnect() and Reconnect() also modify state concurrently. Multiple data races.
Fix: Either hold w.mu during all state modifications in Check(), or give each pathProbeState its own mutex.
C06 — Metrics collector cancel field has data race between Start and Stop
File: metrics/collector.go:Start/Stop
Category: Data race
Start() writes c.cancel without any lock. Stop() reads it without any lock. If called from different goroutines (the expected pattern), this is a data race.
Fix: Protect with c.mu or use sync.Once for start.
C07 — activeRawMount race between cleanup, mount, and set
File: web/storage_handlers.go:919-935
Category: Race condition
In storageAttachMountRawHandler, the old mount is cleaned up under diskJobMu (line 919-924), then the lock is released, then storage.MountRaw runs without lock (line 926), then the new mount path is set under lock (line 933-935). Two concurrent requests can both clean up and create mounts, orphaning the first.
Fix: Hold diskJobMu across the entire operation (cleanup + mount + set).
C08 — SetEncryptionKey on stacks Manager has no locking
File: stacks/manager.go:114-116
Category: Data race
SetEncryptionKey writes m.encKey without holding any lock. If called concurrently with MigrateEncryption, stackEnv, or SaveAppConfig (which read m.encKey), this is a data race.
Fix: Acquire m.mu.Lock() before setting m.encKey.
P1 High Issues
H01 — Restic lock detection never triggers (checks stdout, not stderr)
File: backup/restic.go:135-151
Category: Bug — backup recovery broken
Snapshot() uses cmd.Output() which returns stdout only. Restic writes lock errors to stderr. The lock-detection code at line 138 checks string(out) for "lock"/"locked" — but out is stdout, not stderr. The auto-unlock-and-retry logic likely never triggers.
Fix: Use cmd.CombinedOutput(), or extract stderr from *exec.ExitError:
if exitErr, ok := err.(*exec.ExitError); ok {
errStr = string(exitErr.Stderr)
}
H02 — activeDrives() returns disconnected/decommissioned drives
File: backup/backup.go:206-222
Category: Logic error
activeDrives() logs which drives are disconnected but still adds ALL drives to the result. Functions like RunIntegrityCheck, ListSnapshots, aggregateRepoStats will then operate on disconnected drives, causing timeouts.
Fix: Add continue after logging disconnected drives to skip them:
if m.settings != nil && (m.settings.IsDisconnected(d) || m.settings.IsDecommissioned(d)) {
disconnected = append(disconnected, d)
continue // don't add to drives
}
H03 — MigrateEncryption holds RLock while writing files
File: stacks/manager.go:124-156
Category: Concurrency — semantic error
MigrateEncryption() acquires m.mu.RLock() then calls SaveAppConfig() which writes to disk. A write operation should hold a write lock, not a read lock.
Fix: Use m.mu.Lock() instead of m.mu.RLock().
H04 — SaveAppConfig is not atomic
File: stacks/deploy.go:555 (uses os.WriteFile)
Category: Data integrity
SaveAppConfig writes directly to app.yaml. A crash mid-write corrupts the file. This violates the project's own pattern (settings.go uses write-to-tmp + rename).
Fix: Write to app.yaml.tmp, then os.Rename to app.yaml.
H05 — Deploy failure saves plaintext secrets to disk
File: stacks/deploy.go:311-312
Category: Security
On deploy failure, runComposeDeploy calls SaveAppConfig(stackDir, appCfg, nil, nil) — passing nil for encKey and sensitiveVars. The plaintext secrets from the deploy form are saved unencrypted to app.yaml.
Fix: Pass m.encKey and SensitiveEnvVars(&meta) to SaveAppConfig in the failure path.
H06 — No rate limiting on login attempts
File: web/auth.go:91-128
Category: Security
handleLogin has no rate limiting, lockout, or progressive delay. Bcrypt is slow but still allows ~5-10 attempts/sec. Note: demo node currently has no password set (Auth: no password configured), making this irrelevant there but critical for production deployments.
Fix: Add per-IP rate limiter or exponential backoff on failed attempts.
H07 — Template rendering writes directly to ResponseWriter
Files: web/server.go:393-398, setup/handlers.go:871-877
Category: Bug — corrupt HTTP responses
render() calls tmpl.ExecuteTemplate(w, ...) directly on the ResponseWriter. If the template partially writes then errors, the client receives corrupt HTML with a 200 status. The subsequent http.Error(500) is silently ignored because headers are already sent.
Fix: Render to a bytes.Buffer first, only write to w on success:
var buf bytes.Buffer
if err := s.tmpl.ExecuteTemplate(&buf, name, data); err != nil {
http.Error(w, "Internal error", 500)
return
}
w.Header().Set("Content-Type", "text/html; charset=utf-8")
buf.WriteTo(w)
H08 — Syncer.Stop() panics if called twice
File: sync/sync.go:114-116
Category: Panic
Stop() calls close(s.stopCh). A double-call panics with "close of closed channel".
Fix: Use sync.Once:
s.stopOnce.Do(func() { close(s.stopCh) })
H09 — Sync race: TriggerSync and ticker can run doSync concurrently
File: sync/sync.go:120-137, 164-167
Category: Race condition
TriggerSync checks s.syncing under lock, releases the lock, then calls doSync(). Between unlock and doSync acquiring the lock to set syncing=true, the ticker goroutine can also enter doSync(). Two concurrent syncs corrupt the git cache.
Fix: Set s.syncing = true inside TriggerSync() before releasing the lock.
H10 — Cloudflare HTTP requests ignore context — cannot be cancelled
File: cloudflare/client.go:73
Category: Resource leak
do() uses http.NewRequest() instead of http.NewRequestWithContext(). The ctx parameter passed to Sync() is never propagated to HTTP calls. During shutdown, in-flight Cloudflare API requests block until their 15s timeout.
Fix: Pass context through to do() and use http.NewRequestWithContext().
H11 — GetFullStatus returns shared pointers when no cache exists
File: backup/backup.go:1213-1228
Category: Data race
When m.cachedStatus == nil, the fallback returns a FullBackupStatus that directly references m.lastDBDump and m.lastBackup pointers. After m.mu is released, the caller reads these while backup operations may be mutating them.
Fix: Deep-copy m.lastDBDump and m.lastBackup in the no-cache path.
H12 — No HTTP request body size limits
Files: Multiple handlers in web/handler_debug.go, web/storage_handlers.go, api/router.go
Category: Security — DoS
Every json.NewDecoder(r.Body).Decode(&req) has no body size limit. A multi-GB request consumes all memory.
Fix: Add middleware: r.Body = http.MaxBytesReader(w, r.Body, 1<<20) for all POST handlers.
H13 — runDriveScan in setup never calls CleanupTempMounts
File: setup/scanner.go:266-279
Category: Resource leak
ScanDrivesForInfraBackups temporarily mounts partitions. runDriveScan stores results but never cleans up. Temp mounts remain forever.
Fix: Call CleanupTempMounts(results, s.logger) after storing scan results.
H14 — Snapshot retry reuses same context with diminished timeout
File: backup/restic.go:139-147
Category: Bug
After lock detection + unlock, the retry reuses the same ctx with the original 30-minute timeout. Time spent on the first attempt is subtracted. Also, unlockCmd.Run() ignores its error.
Fix: Check unlock error. Consider creating a fresh context for retry.
H15 — crossdrive rsync leaf-name collision detection is incomplete
File: backup/crossdrive.go:375-389
Category: Logic error
The collision check uses os.Stat(dstPath) which checks pre-existing directories from previous runs, not just current-iteration collisions. Mount index 0 skips collision check entirely. Pre-existing leftovers could cause data to be rsynced into wrong directories.
Fix: Use a seen map to track leaf names within the current iteration:
seen := make(map[string]bool)
for _, srcMount := range mounts {
leaf := filepath.Base(srcMount)
if seen[leaf] { leaf = fmt.Sprintf("%s_%d", leaf, i) }
seen[leaf] = true
// ...
}
H16 — CSRF Bearer bypass — token not validated
File: web/csrf.go:36-41
Category: Security
Any request with Authorization: Bearer <anything> skips CSRF entirely. The Bearer token is not validated against the actual API key.
Fix: Validate the Bearer token before skipping CSRF (check against s.settings.GetAPIKey()).
H17 — Potential nil dereference on crossDriveRunner
File: web/handlers.go:919
Category: Potential panic
s.crossDriveRunner.ValidateDestination(...) is called in buildAppBackupRows() without nil check. The field is set post-construction via SetCrossDriveRunner.
Fix: Add nil guard before access.
H18 — selftest.checkDockerSocket panics on empty output
File: selftest/selftest.go:99
Category: Potential panic
string(out[:len(out)-1]) panics with index-out-of-range if out is empty.
Fix: Use strings.TrimSpace(string(out)) instead.
H19 — stderrBuf data race in MigrateDrive
File: storage/migrate_drive.go:314-331
Category: Data race
A background goroutine writes to stderrBuf. After cmd.Wait() returns, stderrBuf.String() is read, but Wait() with piped stderr does NOT guarantee the reader goroutine has finished.
Fix: Use sync.WaitGroup to wait for the stderr goroutine before reading.
H20 — MountRaw UUID slice panics if UUID shorter than 8 chars
File: storage/attach_linux.go:53-58
Category: Potential panic
uuid[:8] panics if UUID is shorter than 8 characters.
Fix: if len(uuid) >= 8 { dirName = uuid[:8] } else { dirName = uuid }
H21 — removeBindFstabEntry uses loose string matching
File: storage/attach_linux.go:431-456
Category: Logic error
strings.Contains(line, targetMountPath) matches /mnt/hdd_10 when target is /mnt/hdd_1. Can remove unrelated fstab entries.
Fix: Parse fstab fields and compare second field exactly.
P2 Medium Issues
M01 — http.Get without timeout in Hub connectivity test
File: main.go:~691
Uses http.DefaultClient (no timeout) while Gitea test correctly uses 5s timeout.
Fix: client := &http.Client{Timeout: 5 * time.Second}
M02 — alertMgr.Refresh missing storagePaths arg at initial call
File: main.go:591
Initial refresh omits sett.GetStoragePaths() while all 3 other call sites pass it.
Fix: Add sett.GetStoragePaths() to the variadic args.
M03 — DrainPendingEvents risks duplicate delivery on save failure
File: settings/settings.go:801-814
Events cleared from memory before save. If save fails, events returned but still on disk — duplicated on restart.
Fix: Restore events if save fails, or save before clearing.
M04 — AddStoragePath allows duplicate paths
File: settings/settings.go:422-432
No duplicate check. Calling twice with same path creates duplicate entries.
Fix: Check existing paths before append.
M05 — LoadFromBytes skips env overrides
File: config/config.go:200-211
Unlike Load, LoadFromBytes doesn't call applyEnvOverrides.
Fix: Add applyEnvOverrides(cfg) call.
M06 — ListDumpFiles re-validates every SQL file on every cache refresh (5 min)
File: backup/dbdump.go:449-450
Reads and parses entire dump files every refresh. For large dumps, significant I/O.
Fix: Use DBValidationCache to skip re-validation when file hasn't changed.
M07 — copyStackDBDumps reads entire files into memory
File: backup/crossdrive.go:466-484
Uses os.ReadFile for DB dumps that can be hundreds of MB.
Fix: Use io.Copy with os.Open/os.Create for streaming.
M08 — backupDrive only records last drive's snapshot in history
File: backup/backup.go:369-437
Multi-drive setups only report the last drive's snapshot.
Fix: Record all successful snapshots.
M09 — SubdomainInUse does disk I/O under RLock
File: stacks/deploy.go:58-82
Calls LoadAppConfig() (disk read) for every stack while holding RLock.
Fix: Cache subdomain in memory Stack struct.
M10 — InjectMissingFields auto-fills subdomain without uniqueness check
File: stacks/deploy.go:676-690
Uses field.Default or meta.Subdomain without checking SubdomainInUse or reserved names.
Fix: Add validation.
M11 — ParseComposeHDDMounts uses fragile line-by-line YAML parsing
File: stacks/delete.go:392-454
Cannot distinguish top-level vs service-level volumes: key.
Fix: Use yaml.v3 to properly parse.
M12 — Backup path removal in RemoveStack has no path validation
File: stacks/delete.go:304-316
Accepts arbitrary backupPathsToRemove from caller, calls os.RemoveAll. No validation paths are actually backup paths.
Fix: Validate paths start with expected backup directory prefix.
M13 — getDirSizeHuman has no timeout
File: stacks/delete.go:457-468
exec.Command("du", "-sh", path) with no timeout. Can hang on NFS.
Fix: Add context.WithTimeout(30s) like getDirSizeBytes does.
M14 — Scheduler doesn't guard against double-start
File: scheduler/scheduler.go:99-116
Calling Start() twice overwrites ctx/cancel, leaking the first set of goroutines.
Fix: Guard with if s.ctx != nil { return }.
M15 — Jobs registered after Start() never execute
File: scheduler/scheduler.go:60-96 vs 99-116
Start() only launches goroutines for currently-registered jobs.
Fix: Document requirement, or auto-launch late-registered jobs.
M16 — selfupdate.performUpdate leaves "pending" state on compose-up failure
File: selfupdate/updater.go:396-410
State file stays "pending" when compose-up fails, creating contradictory API state.
Fix: Set state.Status = "failed" on compose-up failure.
M17 — selfupdate docker commands have no timeout
File: selfupdate/updater.go:495-502
docker pull and docker compose up -d can hang indefinitely.
Fix: Use exec.CommandContext with timeout.
M18 — SetBackupRunningCheck has no mutex protection
File: selfupdate/updater.go:66-68
Writes u.backupRunning without lock. DryRun() also reads it without lock.
Fix: Use u.mu.Lock().
M19 — goroutine leak in ProbeStoragePath on hung mounts
File: system/mounts_linux.go:258-289
syscall.Statfs on a dead mount blocks forever. Each probe leaks a goroutine.
Fix: Track outstanding probes per path, prevent duplicate probes.
M20 — rsync --info=progress2 carriage returns not handled
File: storage/format.go:31-53
Progress scanner splits on \n only; \r-delimited updates get concatenated.
Fix: Custom scanner splitting on both \r and \n.
M21 — mmcblk device partition path construction is wrong
File: storage/format_linux.go:119-122
Only NVMe gets p separator. mmcblk0 → mmcblk01 instead of mmcblk0p1.
Fix: Add || strings.Contains(req.DevicePath, "mmcblk").
M22 — Integer underflow in memory availability message
File: api/router.go:381-386
If reservedMB > totalMB, usableMB goes negative, blocking all app starts.
Fix: Clamp usableMB to minimum 0.
M23 — triggerAllCrossBackups runs "manual" schedule
File: api/router.go:907-929
"Manual" schedule meant for per-app on-demand, not bulk triggers.
Fix: Remove RunAllScheduled(ctx, "manual").
M24 — writeFreshConfig creates bare &settings.Settings{} without path
File: setup/handlers.go:668-693
On settings load failure, creates Settings with no file path. SetPasswordHash (which calls save) writes to empty path.
Fix: Create Settings with proper path and logger.
M25 — SetStep race condition between unlock and Save
File: setup/setup.go:99-106
State modified under lock, then saved after unlock. Another goroutine can modify between.
Fix: Use saveLocked() variant called while holding lock.
M26 — Setup CSRF relies on client-controlled cookie + form
File: setup/csrf.go:34-43
Both sides of the comparison are client-controlled. No server-side token storage.
Fix: Store CSRF token server-side in SetupState.
M27 — PushEvent in notifier spawns goroutine without tracking
File: notify/notifier.go
Goroutine leak potential — no WaitGroup or context for clean shutdown.
Fix: Track goroutines with sync.WaitGroup.
M28 — SafeDisconnect and Check can race on same storage path
File: monitor/watchdog.go:498-644
Both modify pathProbeState concurrently without coordination. Can double-stop stacks.
Fix: Use broader mutex or state-transition lock.
M29 — stackProvider reads in backup package lack mutex protection
File: backup/backup.go:171-181,184-203,478-485
SetStackProvider writes under mutex, but all reads are unprotected.
Fix: Ensure single-call-at-startup invariant is documented, or add read locks.
M30 — Multiple GetStacks() calls in same handler
File: web/handlers.go:214,233
stacksHandler calls GetStacks() 3 times — each may return different snapshots.
Fix: Call once, reuse result.
P3 Low Issues
L01 — Dead code: fileExists function in main.go (line 1071)
L02 — Dead code: imageName function in selfupdate/updater.go (line 231)
L03 — Dead code: detectHostIPViaRoute in setup/network.go (always returns "")
L04 — Dead code: handleRoot in setup/handlers.go (both if branches do same redirect)
L05 — TruncateStr operates on bytes, not runes — breaks multi-byte UTF-8
- File:
util/strings.go:6-12andstacks/manager.go:718-724
L06 — TruncateStr no guard for negative maxLen
L07 — Variable copy shadows builtin in settings.go:294
L08 — save() in settings.go does not guard nil logger (line 218)
L09 — GetNotificationPrefs early return shares global DefaultEnabledEvents slice
L10 — os.ExpandEnv in config.go may corrupt $ in YAML values
L11 — Key file written non-atomically (crypto.go:37)
L12 — ValidateDump uses global log.Printf instead of injected logger
L13 — hasPrefix reimplements strings.HasPrefix in restore_scan.go
L14 — AllDone returns true for empty Apps list (restore_scan.go:134)
L15 — RunIntegrityCheck only reports last error across drives
L16 — restic.Stats() casts uint64 to int64 unsafely
L17 — logPostStartStatus goroutine leak potential (no context/WaitGroup)
L18 — Pinger.send response body not drained before close
L19 — sampleContainers creates context.Background() instead of using parent ctx
L20 — Docker commands without timeout in healthcheck.go
L21 — Non-deterministic Cloudflare rule order due to map iteration
L22 — parseDailyTime accepts trailing garbage ("12:30:45")
L23 — Syncer.Start() can be called multiple times (goroutine leak)
L24 — Stop() on scheduler reads s.cancel without mutex
L25 — stripPartition doesn't handle mmcblk devices
L26 — sizeBytes() silently returns 0 on ParseUint error
L27 — Duplicate dirSizeHuman/dirSizeBytesHuman/fmtBytes functions
L28 — w.Write errors ignored in multiple HTTP handlers
L29 — json.Encode errors ignored in multiple handlers
L30 — r.ParseForm() errors ignored (multiple handlers)
L31 — No upper bound on log lines query parameter (DoS)
L32 — Logout accepts GET requests (CSRF)
L33 — Content-Type check uses exact match, fails for application/json; charset=utf-8
L34 — SyncFileBrowserMounts writes config files non-atomically
L35 — DriveMigrator adapter types have triplicated methods (main.go:840-1038)
L36 — Goroutines in main.go use time.Sleep without context cancellation check
L37 — hubPusher.Push() error silently ignored in multiple callbacks
L38 — Setup shutdown errors silently discarded (main.go:1136-1137)
L39 — PushOnce in report/pusher.go always returns nil (misleading return type)
L40 — Static fallback CSRF token ("fallback-csrf-token") in setup/csrf.go
L41 — extractName returns "" but most API callers don't check for empty
L42 — Hardcoded config path /opt/docker/felhom-controller/controller.yaml in 3 setup places
L43 — GetStatus() in backup returns raw shared pointers (not deep-copied like GetFullStatus)
L44 — Fstab entries accumulate extra blank lines from double-newline appends
L45 — logFileHashes called after write — shows identical src/dst hashes
L46 — LoadMetadata prints to stderr instead of using logger
L47 — LoadMetadata silently ignores YAML parse errors, returns defaults
L48 — simpleHash in alerts.go has 32-bit collision risk
L49 — Close() on auth SessionStore can panic if called twice
L50 — CSS path in setup/handlers.go always fails (dead code)
L51 — isError in setup uses string comparison instead of errors.Is
L52 — Password stored in plaintext in setup-state.json during setup
Standardization / Optimization Opportunities
S01 — Unify env var construction for compose commands
Three different code paths build env vars: stackEnv, composeExecWithEnv, composeExecCustomEnv(env=nil). DOMAIN can be appended twice. Should consolidate into single method.
S02 — Unify humanizeBytes / fmtBytes / dirSizeBytesHuman / dirSizeHuman
At least 4 separate implementations of byte-to-human-readable conversion across packages. Should be one function in util/.
S03 — Standardize atomic file writes
settings.go uses write-to-tmp + rename, but SaveAppConfig, SyncFileBrowserMounts, crypto key file, and setup state do not. All persistent file writes should use the same pattern.
S04 — Standardize HTTP client creation
Many places create &http.Client{Timeout: X} per call. Should have shared clients with sensible defaults.
S05 — Standardize exec.Command with context/timeout
Most exec.Command calls have no timeout. Should use exec.CommandContext with appropriate timeouts consistently.
S06 — Standardize error handling in HTTP handlers
Some handlers return JSON errors, some use http.Error, some ignore write errors. Should be consistent.
S07 — Extract common adapter pattern
stackAdapter, watchdogStackAdapter, driveMigrateStackAdapter in main.go share identical methods. Should use embedded base adapter.
S08 — Consolidate getDirSizeHuman + getDirSizeBytes into single call
Two du invocations for the same directory. Call getDirSizeBytes once and format.
S09 — Template rendering: always buffer before writing
Multiple render() functions write directly to ResponseWriter. All should use bytes.Buffer.
S10 — Add MaxBytesReader middleware
Rather than per-handler body limiting, add a global middleware with sensible default (1MB for JSON, larger for file uploads).
Fix Plan — Ordered by Implementation Phase
Phase 1 — Critical Data Races & Security (P0)
Estimate: ~2 hours
- C01 — Fix
UpdateStackConfigto use decrypted env for compose - C02 — Add error return to
DecryptMap(or at least log) - C03 — Move
appCfg.Deployed = falseinside lock block - C04 — Implement
Stack.DeepCopy(), use inGetStack/GetStacks - C05 — Add per-state mutex to
pathProbeStatein watchdog - C06 — Protect metrics collector
cancelwith mutex - C07 — Hold
diskJobMuacross entire raw mount operation - C08 — Add lock to
SetEncryptionKey
Phase 2 — High-Priority Bugs (P1)
Estimate: ~3 hours
- H01 — Fix restic lock detection to use stderr
- H02 — Filter disconnected drives in
activeDrives() - H03 — Change to write lock in
MigrateEncryption - H04 — Make
SaveAppConfigatomic (write-to-tmp + rename) - H05 — Pass encKey to SaveAppConfig on deploy failure
- H07 — Buffer template rendering (web + setup)
- H08 — Use sync.Once in Syncer.Stop()
- H09 — Fix sync race (set syncing=true before unlock)
- H10 — Thread context through Cloudflare HTTP calls
- H11 — Deep-copy in GetFullStatus no-cache path
- H14 — Fix restic retry context + check unlock error
- H15 — Fix crossdrive leaf collision with
seenmap - H17 — Add nil check for crossDriveRunner
- H18 — Use TrimSpace for selftest output
- H19 — WaitGroup for stderr goroutine in MigrateDrive
- H20 — Guard UUID slice length
Phase 3 — Security Hardening (P1-P2)
Estimate: ~1.5 hours
- H06 — Add login rate limiting
- H12 — Add MaxBytesReader middleware
- H16 — Validate Bearer token in CSRF middleware
- H21 — Fix fstab entry matching to use field parsing
- M12 — Validate backup paths in RemoveStack
- M22 — Clamp memory availability to min 0
- M26 — Server-side CSRF for setup wizard
Phase 4 — Medium Priority Fixes (P2)
Estimate: ~3 hours
- M01 — Add timeout to Hub connectivity test
- M02 — Add storagePaths to initial alert refresh
- M03 — Fix DrainPendingEvents to restore on save failure
- M04 — Add duplicate check in AddStoragePath
- M06 — Use validation cache to skip re-validation
- M07 — Stream file copies instead of reading into memory
- M08 — Record all drive snapshots in history
- M09 — Cache subdomain in Stack struct
- M14 — Guard scheduler double-start
- M16 — Set "failed" state on compose-up failure
- M17 — Add timeout to selfupdate docker commands
- M21 — Fix mmcblk partition path
- M23 — Remove "manual" from triggerAllCrossBackups
- M24 — Fix writeFreshConfig settings path
- M25 — Fix SetStep race (saveLocked)
- M30 — Single GetStacks() call per handler
Phase 5 — Standardization (S-items)
Estimate: ~2 hours
- S01 — Unify compose env construction
- S02 — Single
humanizeBytesin util/ - S03 — Atomic writes everywhere
- S04 — Shared HTTP clients
- S05 — exec.CommandContext everywhere
- S07 — Base adapter embedding
- S09 — Template buffer pattern
Phase 6 — Low Priority Cleanup (P3)
Estimate: ~1.5 hours
Remove dead code (L01-L04), fix UTF-8 truncation (L05-L06), address remaining code smells and minor issues.
Total estimated effort: ~13 hours across 6 phases
Phases 1-3 are the most important and should be done first. Phase 4-5 can be done incrementally. Phase 6 is best done opportunistically when touching related files.