Files
deploy-felhom-compose/BUGHUNT.md
T
admin 45f75a916c fix: P2+P3 bug fixes, hardening, and cleanup (18 files)
Bug fixes:
- Add applyEnvOverrides to LoadFromBytes (M05)
- Set state=failed on compose-up failure in selfupdate (M16)
- Clamp usableMB to min 0 in memory check (M22)
- Remove "manual" schedule from triggerAllCrossBackups (M23)
- Add mmcblk device handling for partition paths (M21)
- Fix stripPartition for mmcblk devices (L25)
- Fix TruncateStr for UTF-8 and negative maxLen (L05/L06)
- Fix AllDone to return false for empty restore plans (L14)
- Fix PushOnce to return actual errors (L39)
- Restore pending events on save failure in DrainPendingEvents (M03)
- Add duplicate check in AddStoragePath (M04)
- Call CleanupTempMounts after drive scan (H13)
- Log SetStep save errors (M25)

Hardening:
- Guard scheduler Start() against double-start (M14)
- Acquire mutex in scheduler Stop() before reading cancel (L24)
- Cap log lines parameter to 10000 (L31)
- Require POST for logout (L32)
- Use sync.Once for Server.Close() (L49)
- Panic on crypto/rand.Read failure in setup CSRF (L40)
- Validate Bearer token against Hub API key in CSRF (H16 fix)
- Replace custom hasPrefix with strings.HasPrefix (L13)
- Replace simpleHash with crc32.ChecksumIEEE (L48)

Cleanup:
- Remove dead imageName function (L02)
- Remove dead detectHostIPViaRoute function (L03)
- Rename shadowed copy variable to cp (L07)
- Copy DefaultEnabledEvents in GetNotificationPrefs early return (L09)
- Update BUGHUNT.md with comprehensive audit results

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 13:47:52 +01:00

31 KiB

Bug Hunt Report — Comprehensive Controller Audit

Date: 2026-02-25 Scope: Full controller codebase (controller/ — all ~80 Go files, all packages) Method: Systematic code review of every file + live testing on demo node (192.168.0.162) Verified on: v0.30.2, Go 1.24.0


Priority Legend

Priority Meaning
P0 Critical — Data race causing corruption, potential panic in production, security vulnerability
P1 High — Logic error causing wrong behavior, resource leak, silent data loss
P2 Medium — Suboptimal behavior, inconsistency, performance issue, minor bug
P3 Low — Code smell, dead code, cosmetic issue, hardening

Table of Contents


P0 Critical Issues

C01 — UpdateStackConfig passes encrypted env values to docker compose

Files: stacks/deploy.go:340-366 Category: Bug — Silent data corruption

UpdateStackConfig loads app.yaml via LoadAppConfig() (line 340) which returns encrypted ENC:... values for sensitive fields. It then updates the modified fields and calls composeExecWithEnv (line 366) passing the mixed map — old encrypted values + new plaintext values. Docker compose receives the encrypted ciphertext as literal environment variable values. Containers get ENC:H98iUGr... instead of the actual DB password.

Verified on demo: immich/app.yaml has DB_PASSWORD: ENC:H98i.... If a user changes optional config via the UI, the next docker compose up -d would inject the encrypted string as DB_PASSWORD.

Fix: Use LoadAppConfigDecrypted(stackDir, m.encKey) for the compose execution path, or call m.stackEnv(stackDir) instead of m.composeExecWithEnv.


C02 — DecryptMap silently swallows decryption errors

File: crypto/crypto.go:107-113 Category: Silent data corruption

When decryption fails (key rotated, data corrupted), DecryptMap silently returns the raw ENC:... ciphertext as the value. Applications receive garbled base64 as passwords/secrets, causing silent failures. No error is logged or returned.

Fix: Return an error from DecryptMap, or at minimum log a [WARN]. Update callers to handle the error.


C03 — appCfg.Deployed = false written outside lock in deploy goroutine

Files: stacks/deploy.go:311 Category: Data race

In runComposeDeploy, on failure, appCfg.Deployed = false is written at line 311 outside the m.mu.Lock() block (lines 302-309). The same appCfg pointer is shared with s.AppConfig in the stacks map. Any concurrent GetStack/GetStacks call reading AppConfig.Deployed is a data race.

Fix: Move line 311 (appCfg.Deployed = false) inside the lock block before line 309.


C04 — GetStack/GetStacks returns shallow copies with shared pointer fields

Files: stacks/manager.go:449-470 Category: Data race

GetStack does copy := *s — a shallow copy. Stack.AppConfig is a pointer, so the "copy" shares the AppConfig with the original. Any goroutine reading copy.AppConfig.Env while another modifies the original's AppConfig (e.g., deploy goroutine) is a data race. Same issue with Containers, HealthProbe, and Meta.DeployFields.

Fix: Implement deep copy for Stack struct. At minimum, deep-copy AppConfig, Containers, and HealthProbe.


C05 — Watchdog pathProbeState fields read/written without synchronization

File: monitor/watchdog.go:141-163 and throughout Category: Data race

Check() reads and writes pathProbeState fields (lastProbeTime, consecutiveFailures, probeCount, lastStatus, etc.) without holding w.mu. Meanwhile, GetDebugStatus() reads these same fields while holding w.mu. SafeDisconnect() and Reconnect() also modify state concurrently. Multiple data races.

Fix: Either hold w.mu during all state modifications in Check(), or give each pathProbeState its own mutex.


C06 — Metrics collector cancel field has data race between Start and Stop

File: metrics/collector.go:Start/Stop Category: Data race

Start() writes c.cancel without any lock. Stop() reads it without any lock. If called from different goroutines (the expected pattern), this is a data race.

Fix: Protect with c.mu or use sync.Once for start.


C07 — activeRawMount race between cleanup, mount, and set

File: web/storage_handlers.go:919-935 Category: Race condition

In storageAttachMountRawHandler, the old mount is cleaned up under diskJobMu (line 919-924), then the lock is released, then storage.MountRaw runs without lock (line 926), then the new mount path is set under lock (line 933-935). Two concurrent requests can both clean up and create mounts, orphaning the first.

Fix: Hold diskJobMu across the entire operation (cleanup + mount + set).


C08 — SetEncryptionKey on stacks Manager has no locking

File: stacks/manager.go:114-116 Category: Data race

SetEncryptionKey writes m.encKey without holding any lock. If called concurrently with MigrateEncryption, stackEnv, or SaveAppConfig (which read m.encKey), this is a data race.

Fix: Acquire m.mu.Lock() before setting m.encKey.


P1 High Issues

H01 — Restic lock detection never triggers (checks stdout, not stderr)

File: backup/restic.go:135-151 Category: Bug — backup recovery broken

Snapshot() uses cmd.Output() which returns stdout only. Restic writes lock errors to stderr. The lock-detection code at line 138 checks string(out) for "lock"/"locked" — but out is stdout, not stderr. The auto-unlock-and-retry logic likely never triggers.

Fix: Use cmd.CombinedOutput(), or extract stderr from *exec.ExitError:

if exitErr, ok := err.(*exec.ExitError); ok {
    errStr = string(exitErr.Stderr)
}

H02 — activeDrives() returns disconnected/decommissioned drives

File: backup/backup.go:206-222 Category: Logic error

activeDrives() logs which drives are disconnected but still adds ALL drives to the result. Functions like RunIntegrityCheck, ListSnapshots, aggregateRepoStats will then operate on disconnected drives, causing timeouts.

Fix: Add continue after logging disconnected drives to skip them:

if m.settings != nil && (m.settings.IsDisconnected(d) || m.settings.IsDecommissioned(d)) {
    disconnected = append(disconnected, d)
    continue // don't add to drives
}

H03 — MigrateEncryption holds RLock while writing files

File: stacks/manager.go:124-156 Category: Concurrency — semantic error

MigrateEncryption() acquires m.mu.RLock() then calls SaveAppConfig() which writes to disk. A write operation should hold a write lock, not a read lock.

Fix: Use m.mu.Lock() instead of m.mu.RLock().


H04 — SaveAppConfig is not atomic

File: stacks/deploy.go:555 (uses os.WriteFile) Category: Data integrity

SaveAppConfig writes directly to app.yaml. A crash mid-write corrupts the file. This violates the project's own pattern (settings.go uses write-to-tmp + rename).

Fix: Write to app.yaml.tmp, then os.Rename to app.yaml.


H05 — Deploy failure saves plaintext secrets to disk

File: stacks/deploy.go:311-312 Category: Security

On deploy failure, runComposeDeploy calls SaveAppConfig(stackDir, appCfg, nil, nil) — passing nil for encKey and sensitiveVars. The plaintext secrets from the deploy form are saved unencrypted to app.yaml.

Fix: Pass m.encKey and SensitiveEnvVars(&meta) to SaveAppConfig in the failure path.


H06 — No rate limiting on login attempts

File: web/auth.go:91-128 Category: Security

handleLogin has no rate limiting, lockout, or progressive delay. Bcrypt is slow but still allows ~5-10 attempts/sec. Note: demo node currently has no password set (Auth: no password configured), making this irrelevant there but critical for production deployments.

Fix: Add per-IP rate limiter or exponential backoff on failed attempts.


H07 — Template rendering writes directly to ResponseWriter

Files: web/server.go:393-398, setup/handlers.go:871-877 Category: Bug — corrupt HTTP responses

render() calls tmpl.ExecuteTemplate(w, ...) directly on the ResponseWriter. If the template partially writes then errors, the client receives corrupt HTML with a 200 status. The subsequent http.Error(500) is silently ignored because headers are already sent.

Fix: Render to a bytes.Buffer first, only write to w on success:

var buf bytes.Buffer
if err := s.tmpl.ExecuteTemplate(&buf, name, data); err != nil {
    http.Error(w, "Internal error", 500)
    return
}
w.Header().Set("Content-Type", "text/html; charset=utf-8")
buf.WriteTo(w)

H08 — Syncer.Stop() panics if called twice

File: sync/sync.go:114-116 Category: Panic

Stop() calls close(s.stopCh). A double-call panics with "close of closed channel".

Fix: Use sync.Once:

s.stopOnce.Do(func() { close(s.stopCh) })

H09 — Sync race: TriggerSync and ticker can run doSync concurrently

File: sync/sync.go:120-137, 164-167 Category: Race condition

TriggerSync checks s.syncing under lock, releases the lock, then calls doSync(). Between unlock and doSync acquiring the lock to set syncing=true, the ticker goroutine can also enter doSync(). Two concurrent syncs corrupt the git cache.

Fix: Set s.syncing = true inside TriggerSync() before releasing the lock.


H10 — Cloudflare HTTP requests ignore context — cannot be cancelled

File: cloudflare/client.go:73 Category: Resource leak

do() uses http.NewRequest() instead of http.NewRequestWithContext(). The ctx parameter passed to Sync() is never propagated to HTTP calls. During shutdown, in-flight Cloudflare API requests block until their 15s timeout.

Fix: Pass context through to do() and use http.NewRequestWithContext().


H11 — GetFullStatus returns shared pointers when no cache exists

File: backup/backup.go:1213-1228 Category: Data race

When m.cachedStatus == nil, the fallback returns a FullBackupStatus that directly references m.lastDBDump and m.lastBackup pointers. After m.mu is released, the caller reads these while backup operations may be mutating them.

Fix: Deep-copy m.lastDBDump and m.lastBackup in the no-cache path.


H12 — No HTTP request body size limits

Files: Multiple handlers in web/handler_debug.go, web/storage_handlers.go, api/router.go Category: Security — DoS

Every json.NewDecoder(r.Body).Decode(&req) has no body size limit. A multi-GB request consumes all memory.

Fix: Add middleware: r.Body = http.MaxBytesReader(w, r.Body, 1<<20) for all POST handlers.


H13 — runDriveScan in setup never calls CleanupTempMounts

File: setup/scanner.go:266-279 Category: Resource leak

ScanDrivesForInfraBackups temporarily mounts partitions. runDriveScan stores results but never cleans up. Temp mounts remain forever.

Fix: Call CleanupTempMounts(results, s.logger) after storing scan results.


H14 — Snapshot retry reuses same context with diminished timeout

File: backup/restic.go:139-147 Category: Bug

After lock detection + unlock, the retry reuses the same ctx with the original 30-minute timeout. Time spent on the first attempt is subtracted. Also, unlockCmd.Run() ignores its error.

Fix: Check unlock error. Consider creating a fresh context for retry.


H15 — crossdrive rsync leaf-name collision detection is incomplete

File: backup/crossdrive.go:375-389 Category: Logic error

The collision check uses os.Stat(dstPath) which checks pre-existing directories from previous runs, not just current-iteration collisions. Mount index 0 skips collision check entirely. Pre-existing leftovers could cause data to be rsynced into wrong directories.

Fix: Use a seen map to track leaf names within the current iteration:

seen := make(map[string]bool)
for _, srcMount := range mounts {
    leaf := filepath.Base(srcMount)
    if seen[leaf] { leaf = fmt.Sprintf("%s_%d", leaf, i) }
    seen[leaf] = true
    // ...
}

H16 — CSRF Bearer bypass — token not validated

File: web/csrf.go:36-41 Category: Security

Any request with Authorization: Bearer <anything> skips CSRF entirely. The Bearer token is not validated against the actual API key.

Fix: Validate the Bearer token before skipping CSRF (check against s.settings.GetAPIKey()).


H17 — Potential nil dereference on crossDriveRunner

File: web/handlers.go:919 Category: Potential panic

s.crossDriveRunner.ValidateDestination(...) is called in buildAppBackupRows() without nil check. The field is set post-construction via SetCrossDriveRunner.

Fix: Add nil guard before access.


H18 — selftest.checkDockerSocket panics on empty output

File: selftest/selftest.go:99 Category: Potential panic

string(out[:len(out)-1]) panics with index-out-of-range if out is empty.

Fix: Use strings.TrimSpace(string(out)) instead.


H19 — stderrBuf data race in MigrateDrive

File: storage/migrate_drive.go:314-331 Category: Data race

A background goroutine writes to stderrBuf. After cmd.Wait() returns, stderrBuf.String() is read, but Wait() with piped stderr does NOT guarantee the reader goroutine has finished.

Fix: Use sync.WaitGroup to wait for the stderr goroutine before reading.


H20 — MountRaw UUID slice panics if UUID shorter than 8 chars

File: storage/attach_linux.go:53-58 Category: Potential panic

uuid[:8] panics if UUID is shorter than 8 characters.

Fix: if len(uuid) >= 8 { dirName = uuid[:8] } else { dirName = uuid }


H21 — removeBindFstabEntry uses loose string matching

File: storage/attach_linux.go:431-456 Category: Logic error

strings.Contains(line, targetMountPath) matches /mnt/hdd_10 when target is /mnt/hdd_1. Can remove unrelated fstab entries.

Fix: Parse fstab fields and compare second field exactly.


P2 Medium Issues

M01 — http.Get without timeout in Hub connectivity test

File: main.go:~691 Uses http.DefaultClient (no timeout) while Gitea test correctly uses 5s timeout. Fix: client := &http.Client{Timeout: 5 * time.Second}

M02 — alertMgr.Refresh missing storagePaths arg at initial call

File: main.go:591 Initial refresh omits sett.GetStoragePaths() while all 3 other call sites pass it. Fix: Add sett.GetStoragePaths() to the variadic args.

M03 — DrainPendingEvents risks duplicate delivery on save failure

File: settings/settings.go:801-814 Events cleared from memory before save. If save fails, events returned but still on disk — duplicated on restart. Fix: Restore events if save fails, or save before clearing.

M04 — AddStoragePath allows duplicate paths

File: settings/settings.go:422-432 No duplicate check. Calling twice with same path creates duplicate entries. Fix: Check existing paths before append.

M05 — LoadFromBytes skips env overrides

File: config/config.go:200-211 Unlike Load, LoadFromBytes doesn't call applyEnvOverrides. Fix: Add applyEnvOverrides(cfg) call.

M06 — ListDumpFiles re-validates every SQL file on every cache refresh (5 min)

File: backup/dbdump.go:449-450 Reads and parses entire dump files every refresh. For large dumps, significant I/O. Fix: Use DBValidationCache to skip re-validation when file hasn't changed.

M07 — copyStackDBDumps reads entire files into memory

File: backup/crossdrive.go:466-484 Uses os.ReadFile for DB dumps that can be hundreds of MB. Fix: Use io.Copy with os.Open/os.Create for streaming.

M08 — backupDrive only records last drive's snapshot in history

File: backup/backup.go:369-437 Multi-drive setups only report the last drive's snapshot. Fix: Record all successful snapshots.

M09 — SubdomainInUse does disk I/O under RLock

File: stacks/deploy.go:58-82 Calls LoadAppConfig() (disk read) for every stack while holding RLock. Fix: Cache subdomain in memory Stack struct.

M10 — InjectMissingFields auto-fills subdomain without uniqueness check

File: stacks/deploy.go:676-690 Uses field.Default or meta.Subdomain without checking SubdomainInUse or reserved names. Fix: Add validation.

M11 — ParseComposeHDDMounts uses fragile line-by-line YAML parsing

File: stacks/delete.go:392-454 Cannot distinguish top-level vs service-level volumes: key. Fix: Use yaml.v3 to properly parse.

M12 — Backup path removal in RemoveStack has no path validation

File: stacks/delete.go:304-316 Accepts arbitrary backupPathsToRemove from caller, calls os.RemoveAll. No validation paths are actually backup paths. Fix: Validate paths start with expected backup directory prefix.

M13 — getDirSizeHuman has no timeout

File: stacks/delete.go:457-468 exec.Command("du", "-sh", path) with no timeout. Can hang on NFS. Fix: Add context.WithTimeout(30s) like getDirSizeBytes does.

M14 — Scheduler doesn't guard against double-start

File: scheduler/scheduler.go:99-116 Calling Start() twice overwrites ctx/cancel, leaking the first set of goroutines. Fix: Guard with if s.ctx != nil { return }.

M15 — Jobs registered after Start() never execute

File: scheduler/scheduler.go:60-96 vs 99-116 Start() only launches goroutines for currently-registered jobs. Fix: Document requirement, or auto-launch late-registered jobs.

M16 — selfupdate.performUpdate leaves "pending" state on compose-up failure

File: selfupdate/updater.go:396-410 State file stays "pending" when compose-up fails, creating contradictory API state. Fix: Set state.Status = "failed" on compose-up failure.

M17 — selfupdate docker commands have no timeout

File: selfupdate/updater.go:495-502 docker pull and docker compose up -d can hang indefinitely. Fix: Use exec.CommandContext with timeout.

M18 — SetBackupRunningCheck has no mutex protection

File: selfupdate/updater.go:66-68 Writes u.backupRunning without lock. DryRun() also reads it without lock. Fix: Use u.mu.Lock().

M19 — goroutine leak in ProbeStoragePath on hung mounts

File: system/mounts_linux.go:258-289 syscall.Statfs on a dead mount blocks forever. Each probe leaks a goroutine. Fix: Track outstanding probes per path, prevent duplicate probes.

M20 — rsync --info=progress2 carriage returns not handled

File: storage/format.go:31-53 Progress scanner splits on \n only; \r-delimited updates get concatenated. Fix: Custom scanner splitting on both \r and \n.

M21 — mmcblk device partition path construction is wrong

File: storage/format_linux.go:119-122 Only NVMe gets p separator. mmcblk0mmcblk01 instead of mmcblk0p1. Fix: Add || strings.Contains(req.DevicePath, "mmcblk").

M22 — Integer underflow in memory availability message

File: api/router.go:381-386 If reservedMB > totalMB, usableMB goes negative, blocking all app starts. Fix: Clamp usableMB to minimum 0.

M23 — triggerAllCrossBackups runs "manual" schedule

File: api/router.go:907-929 "Manual" schedule meant for per-app on-demand, not bulk triggers. Fix: Remove RunAllScheduled(ctx, "manual").

M24 — writeFreshConfig creates bare &settings.Settings{} without path

File: setup/handlers.go:668-693 On settings load failure, creates Settings with no file path. SetPasswordHash (which calls save) writes to empty path. Fix: Create Settings with proper path and logger.

M25 — SetStep race condition between unlock and Save

File: setup/setup.go:99-106 State modified under lock, then saved after unlock. Another goroutine can modify between. Fix: Use saveLocked() variant called while holding lock.

File: setup/csrf.go:34-43 Both sides of the comparison are client-controlled. No server-side token storage. Fix: Store CSRF token server-side in SetupState.

M27 — PushEvent in notifier spawns goroutine without tracking

File: notify/notifier.go Goroutine leak potential — no WaitGroup or context for clean shutdown. Fix: Track goroutines with sync.WaitGroup.

M28 — SafeDisconnect and Check can race on same storage path

File: monitor/watchdog.go:498-644 Both modify pathProbeState concurrently without coordination. Can double-stop stacks. Fix: Use broader mutex or state-transition lock.

M29 — stackProvider reads in backup package lack mutex protection

File: backup/backup.go:171-181,184-203,478-485 SetStackProvider writes under mutex, but all reads are unprotected. Fix: Ensure single-call-at-startup invariant is documented, or add read locks.

M30 — Multiple GetStacks() calls in same handler

File: web/handlers.go:214,233 stacksHandler calls GetStacks() 3 times — each may return different snapshots. Fix: Call once, reuse result.


P3 Low Issues

L01 — Dead code: fileExists function in main.go (line 1071)

L02 — Dead code: imageName function in selfupdate/updater.go (line 231)

L03 — Dead code: detectHostIPViaRoute in setup/network.go (always returns "")

L04 — Dead code: handleRoot in setup/handlers.go (both if branches do same redirect)

L05 — TruncateStr operates on bytes, not runes — breaks multi-byte UTF-8

  • File: util/strings.go:6-12 and stacks/manager.go:718-724

L06 — TruncateStr no guard for negative maxLen

L07 — Variable copy shadows builtin in settings.go:294

L08 — save() in settings.go does not guard nil logger (line 218)

L09 — GetNotificationPrefs early return shares global DefaultEnabledEvents slice

L10 — os.ExpandEnv in config.go may corrupt $ in YAML values

L11 — Key file written non-atomically (crypto.go:37)

L12 — ValidateDump uses global log.Printf instead of injected logger

L13 — hasPrefix reimplements strings.HasPrefix in restore_scan.go

L14 — AllDone returns true for empty Apps list (restore_scan.go:134)

L15 — RunIntegrityCheck only reports last error across drives

L16 — restic.Stats() casts uint64 to int64 unsafely

L17 — logPostStartStatus goroutine leak potential (no context/WaitGroup)

L18 — Pinger.send response body not drained before close

L19 — sampleContainers creates context.Background() instead of using parent ctx

L20 — Docker commands without timeout in healthcheck.go

L21 — Non-deterministic Cloudflare rule order due to map iteration

L22 — parseDailyTime accepts trailing garbage ("12:30:45")

L23 — Syncer.Start() can be called multiple times (goroutine leak)

L24 — Stop() on scheduler reads s.cancel without mutex

L25 — stripPartition doesn't handle mmcblk devices

L26 — sizeBytes() silently returns 0 on ParseUint error

L27 — Duplicate dirSizeHuman/dirSizeBytesHuman/fmtBytes functions

L28 — w.Write errors ignored in multiple HTTP handlers

L29 — json.Encode errors ignored in multiple handlers

L30 — r.ParseForm() errors ignored (multiple handlers)

L31 — No upper bound on log lines query parameter (DoS)

L32 — Logout accepts GET requests (CSRF)

L33 — Content-Type check uses exact match, fails for application/json; charset=utf-8

L34 — SyncFileBrowserMounts writes config files non-atomically

L35 — DriveMigrator adapter types have triplicated methods (main.go:840-1038)

L36 — Goroutines in main.go use time.Sleep without context cancellation check

L37 — hubPusher.Push() error silently ignored in multiple callbacks

L38 — Setup shutdown errors silently discarded (main.go:1136-1137)

L39 — PushOnce in report/pusher.go always returns nil (misleading return type)

L40 — Static fallback CSRF token ("fallback-csrf-token") in setup/csrf.go

L41 — extractName returns "" but most API callers don't check for empty

L42 — Hardcoded config path /opt/docker/felhom-controller/controller.yaml in 3 setup places

L43 — GetStatus() in backup returns raw shared pointers (not deep-copied like GetFullStatus)

L44 — Fstab entries accumulate extra blank lines from double-newline appends

L45 — logFileHashes called after write — shows identical src/dst hashes

L46 — LoadMetadata prints to stderr instead of using logger

L47 — LoadMetadata silently ignores YAML parse errors, returns defaults

L48 — simpleHash in alerts.go has 32-bit collision risk

L49 — Close() on auth SessionStore can panic if called twice

L50 — CSS path in setup/handlers.go always fails (dead code)

L51 — isError in setup uses string comparison instead of errors.Is

L52 — Password stored in plaintext in setup-state.json during setup


Standardization / Optimization Opportunities

S01 — Unify env var construction for compose commands

Three different code paths build env vars: stackEnv, composeExecWithEnv, composeExecCustomEnv(env=nil). DOMAIN can be appended twice. Should consolidate into single method.

S02 — Unify humanizeBytes / fmtBytes / dirSizeBytesHuman / dirSizeHuman

At least 4 separate implementations of byte-to-human-readable conversion across packages. Should be one function in util/.

S03 — Standardize atomic file writes

settings.go uses write-to-tmp + rename, but SaveAppConfig, SyncFileBrowserMounts, crypto key file, and setup state do not. All persistent file writes should use the same pattern.

S04 — Standardize HTTP client creation

Many places create &http.Client{Timeout: X} per call. Should have shared clients with sensible defaults.

S05 — Standardize exec.Command with context/timeout

Most exec.Command calls have no timeout. Should use exec.CommandContext with appropriate timeouts consistently.

S06 — Standardize error handling in HTTP handlers

Some handlers return JSON errors, some use http.Error, some ignore write errors. Should be consistent.

S07 — Extract common adapter pattern

stackAdapter, watchdogStackAdapter, driveMigrateStackAdapter in main.go share identical methods. Should use embedded base adapter.

S08 — Consolidate getDirSizeHuman + getDirSizeBytes into single call

Two du invocations for the same directory. Call getDirSizeBytes once and format.

S09 — Template rendering: always buffer before writing

Multiple render() functions write directly to ResponseWriter. All should use bytes.Buffer.

S10 — Add MaxBytesReader middleware

Rather than per-handler body limiting, add a global middleware with sensible default (1MB for JSON, larger for file uploads).


Fix Plan — Ordered by Implementation Phase

Phase 1 — Critical Data Races & Security (P0)

Estimate: ~2 hours

  1. C01 — Fix UpdateStackConfig to use decrypted env for compose
  2. C02 — Add error return to DecryptMap (or at least log)
  3. C03 — Move appCfg.Deployed = false inside lock block
  4. C04 — Implement Stack.DeepCopy(), use in GetStack/GetStacks
  5. C05 — Add per-state mutex to pathProbeState in watchdog
  6. C06 — Protect metrics collector cancel with mutex
  7. C07 — Hold diskJobMu across entire raw mount operation
  8. C08 — Add lock to SetEncryptionKey

Phase 2 — High-Priority Bugs (P1)

Estimate: ~3 hours

  1. H01 — Fix restic lock detection to use stderr
  2. H02 — Filter disconnected drives in activeDrives()
  3. H03 — Change to write lock in MigrateEncryption
  4. H04 — Make SaveAppConfig atomic (write-to-tmp + rename)
  5. H05 — Pass encKey to SaveAppConfig on deploy failure
  6. H07 — Buffer template rendering (web + setup)
  7. H08 — Use sync.Once in Syncer.Stop()
  8. H09 — Fix sync race (set syncing=true before unlock)
  9. H10 — Thread context through Cloudflare HTTP calls
  10. H11 — Deep-copy in GetFullStatus no-cache path
  11. H14 — Fix restic retry context + check unlock error
  12. H15 — Fix crossdrive leaf collision with seen map
  13. H17 — Add nil check for crossDriveRunner
  14. H18 — Use TrimSpace for selftest output
  15. H19 — WaitGroup for stderr goroutine in MigrateDrive
  16. H20 — Guard UUID slice length

Phase 3 — Security Hardening (P1-P2)

Estimate: ~1.5 hours

  1. H06 — Add login rate limiting
  2. H12 — Add MaxBytesReader middleware
  3. H16 — Validate Bearer token in CSRF middleware
  4. H21 — Fix fstab entry matching to use field parsing
  5. M12 — Validate backup paths in RemoveStack
  6. M22 — Clamp memory availability to min 0
  7. M26 — Server-side CSRF for setup wizard

Phase 4 — Medium Priority Fixes (P2)

Estimate: ~3 hours

  1. M01 — Add timeout to Hub connectivity test
  2. M02 — Add storagePaths to initial alert refresh
  3. M03 — Fix DrainPendingEvents to restore on save failure
  4. M04 — Add duplicate check in AddStoragePath
  5. M06 — Use validation cache to skip re-validation
  6. M07 — Stream file copies instead of reading into memory
  7. M08 — Record all drive snapshots in history
  8. M09 — Cache subdomain in Stack struct
  9. M14 — Guard scheduler double-start
  10. M16 — Set "failed" state on compose-up failure
  11. M17 — Add timeout to selfupdate docker commands
  12. M21 — Fix mmcblk partition path
  13. M23 — Remove "manual" from triggerAllCrossBackups
  14. M24 — Fix writeFreshConfig settings path
  15. M25 — Fix SetStep race (saveLocked)
  16. M30 — Single GetStacks() call per handler

Phase 5 — Standardization (S-items)

Estimate: ~2 hours

  1. S01 — Unify compose env construction
  2. S02 — Single humanizeBytes in util/
  3. S03 — Atomic writes everywhere
  4. S04 — Shared HTTP clients
  5. S05 — exec.CommandContext everywhere
  6. S07 — Base adapter embedding
  7. S09 — Template buffer pattern

Phase 6 — Low Priority Cleanup (P3)

Estimate: ~1.5 hours

Remove dead code (L01-L04), fix UTF-8 truncation (L05-L06), address remaining code smells and minor issues.


Total estimated effort: ~13 hours across 6 phases

Phases 1-3 are the most important and should be done first. Phase 4-5 can be done incrementally. Phase 6 is best done opportunistically when touching related files.