45f75a916c
Bug fixes: - Add applyEnvOverrides to LoadFromBytes (M05) - Set state=failed on compose-up failure in selfupdate (M16) - Clamp usableMB to min 0 in memory check (M22) - Remove "manual" schedule from triggerAllCrossBackups (M23) - Add mmcblk device handling for partition paths (M21) - Fix stripPartition for mmcblk devices (L25) - Fix TruncateStr for UTF-8 and negative maxLen (L05/L06) - Fix AllDone to return false for empty restore plans (L14) - Fix PushOnce to return actual errors (L39) - Restore pending events on save failure in DrainPendingEvents (M03) - Add duplicate check in AddStoragePath (M04) - Call CleanupTempMounts after drive scan (H13) - Log SetStep save errors (M25) Hardening: - Guard scheduler Start() against double-start (M14) - Acquire mutex in scheduler Stop() before reading cancel (L24) - Cap log lines parameter to 10000 (L31) - Require POST for logout (L32) - Use sync.Once for Server.Close() (L49) - Panic on crypto/rand.Read failure in setup CSRF (L40) - Validate Bearer token against Hub API key in CSRF (H16 fix) - Replace custom hasPrefix with strings.HasPrefix (L13) - Replace simpleHash with crc32.ChecksumIEEE (L48) Cleanup: - Remove dead imageName function (L02) - Remove dead detectHostIPViaRoute function (L03) - Rename shadowed copy variable to cp (L07) - Copy DefaultEnabledEvents in GetNotificationPrefs early return (L09) - Update BUGHUNT.md with comprehensive audit results Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
721 lines
31 KiB
Markdown
721 lines
31 KiB
Markdown
# Bug Hunt Report — Comprehensive Controller Audit
|
|
|
|
**Date:** 2026-02-25
|
|
**Scope:** Full controller codebase (`controller/` — all ~80 Go files, all packages)
|
|
**Method:** Systematic code review of every file + live testing on demo node (192.168.0.162)
|
|
**Verified on:** v0.30.2, Go 1.24.0
|
|
|
|
---
|
|
|
|
## Priority Legend
|
|
|
|
| Priority | Meaning |
|
|
|----------|---------|
|
|
| P0 | **Critical** — Data race causing corruption, potential panic in production, security vulnerability |
|
|
| P1 | **High** — Logic error causing wrong behavior, resource leak, silent data loss |
|
|
| P2 | **Medium** — Suboptimal behavior, inconsistency, performance issue, minor bug |
|
|
| P3 | **Low** — Code smell, dead code, cosmetic issue, hardening |
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
- [P0 Critical Issues](#p0-critical-issues)
|
|
- [P1 High Issues](#p1-high-issues)
|
|
- [P2 Medium Issues](#p2-medium-issues)
|
|
- [P3 Low Issues](#p3-low-issues)
|
|
- [Standardization / Optimization Opportunities](#standardization--optimization-opportunities)
|
|
- [Fix Plan — Ordered by Implementation Phase](#fix-plan--ordered-by-implementation-phase)
|
|
|
|
---
|
|
|
|
## P0 Critical Issues
|
|
|
|
### C01 — `UpdateStackConfig` passes encrypted env values to docker compose
|
|
|
|
**Files:** `stacks/deploy.go:340-366`
|
|
**Category:** Bug — Silent data corruption
|
|
|
|
`UpdateStackConfig` loads `app.yaml` via `LoadAppConfig()` (line 340) which returns encrypted `ENC:...` values for sensitive fields. It then updates the modified fields and calls `composeExecWithEnv` (line 366) passing the mixed map — old encrypted values + new plaintext values. Docker compose receives the encrypted ciphertext as literal environment variable values. Containers get `ENC:H98iUGr...` instead of the actual DB password.
|
|
|
|
**Verified on demo:** `immich/app.yaml` has `DB_PASSWORD: ENC:H98i...`. If a user changes optional config via the UI, the next `docker compose up -d` would inject the encrypted string as DB_PASSWORD.
|
|
|
|
**Fix:** Use `LoadAppConfigDecrypted(stackDir, m.encKey)` for the compose execution path, or call `m.stackEnv(stackDir)` instead of `m.composeExecWithEnv`.
|
|
|
|
---
|
|
|
|
### C02 — `DecryptMap` silently swallows decryption errors
|
|
|
|
**File:** `crypto/crypto.go:107-113`
|
|
**Category:** Silent data corruption
|
|
|
|
When decryption fails (key rotated, data corrupted), `DecryptMap` silently returns the raw `ENC:...` ciphertext as the value. Applications receive garbled base64 as passwords/secrets, causing silent failures. No error is logged or returned.
|
|
|
|
**Fix:** Return an error from `DecryptMap`, or at minimum log a `[WARN]`. Update callers to handle the error.
|
|
|
|
---
|
|
|
|
### C03 — `appCfg.Deployed = false` written outside lock in deploy goroutine
|
|
|
|
**Files:** `stacks/deploy.go:311`
|
|
**Category:** Data race
|
|
|
|
In `runComposeDeploy`, on failure, `appCfg.Deployed = false` is written at line 311 **outside** the `m.mu.Lock()` block (lines 302-309). The same `appCfg` pointer is shared with `s.AppConfig` in the stacks map. Any concurrent `GetStack`/`GetStacks` call reading `AppConfig.Deployed` is a data race.
|
|
|
|
**Fix:** Move line 311 (`appCfg.Deployed = false`) inside the lock block before line 309.
|
|
|
|
---
|
|
|
|
### C04 — `GetStack`/`GetStacks` returns shallow copies with shared pointer fields
|
|
|
|
**Files:** `stacks/manager.go:449-470`
|
|
**Category:** Data race
|
|
|
|
`GetStack` does `copy := *s` — a shallow copy. `Stack.AppConfig` is a pointer, so the "copy" shares the `AppConfig` with the original. Any goroutine reading `copy.AppConfig.Env` while another modifies the original's `AppConfig` (e.g., deploy goroutine) is a data race. Same issue with `Containers`, `HealthProbe`, and `Meta.DeployFields`.
|
|
|
|
**Fix:** Implement deep copy for `Stack` struct. At minimum, deep-copy `AppConfig`, `Containers`, and `HealthProbe`.
|
|
|
|
---
|
|
|
|
### C05 — Watchdog `pathProbeState` fields read/written without synchronization
|
|
|
|
**File:** `monitor/watchdog.go:141-163 and throughout`
|
|
**Category:** Data race
|
|
|
|
`Check()` reads and writes `pathProbeState` fields (`lastProbeTime`, `consecutiveFailures`, `probeCount`, `lastStatus`, etc.) without holding `w.mu`. Meanwhile, `GetDebugStatus()` reads these same fields while holding `w.mu`. `SafeDisconnect()` and `Reconnect()` also modify state concurrently. Multiple data races.
|
|
|
|
**Fix:** Either hold `w.mu` during all state modifications in `Check()`, or give each `pathProbeState` its own mutex.
|
|
|
|
---
|
|
|
|
### C06 — Metrics collector `cancel` field has data race between `Start` and `Stop`
|
|
|
|
**File:** `metrics/collector.go:Start/Stop`
|
|
**Category:** Data race
|
|
|
|
`Start()` writes `c.cancel` without any lock. `Stop()` reads it without any lock. If called from different goroutines (the expected pattern), this is a data race.
|
|
|
|
**Fix:** Protect with `c.mu` or use `sync.Once` for start.
|
|
|
|
---
|
|
|
|
### C07 — `activeRawMount` race between cleanup, mount, and set
|
|
|
|
**File:** `web/storage_handlers.go:919-935`
|
|
**Category:** Race condition
|
|
|
|
In `storageAttachMountRawHandler`, the old mount is cleaned up under `diskJobMu` (line 919-924), then the lock is released, then `storage.MountRaw` runs without lock (line 926), then the new mount path is set under lock (line 933-935). Two concurrent requests can both clean up and create mounts, orphaning the first.
|
|
|
|
**Fix:** Hold `diskJobMu` across the entire operation (cleanup + mount + set).
|
|
|
|
---
|
|
|
|
### C08 — `SetEncryptionKey` on stacks Manager has no locking
|
|
|
|
**File:** `stacks/manager.go:114-116`
|
|
**Category:** Data race
|
|
|
|
`SetEncryptionKey` writes `m.encKey` without holding any lock. If called concurrently with `MigrateEncryption`, `stackEnv`, or `SaveAppConfig` (which read `m.encKey`), this is a data race.
|
|
|
|
**Fix:** Acquire `m.mu.Lock()` before setting `m.encKey`.
|
|
|
|
---
|
|
|
|
## P1 High Issues
|
|
|
|
### H01 — Restic lock detection never triggers (checks stdout, not stderr)
|
|
|
|
**File:** `backup/restic.go:135-151`
|
|
**Category:** Bug — backup recovery broken
|
|
|
|
`Snapshot()` uses `cmd.Output()` which returns stdout only. Restic writes lock errors to stderr. The lock-detection code at line 138 checks `string(out)` for "lock"/"locked" — but `out` is stdout, not stderr. The auto-unlock-and-retry logic likely never triggers.
|
|
|
|
**Fix:** Use `cmd.CombinedOutput()`, or extract stderr from `*exec.ExitError`:
|
|
```go
|
|
if exitErr, ok := err.(*exec.ExitError); ok {
|
|
errStr = string(exitErr.Stderr)
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
### H02 — `activeDrives()` returns disconnected/decommissioned drives
|
|
|
|
**File:** `backup/backup.go:206-222`
|
|
**Category:** Logic error
|
|
|
|
`activeDrives()` logs which drives are disconnected but still adds ALL drives to the result. Functions like `RunIntegrityCheck`, `ListSnapshots`, `aggregateRepoStats` will then operate on disconnected drives, causing timeouts.
|
|
|
|
**Fix:** Add `continue` after logging disconnected drives to skip them:
|
|
```go
|
|
if m.settings != nil && (m.settings.IsDisconnected(d) || m.settings.IsDecommissioned(d)) {
|
|
disconnected = append(disconnected, d)
|
|
continue // don't add to drives
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
### H03 — `MigrateEncryption` holds RLock while writing files
|
|
|
|
**File:** `stacks/manager.go:124-156`
|
|
**Category:** Concurrency — semantic error
|
|
|
|
`MigrateEncryption()` acquires `m.mu.RLock()` then calls `SaveAppConfig()` which writes to disk. A write operation should hold a write lock, not a read lock.
|
|
|
|
**Fix:** Use `m.mu.Lock()` instead of `m.mu.RLock()`.
|
|
|
|
---
|
|
|
|
### H04 — `SaveAppConfig` is not atomic
|
|
|
|
**File:** `stacks/deploy.go:555` (uses `os.WriteFile`)
|
|
**Category:** Data integrity
|
|
|
|
`SaveAppConfig` writes directly to `app.yaml`. A crash mid-write corrupts the file. This violates the project's own pattern (settings.go uses write-to-tmp + rename).
|
|
|
|
**Fix:** Write to `app.yaml.tmp`, then `os.Rename` to `app.yaml`.
|
|
|
|
---
|
|
|
|
### H05 — Deploy failure saves plaintext secrets to disk
|
|
|
|
**File:** `stacks/deploy.go:311-312`
|
|
**Category:** Security
|
|
|
|
On deploy failure, `runComposeDeploy` calls `SaveAppConfig(stackDir, appCfg, nil, nil)` — passing nil for encKey and sensitiveVars. The plaintext secrets from the deploy form are saved unencrypted to `app.yaml`.
|
|
|
|
**Fix:** Pass `m.encKey` and `SensitiveEnvVars(&meta)` to `SaveAppConfig` in the failure path.
|
|
|
|
---
|
|
|
|
### H06 — No rate limiting on login attempts
|
|
|
|
**File:** `web/auth.go:91-128`
|
|
**Category:** Security
|
|
|
|
`handleLogin` has no rate limiting, lockout, or progressive delay. Bcrypt is slow but still allows ~5-10 attempts/sec. Note: demo node currently has no password set (`Auth: no password configured`), making this irrelevant there but critical for production deployments.
|
|
|
|
**Fix:** Add per-IP rate limiter or exponential backoff on failed attempts.
|
|
|
|
---
|
|
|
|
### H07 — Template rendering writes directly to ResponseWriter
|
|
|
|
**Files:** `web/server.go:393-398`, `setup/handlers.go:871-877`
|
|
**Category:** Bug — corrupt HTTP responses
|
|
|
|
`render()` calls `tmpl.ExecuteTemplate(w, ...)` directly on the ResponseWriter. If the template partially writes then errors, the client receives corrupt HTML with a 200 status. The subsequent `http.Error(500)` is silently ignored because headers are already sent.
|
|
|
|
**Fix:** Render to a `bytes.Buffer` first, only write to `w` on success:
|
|
```go
|
|
var buf bytes.Buffer
|
|
if err := s.tmpl.ExecuteTemplate(&buf, name, data); err != nil {
|
|
http.Error(w, "Internal error", 500)
|
|
return
|
|
}
|
|
w.Header().Set("Content-Type", "text/html; charset=utf-8")
|
|
buf.WriteTo(w)
|
|
```
|
|
|
|
---
|
|
|
|
### H08 — `Syncer.Stop()` panics if called twice
|
|
|
|
**File:** `sync/sync.go:114-116`
|
|
**Category:** Panic
|
|
|
|
`Stop()` calls `close(s.stopCh)`. A double-call panics with "close of closed channel".
|
|
|
|
**Fix:** Use `sync.Once`:
|
|
```go
|
|
s.stopOnce.Do(func() { close(s.stopCh) })
|
|
```
|
|
|
|
---
|
|
|
|
### H09 — Sync race: `TriggerSync` and ticker can run `doSync` concurrently
|
|
|
|
**File:** `sync/sync.go:120-137, 164-167`
|
|
**Category:** Race condition
|
|
|
|
`TriggerSync` checks `s.syncing` under lock, releases the lock, then calls `doSync()`. Between unlock and doSync acquiring the lock to set `syncing=true`, the ticker goroutine can also enter `doSync()`. Two concurrent syncs corrupt the git cache.
|
|
|
|
**Fix:** Set `s.syncing = true` inside `TriggerSync()` before releasing the lock.
|
|
|
|
---
|
|
|
|
### H10 — Cloudflare HTTP requests ignore context — cannot be cancelled
|
|
|
|
**File:** `cloudflare/client.go:73`
|
|
**Category:** Resource leak
|
|
|
|
`do()` uses `http.NewRequest()` instead of `http.NewRequestWithContext()`. The `ctx` parameter passed to `Sync()` is never propagated to HTTP calls. During shutdown, in-flight Cloudflare API requests block until their 15s timeout.
|
|
|
|
**Fix:** Pass context through to `do()` and use `http.NewRequestWithContext()`.
|
|
|
|
---
|
|
|
|
### H11 — `GetFullStatus` returns shared pointers when no cache exists
|
|
|
|
**File:** `backup/backup.go:1213-1228`
|
|
**Category:** Data race
|
|
|
|
When `m.cachedStatus == nil`, the fallback returns a `FullBackupStatus` that directly references `m.lastDBDump` and `m.lastBackup` pointers. After `m.mu` is released, the caller reads these while backup operations may be mutating them.
|
|
|
|
**Fix:** Deep-copy `m.lastDBDump` and `m.lastBackup` in the no-cache path.
|
|
|
|
---
|
|
|
|
### H12 — No HTTP request body size limits
|
|
|
|
**Files:** Multiple handlers in `web/handler_debug.go`, `web/storage_handlers.go`, `api/router.go`
|
|
**Category:** Security — DoS
|
|
|
|
Every `json.NewDecoder(r.Body).Decode(&req)` has no body size limit. A multi-GB request consumes all memory.
|
|
|
|
**Fix:** Add middleware: `r.Body = http.MaxBytesReader(w, r.Body, 1<<20)` for all POST handlers.
|
|
|
|
---
|
|
|
|
### H13 — `runDriveScan` in setup never calls `CleanupTempMounts`
|
|
|
|
**File:** `setup/scanner.go:266-279`
|
|
**Category:** Resource leak
|
|
|
|
`ScanDrivesForInfraBackups` temporarily mounts partitions. `runDriveScan` stores results but never cleans up. Temp mounts remain forever.
|
|
|
|
**Fix:** Call `CleanupTempMounts(results, s.logger)` after storing scan results.
|
|
|
|
---
|
|
|
|
### H14 — `Snapshot` retry reuses same context with diminished timeout
|
|
|
|
**File:** `backup/restic.go:139-147`
|
|
**Category:** Bug
|
|
|
|
After lock detection + unlock, the retry reuses the same `ctx` with the original 30-minute timeout. Time spent on the first attempt is subtracted. Also, `unlockCmd.Run()` ignores its error.
|
|
|
|
**Fix:** Check unlock error. Consider creating a fresh context for retry.
|
|
|
|
---
|
|
|
|
### H15 — `crossdrive` rsync leaf-name collision detection is incomplete
|
|
|
|
**File:** `backup/crossdrive.go:375-389`
|
|
**Category:** Logic error
|
|
|
|
The collision check uses `os.Stat(dstPath)` which checks pre-existing directories from previous runs, not just current-iteration collisions. Mount index 0 skips collision check entirely. Pre-existing leftovers could cause data to be rsynced into wrong directories.
|
|
|
|
**Fix:** Use a `seen` map to track leaf names within the current iteration:
|
|
```go
|
|
seen := make(map[string]bool)
|
|
for _, srcMount := range mounts {
|
|
leaf := filepath.Base(srcMount)
|
|
if seen[leaf] { leaf = fmt.Sprintf("%s_%d", leaf, i) }
|
|
seen[leaf] = true
|
|
// ...
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
### H16 — CSRF Bearer bypass — token not validated
|
|
|
|
**File:** `web/csrf.go:36-41`
|
|
**Category:** Security
|
|
|
|
Any request with `Authorization: Bearer <anything>` skips CSRF entirely. The Bearer token is not validated against the actual API key.
|
|
|
|
**Fix:** Validate the Bearer token before skipping CSRF (check against `s.settings.GetAPIKey()`).
|
|
|
|
---
|
|
|
|
### H17 — Potential nil dereference on `crossDriveRunner`
|
|
|
|
**File:** `web/handlers.go:919`
|
|
**Category:** Potential panic
|
|
|
|
`s.crossDriveRunner.ValidateDestination(...)` is called in `buildAppBackupRows()` without nil check. The field is set post-construction via `SetCrossDriveRunner`.
|
|
|
|
**Fix:** Add nil guard before access.
|
|
|
|
---
|
|
|
|
### H18 — `selftest.checkDockerSocket` panics on empty output
|
|
|
|
**File:** `selftest/selftest.go:99`
|
|
**Category:** Potential panic
|
|
|
|
`string(out[:len(out)-1])` panics with index-out-of-range if `out` is empty.
|
|
|
|
**Fix:** Use `strings.TrimSpace(string(out))` instead.
|
|
|
|
---
|
|
|
|
### H19 — `stderrBuf` data race in `MigrateDrive`
|
|
|
|
**File:** `storage/migrate_drive.go:314-331`
|
|
**Category:** Data race
|
|
|
|
A background goroutine writes to `stderrBuf`. After `cmd.Wait()` returns, `stderrBuf.String()` is read, but `Wait()` with piped stderr does NOT guarantee the reader goroutine has finished.
|
|
|
|
**Fix:** Use `sync.WaitGroup` to wait for the stderr goroutine before reading.
|
|
|
|
---
|
|
|
|
### H20 — `MountRaw` UUID slice panics if UUID shorter than 8 chars
|
|
|
|
**File:** `storage/attach_linux.go:53-58`
|
|
**Category:** Potential panic
|
|
|
|
`uuid[:8]` panics if UUID is shorter than 8 characters.
|
|
|
|
**Fix:** `if len(uuid) >= 8 { dirName = uuid[:8] } else { dirName = uuid }`
|
|
|
|
---
|
|
|
|
### H21 — `removeBindFstabEntry` uses loose string matching
|
|
|
|
**File:** `storage/attach_linux.go:431-456`
|
|
**Category:** Logic error
|
|
|
|
`strings.Contains(line, targetMountPath)` matches `/mnt/hdd_10` when target is `/mnt/hdd_1`. Can remove unrelated fstab entries.
|
|
|
|
**Fix:** Parse fstab fields and compare second field exactly.
|
|
|
|
---
|
|
|
|
## P2 Medium Issues
|
|
|
|
### M01 — `http.Get` without timeout in Hub connectivity test
|
|
**File:** `main.go:~691`
|
|
Uses `http.DefaultClient` (no timeout) while Gitea test correctly uses 5s timeout.
|
|
**Fix:** `client := &http.Client{Timeout: 5 * time.Second}`
|
|
|
|
### M02 — `alertMgr.Refresh` missing `storagePaths` arg at initial call
|
|
**File:** `main.go:591`
|
|
Initial refresh omits `sett.GetStoragePaths()` while all 3 other call sites pass it.
|
|
**Fix:** Add `sett.GetStoragePaths()` to the variadic args.
|
|
|
|
### M03 — `DrainPendingEvents` risks duplicate delivery on save failure
|
|
**File:** `settings/settings.go:801-814`
|
|
Events cleared from memory before save. If save fails, events returned but still on disk — duplicated on restart.
|
|
**Fix:** Restore events if save fails, or save before clearing.
|
|
|
|
### M04 — `AddStoragePath` allows duplicate paths
|
|
**File:** `settings/settings.go:422-432`
|
|
No duplicate check. Calling twice with same path creates duplicate entries.
|
|
**Fix:** Check existing paths before append.
|
|
|
|
### M05 — `LoadFromBytes` skips env overrides
|
|
**File:** `config/config.go:200-211`
|
|
Unlike `Load`, `LoadFromBytes` doesn't call `applyEnvOverrides`.
|
|
**Fix:** Add `applyEnvOverrides(cfg)` call.
|
|
|
|
### M06 — `ListDumpFiles` re-validates every SQL file on every cache refresh (5 min)
|
|
**File:** `backup/dbdump.go:449-450`
|
|
Reads and parses entire dump files every refresh. For large dumps, significant I/O.
|
|
**Fix:** Use `DBValidationCache` to skip re-validation when file hasn't changed.
|
|
|
|
### M07 — `copyStackDBDumps` reads entire files into memory
|
|
**File:** `backup/crossdrive.go:466-484`
|
|
Uses `os.ReadFile` for DB dumps that can be hundreds of MB.
|
|
**Fix:** Use `io.Copy` with `os.Open`/`os.Create` for streaming.
|
|
|
|
### M08 — `backupDrive` only records last drive's snapshot in history
|
|
**File:** `backup/backup.go:369-437`
|
|
Multi-drive setups only report the last drive's snapshot.
|
|
**Fix:** Record all successful snapshots.
|
|
|
|
### M09 — `SubdomainInUse` does disk I/O under RLock
|
|
**File:** `stacks/deploy.go:58-82`
|
|
Calls `LoadAppConfig()` (disk read) for every stack while holding RLock.
|
|
**Fix:** Cache subdomain in memory `Stack` struct.
|
|
|
|
### M10 — `InjectMissingFields` auto-fills subdomain without uniqueness check
|
|
**File:** `stacks/deploy.go:676-690`
|
|
Uses `field.Default` or `meta.Subdomain` without checking `SubdomainInUse` or reserved names.
|
|
**Fix:** Add validation.
|
|
|
|
### M11 — `ParseComposeHDDMounts` uses fragile line-by-line YAML parsing
|
|
**File:** `stacks/delete.go:392-454`
|
|
Cannot distinguish top-level vs service-level `volumes:` key.
|
|
**Fix:** Use `yaml.v3` to properly parse.
|
|
|
|
### M12 — Backup path removal in `RemoveStack` has no path validation
|
|
**File:** `stacks/delete.go:304-316`
|
|
Accepts arbitrary `backupPathsToRemove` from caller, calls `os.RemoveAll`. No validation paths are actually backup paths.
|
|
**Fix:** Validate paths start with expected backup directory prefix.
|
|
|
|
### M13 — `getDirSizeHuman` has no timeout
|
|
**File:** `stacks/delete.go:457-468`
|
|
`exec.Command("du", "-sh", path)` with no timeout. Can hang on NFS.
|
|
**Fix:** Add `context.WithTimeout(30s)` like `getDirSizeBytes` does.
|
|
|
|
### M14 — Scheduler doesn't guard against double-start
|
|
**File:** `scheduler/scheduler.go:99-116`
|
|
Calling `Start()` twice overwrites `ctx`/`cancel`, leaking the first set of goroutines.
|
|
**Fix:** Guard with `if s.ctx != nil { return }`.
|
|
|
|
### M15 — Jobs registered after `Start()` never execute
|
|
**File:** `scheduler/scheduler.go:60-96 vs 99-116`
|
|
`Start()` only launches goroutines for currently-registered jobs.
|
|
**Fix:** Document requirement, or auto-launch late-registered jobs.
|
|
|
|
### M16 — `selfupdate.performUpdate` leaves "pending" state on compose-up failure
|
|
**File:** `selfupdate/updater.go:396-410`
|
|
State file stays "pending" when compose-up fails, creating contradictory API state.
|
|
**Fix:** Set `state.Status = "failed"` on compose-up failure.
|
|
|
|
### M17 — `selfupdate` docker commands have no timeout
|
|
**File:** `selfupdate/updater.go:495-502`
|
|
`docker pull` and `docker compose up -d` can hang indefinitely.
|
|
**Fix:** Use `exec.CommandContext` with timeout.
|
|
|
|
### M18 — `SetBackupRunningCheck` has no mutex protection
|
|
**File:** `selfupdate/updater.go:66-68`
|
|
Writes `u.backupRunning` without lock. `DryRun()` also reads it without lock.
|
|
**Fix:** Use `u.mu.Lock()`.
|
|
|
|
### M19 — `goroutine leak` in `ProbeStoragePath` on hung mounts
|
|
**File:** `system/mounts_linux.go:258-289`
|
|
`syscall.Statfs` on a dead mount blocks forever. Each probe leaks a goroutine.
|
|
**Fix:** Track outstanding probes per path, prevent duplicate probes.
|
|
|
|
### M20 — `rsync --info=progress2` carriage returns not handled
|
|
**File:** `storage/format.go:31-53`
|
|
Progress scanner splits on `\n` only; `\r`-delimited updates get concatenated.
|
|
**Fix:** Custom scanner splitting on both `\r` and `\n`.
|
|
|
|
### M21 — `mmcblk` device partition path construction is wrong
|
|
**File:** `storage/format_linux.go:119-122`
|
|
Only NVMe gets `p` separator. `mmcblk0` → `mmcblk01` instead of `mmcblk0p1`.
|
|
**Fix:** Add `|| strings.Contains(req.DevicePath, "mmcblk")`.
|
|
|
|
### M22 — Integer underflow in memory availability message
|
|
**File:** `api/router.go:381-386`
|
|
If `reservedMB > totalMB`, `usableMB` goes negative, blocking all app starts.
|
|
**Fix:** Clamp `usableMB` to minimum 0.
|
|
|
|
### M23 — `triggerAllCrossBackups` runs "manual" schedule
|
|
**File:** `api/router.go:907-929`
|
|
"Manual" schedule meant for per-app on-demand, not bulk triggers.
|
|
**Fix:** Remove `RunAllScheduled(ctx, "manual")`.
|
|
|
|
### M24 — `writeFreshConfig` creates bare `&settings.Settings{}` without path
|
|
**File:** `setup/handlers.go:668-693`
|
|
On settings load failure, creates Settings with no file path. `SetPasswordHash` (which calls save) writes to empty path.
|
|
**Fix:** Create Settings with proper path and logger.
|
|
|
|
### M25 — `SetStep` race condition between unlock and Save
|
|
**File:** `setup/setup.go:99-106`
|
|
State modified under lock, then saved after unlock. Another goroutine can modify between.
|
|
**Fix:** Use `saveLocked()` variant called while holding lock.
|
|
|
|
### M26 — Setup CSRF relies on client-controlled cookie + form
|
|
**File:** `setup/csrf.go:34-43`
|
|
Both sides of the comparison are client-controlled. No server-side token storage.
|
|
**Fix:** Store CSRF token server-side in SetupState.
|
|
|
|
### M27 — `PushEvent` in notifier spawns goroutine without tracking
|
|
**File:** `notify/notifier.go`
|
|
Goroutine leak potential — no WaitGroup or context for clean shutdown.
|
|
**Fix:** Track goroutines with sync.WaitGroup.
|
|
|
|
### M28 — `SafeDisconnect` and `Check` can race on same storage path
|
|
**File:** `monitor/watchdog.go:498-644`
|
|
Both modify `pathProbeState` concurrently without coordination. Can double-stop stacks.
|
|
**Fix:** Use broader mutex or state-transition lock.
|
|
|
|
### M29 — `stackProvider` reads in backup package lack mutex protection
|
|
**File:** `backup/backup.go:171-181,184-203,478-485`
|
|
`SetStackProvider` writes under mutex, but all reads are unprotected.
|
|
**Fix:** Ensure single-call-at-startup invariant is documented, or add read locks.
|
|
|
|
### M30 — Multiple `GetStacks()` calls in same handler
|
|
**File:** `web/handlers.go:214,233`
|
|
`stacksHandler` calls `GetStacks()` 3 times — each may return different snapshots.
|
|
**Fix:** Call once, reuse result.
|
|
|
|
---
|
|
|
|
## P3 Low Issues
|
|
|
|
### L01 — Dead code: `fileExists` function in main.go (line 1071)
|
|
### L02 — Dead code: `imageName` function in selfupdate/updater.go (line 231)
|
|
### L03 — Dead code: `detectHostIPViaRoute` in setup/network.go (always returns "")
|
|
### L04 — Dead code: `handleRoot` in setup/handlers.go (both if branches do same redirect)
|
|
### L05 — `TruncateStr` operates on bytes, not runes — breaks multi-byte UTF-8
|
|
- File: `util/strings.go:6-12` and `stacks/manager.go:718-724`
|
|
### L06 — `TruncateStr` no guard for negative maxLen
|
|
### L07 — Variable `copy` shadows builtin in settings.go:294
|
|
### L08 — `save()` in settings.go does not guard nil logger (line 218)
|
|
### L09 — `GetNotificationPrefs` early return shares global `DefaultEnabledEvents` slice
|
|
### L10 — `os.ExpandEnv` in config.go may corrupt `$` in YAML values
|
|
### L11 — Key file written non-atomically (crypto.go:37)
|
|
### L12 — `ValidateDump` uses global `log.Printf` instead of injected logger
|
|
### L13 — `hasPrefix` reimplements `strings.HasPrefix` in restore_scan.go
|
|
### L14 — `AllDone` returns true for empty Apps list (restore_scan.go:134)
|
|
### L15 — `RunIntegrityCheck` only reports last error across drives
|
|
### L16 — `restic.Stats()` casts uint64 to int64 unsafely
|
|
### L17 — `logPostStartStatus` goroutine leak potential (no context/WaitGroup)
|
|
### L18 — `Pinger.send` response body not drained before close
|
|
### L19 — `sampleContainers` creates `context.Background()` instead of using parent ctx
|
|
### L20 — Docker commands without timeout in healthcheck.go
|
|
### L21 — Non-deterministic Cloudflare rule order due to map iteration
|
|
### L22 — `parseDailyTime` accepts trailing garbage ("12:30:45")
|
|
### L23 — `Syncer.Start()` can be called multiple times (goroutine leak)
|
|
### L24 — `Stop()` on scheduler reads `s.cancel` without mutex
|
|
### L25 — `stripPartition` doesn't handle `mmcblk` devices
|
|
### L26 — `sizeBytes()` silently returns 0 on ParseUint error
|
|
### L27 — Duplicate `dirSizeHuman`/`dirSizeBytesHuman`/`fmtBytes` functions
|
|
### L28 — `w.Write` errors ignored in multiple HTTP handlers
|
|
### L29 — `json.Encode` errors ignored in multiple handlers
|
|
### L30 — `r.ParseForm()` errors ignored (multiple handlers)
|
|
### L31 — No upper bound on log `lines` query parameter (DoS)
|
|
### L32 — Logout accepts GET requests (CSRF)
|
|
### L33 — Content-Type check uses exact match, fails for `application/json; charset=utf-8`
|
|
### L34 — `SyncFileBrowserMounts` writes config files non-atomically
|
|
### L35 — `DriveMigrator` adapter types have triplicated methods (main.go:840-1038)
|
|
### L36 — Goroutines in main.go use `time.Sleep` without context cancellation check
|
|
### L37 — `hubPusher.Push()` error silently ignored in multiple callbacks
|
|
### L38 — Setup shutdown errors silently discarded (main.go:1136-1137)
|
|
### L39 — `PushOnce` in report/pusher.go always returns nil (misleading return type)
|
|
### L40 — Static fallback CSRF token ("fallback-csrf-token") in setup/csrf.go
|
|
### L41 — `extractName` returns "" but most API callers don't check for empty
|
|
### L42 — Hardcoded config path `/opt/docker/felhom-controller/controller.yaml` in 3 setup places
|
|
### L43 — `GetStatus()` in backup returns raw shared pointers (not deep-copied like `GetFullStatus`)
|
|
### L44 — Fstab entries accumulate extra blank lines from double-newline appends
|
|
### L45 — `logFileHashes` called after write — shows identical src/dst hashes
|
|
### L46 — `LoadMetadata` prints to stderr instead of using logger
|
|
### L47 — `LoadMetadata` silently ignores YAML parse errors, returns defaults
|
|
### L48 — `simpleHash` in alerts.go has 32-bit collision risk
|
|
### L49 — `Close()` on auth SessionStore can panic if called twice
|
|
### L50 — CSS path in setup/handlers.go always fails (dead code)
|
|
### L51 — `isError` in setup uses string comparison instead of `errors.Is`
|
|
### L52 — Password stored in plaintext in setup-state.json during setup
|
|
|
|
---
|
|
|
|
## Standardization / Optimization Opportunities
|
|
|
|
### S01 — Unify env var construction for compose commands
|
|
Three different code paths build env vars: `stackEnv`, `composeExecWithEnv`, `composeExecCustomEnv(env=nil)`. `DOMAIN` can be appended twice. Should consolidate into single method.
|
|
|
|
### S02 — Unify `humanizeBytes` / `fmtBytes` / `dirSizeBytesHuman` / `dirSizeHuman`
|
|
At least 4 separate implementations of byte-to-human-readable conversion across packages. Should be one function in `util/`.
|
|
|
|
### S03 — Standardize atomic file writes
|
|
`settings.go` uses write-to-tmp + rename, but `SaveAppConfig`, `SyncFileBrowserMounts`, crypto key file, and setup state do not. All persistent file writes should use the same pattern.
|
|
|
|
### S04 — Standardize HTTP client creation
|
|
Many places create `&http.Client{Timeout: X}` per call. Should have shared clients with sensible defaults.
|
|
|
|
### S05 — Standardize `exec.Command` with context/timeout
|
|
Most `exec.Command` calls have no timeout. Should use `exec.CommandContext` with appropriate timeouts consistently.
|
|
|
|
### S06 — Standardize error handling in HTTP handlers
|
|
Some handlers return JSON errors, some use `http.Error`, some ignore write errors. Should be consistent.
|
|
|
|
### S07 — Extract common adapter pattern
|
|
`stackAdapter`, `watchdogStackAdapter`, `driveMigrateStackAdapter` in main.go share identical methods. Should use embedded base adapter.
|
|
|
|
### S08 — Consolidate `getDirSizeHuman` + `getDirSizeBytes` into single call
|
|
Two `du` invocations for the same directory. Call `getDirSizeBytes` once and format.
|
|
|
|
### S09 — Template rendering: always buffer before writing
|
|
Multiple `render()` functions write directly to ResponseWriter. All should use bytes.Buffer.
|
|
|
|
### S10 — Add `MaxBytesReader` middleware
|
|
Rather than per-handler body limiting, add a global middleware with sensible default (1MB for JSON, larger for file uploads).
|
|
|
|
---
|
|
|
|
## Fix Plan — Ordered by Implementation Phase
|
|
|
|
### Phase 1 — Critical Data Races & Security (P0)
|
|
**Estimate: ~2 hours**
|
|
|
|
1. **C01** — Fix `UpdateStackConfig` to use decrypted env for compose
|
|
2. **C02** — Add error return to `DecryptMap` (or at least log)
|
|
3. **C03** — Move `appCfg.Deployed = false` inside lock block
|
|
4. **C04** — Implement `Stack.DeepCopy()`, use in `GetStack`/`GetStacks`
|
|
5. **C05** — Add per-state mutex to `pathProbeState` in watchdog
|
|
6. **C06** — Protect metrics collector `cancel` with mutex
|
|
7. **C07** — Hold `diskJobMu` across entire raw mount operation
|
|
8. **C08** — Add lock to `SetEncryptionKey`
|
|
|
|
### Phase 2 — High-Priority Bugs (P1)
|
|
**Estimate: ~3 hours**
|
|
|
|
1. **H01** — Fix restic lock detection to use stderr
|
|
2. **H02** — Filter disconnected drives in `activeDrives()`
|
|
3. **H03** — Change to write lock in `MigrateEncryption`
|
|
4. **H04** — Make `SaveAppConfig` atomic (write-to-tmp + rename)
|
|
5. **H05** — Pass encKey to SaveAppConfig on deploy failure
|
|
6. **H07** — Buffer template rendering (web + setup)
|
|
7. **H08** — Use sync.Once in Syncer.Stop()
|
|
8. **H09** — Fix sync race (set syncing=true before unlock)
|
|
9. **H10** — Thread context through Cloudflare HTTP calls
|
|
10. **H11** — Deep-copy in GetFullStatus no-cache path
|
|
11. **H14** — Fix restic retry context + check unlock error
|
|
12. **H15** — Fix crossdrive leaf collision with `seen` map
|
|
13. **H17** — Add nil check for crossDriveRunner
|
|
14. **H18** — Use TrimSpace for selftest output
|
|
15. **H19** — WaitGroup for stderr goroutine in MigrateDrive
|
|
16. **H20** — Guard UUID slice length
|
|
|
|
### Phase 3 — Security Hardening (P1-P2)
|
|
**Estimate: ~1.5 hours**
|
|
|
|
1. **H06** — Add login rate limiting
|
|
2. **H12** — Add MaxBytesReader middleware
|
|
3. **H16** — Validate Bearer token in CSRF middleware
|
|
4. **H21** — Fix fstab entry matching to use field parsing
|
|
5. **M12** — Validate backup paths in RemoveStack
|
|
6. **M22** — Clamp memory availability to min 0
|
|
7. **M26** — Server-side CSRF for setup wizard
|
|
|
|
### Phase 4 — Medium Priority Fixes (P2)
|
|
**Estimate: ~3 hours**
|
|
|
|
1. **M01** — Add timeout to Hub connectivity test
|
|
2. **M02** — Add storagePaths to initial alert refresh
|
|
3. **M03** — Fix DrainPendingEvents to restore on save failure
|
|
4. **M04** — Add duplicate check in AddStoragePath
|
|
5. **M06** — Use validation cache to skip re-validation
|
|
6. **M07** — Stream file copies instead of reading into memory
|
|
7. **M08** — Record all drive snapshots in history
|
|
8. **M09** — Cache subdomain in Stack struct
|
|
9. **M14** — Guard scheduler double-start
|
|
10. **M16** — Set "failed" state on compose-up failure
|
|
11. **M17** — Add timeout to selfupdate docker commands
|
|
12. **M21** — Fix mmcblk partition path
|
|
13. **M23** — Remove "manual" from triggerAllCrossBackups
|
|
14. **M24** — Fix writeFreshConfig settings path
|
|
15. **M25** — Fix SetStep race (saveLocked)
|
|
16. **M30** — Single GetStacks() call per handler
|
|
|
|
### Phase 5 — Standardization (S-items)
|
|
**Estimate: ~2 hours**
|
|
|
|
1. **S01** — Unify compose env construction
|
|
2. **S02** — Single `humanizeBytes` in util/
|
|
3. **S03** — Atomic writes everywhere
|
|
4. **S04** — Shared HTTP clients
|
|
5. **S05** — exec.CommandContext everywhere
|
|
6. **S07** — Base adapter embedding
|
|
7. **S09** — Template buffer pattern
|
|
|
|
### Phase 6 — Low Priority Cleanup (P3)
|
|
**Estimate: ~1.5 hours**
|
|
|
|
Remove dead code (L01-L04), fix UTF-8 truncation (L05-L06), address remaining code smells and minor issues.
|
|
|
|
---
|
|
|
|
**Total estimated effort: ~13 hours across 6 phases**
|
|
|
|
Phases 1-3 are the most important and should be done first. Phase 4-5 can be done incrementally. Phase 6 is best done opportunistically when touching related files.
|