# Bug Hunt Report — Comprehensive Controller Audit **Date:** 2026-02-25 **Scope:** Full controller codebase (`controller/` — all ~80 Go files, all packages) **Method:** Systematic code review of every file + live testing on demo node (192.168.0.162) **Controller version:** v0.30.3 --- ## Executive Summary Reviewed all controller Go source files across 6 parallel deep-dive sessions covering: main/settings/config, backup (12 files), stacks (5 files), web/API/router (14 files), scheduler/monitor/metrics (10 files), and cloudflare/assets/storage/sync/setup/notify/report/recovery/selftest/crypto/util (32 files). **Live testing result:** Demo node is healthy — zero errors, zero warnings, 18 MiB RAM, 0.11% CPU, all subsystems operational. **Code review found 78 issues total:** | Severity | Count | |----------|-------| | CRITICAL | 3 | | HIGH | 12 | | MEDIUM | 26 | | LOW | 37 | --- ## Table of Contents - [CRITICAL Issues](#critical-issues) - [HIGH Issues](#high-issues) - [MEDIUM Issues](#medium-issues) - [LOW Issues](#low-issues) - [Fix Plan](#fix-plan) --- ## CRITICAL Issues ### C1: Watchdog mutex unlock/relock pattern — panic-unsafe **File:** `internal/monitor/watchdog.go:234-238` **Category:** concurrency In `handleConnectedProbe()`, the mutex is unlocked before calling `handleDisconnect`, then re-locked to satisfy a deferred unlock: ```go state.mu.Unlock() // line 235 w.handleDisconnect(sp, state, result) // line 236 state.mu.Lock() // line 237 — re-acquire for deferred Unlock ``` If `handleDisconnect` panics, line 237 never executes, but the `defer state.mu.Unlock()` still fires — unlocking an already-unlocked mutex causes a **double-panic** (`sync: unlock of unlocked mutex`), crashing the watchdog goroutine entirely. **Fix:** Refactor to update state inside the lock, then call `handleDisconnect` outside with an early return instead of re-locking. Or wrap `handleDisconnect` in a recover block. --- ### C2: SetGeoAppOverride nil pointer dereference **File:** `internal/settings/settings.go:885` **Category:** bug `SetGeoAppOverride(stackName string, override *GeoAppOverride)` does not check for `override == nil` before accessing its fields. Callers who pass nil to "clear" an override would crash. **Fix:** Add nil guard: `if override == nil { delete map entry; return }`. --- ### C3: SSD-only apps skip DB dump restoration during DR **File:** `internal/backup/restore_app_linux.go:156-161` **Category:** logic/bug `restoreDBDumps` returns `nil` early if `app.HDDPath == ""`. SSD-only apps (no HDD) have DB dumps under `systemDataPath`, but this path is never considered. During disaster recovery, SSD-only apps with databases (e.g., Mealie, Vikunja) would **lose their DB dump backups**. **Fix:** Fall back to `systemDataPath` when `HDDPath` is empty, mirroring the logic in `GetAppDrivePath()`. --- ## HIGH Issues ### H1: Double deploy race condition (TOCTOU) **File:** `internal/stacks/deploy.go:110-285` **Category:** concurrency `DeployStack` reads the stack via `GetStack` (RLock, releases), checks `existing.Deployed == false` on disk, then 169 lines later acquires the write lock to set `Deploying = true`. Two concurrent deploy calls for the same stack can both pass the check and launch parallel `docker compose up -d` goroutines. **Fix:** Atomically check-and-set `Deploying` under a single lock acquisition: ```go m.mu.Lock() if s.Deploying { m.mu.Unlock(); return error } s.Deploying = true m.mu.Unlock() ``` --- ### H2: Delete/Remove during active deploy **File:** `internal/stacks/delete.go:73, 226` **Category:** concurrency Neither `DeleteStack` nor `RemoveStack` checks `stack.Deploying`. A user could delete a stack while an async deploy goroutine is running, causing race between `docker compose down` and `docker compose up`, and the deploy goroutine would later modify a deleted stack. **Fix:** Add `if stack.Deploying { return error }` checks in both functions. --- ### H3: ScanStacks can overwrite Deployed flag during async deploy **File:** `internal/stacks/manager.go:232-248` **Category:** concurrency `ScanStacks` re-reads `app.yaml` from disk and overwrites the in-memory `Deployed` flag. If an async deploy fails and `runComposeDeploy` sets `Deployed=false` in memory, but disk still has the old state, `ScanStacks` could re-set `Deployed=true` creating a ghost deployment. **Fix:** Skip overwriting `Deployed` and `AppConfig` when `existing.Deploying == true`. --- ### H4: Shared `appCfg` pointer between main thread and async deploy goroutine **File:** `internal/stacks/deploy.go:283, 310` **Category:** concurrency The same `appCfg` pointer is stored in `s.AppConfig` (line 283) and passed to the async `runComposeDeploy` goroutine. The goroutine later mutates it (line 310: `appCfg.Deployed = false`). While the mutation happens under `m.mu.Lock`, readers who already got the pointer reference see mutation. **Fix:** Clone `appCfg` before passing to the goroutine. --- ### H5: `SyncFileBrowserMounts` has no concurrency guard **File:** `internal/web/handlers.go:1543-1589` **Category:** concurrency Called from 7 locations (4 via `go` goroutines). Two concurrent storage operations could trigger parallel file writes to `docker-compose.yml` and `config.yaml`, and concurrent `docker compose` restarts causing file corruption or Docker conflicts. **Fix:** Add a `sync.Mutex` for the entire SyncFileBrowserMounts operation, or use a debounced channel to coalesce concurrent calls. --- ### H6: PushEvent never records to event history **File:** `internal/notify/notifier.go:181-218` **Category:** bug `PushEvent()` fires notifications in a goroutine but never calls `recordHistory()`. Only `PushTestEventSync()` records. The debug event history page is blind to all real production events. **Fix:** Add `n.recordHistory(...)` calls in the PushEvent goroutine on both success and failure paths. --- ### H7: PushOnce returns nil for non-2xx HTTP responses **File:** `internal/report/pusher.go:194-224` **Category:** bug `PushOnce()` silently returns `nil` for 4xx/5xx responses. Callers believe the push succeeded. Compare with `Push()` and `PushInfraBackup()` which properly handle non-2xx. **Fix:** Return `fmt.Errorf("hub push-once: HTTP %d", resp.StatusCode)` for non-2xx. --- ### H8: tmpFile not closed/synced before rename in DB dump **File:** `internal/backup/dbdump.go:231-272` **Category:** data integrity `DumpOne` calls `os.Rename(tmpPath, finalPath)` while the file handle is still open (defer closes it later). Data may not be flushed to disk before rename. A crash after rename could leave a truncated/empty dump file. **Fix:** Close `tmpFile` explicitly (checking the error) before `os.Rename`. --- ### H9: Restic retry uses possibly-expired context **File:** `internal/backup/restic.go:142-156` **Category:** logic When a stale lock is detected, `Snapshot` calls `unlock` then retries with the same context (30-min timeout). If the first attempt consumed most of the timeout, the retry gets very little time and fails with a confusing timeout error. **Fix:** Create a fresh context with its own timeout for the retry. --- ### H10: SaveAppConfig silently stores secrets in plaintext on encryption failure **File:** `internal/stacks/deploy.go:542-549` **Category:** security When `crypto.Encrypt()` fails, the code silently falls through and saves the value unencrypted. No warning is logged. Passwords could be stored in plaintext on disk without any indication. **Fix:** Log the encryption failure, or return an error to prevent saving unencrypted secrets. --- ### H11: `settingsCrossBackupHandler` missing destination path validation **File:** `internal/web/handlers.go:947-994` **Category:** security The web form handler for cross-drive backup configuration does NOT validate the destination path against registered storage paths. The API handler (`router.go:823-837`) has explicit validation (C9 comment). A crafted form submission could set an arbitrary destination path. **Fix:** Add the same validation as the API handler: check `destPath` against `GetStoragePaths()`. --- ### H12: deepCopyStack incomplete — OptionalConfig and HealthCheck not deep-copied **File:** `internal/stacks/manager.go:474-517` **Category:** concurrency `deepCopyStack` copies `DeployFields` but not `Meta.OptionalConfig` (slice of slices) or `Meta.HealthCheck` (pointer). Mutations on the returned copy's OptionalConfig or HealthCheck would affect the original in the map. **Fix:** Deep-copy `OptionalConfig` including nested `Fields` slices, and clone `HealthCheck` pointer. --- ## MEDIUM Issues ### M1: Non-constant-time API key comparison **File:** `cmd/controller/main.go:820` **Category:** security API key comparison uses `==` instead of `subtle.ConstantTimeCompare`, making it vulnerable to timing attacks. **Fix:** Use `crypto/subtle.ConstantTimeCompare`. --- ### M2: `stackProvider` read without mutex protection **File:** `internal/backup/backup.go:172-176, 185-189, 481-482` **Category:** concurrency `SetStackProvider` writes under `m.mu`, but all reads of `m.stackProvider` happen without any lock. This is a Go data race per the memory model. In practice it's set once at startup, but the race detector would flag it. **Fix:** Use `sync.Once` or `atomic.Value` for the provider. --- ### M3: `DrainPendingEvents` silently loses events on save failure **File:** `internal/settings/settings.go:808-823` **Category:** logic If `s.save()` fails, events are drained from memory but never persisted. On restart, the events are lost. **Fix:** Only drain events from memory after successful save. --- ### M4: `SubdomainInUse` performs disk I/O under RLock **File:** `internal/stacks/deploy.go:58-82` **Category:** concurrency/performance Reads `app.yaml` from disk via `LoadAppConfig` for every deployed stack while holding `m.mu.RLock()`. Slow I/O blocks all writers. **Fix:** Collect stack dirs under the lock, release it, then perform disk I/O. --- ### M5: `MigrateEncryption` reads `encKey` without lock **File:** `internal/stacks/manager.go:123` **Category:** concurrency Checks `m.encKey == nil` before acquiring `m.mu.Lock()`. `SetEncryptionKey` writes `encKey` under the lock, creating a data race. **Fix:** Move the nil check inside the lock. --- ### M6: `MigrateEncryption` holds lock during all disk I/O **File:** `internal/stacks/manager.go:122-162` **Category:** concurrency/performance Holds `m.mu.Lock()` while performing `LoadAppConfig`, `LoadMetadata`, and `SaveAppConfig` for every stack. Blocks all operations during migration. **Fix:** Collect stacks needing migration under lock, then do I/O outside. (Acceptable if only runs at startup before scheduler.) --- ### M7: `executeAllRestores` accesses `plan.Apps` without lock **File:** `internal/web/handler_restore.go:115` **Category:** concurrency Directly accesses `plan.Apps` field without holding `restoreMu`. Concurrent reads from `apiRestoreStatus` could race. **Fix:** Use the plan's thread-safe accessor instead of direct field access. --- ### M8: No login rate limiting **File:** `internal/web/auth.go:91-128` **Category:** security No rate limiting on login attempts. While bcrypt is expensive, a persistent attacker could still brute-force weak passwords. **Fix:** Add per-IP rate limiting (e.g., 5 attempts per minute). --- ### M9: Route handling — `hasSuffix` can match empty stack names **File:** `internal/api/router.go:128` **Category:** bug Path `/stacks/deploy` matches `hasSuffix(path, "/deploy")`, yielding an empty stack name from `extractName`. The handler proceeds without validating the name is non-empty. **Fix:** Add non-empty name validation in handlers that use `extractName`. --- ### M10: Jobs registered after `Start()` never execute **File:** `internal/scheduler/scheduler.go:60-96` **Category:** logic `Start()` launches goroutines for existing jobs only. Jobs added via `Every()`/`Daily()` after `Start()` are appended but never get a goroutine. **Fix:** Either reject late registrations or launch the goroutine immediately if already started. --- ### M11: `findStoragePath` returns pointer to copy, not to settings data **File:** `internal/monitor/watchdog.go:890-897` **Category:** logic Returns `&sp` where `sp` is a range-loop copy. Mutations won't reflect in settings. All current callers are read-only so it's safe, but misleading. **Fix:** Return value type, or document it's a snapshot. --- ### M12: Set* methods on watchdog lack synchronization **File:** `internal/monitor/watchdog.go:113-125` **Category:** concurrency `SetAlertRefresh`, `SetHubReportPusher`, `SetRepoUnlocker` write fields without any mutex. If called after the scheduler starts, it's a data race. **Fix:** Document they must be called before Start(), or protect with mutex. --- ### M13: SQLite WAL mode pragma not verified **File:** `internal/metrics/store.go:26-36` **Category:** data integrity `PRAGMA journal_mode=WAL` is executed via `db.Exec()` but the returned mode is not verified. WAL could silently fail to engage on certain filesystems. **Fix:** Use `db.QueryRow("PRAGMA journal_mode=WAL").Scan(&mode)` and verify `mode == "wal"`. --- ### M14: `sampleContainers` uses `context.Background()` instead of cancellable context **File:** `internal/metrics/collector.go:100-101` **Category:** bug `docker stats` command ignores shutdown cancellation, potentially blocking for 30s after `Stop()` is called. **Fix:** Pass the collector's context through to `sampleContainers`. --- ### M15: Telemetry scan error silently swallowed **File:** `internal/metrics/telemetry.go:39-42` **Category:** bug Row scan failures are silently discarded with `continue`, masking data corruption or schema issues. **Fix:** Log the scan error before continuing. --- ### M16: `copyStackDBDumps` reads entire files into memory **File:** `internal/backup/crossdrive.go:478-484` **Category:** optimization Uses `os.ReadFile`/`os.WriteFile` for dump file copies. Large DB dumps (500MB+) cause full in-memory allocation. **Fix:** Use `io.Copy` with buffered I/O, or shell out to `cp`/`rsync`. --- ### M17: `restic stats` called twice per drive per cache refresh **File:** `internal/backup/backup.go:1070-1071` **Category:** optimization `aggregateRepoStats()` and `perDriveRepoStats()` both iterate drives and call `restic stats`. Each drives gets two subprocess invocations per refresh. **Fix:** Compute per-drive stats once, then aggregate from the results. --- ### M18: `ListDumpFiles` re-validates every dump file every 5 minutes **File:** `internal/backup/dbdump.go:448-451` **Category:** optimization `ValidateDump` opens and reads each `.sql` file line-by-line on every cache refresh, even for multi-hundred-MB dumps. Ignores the validation cache in `settings.json`. **Fix:** Pass the validation cache to `ListDumpFiles`, skip re-validation if file size+modtime match. --- ### M19: Naive stack name derivation from container names **File:** `internal/backup/dbdump.go:517-532` **Category:** logic `deriveStackName("my-cache")` strips "cache" suffix and returns "my" — potentially attributing DB dumps to wrong stacks. **Fix:** Cross-reference with actual deployed stack names instead of pure suffix stripping. --- ### M20: `syncInfraConfig` controller.yaml copy is not atomic **File:** `internal/backup/crossdrive.go:526-534` **Category:** data integrity Uses `os.ReadFile` + `os.WriteFile` directly. A crash mid-write corrupts the file. Compare with `local_infra.go` which uses `atomicWrite`. **Fix:** Use `atomicWrite` for consistency. --- ### M21: `http.Get` without timeout in main.go **File:** `cmd/controller/main.go:691` **Category:** bug Uses `http.Get` (no timeout). If the endpoint hangs, the goroutine leaks permanently. **Fix:** Use `&http.Client{Timeout: 10 * time.Second}`. --- ### M22: Assets syncer holds mutex during entire network operation **File:** `internal/assets/syncer.go:72-73` **Category:** concurrency `s.mu.Lock()` held for the entire `Sync()` call including HTTP downloads. `Status()` calls block for the entire sync duration (60+ seconds). **Fix:** Lock only to set a `running` flag, then unlock for actual work. --- ### M23: Git sync logs credentials in plaintext **File:** `internal/sync/sync.go:427` **Category:** security `runGitInDir` logs `strings.Join(args, " ")` which includes the repo URL with embedded credentials during `git clone`. The `maskRepoURL()` helper exists but is only used elsewhere. **Fix:** Apply `maskRepoURL()` in `runGitInDir()` before logging. --- ### M24: Path prefix check in storageAttachBrowseHandler may match sibling directories **File:** `internal/web/storage_handlers.go:953-958` **Category:** security `strings.HasPrefix(cleanPath, storage.RawMountBase)` — if `RawMountBase` doesn't end with `/`, a path like `/mnt/felhom-stage-evil/sensitive` would pass the check. **Fix:** Ensure prefix check includes trailing `/`: `strings.HasPrefix(cleanPath, storage.RawMountBase + "/")`. --- ### M25: Set* methods on Server lack synchronization **File:** `internal/web/server.go:114-172` **Category:** concurrency Multiple `Set*` methods write to Server fields without synchronization. The `SyncFileBrowserMounts` goroutine launched at startup could race with these calls. **Fix:** Document init order or add a barrier, or pass all deps via `NewServer()`. --- ### M26: `crypto.DecryptMap` uses global `log.Printf` **File:** `internal/crypto/crypto.go:112` **Category:** consistency Uses global `log` package instead of injected logger — inconsistent with rest of codebase. **Fix:** Add a `logger` parameter. --- ## LOW Issues ### L1: Dead code: unused `fileExists` in main.go (line 1071) ### L2: Multiple goroutines launched without `sync.WaitGroup` for shutdown (main.go: multiple locations) ### L3: `0` treated as "unset" for all int config fields (config.go:230-234) — prevents intentional zero values ### L4: Hardcoded metrics DB path ignores `cfg.Paths.DataDir` (main.go:138) ### L5: Hardcoded config path in infra backup functions (main.go:1047, 1148) ### L6: String concat instead of `filepath.Join` for path (main.go:83) ### L7: Double-clear of `running` map in CrossDriveRunner (crossdrive.go:86-110) — defer + manual clear ### L8: uint64 to int64 cast in restic stats — overflow for repos >8 EB (restic.go:293) ### L9: `deriveStackName` error parsing uses `string(out)` which may be empty (restic.go:136-142) ### L10: infraPaths duplicated in every drive's backup (backup.go:366-369) ### L11: TOCTOU race on fstab in `addDRFstabEntries` (restore_drives_linux.go:213-262) ### L12: Non-atomic fallback fstab write (restore_drives_linux.go:256-258) ### L13: 10-second timeout may truncate DB discovery results (backup.go:1077-1081) ### L14: `restoreUserData` copies files with hardcoded 0644 permissions (restore_app_linux.go:139-148) ### L15: Merged snapshot history not saved to disk after merge (backup.go:984-1053) ### L16: `RunIntegrityCheck` doesn't acquire the `running` flag (backup.go:571-622) ### L17: Empty drive path propagates to path builders (backup.go:177-180) ### L18: `ScanDrivesForBackups` picks first found backup, not freshest (restore_scan.go:248) ### L19: Throwaway `exec.Cmd` for logging in MariaDB dump (dbdump.go:213-223) ### L20: `ValidateDump` uses global `log` instead of passed logger (dbdump.go:300+) ### L21: `methodOrEmpty` never returns empty despite its name (healthprobe.go:315-323) ### L22: `aggregateState` doesn't account for `StatePaused` containers (manager.go:387-441) ### L23: `composeExecCustomEnv` double-loads env when env is nil (manager.go:688-694) ### L24: `logPostStartStatus` goroutine has no cancellation (manager.go:775-799) ### L25: `ParseComposeHDDMounts` naive YAML parsing can be fooled (delete.go:398-460) ### L26: `getDirSizeHuman` has no timeout (delete.go:463-474) ### L27: Delete/Remove calls both manual state update AND ScanStacks (delete.go:162-168) ### L28: `DeployField.Options` not deep-copied in `deepCopyStack` (manager.go:510-514) ### L29: Double WriteHeader in `serveCatchAll` (server.go:355-360) ### L30: CSRF token never rotated during session lifetime (auth.go:144-161) ### L31: Content-Type exact match is fragile (storage_handlers.go:1490-1497) ### L32: `AlertManager.Refresh` confusing variadic API (alerts.go:54) ### L33: `json` template function suppresses marshaling errors (funcmap.go:329-332) ### L34: Pinger `send()` always returns nil (pinger.go:91) ### L35: No SQLite connection pool limits (metrics/store.go:19-23) ### L36: No WAL checkpoint after Prune (metrics/store.go:303-322) ### L37: `AllCountries()` rebuilds sorted list on every call (cloudflare/countries.go:253-261) --- ## Fix Plan ### Phase 1: Critical & High Bugs (do first) These are bugs that could cause crashes, data loss, or security issues. | # | Issue | Effort | Impact | |---|-------|--------|--------| | C1 | Watchdog mutex unlock/relock panic safety | 30min | Prevents watchdog crash | | C2 | SetGeoAppOverride nil guard | 5min | Prevents nil panic | | C3 | Restore: SSD-only apps DB dump fallback | 30min | Fixes DR for SSD-only apps | | H1 | Double deploy race: atomic check-and-set | 30min | Prevents parallel deploys | | H2 | Delete/Remove: check Deploying flag | 10min | Prevents delete-during-deploy | | H3 | ScanStacks: skip Deployed overwrite during deploy | 15min | Prevents ghost deployments | | H4 | Clone appCfg before async deploy goroutine | 15min | Prevents shared pointer mutation | | H5 | SyncFileBrowserMounts: add mutex | 20min | Prevents concurrent file corruption | | H6 | PushEvent: add recordHistory calls | 20min | Fixes event history debug page | | H7 | PushOnce: return error for non-2xx | 5min | Fixes silent push failures | | H8 | DumpOne: close tmpFile before rename | 10min | Prevents truncated dump files | | H9 | Restic retry: fresh context | 10min | Prevents confusing timeout errors | | H10 | SaveAppConfig: log/error on encrypt failure | 10min | Prevents silent plaintext secrets | | H11 | settingsCrossBackupHandler: add path validation | 15min | Closes path traversal gap | | H12 | deepCopyStack: complete deep copy | 20min | Prevents shared slice mutation | **Estimated Phase 1 total: ~4 hours** ### Phase 2: Medium Security & Consistency | # | Issue | Effort | Impact | |---|-------|--------|--------| | M1 | Constant-time API key comparison | 5min | Timing attack prevention | | M8 | Login rate limiting | 1hr | Brute-force prevention | | M9 | Empty stack name validation | 10min | Prevents invalid API calls | | M23 | Mask git credentials in logs | 10min | Prevents credential leakage | | M24 | Path prefix check trailing slash | 5min | Closes path matching gap | | M21 | http.Get with timeout | 5min | Prevents goroutine leak | | M25 | Server Set* methods: document init order | 10min | Prevents future races | **Estimated Phase 2 total: ~2 hours** ### Phase 3: Medium Concurrency & Logic | # | Issue | Effort | Impact | |---|-------|--------|--------| | M2 | stackProvider: use sync.Once | 15min | Fixes data race | | M3 | DrainPendingEvents: drain after save | 10min | Prevents event loss | | M4 | SubdomainInUse: I/O outside lock | 30min | Reduces lock contention | | M5 | MigrateEncryption: encKey check inside lock | 5min | Fixes data race | | M7 | executeAllRestores: use thread-safe accessor | 10min | Fixes race condition | | M10 | Scheduler: reject/launch late registrations | 20min | Prevents silent job drops | | M12 | Watchdog Set* methods: document or protect | 10min | Prevents data race | | M13 | SQLite WAL verification | 10min | Confirms WAL engagement | | M14 | sampleContainers: pass cancellable context | 15min | Clean shutdown | | M15 | Telemetry scan error logging | 5min | Better diagnostics | | M22 | Assets syncer: unlock during network I/O | 20min | Unblocks status API | | M26 | crypto.DecryptMap: use injected logger | 10min | Logger consistency | **Estimated Phase 3 total: ~3 hours** ### Phase 4: Medium Optimization | # | Issue | Effort | Impact | |---|-------|--------|--------| | M16 | copyStackDBDumps: buffered I/O | 20min | Prevents OOM on large dumps | | M17 | restic stats: compute once per drive | 30min | Halves subprocess calls | | M18 | ListDumpFiles: use validation cache | 30min | Stops re-reading large dumps | | M19 | deriveStackName: cross-reference with stacks | 30min | Correct DB-to-stack mapping | | M20 | syncInfraConfig: atomic write | 10min | Consistent with local_infra.go | **Estimated Phase 4 total: ~2 hours** ### Phase 5: Low Priority (opportunistic) Address during related work or cleanup sprints. Key items: - **L2**: Add `sync.WaitGroup` for background goroutines (shutdown cleanup) - **L11/L12**: Atomic fstab writes in DR restore - **L14**: Preserve file permissions in restore - **L15**: Save merged snapshot history - **L22**: Handle `StatePaused` in `aggregateState` - **L25**: Replace naive YAML parsing with proper parser - **L35**: Set `db.SetMaxOpenConns(2)` for SQLite **Estimated Phase 5 total: ~4 hours (spread across multiple sessions)** --- ## Cross-Cutting Patterns to Standardize ### 1. Logger Inconsistency Several functions use global `log.Printf` instead of injected loggers: - `dbdump.go:ValidateDump` — global log - `crypto.go:DecryptMap` — global log - Fix: Add logger parameters for consistency ### 2. Atomic Write Inconsistency Some file operations use atomic write (tmp + rename), others don't: - `crossdrive.go:syncInfraConfig` — direct write - `restore_drives_linux.go` — fallback non-atomic write - Fix: Centralize atomic write helper, use everywhere ### 3. Context Propagation Several operations use `context.Background()` instead of the parent context: - `metrics/collector.go:sampleContainers` - `main.go:http.Get` - Fix: Thread context through for proper cancellation ### 4. Deep Copy Completeness `deepCopyStack` misses several nested slices/pointers: - `Meta.OptionalConfig`, `Meta.HealthCheck`, `DeployField.Options` - Fix: Audit all nested types and ensure complete deep copy ### 5. Error Swallowing Multiple locations silently discard errors: - `PushOnce` non-2xx responses - `PushEvent` delivery failures (not recorded) - `Pinger.send` always returns nil - `SaveAppConfig` encryption failures - Fix: At minimum log errors; return where possible --- ## Live Testing Summary (192.168.0.162) | Test | Result | |------|--------| | Container health | HEALTHY (v0.30.3, up, `(healthy)`) | | Log errors | CLEAN (zero errors/warnings) | | API health | OK (`{"ok":true}`) | | Disk space | OK (21% root, 1% data drives) | | Metrics DB | OK (7.2M DB, 1.4M WAL) | | Restic locks | CLEAN (none) | | Memory | 17.6 MiB / 15.4 GiB (0.11%) | | CPU | 0.11% | | Backup system | Working (4 DBs discovered, cache refresh OK) | No runtime issues detected. All findings are from code review — latent bugs and race conditions that require specific conditions to trigger.