# Bug Hunt Report — Comprehensive Controller Audit

**Date:** 2026-02-25
**Scope:** Full controller codebase (`controller/` — all ~80 Go files, all packages)
**Method:** Systematic code review of every file + live testing on demo node (192.168.0.162)
**Controller version:** v0.30.3

---

## Executive Summary

Reviewed all controller Go source files across 6 parallel deep-dive sessions covering: main/settings/config, backup (12 files), stacks (5 files), web/API/router (14 files), scheduler/monitor/metrics (10 files), and cloudflare/assets/storage/sync/setup/notify/report/recovery/selftest/crypto/util (32 files).

**Live testing result:** Demo node is healthy — zero errors, zero warnings, 18 MiB RAM, 0.11% CPU, all subsystems operational.

**Code review found 78 issues total:**

| Severity | Count |
|----------|-------|
| CRITICAL | 3 |
| HIGH     | 12 |
| MEDIUM   | 26 |
| LOW      | 37 |

---

## Table of Contents

- [CRITICAL Issues](#critical-issues)
- [HIGH Issues](#high-issues)
- [MEDIUM Issues](#medium-issues)
- [LOW Issues](#low-issues)
- [Fix Plan](#fix-plan)

---

## CRITICAL Issues

### C1: Watchdog mutex unlock/relock pattern — panic-unsafe

**File:** `internal/monitor/watchdog.go:234-238`
**Category:** concurrency

In `handleConnectedProbe()`, the mutex is unlocked before calling `handleDisconnect`, then re-locked to satisfy a deferred unlock:

```go
state.mu.Unlock()           // line 235
w.handleDisconnect(sp, state, result)  // line 236
state.mu.Lock()             // line 237 — re-acquire for deferred Unlock
```

If `handleDisconnect` panics, line 237 never executes, but the `defer state.mu.Unlock()` still fires — unlocking an already-unlocked mutex causes a **double-panic** (`sync: unlock of unlocked mutex`), crashing the watchdog goroutine entirely.

**Fix:** Refactor to update state inside the lock, then call `handleDisconnect` outside with an early return instead of re-locking. Or wrap `handleDisconnect` in a recover block.

---

### C2: SetGeoAppOverride nil pointer dereference

**File:** `internal/settings/settings.go:885`
**Category:** bug

`SetGeoAppOverride(stackName string, override *GeoAppOverride)` does not check for `override == nil` before accessing its fields. Callers who pass nil to "clear" an override would crash.

**Fix:** Add nil guard: `if override == nil { delete map entry; return }`.

---

### C3: SSD-only apps skip DB dump restoration during DR

**File:** `internal/backup/restore_app_linux.go:156-161`
**Category:** logic/bug

`restoreDBDumps` returns `nil` early if `app.HDDPath == ""`. SSD-only apps (no HDD) have DB dumps under `systemDataPath`, but this path is never considered. During disaster recovery, SSD-only apps with databases (e.g., Mealie, Vikunja) would **lose their DB dump backups**.

**Fix:** Fall back to `systemDataPath` when `HDDPath` is empty, mirroring the logic in `GetAppDrivePath()`.

---

## HIGH Issues

### H1: Double deploy race condition (TOCTOU)

**File:** `internal/stacks/deploy.go:110-285`
**Category:** concurrency

`DeployStack` reads the stack via `GetStack` (RLock, releases), checks `existing.Deployed == false` on disk, then 169 lines later acquires the write lock to set `Deploying = true`. Two concurrent deploy calls for the same stack can both pass the check and launch parallel `docker compose up -d` goroutines.

**Fix:** Atomically check-and-set `Deploying` under a single lock acquisition:
```go
m.mu.Lock()
if s.Deploying { m.mu.Unlock(); return error }
s.Deploying = true
m.mu.Unlock()
```

---

### H2: Delete/Remove during active deploy

**File:** `internal/stacks/delete.go:73, 226`
**Category:** concurrency

Neither `DeleteStack` nor `RemoveStack` checks `stack.Deploying`. A user could delete a stack while an async deploy goroutine is running, causing race between `docker compose down` and `docker compose up`, and the deploy goroutine would later modify a deleted stack.

**Fix:** Add `if stack.Deploying { return error }` checks in both functions.

---

### H3: ScanStacks can overwrite Deployed flag during async deploy

**File:** `internal/stacks/manager.go:232-248`
**Category:** concurrency

`ScanStacks` re-reads `app.yaml` from disk and overwrites the in-memory `Deployed` flag. If an async deploy fails and `runComposeDeploy` sets `Deployed=false` in memory, but disk still has the old state, `ScanStacks` could re-set `Deployed=true` creating a ghost deployment.

**Fix:** Skip overwriting `Deployed` and `AppConfig` when `existing.Deploying == true`.

---

### H4: Shared `appCfg` pointer between main thread and async deploy goroutine

**File:** `internal/stacks/deploy.go:283, 310`
**Category:** concurrency

The same `appCfg` pointer is stored in `s.AppConfig` (line 283) and passed to the async `runComposeDeploy` goroutine. The goroutine later mutates it (line 310: `appCfg.Deployed = false`). While the mutation happens under `m.mu.Lock`, readers who already got the pointer reference see mutation.

**Fix:** Clone `appCfg` before passing to the goroutine.

---

### H5: `SyncFileBrowserMounts` has no concurrency guard

**File:** `internal/web/handlers.go:1543-1589`
**Category:** concurrency

Called from 7 locations (4 via `go` goroutines). Two concurrent storage operations could trigger parallel file writes to `docker-compose.yml` and `config.yaml`, and concurrent `docker compose` restarts causing file corruption or Docker conflicts.

**Fix:** Add a `sync.Mutex` for the entire SyncFileBrowserMounts operation, or use a debounced channel to coalesce concurrent calls.

---

### H6: PushEvent never records to event history

**File:** `internal/notify/notifier.go:181-218`
**Category:** bug

`PushEvent()` fires notifications in a goroutine but never calls `recordHistory()`. Only `PushTestEventSync()` records. The debug event history page is blind to all real production events.

**Fix:** Add `n.recordHistory(...)` calls in the PushEvent goroutine on both success and failure paths.

---

### H7: PushOnce returns nil for non-2xx HTTP responses

**File:** `internal/report/pusher.go:194-224`
**Category:** bug

`PushOnce()` silently returns `nil` for 4xx/5xx responses. Callers believe the push succeeded. Compare with `Push()` and `PushInfraBackup()` which properly handle non-2xx.

**Fix:** Return `fmt.Errorf("hub push-once: HTTP %d", resp.StatusCode)` for non-2xx.

---

### H8: tmpFile not closed/synced before rename in DB dump

**File:** `internal/backup/dbdump.go:231-272`
**Category:** data integrity

`DumpOne` calls `os.Rename(tmpPath, finalPath)` while the file handle is still open (defer closes it later). Data may not be flushed to disk before rename. A crash after rename could leave a truncated/empty dump file.

**Fix:** Close `tmpFile` explicitly (checking the error) before `os.Rename`.

---

### H9: Restic retry uses possibly-expired context

**File:** `internal/backup/restic.go:142-156`
**Category:** logic

When a stale lock is detected, `Snapshot` calls `unlock` then retries with the same context (30-min timeout). If the first attempt consumed most of the timeout, the retry gets very little time and fails with a confusing timeout error.

**Fix:** Create a fresh context with its own timeout for the retry.

---

### H10: SaveAppConfig silently stores secrets in plaintext on encryption failure

**File:** `internal/stacks/deploy.go:542-549`
**Category:** security

When `crypto.Encrypt()` fails, the code silently falls through and saves the value unencrypted. No warning is logged. Passwords could be stored in plaintext on disk without any indication.

**Fix:** Log the encryption failure, or return an error to prevent saving unencrypted secrets.

---

### H11: `settingsCrossBackupHandler` missing destination path validation

**File:** `internal/web/handlers.go:947-994`
**Category:** security

The web form handler for cross-drive backup configuration does NOT validate the destination path against registered storage paths. The API handler (`router.go:823-837`) has explicit validation (C9 comment). A crafted form submission could set an arbitrary destination path.

**Fix:** Add the same validation as the API handler: check `destPath` against `GetStoragePaths()`.

---

### H12: deepCopyStack incomplete — OptionalConfig and HealthCheck not deep-copied

**File:** `internal/stacks/manager.go:474-517`
**Category:** concurrency

`deepCopyStack` copies `DeployFields` but not `Meta.OptionalConfig` (slice of slices) or `Meta.HealthCheck` (pointer). Mutations on the returned copy's OptionalConfig or HealthCheck would affect the original in the map.

**Fix:** Deep-copy `OptionalConfig` including nested `Fields` slices, and clone `HealthCheck` pointer.

---

## MEDIUM Issues

### M1: Non-constant-time API key comparison

**File:** `cmd/controller/main.go:820`
**Category:** security

API key comparison uses `==` instead of `subtle.ConstantTimeCompare`, making it vulnerable to timing attacks.

**Fix:** Use `crypto/subtle.ConstantTimeCompare`.

---

### M2: `stackProvider` read without mutex protection

**File:** `internal/backup/backup.go:172-176, 185-189, 481-482`
**Category:** concurrency

`SetStackProvider` writes under `m.mu`, but all reads of `m.stackProvider` happen without any lock. This is a Go data race per the memory model. In practice it's set once at startup, but the race detector would flag it.

**Fix:** Use `sync.Once` or `atomic.Value` for the provider.

---

### M3: `DrainPendingEvents` silently loses events on save failure

**File:** `internal/settings/settings.go:808-823`
**Category:** logic

If `s.save()` fails, events are drained from memory but never persisted. On restart, the events are lost.

**Fix:** Only drain events from memory after successful save.

---

### M4: `SubdomainInUse` performs disk I/O under RLock

**File:** `internal/stacks/deploy.go:58-82`
**Category:** concurrency/performance

Reads `app.yaml` from disk via `LoadAppConfig` for every deployed stack while holding `m.mu.RLock()`. Slow I/O blocks all writers.

**Fix:** Collect stack dirs under the lock, release it, then perform disk I/O.

---

### M5: `MigrateEncryption` reads `encKey` without lock

**File:** `internal/stacks/manager.go:123`
**Category:** concurrency

Checks `m.encKey == nil` before acquiring `m.mu.Lock()`. `SetEncryptionKey` writes `encKey` under the lock, creating a data race.

**Fix:** Move the nil check inside the lock.

---

### M6: `MigrateEncryption` holds lock during all disk I/O

**File:** `internal/stacks/manager.go:122-162`
**Category:** concurrency/performance

Holds `m.mu.Lock()` while performing `LoadAppConfig`, `LoadMetadata`, and `SaveAppConfig` for every stack. Blocks all operations during migration.

**Fix:** Collect stacks needing migration under lock, then do I/O outside. (Acceptable if only runs at startup before scheduler.)

---

### M7: `executeAllRestores` accesses `plan.Apps` without lock

**File:** `internal/web/handler_restore.go:115`
**Category:** concurrency

Directly accesses `plan.Apps` field without holding `restoreMu`. Concurrent reads from `apiRestoreStatus` could race.

**Fix:** Use the plan's thread-safe accessor instead of direct field access.

---

### M8: No login rate limiting

**File:** `internal/web/auth.go:91-128`
**Category:** security

No rate limiting on login attempts. While bcrypt is expensive, a persistent attacker could still brute-force weak passwords.

**Fix:** Add per-IP rate limiting (e.g., 5 attempts per minute).

---

### M9: Route handling — `hasSuffix` can match empty stack names

**File:** `internal/api/router.go:128`
**Category:** bug

Path `/stacks/deploy` matches `hasSuffix(path, "/deploy")`, yielding an empty stack name from `extractName`. The handler proceeds without validating the name is non-empty.

**Fix:** Add non-empty name validation in handlers that use `extractName`.

---

### M10: Jobs registered after `Start()` never execute

**File:** `internal/scheduler/scheduler.go:60-96`
**Category:** logic

`Start()` launches goroutines for existing jobs only. Jobs added via `Every()`/`Daily()` after `Start()` are appended but never get a goroutine.

**Fix:** Either reject late registrations or launch the goroutine immediately if already started.

---

### M11: `findStoragePath` returns pointer to copy, not to settings data

**File:** `internal/monitor/watchdog.go:890-897`
**Category:** logic

Returns `&sp` where `sp` is a range-loop copy. Mutations won't reflect in settings. All current callers are read-only so it's safe, but misleading.

**Fix:** Return value type, or document it's a snapshot.

---

### M12: Set* methods on watchdog lack synchronization

**File:** `internal/monitor/watchdog.go:113-125`
**Category:** concurrency

`SetAlertRefresh`, `SetHubReportPusher`, `SetRepoUnlocker` write fields without any mutex. If called after the scheduler starts, it's a data race.

**Fix:** Document they must be called before Start(), or protect with mutex.

---

### M13: SQLite WAL mode pragma not verified

**File:** `internal/metrics/store.go:26-36`
**Category:** data integrity

`PRAGMA journal_mode=WAL` is executed via `db.Exec()` but the returned mode is not verified. WAL could silently fail to engage on certain filesystems.

**Fix:** Use `db.QueryRow("PRAGMA journal_mode=WAL").Scan(&mode)` and verify `mode == "wal"`.

---

### M14: `sampleContainers` uses `context.Background()` instead of cancellable context

**File:** `internal/metrics/collector.go:100-101`
**Category:** bug

`docker stats` command ignores shutdown cancellation, potentially blocking for 30s after `Stop()` is called.

**Fix:** Pass the collector's context through to `sampleContainers`.

---

### M15: Telemetry scan error silently swallowed

**File:** `internal/metrics/telemetry.go:39-42`
**Category:** bug

Row scan failures are silently discarded with `continue`, masking data corruption or schema issues.

**Fix:** Log the scan error before continuing.

---

### M16: `copyStackDBDumps` reads entire files into memory

**File:** `internal/backup/crossdrive.go:478-484`
**Category:** optimization

Uses `os.ReadFile`/`os.WriteFile` for dump file copies. Large DB dumps (500MB+) cause full in-memory allocation.

**Fix:** Use `io.Copy` with buffered I/O, or shell out to `cp`/`rsync`.

---

### M17: `restic stats` called twice per drive per cache refresh

**File:** `internal/backup/backup.go:1070-1071`
**Category:** optimization

`aggregateRepoStats()` and `perDriveRepoStats()` both iterate drives and call `restic stats`. Each drives gets two subprocess invocations per refresh.

**Fix:** Compute per-drive stats once, then aggregate from the results.

---

### M18: `ListDumpFiles` re-validates every dump file every 5 minutes

**File:** `internal/backup/dbdump.go:448-451`
**Category:** optimization

`ValidateDump` opens and reads each `.sql` file line-by-line on every cache refresh, even for multi-hundred-MB dumps. Ignores the validation cache in `settings.json`.

**Fix:** Pass the validation cache to `ListDumpFiles`, skip re-validation if file size+modtime match.

---

### M19: Naive stack name derivation from container names

**File:** `internal/backup/dbdump.go:517-532`
**Category:** logic

`deriveStackName("my-cache")` strips "cache" suffix and returns "my" — potentially attributing DB dumps to wrong stacks.

**Fix:** Cross-reference with actual deployed stack names instead of pure suffix stripping.

---

### M20: `syncInfraConfig` controller.yaml copy is not atomic

**File:** `internal/backup/crossdrive.go:526-534`
**Category:** data integrity

Uses `os.ReadFile` + `os.WriteFile` directly. A crash mid-write corrupts the file. Compare with `local_infra.go` which uses `atomicWrite`.

**Fix:** Use `atomicWrite` for consistency.

---

### M21: `http.Get` without timeout in main.go

**File:** `cmd/controller/main.go:691`
**Category:** bug

Uses `http.Get` (no timeout). If the endpoint hangs, the goroutine leaks permanently.

**Fix:** Use `&http.Client{Timeout: 10 * time.Second}`.

---

### M22: Assets syncer holds mutex during entire network operation

**File:** `internal/assets/syncer.go:72-73`
**Category:** concurrency

`s.mu.Lock()` held for the entire `Sync()` call including HTTP downloads. `Status()` calls block for the entire sync duration (60+ seconds).

**Fix:** Lock only to set a `running` flag, then unlock for actual work.

---

### M23: Git sync logs credentials in plaintext

**File:** `internal/sync/sync.go:427`
**Category:** security

`runGitInDir` logs `strings.Join(args, " ")` which includes the repo URL with embedded credentials during `git clone`. The `maskRepoURL()` helper exists but is only used elsewhere.

**Fix:** Apply `maskRepoURL()` in `runGitInDir()` before logging.

---

### M24: Path prefix check in storageAttachBrowseHandler may match sibling directories

**File:** `internal/web/storage_handlers.go:953-958`
**Category:** security

`strings.HasPrefix(cleanPath, storage.RawMountBase)` — if `RawMountBase` doesn't end with `/`, a path like `/mnt/felhom-stage-evil/sensitive` would pass the check.

**Fix:** Ensure prefix check includes trailing `/`: `strings.HasPrefix(cleanPath, storage.RawMountBase + "/")`.

---

### M25: Set* methods on Server lack synchronization

**File:** `internal/web/server.go:114-172`
**Category:** concurrency

Multiple `Set*` methods write to Server fields without synchronization. The `SyncFileBrowserMounts` goroutine launched at startup could race with these calls.

**Fix:** Document init order or add a barrier, or pass all deps via `NewServer()`.

---

### M26: `crypto.DecryptMap` uses global `log.Printf`

**File:** `internal/crypto/crypto.go:112`
**Category:** consistency

Uses global `log` package instead of injected logger — inconsistent with rest of codebase.

**Fix:** Add a `logger` parameter.

---

## LOW Issues

### L1: Dead code: unused `fileExists` in main.go (line 1071)
### L2: Multiple goroutines launched without `sync.WaitGroup` for shutdown (main.go: multiple locations)
### L3: `0` treated as "unset" for all int config fields (config.go:230-234) — prevents intentional zero values
### L4: Hardcoded metrics DB path ignores `cfg.Paths.DataDir` (main.go:138)
### L5: Hardcoded config path in infra backup functions (main.go:1047, 1148)
### L6: String concat instead of `filepath.Join` for path (main.go:83)
### L7: Double-clear of `running` map in CrossDriveRunner (crossdrive.go:86-110) — defer + manual clear
### L8: uint64 to int64 cast in restic stats — overflow for repos >8 EB (restic.go:293)
### L9: `deriveStackName` error parsing uses `string(out)` which may be empty (restic.go:136-142)
### L10: infraPaths duplicated in every drive's backup (backup.go:366-369)
### L11: TOCTOU race on fstab in `addDRFstabEntries` (restore_drives_linux.go:213-262)
### L12: Non-atomic fallback fstab write (restore_drives_linux.go:256-258)
### L13: 10-second timeout may truncate DB discovery results (backup.go:1077-1081)
### L14: `restoreUserData` copies files with hardcoded 0644 permissions (restore_app_linux.go:139-148)
### L15: Merged snapshot history not saved to disk after merge (backup.go:984-1053)
### L16: `RunIntegrityCheck` doesn't acquire the `running` flag (backup.go:571-622)
### L17: Empty drive path propagates to path builders (backup.go:177-180)
### L18: `ScanDrivesForBackups` picks first found backup, not freshest (restore_scan.go:248)
### L19: Throwaway `exec.Cmd` for logging in MariaDB dump (dbdump.go:213-223)
### L20: `ValidateDump` uses global `log` instead of passed logger (dbdump.go:300+)
### L21: `methodOrEmpty` never returns empty despite its name (healthprobe.go:315-323)
### L22: `aggregateState` doesn't account for `StatePaused` containers (manager.go:387-441)
### L23: `composeExecCustomEnv` double-loads env when env is nil (manager.go:688-694)
### L24: `logPostStartStatus` goroutine has no cancellation (manager.go:775-799)
### L25: `ParseComposeHDDMounts` naive YAML parsing can be fooled (delete.go:398-460)
### L26: `getDirSizeHuman` has no timeout (delete.go:463-474)
### L27: Delete/Remove calls both manual state update AND ScanStacks (delete.go:162-168)
### L28: `DeployField.Options` not deep-copied in `deepCopyStack` (manager.go:510-514)
### L29: Double WriteHeader in `serveCatchAll` (server.go:355-360)
### L30: CSRF token never rotated during session lifetime (auth.go:144-161)
### L31: Content-Type exact match is fragile (storage_handlers.go:1490-1497)
### L32: `AlertManager.Refresh` confusing variadic API (alerts.go:54)
### L33: `json` template function suppresses marshaling errors (funcmap.go:329-332)
### L34: Pinger `send()` always returns nil (pinger.go:91)
### L35: No SQLite connection pool limits (metrics/store.go:19-23)
### L36: No WAL checkpoint after Prune (metrics/store.go:303-322)
### L37: `AllCountries()` rebuilds sorted list on every call (cloudflare/countries.go:253-261)

---

## Fix Plan

### Phase 1: Critical & High Bugs (do first)

These are bugs that could cause crashes, data loss, or security issues.

| # | Issue | Effort | Impact |
|---|-------|--------|--------|
| C1 | Watchdog mutex unlock/relock panic safety | 30min | Prevents watchdog crash |
| C2 | SetGeoAppOverride nil guard | 5min | Prevents nil panic |
| C3 | Restore: SSD-only apps DB dump fallback | 30min | Fixes DR for SSD-only apps |
| H1 | Double deploy race: atomic check-and-set | 30min | Prevents parallel deploys |
| H2 | Delete/Remove: check Deploying flag | 10min | Prevents delete-during-deploy |
| H3 | ScanStacks: skip Deployed overwrite during deploy | 15min | Prevents ghost deployments |
| H4 | Clone appCfg before async deploy goroutine | 15min | Prevents shared pointer mutation |
| H5 | SyncFileBrowserMounts: add mutex | 20min | Prevents concurrent file corruption |
| H6 | PushEvent: add recordHistory calls | 20min | Fixes event history debug page |
| H7 | PushOnce: return error for non-2xx | 5min | Fixes silent push failures |
| H8 | DumpOne: close tmpFile before rename | 10min | Prevents truncated dump files |
| H9 | Restic retry: fresh context | 10min | Prevents confusing timeout errors |
| H10 | SaveAppConfig: log/error on encrypt failure | 10min | Prevents silent plaintext secrets |
| H11 | settingsCrossBackupHandler: add path validation | 15min | Closes path traversal gap |
| H12 | deepCopyStack: complete deep copy | 20min | Prevents shared slice mutation |

**Estimated Phase 1 total: ~4 hours**

### Phase 2: Medium Security & Consistency

| # | Issue | Effort | Impact |
|---|-------|--------|--------|
| M1 | Constant-time API key comparison | 5min | Timing attack prevention |
| M8 | Login rate limiting | 1hr | Brute-force prevention |
| M9 | Empty stack name validation | 10min | Prevents invalid API calls |
| M23 | Mask git credentials in logs | 10min | Prevents credential leakage |
| M24 | Path prefix check trailing slash | 5min | Closes path matching gap |
| M21 | http.Get with timeout | 5min | Prevents goroutine leak |
| M25 | Server Set* methods: document init order | 10min | Prevents future races |

**Estimated Phase 2 total: ~2 hours**

### Phase 3: Medium Concurrency & Logic

| # | Issue | Effort | Impact |
|---|-------|--------|--------|
| M2 | stackProvider: use sync.Once | 15min | Fixes data race |
| M3 | DrainPendingEvents: drain after save | 10min | Prevents event loss |
| M4 | SubdomainInUse: I/O outside lock | 30min | Reduces lock contention |
| M5 | MigrateEncryption: encKey check inside lock | 5min | Fixes data race |
| M7 | executeAllRestores: use thread-safe accessor | 10min | Fixes race condition |
| M10 | Scheduler: reject/launch late registrations | 20min | Prevents silent job drops |
| M12 | Watchdog Set* methods: document or protect | 10min | Prevents data race |
| M13 | SQLite WAL verification | 10min | Confirms WAL engagement |
| M14 | sampleContainers: pass cancellable context | 15min | Clean shutdown |
| M15 | Telemetry scan error logging | 5min | Better diagnostics |
| M22 | Assets syncer: unlock during network I/O | 20min | Unblocks status API |
| M26 | crypto.DecryptMap: use injected logger | 10min | Logger consistency |

**Estimated Phase 3 total: ~3 hours**

### Phase 4: Medium Optimization

| # | Issue | Effort | Impact |
|---|-------|--------|--------|
| M16 | copyStackDBDumps: buffered I/O | 20min | Prevents OOM on large dumps |
| M17 | restic stats: compute once per drive | 30min | Halves subprocess calls |
| M18 | ListDumpFiles: use validation cache | 30min | Stops re-reading large dumps |
| M19 | deriveStackName: cross-reference with stacks | 30min | Correct DB-to-stack mapping |
| M20 | syncInfraConfig: atomic write | 10min | Consistent with local_infra.go |

**Estimated Phase 4 total: ~2 hours**

### Phase 5: Low Priority (opportunistic)

Address during related work or cleanup sprints. Key items:

- **L2**: Add `sync.WaitGroup` for background goroutines (shutdown cleanup)
- **L11/L12**: Atomic fstab writes in DR restore
- **L14**: Preserve file permissions in restore
- **L15**: Save merged snapshot history
- **L22**: Handle `StatePaused` in `aggregateState`
- **L25**: Replace naive YAML parsing with proper parser
- **L35**: Set `db.SetMaxOpenConns(2)` for SQLite

**Estimated Phase 5 total: ~4 hours (spread across multiple sessions)**

---

## Cross-Cutting Patterns to Standardize

### 1. Logger Inconsistency
Several functions use global `log.Printf` instead of injected loggers:
- `dbdump.go:ValidateDump` — global log
- `crypto.go:DecryptMap` — global log
- Fix: Add logger parameters for consistency

### 2. Atomic Write Inconsistency
Some file operations use atomic write (tmp + rename), others don't:
- `crossdrive.go:syncInfraConfig` — direct write
- `restore_drives_linux.go` — fallback non-atomic write
- Fix: Centralize atomic write helper, use everywhere

### 3. Context Propagation
Several operations use `context.Background()` instead of the parent context:
- `metrics/collector.go:sampleContainers`
- `main.go:http.Get`
- Fix: Thread context through for proper cancellation

### 4. Deep Copy Completeness
`deepCopyStack` misses several nested slices/pointers:
- `Meta.OptionalConfig`, `Meta.HealthCheck`, `DeployField.Options`
- Fix: Audit all nested types and ensure complete deep copy

### 5. Error Swallowing
Multiple locations silently discard errors:
- `PushOnce` non-2xx responses
- `PushEvent` delivery failures (not recorded)
- `Pinger.send` always returns nil
- `SaveAppConfig` encryption failures
- Fix: At minimum log errors; return where possible

---

## Live Testing Summary (192.168.0.162)

| Test | Result |
|------|--------|
| Container health | HEALTHY (v0.30.3, up, `(healthy)`) |
| Log errors | CLEAN (zero errors/warnings) |
| API health | OK (`{"ok":true}`) |
| Disk space | OK (21% root, 1% data drives) |
| Metrics DB | OK (7.2M DB, 1.4M WAL) |
| Restic locks | CLEAN (none) |
| Memory | 17.6 MiB / 15.4 GiB (0.11%) |
| CPU | 0.11% |
| Backup system | Working (4 DBs discovered, cache refresh OK) |

No runtime issues detected. All findings are from code review — latent bugs and race conditions that require specific conditions to trigger.