db83db383c
Critical: watchdog mutex panic safety, SetGeoAppOverride nil guard, SSD-only app DB restore fallback. High: double deploy race (atomic Deploying flag), delete/remove during deploy guard, ScanStacks overwrite protection, FileBrowser mount mutex, PushEvent history, PushOnce error handling, DB dump sync+close before rename, restic retry fresh context, encrypt failure logging, cross-backup path traversal validation, deepCopyStack completeness. Security: constant-time API key comparison, login rate limiting (5/min), git credential masking in logs, storage path prefix traversal fix. Concurrency: MigrateEncryption lock ordering, SubdomainInUse I/O outside lock, scheduler late-registered jobs, SQLite WAL verification, metrics shutdown context, telemetry scan error logging, asset sync lock scope. Optimization: streaming file copy for DB dumps, restic stats dedup, atomic infra config copy. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
687 lines
26 KiB
Markdown
687 lines
26 KiB
Markdown
# Bug Hunt Report — Comprehensive Controller Audit
|
|
|
|
**Date:** 2026-02-25
|
|
**Scope:** Full controller codebase (`controller/` — all ~80 Go files, all packages)
|
|
**Method:** Systematic code review of every file + live testing on demo node (192.168.0.162)
|
|
**Controller version:** v0.30.3
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
Reviewed all controller Go source files across 6 parallel deep-dive sessions covering: main/settings/config, backup (12 files), stacks (5 files), web/API/router (14 files), scheduler/monitor/metrics (10 files), and cloudflare/assets/storage/sync/setup/notify/report/recovery/selftest/crypto/util (32 files).
|
|
|
|
**Live testing result:** Demo node is healthy — zero errors, zero warnings, 18 MiB RAM, 0.11% CPU, all subsystems operational.
|
|
|
|
**Code review found 78 issues total:**
|
|
|
|
| Severity | Count |
|
|
|----------|-------|
|
|
| CRITICAL | 3 |
|
|
| HIGH | 12 |
|
|
| MEDIUM | 26 |
|
|
| LOW | 37 |
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
- [CRITICAL Issues](#critical-issues)
|
|
- [HIGH Issues](#high-issues)
|
|
- [MEDIUM Issues](#medium-issues)
|
|
- [LOW Issues](#low-issues)
|
|
- [Fix Plan](#fix-plan)
|
|
|
|
---
|
|
|
|
## CRITICAL Issues
|
|
|
|
### C1: Watchdog mutex unlock/relock pattern — panic-unsafe
|
|
|
|
**File:** `internal/monitor/watchdog.go:234-238`
|
|
**Category:** concurrency
|
|
|
|
In `handleConnectedProbe()`, the mutex is unlocked before calling `handleDisconnect`, then re-locked to satisfy a deferred unlock:
|
|
|
|
```go
|
|
state.mu.Unlock() // line 235
|
|
w.handleDisconnect(sp, state, result) // line 236
|
|
state.mu.Lock() // line 237 — re-acquire for deferred Unlock
|
|
```
|
|
|
|
If `handleDisconnect` panics, line 237 never executes, but the `defer state.mu.Unlock()` still fires — unlocking an already-unlocked mutex causes a **double-panic** (`sync: unlock of unlocked mutex`), crashing the watchdog goroutine entirely.
|
|
|
|
**Fix:** Refactor to update state inside the lock, then call `handleDisconnect` outside with an early return instead of re-locking. Or wrap `handleDisconnect` in a recover block.
|
|
|
|
---
|
|
|
|
### C2: SetGeoAppOverride nil pointer dereference
|
|
|
|
**File:** `internal/settings/settings.go:885`
|
|
**Category:** bug
|
|
|
|
`SetGeoAppOverride(stackName string, override *GeoAppOverride)` does not check for `override == nil` before accessing its fields. Callers who pass nil to "clear" an override would crash.
|
|
|
|
**Fix:** Add nil guard: `if override == nil { delete map entry; return }`.
|
|
|
|
---
|
|
|
|
### C3: SSD-only apps skip DB dump restoration during DR
|
|
|
|
**File:** `internal/backup/restore_app_linux.go:156-161`
|
|
**Category:** logic/bug
|
|
|
|
`restoreDBDumps` returns `nil` early if `app.HDDPath == ""`. SSD-only apps (no HDD) have DB dumps under `systemDataPath`, but this path is never considered. During disaster recovery, SSD-only apps with databases (e.g., Mealie, Vikunja) would **lose their DB dump backups**.
|
|
|
|
**Fix:** Fall back to `systemDataPath` when `HDDPath` is empty, mirroring the logic in `GetAppDrivePath()`.
|
|
|
|
---
|
|
|
|
## HIGH Issues
|
|
|
|
### H1: Double deploy race condition (TOCTOU)
|
|
|
|
**File:** `internal/stacks/deploy.go:110-285`
|
|
**Category:** concurrency
|
|
|
|
`DeployStack` reads the stack via `GetStack` (RLock, releases), checks `existing.Deployed == false` on disk, then 169 lines later acquires the write lock to set `Deploying = true`. Two concurrent deploy calls for the same stack can both pass the check and launch parallel `docker compose up -d` goroutines.
|
|
|
|
**Fix:** Atomically check-and-set `Deploying` under a single lock acquisition:
|
|
```go
|
|
m.mu.Lock()
|
|
if s.Deploying { m.mu.Unlock(); return error }
|
|
s.Deploying = true
|
|
m.mu.Unlock()
|
|
```
|
|
|
|
---
|
|
|
|
### H2: Delete/Remove during active deploy
|
|
|
|
**File:** `internal/stacks/delete.go:73, 226`
|
|
**Category:** concurrency
|
|
|
|
Neither `DeleteStack` nor `RemoveStack` checks `stack.Deploying`. A user could delete a stack while an async deploy goroutine is running, causing race between `docker compose down` and `docker compose up`, and the deploy goroutine would later modify a deleted stack.
|
|
|
|
**Fix:** Add `if stack.Deploying { return error }` checks in both functions.
|
|
|
|
---
|
|
|
|
### H3: ScanStacks can overwrite Deployed flag during async deploy
|
|
|
|
**File:** `internal/stacks/manager.go:232-248`
|
|
**Category:** concurrency
|
|
|
|
`ScanStacks` re-reads `app.yaml` from disk and overwrites the in-memory `Deployed` flag. If an async deploy fails and `runComposeDeploy` sets `Deployed=false` in memory, but disk still has the old state, `ScanStacks` could re-set `Deployed=true` creating a ghost deployment.
|
|
|
|
**Fix:** Skip overwriting `Deployed` and `AppConfig` when `existing.Deploying == true`.
|
|
|
|
---
|
|
|
|
### H4: Shared `appCfg` pointer between main thread and async deploy goroutine
|
|
|
|
**File:** `internal/stacks/deploy.go:283, 310`
|
|
**Category:** concurrency
|
|
|
|
The same `appCfg` pointer is stored in `s.AppConfig` (line 283) and passed to the async `runComposeDeploy` goroutine. The goroutine later mutates it (line 310: `appCfg.Deployed = false`). While the mutation happens under `m.mu.Lock`, readers who already got the pointer reference see mutation.
|
|
|
|
**Fix:** Clone `appCfg` before passing to the goroutine.
|
|
|
|
---
|
|
|
|
### H5: `SyncFileBrowserMounts` has no concurrency guard
|
|
|
|
**File:** `internal/web/handlers.go:1543-1589`
|
|
**Category:** concurrency
|
|
|
|
Called from 7 locations (4 via `go` goroutines). Two concurrent storage operations could trigger parallel file writes to `docker-compose.yml` and `config.yaml`, and concurrent `docker compose` restarts causing file corruption or Docker conflicts.
|
|
|
|
**Fix:** Add a `sync.Mutex` for the entire SyncFileBrowserMounts operation, or use a debounced channel to coalesce concurrent calls.
|
|
|
|
---
|
|
|
|
### H6: PushEvent never records to event history
|
|
|
|
**File:** `internal/notify/notifier.go:181-218`
|
|
**Category:** bug
|
|
|
|
`PushEvent()` fires notifications in a goroutine but never calls `recordHistory()`. Only `PushTestEventSync()` records. The debug event history page is blind to all real production events.
|
|
|
|
**Fix:** Add `n.recordHistory(...)` calls in the PushEvent goroutine on both success and failure paths.
|
|
|
|
---
|
|
|
|
### H7: PushOnce returns nil for non-2xx HTTP responses
|
|
|
|
**File:** `internal/report/pusher.go:194-224`
|
|
**Category:** bug
|
|
|
|
`PushOnce()` silently returns `nil` for 4xx/5xx responses. Callers believe the push succeeded. Compare with `Push()` and `PushInfraBackup()` which properly handle non-2xx.
|
|
|
|
**Fix:** Return `fmt.Errorf("hub push-once: HTTP %d", resp.StatusCode)` for non-2xx.
|
|
|
|
---
|
|
|
|
### H8: tmpFile not closed/synced before rename in DB dump
|
|
|
|
**File:** `internal/backup/dbdump.go:231-272`
|
|
**Category:** data integrity
|
|
|
|
`DumpOne` calls `os.Rename(tmpPath, finalPath)` while the file handle is still open (defer closes it later). Data may not be flushed to disk before rename. A crash after rename could leave a truncated/empty dump file.
|
|
|
|
**Fix:** Close `tmpFile` explicitly (checking the error) before `os.Rename`.
|
|
|
|
---
|
|
|
|
### H9: Restic retry uses possibly-expired context
|
|
|
|
**File:** `internal/backup/restic.go:142-156`
|
|
**Category:** logic
|
|
|
|
When a stale lock is detected, `Snapshot` calls `unlock` then retries with the same context (30-min timeout). If the first attempt consumed most of the timeout, the retry gets very little time and fails with a confusing timeout error.
|
|
|
|
**Fix:** Create a fresh context with its own timeout for the retry.
|
|
|
|
---
|
|
|
|
### H10: SaveAppConfig silently stores secrets in plaintext on encryption failure
|
|
|
|
**File:** `internal/stacks/deploy.go:542-549`
|
|
**Category:** security
|
|
|
|
When `crypto.Encrypt()` fails, the code silently falls through and saves the value unencrypted. No warning is logged. Passwords could be stored in plaintext on disk without any indication.
|
|
|
|
**Fix:** Log the encryption failure, or return an error to prevent saving unencrypted secrets.
|
|
|
|
---
|
|
|
|
### H11: `settingsCrossBackupHandler` missing destination path validation
|
|
|
|
**File:** `internal/web/handlers.go:947-994`
|
|
**Category:** security
|
|
|
|
The web form handler for cross-drive backup configuration does NOT validate the destination path against registered storage paths. The API handler (`router.go:823-837`) has explicit validation (C9 comment). A crafted form submission could set an arbitrary destination path.
|
|
|
|
**Fix:** Add the same validation as the API handler: check `destPath` against `GetStoragePaths()`.
|
|
|
|
---
|
|
|
|
### H12: deepCopyStack incomplete — OptionalConfig and HealthCheck not deep-copied
|
|
|
|
**File:** `internal/stacks/manager.go:474-517`
|
|
**Category:** concurrency
|
|
|
|
`deepCopyStack` copies `DeployFields` but not `Meta.OptionalConfig` (slice of slices) or `Meta.HealthCheck` (pointer). Mutations on the returned copy's OptionalConfig or HealthCheck would affect the original in the map.
|
|
|
|
**Fix:** Deep-copy `OptionalConfig` including nested `Fields` slices, and clone `HealthCheck` pointer.
|
|
|
|
---
|
|
|
|
## MEDIUM Issues
|
|
|
|
### M1: Non-constant-time API key comparison
|
|
|
|
**File:** `cmd/controller/main.go:820`
|
|
**Category:** security
|
|
|
|
API key comparison uses `==` instead of `subtle.ConstantTimeCompare`, making it vulnerable to timing attacks.
|
|
|
|
**Fix:** Use `crypto/subtle.ConstantTimeCompare`.
|
|
|
|
---
|
|
|
|
### M2: `stackProvider` read without mutex protection
|
|
|
|
**File:** `internal/backup/backup.go:172-176, 185-189, 481-482`
|
|
**Category:** concurrency
|
|
|
|
`SetStackProvider` writes under `m.mu`, but all reads of `m.stackProvider` happen without any lock. This is a Go data race per the memory model. In practice it's set once at startup, but the race detector would flag it.
|
|
|
|
**Fix:** Use `sync.Once` or `atomic.Value` for the provider.
|
|
|
|
---
|
|
|
|
### M3: `DrainPendingEvents` silently loses events on save failure
|
|
|
|
**File:** `internal/settings/settings.go:808-823`
|
|
**Category:** logic
|
|
|
|
If `s.save()` fails, events are drained from memory but never persisted. On restart, the events are lost.
|
|
|
|
**Fix:** Only drain events from memory after successful save.
|
|
|
|
---
|
|
|
|
### M4: `SubdomainInUse` performs disk I/O under RLock
|
|
|
|
**File:** `internal/stacks/deploy.go:58-82`
|
|
**Category:** concurrency/performance
|
|
|
|
Reads `app.yaml` from disk via `LoadAppConfig` for every deployed stack while holding `m.mu.RLock()`. Slow I/O blocks all writers.
|
|
|
|
**Fix:** Collect stack dirs under the lock, release it, then perform disk I/O.
|
|
|
|
---
|
|
|
|
### M5: `MigrateEncryption` reads `encKey` without lock
|
|
|
|
**File:** `internal/stacks/manager.go:123`
|
|
**Category:** concurrency
|
|
|
|
Checks `m.encKey == nil` before acquiring `m.mu.Lock()`. `SetEncryptionKey` writes `encKey` under the lock, creating a data race.
|
|
|
|
**Fix:** Move the nil check inside the lock.
|
|
|
|
---
|
|
|
|
### M6: `MigrateEncryption` holds lock during all disk I/O
|
|
|
|
**File:** `internal/stacks/manager.go:122-162`
|
|
**Category:** concurrency/performance
|
|
|
|
Holds `m.mu.Lock()` while performing `LoadAppConfig`, `LoadMetadata`, and `SaveAppConfig` for every stack. Blocks all operations during migration.
|
|
|
|
**Fix:** Collect stacks needing migration under lock, then do I/O outside. (Acceptable if only runs at startup before scheduler.)
|
|
|
|
---
|
|
|
|
### M7: `executeAllRestores` accesses `plan.Apps` without lock
|
|
|
|
**File:** `internal/web/handler_restore.go:115`
|
|
**Category:** concurrency
|
|
|
|
Directly accesses `plan.Apps` field without holding `restoreMu`. Concurrent reads from `apiRestoreStatus` could race.
|
|
|
|
**Fix:** Use the plan's thread-safe accessor instead of direct field access.
|
|
|
|
---
|
|
|
|
### M8: No login rate limiting
|
|
|
|
**File:** `internal/web/auth.go:91-128`
|
|
**Category:** security
|
|
|
|
No rate limiting on login attempts. While bcrypt is expensive, a persistent attacker could still brute-force weak passwords.
|
|
|
|
**Fix:** Add per-IP rate limiting (e.g., 5 attempts per minute).
|
|
|
|
---
|
|
|
|
### M9: Route handling — `hasSuffix` can match empty stack names
|
|
|
|
**File:** `internal/api/router.go:128`
|
|
**Category:** bug
|
|
|
|
Path `/stacks/deploy` matches `hasSuffix(path, "/deploy")`, yielding an empty stack name from `extractName`. The handler proceeds without validating the name is non-empty.
|
|
|
|
**Fix:** Add non-empty name validation in handlers that use `extractName`.
|
|
|
|
---
|
|
|
|
### M10: Jobs registered after `Start()` never execute
|
|
|
|
**File:** `internal/scheduler/scheduler.go:60-96`
|
|
**Category:** logic
|
|
|
|
`Start()` launches goroutines for existing jobs only. Jobs added via `Every()`/`Daily()` after `Start()` are appended but never get a goroutine.
|
|
|
|
**Fix:** Either reject late registrations or launch the goroutine immediately if already started.
|
|
|
|
---
|
|
|
|
### M11: `findStoragePath` returns pointer to copy, not to settings data
|
|
|
|
**File:** `internal/monitor/watchdog.go:890-897`
|
|
**Category:** logic
|
|
|
|
Returns `&sp` where `sp` is a range-loop copy. Mutations won't reflect in settings. All current callers are read-only so it's safe, but misleading.
|
|
|
|
**Fix:** Return value type, or document it's a snapshot.
|
|
|
|
---
|
|
|
|
### M12: Set* methods on watchdog lack synchronization
|
|
|
|
**File:** `internal/monitor/watchdog.go:113-125`
|
|
**Category:** concurrency
|
|
|
|
`SetAlertRefresh`, `SetHubReportPusher`, `SetRepoUnlocker` write fields without any mutex. If called after the scheduler starts, it's a data race.
|
|
|
|
**Fix:** Document they must be called before Start(), or protect with mutex.
|
|
|
|
---
|
|
|
|
### M13: SQLite WAL mode pragma not verified
|
|
|
|
**File:** `internal/metrics/store.go:26-36`
|
|
**Category:** data integrity
|
|
|
|
`PRAGMA journal_mode=WAL` is executed via `db.Exec()` but the returned mode is not verified. WAL could silently fail to engage on certain filesystems.
|
|
|
|
**Fix:** Use `db.QueryRow("PRAGMA journal_mode=WAL").Scan(&mode)` and verify `mode == "wal"`.
|
|
|
|
---
|
|
|
|
### M14: `sampleContainers` uses `context.Background()` instead of cancellable context
|
|
|
|
**File:** `internal/metrics/collector.go:100-101`
|
|
**Category:** bug
|
|
|
|
`docker stats` command ignores shutdown cancellation, potentially blocking for 30s after `Stop()` is called.
|
|
|
|
**Fix:** Pass the collector's context through to `sampleContainers`.
|
|
|
|
---
|
|
|
|
### M15: Telemetry scan error silently swallowed
|
|
|
|
**File:** `internal/metrics/telemetry.go:39-42`
|
|
**Category:** bug
|
|
|
|
Row scan failures are silently discarded with `continue`, masking data corruption or schema issues.
|
|
|
|
**Fix:** Log the scan error before continuing.
|
|
|
|
---
|
|
|
|
### M16: `copyStackDBDumps` reads entire files into memory
|
|
|
|
**File:** `internal/backup/crossdrive.go:478-484`
|
|
**Category:** optimization
|
|
|
|
Uses `os.ReadFile`/`os.WriteFile` for dump file copies. Large DB dumps (500MB+) cause full in-memory allocation.
|
|
|
|
**Fix:** Use `io.Copy` with buffered I/O, or shell out to `cp`/`rsync`.
|
|
|
|
---
|
|
|
|
### M17: `restic stats` called twice per drive per cache refresh
|
|
|
|
**File:** `internal/backup/backup.go:1070-1071`
|
|
**Category:** optimization
|
|
|
|
`aggregateRepoStats()` and `perDriveRepoStats()` both iterate drives and call `restic stats`. Each drives gets two subprocess invocations per refresh.
|
|
|
|
**Fix:** Compute per-drive stats once, then aggregate from the results.
|
|
|
|
---
|
|
|
|
### M18: `ListDumpFiles` re-validates every dump file every 5 minutes
|
|
|
|
**File:** `internal/backup/dbdump.go:448-451`
|
|
**Category:** optimization
|
|
|
|
`ValidateDump` opens and reads each `.sql` file line-by-line on every cache refresh, even for multi-hundred-MB dumps. Ignores the validation cache in `settings.json`.
|
|
|
|
**Fix:** Pass the validation cache to `ListDumpFiles`, skip re-validation if file size+modtime match.
|
|
|
|
---
|
|
|
|
### M19: Naive stack name derivation from container names
|
|
|
|
**File:** `internal/backup/dbdump.go:517-532`
|
|
**Category:** logic
|
|
|
|
`deriveStackName("my-cache")` strips "cache" suffix and returns "my" — potentially attributing DB dumps to wrong stacks.
|
|
|
|
**Fix:** Cross-reference with actual deployed stack names instead of pure suffix stripping.
|
|
|
|
---
|
|
|
|
### M20: `syncInfraConfig` controller.yaml copy is not atomic
|
|
|
|
**File:** `internal/backup/crossdrive.go:526-534`
|
|
**Category:** data integrity
|
|
|
|
Uses `os.ReadFile` + `os.WriteFile` directly. A crash mid-write corrupts the file. Compare with `local_infra.go` which uses `atomicWrite`.
|
|
|
|
**Fix:** Use `atomicWrite` for consistency.
|
|
|
|
---
|
|
|
|
### M21: `http.Get` without timeout in main.go
|
|
|
|
**File:** `cmd/controller/main.go:691`
|
|
**Category:** bug
|
|
|
|
Uses `http.Get` (no timeout). If the endpoint hangs, the goroutine leaks permanently.
|
|
|
|
**Fix:** Use `&http.Client{Timeout: 10 * time.Second}`.
|
|
|
|
---
|
|
|
|
### M22: Assets syncer holds mutex during entire network operation
|
|
|
|
**File:** `internal/assets/syncer.go:72-73`
|
|
**Category:** concurrency
|
|
|
|
`s.mu.Lock()` held for the entire `Sync()` call including HTTP downloads. `Status()` calls block for the entire sync duration (60+ seconds).
|
|
|
|
**Fix:** Lock only to set a `running` flag, then unlock for actual work.
|
|
|
|
---
|
|
|
|
### M23: Git sync logs credentials in plaintext
|
|
|
|
**File:** `internal/sync/sync.go:427`
|
|
**Category:** security
|
|
|
|
`runGitInDir` logs `strings.Join(args, " ")` which includes the repo URL with embedded credentials during `git clone`. The `maskRepoURL()` helper exists but is only used elsewhere.
|
|
|
|
**Fix:** Apply `maskRepoURL()` in `runGitInDir()` before logging.
|
|
|
|
---
|
|
|
|
### M24: Path prefix check in storageAttachBrowseHandler may match sibling directories
|
|
|
|
**File:** `internal/web/storage_handlers.go:953-958`
|
|
**Category:** security
|
|
|
|
`strings.HasPrefix(cleanPath, storage.RawMountBase)` — if `RawMountBase` doesn't end with `/`, a path like `/mnt/felhom-stage-evil/sensitive` would pass the check.
|
|
|
|
**Fix:** Ensure prefix check includes trailing `/`: `strings.HasPrefix(cleanPath, storage.RawMountBase + "/")`.
|
|
|
|
---
|
|
|
|
### M25: Set* methods on Server lack synchronization
|
|
|
|
**File:** `internal/web/server.go:114-172`
|
|
**Category:** concurrency
|
|
|
|
Multiple `Set*` methods write to Server fields without synchronization. The `SyncFileBrowserMounts` goroutine launched at startup could race with these calls.
|
|
|
|
**Fix:** Document init order or add a barrier, or pass all deps via `NewServer()`.
|
|
|
|
---
|
|
|
|
### M26: `crypto.DecryptMap` uses global `log.Printf`
|
|
|
|
**File:** `internal/crypto/crypto.go:112`
|
|
**Category:** consistency
|
|
|
|
Uses global `log` package instead of injected logger — inconsistent with rest of codebase.
|
|
|
|
**Fix:** Add a `logger` parameter.
|
|
|
|
---
|
|
|
|
## LOW Issues
|
|
|
|
### L1: Dead code: unused `fileExists` in main.go (line 1071)
|
|
### L2: Multiple goroutines launched without `sync.WaitGroup` for shutdown (main.go: multiple locations)
|
|
### L3: `0` treated as "unset" for all int config fields (config.go:230-234) — prevents intentional zero values
|
|
### L4: Hardcoded metrics DB path ignores `cfg.Paths.DataDir` (main.go:138)
|
|
### L5: Hardcoded config path in infra backup functions (main.go:1047, 1148)
|
|
### L6: String concat instead of `filepath.Join` for path (main.go:83)
|
|
### L7: Double-clear of `running` map in CrossDriveRunner (crossdrive.go:86-110) — defer + manual clear
|
|
### L8: uint64 to int64 cast in restic stats — overflow for repos >8 EB (restic.go:293)
|
|
### L9: `deriveStackName` error parsing uses `string(out)` which may be empty (restic.go:136-142)
|
|
### L10: infraPaths duplicated in every drive's backup (backup.go:366-369)
|
|
### L11: TOCTOU race on fstab in `addDRFstabEntries` (restore_drives_linux.go:213-262)
|
|
### L12: Non-atomic fallback fstab write (restore_drives_linux.go:256-258)
|
|
### L13: 10-second timeout may truncate DB discovery results (backup.go:1077-1081)
|
|
### L14: `restoreUserData` copies files with hardcoded 0644 permissions (restore_app_linux.go:139-148)
|
|
### L15: Merged snapshot history not saved to disk after merge (backup.go:984-1053)
|
|
### L16: `RunIntegrityCheck` doesn't acquire the `running` flag (backup.go:571-622)
|
|
### L17: Empty drive path propagates to path builders (backup.go:177-180)
|
|
### L18: `ScanDrivesForBackups` picks first found backup, not freshest (restore_scan.go:248)
|
|
### L19: Throwaway `exec.Cmd` for logging in MariaDB dump (dbdump.go:213-223)
|
|
### L20: `ValidateDump` uses global `log` instead of passed logger (dbdump.go:300+)
|
|
### L21: `methodOrEmpty` never returns empty despite its name (healthprobe.go:315-323)
|
|
### L22: `aggregateState` doesn't account for `StatePaused` containers (manager.go:387-441)
|
|
### L23: `composeExecCustomEnv` double-loads env when env is nil (manager.go:688-694)
|
|
### L24: `logPostStartStatus` goroutine has no cancellation (manager.go:775-799)
|
|
### L25: `ParseComposeHDDMounts` naive YAML parsing can be fooled (delete.go:398-460)
|
|
### L26: `getDirSizeHuman` has no timeout (delete.go:463-474)
|
|
### L27: Delete/Remove calls both manual state update AND ScanStacks (delete.go:162-168)
|
|
### L28: `DeployField.Options` not deep-copied in `deepCopyStack` (manager.go:510-514)
|
|
### L29: Double WriteHeader in `serveCatchAll` (server.go:355-360)
|
|
### L30: CSRF token never rotated during session lifetime (auth.go:144-161)
|
|
### L31: Content-Type exact match is fragile (storage_handlers.go:1490-1497)
|
|
### L32: `AlertManager.Refresh` confusing variadic API (alerts.go:54)
|
|
### L33: `json` template function suppresses marshaling errors (funcmap.go:329-332)
|
|
### L34: Pinger `send()` always returns nil (pinger.go:91)
|
|
### L35: No SQLite connection pool limits (metrics/store.go:19-23)
|
|
### L36: No WAL checkpoint after Prune (metrics/store.go:303-322)
|
|
### L37: `AllCountries()` rebuilds sorted list on every call (cloudflare/countries.go:253-261)
|
|
|
|
---
|
|
|
|
## Fix Plan
|
|
|
|
### Phase 1: Critical & High Bugs (do first)
|
|
|
|
These are bugs that could cause crashes, data loss, or security issues.
|
|
|
|
| # | Issue | Effort | Impact |
|
|
|---|-------|--------|--------|
|
|
| C1 | Watchdog mutex unlock/relock panic safety | 30min | Prevents watchdog crash |
|
|
| C2 | SetGeoAppOverride nil guard | 5min | Prevents nil panic |
|
|
| C3 | Restore: SSD-only apps DB dump fallback | 30min | Fixes DR for SSD-only apps |
|
|
| H1 | Double deploy race: atomic check-and-set | 30min | Prevents parallel deploys |
|
|
| H2 | Delete/Remove: check Deploying flag | 10min | Prevents delete-during-deploy |
|
|
| H3 | ScanStacks: skip Deployed overwrite during deploy | 15min | Prevents ghost deployments |
|
|
| H4 | Clone appCfg before async deploy goroutine | 15min | Prevents shared pointer mutation |
|
|
| H5 | SyncFileBrowserMounts: add mutex | 20min | Prevents concurrent file corruption |
|
|
| H6 | PushEvent: add recordHistory calls | 20min | Fixes event history debug page |
|
|
| H7 | PushOnce: return error for non-2xx | 5min | Fixes silent push failures |
|
|
| H8 | DumpOne: close tmpFile before rename | 10min | Prevents truncated dump files |
|
|
| H9 | Restic retry: fresh context | 10min | Prevents confusing timeout errors |
|
|
| H10 | SaveAppConfig: log/error on encrypt failure | 10min | Prevents silent plaintext secrets |
|
|
| H11 | settingsCrossBackupHandler: add path validation | 15min | Closes path traversal gap |
|
|
| H12 | deepCopyStack: complete deep copy | 20min | Prevents shared slice mutation |
|
|
|
|
**Estimated Phase 1 total: ~4 hours**
|
|
|
|
### Phase 2: Medium Security & Consistency
|
|
|
|
| # | Issue | Effort | Impact |
|
|
|---|-------|--------|--------|
|
|
| M1 | Constant-time API key comparison | 5min | Timing attack prevention |
|
|
| M8 | Login rate limiting | 1hr | Brute-force prevention |
|
|
| M9 | Empty stack name validation | 10min | Prevents invalid API calls |
|
|
| M23 | Mask git credentials in logs | 10min | Prevents credential leakage |
|
|
| M24 | Path prefix check trailing slash | 5min | Closes path matching gap |
|
|
| M21 | http.Get with timeout | 5min | Prevents goroutine leak |
|
|
| M25 | Server Set* methods: document init order | 10min | Prevents future races |
|
|
|
|
**Estimated Phase 2 total: ~2 hours**
|
|
|
|
### Phase 3: Medium Concurrency & Logic
|
|
|
|
| # | Issue | Effort | Impact |
|
|
|---|-------|--------|--------|
|
|
| M2 | stackProvider: use sync.Once | 15min | Fixes data race |
|
|
| M3 | DrainPendingEvents: drain after save | 10min | Prevents event loss |
|
|
| M4 | SubdomainInUse: I/O outside lock | 30min | Reduces lock contention |
|
|
| M5 | MigrateEncryption: encKey check inside lock | 5min | Fixes data race |
|
|
| M7 | executeAllRestores: use thread-safe accessor | 10min | Fixes race condition |
|
|
| M10 | Scheduler: reject/launch late registrations | 20min | Prevents silent job drops |
|
|
| M12 | Watchdog Set* methods: document or protect | 10min | Prevents data race |
|
|
| M13 | SQLite WAL verification | 10min | Confirms WAL engagement |
|
|
| M14 | sampleContainers: pass cancellable context | 15min | Clean shutdown |
|
|
| M15 | Telemetry scan error logging | 5min | Better diagnostics |
|
|
| M22 | Assets syncer: unlock during network I/O | 20min | Unblocks status API |
|
|
| M26 | crypto.DecryptMap: use injected logger | 10min | Logger consistency |
|
|
|
|
**Estimated Phase 3 total: ~3 hours**
|
|
|
|
### Phase 4: Medium Optimization
|
|
|
|
| # | Issue | Effort | Impact |
|
|
|---|-------|--------|--------|
|
|
| M16 | copyStackDBDumps: buffered I/O | 20min | Prevents OOM on large dumps |
|
|
| M17 | restic stats: compute once per drive | 30min | Halves subprocess calls |
|
|
| M18 | ListDumpFiles: use validation cache | 30min | Stops re-reading large dumps |
|
|
| M19 | deriveStackName: cross-reference with stacks | 30min | Correct DB-to-stack mapping |
|
|
| M20 | syncInfraConfig: atomic write | 10min | Consistent with local_infra.go |
|
|
|
|
**Estimated Phase 4 total: ~2 hours**
|
|
|
|
### Phase 5: Low Priority (opportunistic)
|
|
|
|
Address during related work or cleanup sprints. Key items:
|
|
|
|
- **L2**: Add `sync.WaitGroup` for background goroutines (shutdown cleanup)
|
|
- **L11/L12**: Atomic fstab writes in DR restore
|
|
- **L14**: Preserve file permissions in restore
|
|
- **L15**: Save merged snapshot history
|
|
- **L22**: Handle `StatePaused` in `aggregateState`
|
|
- **L25**: Replace naive YAML parsing with proper parser
|
|
- **L35**: Set `db.SetMaxOpenConns(2)` for SQLite
|
|
|
|
**Estimated Phase 5 total: ~4 hours (spread across multiple sessions)**
|
|
|
|
---
|
|
|
|
## Cross-Cutting Patterns to Standardize
|
|
|
|
### 1. Logger Inconsistency
|
|
Several functions use global `log.Printf` instead of injected loggers:
|
|
- `dbdump.go:ValidateDump` — global log
|
|
- `crypto.go:DecryptMap` — global log
|
|
- Fix: Add logger parameters for consistency
|
|
|
|
### 2. Atomic Write Inconsistency
|
|
Some file operations use atomic write (tmp + rename), others don't:
|
|
- `crossdrive.go:syncInfraConfig` — direct write
|
|
- `restore_drives_linux.go` — fallback non-atomic write
|
|
- Fix: Centralize atomic write helper, use everywhere
|
|
|
|
### 3. Context Propagation
|
|
Several operations use `context.Background()` instead of the parent context:
|
|
- `metrics/collector.go:sampleContainers`
|
|
- `main.go:http.Get`
|
|
- Fix: Thread context through for proper cancellation
|
|
|
|
### 4. Deep Copy Completeness
|
|
`deepCopyStack` misses several nested slices/pointers:
|
|
- `Meta.OptionalConfig`, `Meta.HealthCheck`, `DeployField.Options`
|
|
- Fix: Audit all nested types and ensure complete deep copy
|
|
|
|
### 5. Error Swallowing
|
|
Multiple locations silently discard errors:
|
|
- `PushOnce` non-2xx responses
|
|
- `PushEvent` delivery failures (not recorded)
|
|
- `Pinger.send` always returns nil
|
|
- `SaveAppConfig` encryption failures
|
|
- Fix: At minimum log errors; return where possible
|
|
|
|
---
|
|
|
|
## Live Testing Summary (192.168.0.162)
|
|
|
|
| Test | Result |
|
|
|------|--------|
|
|
| Container health | HEALTHY (v0.30.3, up, `(healthy)`) |
|
|
| Log errors | CLEAN (zero errors/warnings) |
|
|
| API health | OK (`{"ok":true}`) |
|
|
| Disk space | OK (21% root, 1% data drives) |
|
|
| Metrics DB | OK (7.2M DB, 1.4M WAL) |
|
|
| Restic locks | CLEAN (none) |
|
|
| Memory | 17.6 MiB / 15.4 GiB (0.11%) |
|
|
| CPU | 0.11% |
|
|
| Backup system | Working (4 DBs discovered, cache refresh OK) |
|
|
|
|
No runtime issues detected. All findings are from code review — latent bugs and race conditions that require specific conditions to trigger.
|