diff --git a/BUGHUNT.md b/BUGHUNT.md index 67f5dae..a07d30a 100644 --- a/BUGHUNT.md +++ b/BUGHUNT.md @@ -3,718 +3,684 @@ **Date:** 2026-02-25 **Scope:** Full controller codebase (`controller/` — all ~80 Go files, all packages) **Method:** Systematic code review of every file + live testing on demo node (192.168.0.162) -**Verified on:** v0.30.2, Go 1.24.0 +**Controller version:** v0.30.3 --- -## Priority Legend +## Executive Summary -| Priority | Meaning | -|----------|---------| -| P0 | **Critical** — Data race causing corruption, potential panic in production, security vulnerability | -| P1 | **High** — Logic error causing wrong behavior, resource leak, silent data loss | -| P2 | **Medium** — Suboptimal behavior, inconsistency, performance issue, minor bug | -| P3 | **Low** — Code smell, dead code, cosmetic issue, hardening | +Reviewed all controller Go source files across 6 parallel deep-dive sessions covering: main/settings/config, backup (12 files), stacks (5 files), web/API/router (14 files), scheduler/monitor/metrics (10 files), and cloudflare/assets/storage/sync/setup/notify/report/recovery/selftest/crypto/util (32 files). + +**Live testing result:** Demo node is healthy — zero errors, zero warnings, 18 MiB RAM, 0.11% CPU, all subsystems operational. + +**Code review found 78 issues total:** + +| Severity | Count | +|----------|-------| +| CRITICAL | 3 | +| HIGH | 12 | +| MEDIUM | 26 | +| LOW | 37 | --- ## Table of Contents -- [P0 Critical Issues](#p0-critical-issues) -- [P1 High Issues](#p1-high-issues) -- [P2 Medium Issues](#p2-medium-issues) -- [P3 Low Issues](#p3-low-issues) -- [Standardization / Optimization Opportunities](#standardization--optimization-opportunities) -- [Fix Plan — Ordered by Implementation Phase](#fix-plan--ordered-by-implementation-phase) +- [CRITICAL Issues](#critical-issues) +- [HIGH Issues](#high-issues) +- [MEDIUM Issues](#medium-issues) +- [LOW Issues](#low-issues) +- [Fix Plan](#fix-plan) --- -## P0 Critical Issues +## CRITICAL Issues -### C01 — `UpdateStackConfig` passes encrypted env values to docker compose +### C1: Watchdog mutex unlock/relock pattern — panic-unsafe -**Files:** `stacks/deploy.go:340-366` -**Category:** Bug — Silent data corruption +**File:** `internal/monitor/watchdog.go:234-238` +**Category:** concurrency -`UpdateStackConfig` loads `app.yaml` via `LoadAppConfig()` (line 340) which returns encrypted `ENC:...` values for sensitive fields. It then updates the modified fields and calls `composeExecWithEnv` (line 366) passing the mixed map — old encrypted values + new plaintext values. Docker compose receives the encrypted ciphertext as literal environment variable values. Containers get `ENC:H98iUGr...` instead of the actual DB password. +In `handleConnectedProbe()`, the mutex is unlocked before calling `handleDisconnect`, then re-locked to satisfy a deferred unlock: -**Verified on demo:** `immich/app.yaml` has `DB_PASSWORD: ENC:H98i...`. If a user changes optional config via the UI, the next `docker compose up -d` would inject the encrypted string as DB_PASSWORD. - -**Fix:** Use `LoadAppConfigDecrypted(stackDir, m.encKey)` for the compose execution path, or call `m.stackEnv(stackDir)` instead of `m.composeExecWithEnv`. - ---- - -### C02 — `DecryptMap` silently swallows decryption errors - -**File:** `crypto/crypto.go:107-113` -**Category:** Silent data corruption - -When decryption fails (key rotated, data corrupted), `DecryptMap` silently returns the raw `ENC:...` ciphertext as the value. Applications receive garbled base64 as passwords/secrets, causing silent failures. No error is logged or returned. - -**Fix:** Return an error from `DecryptMap`, or at minimum log a `[WARN]`. Update callers to handle the error. - ---- - -### C03 — `appCfg.Deployed = false` written outside lock in deploy goroutine - -**Files:** `stacks/deploy.go:311` -**Category:** Data race - -In `runComposeDeploy`, on failure, `appCfg.Deployed = false` is written at line 311 **outside** the `m.mu.Lock()` block (lines 302-309). The same `appCfg` pointer is shared with `s.AppConfig` in the stacks map. Any concurrent `GetStack`/`GetStacks` call reading `AppConfig.Deployed` is a data race. - -**Fix:** Move line 311 (`appCfg.Deployed = false`) inside the lock block before line 309. - ---- - -### C04 — `GetStack`/`GetStacks` returns shallow copies with shared pointer fields - -**Files:** `stacks/manager.go:449-470` -**Category:** Data race - -`GetStack` does `copy := *s` — a shallow copy. `Stack.AppConfig` is a pointer, so the "copy" shares the `AppConfig` with the original. Any goroutine reading `copy.AppConfig.Env` while another modifies the original's `AppConfig` (e.g., deploy goroutine) is a data race. Same issue with `Containers`, `HealthProbe`, and `Meta.DeployFields`. - -**Fix:** Implement deep copy for `Stack` struct. At minimum, deep-copy `AppConfig`, `Containers`, and `HealthProbe`. - ---- - -### C05 — Watchdog `pathProbeState` fields read/written without synchronization - -**File:** `monitor/watchdog.go:141-163 and throughout` -**Category:** Data race - -`Check()` reads and writes `pathProbeState` fields (`lastProbeTime`, `consecutiveFailures`, `probeCount`, `lastStatus`, etc.) without holding `w.mu`. Meanwhile, `GetDebugStatus()` reads these same fields while holding `w.mu`. `SafeDisconnect()` and `Reconnect()` also modify state concurrently. Multiple data races. - -**Fix:** Either hold `w.mu` during all state modifications in `Check()`, or give each `pathProbeState` its own mutex. - ---- - -### C06 — Metrics collector `cancel` field has data race between `Start` and `Stop` - -**File:** `metrics/collector.go:Start/Stop` -**Category:** Data race - -`Start()` writes `c.cancel` without any lock. `Stop()` reads it without any lock. If called from different goroutines (the expected pattern), this is a data race. - -**Fix:** Protect with `c.mu` or use `sync.Once` for start. - ---- - -### C07 — `activeRawMount` race between cleanup, mount, and set - -**File:** `web/storage_handlers.go:919-935` -**Category:** Race condition - -In `storageAttachMountRawHandler`, the old mount is cleaned up under `diskJobMu` (line 919-924), then the lock is released, then `storage.MountRaw` runs without lock (line 926), then the new mount path is set under lock (line 933-935). Two concurrent requests can both clean up and create mounts, orphaning the first. - -**Fix:** Hold `diskJobMu` across the entire operation (cleanup + mount + set). - ---- - -### C08 — `SetEncryptionKey` on stacks Manager has no locking - -**File:** `stacks/manager.go:114-116` -**Category:** Data race - -`SetEncryptionKey` writes `m.encKey` without holding any lock. If called concurrently with `MigrateEncryption`, `stackEnv`, or `SaveAppConfig` (which read `m.encKey`), this is a data race. - -**Fix:** Acquire `m.mu.Lock()` before setting `m.encKey`. - ---- - -## P1 High Issues - -### H01 — Restic lock detection never triggers (checks stdout, not stderr) - -**File:** `backup/restic.go:135-151` -**Category:** Bug — backup recovery broken - -`Snapshot()` uses `cmd.Output()` which returns stdout only. Restic writes lock errors to stderr. The lock-detection code at line 138 checks `string(out)` for "lock"/"locked" — but `out` is stdout, not stderr. The auto-unlock-and-retry logic likely never triggers. - -**Fix:** Use `cmd.CombinedOutput()`, or extract stderr from `*exec.ExitError`: ```go -if exitErr, ok := err.(*exec.ExitError); ok { - errStr = string(exitErr.Stderr) -} +state.mu.Unlock() // line 235 +w.handleDisconnect(sp, state, result) // line 236 +state.mu.Lock() // line 237 — re-acquire for deferred Unlock +``` + +If `handleDisconnect` panics, line 237 never executes, but the `defer state.mu.Unlock()` still fires — unlocking an already-unlocked mutex causes a **double-panic** (`sync: unlock of unlocked mutex`), crashing the watchdog goroutine entirely. + +**Fix:** Refactor to update state inside the lock, then call `handleDisconnect` outside with an early return instead of re-locking. Or wrap `handleDisconnect` in a recover block. + +--- + +### C2: SetGeoAppOverride nil pointer dereference + +**File:** `internal/settings/settings.go:885` +**Category:** bug + +`SetGeoAppOverride(stackName string, override *GeoAppOverride)` does not check for `override == nil` before accessing its fields. Callers who pass nil to "clear" an override would crash. + +**Fix:** Add nil guard: `if override == nil { delete map entry; return }`. + +--- + +### C3: SSD-only apps skip DB dump restoration during DR + +**File:** `internal/backup/restore_app_linux.go:156-161` +**Category:** logic/bug + +`restoreDBDumps` returns `nil` early if `app.HDDPath == ""`. SSD-only apps (no HDD) have DB dumps under `systemDataPath`, but this path is never considered. During disaster recovery, SSD-only apps with databases (e.g., Mealie, Vikunja) would **lose their DB dump backups**. + +**Fix:** Fall back to `systemDataPath` when `HDDPath` is empty, mirroring the logic in `GetAppDrivePath()`. + +--- + +## HIGH Issues + +### H1: Double deploy race condition (TOCTOU) + +**File:** `internal/stacks/deploy.go:110-285` +**Category:** concurrency + +`DeployStack` reads the stack via `GetStack` (RLock, releases), checks `existing.Deployed == false` on disk, then 169 lines later acquires the write lock to set `Deploying = true`. Two concurrent deploy calls for the same stack can both pass the check and launch parallel `docker compose up -d` goroutines. + +**Fix:** Atomically check-and-set `Deploying` under a single lock acquisition: +```go +m.mu.Lock() +if s.Deploying { m.mu.Unlock(); return error } +s.Deploying = true +m.mu.Unlock() ``` --- -### H02 — `activeDrives()` returns disconnected/decommissioned drives +### H2: Delete/Remove during active deploy -**File:** `backup/backup.go:206-222` -**Category:** Logic error +**File:** `internal/stacks/delete.go:73, 226` +**Category:** concurrency -`activeDrives()` logs which drives are disconnected but still adds ALL drives to the result. Functions like `RunIntegrityCheck`, `ListSnapshots`, `aggregateRepoStats` will then operate on disconnected drives, causing timeouts. +Neither `DeleteStack` nor `RemoveStack` checks `stack.Deploying`. A user could delete a stack while an async deploy goroutine is running, causing race between `docker compose down` and `docker compose up`, and the deploy goroutine would later modify a deleted stack. -**Fix:** Add `continue` after logging disconnected drives to skip them: -```go -if m.settings != nil && (m.settings.IsDisconnected(d) || m.settings.IsDecommissioned(d)) { - disconnected = append(disconnected, d) - continue // don't add to drives -} -``` +**Fix:** Add `if stack.Deploying { return error }` checks in both functions. --- -### H03 — `MigrateEncryption` holds RLock while writing files +### H3: ScanStacks can overwrite Deployed flag during async deploy -**File:** `stacks/manager.go:124-156` -**Category:** Concurrency — semantic error +**File:** `internal/stacks/manager.go:232-248` +**Category:** concurrency -`MigrateEncryption()` acquires `m.mu.RLock()` then calls `SaveAppConfig()` which writes to disk. A write operation should hold a write lock, not a read lock. +`ScanStacks` re-reads `app.yaml` from disk and overwrites the in-memory `Deployed` flag. If an async deploy fails and `runComposeDeploy` sets `Deployed=false` in memory, but disk still has the old state, `ScanStacks` could re-set `Deployed=true` creating a ghost deployment. -**Fix:** Use `m.mu.Lock()` instead of `m.mu.RLock()`. +**Fix:** Skip overwriting `Deployed` and `AppConfig` when `existing.Deploying == true`. --- -### H04 — `SaveAppConfig` is not atomic +### H4: Shared `appCfg` pointer between main thread and async deploy goroutine -**File:** `stacks/deploy.go:555` (uses `os.WriteFile`) -**Category:** Data integrity +**File:** `internal/stacks/deploy.go:283, 310` +**Category:** concurrency -`SaveAppConfig` writes directly to `app.yaml`. A crash mid-write corrupts the file. This violates the project's own pattern (settings.go uses write-to-tmp + rename). +The same `appCfg` pointer is stored in `s.AppConfig` (line 283) and passed to the async `runComposeDeploy` goroutine. The goroutine later mutates it (line 310: `appCfg.Deployed = false`). While the mutation happens under `m.mu.Lock`, readers who already got the pointer reference see mutation. -**Fix:** Write to `app.yaml.tmp`, then `os.Rename` to `app.yaml`. +**Fix:** Clone `appCfg` before passing to the goroutine. --- -### H05 — Deploy failure saves plaintext secrets to disk +### H5: `SyncFileBrowserMounts` has no concurrency guard -**File:** `stacks/deploy.go:311-312` -**Category:** Security +**File:** `internal/web/handlers.go:1543-1589` +**Category:** concurrency -On deploy failure, `runComposeDeploy` calls `SaveAppConfig(stackDir, appCfg, nil, nil)` — passing nil for encKey and sensitiveVars. The plaintext secrets from the deploy form are saved unencrypted to `app.yaml`. +Called from 7 locations (4 via `go` goroutines). Two concurrent storage operations could trigger parallel file writes to `docker-compose.yml` and `config.yaml`, and concurrent `docker compose` restarts causing file corruption or Docker conflicts. -**Fix:** Pass `m.encKey` and `SensitiveEnvVars(&meta)` to `SaveAppConfig` in the failure path. +**Fix:** Add a `sync.Mutex` for the entire SyncFileBrowserMounts operation, or use a debounced channel to coalesce concurrent calls. --- -### H06 — No rate limiting on login attempts +### H6: PushEvent never records to event history -**File:** `web/auth.go:91-128` -**Category:** Security +**File:** `internal/notify/notifier.go:181-218` +**Category:** bug -`handleLogin` has no rate limiting, lockout, or progressive delay. Bcrypt is slow but still allows ~5-10 attempts/sec. Note: demo node currently has no password set (`Auth: no password configured`), making this irrelevant there but critical for production deployments. +`PushEvent()` fires notifications in a goroutine but never calls `recordHistory()`. Only `PushTestEventSync()` records. The debug event history page is blind to all real production events. -**Fix:** Add per-IP rate limiter or exponential backoff on failed attempts. +**Fix:** Add `n.recordHistory(...)` calls in the PushEvent goroutine on both success and failure paths. --- -### H07 — Template rendering writes directly to ResponseWriter +### H7: PushOnce returns nil for non-2xx HTTP responses -**Files:** `web/server.go:393-398`, `setup/handlers.go:871-877` -**Category:** Bug — corrupt HTTP responses +**File:** `internal/report/pusher.go:194-224` +**Category:** bug -`render()` calls `tmpl.ExecuteTemplate(w, ...)` directly on the ResponseWriter. If the template partially writes then errors, the client receives corrupt HTML with a 200 status. The subsequent `http.Error(500)` is silently ignored because headers are already sent. +`PushOnce()` silently returns `nil` for 4xx/5xx responses. Callers believe the push succeeded. Compare with `Push()` and `PushInfraBackup()` which properly handle non-2xx. -**Fix:** Render to a `bytes.Buffer` first, only write to `w` on success: -```go -var buf bytes.Buffer -if err := s.tmpl.ExecuteTemplate(&buf, name, data); err != nil { - http.Error(w, "Internal error", 500) - return -} -w.Header().Set("Content-Type", "text/html; charset=utf-8") -buf.WriteTo(w) -``` +**Fix:** Return `fmt.Errorf("hub push-once: HTTP %d", resp.StatusCode)` for non-2xx. --- -### H08 — `Syncer.Stop()` panics if called twice +### H8: tmpFile not closed/synced before rename in DB dump -**File:** `sync/sync.go:114-116` -**Category:** Panic +**File:** `internal/backup/dbdump.go:231-272` +**Category:** data integrity -`Stop()` calls `close(s.stopCh)`. A double-call panics with "close of closed channel". +`DumpOne` calls `os.Rename(tmpPath, finalPath)` while the file handle is still open (defer closes it later). Data may not be flushed to disk before rename. A crash after rename could leave a truncated/empty dump file. -**Fix:** Use `sync.Once`: -```go -s.stopOnce.Do(func() { close(s.stopCh) }) -``` +**Fix:** Close `tmpFile` explicitly (checking the error) before `os.Rename`. --- -### H09 — Sync race: `TriggerSync` and ticker can run `doSync` concurrently +### H9: Restic retry uses possibly-expired context -**File:** `sync/sync.go:120-137, 164-167` -**Category:** Race condition +**File:** `internal/backup/restic.go:142-156` +**Category:** logic -`TriggerSync` checks `s.syncing` under lock, releases the lock, then calls `doSync()`. Between unlock and doSync acquiring the lock to set `syncing=true`, the ticker goroutine can also enter `doSync()`. Two concurrent syncs corrupt the git cache. +When a stale lock is detected, `Snapshot` calls `unlock` then retries with the same context (30-min timeout). If the first attempt consumed most of the timeout, the retry gets very little time and fails with a confusing timeout error. -**Fix:** Set `s.syncing = true` inside `TriggerSync()` before releasing the lock. +**Fix:** Create a fresh context with its own timeout for the retry. --- -### H10 — Cloudflare HTTP requests ignore context — cannot be cancelled +### H10: SaveAppConfig silently stores secrets in plaintext on encryption failure -**File:** `cloudflare/client.go:73` -**Category:** Resource leak +**File:** `internal/stacks/deploy.go:542-549` +**Category:** security -`do()` uses `http.NewRequest()` instead of `http.NewRequestWithContext()`. The `ctx` parameter passed to `Sync()` is never propagated to HTTP calls. During shutdown, in-flight Cloudflare API requests block until their 15s timeout. +When `crypto.Encrypt()` fails, the code silently falls through and saves the value unencrypted. No warning is logged. Passwords could be stored in plaintext on disk without any indication. -**Fix:** Pass context through to `do()` and use `http.NewRequestWithContext()`. +**Fix:** Log the encryption failure, or return an error to prevent saving unencrypted secrets. --- -### H11 — `GetFullStatus` returns shared pointers when no cache exists +### H11: `settingsCrossBackupHandler` missing destination path validation -**File:** `backup/backup.go:1213-1228` -**Category:** Data race +**File:** `internal/web/handlers.go:947-994` +**Category:** security -When `m.cachedStatus == nil`, the fallback returns a `FullBackupStatus` that directly references `m.lastDBDump` and `m.lastBackup` pointers. After `m.mu` is released, the caller reads these while backup operations may be mutating them. +The web form handler for cross-drive backup configuration does NOT validate the destination path against registered storage paths. The API handler (`router.go:823-837`) has explicit validation (C9 comment). A crafted form submission could set an arbitrary destination path. -**Fix:** Deep-copy `m.lastDBDump` and `m.lastBackup` in the no-cache path. +**Fix:** Add the same validation as the API handler: check `destPath` against `GetStoragePaths()`. --- -### H12 — No HTTP request body size limits +### H12: deepCopyStack incomplete — OptionalConfig and HealthCheck not deep-copied -**Files:** Multiple handlers in `web/handler_debug.go`, `web/storage_handlers.go`, `api/router.go` -**Category:** Security — DoS +**File:** `internal/stacks/manager.go:474-517` +**Category:** concurrency -Every `json.NewDecoder(r.Body).Decode(&req)` has no body size limit. A multi-GB request consumes all memory. +`deepCopyStack` copies `DeployFields` but not `Meta.OptionalConfig` (slice of slices) or `Meta.HealthCheck` (pointer). Mutations on the returned copy's OptionalConfig or HealthCheck would affect the original in the map. -**Fix:** Add middleware: `r.Body = http.MaxBytesReader(w, r.Body, 1<<20)` for all POST handlers. +**Fix:** Deep-copy `OptionalConfig` including nested `Fields` slices, and clone `HealthCheck` pointer. --- -### H13 — `runDriveScan` in setup never calls `CleanupTempMounts` +## MEDIUM Issues -**File:** `setup/scanner.go:266-279` -**Category:** Resource leak +### M1: Non-constant-time API key comparison -`ScanDrivesForInfraBackups` temporarily mounts partitions. `runDriveScan` stores results but never cleans up. Temp mounts remain forever. +**File:** `cmd/controller/main.go:820` +**Category:** security -**Fix:** Call `CleanupTempMounts(results, s.logger)` after storing scan results. +API key comparison uses `==` instead of `subtle.ConstantTimeCompare`, making it vulnerable to timing attacks. + +**Fix:** Use `crypto/subtle.ConstantTimeCompare`. --- -### H14 — `Snapshot` retry reuses same context with diminished timeout +### M2: `stackProvider` read without mutex protection -**File:** `backup/restic.go:139-147` -**Category:** Bug +**File:** `internal/backup/backup.go:172-176, 185-189, 481-482` +**Category:** concurrency -After lock detection + unlock, the retry reuses the same `ctx` with the original 30-minute timeout. Time spent on the first attempt is subtracted. Also, `unlockCmd.Run()` ignores its error. +`SetStackProvider` writes under `m.mu`, but all reads of `m.stackProvider` happen without any lock. This is a Go data race per the memory model. In practice it's set once at startup, but the race detector would flag it. -**Fix:** Check unlock error. Consider creating a fresh context for retry. +**Fix:** Use `sync.Once` or `atomic.Value` for the provider. --- -### H15 — `crossdrive` rsync leaf-name collision detection is incomplete +### M3: `DrainPendingEvents` silently loses events on save failure -**File:** `backup/crossdrive.go:375-389` -**Category:** Logic error +**File:** `internal/settings/settings.go:808-823` +**Category:** logic -The collision check uses `os.Stat(dstPath)` which checks pre-existing directories from previous runs, not just current-iteration collisions. Mount index 0 skips collision check entirely. Pre-existing leftovers could cause data to be rsynced into wrong directories. +If `s.save()` fails, events are drained from memory but never persisted. On restart, the events are lost. -**Fix:** Use a `seen` map to track leaf names within the current iteration: -```go -seen := make(map[string]bool) -for _, srcMount := range mounts { - leaf := filepath.Base(srcMount) - if seen[leaf] { leaf = fmt.Sprintf("%s_%d", leaf, i) } - seen[leaf] = true - // ... -} -``` +**Fix:** Only drain events from memory after successful save. --- -### H16 — CSRF Bearer bypass — token not validated +### M4: `SubdomainInUse` performs disk I/O under RLock -**File:** `web/csrf.go:36-41` -**Category:** Security +**File:** `internal/stacks/deploy.go:58-82` +**Category:** concurrency/performance -Any request with `Authorization: Bearer ` skips CSRF entirely. The Bearer token is not validated against the actual API key. +Reads `app.yaml` from disk via `LoadAppConfig` for every deployed stack while holding `m.mu.RLock()`. Slow I/O blocks all writers. -**Fix:** Validate the Bearer token before skipping CSRF (check against `s.settings.GetAPIKey()`). +**Fix:** Collect stack dirs under the lock, release it, then perform disk I/O. --- -### H17 — Potential nil dereference on `crossDriveRunner` +### M5: `MigrateEncryption` reads `encKey` without lock -**File:** `web/handlers.go:919` -**Category:** Potential panic +**File:** `internal/stacks/manager.go:123` +**Category:** concurrency -`s.crossDriveRunner.ValidateDestination(...)` is called in `buildAppBackupRows()` without nil check. The field is set post-construction via `SetCrossDriveRunner`. +Checks `m.encKey == nil` before acquiring `m.mu.Lock()`. `SetEncryptionKey` writes `encKey` under the lock, creating a data race. -**Fix:** Add nil guard before access. +**Fix:** Move the nil check inside the lock. --- -### H18 — `selftest.checkDockerSocket` panics on empty output +### M6: `MigrateEncryption` holds lock during all disk I/O -**File:** `selftest/selftest.go:99` -**Category:** Potential panic +**File:** `internal/stacks/manager.go:122-162` +**Category:** concurrency/performance -`string(out[:len(out)-1])` panics with index-out-of-range if `out` is empty. +Holds `m.mu.Lock()` while performing `LoadAppConfig`, `LoadMetadata`, and `SaveAppConfig` for every stack. Blocks all operations during migration. -**Fix:** Use `strings.TrimSpace(string(out))` instead. +**Fix:** Collect stacks needing migration under lock, then do I/O outside. (Acceptable if only runs at startup before scheduler.) --- -### H19 — `stderrBuf` data race in `MigrateDrive` +### M7: `executeAllRestores` accesses `plan.Apps` without lock -**File:** `storage/migrate_drive.go:314-331` -**Category:** Data race +**File:** `internal/web/handler_restore.go:115` +**Category:** concurrency -A background goroutine writes to `stderrBuf`. After `cmd.Wait()` returns, `stderrBuf.String()` is read, but `Wait()` with piped stderr does NOT guarantee the reader goroutine has finished. +Directly accesses `plan.Apps` field without holding `restoreMu`. Concurrent reads from `apiRestoreStatus` could race. -**Fix:** Use `sync.WaitGroup` to wait for the stderr goroutine before reading. +**Fix:** Use the plan's thread-safe accessor instead of direct field access. --- -### H20 — `MountRaw` UUID slice panics if UUID shorter than 8 chars +### M8: No login rate limiting -**File:** `storage/attach_linux.go:53-58` -**Category:** Potential panic +**File:** `internal/web/auth.go:91-128` +**Category:** security -`uuid[:8]` panics if UUID is shorter than 8 characters. +No rate limiting on login attempts. While bcrypt is expensive, a persistent attacker could still brute-force weak passwords. -**Fix:** `if len(uuid) >= 8 { dirName = uuid[:8] } else { dirName = uuid }` +**Fix:** Add per-IP rate limiting (e.g., 5 attempts per minute). --- -### H21 — `removeBindFstabEntry` uses loose string matching +### M9: Route handling — `hasSuffix` can match empty stack names -**File:** `storage/attach_linux.go:431-456` -**Category:** Logic error +**File:** `internal/api/router.go:128` +**Category:** bug -`strings.Contains(line, targetMountPath)` matches `/mnt/hdd_10` when target is `/mnt/hdd_1`. Can remove unrelated fstab entries. +Path `/stacks/deploy` matches `hasSuffix(path, "/deploy")`, yielding an empty stack name from `extractName`. The handler proceeds without validating the name is non-empty. -**Fix:** Parse fstab fields and compare second field exactly. +**Fix:** Add non-empty name validation in handlers that use `extractName`. --- -## P2 Medium Issues +### M10: Jobs registered after `Start()` never execute -### M01 — `http.Get` without timeout in Hub connectivity test -**File:** `main.go:~691` -Uses `http.DefaultClient` (no timeout) while Gitea test correctly uses 5s timeout. -**Fix:** `client := &http.Client{Timeout: 5 * time.Second}` +**File:** `internal/scheduler/scheduler.go:60-96` +**Category:** logic -### M02 — `alertMgr.Refresh` missing `storagePaths` arg at initial call -**File:** `main.go:591` -Initial refresh omits `sett.GetStoragePaths()` while all 3 other call sites pass it. -**Fix:** Add `sett.GetStoragePaths()` to the variadic args. +`Start()` launches goroutines for existing jobs only. Jobs added via `Every()`/`Daily()` after `Start()` are appended but never get a goroutine. -### M03 — `DrainPendingEvents` risks duplicate delivery on save failure -**File:** `settings/settings.go:801-814` -Events cleared from memory before save. If save fails, events returned but still on disk — duplicated on restart. -**Fix:** Restore events if save fails, or save before clearing. - -### M04 — `AddStoragePath` allows duplicate paths -**File:** `settings/settings.go:422-432` -No duplicate check. Calling twice with same path creates duplicate entries. -**Fix:** Check existing paths before append. - -### M05 — `LoadFromBytes` skips env overrides -**File:** `config/config.go:200-211` -Unlike `Load`, `LoadFromBytes` doesn't call `applyEnvOverrides`. -**Fix:** Add `applyEnvOverrides(cfg)` call. - -### M06 — `ListDumpFiles` re-validates every SQL file on every cache refresh (5 min) -**File:** `backup/dbdump.go:449-450` -Reads and parses entire dump files every refresh. For large dumps, significant I/O. -**Fix:** Use `DBValidationCache` to skip re-validation when file hasn't changed. - -### M07 — `copyStackDBDumps` reads entire files into memory -**File:** `backup/crossdrive.go:466-484` -Uses `os.ReadFile` for DB dumps that can be hundreds of MB. -**Fix:** Use `io.Copy` with `os.Open`/`os.Create` for streaming. - -### M08 — `backupDrive` only records last drive's snapshot in history -**File:** `backup/backup.go:369-437` -Multi-drive setups only report the last drive's snapshot. -**Fix:** Record all successful snapshots. - -### M09 — `SubdomainInUse` does disk I/O under RLock -**File:** `stacks/deploy.go:58-82` -Calls `LoadAppConfig()` (disk read) for every stack while holding RLock. -**Fix:** Cache subdomain in memory `Stack` struct. - -### M10 — `InjectMissingFields` auto-fills subdomain without uniqueness check -**File:** `stacks/deploy.go:676-690` -Uses `field.Default` or `meta.Subdomain` without checking `SubdomainInUse` or reserved names. -**Fix:** Add validation. - -### M11 — `ParseComposeHDDMounts` uses fragile line-by-line YAML parsing -**File:** `stacks/delete.go:392-454` -Cannot distinguish top-level vs service-level `volumes:` key. -**Fix:** Use `yaml.v3` to properly parse. - -### M12 — Backup path removal in `RemoveStack` has no path validation -**File:** `stacks/delete.go:304-316` -Accepts arbitrary `backupPathsToRemove` from caller, calls `os.RemoveAll`. No validation paths are actually backup paths. -**Fix:** Validate paths start with expected backup directory prefix. - -### M13 — `getDirSizeHuman` has no timeout -**File:** `stacks/delete.go:457-468` -`exec.Command("du", "-sh", path)` with no timeout. Can hang on NFS. -**Fix:** Add `context.WithTimeout(30s)` like `getDirSizeBytes` does. - -### M14 — Scheduler doesn't guard against double-start -**File:** `scheduler/scheduler.go:99-116` -Calling `Start()` twice overwrites `ctx`/`cancel`, leaking the first set of goroutines. -**Fix:** Guard with `if s.ctx != nil { return }`. - -### M15 — Jobs registered after `Start()` never execute -**File:** `scheduler/scheduler.go:60-96 vs 99-116` -`Start()` only launches goroutines for currently-registered jobs. -**Fix:** Document requirement, or auto-launch late-registered jobs. - -### M16 — `selfupdate.performUpdate` leaves "pending" state on compose-up failure -**File:** `selfupdate/updater.go:396-410` -State file stays "pending" when compose-up fails, creating contradictory API state. -**Fix:** Set `state.Status = "failed"` on compose-up failure. - -### M17 — `selfupdate` docker commands have no timeout -**File:** `selfupdate/updater.go:495-502` -`docker pull` and `docker compose up -d` can hang indefinitely. -**Fix:** Use `exec.CommandContext` with timeout. - -### M18 — `SetBackupRunningCheck` has no mutex protection -**File:** `selfupdate/updater.go:66-68` -Writes `u.backupRunning` without lock. `DryRun()` also reads it without lock. -**Fix:** Use `u.mu.Lock()`. - -### M19 — `goroutine leak` in `ProbeStoragePath` on hung mounts -**File:** `system/mounts_linux.go:258-289` -`syscall.Statfs` on a dead mount blocks forever. Each probe leaks a goroutine. -**Fix:** Track outstanding probes per path, prevent duplicate probes. - -### M20 — `rsync --info=progress2` carriage returns not handled -**File:** `storage/format.go:31-53` -Progress scanner splits on `\n` only; `\r`-delimited updates get concatenated. -**Fix:** Custom scanner splitting on both `\r` and `\n`. - -### M21 — `mmcblk` device partition path construction is wrong -**File:** `storage/format_linux.go:119-122` -Only NVMe gets `p` separator. `mmcblk0` → `mmcblk01` instead of `mmcblk0p1`. -**Fix:** Add `|| strings.Contains(req.DevicePath, "mmcblk")`. - -### M22 — Integer underflow in memory availability message -**File:** `api/router.go:381-386` -If `reservedMB > totalMB`, `usableMB` goes negative, blocking all app starts. -**Fix:** Clamp `usableMB` to minimum 0. - -### M23 — `triggerAllCrossBackups` runs "manual" schedule -**File:** `api/router.go:907-929` -"Manual" schedule meant for per-app on-demand, not bulk triggers. -**Fix:** Remove `RunAllScheduled(ctx, "manual")`. - -### M24 — `writeFreshConfig` creates bare `&settings.Settings{}` without path -**File:** `setup/handlers.go:668-693` -On settings load failure, creates Settings with no file path. `SetPasswordHash` (which calls save) writes to empty path. -**Fix:** Create Settings with proper path and logger. - -### M25 — `SetStep` race condition between unlock and Save -**File:** `setup/setup.go:99-106` -State modified under lock, then saved after unlock. Another goroutine can modify between. -**Fix:** Use `saveLocked()` variant called while holding lock. - -### M26 — Setup CSRF relies on client-controlled cookie + form -**File:** `setup/csrf.go:34-43` -Both sides of the comparison are client-controlled. No server-side token storage. -**Fix:** Store CSRF token server-side in SetupState. - -### M27 — `PushEvent` in notifier spawns goroutine without tracking -**File:** `notify/notifier.go` -Goroutine leak potential — no WaitGroup or context for clean shutdown. -**Fix:** Track goroutines with sync.WaitGroup. - -### M28 — `SafeDisconnect` and `Check` can race on same storage path -**File:** `monitor/watchdog.go:498-644` -Both modify `pathProbeState` concurrently without coordination. Can double-stop stacks. -**Fix:** Use broader mutex or state-transition lock. - -### M29 — `stackProvider` reads in backup package lack mutex protection -**File:** `backup/backup.go:171-181,184-203,478-485` -`SetStackProvider` writes under mutex, but all reads are unprotected. -**Fix:** Ensure single-call-at-startup invariant is documented, or add read locks. - -### M30 — Multiple `GetStacks()` calls in same handler -**File:** `web/handlers.go:214,233` -`stacksHandler` calls `GetStacks()` 3 times — each may return different snapshots. -**Fix:** Call once, reuse result. +**Fix:** Either reject late registrations or launch the goroutine immediately if already started. --- -## P3 Low Issues +### M11: `findStoragePath` returns pointer to copy, not to settings data -### L01 — Dead code: `fileExists` function in main.go (line 1071) -### L02 — Dead code: `imageName` function in selfupdate/updater.go (line 231) -### L03 — Dead code: `detectHostIPViaRoute` in setup/network.go (always returns "") -### L04 — Dead code: `handleRoot` in setup/handlers.go (both if branches do same redirect) -### L05 — `TruncateStr` operates on bytes, not runes — breaks multi-byte UTF-8 -- File: `util/strings.go:6-12` and `stacks/manager.go:718-724` -### L06 — `TruncateStr` no guard for negative maxLen -### L07 — Variable `copy` shadows builtin in settings.go:294 -### L08 — `save()` in settings.go does not guard nil logger (line 218) -### L09 — `GetNotificationPrefs` early return shares global `DefaultEnabledEvents` slice -### L10 — `os.ExpandEnv` in config.go may corrupt `$` in YAML values -### L11 — Key file written non-atomically (crypto.go:37) -### L12 — `ValidateDump` uses global `log.Printf` instead of injected logger -### L13 — `hasPrefix` reimplements `strings.HasPrefix` in restore_scan.go -### L14 — `AllDone` returns true for empty Apps list (restore_scan.go:134) -### L15 — `RunIntegrityCheck` only reports last error across drives -### L16 — `restic.Stats()` casts uint64 to int64 unsafely -### L17 — `logPostStartStatus` goroutine leak potential (no context/WaitGroup) -### L18 — `Pinger.send` response body not drained before close -### L19 — `sampleContainers` creates `context.Background()` instead of using parent ctx -### L20 — Docker commands without timeout in healthcheck.go -### L21 — Non-deterministic Cloudflare rule order due to map iteration -### L22 — `parseDailyTime` accepts trailing garbage ("12:30:45") -### L23 — `Syncer.Start()` can be called multiple times (goroutine leak) -### L24 — `Stop()` on scheduler reads `s.cancel` without mutex -### L25 — `stripPartition` doesn't handle `mmcblk` devices -### L26 — `sizeBytes()` silently returns 0 on ParseUint error -### L27 — Duplicate `dirSizeHuman`/`dirSizeBytesHuman`/`fmtBytes` functions -### L28 — `w.Write` errors ignored in multiple HTTP handlers -### L29 — `json.Encode` errors ignored in multiple handlers -### L30 — `r.ParseForm()` errors ignored (multiple handlers) -### L31 — No upper bound on log `lines` query parameter (DoS) -### L32 — Logout accepts GET requests (CSRF) -### L33 — Content-Type check uses exact match, fails for `application/json; charset=utf-8` -### L34 — `SyncFileBrowserMounts` writes config files non-atomically -### L35 — `DriveMigrator` adapter types have triplicated methods (main.go:840-1038) -### L36 — Goroutines in main.go use `time.Sleep` without context cancellation check -### L37 — `hubPusher.Push()` error silently ignored in multiple callbacks -### L38 — Setup shutdown errors silently discarded (main.go:1136-1137) -### L39 — `PushOnce` in report/pusher.go always returns nil (misleading return type) -### L40 — Static fallback CSRF token ("fallback-csrf-token") in setup/csrf.go -### L41 — `extractName` returns "" but most API callers don't check for empty -### L42 — Hardcoded config path `/opt/docker/felhom-controller/controller.yaml` in 3 setup places -### L43 — `GetStatus()` in backup returns raw shared pointers (not deep-copied like `GetFullStatus`) -### L44 — Fstab entries accumulate extra blank lines from double-newline appends -### L45 — `logFileHashes` called after write — shows identical src/dst hashes -### L46 — `LoadMetadata` prints to stderr instead of using logger -### L47 — `LoadMetadata` silently ignores YAML parse errors, returns defaults -### L48 — `simpleHash` in alerts.go has 32-bit collision risk -### L49 — `Close()` on auth SessionStore can panic if called twice -### L50 — CSS path in setup/handlers.go always fails (dead code) -### L51 — `isError` in setup uses string comparison instead of `errors.Is` -### L52 — Password stored in plaintext in setup-state.json during setup +**File:** `internal/monitor/watchdog.go:890-897` +**Category:** logic + +Returns `&sp` where `sp` is a range-loop copy. Mutations won't reflect in settings. All current callers are read-only so it's safe, but misleading. + +**Fix:** Return value type, or document it's a snapshot. --- -## Standardization / Optimization Opportunities +### M12: Set* methods on watchdog lack synchronization -### S01 — Unify env var construction for compose commands -Three different code paths build env vars: `stackEnv`, `composeExecWithEnv`, `composeExecCustomEnv(env=nil)`. `DOMAIN` can be appended twice. Should consolidate into single method. +**File:** `internal/monitor/watchdog.go:113-125` +**Category:** concurrency -### S02 — Unify `humanizeBytes` / `fmtBytes` / `dirSizeBytesHuman` / `dirSizeHuman` -At least 4 separate implementations of byte-to-human-readable conversion across packages. Should be one function in `util/`. +`SetAlertRefresh`, `SetHubReportPusher`, `SetRepoUnlocker` write fields without any mutex. If called after the scheduler starts, it's a data race. -### S03 — Standardize atomic file writes -`settings.go` uses write-to-tmp + rename, but `SaveAppConfig`, `SyncFileBrowserMounts`, crypto key file, and setup state do not. All persistent file writes should use the same pattern. - -### S04 — Standardize HTTP client creation -Many places create `&http.Client{Timeout: X}` per call. Should have shared clients with sensible defaults. - -### S05 — Standardize `exec.Command` with context/timeout -Most `exec.Command` calls have no timeout. Should use `exec.CommandContext` with appropriate timeouts consistently. - -### S06 — Standardize error handling in HTTP handlers -Some handlers return JSON errors, some use `http.Error`, some ignore write errors. Should be consistent. - -### S07 — Extract common adapter pattern -`stackAdapter`, `watchdogStackAdapter`, `driveMigrateStackAdapter` in main.go share identical methods. Should use embedded base adapter. - -### S08 — Consolidate `getDirSizeHuman` + `getDirSizeBytes` into single call -Two `du` invocations for the same directory. Call `getDirSizeBytes` once and format. - -### S09 — Template rendering: always buffer before writing -Multiple `render()` functions write directly to ResponseWriter. All should use bytes.Buffer. - -### S10 — Add `MaxBytesReader` middleware -Rather than per-handler body limiting, add a global middleware with sensible default (1MB for JSON, larger for file uploads). +**Fix:** Document they must be called before Start(), or protect with mutex. --- -## Fix Plan — Ordered by Implementation Phase +### M13: SQLite WAL mode pragma not verified -### Phase 1 — Critical Data Races & Security (P0) -**Estimate: ~2 hours** +**File:** `internal/metrics/store.go:26-36` +**Category:** data integrity -1. **C01** — Fix `UpdateStackConfig` to use decrypted env for compose -2. **C02** — Add error return to `DecryptMap` (or at least log) -3. **C03** — Move `appCfg.Deployed = false` inside lock block -4. **C04** — Implement `Stack.DeepCopy()`, use in `GetStack`/`GetStacks` -5. **C05** — Add per-state mutex to `pathProbeState` in watchdog -6. **C06** — Protect metrics collector `cancel` with mutex -7. **C07** — Hold `diskJobMu` across entire raw mount operation -8. **C08** — Add lock to `SetEncryptionKey` +`PRAGMA journal_mode=WAL` is executed via `db.Exec()` but the returned mode is not verified. WAL could silently fail to engage on certain filesystems. -### Phase 2 — High-Priority Bugs (P1) -**Estimate: ~3 hours** - -1. **H01** — Fix restic lock detection to use stderr -2. **H02** — Filter disconnected drives in `activeDrives()` -3. **H03** — Change to write lock in `MigrateEncryption` -4. **H04** — Make `SaveAppConfig` atomic (write-to-tmp + rename) -5. **H05** — Pass encKey to SaveAppConfig on deploy failure -6. **H07** — Buffer template rendering (web + setup) -7. **H08** — Use sync.Once in Syncer.Stop() -8. **H09** — Fix sync race (set syncing=true before unlock) -9. **H10** — Thread context through Cloudflare HTTP calls -10. **H11** — Deep-copy in GetFullStatus no-cache path -11. **H14** — Fix restic retry context + check unlock error -12. **H15** — Fix crossdrive leaf collision with `seen` map -13. **H17** — Add nil check for crossDriveRunner -14. **H18** — Use TrimSpace for selftest output -15. **H19** — WaitGroup for stderr goroutine in MigrateDrive -16. **H20** — Guard UUID slice length - -### Phase 3 — Security Hardening (P1-P2) -**Estimate: ~1.5 hours** - -1. **H06** — Add login rate limiting -2. **H12** — Add MaxBytesReader middleware -3. **H16** — Validate Bearer token in CSRF middleware -4. **H21** — Fix fstab entry matching to use field parsing -5. **M12** — Validate backup paths in RemoveStack -6. **M22** — Clamp memory availability to min 0 -7. **M26** — Server-side CSRF for setup wizard - -### Phase 4 — Medium Priority Fixes (P2) -**Estimate: ~3 hours** - -1. **M01** — Add timeout to Hub connectivity test -2. **M02** — Add storagePaths to initial alert refresh -3. **M03** — Fix DrainPendingEvents to restore on save failure -4. **M04** — Add duplicate check in AddStoragePath -5. **M06** — Use validation cache to skip re-validation -6. **M07** — Stream file copies instead of reading into memory -7. **M08** — Record all drive snapshots in history -8. **M09** — Cache subdomain in Stack struct -9. **M14** — Guard scheduler double-start -10. **M16** — Set "failed" state on compose-up failure -11. **M17** — Add timeout to selfupdate docker commands -12. **M21** — Fix mmcblk partition path -13. **M23** — Remove "manual" from triggerAllCrossBackups -14. **M24** — Fix writeFreshConfig settings path -15. **M25** — Fix SetStep race (saveLocked) -16. **M30** — Single GetStacks() call per handler - -### Phase 5 — Standardization (S-items) -**Estimate: ~2 hours** - -1. **S01** — Unify compose env construction -2. **S02** — Single `humanizeBytes` in util/ -3. **S03** — Atomic writes everywhere -4. **S04** — Shared HTTP clients -5. **S05** — exec.CommandContext everywhere -6. **S07** — Base adapter embedding -7. **S09** — Template buffer pattern - -### Phase 6 — Low Priority Cleanup (P3) -**Estimate: ~1.5 hours** - -Remove dead code (L01-L04), fix UTF-8 truncation (L05-L06), address remaining code smells and minor issues. +**Fix:** Use `db.QueryRow("PRAGMA journal_mode=WAL").Scan(&mode)` and verify `mode == "wal"`. --- -**Total estimated effort: ~13 hours across 6 phases** +### M14: `sampleContainers` uses `context.Background()` instead of cancellable context -Phases 1-3 are the most important and should be done first. Phase 4-5 can be done incrementally. Phase 6 is best done opportunistically when touching related files. +**File:** `internal/metrics/collector.go:100-101` +**Category:** bug + +`docker stats` command ignores shutdown cancellation, potentially blocking for 30s after `Stop()` is called. + +**Fix:** Pass the collector's context through to `sampleContainers`. + +--- + +### M15: Telemetry scan error silently swallowed + +**File:** `internal/metrics/telemetry.go:39-42` +**Category:** bug + +Row scan failures are silently discarded with `continue`, masking data corruption or schema issues. + +**Fix:** Log the scan error before continuing. + +--- + +### M16: `copyStackDBDumps` reads entire files into memory + +**File:** `internal/backup/crossdrive.go:478-484` +**Category:** optimization + +Uses `os.ReadFile`/`os.WriteFile` for dump file copies. Large DB dumps (500MB+) cause full in-memory allocation. + +**Fix:** Use `io.Copy` with buffered I/O, or shell out to `cp`/`rsync`. + +--- + +### M17: `restic stats` called twice per drive per cache refresh + +**File:** `internal/backup/backup.go:1070-1071` +**Category:** optimization + +`aggregateRepoStats()` and `perDriveRepoStats()` both iterate drives and call `restic stats`. Each drives gets two subprocess invocations per refresh. + +**Fix:** Compute per-drive stats once, then aggregate from the results. + +--- + +### M18: `ListDumpFiles` re-validates every dump file every 5 minutes + +**File:** `internal/backup/dbdump.go:448-451` +**Category:** optimization + +`ValidateDump` opens and reads each `.sql` file line-by-line on every cache refresh, even for multi-hundred-MB dumps. Ignores the validation cache in `settings.json`. + +**Fix:** Pass the validation cache to `ListDumpFiles`, skip re-validation if file size+modtime match. + +--- + +### M19: Naive stack name derivation from container names + +**File:** `internal/backup/dbdump.go:517-532` +**Category:** logic + +`deriveStackName("my-cache")` strips "cache" suffix and returns "my" — potentially attributing DB dumps to wrong stacks. + +**Fix:** Cross-reference with actual deployed stack names instead of pure suffix stripping. + +--- + +### M20: `syncInfraConfig` controller.yaml copy is not atomic + +**File:** `internal/backup/crossdrive.go:526-534` +**Category:** data integrity + +Uses `os.ReadFile` + `os.WriteFile` directly. A crash mid-write corrupts the file. Compare with `local_infra.go` which uses `atomicWrite`. + +**Fix:** Use `atomicWrite` for consistency. + +--- + +### M21: `http.Get` without timeout in main.go + +**File:** `cmd/controller/main.go:691` +**Category:** bug + +Uses `http.Get` (no timeout). If the endpoint hangs, the goroutine leaks permanently. + +**Fix:** Use `&http.Client{Timeout: 10 * time.Second}`. + +--- + +### M22: Assets syncer holds mutex during entire network operation + +**File:** `internal/assets/syncer.go:72-73` +**Category:** concurrency + +`s.mu.Lock()` held for the entire `Sync()` call including HTTP downloads. `Status()` calls block for the entire sync duration (60+ seconds). + +**Fix:** Lock only to set a `running` flag, then unlock for actual work. + +--- + +### M23: Git sync logs credentials in plaintext + +**File:** `internal/sync/sync.go:427` +**Category:** security + +`runGitInDir` logs `strings.Join(args, " ")` which includes the repo URL with embedded credentials during `git clone`. The `maskRepoURL()` helper exists but is only used elsewhere. + +**Fix:** Apply `maskRepoURL()` in `runGitInDir()` before logging. + +--- + +### M24: Path prefix check in storageAttachBrowseHandler may match sibling directories + +**File:** `internal/web/storage_handlers.go:953-958` +**Category:** security + +`strings.HasPrefix(cleanPath, storage.RawMountBase)` — if `RawMountBase` doesn't end with `/`, a path like `/mnt/felhom-stage-evil/sensitive` would pass the check. + +**Fix:** Ensure prefix check includes trailing `/`: `strings.HasPrefix(cleanPath, storage.RawMountBase + "/")`. + +--- + +### M25: Set* methods on Server lack synchronization + +**File:** `internal/web/server.go:114-172` +**Category:** concurrency + +Multiple `Set*` methods write to Server fields without synchronization. The `SyncFileBrowserMounts` goroutine launched at startup could race with these calls. + +**Fix:** Document init order or add a barrier, or pass all deps via `NewServer()`. + +--- + +### M26: `crypto.DecryptMap` uses global `log.Printf` + +**File:** `internal/crypto/crypto.go:112` +**Category:** consistency + +Uses global `log` package instead of injected logger — inconsistent with rest of codebase. + +**Fix:** Add a `logger` parameter. + +--- + +## LOW Issues + +### L1: Dead code: unused `fileExists` in main.go (line 1071) +### L2: Multiple goroutines launched without `sync.WaitGroup` for shutdown (main.go: multiple locations) +### L3: `0` treated as "unset" for all int config fields (config.go:230-234) — prevents intentional zero values +### L4: Hardcoded metrics DB path ignores `cfg.Paths.DataDir` (main.go:138) +### L5: Hardcoded config path in infra backup functions (main.go:1047, 1148) +### L6: String concat instead of `filepath.Join` for path (main.go:83) +### L7: Double-clear of `running` map in CrossDriveRunner (crossdrive.go:86-110) — defer + manual clear +### L8: uint64 to int64 cast in restic stats — overflow for repos >8 EB (restic.go:293) +### L9: `deriveStackName` error parsing uses `string(out)` which may be empty (restic.go:136-142) +### L10: infraPaths duplicated in every drive's backup (backup.go:366-369) +### L11: TOCTOU race on fstab in `addDRFstabEntries` (restore_drives_linux.go:213-262) +### L12: Non-atomic fallback fstab write (restore_drives_linux.go:256-258) +### L13: 10-second timeout may truncate DB discovery results (backup.go:1077-1081) +### L14: `restoreUserData` copies files with hardcoded 0644 permissions (restore_app_linux.go:139-148) +### L15: Merged snapshot history not saved to disk after merge (backup.go:984-1053) +### L16: `RunIntegrityCheck` doesn't acquire the `running` flag (backup.go:571-622) +### L17: Empty drive path propagates to path builders (backup.go:177-180) +### L18: `ScanDrivesForBackups` picks first found backup, not freshest (restore_scan.go:248) +### L19: Throwaway `exec.Cmd` for logging in MariaDB dump (dbdump.go:213-223) +### L20: `ValidateDump` uses global `log` instead of passed logger (dbdump.go:300+) +### L21: `methodOrEmpty` never returns empty despite its name (healthprobe.go:315-323) +### L22: `aggregateState` doesn't account for `StatePaused` containers (manager.go:387-441) +### L23: `composeExecCustomEnv` double-loads env when env is nil (manager.go:688-694) +### L24: `logPostStartStatus` goroutine has no cancellation (manager.go:775-799) +### L25: `ParseComposeHDDMounts` naive YAML parsing can be fooled (delete.go:398-460) +### L26: `getDirSizeHuman` has no timeout (delete.go:463-474) +### L27: Delete/Remove calls both manual state update AND ScanStacks (delete.go:162-168) +### L28: `DeployField.Options` not deep-copied in `deepCopyStack` (manager.go:510-514) +### L29: Double WriteHeader in `serveCatchAll` (server.go:355-360) +### L30: CSRF token never rotated during session lifetime (auth.go:144-161) +### L31: Content-Type exact match is fragile (storage_handlers.go:1490-1497) +### L32: `AlertManager.Refresh` confusing variadic API (alerts.go:54) +### L33: `json` template function suppresses marshaling errors (funcmap.go:329-332) +### L34: Pinger `send()` always returns nil (pinger.go:91) +### L35: No SQLite connection pool limits (metrics/store.go:19-23) +### L36: No WAL checkpoint after Prune (metrics/store.go:303-322) +### L37: `AllCountries()` rebuilds sorted list on every call (cloudflare/countries.go:253-261) + +--- + +## Fix Plan + +### Phase 1: Critical & High Bugs (do first) + +These are bugs that could cause crashes, data loss, or security issues. + +| # | Issue | Effort | Impact | +|---|-------|--------|--------| +| C1 | Watchdog mutex unlock/relock panic safety | 30min | Prevents watchdog crash | +| C2 | SetGeoAppOverride nil guard | 5min | Prevents nil panic | +| C3 | Restore: SSD-only apps DB dump fallback | 30min | Fixes DR for SSD-only apps | +| H1 | Double deploy race: atomic check-and-set | 30min | Prevents parallel deploys | +| H2 | Delete/Remove: check Deploying flag | 10min | Prevents delete-during-deploy | +| H3 | ScanStacks: skip Deployed overwrite during deploy | 15min | Prevents ghost deployments | +| H4 | Clone appCfg before async deploy goroutine | 15min | Prevents shared pointer mutation | +| H5 | SyncFileBrowserMounts: add mutex | 20min | Prevents concurrent file corruption | +| H6 | PushEvent: add recordHistory calls | 20min | Fixes event history debug page | +| H7 | PushOnce: return error for non-2xx | 5min | Fixes silent push failures | +| H8 | DumpOne: close tmpFile before rename | 10min | Prevents truncated dump files | +| H9 | Restic retry: fresh context | 10min | Prevents confusing timeout errors | +| H10 | SaveAppConfig: log/error on encrypt failure | 10min | Prevents silent plaintext secrets | +| H11 | settingsCrossBackupHandler: add path validation | 15min | Closes path traversal gap | +| H12 | deepCopyStack: complete deep copy | 20min | Prevents shared slice mutation | + +**Estimated Phase 1 total: ~4 hours** + +### Phase 2: Medium Security & Consistency + +| # | Issue | Effort | Impact | +|---|-------|--------|--------| +| M1 | Constant-time API key comparison | 5min | Timing attack prevention | +| M8 | Login rate limiting | 1hr | Brute-force prevention | +| M9 | Empty stack name validation | 10min | Prevents invalid API calls | +| M23 | Mask git credentials in logs | 10min | Prevents credential leakage | +| M24 | Path prefix check trailing slash | 5min | Closes path matching gap | +| M21 | http.Get with timeout | 5min | Prevents goroutine leak | +| M25 | Server Set* methods: document init order | 10min | Prevents future races | + +**Estimated Phase 2 total: ~2 hours** + +### Phase 3: Medium Concurrency & Logic + +| # | Issue | Effort | Impact | +|---|-------|--------|--------| +| M2 | stackProvider: use sync.Once | 15min | Fixes data race | +| M3 | DrainPendingEvents: drain after save | 10min | Prevents event loss | +| M4 | SubdomainInUse: I/O outside lock | 30min | Reduces lock contention | +| M5 | MigrateEncryption: encKey check inside lock | 5min | Fixes data race | +| M7 | executeAllRestores: use thread-safe accessor | 10min | Fixes race condition | +| M10 | Scheduler: reject/launch late registrations | 20min | Prevents silent job drops | +| M12 | Watchdog Set* methods: document or protect | 10min | Prevents data race | +| M13 | SQLite WAL verification | 10min | Confirms WAL engagement | +| M14 | sampleContainers: pass cancellable context | 15min | Clean shutdown | +| M15 | Telemetry scan error logging | 5min | Better diagnostics | +| M22 | Assets syncer: unlock during network I/O | 20min | Unblocks status API | +| M26 | crypto.DecryptMap: use injected logger | 10min | Logger consistency | + +**Estimated Phase 3 total: ~3 hours** + +### Phase 4: Medium Optimization + +| # | Issue | Effort | Impact | +|---|-------|--------|--------| +| M16 | copyStackDBDumps: buffered I/O | 20min | Prevents OOM on large dumps | +| M17 | restic stats: compute once per drive | 30min | Halves subprocess calls | +| M18 | ListDumpFiles: use validation cache | 30min | Stops re-reading large dumps | +| M19 | deriveStackName: cross-reference with stacks | 30min | Correct DB-to-stack mapping | +| M20 | syncInfraConfig: atomic write | 10min | Consistent with local_infra.go | + +**Estimated Phase 4 total: ~2 hours** + +### Phase 5: Low Priority (opportunistic) + +Address during related work or cleanup sprints. Key items: + +- **L2**: Add `sync.WaitGroup` for background goroutines (shutdown cleanup) +- **L11/L12**: Atomic fstab writes in DR restore +- **L14**: Preserve file permissions in restore +- **L15**: Save merged snapshot history +- **L22**: Handle `StatePaused` in `aggregateState` +- **L25**: Replace naive YAML parsing with proper parser +- **L35**: Set `db.SetMaxOpenConns(2)` for SQLite + +**Estimated Phase 5 total: ~4 hours (spread across multiple sessions)** + +--- + +## Cross-Cutting Patterns to Standardize + +### 1. Logger Inconsistency +Several functions use global `log.Printf` instead of injected loggers: +- `dbdump.go:ValidateDump` — global log +- `crypto.go:DecryptMap` — global log +- Fix: Add logger parameters for consistency + +### 2. Atomic Write Inconsistency +Some file operations use atomic write (tmp + rename), others don't: +- `crossdrive.go:syncInfraConfig` — direct write +- `restore_drives_linux.go` — fallback non-atomic write +- Fix: Centralize atomic write helper, use everywhere + +### 3. Context Propagation +Several operations use `context.Background()` instead of the parent context: +- `metrics/collector.go:sampleContainers` +- `main.go:http.Get` +- Fix: Thread context through for proper cancellation + +### 4. Deep Copy Completeness +`deepCopyStack` misses several nested slices/pointers: +- `Meta.OptionalConfig`, `Meta.HealthCheck`, `DeployField.Options` +- Fix: Audit all nested types and ensure complete deep copy + +### 5. Error Swallowing +Multiple locations silently discard errors: +- `PushOnce` non-2xx responses +- `PushEvent` delivery failures (not recorded) +- `Pinger.send` always returns nil +- `SaveAppConfig` encryption failures +- Fix: At minimum log errors; return where possible + +--- + +## Live Testing Summary (192.168.0.162) + +| Test | Result | +|------|--------| +| Container health | HEALTHY (v0.30.3, up, `(healthy)`) | +| Log errors | CLEAN (zero errors/warnings) | +| API health | OK (`{"ok":true}`) | +| Disk space | OK (21% root, 1% data drives) | +| Metrics DB | OK (7.2M DB, 1.4M WAL) | +| Restic locks | CLEAN (none) | +| Memory | 17.6 MiB / 15.4 GiB (0.11%) | +| CPU | 0.11% | +| Backup system | Working (4 DBs discovered, cache refresh OK) | + +No runtime issues detected. All findings are from code review — latent bugs and race conditions that require specific conditions to trigger. diff --git a/CHANGELOG.md b/CHANGELOG.md index 26a5489..736a431 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,45 @@ ## Changelog +### v0.30.4 — Deep Bug Hunt II: Concurrency, Security & Optimization (2026-02-25) + +#### Fixed (Critical) +- **Watchdog mutex panic** — Wrapped `handleDisconnect` call in anonymous func with deferred re-lock to guarantee mutex re-acquisition even on panic (C1) +- **SetGeoAppOverride nil crash** — Added nil guard; passing nil override now correctly deletes the entry instead of panicking (C2) +- **SSD-only app DB restore** — `restoreDBDumps` now falls back to `app.DrivePath` when `HDDPath` is empty (C3) + +#### Fixed (High) +- **Double deploy race** — Added atomic check-and-set of `Deploying` flag with `clearDeploying()` helper on all error paths (H1) +- **Delete/Remove during deploy** — Both `DeleteStack` and `RemoveStack` now reject operations while stack is deploying (H2) +- **ScanStacks overwrite** — Skips updating `Deployed`/`AppConfig` for stacks with active deploy in progress (H3) +- **FileBrowser mount race** — Added `fileBrowserMu` mutex to prevent concurrent `SyncFileBrowserMounts` calls (H5) +- **PushEvent history gap** — Added `recordHistory` calls on both success and failure paths in PushEvent goroutine (H6) +- **PushOnce silent failure** — Now returns error for non-2xx HTTP responses instead of nil (H7) +- **DB dump file corruption** — Added `tmpFile.Sync()` and `tmpFile.Close()` before rename in `DumpOne` (H8) +- **Restic retry timeout** — Creates fresh 30-minute context for retry after unlock instead of reusing near-expired original (H9) +- **Encrypt failure silent** — Added warning log when encryption fails in `SaveAppConfig` (H10) +- **Cross-backup path traversal** — Validates destination path against registered storage paths in both web and API handlers (H11) +- **deepCopyStack incomplete** — Now deep-copies `Meta.OptionalConfig`, `Meta.HealthCheck`, and `DeployField.Options` (H12) + +#### Security +- **Constant-time API key** — Replaced `==` with `subtle.ConstantTimeCompare` for API key comparison, preventing timing attacks (M1) +- **Login rate limiting** — Added per-IP rate limiter (5 attempts/minute) to login handler (M8) +- **Git credential masking** — Applied `maskRepoURL()` in `runGitInDir` log output to prevent credential leakage (M23) +- **Path prefix traversal** — Fixed `storageAttachBrowseHandler` prefix check to require trailing `/`, preventing sibling directory matches (M24) + +#### Concurrency & Logic +- **MigrateEncryption race** — Moved `encKey == nil` check inside the mutex lock (M5) +- **SubdomainInUse I/O under lock** — Collect stack dirs under RLock, release, then perform disk I/O outside (M4) +- **Scheduler late jobs** — Jobs registered after `Start()` now immediately get their goroutine launched (M10) +- **SQLite WAL verification** — WAL pragma now verified via `QueryRow` + `Scan` instead of silent `Exec` (M13) +- **Metrics shutdown** — `sampleContainers` now uses parent context instead of `context.Background()` for clean shutdown (M14) +- **Telemetry scan logging** — Row scan errors now logged instead of silently swallowed (M15) +- **Asset sync lock** — Refactored to hold mutex only for status updates, not during entire HTTP download (M22) + +#### Optimization +- **DB dump copy** — Replaced `os.ReadFile`/`os.WriteFile` with streaming `io.Copy` via `copyFile` helper for large dumps (M16) +- **Restic stats dedup** — Per-drive stats now computed once and aggregated, eliminating duplicate restic subprocess calls (M17) +- **Infra config atomic** — `syncInfraConfig` controller.yaml copy now uses atomic write via `copyFile` (M20) + ### v0.30.3 — Comprehensive Bug Hunt Fixes (2026-02-25) #### Fixed (Critical — P0) diff --git a/controller/cmd/controller/main.go b/controller/cmd/controller/main.go index 9c28831..efe032f 100644 --- a/controller/cmd/controller/main.go +++ b/controller/cmd/controller/main.go @@ -14,6 +14,7 @@ import ( "syscall" "time" + "crypto/subtle" "strings" "gitea.dooplex.hu/admin/felhom-controller/internal/api" @@ -688,7 +689,8 @@ func main() { } dc.HubConnectivityTest = func() (int, int64, error) { start := time.Now() - resp, err := http.Get(cfg.Hub.URL + "/healthz") + client := &http.Client{Timeout: 10 * time.Second} + resp, err := client.Get(cfg.Hub.URL + "/healthz") latency := time.Since(start).Milliseconds() if err != nil { return 0, latency, err @@ -817,7 +819,7 @@ func selfUpdateAuthMiddleware(cfg *config.Config, webServer *web.Server, next ht // Check bearer token first (for external API calls: hub, build scripts) if auth := r.Header.Get("Authorization"); strings.HasPrefix(auth, "Bearer ") { token := strings.TrimPrefix(auth, "Bearer ") - if token != "" && cfg.Hub.APIKey != "" && token == cfg.Hub.APIKey { + if token != "" && cfg.Hub.APIKey != "" && subtle.ConstantTimeCompare([]byte(token), []byte(cfg.Hub.APIKey)) == 1 { next.ServeHTTP(w, r) return } diff --git a/controller/internal/assets/syncer.go b/controller/internal/assets/syncer.go index b5f20b8..89dde7a 100644 --- a/controller/internal/assets/syncer.go +++ b/controller/internal/assets/syncer.go @@ -49,6 +49,7 @@ type Syncer struct { logger *log.Logger debug bool mu sync.Mutex + running bool status SyncStatus } @@ -70,7 +71,18 @@ func New(hubURL, apiKey, assetsDir, fallbackDir string, logger *log.Logger, debu // changed/new files. It also removes local files not in the Hub manifest. func (s *Syncer) Sync(ctx context.Context) error { s.mu.Lock() - defer s.mu.Unlock() + if s.running { + s.mu.Unlock() + return fmt.Errorf("asset sync already in progress") + } + s.running = true + s.mu.Unlock() + + defer func() { + s.mu.Lock() + s.running = false + s.mu.Unlock() + }() s.logger.Println("[INFO] Asset sync starting...") @@ -145,13 +157,15 @@ func (s *Syncer) Sync(ctx context.Context) error { // 5. Save local manifest copy s.saveLocalManifest(manifest) - // 6. Update status + // 6. Update status (under lock) + s.mu.Lock() s.status = SyncStatus{ LastSync: time.Now().UTC().Format(time.RFC3339), LastStatus: "ok", FileCount: len(manifest.Files), TotalBytes: totalBytes, } + s.mu.Unlock() s.logger.Printf("[INFO] Asset sync complete: %d downloaded, %d unchanged, %d removed (%d total files)", downloaded, skipped, removed, len(manifest.Files)) @@ -187,6 +201,7 @@ func (s *Syncer) Status() SyncStatus { } func (s *Syncer) setError(err error) { + s.mu.Lock() s.status = SyncStatus{ LastSync: time.Now().UTC().Format(time.RFC3339), LastStatus: "error", @@ -194,6 +209,7 @@ func (s *Syncer) setError(err error) { FileCount: s.status.FileCount, TotalBytes: s.status.TotalBytes, } + s.mu.Unlock() s.logger.Printf("[WARN] Asset sync failed: %v", err) } diff --git a/controller/internal/backup/backup.go b/controller/internal/backup/backup.go index c6f3051..a2c3b65 100644 --- a/controller/internal/backup/backup.go +++ b/controller/internal/backup/backup.go @@ -68,6 +68,7 @@ type DriveRepoInfo struct { TotalSize string TotalSizeBytes int64 SnapshotCount int + LatestSnapshot *SnapshotInfo `json:"-"` // used for aggregation, not serialized } // CrossDriveSummaryItem holds display data for one app's cross-drive backup. @@ -860,11 +861,33 @@ func (m *Manager) perDriveRepoStats() []DriveRepoInfo { TotalSize: stats.TotalSize, TotalSizeBytes: stats.TotalSizeBytes, SnapshotCount: stats.SnapshotCount, + LatestSnapshot: stats.LatestSnapshot, }) } return infos } +// aggregateFromDriveStats derives aggregate stats from already-computed per-drive stats, +// avoiding a second round of restic subprocess calls. +func aggregateFromDriveStats(drives []DriveRepoInfo, m *Manager) *RepoStats { + agg := &RepoStats{} + var totalBytes int64 + for _, d := range drives { + agg.SnapshotCount += d.SnapshotCount + totalBytes += d.TotalSizeBytes + if d.LatestSnapshot != nil { + if agg.LatestSnapshot == nil || d.LatestSnapshot.Time.After(agg.LatestSnapshot.Time) { + agg.LatestSnapshot = d.LatestSnapshot + } + } + } + agg.TotalSizeBytes = totalBytes + if totalBytes > 0 { + agg.TotalSize = humanizeBytes(totalBytes) + } + return agg +} + // aggregateRepoStats combines stats from all primary restic repos. func (m *Manager) aggregateRepoStats() *RepoStats { drives := m.activeDrives() @@ -1066,9 +1089,9 @@ func (m *Manager) RefreshCache(nextDBDump, nextBackup time.Time) { Retention: m.cfg.Backup.Retention, } - // Expensive calls (outside lock) - status.RepoStats = m.aggregateRepoStats() + // Expensive calls (outside lock) — compute per-drive stats once, derive aggregate status.PerDriveRepoStats = m.perDriveRepoStats() + status.RepoStats = aggregateFromDriveStats(status.PerDriveRepoStats, m) // Scan dump files from per-drive per-stack paths files := m.listAllDumpFiles() diff --git a/controller/internal/backup/crossdrive.go b/controller/internal/backup/crossdrive.go index 1af7bf5..7036c22 100644 --- a/controller/internal/backup/crossdrive.go +++ b/controller/internal/backup/crossdrive.go @@ -3,6 +3,7 @@ package backup import ( "context" "fmt" + "io" "log" "os" "os/exec" @@ -475,12 +476,8 @@ func (r *CrossDriveRunner) copyStackDBDumps(stackName, destDir string) error { } src := filepath.Join(dumpDir, e.Name()) dst := filepath.Join(destDir, e.Name()) - data, err := os.ReadFile(src) - if err != nil { - return fmt.Errorf("reading %s: %w", e.Name(), err) - } - if err := os.WriteFile(dst, data, 0644); err != nil { - return fmt.Errorf("writing %s: %w", e.Name(), err) + if err := copyFile(src, dst); err != nil { + return fmt.Errorf("copying %s: %w", e.Name(), err) } copied++ } @@ -523,14 +520,11 @@ func (r *CrossDriveRunner) syncInfraConfig(ctx context.Context) { } } - // Copy controller.yaml → _infra/controller.yaml + // Copy controller.yaml → _infra/controller.yaml (atomic via copyFile) if _, err := os.Stat(r.controllerYAMLPath); err == nil { yamlDest := filepath.Join(infraDir, "controller.yaml") - data, err := os.ReadFile(r.controllerYAMLPath) - if err != nil { - r.logger.Printf("[WARN] Cannot read controller.yaml for infra backup: %v", err) - } else if err := os.WriteFile(yamlDest, data, 0644); err != nil { - r.logger.Printf("[WARN] Cannot write controller.yaml to %s: %v", yamlDest, err) + if err := copyFile(r.controllerYAMLPath, yamlDest); err != nil { + r.logger.Printf("[WARN] Cannot copy controller.yaml to %s: %v", yamlDest, err) } } @@ -628,6 +622,32 @@ func (r *CrossDriveRunner) updateStatus(stackName, status, errMsg string, durati }) } +// copyFile copies src to dst using buffered streaming I/O (no full-file memory allocation). +func copyFile(src, dst string) error { + in, err := os.Open(src) + if err != nil { + return err + } + defer in.Close() + + tmp := dst + ".tmp" + out, err := os.Create(tmp) + if err != nil { + return err + } + + if _, err := io.Copy(out, in); err != nil { + out.Close() + os.Remove(tmp) + return err + } + if err := out.Close(); err != nil { + os.Remove(tmp) + return err + } + return os.Rename(tmp, dst) +} + // dirSizeBytes returns the total byte size of all files under path. // H7: Walk errors are now propagated instead of silently swallowed. func dirSizeBytes(path string) (int64, error) { diff --git a/controller/internal/backup/dbdump.go b/controller/internal/backup/dbdump.go index f5b4292..40a038f 100644 --- a/controller/internal/backup/dbdump.go +++ b/controller/internal/backup/dbdump.go @@ -256,6 +256,20 @@ func DumpOne(ctx context.Context, db DiscoveredDB, dumpDir string, logger *log.L return result } + // Close and sync tmpFile before rename to ensure data is flushed to disk (H8 fix). + if err := tmpFile.Sync(); err != nil { + os.Remove(tmpPath) + result.Error = fmt.Errorf("syncing dump file: %w", err) + result.Duration = time.Since(start) + return result + } + if err := tmpFile.Close(); err != nil { + os.Remove(tmpPath) + result.Error = fmt.Errorf("closing dump file: %w", err) + result.Duration = time.Since(start) + return result + } + // Check file size stat, err := os.Stat(tmpPath) if err != nil || stat.Size() == 0 { diff --git a/controller/internal/backup/restic.go b/controller/internal/backup/restic.go index 1885fc7..05bff34 100644 --- a/controller/internal/backup/restic.go +++ b/controller/internal/backup/restic.go @@ -145,8 +145,10 @@ func (r *ResticManager) Snapshot(repoPath string, paths []string, tags []string) if unlockErr := unlockCmd.Run(); unlockErr != nil { r.logger.Printf("[WARN] Restic unlock failed: %v", unlockErr) } - // Retry once - cmd = r.command(ctx, repoPath, args...) + // Retry once with a fresh context (H9 fix — original may be nearly expired). + retryCtx, retryCancel := context.WithTimeout(context.Background(), 30*time.Minute) + defer retryCancel() + cmd = r.command(retryCtx, repoPath, args...) out, err = cmd.Output() if err != nil { return nil, fmt.Errorf("restic backup failed after unlock: %v", err) diff --git a/controller/internal/backup/restore_app_linux.go b/controller/internal/backup/restore_app_linux.go index e7a6050..84aaa62 100644 --- a/controller/internal/backup/restore_app_linux.go +++ b/controller/internal/backup/restore_app_linux.go @@ -154,11 +154,22 @@ func restoreUserData(ctx context.Context, app *RestorableApp, logger *log.Logger // restoreDBDumps copies DB dump files from cross-drive backup to the primary dump dir. func restoreDBDumps(app *RestorableApp, logger *log.Logger) error { - if app.DBDumpPath == "" || app.HDDPath == "" { + if app.DBDumpPath == "" { return nil } - destDir := AppDBDumpPath(app.HDDPath, app.Name) + // Use HDDPath for apps with HDD data, fall back to DrivePath (system data path) + // for SSD-only apps whose DB dumps live under the system drive. + drivePath := app.HDDPath + if drivePath == "" { + drivePath = app.DrivePath + } + if drivePath == "" { + logger.Printf("[WARN] Cannot restore DB dumps for %s: no drive path", app.Name) + return nil + } + + destDir := AppDBDumpPath(drivePath, app.Name) if err := os.MkdirAll(destDir, 0755); err != nil { return fmt.Errorf("creating dump dir: %w", err) } diff --git a/controller/internal/metrics/collector.go b/controller/internal/metrics/collector.go index 45b9b6f..e19f4b9 100644 --- a/controller/internal/metrics/collector.go +++ b/controller/internal/metrics/collector.go @@ -54,25 +54,25 @@ func (c *MetricsCollector) loop(ctx context.Context) { defer ticker.Stop() // Sample immediately on start - c.sample() + c.sampleWith(ctx) for { select { case <-ctx.Done(): return case <-ticker.C: - c.sample() + c.sampleWith(ctx) } } } -func (c *MetricsCollector) sample() { +func (c *MetricsCollector) sampleWith(ctx context.Context) { sys := c.sampleSystem() if err := c.store.InsertSystemMetrics(sys); err != nil { c.logger.Printf("[WARN] Failed to store system metrics: %v", err) } - containers := c.sampleContainers() + containers := c.sampleContainers(ctx) if err := c.store.InsertContainerMetrics(containers); err != nil { c.logger.Printf("[WARN] Failed to store container metrics: %v", err) } @@ -96,8 +96,8 @@ func (c *MetricsCollector) sampleSystem() SystemSample { } } -func (c *MetricsCollector) sampleContainers() []ContainerSample { - ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) +func (c *MetricsCollector) sampleContainers(parentCtx context.Context) []ContainerSample { + ctx, cancel := context.WithTimeout(parentCtx, 30*time.Second) defer cancel() cmd := exec.CommandContext(ctx, "docker", "stats", "--no-stream", diff --git a/controller/internal/metrics/store.go b/controller/internal/metrics/store.go index 46e605f..178c859 100644 --- a/controller/internal/metrics/store.go +++ b/controller/internal/metrics/store.go @@ -22,9 +22,18 @@ func NewMetricsStore(dbPath string, logger *log.Logger) (*MetricsStore, error) { return nil, fmt.Errorf("open sqlite: %w", err) } - // Set pragmas for performance and concurrency + // Enable WAL mode and verify it took effect + var walMode string + if err := db.QueryRow("PRAGMA journal_mode=WAL").Scan(&walMode); err != nil { + db.Close() + return nil, fmt.Errorf("set WAL mode: %w", err) + } + if walMode != "wal" { + db.Close() + return nil, fmt.Errorf("WAL mode not supported on this filesystem (got %q)", walMode) + } + // Set remaining pragmas for performance and concurrency pragmas := []string{ - "PRAGMA journal_mode=WAL", "PRAGMA synchronous=NORMAL", "PRAGMA busy_timeout=5000", } diff --git a/controller/internal/metrics/telemetry.go b/controller/internal/metrics/telemetry.go index aceff11..1117efe 100644 --- a/controller/internal/metrics/telemetry.go +++ b/controller/internal/metrics/telemetry.go @@ -1,6 +1,7 @@ package metrics import ( + "log" "time" ) @@ -38,6 +39,7 @@ func (s *MetricsStore) GetContainerTelemetry(since time.Time) ([]ContainerTeleme var ct ContainerTelemetry if err := rows.Scan(&ct.ContainerName, &ct.MemoryAvgMB, &ct.MemoryPeakMB, &ct.CPUAvgPercent, &ct.SampleCount); err != nil { + log.Printf("[WARN] telemetry row scan failed: %v", err) continue } results = append(results, ct) diff --git a/controller/internal/monitor/watchdog.go b/controller/internal/monitor/watchdog.go index b1694f2..b6a10fe 100644 --- a/controller/internal/monitor/watchdog.go +++ b/controller/internal/monitor/watchdog.go @@ -232,9 +232,14 @@ func (w *StorageWatchdog) handleConnectedProbe(sp settings.StoragePath, state *p sp.Path, state.consecutiveFailures, probeThreshold, result.Err) if state.consecutiveFailures >= probeThreshold { - state.mu.Unlock() - w.handleDisconnect(sp, state, result) - state.mu.Lock() // re-acquire for deferred Unlock + // Release state.mu before calling handleDisconnect (which re-acquires it + // internally). Re-acquire afterwards so the deferred Unlock stays balanced. + // Wrap in a func to guarantee re-lock even if handleDisconnect panics. + func() { + state.mu.Unlock() + defer state.mu.Lock() + w.handleDisconnect(sp, state, result) + }() } } diff --git a/controller/internal/notify/notifier.go b/controller/internal/notify/notifier.go index 7d6e8ed..27aa96b 100644 --- a/controller/internal/notify/notifier.go +++ b/controller/internal/notify/notifier.go @@ -210,11 +210,17 @@ func (n *Notifier) PushEvent(eventType, severity, message string, details interf n.logger.Printf("[DEBUG] PushEvent: %s pushed OK (HTTP %d)", eventType, resp.StatusCode) } n.logger.Printf("[INFO] Event pushed: %s (%s) — %s", eventType, severity, message) + n.recordHistory(eventType, severity, message, resp.StatusCode, "") return } lastErr = fmt.Errorf("HTTP %d", resp.StatusCode) } n.logger.Printf("[WARN] Event push failed after 3 attempts (%s/%s): %v", eventType, severity, lastErr) + errMsg := "" + if lastErr != nil { + errMsg = lastErr.Error() + } + n.recordHistory(eventType, severity, message, 0, errMsg) }() } diff --git a/controller/internal/report/pusher.go b/controller/internal/report/pusher.go index 1233425..b424a6f 100644 --- a/controller/internal/report/pusher.go +++ b/controller/internal/report/pusher.go @@ -219,6 +219,7 @@ func (p *Pusher) PushOnce(report *Report) error { if resp.StatusCode >= 200 && resp.StatusCode < 300 { p.logger.Printf("[INFO] Hub push-once sent (%d bytes)", len(data)) + return nil } - return nil + return fmt.Errorf("hub push-once: HTTP %d", resp.StatusCode) } diff --git a/controller/internal/scheduler/scheduler.go b/controller/internal/scheduler/scheduler.go index 0bec051..705fa90 100644 --- a/controller/internal/scheduler/scheduler.go +++ b/controller/internal/scheduler/scheduler.go @@ -41,12 +41,13 @@ type Job struct { // Scheduler manages periodic and daily jobs. type Scheduler struct { - mu sync.Mutex - jobs []*Job - logger *log.Logger - ctx context.Context - cancel context.CancelFunc - wg sync.WaitGroup + mu sync.Mutex + jobs []*Job + logger *log.Logger + ctx context.Context + cancel context.CancelFunc + wg sync.WaitGroup + started bool } // New creates a new Scheduler. @@ -57,6 +58,7 @@ func New(logger *log.Logger) *Scheduler { } // Every registers a periodic job that runs every interval. +// If the scheduler is already started, the job's goroutine is launched immediately. func (s *Scheduler) Every(name string, interval time.Duration, fn JobFunc) { if interval <= 0 { s.logger.Printf("[ERROR] Periodic job %s has invalid interval %s — job not registered", name, interval) @@ -66,15 +68,22 @@ func (s *Scheduler) Every(name string, interval time.Duration, fn JobFunc) { s.mu.Lock() defer s.mu.Unlock() - s.jobs = append(s.jobs, &Job{ + job := &Job{ Name: name, Fn: fn, Interval: interval, - }) + } + s.jobs = append(s.jobs, job) s.logger.Printf("[SCHED] Registered periodic job: %s (every %s)", name, interval) + + if s.started { + s.wg.Add(1) + go s.runPeriodicJob(job) + } } // Daily registers a job that runs once per day at the specified time (HH:MM) in Europe/Budapest timezone. +// If the scheduler is already started, the job's goroutine is launched immediately. func (s *Scheduler) Daily(name string, timeStr string, fn JobFunc) { s.mu.Lock() defer s.mu.Unlock() @@ -85,14 +94,20 @@ func (s *Scheduler) Daily(name string, timeStr string, fn JobFunc) { return } - s.jobs = append(s.jobs, &Job{ + job := &Job{ Name: name, Fn: fn, Schedule: timeStr, - }) + } + s.jobs = append(s.jobs, job) nextRun := nextDailyRun(timeStr) s.logger.Printf("[SCHED] Daily job %s scheduled for %s", name, nextRun.Format("2006-01-02 15:04 MST")) + + if s.started { + s.wg.Add(1) + go s.runDailyJob(job) + } } // Start begins running all registered jobs. Safe to call only once. @@ -104,6 +119,7 @@ func (s *Scheduler) Start(ctx context.Context) { return } s.ctx, s.cancel = context.WithCancel(ctx) + s.started = true for _, job := range s.jobs { if job.Interval > 0 { diff --git a/controller/internal/settings/settings.go b/controller/internal/settings/settings.go index 1c2dffa..8ebef22 100644 --- a/controller/internal/settings/settings.go +++ b/controller/internal/settings/settings.go @@ -873,9 +873,17 @@ func (s *Settings) SetGeoRestriction(geo *GeoRestriction) error { } // SetGeoAppOverride sets a per-app geo override. Creates the GeoRestriction if nil. +// Pass override=nil to remove the override (same as RemoveGeoAppOverride). func (s *Settings) SetGeoAppOverride(appName string, override *AppGeoOverride) error { s.mu.Lock() defer s.mu.Unlock() + if override == nil { + // nil override = remove (fall back to global) + if s.GeoRestriction != nil && s.GeoRestriction.AppOverrides != nil { + delete(s.GeoRestriction.AppOverrides, appName) + } + return s.save() + } if s.GeoRestriction == nil { s.GeoRestriction = &GeoRestriction{AllowedCountries: []string{"HU"}} } diff --git a/controller/internal/stacks/delete.go b/controller/internal/stacks/delete.go index e3ea1bd..3224785 100644 --- a/controller/internal/stacks/delete.go +++ b/controller/internal/stacks/delete.go @@ -86,6 +86,11 @@ func (m *Manager) DeleteStack(name string, removeHDDData bool) (*DeleteResponse, return nil, fmt.Errorf("stack %q is not orphaned — only orphaned stacks can be deleted", name) } + // Must not be deploying (H2 fix) + if stack.Deploying { + return nil, fmt.Errorf("stack %q is currently being deployed — wait for deployment to finish", name) + } + // Must be stopped (not running) if stack.State == StateRunning || stack.State == StateStarting || stack.State == StateRestarting { return nil, fmt.Errorf("stack %q is still running — stop it first before deleting", name) @@ -239,6 +244,11 @@ func (m *Manager) RemoveStack(name string, removeHDDData bool, backupPathsToRemo return nil, fmt.Errorf("stack %q is not deployed", name) } + // Must not be deploying (H2 fix) + if stack.Deploying { + return nil, fmt.Errorf("stack %q is currently being deployed — wait for deployment to finish", name) + } + // Must be stopped (not running) if stack.State == StateRunning || stack.State == StateStarting || stack.State == StateRestarting { return nil, fmt.Errorf("stack %q is still running — stop it first before removing", name) diff --git a/controller/internal/stacks/deploy.go b/controller/internal/stacks/deploy.go index 7c47085..7c0b0e6 100644 --- a/controller/internal/stacks/deploy.go +++ b/controller/internal/stacks/deploy.go @@ -5,6 +5,7 @@ import ( "encoding/base64" "encoding/hex" "fmt" + "log" "math/big" "os" "path/filepath" @@ -56,24 +57,35 @@ func validateSubdomain(s string) error { // SubdomainInUse checks if a subdomain is already used by any deployed stack // other than excludeStack. func (m *Manager) SubdomainInUse(subdomain, excludeStack string) bool { + // Collect stack dirs and metadata under lock, then do I/O outside the lock. + type candidate struct { + dir string + metaSubdomain string + } + var candidates []candidate + m.mu.RLock() - defer m.mu.RUnlock() for name, stack := range m.stacks { if name == excludeStack || !stack.Deployed { continue } - stackDir := filepath.Dir(stack.ComposePath) - appCfg := LoadAppConfig(stackDir) + candidates = append(candidates, candidate{ + dir: filepath.Dir(stack.ComposePath), + metaSubdomain: stack.Meta.Subdomain, + }) + } + m.mu.RUnlock() + + for _, c := range candidates { + appCfg := LoadAppConfig(c.dir) if appCfg == nil { continue } - // Check stored SUBDOMAIN first if sd, ok := appCfg.Env["SUBDOMAIN"]; ok && sd == subdomain { return true } - // Backward compat: check metadata subdomain for apps without SUBDOMAIN in env if _, hasSub := appCfg.Env["SUBDOMAIN"]; !hasSub { - if stack.Meta.Subdomain == subdomain { + if c.metaSubdomain == subdomain { return true } } @@ -107,20 +119,43 @@ type DeployRequest struct { // 7. Run docker compose up -d with env vars // 8. Update in-memory stack state func (m *Manager) DeployStack(req DeployRequest) (string, error) { + // Atomically check and set the Deploying flag to prevent concurrent deploys (H1 fix). + m.mu.Lock() + sPtr, sOk := m.stacks[req.StackName] + if !sOk { + m.mu.Unlock() + return "", fmt.Errorf("stack %q not found", req.StackName) + } + if sPtr.Deploying { + m.mu.Unlock() + return "", fmt.Errorf("stack %q is already being deployed — please wait", req.StackName) + } + if sPtr.Deployed { + m.mu.Unlock() + return "", fmt.Errorf("stack %q is already deployed; use update instead", req.StackName) + } + sPtr.Deploying = true + sPtr.DeployError = "" + m.mu.Unlock() + + // If any validation below fails, clear the Deploying flag. + clearDeploying := func() { + m.mu.Lock() + if s, ok := m.stacks[req.StackName]; ok { + s.Deploying = false + } + m.mu.Unlock() + } + stack, ok := m.GetStack(req.StackName) if !ok { + clearDeploying() return "", fmt.Errorf("stack %q not found", req.StackName) } stackDir := filepath.Dir(stack.ComposePath) meta := LoadMetadata(stackDir) - // Check if already deployed - existing := LoadAppConfig(stackDir) - if existing != nil && existing.Deployed { - return "", fmt.Errorf("stack %q is already deployed; use update instead", req.StackName) - } - // --- Memory validation --- var deployWarning string reservedMB := m.cfg.System.ReservedMemoryMB @@ -136,6 +171,7 @@ func (m *Manager) DeployStack(req DeployRequest) (string, error) { // Hard block: real used + new request exceeds usable memory if newReqMB > 0 && usedMB+newReqMB > usableMB { + clearDeploying() return "", fmt.Errorf( "Nincs elég memória az alkalmazás telepítéséhez. "+ "Szükséges: %d MB, Elérhető: %d MB "+ @@ -187,12 +223,15 @@ func (m *Manager) DeployStack(req DeployRequest) (string, error) { value = field.Default } if err := validateSubdomain(value); err != nil { + clearDeploying() return "", err } if reservedSubdomains[value] { + clearDeploying() return "", fmt.Errorf("a(z) %q aldomain foglalt rendszer számára", value) } if m.SubdomainInUse(value, req.StackName) { + clearDeploying() return "", fmt.Errorf("a(z) %q aldomain már használatban van egy másik alkalmazásban", value) } @@ -204,6 +243,7 @@ func (m *Manager) DeployStack(req DeployRequest) (string, error) { } else { generated, err := generateValue(field.Generate) if err != nil { + clearDeploying() return "", fmt.Errorf("generating %s: %w", field.EnvVar, err) } value = generated @@ -215,6 +255,7 @@ func (m *Manager) DeployStack(req DeployRequest) (string, error) { if userVal, ok := req.Values[field.EnvVar]; ok && userVal != "" { value = userVal } else { + clearDeploying() return "", fmt.Errorf("a(z) %q mező kitöltése kötelező — használja a Generálás gombot vagy írjon be egy jelszót", field.Label) } @@ -229,12 +270,14 @@ func (m *Manager) DeployStack(req DeployRequest) (string, error) { // Validate required fields if field.Required && value == "" { + clearDeploying() return "", fmt.Errorf("a(z) %q (%s) mező kitöltése kötelező", field.Label, field.EnvVar) } - // Validate path fields exist on disk (inside the container's filesystem) + // Validate path fields exist on the host filesystem if field.Type == "path" && value != "" { if _, err := os.Stat(value); os.IsNotExist(err) { + clearDeploying() return "", fmt.Errorf("path %q does not exist for field %q", value, field.Label) } } @@ -257,6 +300,7 @@ func (m *Manager) DeployStack(req DeployRequest) (string, error) { } if err := SaveAppConfig(stackDir, appCfg, m.encKey, SensitiveEnvVars(&meta)); err != nil { + clearDeploying() return "", fmt.Errorf("saving app config: %w", err) } @@ -272,14 +316,12 @@ func (m *Manager) DeployStack(req DeployRequest) (string, error) { m.checkLocalImages(req.StackName, stackDir) } - // Update in-memory stack state and mark as deploying. The compose-up - // runs in a goroutine so the API can return immediately and the UI - // shows progress via polling (image pull can take 30-60s). + // Update in-memory stack state. Deploying was already set at the top (H1 fix). + // The compose-up runs in a goroutine so the API can return immediately + // and the UI shows progress via polling (image pull can take 30-60s). m.mu.Lock() if s, ok := m.stacks[req.StackName]; ok { s.Deployed = true - s.Deploying = true - s.DeployError = "" s.AppConfig = appCfg } m.mu.Unlock() @@ -544,6 +586,9 @@ func SaveAppConfig(stackDir string, cfg *AppConfig, encKey []byte, sensitiveVars if enc, err := crypto.Encrypt(encKey, v); err == nil { saveCfg.Env[k] = enc continue + } else { + // H10 fix: log encryption failure — value will be saved in plaintext. + log.Printf("[WARN] Failed to encrypt env var %q: %v — saving as plaintext", k, err) } } saveCfg.Env[k] = v diff --git a/controller/internal/stacks/manager.go b/controller/internal/stacks/manager.go index e1d0250..ba6cbf6 100644 --- a/controller/internal/stacks/manager.go +++ b/controller/internal/stacks/manager.go @@ -120,11 +120,11 @@ func (m *Manager) SetEncryptionKey(key []byte) { // MigrateEncryption re-saves app.yaml for deployed stacks that still have // plaintext values in sensitive fields. Called once on startup. func (m *Manager) MigrateEncryption() { + m.mu.Lock() + defer m.mu.Unlock() if m.encKey == nil { return } - m.mu.Lock() - defer m.mu.Unlock() migrated := 0 for _, s := range m.stacks { @@ -233,8 +233,12 @@ func (m *Manager) ScanStacks() error { existing.ComposePath = composePath existing.Meta = meta existing.Protected = m.cfg.IsProtectedStack(name) - existing.Deployed = deployed - existing.AppConfig = appCfg + // Don't overwrite Deployed/AppConfig while an async deploy is in + // progress — the goroutine manages these fields (H3 fix). + if !existing.Deploying { + existing.Deployed = deployed + existing.AppConfig = appCfg + } } else { m.stacks[name] = &Stack{ Name: name, @@ -507,10 +511,44 @@ func deepCopyStack(s *Stack) Stack { cp.HealthProbe = &hpCopy } - // Deep-copy Meta.DeployFields slice + // Deep-copy Meta.DeployFields slice (including nested Options) if s.Meta.DeployFields != nil { cp.Meta.DeployFields = make([]DeployField, len(s.Meta.DeployFields)) copy(cp.Meta.DeployFields, s.Meta.DeployFields) + for i, f := range s.Meta.DeployFields { + if f.Options != nil { + cp.Meta.DeployFields[i].Options = make([]SelectOption, len(f.Options)) + copy(cp.Meta.DeployFields[i].Options, f.Options) + } + } + } + + // Deep-copy Meta.OptionalConfig (slice of groups with nested Fields slices) + if s.Meta.OptionalConfig != nil { + cp.Meta.OptionalConfig = make([]OptionalConfigGroup, len(s.Meta.OptionalConfig)) + copy(cp.Meta.OptionalConfig, s.Meta.OptionalConfig) + for i, g := range s.Meta.OptionalConfig { + if g.Fields != nil { + cp.Meta.OptionalConfig[i].Fields = make([]OptionalConfigField, len(g.Fields)) + copy(cp.Meta.OptionalConfig[i].Fields, g.Fields) + } + } + } + + // Deep-copy Meta.HealthCheck pointer + if s.Meta.HealthCheck != nil { + hcCopy := *s.Meta.HealthCheck + if s.Meta.HealthCheck.Checks != nil { + hcCopy.Checks = make([]HealthCheckItem, len(s.Meta.HealthCheck.Checks)) + copy(hcCopy.Checks, s.Meta.HealthCheck.Checks) + for i, c := range s.Meta.HealthCheck.Checks { + if c.Expect != nil { + eCopy := *c.Expect + hcCopy.Checks[i].Expect = &eCopy + } + } + } + cp.Meta.HealthCheck = &hcCopy } return cp diff --git a/controller/internal/sync/sync.go b/controller/internal/sync/sync.go index 4c95d4b..661b0ac 100644 --- a/controller/internal/sync/sync.go +++ b/controller/internal/sync/sync.go @@ -424,7 +424,7 @@ func (s *Syncer) runGitInDir(dir string, args ...string) error { cmd.Stdout = io.Discard cmd.Stderr = &stderr - s.logger.Printf("[SYNC] Running: git %s", strings.Join(args, " ")) + s.logger.Printf("[SYNC] Running: git %s", maskRepoURL(strings.Join(args, " "))) if err := cmd.Run(); err != nil { return fmt.Errorf("git %s: %w\nstderr: %s", strings.Join(args, " "), err, stderr.String()) diff --git a/controller/internal/web/auth.go b/controller/internal/web/auth.go index 9839a23..fc0227c 100644 --- a/controller/internal/web/auth.go +++ b/controller/internal/web/auth.go @@ -17,9 +17,17 @@ type session struct { csrfToken string } +// loginAttempt tracks failed login attempts for rate limiting. +type loginAttempt struct { + count int + lastFail time.Time +} + const ( - sessionCookieName = "felhom_session" - sessionMaxAge = 7 * 24 * time.Hour + sessionCookieName = "felhom_session" + sessionMaxAge = 7 * 24 * time.Hour + loginMaxAttempts = 5 + loginWindowDuration = 1 * time.Minute ) // effectivePasswordHash returns the active password hash using the priority: @@ -98,13 +106,47 @@ func (s *Server) handleLogin(w http.ResponseWriter, r *http.Request) { return } + // Rate limit: check failed attempts from this IP + ip := r.RemoteAddr + if fwd := r.Header.Get("X-Forwarded-For"); fwd != "" { + ip = strings.Split(fwd, ",")[0] + } + ip = strings.TrimSpace(ip) + + s.loginAttemptMu.Lock() + attempt := s.loginAttempts[ip] + if attempt != nil && time.Since(attempt.lastFail) > loginWindowDuration { + // Window expired — reset + attempt = nil + delete(s.loginAttempts, ip) + } + if attempt != nil && attempt.count >= loginMaxAttempts { + s.loginAttemptMu.Unlock() + s.logger.Printf("[WARN] Login rate limited for %s (%d attempts)", ip, attempt.count) + s.renderLogin(w, "Túl sok sikertelen próbálkozás, próbálja újra 1 perc múlva", "") + return + } + s.loginAttemptMu.Unlock() + effectiveHash := s.effectivePasswordHash() if err := bcrypt.CompareHashAndPassword([]byte(effectiveHash), []byte(password)); err != nil { s.logger.Printf("[WARN] Failed login from %s", r.RemoteAddr) + s.loginAttemptMu.Lock() + if s.loginAttempts[ip] == nil { + s.loginAttempts[ip] = &loginAttempt{} + } + s.loginAttempts[ip].count++ + s.loginAttempts[ip].lastFail = time.Now() + s.loginAttemptMu.Unlock() s.renderLogin(w, "Hibás jelszó", "") return } + // Successful login — clear rate limit for this IP + s.loginAttemptMu.Lock() + delete(s.loginAttempts, ip) + s.loginAttemptMu.Unlock() + token := s.createSession() isSecure := r.TLS != nil || r.Header.Get("X-Forwarded-Proto") == "https" http.SetCookie(w, &http.Cookie{ diff --git a/controller/internal/web/handlers.go b/controller/internal/web/handlers.go index 4dbd881..5c64008 100644 --- a/controller/internal/web/handlers.go +++ b/controller/internal/web/handlers.go @@ -965,6 +965,23 @@ func (s *Server) settingsCrossBackupHandler(w http.ResponseWriter, r *http.Reque schedule = existing.Schedule } + // Validate destination path against registered storage paths (H11 fix — matches API handler). + if enabled && destPath != "" { + registeredPaths := s.settings.GetStoragePaths() + validDest := false + for _, sp := range registeredPaths { + if destPath == sp.Path { + validDest = true + break + } + } + if !validDest { + s.logger.Printf("[WARN] Cross-drive backup: rejected invalid dest path %q for %s", destPath, name) + http.Redirect(w, r, "/stacks/"+name+"/deploy?flash_error="+url.QueryEscape("Érvénytelen célútvonal: "+destPath), http.StatusFound) + return + } + } + var cfg *settings.CrossDriveBackup if destPath != "" || existing != nil { cfg = &settings.CrossDriveBackup{ @@ -1543,6 +1560,10 @@ func (s *Server) settingsStorageLabelHandler(w http.ResponseWriter, r *http.Requ // SyncFileBrowserMounts regenerates FileBrowser's docker-compose.yml and config.yaml // with volume mounts and sources for all registered storage paths, then recreates the container. func (s *Server) SyncFileBrowserMounts() { + // Prevent concurrent syncs — multiple callers can race on the same files (H5 fix). + s.fileBrowserMu.Lock() + defer s.fileBrowserMu.Unlock() + stackDir := "/opt/docker/stacks/filebrowser" composePath := stackDir + "/docker-compose.yml" diff --git a/controller/internal/web/server.go b/controller/internal/web/server.go index f699fa5..a3363a6 100644 --- a/controller/internal/web/server.go +++ b/controller/internal/web/server.go @@ -41,10 +41,12 @@ type Server struct { encKey []byte // AES-256 key for decrypting app.yaml values tmpl *template.Template - sessions map[string]*session - sessionsMu sync.RWMutex - done chan struct{} - closeOnce sync.Once + sessions map[string]*session + sessionsMu sync.RWMutex + loginAttempts map[string]*loginAttempt + loginAttemptMu sync.Mutex + done chan struct{} + closeOnce sync.Once // Disk operation state (format/migrate jobs) diskJobMu sync.Mutex @@ -53,6 +55,9 @@ type Server struct { // Active raw mount for the attach wizard (empty when not in use) activeRawMount string + // Guard for FileBrowser sync — prevents concurrent file writes (H5 fix) + fileBrowserMu sync.Mutex + // Drive migration driveMigrator *storage.DriveMigrator @@ -90,6 +95,7 @@ func NewServer(cfg *config.Config, stackMgr *stacks.Manager, cpuCollector *syste logger: logger, version: version, sessions: make(map[string]*session), + loginAttempts: make(map[string]*loginAttempt), done: make(chan struct{}), } s.loadTemplates() @@ -111,6 +117,7 @@ func NewServer(cfg *config.Config, stackMgr *stacks.Manager, cpuCollector *syste } // SetEncryptionKey sets the AES-256 key used to decrypt app.yaml values for display. +// Must be called before ListenAndServe (all Set* methods are init-time only). func (s *Server) SetEncryptionKey(key []byte) { s.encKey = key } diff --git a/controller/internal/web/storage_handlers.go b/controller/internal/web/storage_handlers.go index 322b3c7..664e3c0 100644 --- a/controller/internal/web/storage_handlers.go +++ b/controller/internal/web/storage_handlers.go @@ -952,7 +952,7 @@ func (s *Server) storageAttachBrowseHandler(w http.ResponseWriter, r *http.Reque // Security: validate path is under the raw mount staging area cleanPath := filepath.Clean(browsePath) - if !strings.HasPrefix(cleanPath, storage.RawMountBase) { + if cleanPath != storage.RawMountBase && !strings.HasPrefix(cleanPath, storage.RawMountBase+"/") { jsonError(w, "Érvénytelen útvonal", http.StatusBadRequest) return }