863 lines
32 KiB
Markdown
863 lines
32 KiB
Markdown
# Bug Hunt Report — DR Implementation (TASK.md + TASK2.md)
|
||
|
||
**Date:** 2026-02-19
|
||
**Scope:** All code implemented in TASK.md (docker-setup.sh v5.0) and TASK2.md (DR phases 1-4)
|
||
**Purpose:** Detailed fix instructions for Sonnet to implement
|
||
|
||
---
|
||
|
||
## Priority Legend
|
||
|
||
| Priority | Meaning |
|
||
|----------|---------|
|
||
| **P0** | Must fix before production — crashes, data loss, security |
|
||
| **P1** | Should fix before production — logic errors, race conditions |
|
||
| **P2** | Fix soon — defensive improvements, consistency |
|
||
| **P3** | Nice to have — code quality, edge cases |
|
||
|
||
---
|
||
|
||
## AREA 1: Controller — handler_restore.go (Race Conditions)
|
||
|
||
All restore handlers read `s.restorePlan` without holding `s.restoreMu`, causing nil pointer panics and double-restore races. This is the single most critical area.
|
||
|
||
### BUG 1.1 — P0: restorePageHandler reads restorePlan without lock
|
||
|
||
**File:** `controller/internal/web/handler_restore.go` lines 14-28
|
||
**Impact:** Nil pointer dereference panic if `clearRestoreMode()` runs between nil-check (line 14) and field access (lines 24-28).
|
||
|
||
**Current code:**
|
||
```go
|
||
func (s *Server) restorePageHandler(w http.ResponseWriter, r *http.Request) {
|
||
if s.restorePlan == nil { // line 14: UNLOCKED
|
||
http.Redirect(w, r, "/", http.StatusFound)
|
||
return
|
||
}
|
||
data := map[string]interface{}{
|
||
"CustomerID": s.restorePlan.CustomerID, // line 24: UNLOCKED
|
||
"Timestamp": s.restorePlan.Timestamp, // line 25
|
||
"Apps": s.restorePlan.GetApps(), // line 26
|
||
"Drives": s.restorePlan.Drives, // line 27: direct ref
|
||
"PlanStatus": s.restorePlan.Status, // line 28: UNLOCKED
|
||
}
|
||
```
|
||
|
||
**Fix:** Hold `restoreMu.RLock()` from the nil-check through all field reads. Copy everything under lock, then release before rendering:
|
||
```go
|
||
func (s *Server) restorePageHandler(w http.ResponseWriter, r *http.Request) {
|
||
s.restoreMu.RLock()
|
||
plan := s.restorePlan
|
||
if plan == nil {
|
||
s.restoreMu.RUnlock()
|
||
http.Redirect(w, r, "/", http.StatusFound)
|
||
return
|
||
}
|
||
// Snapshot all needed fields under lock
|
||
customerID := plan.CustomerID
|
||
timestamp := plan.Timestamp
|
||
apps := plan.GetApps()
|
||
drives := make([]backup.DriveInfo, len(plan.Drives))
|
||
copy(drives, plan.Drives)
|
||
status := plan.Status
|
||
s.restoreMu.RUnlock()
|
||
|
||
data := map[string]interface{}{
|
||
"Title": "Katasztrófa utáni visszaállítás",
|
||
"CustomerName": s.cfg.Customer.Name,
|
||
"Domain": s.cfg.Customer.Domain,
|
||
"Version": s.version,
|
||
"CustomerID": customerID,
|
||
"Timestamp": timestamp,
|
||
"Apps": apps,
|
||
"Drives": drives,
|
||
"PlanStatus": status,
|
||
}
|
||
s.render(w, "restore", data)
|
||
}
|
||
```
|
||
|
||
Note: `plan.GetApps()` already holds its own internal `rp.mu.RLock()` — this is fine, nested RLocks are safe. However, `plan.Status` and `plan.Drives` need the outer `restoreMu` to prevent nil dereference.
|
||
|
||
---
|
||
|
||
### BUG 1.2 — P0: apiRestoreStatus reads restorePlan without lock
|
||
|
||
**File:** `controller/internal/web/handler_restore.go` lines 35-42
|
||
**Impact:** Nil pointer panic if `clearRestoreMode()` sets `s.restorePlan = nil` between lines 36 and 42.
|
||
|
||
**Fix:** Same pattern — hold restoreMu.RLock:
|
||
```go
|
||
func (s *Server) apiRestoreStatus(w http.ResponseWriter, r *http.Request) {
|
||
s.restoreMu.RLock()
|
||
plan := s.restorePlan
|
||
if plan == nil {
|
||
s.restoreMu.RUnlock()
|
||
jsonError(w, "not in restore mode", http.StatusBadRequest)
|
||
return
|
||
}
|
||
snapshot := plan.Snapshot()
|
||
s.restoreMu.RUnlock()
|
||
|
||
w.Header().Set("Content-Type", "application/json; charset=utf-8")
|
||
json.NewEncoder(w).Encode(snapshot)
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
### BUG 1.3 — P0: apiRestoreAll has double-restore race + unlocked writes
|
||
|
||
**File:** `controller/internal/web/handler_restore.go` lines 46-63
|
||
**Impact:** Two concurrent POST requests both pass the `Status == "restoring"` check and spawn two parallel goroutines restoring the same apps, causing data corruption.
|
||
|
||
Lines 51 and 56 read/write `s.restorePlan.Status` without any lock.
|
||
|
||
**Fix:** Hold restoreMu.RLock for the nil check, then use the plan's internal mu.Lock for the status check-and-set:
|
||
```go
|
||
func (s *Server) apiRestoreAll(w http.ResponseWriter, r *http.Request) {
|
||
s.restoreMu.RLock()
|
||
plan := s.restorePlan
|
||
s.restoreMu.RUnlock()
|
||
if plan == nil {
|
||
jsonError(w, "not in restore mode", http.StatusBadRequest)
|
||
return
|
||
}
|
||
|
||
// Atomic check-and-set under plan's internal lock
|
||
plan.mu.Lock()
|
||
if plan.Status == "restoring" {
|
||
plan.mu.Unlock()
|
||
jsonError(w, "restore already in progress", http.StatusConflict)
|
||
return
|
||
}
|
||
plan.Status = "restoring"
|
||
plan.mu.Unlock()
|
||
|
||
go s.executeAllRestores()
|
||
|
||
jsonResponse(w, map[string]interface{}{
|
||
"ok": true,
|
||
"message": "Visszaállítás elindítva",
|
||
})
|
||
}
|
||
```
|
||
|
||
**Important:** This requires `RestorePlan.mu` to be exported OR adding a helper method. The simplest approach: add a `SetStatus(status string) bool` method to RestorePlan that returns false if already "restoring":
|
||
|
||
```go
|
||
// Add to restore_scan.go:
|
||
// TryStartRestore atomically sets status to "restoring" if currently "pending".
|
||
// Returns false if already restoring.
|
||
func (rp *RestorePlan) TryStartRestore() bool {
|
||
rp.mu.Lock()
|
||
defer rp.mu.Unlock()
|
||
if rp.Status == "restoring" {
|
||
return false
|
||
}
|
||
rp.Status = "restoring"
|
||
return true
|
||
}
|
||
|
||
// SetStatus sets the overall plan status.
|
||
func (rp *RestorePlan) SetStatus(status string) {
|
||
rp.mu.Lock()
|
||
defer rp.mu.Unlock()
|
||
rp.Status = status
|
||
}
|
||
|
||
// GetStatus returns the current plan status.
|
||
func (rp *RestorePlan) GetStatus() string {
|
||
rp.mu.RLock()
|
||
defer rp.mu.RUnlock()
|
||
return rp.Status
|
||
}
|
||
```
|
||
|
||
Then `apiRestoreAll` becomes:
|
||
```go
|
||
func (s *Server) apiRestoreAll(w http.ResponseWriter, r *http.Request) {
|
||
s.restoreMu.RLock()
|
||
plan := s.restorePlan
|
||
s.restoreMu.RUnlock()
|
||
if plan == nil {
|
||
jsonError(w, "not in restore mode", http.StatusBadRequest)
|
||
return
|
||
}
|
||
if !plan.TryStartRestore() {
|
||
jsonError(w, "restore already in progress", http.StatusConflict)
|
||
return
|
||
}
|
||
go s.executeAllRestores()
|
||
jsonResponse(w, map[string]interface{}{
|
||
"ok": true,
|
||
"message": "Visszaállítás elindítva",
|
||
})
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
### BUG 1.4 — P0: apiRestoreSkip reads restorePlan without lock
|
||
|
||
**File:** `controller/internal/web/handler_restore.go` lines 66-70
|
||
**Impact:** Same nil pointer race as other handlers.
|
||
|
||
**Fix:**
|
||
```go
|
||
func (s *Server) apiRestoreSkip(w http.ResponseWriter, r *http.Request) {
|
||
s.restoreMu.RLock()
|
||
plan := s.restorePlan
|
||
s.restoreMu.RUnlock()
|
||
if plan == nil {
|
||
jsonError(w, "not in restore mode", http.StatusBadRequest)
|
||
return
|
||
}
|
||
|
||
s.logger.Println("[INFO] User skipped DR restore — entering normal mode")
|
||
s.clearRestoreMode()
|
||
|
||
jsonResponse(w, map[string]interface{}{
|
||
"ok": true,
|
||
"message": "Visszaállítás kihagyva",
|
||
})
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
### BUG 1.5 — P1: executeAllRestores writes Status/Apps without lock
|
||
|
||
**File:** `controller/internal/web/handler_restore.go` lines 82-125
|
||
**Impact:**
|
||
- Line 85: `s.restorePlan.Apps` accessed without lock (could be nil if cleared)
|
||
- Line 86: `&s.restorePlan.Apps[i]` — pointer into slice that could be invalidated
|
||
- Line 107: `s.restorePlan.Status = "done"` — unlocked write
|
||
|
||
**Fix:** Use the helper methods from BUG 1.3. Also guard against restorePlan being nil:
|
||
```go
|
||
func (s *Server) executeAllRestores() {
|
||
s.logger.Println("[INFO] Starting DR restore for all apps")
|
||
|
||
s.restoreMu.RLock()
|
||
plan := s.restorePlan
|
||
s.restoreMu.RUnlock()
|
||
if plan == nil {
|
||
s.logger.Println("[WARN] Restore plan cleared before execution could start")
|
||
return
|
||
}
|
||
|
||
for i := range plan.Apps {
|
||
app := &plan.Apps[i]
|
||
if app.Status != "pending" {
|
||
continue
|
||
}
|
||
|
||
plan.UpdateApp(app.Name, "restoring", "")
|
||
s.logger.Printf("[INFO] Restoring app %s (%s)", app.Name, app.DisplayName)
|
||
|
||
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Minute)
|
||
err := backup.RestoreAppFromBackup(ctx, app, s.cfg.Paths.StacksDir, s.logger)
|
||
cancel()
|
||
|
||
if err != nil {
|
||
plan.UpdateApp(app.Name, "failed", err.Error())
|
||
s.logger.Printf("[ERROR] Restore failed for %s: %v", app.Name, err)
|
||
} else {
|
||
plan.UpdateApp(app.Name, "done", "")
|
||
s.logger.Printf("[INFO] Restore completed for %s", app.Name)
|
||
}
|
||
}
|
||
|
||
plan.SetStatus("done") // <-- use new helper instead of direct write
|
||
s.logger.Println("[INFO] All app restores completed")
|
||
|
||
// Re-scan stacks so dashboard picks up restored apps
|
||
if s.stackMgr != nil {
|
||
if err := s.stackMgr.ScanStacks(); err != nil {
|
||
s.logger.Printf("[WARN] Post-restore stack scan failed: %v", err)
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
Also **delete the no-op goroutine** at lines 117-124 (see BUG 1.6).
|
||
|
||
---
|
||
|
||
### BUG 1.6 — P2: No-op auto-clear goroutine (dead code)
|
||
|
||
**File:** `controller/internal/web/handler_restore.go` lines 117-124
|
||
**Impact:** Spawns a goroutine that sleeps 5s and does nothing. Confusing code.
|
||
|
||
**Fix:** Delete lines 117-124 entirely. The user clicks "continue to dashboard" to clear restore mode — this is already implemented and correct.
|
||
|
||
---
|
||
|
||
## AREA 2: Controller — restore_scan.go
|
||
|
||
### BUG 2.1 — P0: dirIsEmpty() returns true on read errors
|
||
|
||
**File:** `controller/internal/backup/restore_scan.go` lines 234-237
|
||
**Impact:** If `os.ReadDir()` fails (permission denied, I/O error), `dirIsEmpty()` returns `true` because `err != nil` satisfies the condition. This causes the scan to skip valid DB dump directories that are merely unreadable, silently losing backup data.
|
||
|
||
**Current code:**
|
||
```go
|
||
func dirIsEmpty(path string) bool {
|
||
entries, err := os.ReadDir(path)
|
||
return err != nil || len(entries) == 0
|
||
}
|
||
```
|
||
|
||
**Fix:** Only return true for genuinely empty directories, return false on errors (assume not empty):
|
||
```go
|
||
func dirIsEmpty(path string) bool {
|
||
entries, err := os.ReadDir(path)
|
||
if err != nil {
|
||
return false // assume non-empty on error — safer for backup detection
|
||
}
|
||
return len(entries) == 0
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
### BUG 2.2 — P1: Snapshot() returns direct slice references (shallow copy)
|
||
|
||
**File:** `controller/internal/backup/restore_scan.go` lines 71-80
|
||
**Impact:** The returned map holds direct references to `rp.Apps` and `rp.Drives`. While the RLock is held during `json.Encode`, if the encoding is ever separated from the lock (e.g., by future code changes), modifications to Apps/Drives would corrupt the snapshot. Inconsistent with `GetApps()` which properly copies.
|
||
|
||
**Fix:** Make defensive copies like `GetApps()` does:
|
||
```go
|
||
func (rp *RestorePlan) Snapshot() map[string]interface{} {
|
||
rp.mu.RLock()
|
||
defer rp.mu.RUnlock()
|
||
|
||
apps := make([]RestorableApp, len(rp.Apps))
|
||
copy(apps, rp.Apps)
|
||
|
||
drives := make([]DriveInfo, len(rp.Drives))
|
||
copy(drives, rp.Drives)
|
||
|
||
return map[string]interface{}{
|
||
"ok": true,
|
||
"status": rp.Status,
|
||
"apps": apps,
|
||
"drives": drives,
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## AREA 3: Controller — infra_backup.go (Silent Failures)
|
||
|
||
### BUG 3.1 — P0: BuildInfraBackup silently drops file read errors
|
||
|
||
**File:** `controller/internal/report/infra_backup.go` lines 55-67
|
||
**Impact:** If controller.yaml, settings.json, or the restic password file can't be read (permissions, moved, disk error), the `if err == nil` pattern silently stores an empty string. The Hub receives an infra backup with missing critical data, and disaster recovery will fail silently (no config, no settings, no password → broken restore).
|
||
|
||
**Current code:**
|
||
```go
|
||
if data, err := os.ReadFile(controllerYAMLPath); err == nil {
|
||
ib.ControllerConfigB64 = base64.StdEncoding.EncodeToString(data)
|
||
}
|
||
// same for settings.json and restic password
|
||
```
|
||
|
||
**Fix:** Log warnings for missing files so the operator knows the backup is incomplete. Return error for controller.yaml (truly critical):
|
||
```go
|
||
// Read and encode controller.yaml (critical — fail if unreadable)
|
||
data, err := os.ReadFile(controllerYAMLPath)
|
||
if err != nil {
|
||
return nil, fmt.Errorf("reading controller config %s: %w", controllerYAMLPath, err)
|
||
}
|
||
ib.ControllerConfigB64 = base64.StdEncoding.EncodeToString(data)
|
||
|
||
// Read and encode settings.json (important but non-fatal)
|
||
if data, err := os.ReadFile(settingsPath); err == nil {
|
||
ib.SettingsJSONB64 = base64.StdEncoding.EncodeToString(data)
|
||
} else if !os.IsNotExist(err) {
|
||
// Log warning: file exists but unreadable
|
||
// (Note: BuildInfraBackup doesn't have a logger param — either add one,
|
||
// or add a Warnings []string field to InfraBackup, or wrap the error)
|
||
}
|
||
|
||
// Read primary restic password (important but non-fatal)
|
||
if data, err := os.ReadFile(resticPasswordFile); err == nil {
|
||
ib.ResticPassword = base64.StdEncoding.EncodeToString(data)
|
||
} else if !os.IsNotExist(err) {
|
||
// Same: log or accumulate warning
|
||
}
|
||
```
|
||
|
||
**Simplest approach:** Add a `logger *log.Logger` parameter to `BuildInfraBackup` (already passed in the caller `pushInfraBackup` in main.go, so just thread it through). Then log warnings:
|
||
|
||
```go
|
||
func BuildInfraBackup(
|
||
customerID, domain, version string,
|
||
controllerYAMLPath string,
|
||
settingsPath string,
|
||
resticPasswordFile string,
|
||
systemDataPath string,
|
||
sett *settings.Settings,
|
||
stackProvider backup.StackDataProvider,
|
||
logger *log.Logger, // <-- ADD THIS
|
||
) (*InfraBackup, error) {
|
||
```
|
||
|
||
In `pushInfraBackup` in main.go, update the call to pass `logger`.
|
||
|
||
---
|
||
|
||
### BUG 3.2 — P2: CrossDrivePassword encoding inconsistency
|
||
|
||
**File:** `controller/internal/report/infra_backup.go` lines 70-72
|
||
**Impact:** `ResticPassword` is base64-encoded (line 66) but `CrossDrivePassword` is stored raw (line 71). While the restore code in main.go handles both correctly (decodes ResticPassword from base64, uses CrossDrivePassword as-is), this inconsistency could cause bugs if someone adds base64 decoding to CrossDrivePassword in the future.
|
||
|
||
**Fix (optional — low priority):** Add a comment explaining the asymmetry:
|
||
```go
|
||
// Cross-drive password is stored as plain text (not base64) because it's
|
||
// already a string in settings, unlike ResticPassword which comes from a file.
|
||
if pw := sett.GetCrossDriveResticPassword(); pw != "" {
|
||
ib.CrossDrivePassword = pw
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## AREA 4: Controller — main.go DR Wiring
|
||
|
||
### BUG 4.1 — P1: Cross-drive password may be lost on settings reload
|
||
|
||
**File:** `controller/cmd/controller/main.go` lines 80-89
|
||
**Impact:** `restorePasswordsFromHub()` sets the cross-drive password on `sett` (line 80), which also saves it to settings.json. Then `restoreSettingsFromHub()` (line 83) overwrites settings.json with the Hub backup version. Then `sett` is replaced with a fresh load (line 87). The password is preserved IF the Hub backup's settings.json already contains it. But if the explicit `ib.CrossDrivePassword` field was updated after the settings.json was last saved, the password is lost.
|
||
|
||
**Fix:** Move password restoration AFTER settings reload:
|
||
```go
|
||
// Restore settings.json from Hub backup (do this FIRST)
|
||
restoreSettingsFromHub(ib, cfg, logger)
|
||
|
||
// Re-load settings (now from restored file)
|
||
if restoredSett, loadErr := settings.Load(settingsPath, logger); loadErr == nil {
|
||
sett = restoredSett
|
||
logger.Println("[INFO] Settings reloaded after Hub restore")
|
||
}
|
||
|
||
// Restore restic passwords (AFTER settings reload so cross-drive password persists)
|
||
restorePasswordsFromHub(ib, cfg, sett, logger)
|
||
```
|
||
|
||
---
|
||
|
||
### BUG 4.2 — P1: No nil check after ScanDrivesForBackups
|
||
|
||
**File:** `controller/cmd/controller/main.go` lines 124-127
|
||
**Impact:** If `ScanDrivesForBackups()` were to return nil (currently it doesn't, but it could in a future change), lines 125-127 would panic with nil pointer dereference.
|
||
|
||
**Fix:** Add nil guard:
|
||
```go
|
||
restorePlan = backup.ScanDrivesForBackups(drivePaths, infraStacks, logger)
|
||
if restorePlan != nil {
|
||
restorePlan.CustomerID = ib.CustomerID
|
||
restorePlan.Domain = ib.Domain
|
||
restorePlan.Timestamp = ib.Timestamp
|
||
logger.Printf("[INFO] DR restore plan ready: %d apps to restore", len(restorePlan.Apps))
|
||
} else {
|
||
logger.Println("[WARN] ScanDrivesForBackups returned nil — no restore plan created")
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
### BUG 4.3 — P2: os.MkdirAll error ignored for restic password dir
|
||
|
||
**File:** `controller/cmd/controller/main.go` line 624
|
||
**Impact:** If directory creation fails, the subsequent WriteFile fails with a confusing "no such file or directory" error.
|
||
|
||
**Fix:**
|
||
```go
|
||
dir := filepath.Dir(cfg.Backup.ResticPasswordFile)
|
||
if err := os.MkdirAll(dir, 0700); err != nil {
|
||
logger.Printf("[WARN] Failed to create restic password directory %s: %v", dir, err)
|
||
} else if err := os.WriteFile(cfg.Backup.ResticPasswordFile, decoded, 0600); err == nil {
|
||
logger.Println("[INFO] Primary restic password restored from Hub")
|
||
} else {
|
||
logger.Printf("[WARN] Failed to write restic password file: %v", err)
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
### BUG 4.4 — P3: restoreSettingsFromHub doesn't ensure data directory exists
|
||
|
||
**File:** `controller/cmd/controller/main.go` lines 652-653
|
||
**Impact:** On a truly fresh deployment, if `cfg.Paths.DataDir` doesn't exist yet, WriteFile will fail.
|
||
|
||
**Fix:** Add `os.MkdirAll` before write:
|
||
```go
|
||
func restoreSettingsFromHub(ib *report.InfraBackup, cfg *config.Config, logger *log.Logger) {
|
||
if ib.SettingsJSONB64 == "" {
|
||
return
|
||
}
|
||
decoded, err := base64.StdEncoding.DecodeString(ib.SettingsJSONB64)
|
||
if err != nil {
|
||
logger.Printf("[WARN] Failed to decode settings from Hub: %v", err)
|
||
return
|
||
}
|
||
if err := os.MkdirAll(cfg.Paths.DataDir, 0755); err != nil {
|
||
logger.Printf("[WARN] Failed to create data directory: %v", err)
|
||
return
|
||
}
|
||
settingsPath := filepath.Join(cfg.Paths.DataDir, "settings.json")
|
||
// ... rest unchanged
|
||
```
|
||
|
||
---
|
||
|
||
## AREA 5: Hub — store.go and handler.go
|
||
|
||
### BUG 5.1 — P2: Unchecked json.Unmarshal errors throughout store.go
|
||
|
||
**File:** `hub/internal/store/store.go` lines 126, 217, 286, 336, 502
|
||
**Impact:** If stored JSON is corrupt, `json.Unmarshal` errors are silently ignored. Functions proceed with zero-valued structs, returning incorrect data.
|
||
|
||
**Affected locations:**
|
||
- Line 126: `json.Unmarshal([]byte(eventsJSON), &events)` in GetNotificationPrefs
|
||
- Line 217: `json.Unmarshal(reportJSON, &parsed)` in SaveReport
|
||
- Line 286: `json.Unmarshal([]byte(c.ReportJSON), &report)` in GetCustomers
|
||
- Line 336: `json.Unmarshal([]byte(c.ReportJSON), &report)` in GetCustomer
|
||
- Line 502: `json.Unmarshal([]byte(reportJSON), &report)` in parseDiskSummary
|
||
|
||
**Fix for each:** Add error logging (not fatal — corrupted data shouldn't crash the hub):
|
||
|
||
Line 126:
|
||
```go
|
||
if err := json.Unmarshal([]byte(eventsJSON), &events); err != nil {
|
||
s.logger.Printf("[WARN] Corrupt enabled_events JSON for %s: %v", customerID, err)
|
||
}
|
||
```
|
||
|
||
Line 217:
|
||
```go
|
||
if err := json.Unmarshal(reportJSON, &parsed); err != nil {
|
||
s.logger.Printf("[WARN] Cannot parse report fields for denormalization: %v", err)
|
||
}
|
||
```
|
||
|
||
Lines 286, 336, 502: Same pattern — log warning, continue with defaults.
|
||
|
||
**Note for Sonnet:** `parseDiskSummary` (line 502) doesn't have access to a logger. Either add a logger parameter or just add `// ignore parse errors — show "–"` comment. The function already returns "–" on failure.
|
||
|
||
---
|
||
|
||
### BUG 5.2 — P2: GetInfraBackupMeta swallows unmarshal error
|
||
|
||
**File:** `hub/internal/store/store.go` line 450
|
||
**Impact:** If the stored infra backup JSON is corrupt, the function returns a meta struct with StackCount=0, DiskCount=0 — misleading the dashboard into showing "no stacks, no disks" instead of an error.
|
||
|
||
**Current code:**
|
||
```go
|
||
if json.Unmarshal([]byte(backupJSON), &parsed) == nil {
|
||
```
|
||
|
||
**Fix:** Log on unmarshal failure:
|
||
```go
|
||
if err := json.Unmarshal([]byte(backupJSON), &parsed); err != nil {
|
||
s.logger.Printf("[WARN] Failed to parse infra backup metadata for %s: %v", customerID, err)
|
||
} else {
|
||
meta.StackCount = len(parsed.DeployedStacks)
|
||
meta.DiskCount = len(parsed.DiskLayout.Mounts)
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
### BUG 5.3 — P3: No customer_id content validation on ingest endpoints
|
||
|
||
**File:** `hub/internal/api/handler.go` lines 88, 226, 313, 348
|
||
**Impact:** Customer IDs with special characters, extremely long strings, or SQL injection attempts are accepted and stored. While Go's SQLite driver uses parameterized queries (safe from SQL injection), excessively long IDs waste storage and could be used in denial-of-service.
|
||
|
||
**Fix (optional):** Add a simple validation helper and use it in all handlers:
|
||
```go
|
||
func isValidCustomerID(id string) bool {
|
||
if len(id) > 128 || len(id) == 0 {
|
||
return false
|
||
}
|
||
for _, c := range id {
|
||
if !((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || (c >= '0' && c <= '9') || c == '-' || c == '_' || c == '.') {
|
||
return false
|
||
}
|
||
}
|
||
return true
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## AREA 6: docker-setup.sh
|
||
|
||
### BUG 6.1 — P0: YAML syntax error — networks: on same line as last volume
|
||
|
||
**File:** `scripts/docker-setup.sh` line ~1297 (in install_filebrowser)
|
||
**Impact:** When `volume_lines` is populated (drives discovered), the `networks:` key ends up on the same line as the last volume entry, producing invalid YAML. Docker Compose fails → FileBrowser doesn't deploy.
|
||
|
||
The heredoc line likely looks like:
|
||
```
|
||
${volume_lines} networks:
|
||
```
|
||
|
||
When volume_lines has content ending with a newline, the output is correct. **However**, if `volume_lines` is empty (no drives found), the output becomes:
|
||
```yaml
|
||
volumes:
|
||
- filebrowser_data:/home/filebrowser/data
|
||
networks:
|
||
```
|
||
which IS correct. And when `volume_lines` has entries, each ends with `$'\n'`, so it also works.
|
||
|
||
**Verification needed:** Read the actual heredoc to confirm whether this is a real bug or false alarm. The agent may have been wrong about the indentation.
|
||
|
||
**Action:** Sonnet should read the `install_filebrowser` function and verify the heredoc YAML output is correct by checking that `volume_lines` entries end with proper newlines and the `networks:` key starts on a new line. If it doesn't, add `$'\n'` before `networks:`.
|
||
|
||
---
|
||
|
||
### BUG 6.2 — P0: Wizard prompts user even in --dry-run mode
|
||
|
||
**File:** `scripts/docker-setup.sh` — `run_config_wizard()` function (around line 1456)
|
||
**Impact:** The DRY_RUN check is at the END of the wizard, after all interactive `read` prompts. User runs `--dry-run` expecting a non-interactive preview but gets prompted for customer ID, domain, password, etc.
|
||
|
||
Also, after the wizard returns in dry-run, `WIZ_*` variables are empty, so downstream functions (`install_cloudflare_tunnel`, `install_filebrowser`, `install_controller`) see empty values and skip their dry-run output.
|
||
|
||
**Fix:** Move the `if [[ "$DRY_RUN" == true ]]` check to the TOP of `run_config_wizard()`, right after the step log line and the banner. Set dummy wizard values for downstream functions:
|
||
|
||
```bash
|
||
run_config_wizard() {
|
||
local step_num=5
|
||
[[ "$SELF_SIGNED_CERT" == true ]] && ((step_num++))
|
||
log_step "${step_num}/$(get_total_steps) - Running configuration wizard..."
|
||
|
||
echo ""
|
||
echo -e "${BOLD}${CYAN}===========================================================${NC}"
|
||
echo -e "${BOLD}${CYAN} Felhom Controller Configuration Wizard${NC}"
|
||
echo -e "${BOLD}${CYAN}===========================================================${NC}"
|
||
echo ""
|
||
|
||
if [[ "$DRY_RUN" == true ]]; then
|
||
echo -e "${CYAN}[DRY-RUN]${NC} Would run interactive wizard and generate controller.yaml"
|
||
# Set dummy values so downstream functions produce correct dry-run output
|
||
WIZ_CUSTOMER_ID="${CUSTOMER_ID:-demo-felhom}"
|
||
WIZ_CUSTOMER_NAME="${WIZ_CUSTOMER_ID}"
|
||
WIZ_DOMAIN="${BASE_DOMAIN:-homeserver.local}"
|
||
WIZ_EMAIL="${ACME_EMAIL:-admin@example.com}"
|
||
WIZ_CF_TUNNEL_TOKEN="" # Empty = no tunnel in dry-run
|
||
WIZ_CF_API_TOKEN="${CF_DNS_API_TOKEN:-}"
|
||
WIZ_SYSTEM_DATA_PATH="/mnt/sys_drive"
|
||
WIZ_PASSWORD_HASH='<would-be-generated>'
|
||
WIZ_SESSION_SECRET='<would-be-generated>'
|
||
return
|
||
fi
|
||
|
||
# ... rest of interactive prompts unchanged
|
||
```
|
||
|
||
**Also:** Remove the old DRY_RUN check at the end of the function (around line 1553-1557).
|
||
|
||
---
|
||
|
||
### BUG 6.3 — P1: Unquoted Cloudflare Tunnel token in docker-compose env
|
||
|
||
**File:** `scripts/docker-setup.sh` line ~1055 (in install_cloudflare_tunnel)
|
||
**Impact:** Tokens containing `$`, backticks, or other shell/Docker special characters will be misinterpreted.
|
||
|
||
**Fix:** Quote the value:
|
||
```bash
|
||
# Change:
|
||
- TUNNEL_TOKEN=${WIZ_CF_TUNNEL_TOKEN}
|
||
# To:
|
||
- TUNNEL_TOKEN="${WIZ_CF_TUNNEL_TOKEN}"
|
||
```
|
||
|
||
---
|
||
|
||
### BUG 6.4 — P1: htpasswd hash extraction with tr -d ':\n' may include junk
|
||
|
||
**File:** `scripts/docker-setup.sh` line ~1514
|
||
**Impact:** `htpasswd -bnBC 10 "" "$password"` outputs `:<hash>\n`. The `tr -d ':\n'` removes ALL colons and newlines, which works. But if the htpasswd command fails silently, the hash is empty and no error is raised.
|
||
|
||
**Fix:** Use `cut -d: -f2` instead of `tr -d` for cleaner extraction, and validate the result:
|
||
```bash
|
||
if command -v htpasswd &>/dev/null; then
|
||
WIZ_PASSWORD_HASH=$(htpasswd -bnBC 10 "" "$wiz_password" 2>/dev/null | cut -d: -f2)
|
||
if [[ ! "$WIZ_PASSWORD_HASH" =~ ^\$2[aby]\$ ]]; then
|
||
log_warn "htpasswd failed — trying Python fallback"
|
||
WIZ_PASSWORD_HASH=""
|
||
fi
|
||
fi
|
||
if [[ -z "$WIZ_PASSWORD_HASH" ]] && command -v python3 &>/dev/null; then
|
||
WIZ_PASSWORD_HASH=$(python3 -c "import bcrypt; print(bcrypt.hashpw(b'${wiz_password}', bcrypt.gensalt(10)).decode())" 2>/dev/null || echo "")
|
||
fi
|
||
if [[ -z "$WIZ_PASSWORD_HASH" ]]; then
|
||
log_warn "Could not hash password — dashboard will prompt on first visit"
|
||
fi
|
||
```
|
||
|
||
---
|
||
|
||
### BUG 6.5 — P1: grep without -F for literal path matching
|
||
|
||
**File:** `scripts/docker-setup.sh` line ~1255 (in install_filebrowser)
|
||
**Impact:** Paths like `/mnt/data[1]` or `/mnt/test.dir` are treated as regex patterns in grep, causing false matches or missed matches.
|
||
|
||
**Fix:**
|
||
```bash
|
||
# Change:
|
||
if ! echo "$volume_lines" | grep -q "${WIZ_SYSTEM_DATA_PATH}"; then
|
||
# To:
|
||
if ! echo "$volume_lines" | grep -qF "${WIZ_SYSTEM_DATA_PATH}"; then
|
||
```
|
||
|
||
---
|
||
|
||
### BUG 6.6 — P2: Spaces in mount point names not quoted in YAML
|
||
|
||
**File:** `scripts/docker-setup.sh` lines ~1238-1248 (in install_filebrowser)
|
||
**Impact:** Mount points with spaces (e.g., `/mnt/my drive/`) produce invalid YAML volume mappings.
|
||
|
||
**Fix:** Quote the path in volume_lines:
|
||
```bash
|
||
# Change:
|
||
volume_lines+=" - ${mp%/}:/srv/${name}"$'\n'
|
||
# To:
|
||
volume_lines+=" - \"${mp%/}:/srv/${name}\""$'\n'
|
||
```
|
||
|
||
---
|
||
|
||
### BUG 6.7 — P2: No post-wizard validation of required fields
|
||
|
||
**File:** `scripts/docker-setup.sh` (end of `run_config_wizard()`, before YAML generation)
|
||
**Impact:** User pressing Enter through all prompts without CLI flags generates config with dummy defaults like `demo-felhom` and `homeserver.local`. These produce a non-functional deployment with no clear error.
|
||
|
||
**Fix:** Add validation before writing controller.yaml:
|
||
```bash
|
||
# After all read prompts, before writing YAML:
|
||
if [[ -z "$WIZ_CUSTOMER_ID" ]] || [[ "$WIZ_CUSTOMER_ID" == "demo-felhom" ]]; then
|
||
log_error "Customer ID is required (not the default 'demo-felhom')"
|
||
exit 1
|
||
fi
|
||
if [[ -z "$WIZ_DOMAIN" ]] || [[ "$WIZ_DOMAIN" == "homeserver.local" ]]; then
|
||
log_error "A real domain is required (not 'homeserver.local')"
|
||
exit 1
|
||
fi
|
||
```
|
||
|
||
---
|
||
|
||
## AREA 7: Template — restore.html (Lower Priority)
|
||
|
||
### BUG 7.1 — P2: Error messages inserted via innerHTML without escaping
|
||
|
||
**File:** `controller/internal/web/templates/restore.html` line ~286
|
||
**Impact:** Error messages from `RestoreAppFromBackup` are concatenated into HTML via innerHTML. While these errors come from Go code (not user input), it's a bad practice. An error containing `<` would break the HTML rendering.
|
||
|
||
**Fix:** Use `textContent` for the error portion:
|
||
```javascript
|
||
// Instead of:
|
||
html += ' <span style="font-size:.8rem;color:var(--danger)">(' + app.error.substring(0, 60) + ')</span>';
|
||
|
||
// Do:
|
||
// After setting cell.innerHTML for the safe part, create the error span with textContent:
|
||
if (app.error) {
|
||
var errSpan = document.createElement('span');
|
||
errSpan.style.cssText = 'font-size:.8rem;color:var(--danger)';
|
||
errSpan.textContent = ' (' + app.error.substring(0, 60) + ')';
|
||
cell.querySelector('.status-' + app.status).appendChild(errSpan);
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
### BUG 7.2 — P3: Polling continues silently on network errors
|
||
|
||
**File:** `controller/internal/web/templates/restore.html` lines ~258-274
|
||
**Impact:** If the server restarts during a restore, the `.catch(function() {})` silently swallows all errors and polling continues every 2 seconds forever. User sees a frozen UI with no error indication.
|
||
|
||
**Fix:** Add error handling and retry limit:
|
||
```javascript
|
||
var pollErrors = 0;
|
||
function pollStatus() {
|
||
fetch('/api/restore/status')
|
||
.then(function(resp) {
|
||
if (!resp.ok) throw new Error('HTTP ' + resp.status);
|
||
return resp.json();
|
||
})
|
||
.then(function(data) {
|
||
pollErrors = 0; // reset on success
|
||
if (!data.ok) return;
|
||
updateTable(data.apps || []);
|
||
updateProgress(data.apps || []);
|
||
if (data.status === 'done') {
|
||
clearInterval(polling);
|
||
polling = null;
|
||
planStatus = 'done';
|
||
updateActions();
|
||
}
|
||
})
|
||
.catch(function(err) {
|
||
pollErrors++;
|
||
console.error('Poll error:', err);
|
||
if (pollErrors >= 10) {
|
||
clearInterval(polling);
|
||
polling = null;
|
||
// Show error to user
|
||
var actions = document.getElementById('dr-actions');
|
||
if (actions) {
|
||
actions.innerHTML = '<p style="color:var(--danger)">Kapcsolat megszakadt. <a href="/restore">Oldal frissítése</a></p>';
|
||
}
|
||
}
|
||
});
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Summary Table
|
||
|
||
| ID | Priority | Area | File | Line(s) | Description |
|
||
|----|----------|------|------|---------|-------------|
|
||
| 1.1 | P0 | Handler | handler_restore.go | 14-28 | restorePageHandler unlocked reads → nil panic |
|
||
| 1.2 | P0 | Handler | handler_restore.go | 35-42 | apiRestoreStatus unlocked reads → nil panic |
|
||
| 1.3 | P0 | Handler | handler_restore.go | 46-63 | apiRestoreAll double-restore race + unlocked writes |
|
||
| 1.4 | P0 | Handler | handler_restore.go | 66-70 | apiRestoreSkip unlocked read → nil panic |
|
||
| 1.5 | P1 | Handler | handler_restore.go | 82-125 | executeAllRestores unlocked reads/writes |
|
||
| 1.6 | P2 | Handler | handler_restore.go | 117-124 | No-op goroutine (dead code) |
|
||
| 2.1 | P0 | Scan | restore_scan.go | 234-237 | dirIsEmpty returns true on read error |
|
||
| 2.2 | P1 | Scan | restore_scan.go | 71-80 | Snapshot() shallow copy of slices |
|
||
| 3.1 | P0 | Infra | infra_backup.go | 55-67 | Silent file read failures in BuildInfraBackup |
|
||
| 3.2 | P2 | Infra | infra_backup.go | 70-72 | CrossDrivePassword encoding inconsistency |
|
||
| 4.1 | P1 | Main | main.go | 80-89 | Cross-drive password lost on settings reload |
|
||
| 4.2 | P1 | Main | main.go | 124-127 | No nil check after ScanDrivesForBackups |
|
||
| 4.3 | P2 | Main | main.go | 624 | os.MkdirAll error ignored |
|
||
| 4.4 | P3 | Main | main.go | 652-653 | Missing directory creation for settings |
|
||
| 5.1 | P2 | Hub | store.go | 126+ | Unchecked json.Unmarshal errors |
|
||
| 5.2 | P2 | Hub | store.go | 450 | GetInfraBackupMeta swallows error |
|
||
| 5.3 | P3 | Hub | handler.go | 88+ | No customer_id validation |
|
||
| 6.1 | P0 | Script | docker-setup.sh | ~1297 | YAML networks: indentation (verify) |
|
||
| 6.2 | P0 | Script | docker-setup.sh | ~1456 | Wizard prompts in --dry-run mode |
|
||
| 6.3 | P1 | Script | docker-setup.sh | ~1055 | Unquoted CF tunnel token |
|
||
| 6.4 | P1 | Script | docker-setup.sh | ~1514 | htpasswd hash extraction fragile |
|
||
| 6.5 | P1 | Script | docker-setup.sh | ~1255 | grep without -F for paths |
|
||
| 6.6 | P2 | Script | docker-setup.sh | ~1238 | Spaces in mount paths unquoted |
|
||
| 6.7 | P2 | Script | docker-setup.sh | wizard | No post-wizard field validation |
|
||
| 7.1 | P2 | Template | restore.html | ~286 | innerHTML with unescaped error text |
|
||
| 7.2 | P3 | Template | restore.html | ~258 | Silent polling failure |
|
||
|
||
**New methods needed in restore_scan.go:** `TryStartRestore() bool`, `SetStatus(string)`, `GetStatus() string`
|
||
|
||
**Recommended fix order:**
|
||
1. Bugs 1.1-1.4, 2.1 (P0 Go — crash/data-loss prevention)
|
||
2. Bugs 3.1, 6.2 (P0 — silent data loss, broken dry-run)
|
||
3. Bug 6.1 (P0 — verify YAML, fix if needed)
|
||
4. Bugs 1.5, 2.2, 4.1, 4.2 (P1 Go — race conditions, logic)
|
||
5. Bugs 6.3-6.5 (P1 Script — security, correctness)
|
||
6. Everything else (P2-P3)
|