Files
deploy-felhom-compose/TASK.md
T

361 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# TASK.md — Hub Dashboard Bugs + Backup Validation Fix
## Overview
Three bugs identified from the live hub.felhom.eu and controller backup page:
1. **Hub main page shows DOWN** despite the detail page showing STATUS: OK
2. **Hub report history timestamps show 00:00:00** instead of actual times
3. **Backup page shows "Hiba" for all DB validations** with no tooltip detail
Bugs 1 and 2 share the same root cause (timestamp parsing). Bug 3 is in the controller.
---
## Bug 1 & 2: Hub timestamp parsing failure
**Repository:** `felhom.eu``hub/`
### Root cause
The hub's SQLite store parses `received_at` timestamps with a single format:
```go
c.ReceivedAt, _ = time.Parse("2006-01-02 15:04:05", receivedAt)
```
The parse error is silently discarded (`_`). When the format doesn't match what the
`modernc.org/sqlite` driver returns, `ReceivedAt` becomes Go's zero time (`0001-01-01 00:00:00`).
**Consequences:**
- `time.Since(zeroTime)` ≈ 740,000+ hours → `TimeSinceReport > 1 hour`**OverallStatus = "down"**
- `zeroTime.Format("15:04:05")`**"00:00:00"** in report history
- Detail page health status shows OK because that comes from the report JSON payload, not the timestamp
The `modernc.org/sqlite` driver may return datetime strings in various formats depending on
how the value was stored and the SQLite version:
- `2026-02-16 14:30:00` (what we expect)
- `2026-02-16T14:30:00Z` (ISO 8601 / RFC3339-ish)
- `2026-02-16 14:30:00+00:00` (with timezone offset)
- `2026-02-16 14:30:00.123456` (with fractional seconds)
### Fix: `hub/internal/store/store.go`
**Step 1:** Add a robust timestamp parser function at the bottom of store.go:
```go
// parseSQLiteTime tries multiple formats that modernc.org/sqlite may return.
func parseSQLiteTime(s string) time.Time {
formats := []string{
"2006-01-02 15:04:05", // SQLite datetime('now')
"2006-01-02T15:04:05Z", // RFC3339 without fractional
time.RFC3339, // 2006-01-02T15:04:05Z07:00
time.RFC3339Nano, // with fractional seconds
"2006-01-02 15:04:05+00:00", // with explicit UTC offset
"2006-01-02 15:04:05.999999999", // with fractional, no TZ
}
for _, f := range formats {
if t, err := time.Parse(f, s); err == nil {
return t
}
}
// Last resort: if string is non-empty, log it for debugging
if s != "" {
log.Printf("[WARN] Could not parse timestamp: %q", s)
}
return time.Time{} // zero time
}
```
Note: Add `"log"` to the import block if not already present.
**Step 2:** Replace ALL occurrences of `time.Parse("2006-01-02 15:04:05", receivedAt)` in store.go.
There are **three** locations:
1. **`GetCustomers()`** — in the `for rows.Next()` loop:
```go
// BEFORE:
c.ReceivedAt, _ = time.Parse("2006-01-02 15:04:05", receivedAt)
// AFTER:
c.ReceivedAt = parseSQLiteTime(receivedAt)
```
2. **`GetCustomer()`** — after `row.Scan`:
```go
// BEFORE:
c.ReceivedAt, _ = time.Parse("2006-01-02 15:04:05", receivedAt)
// AFTER:
c.ReceivedAt = parseSQLiteTime(receivedAt)
```
3. **`GetCustomerHistory()`** — in the `for rows.Next()` loop:
```go
// BEFORE:
c.ReceivedAt, _ = time.Parse("2006-01-02 15:04:05", receivedAt)
// AFTER:
c.ReceivedAt = parseSQLiteTime(receivedAt)
```
**Step 3 (optional diagnostic):** Temporarily add a log line in `SaveReport` to see what format
SQLite actually stores/returns. This can be removed after verifying the fix:
```go
// Add after the INSERT in SaveReport, before return:
// Debug: check what format SQLite returns
var dbTime string
s.db.QueryRow("SELECT received_at FROM reports WHERE customer_id = ? ORDER BY id DESC LIMIT 1", customerID).Scan(&dbTime)
s.logger.Printf("[DEBUG] SQLite received_at raw value: %q", dbTime)
```
### Verify
After rebuilding and deploying the hub:
1. Wait for the next controller report push (or trigger manually)
2. Check hub.felhom.eu — status should show **OK** (green), not DOWN
3. Click into customer detail — "Last report: X min ago" should show a reasonable value
4. Report History timestamps should show actual times like `14:36:32`, not `00:00:00`
5. Check hub pod logs for any `[WARN] Could not parse timestamp` messages (should be none)
### Post-fix grep
```bash
grep -rn 'time.Parse("2006-01-02 15:04:05"' hub/internal/store/store.go
# Should return 0 results — all replaced with parseSQLiteTime()
```
---
## Bug 3: Backup page shows "Hiba" for all DB validations
**Repository:** `deploy-felhom-compose``controller/`
### Symptoms
- All 3 databases (immich, paperless, romm) show "Hiba" in the Érvényesítés column
- The Állapot column shows "OK" (dump succeeded)
- No tooltip text on hover (meaning `Validation.Error` is empty)
- Dump files are valid — headers are correct, sizes are reasonable (43.2 MB / 319.6 KB / 38.7 KB)
### Analysis
The template condition for "Hiba" in the `LastDBDump` path is:
```html
{{if .Error}} → shows "" (dump failed)
{{else if .Validation.Valid}} → shows "X tábla" (validation passed)
{{else}} → shows "Hiba" (THIS IS WHAT WE SEE)
```
"Hiba" with empty tooltip means `Validation.Valid == false` AND `Validation.Error == ""`.
This is the **zero-value** of `DumpValidation{}` — meaning validation was never assigned.
The code in `DumpOne()` calls `ValidateDump()` and the code in `ListDumpFiles()` also calls
`ValidateDump()`. Both paths should populate the Validation field. Yet the UI shows zero-value.
**Most likely cause:** The `lastDBDump` state was populated by an older code version (before
validation was wired), OR there's a race condition where `RefreshCache` captures `lastDBDump`
mid-construction, OR the validation ran but hit an unexpected issue (permissions, encoding).
### Diagnostic step (run on demo-felhom FIRST)
Before applying fixes, check the controller logs to understand what happened:
```bash
# Check the last DB dump run
sudo journalctl -u felhom-controller --since "2026-02-16 00:00" | grep -iE "db dump|table|valid|dump:"
# Check if there was a controller restart
sudo journalctl -u felhom-controller --since "2026-02-16 00:00" | grep -iE "starting|version|shutdown"
# Check if the old bash systemd timer is ALSO running (double-dump conflict!)
systemctl is-active backup-db-dump.timer
systemctl list-timers | grep backup
```
**IMPORTANT:** If `backup-db-dump.timer` is still active, it will race with the controller's
built-in `db-dump` scheduler job. Both write to the same directory. The bash script overwrites
files directly (no `.tmp` + rename), which could corrupt the file mid-validation. **Disable it:**
```bash
sudo systemctl stop backup-db-dump.timer
sudo systemctl disable backup-db-dump.timer
```
### Fix 1: Add debug logging to `ValidateDump`
**File:** `controller/internal/backup/dbdump.go`, function `ValidateDump`
Add a log parameter and diagnostic output so we can see what's happening:
```go
// BEFORE:
func ValidateDump(filePath string, dbType DBType) DumpValidation {
// AFTER:
func ValidateDump(filePath string, dbType DBType) DumpValidation {
log.Printf("[DEBUG] ValidateDump: %s (type=%s)", filePath, dbType)
```
And at the end, before `return v`:
```go
v.Valid = true
log.Printf("[DEBUG] ValidateDump OK: %s — %d tables, header found", filePath, tableCount)
return v
}
```
Also add logging to the error paths:
After `v.Error = "dump file too small (< 100 bytes)"`:
```go
log.Printf("[WARN] ValidateDump FAIL: %s — %s", filePath, v.Error)
```
After `v.Error = fmt.Sprintf("read failed: %v", err)`:
```go
log.Printf("[WARN] ValidateDump FAIL: %s — %s", filePath, v.Error)
```
After `v.Error = "... dump missing comment header"`:
```go
log.Printf("[WARN] ValidateDump FAIL: %s — %s", filePath, v.Error)
```
After `v.Error = "no CREATE TABLE statements found"`:
```go
log.Printf("[WARN] ValidateDump FAIL: %s — %s (header was found, scanned %d lines)", filePath, v.Error, len(strings.Split(content, "\n")))
```
Note: Import `"log"` at the top of the file if not already imported (use the standard `log`
package, not the `*log.Logger` parameter — this is a quick debug addition. Can be cleaned up later.)
### Fix 2: Template guard against zero-value Validation
Even with debug logging, we should make the template resilient to zero-value Validation.
The "Hiba" label with no explanation is a bad UX.
**File:** `controller/internal/web/templates/backups.html`
In the `LastDBDump` section, change the Érvényesítés (validation) column:
```html
<!-- BEFORE: -->
{{if .Error}}
<span class="validation-badge validation-na"></span>
{{else if .Validation.Valid}}
<span class="validation-badge validation-ok">{{.Validation.TableCount}} tábla</span>
{{else}}
<span class="validation-badge validation-fail" title="{{.Validation.Error}}">Hiba</span>
{{end}}
<!-- AFTER: -->
{{if .Error}}
<span class="validation-badge validation-na"></span>
{{else if .Validation.Valid}}
<span class="validation-badge validation-ok">{{.Validation.TableCount}} tábla</span>
{{else if .Validation.Error}}
<span class="validation-badge validation-fail" title="{{.Validation.Error}}">Hiba</span>
{{else}}
<span class="validation-badge validation-na" title="Az érvényesítés nem futott le"></span>
{{end}}
```
This ensures:
- If validation passed → green badge with table count
- If validation failed with a reason → red "Hiba" with tooltip
- If validation never ran (zero-value) → gray "" with explanatory tooltip
### Fix 3: Re-validate on cache refresh (belt-and-suspenders)
Since `RefreshCache` already calls `ListDumpFiles()` which runs `ValidateDump()` per file,
the `DumpFiles` fallback always has fresh validation. The issue is only in the `LastDBDump`
path when in-memory results have stale/missing validation.
Add a cross-check: if `LastDBDump` results have zero-value Validation but the file exists,
re-validate it. Add this in `RefreshCache`, after the existing code:
**File:** `controller/internal/backup/backup.go`, function `RefreshCache`
After the line `status.DumpFiles = files` and before the lock section, add:
```go
// Cross-check: if LastDBDump results have empty validation but files exist,
// re-validate from disk. This handles controller restarts and race conditions.
if m.lastDBDump != nil {
fileValidation := make(map[string]DumpValidation) // keyed by filename
for _, f := range files {
fileValidation[f.FileName] = f.Validation
}
for i, r := range m.lastDBDump.Results {
if !r.Validation.Valid && r.Validation.Error == "" && r.FilePath != "" {
filename := filepath.Base(r.FilePath)
if fv, ok := fileValidation[filename]; ok {
m.lastDBDump.Results[i].Validation = fv
m.logger.Printf("[INFO] Re-validated %s from disk: valid=%v tables=%d",
filename, fv.Valid, fv.TableCount)
}
}
}
}
```
Note: Add `"path/filepath"` to imports if not already present.
This runs every 5 minutes (same cadence as the cache refresh) and will automatically
heal any stale validation state in `lastDBDump` by cross-referencing the fresh
`ListDumpFiles` results.
### Fix 4: Disable conflicting systemd timer (manual step)
If the diagnostic step above reveals that `backup-db-dump.timer` is still active:
```bash
sudo systemctl stop backup-db-dump.timer
sudo systemctl disable backup-db-dump.timer
# Optionally verify:
systemctl list-timers | grep backup
# Should show nothing
```
The controller's built-in `db-dump` scheduler job at 02:30 replaces this timer entirely.
Having both run simultaneously can corrupt dump files mid-write.
### Verify
After deploying fixes:
1. Wait for cache refresh (5 minutes) or trigger a manual backup ("Mentés most")
2. Check `/backups` page — validation column should show "X tábla" for all databases
3. Check controller logs for `[DEBUG] ValidateDump` lines confirming validation ran
4. Verify no `[WARN] ValidateDump FAIL` lines in logs
---
## Post-fix checklist
### Hub (felhom.eu repo → hub/)
- [ ] `grep -rn 'time.Parse("2006-01-02 15:04:05"' hub/internal/store/` → 0 results
- [ ] `parseSQLiteTime` function exists in store.go
- [ ] `go build ./cmd/hub/` succeeds
- [ ] `go vet ./...` passes
- [ ] Build new image, deploy to k3s
- [ ] hub.felhom.eu shows OK status for demo-felhom
- [ ] Report history shows real timestamps
### Controller (deploy-felhom-compose repo → controller/)
- [ ] Template has 4-branch validation check (Valid / Error / zero-value guard)
- [ ] `RefreshCache` has cross-check re-validation logic
- [ ] `ValidateDump` has debug logging
- [ ] `backup-db-dump.timer` is disabled on demo-felhom
- [ ] `go build ./cmd/controller/` succeeds
- [ ] `go vet ./...` passes
- [ ] Build, deploy to demo-felhom
- [ ] Backup page shows table counts, not "Hiba"
- [ ] Controller logs show `[DEBUG] ValidateDump OK` entries
### Version bumps
- Hub: bump to next patch version
- Controller: include in v0.6.1 release (alongside the code review fixes from the other TASK.md)