diff --git a/TASK.md b/TASK.md index bd0fb9b..31fe913 100644 --- a/TASK.md +++ b/TASK.md @@ -1,257 +1,889 @@ -# TASK.md — Controller Refactoring: Template Split, Server Decomposition, Domain Rename +# TASK.md — Phase 2: Monitoring & Health + Phase 3: Backups -> **Goal:** Improve code organization for maintainability and Claude Code efficiency. -> No new features — purely structural refactoring + one config rename. -> -> **Version bump:** v0.3.0 (structural refactor milestone) +> Version bump target: **v0.4.0** +> Priority: Phase 2 first (scheduler + metrics are prerequisites for Phase 3) --- -## Task 1: Split templates.go → go:embed (HIGH PRIORITY) +## Overview -The `templates.go` file contains ALL HTML templates and CSS as Go string constants. -The file itself says: *"As the UI grows, switch to go:embed for easier editing."* -With 7 templates + full CSS, it's time. +Implement two major features in felhom-controller: -### 1.1 — Create template files directory +1. **Phase 2** — Scheduler, CPU/temperature metrics, Healthchecks.io ping integration +2. **Phase 3** — Database dump engine, restic backup snapshots, dashboard status display -Create `controller/internal/web/templates/` with individual files: +Both phases share the scheduler infrastructure. Implement in order. -``` -internal/web/templates/ -├── layout.html ← from layoutTmpl const -├── dashboard.html ← from dashboardTmpl const -├── stacks.html ← from stacksTmpl const (the stacks list page, NOT the old dashboard list) -├── login.html ← from loginTmpl const -├── logs.html ← from logsTmpl const -├── deploy.html ← from deployTmpl const -├── app_info.html ← from appInfoTmpl const -└── style.css ← from cssTemplate const -``` +--- -Each `.html` file should contain ONLY the template content (the `{{define "name"}}...{{end}}` block). -Keep the existing template names (`layout_start`, `layout_end`, `dashboard`, `stacks`, `login`, etc.). +## Phase 2A — Scheduler (`internal/scheduler/`) -### 1.2 — Create embed.go +### Why first -Create `controller/internal/web/embed.go`: +main.go currently has two ad-hoc goroutines (status refresh every 30s, stack scan every 2min). +Phase 2 adds system health pings. Phase 3 adds daily DB dumps and backups. +All need a centralized, logged, observable job runner. Build it once, use everywhere. + +### Design: `internal/scheduler/scheduler.go` ```go -package web +package scheduler -import "embed" +type JobFunc func(ctx context.Context) error -//go:embed templates/*.html templates/*.css -var templateFS embed.FS -``` - -### 1.3 — Update template loading in server.go (or the new funcmap.go, see Task 2) - -Replace the current `loadTemplates()` method that parses the `allTemplates` const: - -```go -func (s *Server) loadTemplates() { - funcMap := template.FuncMap{ /* ... existing funcs ... */ } - - s.tmpl = template.Must( - template.New("").Funcs(funcMap).ParseFS(templateFS, "templates/*.html"), - ) +type Job struct { + Name string + Fn JobFunc + Interval time.Duration // for periodic jobs (every N) + Schedule string // for daily jobs ("02:30", "03:00") — mutually exclusive with Interval + LastRun time.Time + LastErr error + Running bool } + +type Scheduler struct { + jobs []*Job + logger *log.Logger + ctx context.Context + cancel context.CancelFunc + wg sync.WaitGroup +} + +func New(logger *log.Logger) *Scheduler +func (s *Scheduler) Every(name string, interval time.Duration, fn JobFunc) +func (s *Scheduler) Daily(name string, timeStr string, fn JobFunc) // "02:30" format, Europe/Budapest timezone +func (s *Scheduler) Start(ctx context.Context) +func (s *Scheduler) Stop() +func (s *Scheduler) GetJobs() []Job // for dashboard/API display (copy, not pointer) ``` -CSS serving: Instead of the `cssTemplate` const, read from `templateFS`: +### Interval jobs (`Every`) + +- Spawns a goroutine with `time.Ticker` +- Logs `[SCHED] Running job: ` at start, `[SCHED] Job completed (took Xs)` or `[SCHED] Job failed: (took Xs)` at end +- Updates `LastRun`, `LastErr`, `Running` fields (mutex-protected) +- Respects ctx.Done() for shutdown +- **Quiet mode for high-frequency jobs:** Jobs that run every <=30 seconds should only log at debug level on success (avoid log spam). Failures always log at WARN/ERROR level. + +### Daily jobs (`Daily`) + +- Parses `timeStr` as "HH:MM" in `Europe/Budapest` timezone +- On start, calculates duration until next occurrence (today if not yet passed, tomorrow if passed) +- After each run, sleeps until the next day's scheduled time +- Uses `time.After()` or `time.Timer`, NOT `time.Ticker` (handles DST transitions correctly) +- Same logging pattern as interval jobs +- Logs `[SCHED] Daily job scheduled for ` on registration + +### Edge cases + +- If `Daily` timeStr is invalid → log error at registration, don't start the job +- If a job panics → recover, log `[ERROR] Job panicked: `, mark as failed +- If a job is already running when the next tick fires → skip, log `[WARN] Job still running, skipping` +- Graceful shutdown: `Stop()` cancels context, `wg.Wait()` with 30s timeout for running jobs to finish + +### Integration in main.go + +Replace the two ad-hoc goroutines with: ```go -func (s *Server) serveCSSHandler(w http.ResponseWriter, r *http.Request) { - data, err := templateFS.ReadFile("templates/style.css") - if err != nil { - http.Error(w, "CSS not found", 500) - return +sched := scheduler.New(logger) + +// Existing periodic tasks (move from ad-hoc goroutines) +sched.Every("status-refresh", 30*time.Second, func(ctx context.Context) error { + return stackMgr.RefreshStatus() +}) +sched.Every("stack-scan", 2*time.Minute, func(ctx context.Context) error { + return stackMgr.ScanStacks() +}) + +// Phase 2: System health ping (added below) +// Phase 3: DB dump, backup (added below) + +sched.Start(ctx) +defer sched.Stop() +``` + +Delete the two existing goroutines in main.go after migrating to the scheduler. + +--- + +## Phase 2B — CPU & Temperature Metrics (`internal/system/`) + +### Current state + +`info.go` defines `SystemInfo` struct with memory + disk fields. +`info_linux.go` reads `/proc/meminfo` and `syscall.Statfs`. +`info_other.go` provides stubs for non-Linux. + +### New fields in `SystemInfo` + +```go +// Add to SystemInfo struct in info.go: +CPUPercent float64 `json:"cpu_percent"` // 0-100, averaged across all cores +LoadAvg1 float64 `json:"load_avg_1"` // 1-minute load average +LoadAvg5 float64 `json:"load_avg_5"` // 5-minute load average +LoadAvg15 float64 `json:"load_avg_15"` // 15-minute load average +TemperatureCelsius float64 `json:"temperature_celsius"` // CPU/SoC temperature +TemperatureSource string `json:"temperature_source"` // e.g. "thermal_zone0", "x86_pkg_temp" +``` + +### CPU measurement approach + +**Do NOT block `GetInfo()` with a delta calculation.** + +Use a lightweight `CPUCollector` that runs in a background goroutine: + +```go +// internal/system/cpu_linux.go (build tag: linux) + +type CPUCollector struct { + mu sync.RWMutex + cpuPercent float64 + sampleRate time.Duration // default: 5 seconds + cancel context.CancelFunc +} + +func NewCPUCollector(sampleRate time.Duration) *CPUCollector +func (c *CPUCollector) Start(ctx context.Context) +func (c *CPUCollector) Stop() +func (c *CPUCollector) CPUPercent() float64 // returns latest sample +``` + +How it works: +1. Reads `/proc/stat` first line: `cpu ` +2. Sleeps `sampleRate` (5s) +3. Reads again, computes delta: `busy = delta(user+nice+system+irq+softirq+steal)`, `total = busy + delta(idle+iowait)` +4. `cpuPercent = (busy / total) * 100` +5. Stores result, loops + +Parsing `/proc/stat`: +``` +cpu 1234 56 789 45678 123 45 67 0 0 0 +``` +Split by whitespace. Fields after "cpu" are: user(1) nice(2) system(3) idle(4) iowait(5) irq(6) softirq(7) steal(8). +Sum all = total. idle + iowait = idle_total. busy = total - idle_total. + +**IMPORTANT: Inside a Docker container, `/proc/stat` reflects the HOST CPU** (unless CPU cgroups are applied with limits). So the controller's own `/proc/stat` works. + +### Load average + +Read from `/proc/loadavg` (instant, no delta needed): +``` +0.15 0.10 0.05 1/234 56789 +``` +First three fields are 1/5/15 minute load averages. Parse with `fmt.Sscanf`. + +Add `readLoadAvg(info *SystemInfo)` in `info_linux.go`. + +### Temperature + +Read from `/sys/class/thermal/thermal_zone*/temp`: + +**IMPORTANT**: The controller runs in a Docker container. `/sys` is NOT available by default. We mount the host's `/sys` at `/host/sys` inside the container (see docker-compose.yml changes below). + +```go +// internal/system/info_linux.go — add readTemperature(info *SystemInfo) +``` + +Algorithm: +1. Try `/host/sys/class/thermal/thermal_zone*/temp` first (Docker mount) +2. Fallback to `/sys/class/thermal/thermal_zone*/temp` (native/development) +3. For each zone, also read the `type` file for the label +4. Pick the highest temperature (usually `thermal_zone0` or `x86_pkg_temp`) +5. Value is in millidegrees Celsius → divide by 1000.0 +6. Store the zone type as `TemperatureSource` +7. If no thermal zones found: try `/host/sys/class/hwmon/hwmon*/temp1_input` as fallback (same millidegree format) +8. If nothing found: leave fields as zero (dashboard hides temperature when 0) + +### `GetInfo()` signature change + +```go +// Current: +func GetInfo(hddPath string) SystemInfo +// New: +func GetInfo(hddPath string, cpuCollector *CPUCollector) SystemInfo +``` + +Inside `GetInfo()`: +1. Existing: `readMemInfo(&info)`, `readDiskUsage(...)` — unchanged +2. New: `readLoadAvg(&info)` +3. New: `readTemperature(&info)` +4. New: `if cpuCollector != nil { info.CPUPercent = cpuCollector.CPUPercent() }` + +The `info_other.go` stub accepts the parameter but ignores it (returns empty SystemInfo as before). + +### CPU collector lifecycle + +Started in `main.go`: + +```go +cpuCollector := system.NewCPUCollector(5 * time.Second) +cpuCollector.Start(ctx) +defer cpuCollector.Stop() +``` + +Passed to `web.NewServer()` and `api.NewRouter()` which pass it to `system.GetInfo()` calls. + +### Dashboard display + +Extend the existing system info bar in `dashboard.html`: + +Current layout: +``` +| Memória | ████████░░ 72% | SSD | ██████░░░░ 55% | HDD | ████░░░░░░ 38% | +``` + +New layout: +``` +| Memória | ████████░░ 72% | CPU | ██░░░░░░░░ 15% | Hőmérséklet | 52°C | +| SSD | ██████░░░░ 55% | HDD | ████░░░░░░ 38% | +``` + +Or, if horizontal space is tight, keep the two-row layout from the current dashboard and add CPU + temperature to the same row structure. Use the same progress bar component. + +Temperature display: +- Show as text "52°C" with colored dot (green/yellow/red) +- Green: < 60°C +- Yellow: 60-75°C +- Red: > 75°C +- If temperature is 0 (unavailable): hide entirely + +CPU progress bar: +- Same color scheme as memory/disk: green < 70%, yellow 70-85%, red > 85% + +Load average: Show as small text below CPU bar: "Load: 0.3 / 0.2 / 0.1" + +--- + +## Phase 2C — Healthchecks.io Ping Integration (`internal/monitor/`) + +### Design: `internal/monitor/pinger.go` + +```go +package monitor + +type Pinger struct { + baseURL string // e.g. "https://status.felhom.eu" + httpClient *http.Client + logger *log.Logger + enabled bool +} + +func NewPinger(cfg *config.MonitoringConfig, logger *log.Logger) *Pinger + +// Ping sends a success signal with optional diagnostic body +func (p *Pinger) Ping(uuid string, body string) error + +// Fail sends a failure signal with diagnostic body +func (p *Pinger) Fail(uuid string, body string) error + +// Start sends a "job started" signal (for duration tracking) +func (p *Pinger) Start(uuid string) error +``` + +### HTTP protocol + +- Success: `POST {baseURL}/ping/{uuid}` with body as request body +- Failure: `POST {baseURL}/ping/{uuid}/fail` with body +- Start: `POST {baseURL}/ping/{uuid}/start` +- Timeout: 10 seconds +- Retry: 3 attempts with 2s backoff between retries +- If `uuid` is empty or starts with "CHANGEME" → skip silently (log at debug level only) +- If `enabled` is false → skip all pings +- **Never let ping failures affect the main operation** — log a warning on HTTP error, but always return nil from the calling job. Ping errors must not break backup/health flows. + +### Design: `internal/monitor/healthcheck.go` + +```go +// RunHealthCheck runs system checks and returns a diagnostic report. +type HealthReport struct { + Status string // "ok", "warn", "fail" + Issues []string // critical problems + Warnings []string // non-critical warnings + Info []string // informational items + Timestamp time.Time +} + +func RunHealthCheck(cfg *config.Config, cpuCollector *system.CPUCollector) *HealthReport +func (r *HealthReport) FormatMessage() string // human-readable summary for HC ping body +``` + +Checks to run (replicating backup-healthcheck.sh logic in Go): +1. **Disk usage**: Read from `system.GetInfo()`. Compare against thresholds (`disk_warn_percent`, `disk_crit_percent`). +2. **Memory usage**: Same source. Warn if above `memory_warn_percent`. +3. **CPU usage**: From collector. Warn if above `cpu_warn_percent`. +4. **Temperature**: From `system.GetInfo()`. Warn if above `temperature_warn_celsius`. +5. **Docker health**: Verify Docker daemon is reachable by running `docker info` (quick exec check). +6. **Protected containers**: Verify protected stacks are running (traefik, cloudflared, felhom-controller) by checking container state. + +Any issue → Status = "fail". Only warnings → Status = "warn". All clear → Status = "ok". + +### Scheduler integration + +```go +// In main.go: +pinger := monitor.NewPinger(&cfg.Monitoring, logger) +healthUUID := cfg.Monitoring.PingUUIDs.SystemHealth + +// Parse system_health_interval (default "5m") +healthInterval, _ := time.ParseDuration(cfg.Monitoring.SystemHealthInterval) + +sched.Every("system-health", healthInterval, func(ctx context.Context) error { + report := monitor.RunHealthCheck(cfg, cpuCollector) + body := report.FormatMessage() + + if report.Status == "fail" { + pinger.Fail(healthUUID, body) + } else { + pinger.Ping(healthUUID, body) } - w.Header().Set("Content-Type", "text/css; charset=utf-8") - w.Header().Set("Cache-Control", "public, max-age=3600") - w.Write(data) + return nil // never fail the scheduler job due to ping errors +}) +``` + +### Config changes + +Add to `MonitoringConfig`: +```go +SystemHealthInterval string `yaml:"system_health_interval"` +``` + +Default in `applyDefaults()`: `"5m"` + +--- + +## Phase 3A — Database Dump Engine (`internal/backup/dbdump.go`) + +### Approach: Auto-discover from running Docker containers + +Replicates the proven logic from `backup-db-dump.sh` in Go: + +```go +package backup + +type DBType string +const ( + DBTypePostgres DBType = "postgres" + DBTypeMariaDB DBType = "mariadb" +) + +type DiscoveredDB struct { + ContainerName string + ContainerID string + DBType DBType + DBUser string + DBName string + StackName string // derived from container name +} + +type DumpResult struct { + DB DiscoveredDB + FilePath string + Size int64 + Duration time.Duration + Error error +} + +func DiscoverDatabases(ctx context.Context, logger *log.Logger) ([]DiscoveredDB, error) +func DumpAll(ctx context.Context, dbs []DiscoveredDB, dumpDir string, logger *log.Logger) []DumpResult +func DumpOne(ctx context.Context, db DiscoveredDB, dumpDir string, logger *log.Logger) DumpResult +``` + +### Discovery logic + +Run `docker ps --format '{{.ID}}\t{{.Names}}\t{{.Image}}' --filter status=running`. + +For each running container, check image name: +- Contains `postgres` → DBTypePostgres +- Contains `mariadb` or `mysql` → DBTypeMariaDB + +Then for each DB container, get env vars via: +`docker inspect --format '{{range .Config.Env}}{{println .}}{{end}}'` + +Parse env vars: +- **PostgreSQL**: `POSTGRES_USER` (default: "postgres"), `POSTGRES_DB` (default: same as POSTGRES_USER) +- **MariaDB**: `MYSQL_ROOT_PASSWORD`, `MYSQL_DATABASE` (or `MARIADB_DATABASE`) + +Derive stack name from container name by stripping common DB suffixes: +- `paperless-ngx-postgres` → `paperless-ngx` +- `romm-db` → `romm` +- `immich-postgres` → `immich` +- Logic: split on `-`, check if last segment is a known suffix (`postgres`, `db`, `mariadb`, `mysql`, `database`, `redis`, `cache`), if so remove it + +### Dump execution + +**PostgreSQL:** +```bash +docker exec pg_dump -U -d --clean --if-exists --no-owner --no-privileges +``` + +**MariaDB:** +```bash +docker exec mariadb-dump -u root -p --single-transaction --routines --triggers +``` + +**IMPORTANT: Use `docker exec` to run dump commands INSIDE the DB container.** Do NOT use pg_dump/mysqldump from the controller container — version mismatches between the controller's client and the DB server will cause failures. + +Output handling: +- Use `os/exec.Command("docker", "exec", ...)` with `cmd.Stdout` piped to a temp file +- Write to `{dumpDir}/{stackName}-{dbtype}.sql.tmp` during dump +- Rename `.tmp` → `.sql` on success only +- Delete `.tmp` on failure +- Set 5-minute timeout per dump via `context.WithTimeout` + +### Gotchas and edge cases + +- **MariaDB password from container env:** Never log the password. Use `docker inspect` to read `MYSQL_ROOT_PASSWORD` or `MARIADB_ROOT_PASSWORD`. +- **Empty/zero-size dumps:** Check dump file size after writing. If 0 bytes → treat as failure. +- **Dump file naming:** `{stackName}-{dbtype}.sql` (e.g., `paperless-ngx-postgres.sql`). Overwrite previous dump each run (restic handles versioning). +- **Old tmp cleanup:** Delete `.tmp` files older than 1 hour on each run (leftover from crashed dumps). +- **Skip infrastructure DBs:** Don't dump databases from protected stacks (if any have DBs in the future). +- **Container not running:** If a DB container was discovered but is no longer running by dump time → skip with warning (container may have been stopped between discovery and dump). + +### Dump directory + +`/srv/backups/db-dumps/` — configured in `controller.yaml` as `paths.db_dump_dir`. +Already mounted in docker-compose.yml via `/srv/backups:/srv/backups`. + +The user does NOT see this directory (not in FileBrowser, not on HDD). + +--- + +## Phase 3B — Restic Integration (`internal/backup/restic.go`) + +### Design + +```go +type ResticManager struct { + repoPath string + passwordFile string + logger *log.Logger + customerID string + cacheDir string +} + +func NewResticManager(cfg *config.Config, logger *log.Logger) *ResticManager + +func (r *ResticManager) EnsureInitialized() error +func (r *ResticManager) Snapshot(paths []string, tags []string) (*SnapshotResult, error) +func (r *ResticManager) Prune(retention config.RetentionConfig) error +func (r *ResticManager) Check() error +func (r *ResticManager) LatestSnapshot() (*SnapshotInfo, error) +func (r *ResticManager) Stats() (*RepoStats, error) + +type SnapshotResult struct { + SnapshotID string + FilesNew int + FilesChanged int + DataAdded string // human-readable + Duration time.Duration +} + +type SnapshotInfo struct { + ID string + Time time.Time + Paths []string + Tags []string +} + +type RepoStats struct { + TotalSize string + SnapshotCount int + LatestSnapshot *SnapshotInfo } ``` -Register this handler for `/static/style.css` in `ServeHTTP` (replace the current inline CSS serving). +### Restic commands (all via `os/exec`) -### 1.4 — Delete old string constants +All commands set these env vars: +```go +cmd.Env = append(os.Environ(), + "RESTIC_REPOSITORY="+r.repoPath, + "RESTIC_PASSWORD_FILE="+r.passwordFile, + "RESTIC_CACHE_DIR="+r.cacheDir, +) +``` -Remove from `templates.go`: -- `const allTemplates = ...` -- `const layoutTmpl = ...` -- `const dashboardTmpl = ...` -- `const stacksTmpl = ...` -- `const loginTmpl = ...` -- `const logsTmpl = ...` -- `const deployTmpl = ...` -- `const appInfoTmpl = ...` -- `const cssTemplate = ...` +**`RESTIC_CACHE_DIR`** must be set to `/opt/docker/felhom-controller/data/restic-cache` (inside the controller-data Docker volume). Without this, restic defaults to `~/.cache/restic` which may not persist across container restarts. -After this, `templates.go` should either be empty (delete it) or contain only the -felhom logo SVG constant if that's still embedded as a string (keep that one — it's small). +**Init** (idempotent): +- Check if `{repoPath}/config` file exists → if so, already initialized, skip +- Otherwise: `restic init` -### 1.5 — Verify the build +**Snapshot:** +```bash +restic backup /opt/docker/stacks /srv/backups/db-dumps /opt/docker/felhom-controller/controller.yaml \ + --tag felhom --tag --host +``` -- `go build ./cmd/controller/` must succeed -- `go:embed` requires Go 1.16+ (we're on 1.22, fine) -- Templates are still compiled into the binary — zero runtime file dependencies (same as before) -- Verify that the HTML files actually include the `{{define "name"}}...{{end}}` wrappers - (ParseFS needs them to register template names) +What gets backed up (v1): +- `/opt/docker/stacks/` — compose files, .felhom.yml, app.yaml (deploy configs with secrets) +- `/srv/backups/db-dumps/` — SQL dumps (from the DB dump step) +- `/opt/docker/felhom-controller/controller.yaml` — controller config -### Important notes +**NOT backed up in v1:** +- HDD app data (Immich photos, Paperless documents) — too large, needs separate strategy +- Docker volumes directly — critical data covered by DB dumps -- The `` in layout.html already exists, - so CSS loading via the `/static/style.css` route should already work — just make sure - the handler reads from embed.FS instead of serving the const. -- The felhom logo SVG can stay as a Go const (it's small) or move to `templates/felhom-logo.svg` - and be served from embed.FS too. Either approach is fine. +Parse snapshot output (restic `backup` with `--json` sends JSON lines to stderr): +```json +{"message_type":"summary","files_new":5,"files_changed":2,"data_added":12345678,...,"snapshot_id":"abc123"} +``` + +**Prune:** +```bash +restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 6 --prune +``` + +**Check:** +```bash +restic check +``` + +**Latest snapshot:** +```bash +restic snapshots --latest 1 --json +``` +Returns JSON array with snapshot objects. + +**Stats (repo size):** +```bash +restic stats --json +``` + +### Password auto-generation + +On startup, `EnsureInitialized()` checks if the password file exists. If not: +1. Generate 32 random bytes, base64url-encode +2. Write to `r.passwordFile` (the controller-data volume path) +3. Log `[INFO] Generated new restic repository password at ` +4. Log `[WARN] Save this password externally — losing it means losing access to ALL backups` + +### Gotchas + +- **restic is already in the Docker image** (Dockerfile installs it). No additional setup. +- **Locking:** Restic handles repo locking internally. The scheduler's "skip if running" prevents concurrent operations. If a stale lock exists (controller crashed mid-backup), restic will error — add `restic unlock` to the error handling path with a log warning. +- **Timeout:** 30-minute timeout for snapshot operations. Parse context deadline. +- **Large repos:** First snapshot may be large (all stack configs + dumps). Subsequent snapshots are incremental (restic deduplicates). +- **restic JSON output:** Use `--json` for machine-parseable output. Parse from stderr for `backup` command (stdout shows progress, stderr has JSON summary). + +Actually, correction — restic with `--json` sends JSON to **stdout**. Regular progress goes to stderr. For `backup --json`, the summary JSON object with `message_type: "summary"` is on stdout. Parse the last JSON line from stdout. --- -## Task 2: Split server.go into focused files (MEDIUM PRIORITY) +## Phase 3C — Backup Orchestrator (`internal/backup/backup.go`) -Currently `server.go` handles: Server struct, auth/sessions, page handlers, template FuncMap, -asset serving, and HTTP routing. Split into: +### Design -### 2.1 — Create `auth.go` +```go +type Manager struct { + cfg *config.Config + restic *ResticManager + logger *log.Logger + pinger *monitor.Pinger -Move from `server.go` to `internal/web/auth.go`: -- `type session struct` -- `const sessionCookieName`, `const sessionMaxAge` -- `RequireAuth()` middleware method -- `loginHandler()`, `loginPostHandler()`, `logoutHandler()` -- `createSession()`, `isValidSession()`, `cleanupSessions()` -- `renderLogin()` helper + mu sync.Mutex + lastDBDump *DBDumpStatus + lastBackup *BackupStatus +} -### 2.2 — Create `handlers.go` +type DBDumpStatus struct { + LastRun time.Time + Results []DumpResult + Success bool + Duration time.Duration +} -Move from `server.go` to `internal/web/handlers.go`: -- `baseData()` helper -- `dashboardHandler()` -- `stacksHandler()` -- `deployPageHandler()` -- `deployPagePostHandler()` (if it exists as separate handler) -- `appDetailHandler()` -- `logsPageHandler()` +type BackupStatus struct { + LastRun time.Time + Snapshot *SnapshotResult + Success bool + Duration time.Duration + RepoStats *RepoStats +} -### 2.3 — Create `funcmap.go` +func NewManager(cfg *config.Config, pinger *monitor.Pinger, logger *log.Logger) *Manager +func (m *Manager) RunDBDumps(ctx context.Context) error +func (m *Manager) RunBackup(ctx context.Context) error +func (m *Manager) RunFullBackup(ctx context.Context) error // dumps + snapshot + optional prune +func (m *Manager) GetStatus() (*DBDumpStatus, *BackupStatus) +func (m *Manager) GetRepoStats() (*RepoStats, error) +``` -Move from `server.go` to `internal/web/funcmap.go`: -- The entire `template.FuncMap` definition from `loadTemplates()` -- Extract it as a standalone function: `func (s *Server) templateFuncMap() template.FuncMap` -- Then `loadTemplates()` becomes a clean 3-liner calling `templateFuncMap()` + `ParseFS` +### Full backup flow (daily scheduled) -### 2.4 — Keep in server.go +1. **DB dumps:** `DiscoverDatabases()` → `DumpAll()` → update `lastDBDump` status +2. Ping Healthchecks for DB dump result: `pinger.Ping/Fail(dbDumpUUID, summary)` +3. **Restic snapshot:** `restic.EnsureInitialized()` → `restic.Snapshot(paths, tags)` +4. **Prune (weekly):** Check day of week against `prune_schedule` config. If match → `restic.Prune(retention)` + `restic.Check()` +5. Ping Healthchecks for backup result: `pinger.Ping/Fail(backupUUID, summary)` +6. Update `lastBackup` status -After the split, `server.go` should contain only: -- `type Server struct` -- `func NewServer()` -- `func (s *Server) loadTemplates()` (now a 3-liner) -- `func (s *Server) ServeHTTP()` (HTTP routing dispatch) -- `func (s *Server) render()` helper -- Static file/asset serving handlers (`serveStaticFile`, `serveCSSHandler`, `serveLogoHandler`) +### Scheduler integration -### 2.5 — Verify the split +```go +// In main.go: +backupMgr := backup.NewManager(cfg, pinger, logger) -All files are in `package web` — no import changes needed within the package. -The `Server` struct and all its methods are accessible across files in the same package. +if cfg.Backup.Enabled { + sched.Daily("db-dump", cfg.Backup.DBDumpSchedule, func(ctx context.Context) error { + return backupMgr.RunDBDumps(ctx) + }) -Run `go build ./cmd/controller/` to verify everything compiles. + sched.Daily("backup", cfg.Backup.ResticSchedule, func(ctx context.Context) error { + return backupMgr.RunBackup(ctx) + }) +} +``` + +### Dashboard display + +Add "Biztonsági mentés" (Backup) section to `dashboard.html`: + +``` +╔══════════════════════════════════════════╗ +║ 🛡️ Biztonsági mentés ║ +╠══════════════════════════════════════════╣ +║ Utolsó mentés: 2026-02-15 03:01 ✅ ║ +║ Adatbázisok: 2 mentve (12.3 MB) ║ +║ Tároló méret: 45.2 MB (23 pillanatkép) ║ +║ Következő: ma 03:00 ║ +║ ║ +║ [Mentés most] ║ +╚══════════════════════════════════════════╝ +``` + +Hungarian labels: +- "Biztonsági mentés" = Backup +- "Utolsó mentés" = Last backup +- "Adatbázisok" = Databases +- "mentve" = backed up +- "Tároló méret" = Repository size +- "pillanatkép" = snapshot(s) +- "Következő" = Next +- "Mentés most" = Backup now + +Status colors: +- Green ✅: Last backup successful and less than `backup_max_age_hours` old +- Yellow ⚠️: Last backup successful but older than expected +- Red ❌: Last backup failed or no backups exist yet +- Gray: Backup not configured (`backup.enabled: false`) + +If backup is disabled in config → show "Biztonsági mentés nincs beállítva" (Backup not configured). + +### API endpoints + +Add to `api/router.go`: + +``` +GET /api/backup/status → backup manager status + repo stats +POST /api/backup/run → trigger immediate full backup (async) +``` + +`POST /api/backup/run` starts the backup in a background goroutine, returns immediately with `{"ok": true, "message": "Mentés elindítva"}`. The dashboard can poll `/api/backup/status` to track progress. --- -## Task 3: Rename controller domain from `dashboard.*` to `felhom.*` (LOW PRIORITY) - -### 3.1 — Update controller's docker-compose.yml - -In `controller/docker-compose.yml`, change the Traefik label: +## Docker-compose.yml final state ```yaml -# OLD: -- "traefik.http.routers.controller.rule=Host(`dashboard.${DOMAIN}`)" -# NEW: -- "traefik.http.routers.controller.rule=Host(`felhom.${DOMAIN}`)" +services: + felhom-controller: + image: gitea.dooplex.hu/admin/felhom-controller:latest + container_name: felhom-controller + restart: unless-stopped + ports: + - "8080:8080" + volumes: + # Docker socket — required for compose operations + DB dumps (docker exec) + - /var/run/docker.sock:/var/run/docker.sock:ro + # Controller config + - /opt/docker/felhom-controller/controller.yaml:/opt/docker/felhom-controller/controller.yaml:ro + # Controller persistent data (sessions, restic cache, restic password) + - controller-data:/opt/docker/felhom-controller/data + # Stack compose files (read + write for git sync) + - /opt/docker/stacks:/opt/docker/stacks + # Backup directories (restic repo + db dumps) + - /srv/backups:/srv/backups + # HDD mount (if available, for monitoring disk usage) + - ${HDD_PATH:-/mnt/hdd_placeholder}:${HDD_PATH:-/mnt/hdd_placeholder}:ro + # Host /sys — for CPU temperature reading (read-only) + - /sys:/host/sys:ro + environment: + - TZ=Europe/Budapest + labels: + - "traefik.enable=true" + - "traefik.http.routers.controller.rule=Host(`felhom.${DOMAIN}`)" + - "traefik.http.routers.controller.entrypoints=websecure" + - "traefik.http.routers.controller.tls=true" + - "traefik.http.services.controller.loadbalancer.server.port=8080" + - "traefik.docker.network=traefik-public" + - "felhom.managed=true" + - "felhom.component=controller" + networks: + - traefik-public + healthcheck: + test: ["CMD", "curl", "-f", "http://localhost:8080/api/health"] + interval: 30s + timeout: 5s + start_period: 10s + retries: 3 + +volumes: + controller-data: + +networks: + traefik-public: + external: true ``` -### 3.2 — Update docker-setup.sh - -In `controller/scripts/docker-setup.sh`, update `print_summary()` output: -- Any reference to `dashboard.${BASE_DOMAIN}` → `felhom.${BASE_DOMAIN}` - -Also check install_controller() if it generates compose files or prints URLs. - -### 3.3 — Update controller.yaml.example - -If there's any reference to `dashboard.*` in the example config or comments, update to `felhom.*`. - -### 3.4 — Update documentation - -In CLAUDE.md build/deploy workflow sections, update any `dashboard.` references to `felhom.`. - -### 3.5 — Cloudflare Tunnel public hostname (MANUAL — not code) - -**Reminder for Viktor:** After deploying, manually update the Cloudflare Tunnel -public hostname in the Zero Trust dashboard: -- Old: `dashboard.demo-felhom.eu` → Traefik -- New: `felhom.demo-felhom.eu` → Traefik - -### 3.6 — Pi-hole DNS (MANUAL — not code) - -**Reminder for Viktor:** If there's a pi-hole local DNS record for `dashboard.demo-felhom.eu`, -update it to `felhom.demo-felhom.eu` (or rely on the wildcard `*.demo-felhom.eu` record). +Changes from current: +1. **Added:** `/sys:/host/sys:ro` — for temperature reading +2. **Removed:** dedicated restic-password bind mount (password now in controller-data volume) --- -## Task 4: Update documentation +## Config changes summary -### 4.1 — README.md +### `controller.yaml.example` updates -- Update directory structure to show `internal/web/templates/` directory -- Update "Tech stack" section: "Templates: go:embed HTML files" instead of "Go string constants" -- Mention the `felhom.*` subdomain for the controller -- Update file tree showing the new files (embed.go, auth.go, handlers.go, funcmap.go) +```yaml +monitoring: + system_health_interval: "5m" # NEW field -### 4.2 — CLAUDE.md +backup: + restic_password_file: "/opt/docker/felhom-controller/data/restic-password" # CHANGED default path +``` -- Update workspace layout to reflect the new `internal/web/` file structure -- Update "Tech stack" section -- Update any `dashboard.*` references to `felhom.*` -- Note the go:embed pattern for future template additions +### `config.go` updates -### 4.3 — CONTEXT.md - -- Add session entry documenting the refactoring -- Note: templates moved from Go string constants to go:embed HTML files -- Note: server.go split into auth.go, handlers.go, funcmap.go -- Note: controller domain changed from dashboard.* to felhom.* -- Update version to v0.3.0 - -### 4.4 — BUILDING.md - -- Update the structure check in build.sh verification section if needed - (the `internal/web/templates/` directory should exist now) +- Add `SystemHealthInterval string` to `MonitoringConfig` +- Default: `"5m"` in `applyDefaults()` +- Change `restic_password_file` default from `/opt/docker/felhom-controller/restic-password` to `/opt/docker/felhom-controller/data/restic-password` +- Add env override: `FELHOM_MONITORING_SYSTEM_HEALTH_INTERVAL` --- -## Implementation order +## Implementation Order -1. **Task 1** (templates.go → go:embed) — do this first, biggest impact -2. **Task 2** (server.go split) — do this second, leverages the cleaner templates -3. **Task 3** (domain rename) — small, do last -4. **Task 4** (docs) — update after all code changes +### Step 1: Scheduler +1. Create `internal/scheduler/scheduler.go` +2. Implement `Every()` and `Daily()` with logging, panic recovery, skip-if-running +3. Migrate the two existing goroutines from `main.go` to scheduler +4. **Build and verify** — behavior should be identical, logs should show `[SCHED]` entries -## Verification checklist +### Step 2: CPU & Temperature metrics +1. Create `internal/system/cpu_linux.go` + `cpu_other.go` (build tags) +2. Add `readLoadAvg()` and `readTemperature()` to `info_linux.go` +3. Extend `SystemInfo` struct in `info.go` +4. Update `GetInfo()` signature in all files to accept `*CPUCollector` +5. Start CPUCollector in `main.go`, pass to web server and API router +6. Update `docker-compose.yml` — add `/sys:/host/sys:ro` +7. Update `dashboard.html` — show CPU, load, temperature +8. Update `style.css` if needed for new display elements +9. **Build, deploy, verify** — new metrics visible on dashboard -- [ ] `go build ./cmd/controller/` compiles successfully -- [ ] All 7 HTML templates render correctly (login, dashboard, stacks, deploy, app_info, logs, layout) -- [ ] CSS loads at `/static/style.css` -- [ ] Felhom logo SVG loads at `/static/felhom-logo.svg` -- [ ] App logos/screenshots still serve from `/assets/` -- [ ] Auth (login/logout/session) works unchanged -- [ ] Stack operations (start/stop/deploy) work unchanged -- [ ] Controller accessible at `felhom.demo-felhom.eu` (after CF tunnel update) -- [ ] No broken links or template errors in browser console -- [ ] Build + push via build.sh works -- [ ] Deploy on demo-felhom works \ No newline at end of file +### Step 3: Healthchecks pinger + health checks +1. Create `internal/monitor/pinger.go` +2. Create `internal/monitor/healthcheck.go` +3. Add `system_health_interval` to config +4. Add system health ping job to scheduler in `main.go` +5. **Build, deploy** — check controller logs for health check runs + +### Step 4: Database dump engine +1. Create `internal/backup/dbdump.go` +2. Implement discovery + dump functions +3. Wire up `RunDBDumps` temporarily to a test endpoint or manual scheduler trigger for testing +4. **Build, deploy, verify** — dumps should appear in `/srv/backups/db-dumps/` for paperless-ngx-postgres + +### Step 5: Restic integration +1. Create `internal/backup/restic.go` +2. Implement init, snapshot, prune, check, stats +3. Auto-generate restic password if missing +4. Update docker-compose.yml (remove restic-password bind mount) +5. **Build, deploy, verify** — repo initialized, password generated + +### Step 6: Backup orchestrator + dashboard +1. Create `internal/backup/backup.go` +2. Wire up scheduler daily jobs (DB dump + backup) +3. Add API endpoints (`/api/backup/status`, `/api/backup/run`) +4. Add backup status section to `dashboard.html` +5. Add "Mentés most" button +6. **Build, deploy, verify full flow** + +### Step 7: Documentation & cleanup +1. Update `README.md` — Phase 2 and 3 checked off, new module descriptions +2. Update `CONTEXT.md` with session summary +3. Update `CLAUDE.md` if workflow changes +4. Version bump in build: `v0.4.0` + +--- + +## Verification Checklist + +After deployment, verify each item: + +- [ ] `docker ps` shows controller healthy +- [ ] Dashboard loads with CPU %, load average, temperature displayed +- [ ] Temperature shows realistic value (30-60°C idle for N100) +- [ ] CPU % updates (not stuck at 0) +- [ ] `/api/system/info` returns all new fields (cpu_percent, load_avg_*, temperature_*) +- [ ] Scheduler logs show `[SCHED]` entries for all registered jobs +- [ ] If HC UUIDs configured: pings visible in status.felhom.eu dashboard +- [ ] DB dump discovers paperless-ngx postgres container +- [ ] Dump file exists: `/srv/backups/db-dumps/paperless-ngx-postgres.sql` +- [ ] Restic repo initialized: `/srv/backups/restic-repo/config` exists +- [ ] Restic password auto-generated: `/opt/docker/felhom-controller/data/restic-password` exists +- [ ] "Mentés most" button triggers backup successfully +- [ ] Dashboard shows backup status section with last backup time +- [ ] All existing features still work (start/stop/deploy/update/logs/auth) + +--- + +## New files to create + +``` +internal/scheduler/scheduler.go +internal/monitor/pinger.go +internal/monitor/healthcheck.go +internal/backup/dbdump.go +internal/backup/restic.go +internal/backup/backup.go +internal/system/cpu_linux.go +internal/system/cpu_other.go +``` + +## Existing files to modify + +``` +internal/system/info.go — new SystemInfo fields +internal/system/info_linux.go — readLoadAvg(), readTemperature(), GetInfo() signature +internal/system/info_other.go — GetInfo() signature update +internal/config/config.go — SystemHealthInterval, updated defaults +internal/api/router.go — backup endpoints, cpuCollector parameter +internal/web/server.go — accept cpuCollector, backupMgr +internal/web/handlers.go — pass cpuCollector/backupMgr to dashboard +internal/web/templates/dashboard.html — CPU/temp bars, backup status section +internal/web/templates/style.css — styles for new elements +cmd/controller/main.go — scheduler, cpuCollector, pinger, backupMgr wiring +controller/docker-compose.yml — /sys mount, remove restic-password mount +configs/controller.yaml.example — new fields, updated defaults +``` + +--- + +## Manual steps after deployment (for Viktor) + +1. **Verify /sys mount:** `docker exec felhom-controller ls /host/sys/class/thermal/` — should show thermal_zone directories +2. **Healthchecks setup:** Create project + 3 checks in status.felhom.eu for demo-felhom: + - `system-health` (period: 10m, grace: 10m) + - `db-dump` (period: 24h, grace: 1h) + - `backup` (period: 24h, grace: 1h) +3. **Update controller.yaml:** Add the three ping UUIDs +4. **Verify restic password:** `docker exec felhom-controller cat /opt/docker/felhom-controller/data/restic-password` +5. **Test restore procedure:** + ```bash + docker exec felhom-controller restic -r /srv/backups/restic-repo \ + --password-file /opt/docker/felhom-controller/data/restic-password snapshots + ``` +6. **Save restic password externally** — losing it means losing access to all backups \ No newline at end of file