# TASK.md — Phase 2: Monitoring & Health + Phase 3: Backups > Version bump target: **v0.4.0** > Priority: Phase 2 first (scheduler + metrics are prerequisites for Phase 3) --- ## Overview Implement two major features in felhom-controller: 1. **Phase 2** — Scheduler, CPU/temperature metrics, Healthchecks.io ping integration 2. **Phase 3** — Database dump engine, restic backup snapshots, dashboard status display Both phases share the scheduler infrastructure. Implement in order. --- ## Phase 2A — Scheduler (`internal/scheduler/`) ### Why first main.go currently has two ad-hoc goroutines (status refresh every 30s, stack scan every 2min). Phase 2 adds system health pings. Phase 3 adds daily DB dumps and backups. All need a centralized, logged, observable job runner. Build it once, use everywhere. ### Design: `internal/scheduler/scheduler.go` ```go package scheduler type JobFunc func(ctx context.Context) error type Job struct { Name string Fn JobFunc Interval time.Duration // for periodic jobs (every N) Schedule string // for daily jobs ("02:30", "03:00") — mutually exclusive with Interval LastRun time.Time LastErr error Running bool } type Scheduler struct { jobs []*Job logger *log.Logger ctx context.Context cancel context.CancelFunc wg sync.WaitGroup } func New(logger *log.Logger) *Scheduler func (s *Scheduler) Every(name string, interval time.Duration, fn JobFunc) func (s *Scheduler) Daily(name string, timeStr string, fn JobFunc) // "02:30" format, Europe/Budapest timezone func (s *Scheduler) Start(ctx context.Context) func (s *Scheduler) Stop() func (s *Scheduler) GetJobs() []Job // for dashboard/API display (copy, not pointer) ``` ### Interval jobs (`Every`) - Spawns a goroutine with `time.Ticker` - Logs `[SCHED] Running job: ` at start, `[SCHED] Job completed (took Xs)` or `[SCHED] Job failed: (took Xs)` at end - Updates `LastRun`, `LastErr`, `Running` fields (mutex-protected) - Respects ctx.Done() for shutdown - **Quiet mode for high-frequency jobs:** Jobs that run every <=30 seconds should only log at debug level on success (avoid log spam). Failures always log at WARN/ERROR level. ### Daily jobs (`Daily`) - Parses `timeStr` as "HH:MM" in `Europe/Budapest` timezone - On start, calculates duration until next occurrence (today if not yet passed, tomorrow if passed) - After each run, sleeps until the next day's scheduled time - Uses `time.After()` or `time.Timer`, NOT `time.Ticker` (handles DST transitions correctly) - Same logging pattern as interval jobs - Logs `[SCHED] Daily job scheduled for ` on registration ### Edge cases - If `Daily` timeStr is invalid → log error at registration, don't start the job - If a job panics → recover, log `[ERROR] Job panicked: `, mark as failed - If a job is already running when the next tick fires → skip, log `[WARN] Job still running, skipping` - Graceful shutdown: `Stop()` cancels context, `wg.Wait()` with 30s timeout for running jobs to finish ### Integration in main.go Replace the two ad-hoc goroutines with: ```go sched := scheduler.New(logger) // Existing periodic tasks (move from ad-hoc goroutines) sched.Every("status-refresh", 30*time.Second, func(ctx context.Context) error { return stackMgr.RefreshStatus() }) sched.Every("stack-scan", 2*time.Minute, func(ctx context.Context) error { return stackMgr.ScanStacks() }) // Phase 2: System health ping (added below) // Phase 3: DB dump, backup (added below) sched.Start(ctx) defer sched.Stop() ``` Delete the two existing goroutines in main.go after migrating to the scheduler. --- ## Phase 2B — CPU & Temperature Metrics (`internal/system/`) ### Current state `info.go` defines `SystemInfo` struct with memory + disk fields. `info_linux.go` reads `/proc/meminfo` and `syscall.Statfs`. `info_other.go` provides stubs for non-Linux. ### New fields in `SystemInfo` ```go // Add to SystemInfo struct in info.go: CPUPercent float64 `json:"cpu_percent"` // 0-100, averaged across all cores LoadAvg1 float64 `json:"load_avg_1"` // 1-minute load average LoadAvg5 float64 `json:"load_avg_5"` // 5-minute load average LoadAvg15 float64 `json:"load_avg_15"` // 15-minute load average TemperatureCelsius float64 `json:"temperature_celsius"` // CPU/SoC temperature TemperatureSource string `json:"temperature_source"` // e.g. "thermal_zone0", "x86_pkg_temp" ``` ### CPU measurement approach **Do NOT block `GetInfo()` with a delta calculation.** Use a lightweight `CPUCollector` that runs in a background goroutine: ```go // internal/system/cpu_linux.go (build tag: linux) type CPUCollector struct { mu sync.RWMutex cpuPercent float64 sampleRate time.Duration // default: 5 seconds cancel context.CancelFunc } func NewCPUCollector(sampleRate time.Duration) *CPUCollector func (c *CPUCollector) Start(ctx context.Context) func (c *CPUCollector) Stop() func (c *CPUCollector) CPUPercent() float64 // returns latest sample ``` How it works: 1. Reads `/proc/stat` first line: `cpu ` 2. Sleeps `sampleRate` (5s) 3. Reads again, computes delta: `busy = delta(user+nice+system+irq+softirq+steal)`, `total = busy + delta(idle+iowait)` 4. `cpuPercent = (busy / total) * 100` 5. Stores result, loops Parsing `/proc/stat`: ``` cpu 1234 56 789 45678 123 45 67 0 0 0 ``` Split by whitespace. Fields after "cpu" are: user(1) nice(2) system(3) idle(4) iowait(5) irq(6) softirq(7) steal(8). Sum all = total. idle + iowait = idle_total. busy = total - idle_total. **IMPORTANT: Inside a Docker container, `/proc/stat` reflects the HOST CPU** (unless CPU cgroups are applied with limits). So the controller's own `/proc/stat` works. ### Load average Read from `/proc/loadavg` (instant, no delta needed): ``` 0.15 0.10 0.05 1/234 56789 ``` First three fields are 1/5/15 minute load averages. Parse with `fmt.Sscanf`. Add `readLoadAvg(info *SystemInfo)` in `info_linux.go`. ### Temperature Read from `/sys/class/thermal/thermal_zone*/temp`: **IMPORTANT**: The controller runs in a Docker container. `/sys` is NOT available by default. We mount the host's `/sys` at `/host/sys` inside the container (see docker-compose.yml changes below). ```go // internal/system/info_linux.go — add readTemperature(info *SystemInfo) ``` Algorithm: 1. Try `/host/sys/class/thermal/thermal_zone*/temp` first (Docker mount) 2. Fallback to `/sys/class/thermal/thermal_zone*/temp` (native/development) 3. For each zone, also read the `type` file for the label 4. Pick the highest temperature (usually `thermal_zone0` or `x86_pkg_temp`) 5. Value is in millidegrees Celsius → divide by 1000.0 6. Store the zone type as `TemperatureSource` 7. If no thermal zones found: try `/host/sys/class/hwmon/hwmon*/temp1_input` as fallback (same millidegree format) 8. If nothing found: leave fields as zero (dashboard hides temperature when 0) ### `GetInfo()` signature change ```go // Current: func GetInfo(hddPath string) SystemInfo // New: func GetInfo(hddPath string, cpuCollector *CPUCollector) SystemInfo ``` Inside `GetInfo()`: 1. Existing: `readMemInfo(&info)`, `readDiskUsage(...)` — unchanged 2. New: `readLoadAvg(&info)` 3. New: `readTemperature(&info)` 4. New: `if cpuCollector != nil { info.CPUPercent = cpuCollector.CPUPercent() }` The `info_other.go` stub accepts the parameter but ignores it (returns empty SystemInfo as before). ### CPU collector lifecycle Started in `main.go`: ```go cpuCollector := system.NewCPUCollector(5 * time.Second) cpuCollector.Start(ctx) defer cpuCollector.Stop() ``` Passed to `web.NewServer()` and `api.NewRouter()` which pass it to `system.GetInfo()` calls. ### Dashboard display Extend the existing system info bar in `dashboard.html`: Current layout: ``` | Memória | ████████░░ 72% | SSD | ██████░░░░ 55% | HDD | ████░░░░░░ 38% | ``` New layout: ``` | Memória | ████████░░ 72% | CPU | ██░░░░░░░░ 15% | Hőmérséklet | 52°C | | SSD | ██████░░░░ 55% | HDD | ████░░░░░░ 38% | ``` Or, if horizontal space is tight, keep the two-row layout from the current dashboard and add CPU + temperature to the same row structure. Use the same progress bar component. Temperature display: - Show as text "52°C" with colored dot (green/yellow/red) - Green: < 60°C - Yellow: 60-75°C - Red: > 75°C - If temperature is 0 (unavailable): hide entirely CPU progress bar: - Same color scheme as memory/disk: green < 70%, yellow 70-85%, red > 85% Load average: Show as small text below CPU bar: "Load: 0.3 / 0.2 / 0.1" --- ## Phase 2C — Healthchecks.io Ping Integration (`internal/monitor/`) ### Design: `internal/monitor/pinger.go` ```go package monitor type Pinger struct { baseURL string // e.g. "https://status.felhom.eu" httpClient *http.Client logger *log.Logger enabled bool } func NewPinger(cfg *config.MonitoringConfig, logger *log.Logger) *Pinger // Ping sends a success signal with optional diagnostic body func (p *Pinger) Ping(uuid string, body string) error // Fail sends a failure signal with diagnostic body func (p *Pinger) Fail(uuid string, body string) error // Start sends a "job started" signal (for duration tracking) func (p *Pinger) Start(uuid string) error ``` ### HTTP protocol - Success: `POST {baseURL}/ping/{uuid}` with body as request body - Failure: `POST {baseURL}/ping/{uuid}/fail` with body - Start: `POST {baseURL}/ping/{uuid}/start` - Timeout: 10 seconds - Retry: 3 attempts with 2s backoff between retries - If `uuid` is empty or starts with "CHANGEME" → skip silently (log at debug level only) - If `enabled` is false → skip all pings - **Never let ping failures affect the main operation** — log a warning on HTTP error, but always return nil from the calling job. Ping errors must not break backup/health flows. ### Design: `internal/monitor/healthcheck.go` ```go // RunHealthCheck runs system checks and returns a diagnostic report. type HealthReport struct { Status string // "ok", "warn", "fail" Issues []string // critical problems Warnings []string // non-critical warnings Info []string // informational items Timestamp time.Time } func RunHealthCheck(cfg *config.Config, cpuCollector *system.CPUCollector) *HealthReport func (r *HealthReport) FormatMessage() string // human-readable summary for HC ping body ``` Checks to run (replicating backup-healthcheck.sh logic in Go): 1. **Disk usage**: Read from `system.GetInfo()`. Compare against thresholds (`disk_warn_percent`, `disk_crit_percent`). 2. **Memory usage**: Same source. Warn if above `memory_warn_percent`. 3. **CPU usage**: From collector. Warn if above `cpu_warn_percent`. 4. **Temperature**: From `system.GetInfo()`. Warn if above `temperature_warn_celsius`. 5. **Docker health**: Verify Docker daemon is reachable by running `docker info` (quick exec check). 6. **Protected containers**: Verify protected stacks are running (traefik, cloudflared, felhom-controller) by checking container state. Any issue → Status = "fail". Only warnings → Status = "warn". All clear → Status = "ok". ### Scheduler integration ```go // In main.go: pinger := monitor.NewPinger(&cfg.Monitoring, logger) healthUUID := cfg.Monitoring.PingUUIDs.SystemHealth // Parse system_health_interval (default "5m") healthInterval, _ := time.ParseDuration(cfg.Monitoring.SystemHealthInterval) sched.Every("system-health", healthInterval, func(ctx context.Context) error { report := monitor.RunHealthCheck(cfg, cpuCollector) body := report.FormatMessage() if report.Status == "fail" { pinger.Fail(healthUUID, body) } else { pinger.Ping(healthUUID, body) } return nil // never fail the scheduler job due to ping errors }) ``` ### Config changes Add to `MonitoringConfig`: ```go SystemHealthInterval string `yaml:"system_health_interval"` ``` Default in `applyDefaults()`: `"5m"` --- ## Phase 3A — Database Dump Engine (`internal/backup/dbdump.go`) ### Approach: Auto-discover from running Docker containers Replicates the proven logic from `backup-db-dump.sh` in Go: ```go package backup type DBType string const ( DBTypePostgres DBType = "postgres" DBTypeMariaDB DBType = "mariadb" ) type DiscoveredDB struct { ContainerName string ContainerID string DBType DBType DBUser string DBName string StackName string // derived from container name } type DumpResult struct { DB DiscoveredDB FilePath string Size int64 Duration time.Duration Error error } func DiscoverDatabases(ctx context.Context, logger *log.Logger) ([]DiscoveredDB, error) func DumpAll(ctx context.Context, dbs []DiscoveredDB, dumpDir string, logger *log.Logger) []DumpResult func DumpOne(ctx context.Context, db DiscoveredDB, dumpDir string, logger *log.Logger) DumpResult ``` ### Discovery logic Run `docker ps --format '{{.ID}}\t{{.Names}}\t{{.Image}}' --filter status=running`. For each running container, check image name: - Contains `postgres` → DBTypePostgres - Contains `mariadb` or `mysql` → DBTypeMariaDB Then for each DB container, get env vars via: `docker inspect --format '{{range .Config.Env}}{{println .}}{{end}}'` Parse env vars: - **PostgreSQL**: `POSTGRES_USER` (default: "postgres"), `POSTGRES_DB` (default: same as POSTGRES_USER) - **MariaDB**: `MYSQL_ROOT_PASSWORD`, `MYSQL_DATABASE` (or `MARIADB_DATABASE`) Derive stack name from container name by stripping common DB suffixes: - `paperless-ngx-postgres` → `paperless-ngx` - `romm-db` → `romm` - `immich-postgres` → `immich` - Logic: split on `-`, check if last segment is a known suffix (`postgres`, `db`, `mariadb`, `mysql`, `database`, `redis`, `cache`), if so remove it ### Dump execution **PostgreSQL:** ```bash docker exec pg_dump -U -d --clean --if-exists --no-owner --no-privileges ``` **MariaDB:** ```bash docker exec mariadb-dump -u root -p --single-transaction --routines --triggers ``` **IMPORTANT: Use `docker exec` to run dump commands INSIDE the DB container.** Do NOT use pg_dump/mysqldump from the controller container — version mismatches between the controller's client and the DB server will cause failures. Output handling: - Use `os/exec.Command("docker", "exec", ...)` with `cmd.Stdout` piped to a temp file - Write to `{dumpDir}/{stackName}-{dbtype}.sql.tmp` during dump - Rename `.tmp` → `.sql` on success only - Delete `.tmp` on failure - Set 5-minute timeout per dump via `context.WithTimeout` ### Gotchas and edge cases - **MariaDB password from container env:** Never log the password. Use `docker inspect` to read `MYSQL_ROOT_PASSWORD` or `MARIADB_ROOT_PASSWORD`. - **Empty/zero-size dumps:** Check dump file size after writing. If 0 bytes → treat as failure. - **Dump file naming:** `{stackName}-{dbtype}.sql` (e.g., `paperless-ngx-postgres.sql`). Overwrite previous dump each run (restic handles versioning). - **Old tmp cleanup:** Delete `.tmp` files older than 1 hour on each run (leftover from crashed dumps). - **Skip infrastructure DBs:** Don't dump databases from protected stacks (if any have DBs in the future). - **Container not running:** If a DB container was discovered but is no longer running by dump time → skip with warning (container may have been stopped between discovery and dump). ### Dump directory `/srv/backups/db-dumps/` — configured in `controller.yaml` as `paths.db_dump_dir`. Already mounted in docker-compose.yml via `/srv/backups:/srv/backups`. The user does NOT see this directory (not in FileBrowser, not on HDD). --- ## Phase 3B — Restic Integration (`internal/backup/restic.go`) ### Design ```go type ResticManager struct { repoPath string passwordFile string logger *log.Logger customerID string cacheDir string } func NewResticManager(cfg *config.Config, logger *log.Logger) *ResticManager func (r *ResticManager) EnsureInitialized() error func (r *ResticManager) Snapshot(paths []string, tags []string) (*SnapshotResult, error) func (r *ResticManager) Prune(retention config.RetentionConfig) error func (r *ResticManager) Check() error func (r *ResticManager) LatestSnapshot() (*SnapshotInfo, error) func (r *ResticManager) Stats() (*RepoStats, error) type SnapshotResult struct { SnapshotID string FilesNew int FilesChanged int DataAdded string // human-readable Duration time.Duration } type SnapshotInfo struct { ID string Time time.Time Paths []string Tags []string } type RepoStats struct { TotalSize string SnapshotCount int LatestSnapshot *SnapshotInfo } ``` ### Restic commands (all via `os/exec`) All commands set these env vars: ```go cmd.Env = append(os.Environ(), "RESTIC_REPOSITORY="+r.repoPath, "RESTIC_PASSWORD_FILE="+r.passwordFile, "RESTIC_CACHE_DIR="+r.cacheDir, ) ``` **`RESTIC_CACHE_DIR`** must be set to `/opt/docker/felhom-controller/data/restic-cache` (inside the controller-data Docker volume). Without this, restic defaults to `~/.cache/restic` which may not persist across container restarts. **Init** (idempotent): - Check if `{repoPath}/config` file exists → if so, already initialized, skip - Otherwise: `restic init` **Snapshot:** ```bash restic backup /opt/docker/stacks /srv/backups/db-dumps /opt/docker/felhom-controller/controller.yaml \ --tag felhom --tag --host ``` What gets backed up (v1): - `/opt/docker/stacks/` — compose files, .felhom.yml, app.yaml (deploy configs with secrets) - `/srv/backups/db-dumps/` — SQL dumps (from the DB dump step) - `/opt/docker/felhom-controller/controller.yaml` — controller config **NOT backed up in v1:** - HDD app data (Immich photos, Paperless documents) — too large, needs separate strategy - Docker volumes directly — critical data covered by DB dumps Parse snapshot output (restic `backup` with `--json` sends JSON lines to stderr): ```json {"message_type":"summary","files_new":5,"files_changed":2,"data_added":12345678,...,"snapshot_id":"abc123"} ``` **Prune:** ```bash restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 6 --prune ``` **Check:** ```bash restic check ``` **Latest snapshot:** ```bash restic snapshots --latest 1 --json ``` Returns JSON array with snapshot objects. **Stats (repo size):** ```bash restic stats --json ``` ### Password auto-generation On startup, `EnsureInitialized()` checks if the password file exists. If not: 1. Generate 32 random bytes, base64url-encode 2. Write to `r.passwordFile` (the controller-data volume path) 3. Log `[INFO] Generated new restic repository password at ` 4. Log `[WARN] Save this password externally — losing it means losing access to ALL backups` ### Gotchas - **restic is already in the Docker image** (Dockerfile installs it). No additional setup. - **Locking:** Restic handles repo locking internally. The scheduler's "skip if running" prevents concurrent operations. If a stale lock exists (controller crashed mid-backup), restic will error — add `restic unlock` to the error handling path with a log warning. - **Timeout:** 30-minute timeout for snapshot operations. Parse context deadline. - **Large repos:** First snapshot may be large (all stack configs + dumps). Subsequent snapshots are incremental (restic deduplicates). - **restic JSON output:** Use `--json` for machine-parseable output. Parse from stderr for `backup` command (stdout shows progress, stderr has JSON summary). Actually, correction — restic with `--json` sends JSON to **stdout**. Regular progress goes to stderr. For `backup --json`, the summary JSON object with `message_type: "summary"` is on stdout. Parse the last JSON line from stdout. --- ## Phase 3C — Backup Orchestrator (`internal/backup/backup.go`) ### Design ```go type Manager struct { cfg *config.Config restic *ResticManager logger *log.Logger pinger *monitor.Pinger mu sync.Mutex lastDBDump *DBDumpStatus lastBackup *BackupStatus } type DBDumpStatus struct { LastRun time.Time Results []DumpResult Success bool Duration time.Duration } type BackupStatus struct { LastRun time.Time Snapshot *SnapshotResult Success bool Duration time.Duration RepoStats *RepoStats } func NewManager(cfg *config.Config, pinger *monitor.Pinger, logger *log.Logger) *Manager func (m *Manager) RunDBDumps(ctx context.Context) error func (m *Manager) RunBackup(ctx context.Context) error func (m *Manager) RunFullBackup(ctx context.Context) error // dumps + snapshot + optional prune func (m *Manager) GetStatus() (*DBDumpStatus, *BackupStatus) func (m *Manager) GetRepoStats() (*RepoStats, error) ``` ### Full backup flow (daily scheduled) 1. **DB dumps:** `DiscoverDatabases()` → `DumpAll()` → update `lastDBDump` status 2. Ping Healthchecks for DB dump result: `pinger.Ping/Fail(dbDumpUUID, summary)` 3. **Restic snapshot:** `restic.EnsureInitialized()` → `restic.Snapshot(paths, tags)` 4. **Prune (weekly):** Check day of week against `prune_schedule` config. If match → `restic.Prune(retention)` + `restic.Check()` 5. Ping Healthchecks for backup result: `pinger.Ping/Fail(backupUUID, summary)` 6. Update `lastBackup` status ### Scheduler integration ```go // In main.go: backupMgr := backup.NewManager(cfg, pinger, logger) if cfg.Backup.Enabled { sched.Daily("db-dump", cfg.Backup.DBDumpSchedule, func(ctx context.Context) error { return backupMgr.RunDBDumps(ctx) }) sched.Daily("backup", cfg.Backup.ResticSchedule, func(ctx context.Context) error { return backupMgr.RunBackup(ctx) }) } ``` ### Dashboard display Add "Biztonsági mentés" (Backup) section to `dashboard.html`: ``` ╔══════════════════════════════════════════╗ ║ 🛡️ Biztonsági mentés ║ ╠══════════════════════════════════════════╣ ║ Utolsó mentés: 2026-02-15 03:01 ✅ ║ ║ Adatbázisok: 2 mentve (12.3 MB) ║ ║ Tároló méret: 45.2 MB (23 pillanatkép) ║ ║ Következő: ma 03:00 ║ ║ ║ ║ [Mentés most] ║ ╚══════════════════════════════════════════╝ ``` Hungarian labels: - "Biztonsági mentés" = Backup - "Utolsó mentés" = Last backup - "Adatbázisok" = Databases - "mentve" = backed up - "Tároló méret" = Repository size - "pillanatkép" = snapshot(s) - "Következő" = Next - "Mentés most" = Backup now Status colors: - Green ✅: Last backup successful and less than `backup_max_age_hours` old - Yellow ⚠️: Last backup successful but older than expected - Red ❌: Last backup failed or no backups exist yet - Gray: Backup not configured (`backup.enabled: false`) If backup is disabled in config → show "Biztonsági mentés nincs beállítva" (Backup not configured). ### API endpoints Add to `api/router.go`: ``` GET /api/backup/status → backup manager status + repo stats POST /api/backup/run → trigger immediate full backup (async) ``` `POST /api/backup/run` starts the backup in a background goroutine, returns immediately with `{"ok": true, "message": "Mentés elindítva"}`. The dashboard can poll `/api/backup/status` to track progress. --- ## Docker-compose.yml final state ```yaml services: felhom-controller: image: gitea.dooplex.hu/admin/felhom-controller:latest container_name: felhom-controller restart: unless-stopped ports: - "8080:8080" volumes: # Docker socket — required for compose operations + DB dumps (docker exec) - /var/run/docker.sock:/var/run/docker.sock:ro # Controller config - /opt/docker/felhom-controller/controller.yaml:/opt/docker/felhom-controller/controller.yaml:ro # Controller persistent data (sessions, restic cache, restic password) - controller-data:/opt/docker/felhom-controller/data # Stack compose files (read + write for git sync) - /opt/docker/stacks:/opt/docker/stacks # Backup directories (restic repo + db dumps) - /srv/backups:/srv/backups # HDD mount (if available, for monitoring disk usage) - ${HDD_PATH:-/mnt/hdd_placeholder}:${HDD_PATH:-/mnt/hdd_placeholder}:ro # Host /sys — for CPU temperature reading (read-only) - /sys:/host/sys:ro environment: - TZ=Europe/Budapest labels: - "traefik.enable=true" - "traefik.http.routers.controller.rule=Host(`felhom.${DOMAIN}`)" - "traefik.http.routers.controller.entrypoints=websecure" - "traefik.http.routers.controller.tls=true" - "traefik.http.services.controller.loadbalancer.server.port=8080" - "traefik.docker.network=traefik-public" - "felhom.managed=true" - "felhom.component=controller" networks: - traefik-public healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/api/health"] interval: 30s timeout: 5s start_period: 10s retries: 3 volumes: controller-data: networks: traefik-public: external: true ``` Changes from current: 1. **Added:** `/sys:/host/sys:ro` — for temperature reading 2. **Removed:** dedicated restic-password bind mount (password now in controller-data volume) --- ## Config changes summary ### `controller.yaml.example` updates ```yaml monitoring: system_health_interval: "5m" # NEW field backup: restic_password_file: "/opt/docker/felhom-controller/data/restic-password" # CHANGED default path ``` ### `config.go` updates - Add `SystemHealthInterval string` to `MonitoringConfig` - Default: `"5m"` in `applyDefaults()` - Change `restic_password_file` default from `/opt/docker/felhom-controller/restic-password` to `/opt/docker/felhom-controller/data/restic-password` - Add env override: `FELHOM_MONITORING_SYSTEM_HEALTH_INTERVAL` --- ## Implementation Order ### Step 1: Scheduler 1. Create `internal/scheduler/scheduler.go` 2. Implement `Every()` and `Daily()` with logging, panic recovery, skip-if-running 3. Migrate the two existing goroutines from `main.go` to scheduler 4. **Build and verify** — behavior should be identical, logs should show `[SCHED]` entries ### Step 2: CPU & Temperature metrics 1. Create `internal/system/cpu_linux.go` + `cpu_other.go` (build tags) 2. Add `readLoadAvg()` and `readTemperature()` to `info_linux.go` 3. Extend `SystemInfo` struct in `info.go` 4. Update `GetInfo()` signature in all files to accept `*CPUCollector` 5. Start CPUCollector in `main.go`, pass to web server and API router 6. Update `docker-compose.yml` — add `/sys:/host/sys:ro` 7. Update `dashboard.html` — show CPU, load, temperature 8. Update `style.css` if needed for new display elements 9. **Build, deploy, verify** — new metrics visible on dashboard ### Step 3: Healthchecks pinger + health checks 1. Create `internal/monitor/pinger.go` 2. Create `internal/monitor/healthcheck.go` 3. Add `system_health_interval` to config 4. Add system health ping job to scheduler in `main.go` 5. **Build, deploy** — check controller logs for health check runs ### Step 4: Database dump engine 1. Create `internal/backup/dbdump.go` 2. Implement discovery + dump functions 3. Wire up `RunDBDumps` temporarily to a test endpoint or manual scheduler trigger for testing 4. **Build, deploy, verify** — dumps should appear in `/srv/backups/db-dumps/` for paperless-ngx-postgres ### Step 5: Restic integration 1. Create `internal/backup/restic.go` 2. Implement init, snapshot, prune, check, stats 3. Auto-generate restic password if missing 4. Update docker-compose.yml (remove restic-password bind mount) 5. **Build, deploy, verify** — repo initialized, password generated ### Step 6: Backup orchestrator + dashboard 1. Create `internal/backup/backup.go` 2. Wire up scheduler daily jobs (DB dump + backup) 3. Add API endpoints (`/api/backup/status`, `/api/backup/run`) 4. Add backup status section to `dashboard.html` 5. Add "Mentés most" button 6. **Build, deploy, verify full flow** ### Step 7: Documentation & cleanup 1. Update `README.md` — Phase 2 and 3 checked off, new module descriptions 2. Update `CONTEXT.md` with session summary 3. Update `CLAUDE.md` if workflow changes 4. Version bump in build: `v0.4.0` --- ## Verification Checklist After deployment, verify each item: - [ ] `docker ps` shows controller healthy - [ ] Dashboard loads with CPU %, load average, temperature displayed - [ ] Temperature shows realistic value (30-60°C idle for N100) - [ ] CPU % updates (not stuck at 0) - [ ] `/api/system/info` returns all new fields (cpu_percent, load_avg_*, temperature_*) - [ ] Scheduler logs show `[SCHED]` entries for all registered jobs - [ ] If HC UUIDs configured: pings visible in status.felhom.eu dashboard - [ ] DB dump discovers paperless-ngx postgres container - [ ] Dump file exists: `/srv/backups/db-dumps/paperless-ngx-postgres.sql` - [ ] Restic repo initialized: `/srv/backups/restic-repo/config` exists - [ ] Restic password auto-generated: `/opt/docker/felhom-controller/data/restic-password` exists - [ ] "Mentés most" button triggers backup successfully - [ ] Dashboard shows backup status section with last backup time - [ ] All existing features still work (start/stop/deploy/update/logs/auth) --- ## New files to create ``` internal/scheduler/scheduler.go internal/monitor/pinger.go internal/monitor/healthcheck.go internal/backup/dbdump.go internal/backup/restic.go internal/backup/backup.go internal/system/cpu_linux.go internal/system/cpu_other.go ``` ## Existing files to modify ``` internal/system/info.go — new SystemInfo fields internal/system/info_linux.go — readLoadAvg(), readTemperature(), GetInfo() signature internal/system/info_other.go — GetInfo() signature update internal/config/config.go — SystemHealthInterval, updated defaults internal/api/router.go — backup endpoints, cpuCollector parameter internal/web/server.go — accept cpuCollector, backupMgr internal/web/handlers.go — pass cpuCollector/backupMgr to dashboard internal/web/templates/dashboard.html — CPU/temp bars, backup status section internal/web/templates/style.css — styles for new elements cmd/controller/main.go — scheduler, cpuCollector, pinger, backupMgr wiring controller/docker-compose.yml — /sys mount, remove restic-password mount configs/controller.yaml.example — new fields, updated defaults ``` --- ## Manual steps after deployment (for Viktor) 1. **Verify /sys mount:** `docker exec felhom-controller ls /host/sys/class/thermal/` — should show thermal_zone directories 2. **Healthchecks setup:** Create project + 3 checks in status.felhom.eu for demo-felhom: - `system-health` (period: 10m, grace: 10m) - `db-dump` (period: 24h, grace: 1h) - `backup` (period: 24h, grace: 1h) 3. **Update controller.yaml:** Add the three ping UUIDs 4. **Verify restic password:** `docker exec felhom-controller cat /opt/docker/felhom-controller/data/restic-password` 5. **Test restore procedure:** ```bash docker exec felhom-controller restic -r /srv/backups/restic-repo \ --password-file /opt/docker/felhom-controller/data/restic-password snapshots ``` 6. **Save restic password externally** — losing it means losing access to all backups