deploy-felhom-compose/TASK.md

# TASK.md — Phase 2: Monitoring & Health + Phase 3: Backups

> Version bump target: **v0.4.0**
> Priority: Phase 2 first (scheduler + metrics are prerequisites for Phase 3)

---

## Overview

Implement two major features in felhom-controller:

1. **Phase 2** — Scheduler, CPU/temperature metrics, Healthchecks.io ping integration
2. **Phase 3** — Database dump engine, restic backup snapshots, dashboard status display

Both phases share the scheduler infrastructure. Implement in order.

---

## Phase 2A — Scheduler (`internal/scheduler/`)

### Why first

main.go currently has two ad-hoc goroutines (status refresh every 30s, stack scan every 2min).
Phase 2 adds system health pings. Phase 3 adds daily DB dumps and backups.
All need a centralized, logged, observable job runner. Build it once, use everywhere.

### Design: `internal/scheduler/scheduler.go`

```go
package scheduler

type JobFunc func(ctx context.Context) error

type Job struct {
    Name     string
    Fn       JobFunc
    Interval time.Duration    // for periodic jobs (every N)
    Schedule string           // for daily jobs ("02:30", "03:00") — mutually exclusive with Interval
    LastRun  time.Time
    LastErr  error
    Running  bool
}

type Scheduler struct {
    jobs   []*Job
    logger *log.Logger
    ctx    context.Context
    cancel context.CancelFunc
    wg     sync.WaitGroup
}

func New(logger *log.Logger) *Scheduler
func (s *Scheduler) Every(name string, interval time.Duration, fn JobFunc)
func (s *Scheduler) Daily(name string, timeStr string, fn JobFunc)  // "02:30" format, Europe/Budapest timezone
func (s *Scheduler) Start(ctx context.Context)
func (s *Scheduler) Stop()
func (s *Scheduler) GetJobs() []Job  // for dashboard/API display (copy, not pointer)
```

### Interval jobs (`Every`)

- Spawns a goroutine with `time.Ticker`
- Logs `[SCHED] Running job: <n>` at start, `[SCHED] Job <n> completed (took Xs)` or `[SCHED] Job <n> failed: <err> (took Xs)` at end
- Updates `LastRun`, `LastErr`, `Running` fields (mutex-protected)
- Respects ctx.Done() for shutdown
- **Quiet mode for high-frequency jobs:** Jobs that run every <=30 seconds should only log at debug level on success (avoid log spam). Failures always log at WARN/ERROR level.

### Daily jobs (`Daily`)

- Parses `timeStr` as "HH:MM" in `Europe/Budapest` timezone
- On start, calculates duration until next occurrence (today if not yet passed, tomorrow if passed)
- After each run, sleeps until the next day's scheduled time
- Uses `time.After()` or `time.Timer`, NOT `time.Ticker` (handles DST transitions correctly)
- Same logging pattern as interval jobs
- Logs `[SCHED] Daily job <n> scheduled for <next_time>` on registration

### Edge cases

- If `Daily` timeStr is invalid → log error at registration, don't start the job
- If a job panics → recover, log `[ERROR] Job <n> panicked: <err>`, mark as failed
- If a job is already running when the next tick fires → skip, log `[WARN] Job <n> still running, skipping`
- Graceful shutdown: `Stop()` cancels context, `wg.Wait()` with 30s timeout for running jobs to finish

### Integration in main.go

Replace the two ad-hoc goroutines with:

```go
sched := scheduler.New(logger)

// Existing periodic tasks (move from ad-hoc goroutines)
sched.Every("status-refresh", 30*time.Second, func(ctx context.Context) error {
    return stackMgr.RefreshStatus()
})
sched.Every("stack-scan", 2*time.Minute, func(ctx context.Context) error {
    return stackMgr.ScanStacks()
})

// Phase 2: System health ping (added below)
// Phase 3: DB dump, backup (added below)

sched.Start(ctx)
defer sched.Stop()
```

Delete the two existing goroutines in main.go after migrating to the scheduler.

---

## Phase 2B — CPU & Temperature Metrics (`internal/system/`)

### Current state

`info.go` defines `SystemInfo` struct with memory + disk fields.
`info_linux.go` reads `/proc/meminfo` and `syscall.Statfs`.
`info_other.go` provides stubs for non-Linux.

### New fields in `SystemInfo`

```go
// Add to SystemInfo struct in info.go:
CPUPercent          float64 `json:"cpu_percent"`           // 0-100, averaged across all cores
LoadAvg1            float64 `json:"load_avg_1"`            // 1-minute load average
LoadAvg5            float64 `json:"load_avg_5"`            // 5-minute load average
LoadAvg15           float64 `json:"load_avg_15"`           // 15-minute load average
TemperatureCelsius  float64 `json:"temperature_celsius"`   // CPU/SoC temperature
TemperatureSource   string  `json:"temperature_source"`    // e.g. "thermal_zone0", "x86_pkg_temp"
```

### CPU measurement approach

**Do NOT block `GetInfo()` with a delta calculation.**

Use a lightweight `CPUCollector` that runs in a background goroutine:

```go
// internal/system/cpu_linux.go (build tag: linux)

type CPUCollector struct {
    mu          sync.RWMutex
    cpuPercent  float64
    sampleRate  time.Duration  // default: 5 seconds
    cancel      context.CancelFunc
}

func NewCPUCollector(sampleRate time.Duration) *CPUCollector
func (c *CPUCollector) Start(ctx context.Context)
func (c *CPUCollector) Stop()
func (c *CPUCollector) CPUPercent() float64  // returns latest sample
```

How it works:
1. Reads `/proc/stat` first line: `cpu <user> <nice> <system> <idle> <iowait> <irq> <softirq> <steal>`
2. Sleeps `sampleRate` (5s)
3. Reads again, computes delta: `busy = delta(user+nice+system+irq+softirq+steal)`, `total = busy + delta(idle+iowait)`
4. `cpuPercent = (busy / total) * 100`
5. Stores result, loops

Parsing `/proc/stat`:
```
cpu  1234 56 789 45678 123 45 67 0 0 0
```
Split by whitespace. Fields after "cpu" are: user(1) nice(2) system(3) idle(4) iowait(5) irq(6) softirq(7) steal(8).
Sum all = total. idle + iowait = idle_total. busy = total - idle_total.

**IMPORTANT: Inside a Docker container, `/proc/stat` reflects the HOST CPU** (unless CPU cgroups are applied with limits). So the controller's own `/proc/stat` works.

### Load average

Read from `/proc/loadavg` (instant, no delta needed):
```
0.15 0.10 0.05 1/234 56789
```
First three fields are 1/5/15 minute load averages. Parse with `fmt.Sscanf`.

Add `readLoadAvg(info *SystemInfo)` in `info_linux.go`.

### Temperature

Read from `/sys/class/thermal/thermal_zone*/temp`:

**IMPORTANT**: The controller runs in a Docker container. `/sys` is NOT available by default. We mount the host's `/sys` at `/host/sys` inside the container (see docker-compose.yml changes below).

```go
// internal/system/info_linux.go — add readTemperature(info *SystemInfo)
```

Algorithm:
1. Try `/host/sys/class/thermal/thermal_zone*/temp` first (Docker mount)
2. Fallback to `/sys/class/thermal/thermal_zone*/temp` (native/development)
3. For each zone, also read the `type` file for the label
4. Pick the highest temperature (usually `thermal_zone0` or `x86_pkg_temp`)
5. Value is in millidegrees Celsius → divide by 1000.0
6. Store the zone type as `TemperatureSource`
7. If no thermal zones found: try `/host/sys/class/hwmon/hwmon*/temp1_input` as fallback (same millidegree format)
8. If nothing found: leave fields as zero (dashboard hides temperature when 0)

### `GetInfo()` signature change

```go
// Current:
func GetInfo(hddPath string) SystemInfo
// New:
func GetInfo(hddPath string, cpuCollector *CPUCollector) SystemInfo
```

Inside `GetInfo()`:
1. Existing: `readMemInfo(&info)`, `readDiskUsage(...)` — unchanged
2. New: `readLoadAvg(&info)`
3. New: `readTemperature(&info)`
4. New: `if cpuCollector != nil { info.CPUPercent = cpuCollector.CPUPercent() }`

The `info_other.go` stub accepts the parameter but ignores it (returns empty SystemInfo as before).

### CPU collector lifecycle

Started in `main.go`:

```go
cpuCollector := system.NewCPUCollector(5 * time.Second)
cpuCollector.Start(ctx)
defer cpuCollector.Stop()
```

Passed to `web.NewServer()` and `api.NewRouter()` which pass it to `system.GetInfo()` calls.

### Dashboard display

Extend the existing system info bar in `dashboard.html`:

Current layout:
```
| Memória | ████████░░ 72% |  SSD | ██████░░░░ 55% |  HDD | ████░░░░░░ 38% |
```

New layout:
```
| Memória | ████████░░ 72% |  CPU | ██░░░░░░░░ 15% |  Hőmérséklet | 52°C |
| SSD     | ██████░░░░ 55% |  HDD | ████░░░░░░ 38% |
```

Or, if horizontal space is tight, keep the two-row layout from the current dashboard and add CPU + temperature to the same row structure. Use the same progress bar component.

Temperature display:
- Show as text "52°C" with colored dot (green/yellow/red)
- Green: < 60°C
- Yellow: 60-75°C
- Red: > 75°C
- If temperature is 0 (unavailable): hide entirely

CPU progress bar:
- Same color scheme as memory/disk: green < 70%, yellow 70-85%, red > 85%

Load average: Show as small text below CPU bar: "Load: 0.3 / 0.2 / 0.1"

---

## Phase 2C — Healthchecks.io Ping Integration (`internal/monitor/`)

### Design: `internal/monitor/pinger.go`

```go
package monitor

type Pinger struct {
    baseURL    string          // e.g. "https://status.felhom.eu"
    httpClient *http.Client
    logger     *log.Logger
    enabled    bool
}

func NewPinger(cfg *config.MonitoringConfig, logger *log.Logger) *Pinger

// Ping sends a success signal with optional diagnostic body
func (p *Pinger) Ping(uuid string, body string) error

// Fail sends a failure signal with diagnostic body
func (p *Pinger) Fail(uuid string, body string) error

// Start sends a "job started" signal (for duration tracking)
func (p *Pinger) Start(uuid string) error
```

### HTTP protocol

- Success: `POST {baseURL}/ping/{uuid}` with body as request body
- Failure: `POST {baseURL}/ping/{uuid}/fail` with body
- Start: `POST {baseURL}/ping/{uuid}/start`
- Timeout: 10 seconds
- Retry: 3 attempts with 2s backoff between retries
- If `uuid` is empty or starts with "CHANGEME" → skip silently (log at debug level only)
- If `enabled` is false → skip all pings
- **Never let ping failures affect the main operation** — log a warning on HTTP error, but always return nil from the calling job. Ping errors must not break backup/health flows.

### Design: `internal/monitor/healthcheck.go`

```go
// RunHealthCheck runs system checks and returns a diagnostic report.
type HealthReport struct {
    Status    string   // "ok", "warn", "fail"
    Issues    []string // critical problems
    Warnings  []string // non-critical warnings
    Info      []string // informational items
    Timestamp time.Time
}

func RunHealthCheck(cfg *config.Config, cpuCollector *system.CPUCollector) *HealthReport
func (r *HealthReport) FormatMessage() string  // human-readable summary for HC ping body
```

Checks to run (replicating backup-healthcheck.sh logic in Go):
1. **Disk usage**: Read from `system.GetInfo()`. Compare against thresholds (`disk_warn_percent`, `disk_crit_percent`).
2. **Memory usage**: Same source. Warn if above `memory_warn_percent`.
3. **CPU usage**: From collector. Warn if above `cpu_warn_percent`.
4. **Temperature**: From `system.GetInfo()`. Warn if above `temperature_warn_celsius`.
5. **Docker health**: Verify Docker daemon is reachable by running `docker info` (quick exec check).
6. **Protected containers**: Verify protected stacks are running (traefik, cloudflared, felhom-controller) by checking container state.

Any issue → Status = "fail". Only warnings → Status = "warn". All clear → Status = "ok".

### Scheduler integration

```go
// In main.go:
pinger := monitor.NewPinger(&cfg.Monitoring, logger)
healthUUID := cfg.Monitoring.PingUUIDs.SystemHealth

// Parse system_health_interval (default "5m")
healthInterval, _ := time.ParseDuration(cfg.Monitoring.SystemHealthInterval)

sched.Every("system-health", healthInterval, func(ctx context.Context) error {
    report := monitor.RunHealthCheck(cfg, cpuCollector)
    body := report.FormatMessage()

    if report.Status == "fail" {
        pinger.Fail(healthUUID, body)
    } else {
        pinger.Ping(healthUUID, body)
    }
    return nil  // never fail the scheduler job due to ping errors
})
```

### Config changes

Add to `MonitoringConfig`:
```go
SystemHealthInterval string `yaml:"system_health_interval"`
```

Default in `applyDefaults()`: `"5m"`

---

## Phase 3A — Database Dump Engine (`internal/backup/dbdump.go`)

### Approach: Auto-discover from running Docker containers

Replicates the proven logic from `backup-db-dump.sh` in Go:

```go
package backup

type DBType string
const (
    DBTypePostgres DBType = "postgres"
    DBTypeMariaDB  DBType = "mariadb"
)

type DiscoveredDB struct {
    ContainerName string
    ContainerID   string
    DBType        DBType
    DBUser        string
    DBName        string
    StackName     string  // derived from container name
}

type DumpResult struct {
    DB       DiscoveredDB
    FilePath string
    Size     int64
    Duration time.Duration
    Error    error
}

func DiscoverDatabases(ctx context.Context, logger *log.Logger) ([]DiscoveredDB, error)
func DumpAll(ctx context.Context, dbs []DiscoveredDB, dumpDir string, logger *log.Logger) []DumpResult
func DumpOne(ctx context.Context, db DiscoveredDB, dumpDir string, logger *log.Logger) DumpResult
```

### Discovery logic

Run `docker ps --format '{{.ID}}\t{{.Names}}\t{{.Image}}' --filter status=running`.

For each running container, check image name:
- Contains `postgres` → DBTypePostgres
- Contains `mariadb` or `mysql` → DBTypeMariaDB

Then for each DB container, get env vars via:
`docker inspect <id> --format '{{range .Config.Env}}{{println .}}{{end}}'`

Parse env vars:
- **PostgreSQL**: `POSTGRES_USER` (default: "postgres"), `POSTGRES_DB` (default: same as POSTGRES_USER)
- **MariaDB**: `MYSQL_ROOT_PASSWORD`, `MYSQL_DATABASE` (or `MARIADB_DATABASE`)

Derive stack name from container name by stripping common DB suffixes:
- `paperless-ngx-postgres` → `paperless-ngx`
- `romm-db` → `romm`
- `immich-postgres` → `immich`
- Logic: split on `-`, check if last segment is a known suffix (`postgres`, `db`, `mariadb`, `mysql`, `database`, `redis`, `cache`), if so remove it

### Dump execution

**PostgreSQL:**
```bash
docker exec <container> pg_dump -U <user> -d <db> --clean --if-exists --no-owner --no-privileges
```

**MariaDB:**
```bash
docker exec <container> mariadb-dump -u root -p<password> --single-transaction --routines --triggers <db>
```

**IMPORTANT: Use `docker exec` to run dump commands INSIDE the DB container.** Do NOT use pg_dump/mysqldump from the controller container — version mismatches between the controller's client and the DB server will cause failures.

Output handling:
- Use `os/exec.Command("docker", "exec", ...)` with `cmd.Stdout` piped to a temp file
- Write to `{dumpDir}/{stackName}-{dbtype}.sql.tmp` during dump
- Rename `.tmp` → `.sql` on success only
- Delete `.tmp` on failure
- Set 5-minute timeout per dump via `context.WithTimeout`

### Gotchas and edge cases

- **MariaDB password from container env:** Never log the password. Use `docker inspect` to read `MYSQL_ROOT_PASSWORD` or `MARIADB_ROOT_PASSWORD`.
- **Empty/zero-size dumps:** Check dump file size after writing. If 0 bytes → treat as failure.
- **Dump file naming:** `{stackName}-{dbtype}.sql` (e.g., `paperless-ngx-postgres.sql`). Overwrite previous dump each run (restic handles versioning).
- **Old tmp cleanup:** Delete `.tmp` files older than 1 hour on each run (leftover from crashed dumps).
- **Skip infrastructure DBs:** Don't dump databases from protected stacks (if any have DBs in the future).
- **Container not running:** If a DB container was discovered but is no longer running by dump time → skip with warning (container may have been stopped between discovery and dump).

### Dump directory

`/srv/backups/db-dumps/` — configured in `controller.yaml` as `paths.db_dump_dir`.
Already mounted in docker-compose.yml via `/srv/backups:/srv/backups`.

The user does NOT see this directory (not in FileBrowser, not on HDD).

---

## Phase 3B — Restic Integration (`internal/backup/restic.go`)

### Design

```go
type ResticManager struct {
    repoPath     string
    passwordFile string
    logger       *log.Logger
    customerID   string
    cacheDir     string
}

func NewResticManager(cfg *config.Config, logger *log.Logger) *ResticManager

func (r *ResticManager) EnsureInitialized() error
func (r *ResticManager) Snapshot(paths []string, tags []string) (*SnapshotResult, error)
func (r *ResticManager) Prune(retention config.RetentionConfig) error
func (r *ResticManager) Check() error
func (r *ResticManager) LatestSnapshot() (*SnapshotInfo, error)
func (r *ResticManager) Stats() (*RepoStats, error)

type SnapshotResult struct {
    SnapshotID   string
    FilesNew     int
    FilesChanged int
    DataAdded    string        // human-readable
    Duration     time.Duration
}

type SnapshotInfo struct {
    ID       string
    Time     time.Time
    Paths    []string
    Tags     []string
}

type RepoStats struct {
    TotalSize      string
    SnapshotCount  int
    LatestSnapshot *SnapshotInfo
}
```

### Restic commands (all via `os/exec`)

All commands set these env vars:
```go
cmd.Env = append(os.Environ(),
    "RESTIC_REPOSITORY="+r.repoPath,
    "RESTIC_PASSWORD_FILE="+r.passwordFile,
    "RESTIC_CACHE_DIR="+r.cacheDir,
)
```

**`RESTIC_CACHE_DIR`** must be set to `/opt/docker/felhom-controller/data/restic-cache` (inside the controller-data Docker volume). Without this, restic defaults to `~/.cache/restic` which may not persist across container restarts.

**Init** (idempotent):
- Check if `{repoPath}/config` file exists → if so, already initialized, skip
- Otherwise: `restic init`

**Snapshot:**
```bash
restic backup /opt/docker/stacks /srv/backups/db-dumps /opt/docker/felhom-controller/controller.yaml \
    --tag felhom --tag <customerID> --host <customerID>
```

What gets backed up (v1):
- `/opt/docker/stacks/` — compose files, .felhom.yml, app.yaml (deploy configs with secrets)
- `/srv/backups/db-dumps/` — SQL dumps (from the DB dump step)
- `/opt/docker/felhom-controller/controller.yaml` — controller config

**NOT backed up in v1:**
- HDD app data (Immich photos, Paperless documents) — too large, needs separate strategy
- Docker volumes directly — critical data covered by DB dumps

Parse snapshot output (restic `backup` with `--json` sends JSON lines to stderr):
```json
{"message_type":"summary","files_new":5,"files_changed":2,"data_added":12345678,...,"snapshot_id":"abc123"}
```

**Prune:**
```bash
restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 6 --prune
```

**Check:**
```bash
restic check
```

**Latest snapshot:**
```bash
restic snapshots --latest 1 --json
```
Returns JSON array with snapshot objects.

**Stats (repo size):**
```bash
restic stats --json
```

### Password auto-generation

On startup, `EnsureInitialized()` checks if the password file exists. If not:
1. Generate 32 random bytes, base64url-encode
2. Write to `r.passwordFile` (the controller-data volume path)
3. Log `[INFO] Generated new restic repository password at <path>`
4. Log `[WARN] Save this password externally — losing it means losing access to ALL backups`

### Gotchas

- **restic is already in the Docker image** (Dockerfile installs it). No additional setup.
- **Locking:** Restic handles repo locking internally. The scheduler's "skip if running" prevents concurrent operations. If a stale lock exists (controller crashed mid-backup), restic will error — add `restic unlock` to the error handling path with a log warning.
- **Timeout:** 30-minute timeout for snapshot operations. Parse context deadline.
- **Large repos:** First snapshot may be large (all stack configs + dumps). Subsequent snapshots are incremental (restic deduplicates).
- **restic JSON output:** Use `--json` for machine-parseable output. Parse from stderr for `backup` command (stdout shows progress, stderr has JSON summary).

Actually, correction — restic with `--json` sends JSON to **stdout**. Regular progress goes to stderr. For `backup --json`, the summary JSON object with `message_type: "summary"` is on stdout. Parse the last JSON line from stdout.

---

## Phase 3C — Backup Orchestrator (`internal/backup/backup.go`)

### Design

```go
type Manager struct {
    cfg       *config.Config
    restic    *ResticManager
    logger    *log.Logger
    pinger    *monitor.Pinger

    mu         sync.Mutex
    lastDBDump  *DBDumpStatus
    lastBackup  *BackupStatus
}

type DBDumpStatus struct {
    LastRun   time.Time
    Results   []DumpResult
    Success   bool
    Duration  time.Duration
}

type BackupStatus struct {
    LastRun    time.Time
    Snapshot   *SnapshotResult
    Success    bool
    Duration   time.Duration
    RepoStats  *RepoStats
}

func NewManager(cfg *config.Config, pinger *monitor.Pinger, logger *log.Logger) *Manager
func (m *Manager) RunDBDumps(ctx context.Context) error
func (m *Manager) RunBackup(ctx context.Context) error
func (m *Manager) RunFullBackup(ctx context.Context) error  // dumps + snapshot + optional prune
func (m *Manager) GetStatus() (*DBDumpStatus, *BackupStatus)
func (m *Manager) GetRepoStats() (*RepoStats, error)
```

### Full backup flow (daily scheduled)

1. **DB dumps:** `DiscoverDatabases()` → `DumpAll()` → update `lastDBDump` status
2. Ping Healthchecks for DB dump result: `pinger.Ping/Fail(dbDumpUUID, summary)`
3. **Restic snapshot:** `restic.EnsureInitialized()` → `restic.Snapshot(paths, tags)`
4. **Prune (weekly):** Check day of week against `prune_schedule` config. If match → `restic.Prune(retention)` + `restic.Check()`
5. Ping Healthchecks for backup result: `pinger.Ping/Fail(backupUUID, summary)`
6. Update `lastBackup` status

### Scheduler integration

```go
// In main.go:
backupMgr := backup.NewManager(cfg, pinger, logger)

if cfg.Backup.Enabled {
    sched.Daily("db-dump", cfg.Backup.DBDumpSchedule, func(ctx context.Context) error {
        return backupMgr.RunDBDumps(ctx)
    })

    sched.Daily("backup", cfg.Backup.ResticSchedule, func(ctx context.Context) error {
        return backupMgr.RunBackup(ctx)
    })
}
```

### Dashboard display

Add "Biztonsági mentés" (Backup) section to `dashboard.html`:

```
╔══════════════════════════════════════════╗
║  🛡️ Biztonsági mentés                   ║
╠══════════════════════════════════════════╣
║  Utolsó mentés: 2026-02-15 03:01 ✅      ║
║  Adatbázisok: 2 mentve (12.3 MB)         ║
║  Tároló méret: 45.2 MB (23 pillanatkép)   ║
║  Következő: ma 03:00                      ║
║                                           ║
║  [Mentés most]                            ║
╚══════════════════════════════════════════╝
```

Hungarian labels:
- "Biztonsági mentés" = Backup
- "Utolsó mentés" = Last backup
- "Adatbázisok" = Databases
- "mentve" = backed up
- "Tároló méret" = Repository size
- "pillanatkép" = snapshot(s)
- "Következő" = Next
- "Mentés most" = Backup now

Status colors:
- Green ✅: Last backup successful and less than `backup_max_age_hours` old
- Yellow ⚠️: Last backup successful but older than expected
- Red ❌: Last backup failed or no backups exist yet
- Gray: Backup not configured (`backup.enabled: false`)

If backup is disabled in config → show "Biztonsági mentés nincs beállítva" (Backup not configured).

### API endpoints

Add to `api/router.go`:

```
GET  /api/backup/status    → backup manager status + repo stats
POST /api/backup/run       → trigger immediate full backup (async)
```

`POST /api/backup/run` starts the backup in a background goroutine, returns immediately with `{"ok": true, "message": "Mentés elindítva"}`. The dashboard can poll `/api/backup/status` to track progress.

---

## Docker-compose.yml final state

```yaml
services:
  felhom-controller:
    image: gitea.dooplex.hu/admin/felhom-controller:latest
    container_name: felhom-controller
    restart: unless-stopped
    ports:
      - "8080:8080"
    volumes:
      # Docker socket — required for compose operations + DB dumps (docker exec)
      - /var/run/docker.sock:/var/run/docker.sock:ro
      # Controller config
      - /opt/docker/felhom-controller/controller.yaml:/opt/docker/felhom-controller/controller.yaml:ro
      # Controller persistent data (sessions, restic cache, restic password)
      - controller-data:/opt/docker/felhom-controller/data
      # Stack compose files (read + write for git sync)
      - /opt/docker/stacks:/opt/docker/stacks
      # Backup directories (restic repo + db dumps)
      - /srv/backups:/srv/backups
      # HDD mount (if available, for monitoring disk usage)
      - ${HDD_PATH:-/mnt/hdd_placeholder}:${HDD_PATH:-/mnt/hdd_placeholder}:ro
      # Host /sys — for CPU temperature reading (read-only)
      - /sys:/host/sys:ro
    environment:
      - TZ=Europe/Budapest
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.controller.rule=Host(`felhom.${DOMAIN}`)"
      - "traefik.http.routers.controller.entrypoints=websecure"
      - "traefik.http.routers.controller.tls=true"
      - "traefik.http.services.controller.loadbalancer.server.port=8080"
      - "traefik.docker.network=traefik-public"
      - "felhom.managed=true"
      - "felhom.component=controller"
    networks:
      - traefik-public
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/api/health"]
      interval: 30s
      timeout: 5s
      start_period: 10s
      retries: 3

volumes:
  controller-data:

networks:
  traefik-public:
    external: true
```

Changes from current:
1. **Added:** `/sys:/host/sys:ro` — for temperature reading
2. **Removed:** dedicated restic-password bind mount (password now in controller-data volume)

---

## Config changes summary

### `controller.yaml.example` updates

```yaml
monitoring:
  system_health_interval: "5m"    # NEW field

backup:
  restic_password_file: "/opt/docker/felhom-controller/data/restic-password"  # CHANGED default path
```

### `config.go` updates

- Add `SystemHealthInterval string` to `MonitoringConfig`
- Default: `"5m"` in `applyDefaults()`
- Change `restic_password_file` default from `/opt/docker/felhom-controller/restic-password` to `/opt/docker/felhom-controller/data/restic-password`
- Add env override: `FELHOM_MONITORING_SYSTEM_HEALTH_INTERVAL`

---

## Implementation Order

### Step 1: Scheduler
1. Create `internal/scheduler/scheduler.go`
2. Implement `Every()` and `Daily()` with logging, panic recovery, skip-if-running
3. Migrate the two existing goroutines from `main.go` to scheduler
4. **Build and verify** — behavior should be identical, logs should show `[SCHED]` entries

### Step 2: CPU & Temperature metrics
1. Create `internal/system/cpu_linux.go` + `cpu_other.go` (build tags)
2. Add `readLoadAvg()` and `readTemperature()` to `info_linux.go`
3. Extend `SystemInfo` struct in `info.go`
4. Update `GetInfo()` signature in all files to accept `*CPUCollector`
5. Start CPUCollector in `main.go`, pass to web server and API router
6. Update `docker-compose.yml` — add `/sys:/host/sys:ro`
7. Update `dashboard.html` — show CPU, load, temperature
8. Update `style.css` if needed for new display elements
9. **Build, deploy, verify** — new metrics visible on dashboard

### Step 3: Healthchecks pinger + health checks
1. Create `internal/monitor/pinger.go`
2. Create `internal/monitor/healthcheck.go`
3. Add `system_health_interval` to config
4. Add system health ping job to scheduler in `main.go`
5. **Build, deploy** — check controller logs for health check runs

### Step 4: Database dump engine
1. Create `internal/backup/dbdump.go`
2. Implement discovery + dump functions
3. Wire up `RunDBDumps` temporarily to a test endpoint or manual scheduler trigger for testing
4. **Build, deploy, verify** — dumps should appear in `/srv/backups/db-dumps/` for paperless-ngx-postgres

### Step 5: Restic integration
1. Create `internal/backup/restic.go`
2. Implement init, snapshot, prune, check, stats
3. Auto-generate restic password if missing
4. Update docker-compose.yml (remove restic-password bind mount)
5. **Build, deploy, verify** — repo initialized, password generated

### Step 6: Backup orchestrator + dashboard
1. Create `internal/backup/backup.go`
2. Wire up scheduler daily jobs (DB dump + backup)
3. Add API endpoints (`/api/backup/status`, `/api/backup/run`)
4. Add backup status section to `dashboard.html`
5. Add "Mentés most" button
6. **Build, deploy, verify full flow**

### Step 7: Documentation & cleanup
1. Update `README.md` — Phase 2 and 3 checked off, new module descriptions
2. Update `CONTEXT.md` with session summary
3. Update `CLAUDE.md` if workflow changes
4. Version bump in build: `v0.4.0`

---

## Verification Checklist

After deployment, verify each item:

- [ ] `docker ps` shows controller healthy
- [ ] Dashboard loads with CPU %, load average, temperature displayed
- [ ] Temperature shows realistic value (30-60°C idle for N100)
- [ ] CPU % updates (not stuck at 0)
- [ ] `/api/system/info` returns all new fields (cpu_percent, load_avg_*, temperature_*)
- [ ] Scheduler logs show `[SCHED]` entries for all registered jobs
- [ ] If HC UUIDs configured: pings visible in status.felhom.eu dashboard
- [ ] DB dump discovers paperless-ngx postgres container
- [ ] Dump file exists: `/srv/backups/db-dumps/paperless-ngx-postgres.sql`
- [ ] Restic repo initialized: `/srv/backups/restic-repo/config` exists
- [ ] Restic password auto-generated: `/opt/docker/felhom-controller/data/restic-password` exists
- [ ] "Mentés most" button triggers backup successfully
- [ ] Dashboard shows backup status section with last backup time
- [ ] All existing features still work (start/stop/deploy/update/logs/auth)

---

## New files to create

```
internal/scheduler/scheduler.go
internal/monitor/pinger.go
internal/monitor/healthcheck.go
internal/backup/dbdump.go
internal/backup/restic.go
internal/backup/backup.go
internal/system/cpu_linux.go
internal/system/cpu_other.go
```

## Existing files to modify

```
internal/system/info.go              — new SystemInfo fields
internal/system/info_linux.go        — readLoadAvg(), readTemperature(), GetInfo() signature
internal/system/info_other.go        — GetInfo() signature update
internal/config/config.go            — SystemHealthInterval, updated defaults
internal/api/router.go               — backup endpoints, cpuCollector parameter
internal/web/server.go               — accept cpuCollector, backupMgr
internal/web/handlers.go             — pass cpuCollector/backupMgr to dashboard
internal/web/templates/dashboard.html — CPU/temp bars, backup status section
internal/web/templates/style.css     — styles for new elements
cmd/controller/main.go               — scheduler, cpuCollector, pinger, backupMgr wiring
controller/docker-compose.yml        — /sys mount, remove restic-password mount
configs/controller.yaml.example      — new fields, updated defaults
```

---

## Manual steps after deployment (for Viktor)

1. **Verify /sys mount:** `docker exec felhom-controller ls /host/sys/class/thermal/` — should show thermal_zone directories
2. **Healthchecks setup:** Create project + 3 checks in status.felhom.eu for demo-felhom:
   - `system-health` (period: 10m, grace: 10m)
   - `db-dump` (period: 24h, grace: 1h)
   - `backup` (period: 24h, grace: 1h)
3. **Update controller.yaml:** Add the three ping UUIDs
4. **Verify restic password:** `docker exec felhom-controller cat /opt/docker/felhom-controller/data/restic-password`
5. **Test restore procedure:**
   ```bash
   docker exec felhom-controller restic -r /srv/backups/restic-repo \
     --password-file /opt/docker/felhom-controller/data/restic-password snapshots
   ```
6. **Save restic password externally** — losing it means losing access to all backups