Files
deploy-felhom-compose/TASK.md
T

31 KiB

TASK.md — Phase 2: Monitoring & Health + Phase 3: Backups

Version bump target: v0.4.0 Priority: Phase 2 first (scheduler + metrics are prerequisites for Phase 3)


Overview

Implement two major features in felhom-controller:

  1. Phase 2 — Scheduler, CPU/temperature metrics, Healthchecks.io ping integration
  2. Phase 3 — Database dump engine, restic backup snapshots, dashboard status display

Both phases share the scheduler infrastructure. Implement in order.


Phase 2A — Scheduler (internal/scheduler/)

Why first

main.go currently has two ad-hoc goroutines (status refresh every 30s, stack scan every 2min). Phase 2 adds system health pings. Phase 3 adds daily DB dumps and backups. All need a centralized, logged, observable job runner. Build it once, use everywhere.

Design: internal/scheduler/scheduler.go

package scheduler

type JobFunc func(ctx context.Context) error

type Job struct {
    Name     string
    Fn       JobFunc
    Interval time.Duration    // for periodic jobs (every N)
    Schedule string           // for daily jobs ("02:30", "03:00") — mutually exclusive with Interval
    LastRun  time.Time
    LastErr  error
    Running  bool
}

type Scheduler struct {
    jobs   []*Job
    logger *log.Logger
    ctx    context.Context
    cancel context.CancelFunc
    wg     sync.WaitGroup
}

func New(logger *log.Logger) *Scheduler
func (s *Scheduler) Every(name string, interval time.Duration, fn JobFunc)
func (s *Scheduler) Daily(name string, timeStr string, fn JobFunc)  // "02:30" format, Europe/Budapest timezone
func (s *Scheduler) Start(ctx context.Context)
func (s *Scheduler) Stop()
func (s *Scheduler) GetJobs() []Job  // for dashboard/API display (copy, not pointer)

Interval jobs (Every)

  • Spawns a goroutine with time.Ticker
  • Logs [SCHED] Running job: <n> at start, [SCHED] Job <n> completed (took Xs) or [SCHED] Job <n> failed: <err> (took Xs) at end
  • Updates LastRun, LastErr, Running fields (mutex-protected)
  • Respects ctx.Done() for shutdown
  • Quiet mode for high-frequency jobs: Jobs that run every <=30 seconds should only log at debug level on success (avoid log spam). Failures always log at WARN/ERROR level.

Daily jobs (Daily)

  • Parses timeStr as "HH:MM" in Europe/Budapest timezone
  • On start, calculates duration until next occurrence (today if not yet passed, tomorrow if passed)
  • After each run, sleeps until the next day's scheduled time
  • Uses time.After() or time.Timer, NOT time.Ticker (handles DST transitions correctly)
  • Same logging pattern as interval jobs
  • Logs [SCHED] Daily job <n> scheduled for <next_time> on registration

Edge cases

  • If Daily timeStr is invalid → log error at registration, don't start the job
  • If a job panics → recover, log [ERROR] Job <n> panicked: <err>, mark as failed
  • If a job is already running when the next tick fires → skip, log [WARN] Job <n> still running, skipping
  • Graceful shutdown: Stop() cancels context, wg.Wait() with 30s timeout for running jobs to finish

Integration in main.go

Replace the two ad-hoc goroutines with:

sched := scheduler.New(logger)

// Existing periodic tasks (move from ad-hoc goroutines)
sched.Every("status-refresh", 30*time.Second, func(ctx context.Context) error {
    return stackMgr.RefreshStatus()
})
sched.Every("stack-scan", 2*time.Minute, func(ctx context.Context) error {
    return stackMgr.ScanStacks()
})

// Phase 2: System health ping (added below)
// Phase 3: DB dump, backup (added below)

sched.Start(ctx)
defer sched.Stop()

Delete the two existing goroutines in main.go after migrating to the scheduler.


Phase 2B — CPU & Temperature Metrics (internal/system/)

Current state

info.go defines SystemInfo struct with memory + disk fields. info_linux.go reads /proc/meminfo and syscall.Statfs. info_other.go provides stubs for non-Linux.

New fields in SystemInfo

// Add to SystemInfo struct in info.go:
CPUPercent          float64 `json:"cpu_percent"`           // 0-100, averaged across all cores
LoadAvg1            float64 `json:"load_avg_1"`            // 1-minute load average
LoadAvg5            float64 `json:"load_avg_5"`            // 5-minute load average
LoadAvg15           float64 `json:"load_avg_15"`           // 15-minute load average
TemperatureCelsius  float64 `json:"temperature_celsius"`   // CPU/SoC temperature
TemperatureSource   string  `json:"temperature_source"`    // e.g. "thermal_zone0", "x86_pkg_temp"

CPU measurement approach

Do NOT block GetInfo() with a delta calculation.

Use a lightweight CPUCollector that runs in a background goroutine:

// internal/system/cpu_linux.go (build tag: linux)

type CPUCollector struct {
    mu          sync.RWMutex
    cpuPercent  float64
    sampleRate  time.Duration  // default: 5 seconds
    cancel      context.CancelFunc
}

func NewCPUCollector(sampleRate time.Duration) *CPUCollector
func (c *CPUCollector) Start(ctx context.Context)
func (c *CPUCollector) Stop()
func (c *CPUCollector) CPUPercent() float64  // returns latest sample

How it works:

  1. Reads /proc/stat first line: cpu <user> <nice> <system> <idle> <iowait> <irq> <softirq> <steal>
  2. Sleeps sampleRate (5s)
  3. Reads again, computes delta: busy = delta(user+nice+system+irq+softirq+steal), total = busy + delta(idle+iowait)
  4. cpuPercent = (busy / total) * 100
  5. Stores result, loops

Parsing /proc/stat:

cpu  1234 56 789 45678 123 45 67 0 0 0

Split by whitespace. Fields after "cpu" are: user(1) nice(2) system(3) idle(4) iowait(5) irq(6) softirq(7) steal(8). Sum all = total. idle + iowait = idle_total. busy = total - idle_total.

IMPORTANT: Inside a Docker container, /proc/stat reflects the HOST CPU (unless CPU cgroups are applied with limits). So the controller's own /proc/stat works.

Load average

Read from /proc/loadavg (instant, no delta needed):

0.15 0.10 0.05 1/234 56789

First three fields are 1/5/15 minute load averages. Parse with fmt.Sscanf.

Add readLoadAvg(info *SystemInfo) in info_linux.go.

Temperature

Read from /sys/class/thermal/thermal_zone*/temp:

IMPORTANT: The controller runs in a Docker container. /sys is NOT available by default. We mount the host's /sys at /host/sys inside the container (see docker-compose.yml changes below).

// internal/system/info_linux.go — add readTemperature(info *SystemInfo)

Algorithm:

  1. Try /host/sys/class/thermal/thermal_zone*/temp first (Docker mount)
  2. Fallback to /sys/class/thermal/thermal_zone*/temp (native/development)
  3. For each zone, also read the type file for the label
  4. Pick the highest temperature (usually thermal_zone0 or x86_pkg_temp)
  5. Value is in millidegrees Celsius → divide by 1000.0
  6. Store the zone type as TemperatureSource
  7. If no thermal zones found: try /host/sys/class/hwmon/hwmon*/temp1_input as fallback (same millidegree format)
  8. If nothing found: leave fields as zero (dashboard hides temperature when 0)

GetInfo() signature change

// Current:
func GetInfo(hddPath string) SystemInfo
// New:
func GetInfo(hddPath string, cpuCollector *CPUCollector) SystemInfo

Inside GetInfo():

  1. Existing: readMemInfo(&info), readDiskUsage(...) — unchanged
  2. New: readLoadAvg(&info)
  3. New: readTemperature(&info)
  4. New: if cpuCollector != nil { info.CPUPercent = cpuCollector.CPUPercent() }

The info_other.go stub accepts the parameter but ignores it (returns empty SystemInfo as before).

CPU collector lifecycle

Started in main.go:

cpuCollector := system.NewCPUCollector(5 * time.Second)
cpuCollector.Start(ctx)
defer cpuCollector.Stop()

Passed to web.NewServer() and api.NewRouter() which pass it to system.GetInfo() calls.

Dashboard display

Extend the existing system info bar in dashboard.html:

Current layout:

| Memória | ████████░░ 72% |  SSD | ██████░░░░ 55% |  HDD | ████░░░░░░ 38% |

New layout:

| Memória | ████████░░ 72% |  CPU | ██░░░░░░░░ 15% |  Hőmérséklet | 52°C |
| SSD     | ██████░░░░ 55% |  HDD | ████░░░░░░ 38% |

Or, if horizontal space is tight, keep the two-row layout from the current dashboard and add CPU + temperature to the same row structure. Use the same progress bar component.

Temperature display:

  • Show as text "52°C" with colored dot (green/yellow/red)
  • Green: < 60°C
  • Yellow: 60-75°C
  • Red: > 75°C
  • If temperature is 0 (unavailable): hide entirely

CPU progress bar:

  • Same color scheme as memory/disk: green < 70%, yellow 70-85%, red > 85%

Load average: Show as small text below CPU bar: "Load: 0.3 / 0.2 / 0.1"


Phase 2C — Healthchecks.io Ping Integration (internal/monitor/)

Design: internal/monitor/pinger.go

package monitor

type Pinger struct {
    baseURL    string          // e.g. "https://status.felhom.eu"
    httpClient *http.Client
    logger     *log.Logger
    enabled    bool
}

func NewPinger(cfg *config.MonitoringConfig, logger *log.Logger) *Pinger

// Ping sends a success signal with optional diagnostic body
func (p *Pinger) Ping(uuid string, body string) error

// Fail sends a failure signal with diagnostic body
func (p *Pinger) Fail(uuid string, body string) error

// Start sends a "job started" signal (for duration tracking)
func (p *Pinger) Start(uuid string) error

HTTP protocol

  • Success: POST {baseURL}/ping/{uuid} with body as request body
  • Failure: POST {baseURL}/ping/{uuid}/fail with body
  • Start: POST {baseURL}/ping/{uuid}/start
  • Timeout: 10 seconds
  • Retry: 3 attempts with 2s backoff between retries
  • If uuid is empty or starts with "CHANGEME" → skip silently (log at debug level only)
  • If enabled is false → skip all pings
  • Never let ping failures affect the main operation — log a warning on HTTP error, but always return nil from the calling job. Ping errors must not break backup/health flows.

Design: internal/monitor/healthcheck.go

// RunHealthCheck runs system checks and returns a diagnostic report.
type HealthReport struct {
    Status    string   // "ok", "warn", "fail"
    Issues    []string // critical problems
    Warnings  []string // non-critical warnings
    Info      []string // informational items
    Timestamp time.Time
}

func RunHealthCheck(cfg *config.Config, cpuCollector *system.CPUCollector) *HealthReport
func (r *HealthReport) FormatMessage() string  // human-readable summary for HC ping body

Checks to run (replicating backup-healthcheck.sh logic in Go):

  1. Disk usage: Read from system.GetInfo(). Compare against thresholds (disk_warn_percent, disk_crit_percent).
  2. Memory usage: Same source. Warn if above memory_warn_percent.
  3. CPU usage: From collector. Warn if above cpu_warn_percent.
  4. Temperature: From system.GetInfo(). Warn if above temperature_warn_celsius.
  5. Docker health: Verify Docker daemon is reachable by running docker info (quick exec check).
  6. Protected containers: Verify protected stacks are running (traefik, cloudflared, felhom-controller) by checking container state.

Any issue → Status = "fail". Only warnings → Status = "warn". All clear → Status = "ok".

Scheduler integration

// In main.go:
pinger := monitor.NewPinger(&cfg.Monitoring, logger)
healthUUID := cfg.Monitoring.PingUUIDs.SystemHealth

// Parse system_health_interval (default "5m")
healthInterval, _ := time.ParseDuration(cfg.Monitoring.SystemHealthInterval)

sched.Every("system-health", healthInterval, func(ctx context.Context) error {
    report := monitor.RunHealthCheck(cfg, cpuCollector)
    body := report.FormatMessage()

    if report.Status == "fail" {
        pinger.Fail(healthUUID, body)
    } else {
        pinger.Ping(healthUUID, body)
    }
    return nil  // never fail the scheduler job due to ping errors
})

Config changes

Add to MonitoringConfig:

SystemHealthInterval string `yaml:"system_health_interval"`

Default in applyDefaults(): "5m"


Phase 3A — Database Dump Engine (internal/backup/dbdump.go)

Approach: Auto-discover from running Docker containers

Replicates the proven logic from backup-db-dump.sh in Go:

package backup

type DBType string
const (
    DBTypePostgres DBType = "postgres"
    DBTypeMariaDB  DBType = "mariadb"
)

type DiscoveredDB struct {
    ContainerName string
    ContainerID   string
    DBType        DBType
    DBUser        string
    DBName        string
    StackName     string  // derived from container name
}

type DumpResult struct {
    DB       DiscoveredDB
    FilePath string
    Size     int64
    Duration time.Duration
    Error    error
}

func DiscoverDatabases(ctx context.Context, logger *log.Logger) ([]DiscoveredDB, error)
func DumpAll(ctx context.Context, dbs []DiscoveredDB, dumpDir string, logger *log.Logger) []DumpResult
func DumpOne(ctx context.Context, db DiscoveredDB, dumpDir string, logger *log.Logger) DumpResult

Discovery logic

Run docker ps --format '{{.ID}}\t{{.Names}}\t{{.Image}}' --filter status=running.

For each running container, check image name:

  • Contains postgres → DBTypePostgres
  • Contains mariadb or mysql → DBTypeMariaDB

Then for each DB container, get env vars via: docker inspect <id> --format '{{range .Config.Env}}{{println .}}{{end}}'

Parse env vars:

  • PostgreSQL: POSTGRES_USER (default: "postgres"), POSTGRES_DB (default: same as POSTGRES_USER)
  • MariaDB: MYSQL_ROOT_PASSWORD, MYSQL_DATABASE (or MARIADB_DATABASE)

Derive stack name from container name by stripping common DB suffixes:

  • paperless-ngx-postgrespaperless-ngx
  • romm-dbromm
  • immich-postgresimmich
  • Logic: split on -, check if last segment is a known suffix (postgres, db, mariadb, mysql, database, redis, cache), if so remove it

Dump execution

PostgreSQL:

docker exec <container> pg_dump -U <user> -d <db> --clean --if-exists --no-owner --no-privileges

MariaDB:

docker exec <container> mariadb-dump -u root -p<password> --single-transaction --routines --triggers <db>

IMPORTANT: Use docker exec to run dump commands INSIDE the DB container. Do NOT use pg_dump/mysqldump from the controller container — version mismatches between the controller's client and the DB server will cause failures.

Output handling:

  • Use os/exec.Command("docker", "exec", ...) with cmd.Stdout piped to a temp file
  • Write to {dumpDir}/{stackName}-{dbtype}.sql.tmp during dump
  • Rename .tmp.sql on success only
  • Delete .tmp on failure
  • Set 5-minute timeout per dump via context.WithTimeout

Gotchas and edge cases

  • MariaDB password from container env: Never log the password. Use docker inspect to read MYSQL_ROOT_PASSWORD or MARIADB_ROOT_PASSWORD.
  • Empty/zero-size dumps: Check dump file size after writing. If 0 bytes → treat as failure.
  • Dump file naming: {stackName}-{dbtype}.sql (e.g., paperless-ngx-postgres.sql). Overwrite previous dump each run (restic handles versioning).
  • Old tmp cleanup: Delete .tmp files older than 1 hour on each run (leftover from crashed dumps).
  • Skip infrastructure DBs: Don't dump databases from protected stacks (if any have DBs in the future).
  • Container not running: If a DB container was discovered but is no longer running by dump time → skip with warning (container may have been stopped between discovery and dump).

Dump directory

/srv/backups/db-dumps/ — configured in controller.yaml as paths.db_dump_dir. Already mounted in docker-compose.yml via /srv/backups:/srv/backups.

The user does NOT see this directory (not in FileBrowser, not on HDD).


Phase 3B — Restic Integration (internal/backup/restic.go)

Design

type ResticManager struct {
    repoPath     string
    passwordFile string
    logger       *log.Logger
    customerID   string
    cacheDir     string
}

func NewResticManager(cfg *config.Config, logger *log.Logger) *ResticManager

func (r *ResticManager) EnsureInitialized() error
func (r *ResticManager) Snapshot(paths []string, tags []string) (*SnapshotResult, error)
func (r *ResticManager) Prune(retention config.RetentionConfig) error
func (r *ResticManager) Check() error
func (r *ResticManager) LatestSnapshot() (*SnapshotInfo, error)
func (r *ResticManager) Stats() (*RepoStats, error)

type SnapshotResult struct {
    SnapshotID   string
    FilesNew     int
    FilesChanged int
    DataAdded    string        // human-readable
    Duration     time.Duration
}

type SnapshotInfo struct {
    ID       string
    Time     time.Time
    Paths    []string
    Tags     []string
}

type RepoStats struct {
    TotalSize      string
    SnapshotCount  int
    LatestSnapshot *SnapshotInfo
}

Restic commands (all via os/exec)

All commands set these env vars:

cmd.Env = append(os.Environ(),
    "RESTIC_REPOSITORY="+r.repoPath,
    "RESTIC_PASSWORD_FILE="+r.passwordFile,
    "RESTIC_CACHE_DIR="+r.cacheDir,
)

RESTIC_CACHE_DIR must be set to /opt/docker/felhom-controller/data/restic-cache (inside the controller-data Docker volume). Without this, restic defaults to ~/.cache/restic which may not persist across container restarts.

Init (idempotent):

  • Check if {repoPath}/config file exists → if so, already initialized, skip
  • Otherwise: restic init

Snapshot:

restic backup /opt/docker/stacks /srv/backups/db-dumps /opt/docker/felhom-controller/controller.yaml \
    --tag felhom --tag <customerID> --host <customerID>

What gets backed up (v1):

  • /opt/docker/stacks/ — compose files, .felhom.yml, app.yaml (deploy configs with secrets)
  • /srv/backups/db-dumps/ — SQL dumps (from the DB dump step)
  • /opt/docker/felhom-controller/controller.yaml — controller config

NOT backed up in v1:

  • HDD app data (Immich photos, Paperless documents) — too large, needs separate strategy
  • Docker volumes directly — critical data covered by DB dumps

Parse snapshot output (restic backup with --json sends JSON lines to stderr):

{"message_type":"summary","files_new":5,"files_changed":2,"data_added":12345678,...,"snapshot_id":"abc123"}

Prune:

restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 6 --prune

Check:

restic check

Latest snapshot:

restic snapshots --latest 1 --json

Returns JSON array with snapshot objects.

Stats (repo size):

restic stats --json

Password auto-generation

On startup, EnsureInitialized() checks if the password file exists. If not:

  1. Generate 32 random bytes, base64url-encode
  2. Write to r.passwordFile (the controller-data volume path)
  3. Log [INFO] Generated new restic repository password at <path>
  4. Log [WARN] Save this password externally — losing it means losing access to ALL backups

Gotchas

  • restic is already in the Docker image (Dockerfile installs it). No additional setup.
  • Locking: Restic handles repo locking internally. The scheduler's "skip if running" prevents concurrent operations. If a stale lock exists (controller crashed mid-backup), restic will error — add restic unlock to the error handling path with a log warning.
  • Timeout: 30-minute timeout for snapshot operations. Parse context deadline.
  • Large repos: First snapshot may be large (all stack configs + dumps). Subsequent snapshots are incremental (restic deduplicates).
  • restic JSON output: Use --json for machine-parseable output. Parse from stderr for backup command (stdout shows progress, stderr has JSON summary).

Actually, correction — restic with --json sends JSON to stdout. Regular progress goes to stderr. For backup --json, the summary JSON object with message_type: "summary" is on stdout. Parse the last JSON line from stdout.


Phase 3C — Backup Orchestrator (internal/backup/backup.go)

Design

type Manager struct {
    cfg       *config.Config
    restic    *ResticManager
    logger    *log.Logger
    pinger    *monitor.Pinger

    mu         sync.Mutex
    lastDBDump  *DBDumpStatus
    lastBackup  *BackupStatus
}

type DBDumpStatus struct {
    LastRun   time.Time
    Results   []DumpResult
    Success   bool
    Duration  time.Duration
}

type BackupStatus struct {
    LastRun    time.Time
    Snapshot   *SnapshotResult
    Success    bool
    Duration   time.Duration
    RepoStats  *RepoStats
}

func NewManager(cfg *config.Config, pinger *monitor.Pinger, logger *log.Logger) *Manager
func (m *Manager) RunDBDumps(ctx context.Context) error
func (m *Manager) RunBackup(ctx context.Context) error
func (m *Manager) RunFullBackup(ctx context.Context) error  // dumps + snapshot + optional prune
func (m *Manager) GetStatus() (*DBDumpStatus, *BackupStatus)
func (m *Manager) GetRepoStats() (*RepoStats, error)

Full backup flow (daily scheduled)

  1. DB dumps: DiscoverDatabases()DumpAll() → update lastDBDump status
  2. Ping Healthchecks for DB dump result: pinger.Ping/Fail(dbDumpUUID, summary)
  3. Restic snapshot: restic.EnsureInitialized()restic.Snapshot(paths, tags)
  4. Prune (weekly): Check day of week against prune_schedule config. If match → restic.Prune(retention) + restic.Check()
  5. Ping Healthchecks for backup result: pinger.Ping/Fail(backupUUID, summary)
  6. Update lastBackup status

Scheduler integration

// In main.go:
backupMgr := backup.NewManager(cfg, pinger, logger)

if cfg.Backup.Enabled {
    sched.Daily("db-dump", cfg.Backup.DBDumpSchedule, func(ctx context.Context) error {
        return backupMgr.RunDBDumps(ctx)
    })

    sched.Daily("backup", cfg.Backup.ResticSchedule, func(ctx context.Context) error {
        return backupMgr.RunBackup(ctx)
    })
}

Dashboard display

Add "Biztonsági mentés" (Backup) section to dashboard.html:

╔══════════════════════════════════════════╗
║  🛡️ Biztonsági mentés                   ║
╠══════════════════════════════════════════╣
║  Utolsó mentés: 2026-02-15 03:01 ✅      ║
║  Adatbázisok: 2 mentve (12.3 MB)         ║
║  Tároló méret: 45.2 MB (23 pillanatkép)   ║
║  Következő: ma 03:00                      ║
║                                           ║
║  [Mentés most]                            ║
╚══════════════════════════════════════════╝

Hungarian labels:

  • "Biztonsági mentés" = Backup
  • "Utolsó mentés" = Last backup
  • "Adatbázisok" = Databases
  • "mentve" = backed up
  • "Tároló méret" = Repository size
  • "pillanatkép" = snapshot(s)
  • "Következő" = Next
  • "Mentés most" = Backup now

Status colors:

  • Green : Last backup successful and less than backup_max_age_hours old
  • Yellow ⚠️: Last backup successful but older than expected
  • Red : Last backup failed or no backups exist yet
  • Gray: Backup not configured (backup.enabled: false)

If backup is disabled in config → show "Biztonsági mentés nincs beállítva" (Backup not configured).

API endpoints

Add to api/router.go:

GET  /api/backup/status    → backup manager status + repo stats
POST /api/backup/run       → trigger immediate full backup (async)

POST /api/backup/run starts the backup in a background goroutine, returns immediately with {"ok": true, "message": "Mentés elindítva"}. The dashboard can poll /api/backup/status to track progress.


Docker-compose.yml final state

services:
  felhom-controller:
    image: gitea.dooplex.hu/admin/felhom-controller:latest
    container_name: felhom-controller
    restart: unless-stopped
    ports:
      - "8080:8080"
    volumes:
      # Docker socket — required for compose operations + DB dumps (docker exec)
      - /var/run/docker.sock:/var/run/docker.sock:ro
      # Controller config
      - /opt/docker/felhom-controller/controller.yaml:/opt/docker/felhom-controller/controller.yaml:ro
      # Controller persistent data (sessions, restic cache, restic password)
      - controller-data:/opt/docker/felhom-controller/data
      # Stack compose files (read + write for git sync)
      - /opt/docker/stacks:/opt/docker/stacks
      # Backup directories (restic repo + db dumps)
      - /srv/backups:/srv/backups
      # HDD mount (if available, for monitoring disk usage)
      - ${HDD_PATH:-/mnt/hdd_placeholder}:${HDD_PATH:-/mnt/hdd_placeholder}:ro
      # Host /sys — for CPU temperature reading (read-only)
      - /sys:/host/sys:ro
    environment:
      - TZ=Europe/Budapest
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.controller.rule=Host(`felhom.${DOMAIN}`)"
      - "traefik.http.routers.controller.entrypoints=websecure"
      - "traefik.http.routers.controller.tls=true"
      - "traefik.http.services.controller.loadbalancer.server.port=8080"
      - "traefik.docker.network=traefik-public"
      - "felhom.managed=true"
      - "felhom.component=controller"
    networks:
      - traefik-public
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/api/health"]
      interval: 30s
      timeout: 5s
      start_period: 10s
      retries: 3

volumes:
  controller-data:

networks:
  traefik-public:
    external: true

Changes from current:

  1. Added: /sys:/host/sys:ro — for temperature reading
  2. Removed: dedicated restic-password bind mount (password now in controller-data volume)

Config changes summary

controller.yaml.example updates

monitoring:
  system_health_interval: "5m"    # NEW field

backup:
  restic_password_file: "/opt/docker/felhom-controller/data/restic-password"  # CHANGED default path

config.go updates

  • Add SystemHealthInterval string to MonitoringConfig
  • Default: "5m" in applyDefaults()
  • Change restic_password_file default from /opt/docker/felhom-controller/restic-password to /opt/docker/felhom-controller/data/restic-password
  • Add env override: FELHOM_MONITORING_SYSTEM_HEALTH_INTERVAL

Implementation Order

Step 1: Scheduler

  1. Create internal/scheduler/scheduler.go
  2. Implement Every() and Daily() with logging, panic recovery, skip-if-running
  3. Migrate the two existing goroutines from main.go to scheduler
  4. Build and verify — behavior should be identical, logs should show [SCHED] entries

Step 2: CPU & Temperature metrics

  1. Create internal/system/cpu_linux.go + cpu_other.go (build tags)
  2. Add readLoadAvg() and readTemperature() to info_linux.go
  3. Extend SystemInfo struct in info.go
  4. Update GetInfo() signature in all files to accept *CPUCollector
  5. Start CPUCollector in main.go, pass to web server and API router
  6. Update docker-compose.yml — add /sys:/host/sys:ro
  7. Update dashboard.html — show CPU, load, temperature
  8. Update style.css if needed for new display elements
  9. Build, deploy, verify — new metrics visible on dashboard

Step 3: Healthchecks pinger + health checks

  1. Create internal/monitor/pinger.go
  2. Create internal/monitor/healthcheck.go
  3. Add system_health_interval to config
  4. Add system health ping job to scheduler in main.go
  5. Build, deploy — check controller logs for health check runs

Step 4: Database dump engine

  1. Create internal/backup/dbdump.go
  2. Implement discovery + dump functions
  3. Wire up RunDBDumps temporarily to a test endpoint or manual scheduler trigger for testing
  4. Build, deploy, verify — dumps should appear in /srv/backups/db-dumps/ for paperless-ngx-postgres

Step 5: Restic integration

  1. Create internal/backup/restic.go
  2. Implement init, snapshot, prune, check, stats
  3. Auto-generate restic password if missing
  4. Update docker-compose.yml (remove restic-password bind mount)
  5. Build, deploy, verify — repo initialized, password generated

Step 6: Backup orchestrator + dashboard

  1. Create internal/backup/backup.go
  2. Wire up scheduler daily jobs (DB dump + backup)
  3. Add API endpoints (/api/backup/status, /api/backup/run)
  4. Add backup status section to dashboard.html
  5. Add "Mentés most" button
  6. Build, deploy, verify full flow

Step 7: Documentation & cleanup

  1. Update README.md — Phase 2 and 3 checked off, new module descriptions
  2. Update CONTEXT.md with session summary
  3. Update CLAUDE.md if workflow changes
  4. Version bump in build: v0.4.0

Verification Checklist

After deployment, verify each item:

  • docker ps shows controller healthy
  • Dashboard loads with CPU %, load average, temperature displayed
  • Temperature shows realistic value (30-60°C idle for N100)
  • CPU % updates (not stuck at 0)
  • /api/system/info returns all new fields (cpu_percent, load_avg_, temperature_)
  • Scheduler logs show [SCHED] entries for all registered jobs
  • If HC UUIDs configured: pings visible in status.felhom.eu dashboard
  • DB dump discovers paperless-ngx postgres container
  • Dump file exists: /srv/backups/db-dumps/paperless-ngx-postgres.sql
  • Restic repo initialized: /srv/backups/restic-repo/config exists
  • Restic password auto-generated: /opt/docker/felhom-controller/data/restic-password exists
  • "Mentés most" button triggers backup successfully
  • Dashboard shows backup status section with last backup time
  • All existing features still work (start/stop/deploy/update/logs/auth)

New files to create

internal/scheduler/scheduler.go
internal/monitor/pinger.go
internal/monitor/healthcheck.go
internal/backup/dbdump.go
internal/backup/restic.go
internal/backup/backup.go
internal/system/cpu_linux.go
internal/system/cpu_other.go

Existing files to modify

internal/system/info.go              — new SystemInfo fields
internal/system/info_linux.go        — readLoadAvg(), readTemperature(), GetInfo() signature
internal/system/info_other.go        — GetInfo() signature update
internal/config/config.go            — SystemHealthInterval, updated defaults
internal/api/router.go               — backup endpoints, cpuCollector parameter
internal/web/server.go               — accept cpuCollector, backupMgr
internal/web/handlers.go             — pass cpuCollector/backupMgr to dashboard
internal/web/templates/dashboard.html — CPU/temp bars, backup status section
internal/web/templates/style.css     — styles for new elements
cmd/controller/main.go               — scheduler, cpuCollector, pinger, backupMgr wiring
controller/docker-compose.yml        — /sys mount, remove restic-password mount
configs/controller.yaml.example      — new fields, updated defaults

Manual steps after deployment (for Viktor)

  1. Verify /sys mount: docker exec felhom-controller ls /host/sys/class/thermal/ — should show thermal_zone directories
  2. Healthchecks setup: Create project + 3 checks in status.felhom.eu for demo-felhom:
    • system-health (period: 10m, grace: 10m)
    • db-dump (period: 24h, grace: 1h)
    • backup (period: 24h, grace: 1h)
  3. Update controller.yaml: Add the three ping UUIDs
  4. Verify restic password: docker exec felhom-controller cat /opt/docker/felhom-controller/data/restic-password
  5. Test restore procedure:
    docker exec felhom-controller restic -r /srv/backups/restic-repo \
      --password-file /opt/docker/felhom-controller/data/restic-password snapshots
    
  6. Save restic password externally — losing it means losing access to all backups