admin/deploy-felhom-compose

Fork 0

Files

T

admin 8a988c5998 TASK.md — Phase 2: Monitoring & Health + Phase 3: Backups

2026-02-15 11:01:30 +01:00

31 KiB

Raw Blame History

TASK.md — Phase 2: Monitoring & Health + Phase 3: Backups

Version bump target: v0.4.0 Priority: Phase 2 first (scheduler + metrics are prerequisites for Phase 3)

Overview

Implement two major features in felhom-controller:

Phase 2 — Scheduler, CPU/temperature metrics, Healthchecks.io ping integration
Phase 3 — Database dump engine, restic backup snapshots, dashboard status display

Both phases share the scheduler infrastructure. Implement in order.

Phase 2A — Scheduler (`internal/scheduler/`)

Why first

main.go currently has two ad-hoc goroutines (status refresh every 30s, stack scan every 2min). Phase 2 adds system health pings. Phase 3 adds daily DB dumps and backups. All need a centralized, logged, observable job runner. Build it once, use everywhere.

Design: `internal/scheduler/scheduler.go`

package scheduler

type JobFunc func(ctx context.Context) error

type Job struct {
    Name     string
    Fn       JobFunc
    Interval time.Duration    // for periodic jobs (every N)
    Schedule string           // for daily jobs ("02:30", "03:00") — mutually exclusive with Interval
    LastRun  time.Time
    LastErr  error
    Running  bool
}

type Scheduler struct {
    jobs   []*Job
    logger *log.Logger
    ctx    context.Context
    cancel context.CancelFunc
    wg     sync.WaitGroup
}

func New(logger *log.Logger) *Scheduler
func (s *Scheduler) Every(name string, interval time.Duration, fn JobFunc)
func (s *Scheduler) Daily(name string, timeStr string, fn JobFunc)  // "02:30" format, Europe/Budapest timezone
func (s *Scheduler) Start(ctx context.Context)
func (s *Scheduler) Stop()
func (s *Scheduler) GetJobs() []Job  // for dashboard/API display (copy, not pointer)

Interval jobs (`Every`)

Spawns a goroutine with time.Ticker
Logs [SCHED] Running job: <n> at start, [SCHED] Job <n> completed (took Xs) or [SCHED] Job <n> failed: <err> (took Xs) at end
Updates LastRun, LastErr, Running fields (mutex-protected)
Respects ctx.Done() for shutdown
Quiet mode for high-frequency jobs: Jobs that run every <=30 seconds should only log at debug level on success (avoid log spam). Failures always log at WARN/ERROR level.

Daily jobs (`Daily`)

Parses timeStr as "HH:MM" in Europe/Budapest timezone
On start, calculates duration until next occurrence (today if not yet passed, tomorrow if passed)
After each run, sleeps until the next day's scheduled time
Uses time.After() or time.Timer, NOT time.Ticker (handles DST transitions correctly)
Same logging pattern as interval jobs
Logs [SCHED] Daily job <n> scheduled for <next_time> on registration

Edge cases

If Daily timeStr is invalid → log error at registration, don't start the job
If a job panics → recover, log [ERROR] Job <n> panicked: <err>, mark as failed
If a job is already running when the next tick fires → skip, log [WARN] Job <n> still running, skipping
Graceful shutdown: Stop() cancels context, wg.Wait() with 30s timeout for running jobs to finish

Integration in main.go

Replace the two ad-hoc goroutines with:

sched := scheduler.New(logger)

// Existing periodic tasks (move from ad-hoc goroutines)
sched.Every("status-refresh", 30*time.Second, func(ctx context.Context) error {
    return stackMgr.RefreshStatus()
})
sched.Every("stack-scan", 2*time.Minute, func(ctx context.Context) error {
    return stackMgr.ScanStacks()
})

// Phase 2: System health ping (added below)
// Phase 3: DB dump, backup (added below)

sched.Start(ctx)
defer sched.Stop()

Delete the two existing goroutines in main.go after migrating to the scheduler.

Phase 2B — CPU & Temperature Metrics (`internal/system/`)

Current state

info.go defines SystemInfo struct with memory + disk fields. info_linux.go reads /proc/meminfo and syscall.Statfs. info_other.go provides stubs for non-Linux.

New fields in `SystemInfo`

// Add to SystemInfo struct in info.go:
CPUPercent          float64 `json:"cpu_percent"`           // 0-100, averaged across all cores
LoadAvg1            float64 `json:"load_avg_1"`            // 1-minute load average
LoadAvg5            float64 `json:"load_avg_5"`            // 5-minute load average
LoadAvg15           float64 `json:"load_avg_15"`           // 15-minute load average
TemperatureCelsius  float64 `json:"temperature_celsius"`   // CPU/SoC temperature
TemperatureSource   string  `json:"temperature_source"`    // e.g. "thermal_zone0", "x86_pkg_temp"

CPU measurement approach

Do NOT block GetInfo() with a delta calculation.

Use a lightweight CPUCollector that runs in a background goroutine:

// internal/system/cpu_linux.go (build tag: linux)

type CPUCollector struct {
    mu          sync.RWMutex
    cpuPercent  float64
    sampleRate  time.Duration  // default: 5 seconds
    cancel      context.CancelFunc
}

func NewCPUCollector(sampleRate time.Duration) *CPUCollector
func (c *CPUCollector) Start(ctx context.Context)
func (c *CPUCollector) Stop()
func (c *CPUCollector) CPUPercent() float64  // returns latest sample

How it works:

Reads /proc/stat first line: cpu <user> <nice> <system> <idle> <iowait> <irq> <softirq> <steal>
Sleeps sampleRate (5s)
Reads again, computes delta: busy = delta(user+nice+system+irq+softirq+steal), total = busy + delta(idle+iowait)
cpuPercent = (busy / total) * 100
Stores result, loops

Parsing /proc/stat:

cpu  1234 56 789 45678 123 45 67 0 0 0

Split by whitespace. Fields after "cpu" are: user(1) nice(2) system(3) idle(4) iowait(5) irq(6) softirq(7) steal(8). Sum all = total. idle + iowait = idle_total. busy = total - idle_total.

IMPORTANT: Inside a Docker container, /proc/stat reflects the HOST CPU (unless CPU cgroups are applied with limits). So the controller's own /proc/stat works.

Load average

Read from /proc/loadavg (instant, no delta needed):

0.15 0.10 0.05 1/234 56789

First three fields are 1/5/15 minute load averages. Parse with fmt.Sscanf.

Add readLoadAvg(info *SystemInfo) in info_linux.go.

Temperature

Read from /sys/class/thermal/thermal_zone*/temp:

IMPORTANT: The controller runs in a Docker container. /sys is NOT available by default. We mount the host's /sys at /host/sys inside the container (see docker-compose.yml changes below).

// internal/system/info_linux.go — add readTemperature(info *SystemInfo)

Algorithm:

Try /host/sys/class/thermal/thermal_zone*/temp first (Docker mount)
Fallback to /sys/class/thermal/thermal_zone*/temp (native/development)
For each zone, also read the type file for the label
Pick the highest temperature (usually thermal_zone0 or x86_pkg_temp)
Value is in millidegrees Celsius → divide by 1000.0
Store the zone type as TemperatureSource
If no thermal zones found: try /host/sys/class/hwmon/hwmon*/temp1_input as fallback (same millidegree format)
If nothing found: leave fields as zero (dashboard hides temperature when 0)

`GetInfo()` signature change

// Current:
func GetInfo(hddPath string) SystemInfo
// New:
func GetInfo(hddPath string, cpuCollector *CPUCollector) SystemInfo

Inside GetInfo():

Existing: readMemInfo(&info), readDiskUsage(...) — unchanged
New: readLoadAvg(&info)
New: readTemperature(&info)
New: if cpuCollector != nil { info.CPUPercent = cpuCollector.CPUPercent() }

The info_other.go stub accepts the parameter but ignores it (returns empty SystemInfo as before).

CPU collector lifecycle

Started in main.go:

cpuCollector := system.NewCPUCollector(5 * time.Second)
cpuCollector.Start(ctx)
defer cpuCollector.Stop()

Passed to web.NewServer() and api.NewRouter() which pass it to system.GetInfo() calls.

Dashboard display

Extend the existing system info bar in dashboard.html:

Current layout:

| Memória | ████████░░ 72% |  SSD | ██████░░░░ 55% |  HDD | ████░░░░░░ 38% |

New layout:

| Memória | ████████░░ 72% |  CPU | ██░░░░░░░░ 15% |  Hőmérséklet | 52°C |
| SSD     | ██████░░░░ 55% |  HDD | ████░░░░░░ 38% |

Or, if horizontal space is tight, keep the two-row layout from the current dashboard and add CPU + temperature to the same row structure. Use the same progress bar component.

Temperature display:

Show as text "52°C" with colored dot (green/yellow/red)
Green: < 60°C
Yellow: 60-75°C
Red: > 75°C
If temperature is 0 (unavailable): hide entirely

CPU progress bar:

Same color scheme as memory/disk: green < 70%, yellow 70-85%, red > 85%

Load average: Show as small text below CPU bar: "Load: 0.3 / 0.2 / 0.1"

Phase 2C — Healthchecks.io Ping Integration (`internal/monitor/`)

Design: `internal/monitor/pinger.go`

package monitor

type Pinger struct {
    baseURL    string          // e.g. "https://status.felhom.eu"
    httpClient *http.Client
    logger     *log.Logger
    enabled    bool
}

func NewPinger(cfg *config.MonitoringConfig, logger *log.Logger) *Pinger

// Ping sends a success signal with optional diagnostic body
func (p *Pinger) Ping(uuid string, body string) error

// Fail sends a failure signal with diagnostic body
func (p *Pinger) Fail(uuid string, body string) error

// Start sends a "job started" signal (for duration tracking)
func (p *Pinger) Start(uuid string) error

HTTP protocol

Success: POST {baseURL}/ping/{uuid} with body as request body
Failure: POST {baseURL}/ping/{uuid}/fail with body
Start: POST {baseURL}/ping/{uuid}/start
Timeout: 10 seconds
Retry: 3 attempts with 2s backoff between retries
If uuid is empty or starts with "CHANGEME" → skip silently (log at debug level only)
If enabled is false → skip all pings
Never let ping failures affect the main operation — log a warning on HTTP error, but always return nil from the calling job. Ping errors must not break backup/health flows.

Design: `internal/monitor/healthcheck.go`

// RunHealthCheck runs system checks and returns a diagnostic report.
type HealthReport struct {
    Status    string   // "ok", "warn", "fail"
    Issues    []string // critical problems
    Warnings  []string // non-critical warnings
    Info      []string // informational items
    Timestamp time.Time
}

func RunHealthCheck(cfg *config.Config, cpuCollector *system.CPUCollector) *HealthReport
func (r *HealthReport) FormatMessage() string  // human-readable summary for HC ping body

Checks to run (replicating backup-healthcheck.sh logic in Go):

Disk usage: Read from system.GetInfo(). Compare against thresholds (disk_warn_percent, disk_crit_percent).
Memory usage: Same source. Warn if above memory_warn_percent.
CPU usage: From collector. Warn if above cpu_warn_percent.
Temperature: From system.GetInfo(). Warn if above temperature_warn_celsius.
Docker health: Verify Docker daemon is reachable by running docker info (quick exec check).
Protected containers: Verify protected stacks are running (traefik, cloudflared, felhom-controller) by checking container state.

Any issue → Status = "fail". Only warnings → Status = "warn". All clear → Status = "ok".

Scheduler integration

// In main.go:
pinger := monitor.NewPinger(&cfg.Monitoring, logger)
healthUUID := cfg.Monitoring.PingUUIDs.SystemHealth

// Parse system_health_interval (default "5m")
healthInterval, _ := time.ParseDuration(cfg.Monitoring.SystemHealthInterval)

sched.Every("system-health", healthInterval, func(ctx context.Context) error {
    report := monitor.RunHealthCheck(cfg, cpuCollector)
    body := report.FormatMessage()

    if report.Status == "fail" {
        pinger.Fail(healthUUID, body)
    } else {
        pinger.Ping(healthUUID, body)
    }
    return nil  // never fail the scheduler job due to ping errors
})

Config changes

Add to MonitoringConfig:

SystemHealthInterval string `yaml:"system_health_interval"`

Default in applyDefaults(): "5m"

Phase 3A — Database Dump Engine (`internal/backup/dbdump.go`)

Approach: Auto-discover from running Docker containers

Replicates the proven logic from backup-db-dump.sh in Go:

package backup

type DBType string
const (
    DBTypePostgres DBType = "postgres"
    DBTypeMariaDB  DBType = "mariadb"
)

type DiscoveredDB struct {
    ContainerName string
    ContainerID   string
    DBType        DBType
    DBUser        string
    DBName        string
    StackName     string  // derived from container name
}

type DumpResult struct {
    DB       DiscoveredDB
    FilePath string
    Size     int64
    Duration time.Duration
    Error    error
}

func DiscoverDatabases(ctx context.Context, logger *log.Logger) ([]DiscoveredDB, error)
func DumpAll(ctx context.Context, dbs []DiscoveredDB, dumpDir string, logger *log.Logger) []DumpResult
func DumpOne(ctx context.Context, db DiscoveredDB, dumpDir string, logger *log.Logger) DumpResult

Discovery logic

Run docker ps --format '{{.ID}}\t{{.Names}}\t{{.Image}}' --filter status=running.

For each running container, check image name:

Contains postgres → DBTypePostgres
Contains mariadb or mysql → DBTypeMariaDB

Then for each DB container, get env vars via: docker inspect <id> --format '{{range .Config.Env}}{{println .}}{{end}}'

Parse env vars:

PostgreSQL: POSTGRES_USER (default: "postgres"), POSTGRES_DB (default: same as POSTGRES_USER)
MariaDB: MYSQL_ROOT_PASSWORD, MYSQL_DATABASE (or MARIADB_DATABASE)

Derive stack name from container name by stripping common DB suffixes:

paperless-ngx-postgres → paperless-ngx
romm-db → romm
immich-postgres → immich
Logic: split on -, check if last segment is a known suffix (postgres, db, mariadb, mysql, database, redis, cache), if so remove it

Dump execution

PostgreSQL:

docker exec <container> pg_dump -U <user> -d <db> --clean --if-exists --no-owner --no-privileges

MariaDB:

docker exec <container> mariadb-dump -u root -p<password> --single-transaction --routines --triggers <db>

IMPORTANT: Use docker exec to run dump commands INSIDE the DB container. Do NOT use pg_dump/mysqldump from the controller container — version mismatches between the controller's client and the DB server will cause failures.

Output handling:

Use os/exec.Command("docker", "exec", ...) with cmd.Stdout piped to a temp file
Write to {dumpDir}/{stackName}-{dbtype}.sql.tmp during dump
Rename .tmp → .sql on success only
Delete .tmp on failure
Set 5-minute timeout per dump via context.WithTimeout

Gotchas and edge cases

MariaDB password from container env: Never log the password. Use docker inspect to read MYSQL_ROOT_PASSWORD or MARIADB_ROOT_PASSWORD.
Empty/zero-size dumps: Check dump file size after writing. If 0 bytes → treat as failure.
Dump file naming: {stackName}-{dbtype}.sql (e.g., paperless-ngx-postgres.sql). Overwrite previous dump each run (restic handles versioning).
Old tmp cleanup: Delete .tmp files older than 1 hour on each run (leftover from crashed dumps).
Skip infrastructure DBs: Don't dump databases from protected stacks (if any have DBs in the future).
Container not running: If a DB container was discovered but is no longer running by dump time → skip with warning (container may have been stopped between discovery and dump).

Dump directory

/srv/backups/db-dumps/ — configured in controller.yaml as paths.db_dump_dir. Already mounted in docker-compose.yml via /srv/backups:/srv/backups.

The user does NOT see this directory (not in FileBrowser, not on HDD).

Phase 3B — Restic Integration (`internal/backup/restic.go`)

Design

type ResticManager struct {
    repoPath     string
    passwordFile string
    logger       *log.Logger
    customerID   string
    cacheDir     string
}

func NewResticManager(cfg *config.Config, logger *log.Logger) *ResticManager

func (r *ResticManager) EnsureInitialized() error
func (r *ResticManager) Snapshot(paths []string, tags []string) (*SnapshotResult, error)
func (r *ResticManager) Prune(retention config.RetentionConfig) error
func (r *ResticManager) Check() error
func (r *ResticManager) LatestSnapshot() (*SnapshotInfo, error)
func (r *ResticManager) Stats() (*RepoStats, error)

type SnapshotResult struct {
    SnapshotID   string
    FilesNew     int
    FilesChanged int
    DataAdded    string        // human-readable
    Duration     time.Duration
}

type SnapshotInfo struct {
    ID       string
    Time     time.Time
    Paths    []string
    Tags     []string
}

type RepoStats struct {
    TotalSize      string
    SnapshotCount  int
    LatestSnapshot *SnapshotInfo
}

Restic commands (all via `os/exec`)

All commands set these env vars:

cmd.Env = append(os.Environ(),
    "RESTIC_REPOSITORY="+r.repoPath,
    "RESTIC_PASSWORD_FILE="+r.passwordFile,
    "RESTIC_CACHE_DIR="+r.cacheDir,
)

RESTIC_CACHE_DIR must be set to /opt/docker/felhom-controller/data/restic-cache (inside the controller-data Docker volume). Without this, restic defaults to ~/.cache/restic which may not persist across container restarts.

Init (idempotent):

Check if {repoPath}/config file exists → if so, already initialized, skip
Otherwise: restic init

Snapshot:

restic backup /opt/docker/stacks /srv/backups/db-dumps /opt/docker/felhom-controller/controller.yaml \
    --tag felhom --tag <customerID> --host <customerID>

What gets backed up (v1):

/opt/docker/stacks/ — compose files, .felhom.yml, app.yaml (deploy configs with secrets)
/srv/backups/db-dumps/ — SQL dumps (from the DB dump step)
/opt/docker/felhom-controller/controller.yaml — controller config

NOT backed up in v1:

HDD app data (Immich photos, Paperless documents) — too large, needs separate strategy
Docker volumes directly — critical data covered by DB dumps

Parse snapshot output (restic backup with --json sends JSON lines to stderr):

{"message_type":"summary","files_new":5,"files_changed":2,"data_added":12345678,...,"snapshot_id":"abc123"}

Prune:

restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 6 --prune

Check:

restic check

Latest snapshot:

restic snapshots --latest 1 --json

Returns JSON array with snapshot objects.

Stats (repo size):

restic stats --json

Password auto-generation

On startup, EnsureInitialized() checks if the password file exists. If not:

Generate 32 random bytes, base64url-encode
Write to r.passwordFile (the controller-data volume path)
Log [INFO] Generated new restic repository password at <path>
Log [WARN] Save this password externally — losing it means losing access to ALL backups

Gotchas

restic is already in the Docker image (Dockerfile installs it). No additional setup.
Locking: Restic handles repo locking internally. The scheduler's "skip if running" prevents concurrent operations. If a stale lock exists (controller crashed mid-backup), restic will error — add restic unlock to the error handling path with a log warning.
Timeout: 30-minute timeout for snapshot operations. Parse context deadline.
Large repos: First snapshot may be large (all stack configs + dumps). Subsequent snapshots are incremental (restic deduplicates).
restic JSON output: Use --json for machine-parseable output. Parse from stderr for backup command (stdout shows progress, stderr has JSON summary).

Actually, correction — restic with --json sends JSON to stdout. Regular progress goes to stderr. For backup --json, the summary JSON object with message_type: "summary" is on stdout. Parse the last JSON line from stdout.

Phase 3C — Backup Orchestrator (`internal/backup/backup.go`)

Design

type Manager struct {
    cfg       *config.Config
    restic    *ResticManager
    logger    *log.Logger
    pinger    *monitor.Pinger

    mu         sync.Mutex
    lastDBDump  *DBDumpStatus
    lastBackup  *BackupStatus
}

type DBDumpStatus struct {
    LastRun   time.Time
    Results   []DumpResult
    Success   bool
    Duration  time.Duration
}

type BackupStatus struct {
    LastRun    time.Time
    Snapshot   *SnapshotResult
    Success    bool
    Duration   time.Duration
    RepoStats  *RepoStats
}

func NewManager(cfg *config.Config, pinger *monitor.Pinger, logger *log.Logger) *Manager
func (m *Manager) RunDBDumps(ctx context.Context) error
func (m *Manager) RunBackup(ctx context.Context) error
func (m *Manager) RunFullBackup(ctx context.Context) error  // dumps + snapshot + optional prune
func (m *Manager) GetStatus() (*DBDumpStatus, *BackupStatus)
func (m *Manager) GetRepoStats() (*RepoStats, error)

Full backup flow (daily scheduled)

DB dumps: DiscoverDatabases() → DumpAll() → update lastDBDump status
Ping Healthchecks for DB dump result: pinger.Ping/Fail(dbDumpUUID, summary)
Restic snapshot: restic.EnsureInitialized() → restic.Snapshot(paths, tags)
Prune (weekly): Check day of week against prune_schedule config. If match → restic.Prune(retention) + restic.Check()
Ping Healthchecks for backup result: pinger.Ping/Fail(backupUUID, summary)
Update lastBackup status

Scheduler integration

// In main.go:
backupMgr := backup.NewManager(cfg, pinger, logger)

if cfg.Backup.Enabled {
    sched.Daily("db-dump", cfg.Backup.DBDumpSchedule, func(ctx context.Context) error {
        return backupMgr.RunDBDumps(ctx)
    })

    sched.Daily("backup", cfg.Backup.ResticSchedule, func(ctx context.Context) error {
        return backupMgr.RunBackup(ctx)
    })
}

Dashboard display

Add "Biztonsági mentés" (Backup) section to dashboard.html:

╔══════════════════════════════════════════╗
║  🛡️ Biztonsági mentés                   ║
╠══════════════════════════════════════════╣
║  Utolsó mentés: 2026-02-15 03:01 ✅      ║
║  Adatbázisok: 2 mentve (12.3 MB)         ║
║  Tároló méret: 45.2 MB (23 pillanatkép)   ║
║  Következő: ma 03:00                      ║
║                                           ║
║  [Mentés most]                            ║
╚══════════════════════════════════════════╝

Hungarian labels:

"Biztonsági mentés" = Backup
"Utolsó mentés" = Last backup
"Adatbázisok" = Databases
"mentve" = backed up
"Tároló méret" = Repository size
"pillanatkép" = snapshot(s)
"Következő" = Next
"Mentés most" = Backup now

Status colors:

Green ✅: Last backup successful and less than backup_max_age_hours old
Yellow ⚠️: Last backup successful but older than expected
Red ❌: Last backup failed or no backups exist yet
Gray: Backup not configured (backup.enabled: false)

If backup is disabled in config → show "Biztonsági mentés nincs beállítva" (Backup not configured).

API endpoints

Add to api/router.go:

GET  /api/backup/status    → backup manager status + repo stats
POST /api/backup/run       → trigger immediate full backup (async)

POST /api/backup/run starts the backup in a background goroutine, returns immediately with {"ok": true, "message": "Mentés elindítva"}. The dashboard can poll /api/backup/status to track progress.

Docker-compose.yml final state

services:
  felhom-controller:
    image: gitea.dooplex.hu/admin/felhom-controller:latest
    container_name: felhom-controller
    restart: unless-stopped
    ports:
      - "8080:8080"
    volumes:
      # Docker socket — required for compose operations + DB dumps (docker exec)
      - /var/run/docker.sock:/var/run/docker.sock:ro
      # Controller config
      - /opt/docker/felhom-controller/controller.yaml:/opt/docker/felhom-controller/controller.yaml:ro
      # Controller persistent data (sessions, restic cache, restic password)
      - controller-data:/opt/docker/felhom-controller/data
      # Stack compose files (read + write for git sync)
      - /opt/docker/stacks:/opt/docker/stacks
      # Backup directories (restic repo + db dumps)
      - /srv/backups:/srv/backups
      # HDD mount (if available, for monitoring disk usage)
      - ${HDD_PATH:-/mnt/hdd_placeholder}:${HDD_PATH:-/mnt/hdd_placeholder}:ro
      # Host /sys — for CPU temperature reading (read-only)
      - /sys:/host/sys:ro
    environment:
      - TZ=Europe/Budapest
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.controller.rule=Host(`felhom.${DOMAIN}`)"
      - "traefik.http.routers.controller.entrypoints=websecure"
      - "traefik.http.routers.controller.tls=true"
      - "traefik.http.services.controller.loadbalancer.server.port=8080"
      - "traefik.docker.network=traefik-public"
      - "felhom.managed=true"
      - "felhom.component=controller"
    networks:
      - traefik-public
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/api/health"]
      interval: 30s
      timeout: 5s
      start_period: 10s
      retries: 3

volumes:
  controller-data:

networks:
  traefik-public:
    external: true

Changes from current:

Added: /sys:/host/sys:ro — for temperature reading
Removed: dedicated restic-password bind mount (password now in controller-data volume)

Config changes summary

`controller.yaml.example` updates

monitoring:
  system_health_interval: "5m"    # NEW field

backup:
  restic_password_file: "/opt/docker/felhom-controller/data/restic-password"  # CHANGED default path

`config.go` updates

Add SystemHealthInterval string to MonitoringConfig
Default: "5m" in applyDefaults()
Change restic_password_file default from /opt/docker/felhom-controller/restic-password to /opt/docker/felhom-controller/data/restic-password
Add env override: FELHOM_MONITORING_SYSTEM_HEALTH_INTERVAL

Implementation Order

Step 1: Scheduler

Create internal/scheduler/scheduler.go
Implement Every() and Daily() with logging, panic recovery, skip-if-running
Migrate the two existing goroutines from main.go to scheduler
Build and verify — behavior should be identical, logs should show [SCHED] entries

Step 2: CPU & Temperature metrics

Create internal/system/cpu_linux.go + cpu_other.go (build tags)
Add readLoadAvg() and readTemperature() to info_linux.go
Extend SystemInfo struct in info.go
Update GetInfo() signature in all files to accept *CPUCollector
Start CPUCollector in main.go, pass to web server and API router
Update docker-compose.yml — add /sys:/host/sys:ro
Update dashboard.html — show CPU, load, temperature
Update style.css if needed for new display elements
Build, deploy, verify — new metrics visible on dashboard

Step 3: Healthchecks pinger + health checks

Create internal/monitor/pinger.go
Create internal/monitor/healthcheck.go
Add system_health_interval to config
Add system health ping job to scheduler in main.go
Build, deploy — check controller logs for health check runs

Step 4: Database dump engine

Create internal/backup/dbdump.go
Implement discovery + dump functions
Wire up RunDBDumps temporarily to a test endpoint or manual scheduler trigger for testing
Build, deploy, verify — dumps should appear in /srv/backups/db-dumps/ for paperless-ngx-postgres

Step 5: Restic integration

Create internal/backup/restic.go
Implement init, snapshot, prune, check, stats
Auto-generate restic password if missing
Update docker-compose.yml (remove restic-password bind mount)
Build, deploy, verify — repo initialized, password generated

Step 6: Backup orchestrator + dashboard

Create internal/backup/backup.go
Wire up scheduler daily jobs (DB dump + backup)
Add API endpoints (/api/backup/status, /api/backup/run)
Add backup status section to dashboard.html
Add "Mentés most" button
Build, deploy, verify full flow

Step 7: Documentation & cleanup

Update README.md — Phase 2 and 3 checked off, new module descriptions
Update CONTEXT.md with session summary
Update CLAUDE.md if workflow changes
Version bump in build: v0.4.0

Verification Checklist

After deployment, verify each item:

docker ps shows controller healthy
Dashboard loads with CPU %, load average, temperature displayed
Temperature shows realistic value (30-60°C idle for N100)
CPU % updates (not stuck at 0)
/api/system/info returns all new fields (cpu_percent, load_avg_, temperature_)
Scheduler logs show [SCHED] entries for all registered jobs
If HC UUIDs configured: pings visible in status.felhom.eu dashboard
DB dump discovers paperless-ngx postgres container
Dump file exists: /srv/backups/db-dumps/paperless-ngx-postgres.sql
Restic repo initialized: /srv/backups/restic-repo/config exists
Restic password auto-generated: /opt/docker/felhom-controller/data/restic-password exists
"Mentés most" button triggers backup successfully
Dashboard shows backup status section with last backup time
All existing features still work (start/stop/deploy/update/logs/auth)

New files to create

internal/scheduler/scheduler.go
internal/monitor/pinger.go
internal/monitor/healthcheck.go
internal/backup/dbdump.go
internal/backup/restic.go
internal/backup/backup.go
internal/system/cpu_linux.go
internal/system/cpu_other.go

Existing files to modify

internal/system/info.go              — new SystemInfo fields
internal/system/info_linux.go        — readLoadAvg(), readTemperature(), GetInfo() signature
internal/system/info_other.go        — GetInfo() signature update
internal/config/config.go            — SystemHealthInterval, updated defaults
internal/api/router.go               — backup endpoints, cpuCollector parameter
internal/web/server.go               — accept cpuCollector, backupMgr
internal/web/handlers.go             — pass cpuCollector/backupMgr to dashboard
internal/web/templates/dashboard.html — CPU/temp bars, backup status section
internal/web/templates/style.css     — styles for new elements
cmd/controller/main.go               — scheduler, cpuCollector, pinger, backupMgr wiring
controller/docker-compose.yml        — /sys mount, remove restic-password mount
configs/controller.yaml.example      — new fields, updated defaults

Manual steps after deployment (for Viktor)

Verify /sys mount: docker exec felhom-controller ls /host/sys/class/thermal/ — should show thermal_zone directories
Healthchecks setup: Create project + 3 checks in status.felhom.eu for demo-felhom:
- system-health (period: 10m, grace: 10m)
- db-dump (period: 24h, grace: 1h)
- backup (period: 24h, grace: 1h)
Update controller.yaml: Add the three ping UUIDs
Verify restic password: docker exec felhom-controller cat /opt/docker/felhom-controller/data/restic-password

Test restore procedure:

docker exec felhom-controller restic -r /srv/backups/restic-repo \
  --password-file /opt/docker/felhom-controller/data/restic-password snapshots

Save restic password externally — losing it means losing access to all backups

31 KiB Raw Blame History

TASK.md — Phase 2: Monitoring & Health + Phase 3: Backups

Overview

Phase 2A — Scheduler (internal/scheduler/)

Why first

Design: internal/scheduler/scheduler.go

Interval jobs (Every)

Daily jobs (Daily)

Edge cases

Integration in main.go

Phase 2B — CPU & Temperature Metrics (internal/system/)

Current state

New fields in SystemInfo

CPU measurement approach

Load average

Temperature

GetInfo() signature change

CPU collector lifecycle

Dashboard display

Phase 2C — Healthchecks.io Ping Integration (internal/monitor/)

Design: internal/monitor/pinger.go

HTTP protocol

Design: internal/monitor/healthcheck.go

Scheduler integration

Config changes

Phase 3A — Database Dump Engine (internal/backup/dbdump.go)

Approach: Auto-discover from running Docker containers

Discovery logic

Dump execution

Gotchas and edge cases

Dump directory

Phase 3B — Restic Integration (internal/backup/restic.go)

Design

Restic commands (all via os/exec)

Password auto-generation

Gotchas

Phase 3C — Backup Orchestrator (internal/backup/backup.go)

Design

Full backup flow (daily scheduled)

Scheduler integration

Dashboard display

API endpoints

Docker-compose.yml final state

Config changes summary

controller.yaml.example updates

config.go updates

Implementation Order

Step 1: Scheduler

Step 2: CPU & Temperature metrics

Step 3: Healthchecks pinger + health checks

Step 4: Database dump engine

Step 5: Restic integration

Step 6: Backup orchestrator + dashboard

Step 7: Documentation & cleanup

Verification Checklist

New files to create

Existing files to modify

Manual steps after deployment (for Viktor)

31 KiB

Raw Blame History

Phase 2A — Scheduler (`internal/scheduler/`)

Design: `internal/scheduler/scheduler.go`

Interval jobs (`Every`)

Daily jobs (`Daily`)

Phase 2B — CPU & Temperature Metrics (`internal/system/`)

New fields in `SystemInfo`

`GetInfo()` signature change

Phase 2C — Healthchecks.io Ping Integration (`internal/monitor/`)

Design: `internal/monitor/pinger.go`

Design: `internal/monitor/healthcheck.go`

Phase 3A — Database Dump Engine (`internal/backup/dbdump.go`)

Phase 3B — Restic Integration (`internal/backup/restic.go`)

Restic commands (all via `os/exec`)

Phase 3C — Backup Orchestrator (`internal/backup/backup.go`)

`controller.yaml.example` updates

`config.go` updates