31 KiB
TASK.md — Phase 2: Monitoring & Health + Phase 3: Backups
Version bump target: v0.4.0 Priority: Phase 2 first (scheduler + metrics are prerequisites for Phase 3)
Overview
Implement two major features in felhom-controller:
- Phase 2 — Scheduler, CPU/temperature metrics, Healthchecks.io ping integration
- Phase 3 — Database dump engine, restic backup snapshots, dashboard status display
Both phases share the scheduler infrastructure. Implement in order.
Phase 2A — Scheduler (internal/scheduler/)
Why first
main.go currently has two ad-hoc goroutines (status refresh every 30s, stack scan every 2min). Phase 2 adds system health pings. Phase 3 adds daily DB dumps and backups. All need a centralized, logged, observable job runner. Build it once, use everywhere.
Design: internal/scheduler/scheduler.go
package scheduler
type JobFunc func(ctx context.Context) error
type Job struct {
Name string
Fn JobFunc
Interval time.Duration // for periodic jobs (every N)
Schedule string // for daily jobs ("02:30", "03:00") — mutually exclusive with Interval
LastRun time.Time
LastErr error
Running bool
}
type Scheduler struct {
jobs []*Job
logger *log.Logger
ctx context.Context
cancel context.CancelFunc
wg sync.WaitGroup
}
func New(logger *log.Logger) *Scheduler
func (s *Scheduler) Every(name string, interval time.Duration, fn JobFunc)
func (s *Scheduler) Daily(name string, timeStr string, fn JobFunc) // "02:30" format, Europe/Budapest timezone
func (s *Scheduler) Start(ctx context.Context)
func (s *Scheduler) Stop()
func (s *Scheduler) GetJobs() []Job // for dashboard/API display (copy, not pointer)
Interval jobs (Every)
- Spawns a goroutine with
time.Ticker - Logs
[SCHED] Running job: <n>at start,[SCHED] Job <n> completed (took Xs)or[SCHED] Job <n> failed: <err> (took Xs)at end - Updates
LastRun,LastErr,Runningfields (mutex-protected) - Respects ctx.Done() for shutdown
- Quiet mode for high-frequency jobs: Jobs that run every <=30 seconds should only log at debug level on success (avoid log spam). Failures always log at WARN/ERROR level.
Daily jobs (Daily)
- Parses
timeStras "HH:MM" inEurope/Budapesttimezone - On start, calculates duration until next occurrence (today if not yet passed, tomorrow if passed)
- After each run, sleeps until the next day's scheduled time
- Uses
time.After()ortime.Timer, NOTtime.Ticker(handles DST transitions correctly) - Same logging pattern as interval jobs
- Logs
[SCHED] Daily job <n> scheduled for <next_time>on registration
Edge cases
- If
DailytimeStr is invalid → log error at registration, don't start the job - If a job panics → recover, log
[ERROR] Job <n> panicked: <err>, mark as failed - If a job is already running when the next tick fires → skip, log
[WARN] Job <n> still running, skipping - Graceful shutdown:
Stop()cancels context,wg.Wait()with 30s timeout for running jobs to finish
Integration in main.go
Replace the two ad-hoc goroutines with:
sched := scheduler.New(logger)
// Existing periodic tasks (move from ad-hoc goroutines)
sched.Every("status-refresh", 30*time.Second, func(ctx context.Context) error {
return stackMgr.RefreshStatus()
})
sched.Every("stack-scan", 2*time.Minute, func(ctx context.Context) error {
return stackMgr.ScanStacks()
})
// Phase 2: System health ping (added below)
// Phase 3: DB dump, backup (added below)
sched.Start(ctx)
defer sched.Stop()
Delete the two existing goroutines in main.go after migrating to the scheduler.
Phase 2B — CPU & Temperature Metrics (internal/system/)
Current state
info.go defines SystemInfo struct with memory + disk fields.
info_linux.go reads /proc/meminfo and syscall.Statfs.
info_other.go provides stubs for non-Linux.
New fields in SystemInfo
// Add to SystemInfo struct in info.go:
CPUPercent float64 `json:"cpu_percent"` // 0-100, averaged across all cores
LoadAvg1 float64 `json:"load_avg_1"` // 1-minute load average
LoadAvg5 float64 `json:"load_avg_5"` // 5-minute load average
LoadAvg15 float64 `json:"load_avg_15"` // 15-minute load average
TemperatureCelsius float64 `json:"temperature_celsius"` // CPU/SoC temperature
TemperatureSource string `json:"temperature_source"` // e.g. "thermal_zone0", "x86_pkg_temp"
CPU measurement approach
Do NOT block GetInfo() with a delta calculation.
Use a lightweight CPUCollector that runs in a background goroutine:
// internal/system/cpu_linux.go (build tag: linux)
type CPUCollector struct {
mu sync.RWMutex
cpuPercent float64
sampleRate time.Duration // default: 5 seconds
cancel context.CancelFunc
}
func NewCPUCollector(sampleRate time.Duration) *CPUCollector
func (c *CPUCollector) Start(ctx context.Context)
func (c *CPUCollector) Stop()
func (c *CPUCollector) CPUPercent() float64 // returns latest sample
How it works:
- Reads
/proc/statfirst line:cpu <user> <nice> <system> <idle> <iowait> <irq> <softirq> <steal> - Sleeps
sampleRate(5s) - Reads again, computes delta:
busy = delta(user+nice+system+irq+softirq+steal),total = busy + delta(idle+iowait) cpuPercent = (busy / total) * 100- Stores result, loops
Parsing /proc/stat:
cpu 1234 56 789 45678 123 45 67 0 0 0
Split by whitespace. Fields after "cpu" are: user(1) nice(2) system(3) idle(4) iowait(5) irq(6) softirq(7) steal(8). Sum all = total. idle + iowait = idle_total. busy = total - idle_total.
IMPORTANT: Inside a Docker container, /proc/stat reflects the HOST CPU (unless CPU cgroups are applied with limits). So the controller's own /proc/stat works.
Load average
Read from /proc/loadavg (instant, no delta needed):
0.15 0.10 0.05 1/234 56789
First three fields are 1/5/15 minute load averages. Parse with fmt.Sscanf.
Add readLoadAvg(info *SystemInfo) in info_linux.go.
Temperature
Read from /sys/class/thermal/thermal_zone*/temp:
IMPORTANT: The controller runs in a Docker container. /sys is NOT available by default. We mount the host's /sys at /host/sys inside the container (see docker-compose.yml changes below).
// internal/system/info_linux.go — add readTemperature(info *SystemInfo)
Algorithm:
- Try
/host/sys/class/thermal/thermal_zone*/tempfirst (Docker mount) - Fallback to
/sys/class/thermal/thermal_zone*/temp(native/development) - For each zone, also read the
typefile for the label - Pick the highest temperature (usually
thermal_zone0orx86_pkg_temp) - Value is in millidegrees Celsius → divide by 1000.0
- Store the zone type as
TemperatureSource - If no thermal zones found: try
/host/sys/class/hwmon/hwmon*/temp1_inputas fallback (same millidegree format) - If nothing found: leave fields as zero (dashboard hides temperature when 0)
GetInfo() signature change
// Current:
func GetInfo(hddPath string) SystemInfo
// New:
func GetInfo(hddPath string, cpuCollector *CPUCollector) SystemInfo
Inside GetInfo():
- Existing:
readMemInfo(&info),readDiskUsage(...)— unchanged - New:
readLoadAvg(&info) - New:
readTemperature(&info) - New:
if cpuCollector != nil { info.CPUPercent = cpuCollector.CPUPercent() }
The info_other.go stub accepts the parameter but ignores it (returns empty SystemInfo as before).
CPU collector lifecycle
Started in main.go:
cpuCollector := system.NewCPUCollector(5 * time.Second)
cpuCollector.Start(ctx)
defer cpuCollector.Stop()
Passed to web.NewServer() and api.NewRouter() which pass it to system.GetInfo() calls.
Dashboard display
Extend the existing system info bar in dashboard.html:
Current layout:
| Memória | ████████░░ 72% | SSD | ██████░░░░ 55% | HDD | ████░░░░░░ 38% |
New layout:
| Memória | ████████░░ 72% | CPU | ██░░░░░░░░ 15% | Hőmérséklet | 52°C |
| SSD | ██████░░░░ 55% | HDD | ████░░░░░░ 38% |
Or, if horizontal space is tight, keep the two-row layout from the current dashboard and add CPU + temperature to the same row structure. Use the same progress bar component.
Temperature display:
- Show as text "52°C" with colored dot (green/yellow/red)
- Green: < 60°C
- Yellow: 60-75°C
- Red: > 75°C
- If temperature is 0 (unavailable): hide entirely
CPU progress bar:
- Same color scheme as memory/disk: green < 70%, yellow 70-85%, red > 85%
Load average: Show as small text below CPU bar: "Load: 0.3 / 0.2 / 0.1"
Phase 2C — Healthchecks.io Ping Integration (internal/monitor/)
Design: internal/monitor/pinger.go
package monitor
type Pinger struct {
baseURL string // e.g. "https://status.felhom.eu"
httpClient *http.Client
logger *log.Logger
enabled bool
}
func NewPinger(cfg *config.MonitoringConfig, logger *log.Logger) *Pinger
// Ping sends a success signal with optional diagnostic body
func (p *Pinger) Ping(uuid string, body string) error
// Fail sends a failure signal with diagnostic body
func (p *Pinger) Fail(uuid string, body string) error
// Start sends a "job started" signal (for duration tracking)
func (p *Pinger) Start(uuid string) error
HTTP protocol
- Success:
POST {baseURL}/ping/{uuid}with body as request body - Failure:
POST {baseURL}/ping/{uuid}/failwith body - Start:
POST {baseURL}/ping/{uuid}/start - Timeout: 10 seconds
- Retry: 3 attempts with 2s backoff between retries
- If
uuidis empty or starts with "CHANGEME" → skip silently (log at debug level only) - If
enabledis false → skip all pings - Never let ping failures affect the main operation — log a warning on HTTP error, but always return nil from the calling job. Ping errors must not break backup/health flows.
Design: internal/monitor/healthcheck.go
// RunHealthCheck runs system checks and returns a diagnostic report.
type HealthReport struct {
Status string // "ok", "warn", "fail"
Issues []string // critical problems
Warnings []string // non-critical warnings
Info []string // informational items
Timestamp time.Time
}
func RunHealthCheck(cfg *config.Config, cpuCollector *system.CPUCollector) *HealthReport
func (r *HealthReport) FormatMessage() string // human-readable summary for HC ping body
Checks to run (replicating backup-healthcheck.sh logic in Go):
- Disk usage: Read from
system.GetInfo(). Compare against thresholds (disk_warn_percent,disk_crit_percent). - Memory usage: Same source. Warn if above
memory_warn_percent. - CPU usage: From collector. Warn if above
cpu_warn_percent. - Temperature: From
system.GetInfo(). Warn if abovetemperature_warn_celsius. - Docker health: Verify Docker daemon is reachable by running
docker info(quick exec check). - Protected containers: Verify protected stacks are running (traefik, cloudflared, felhom-controller) by checking container state.
Any issue → Status = "fail". Only warnings → Status = "warn". All clear → Status = "ok".
Scheduler integration
// In main.go:
pinger := monitor.NewPinger(&cfg.Monitoring, logger)
healthUUID := cfg.Monitoring.PingUUIDs.SystemHealth
// Parse system_health_interval (default "5m")
healthInterval, _ := time.ParseDuration(cfg.Monitoring.SystemHealthInterval)
sched.Every("system-health", healthInterval, func(ctx context.Context) error {
report := monitor.RunHealthCheck(cfg, cpuCollector)
body := report.FormatMessage()
if report.Status == "fail" {
pinger.Fail(healthUUID, body)
} else {
pinger.Ping(healthUUID, body)
}
return nil // never fail the scheduler job due to ping errors
})
Config changes
Add to MonitoringConfig:
SystemHealthInterval string `yaml:"system_health_interval"`
Default in applyDefaults(): "5m"
Phase 3A — Database Dump Engine (internal/backup/dbdump.go)
Approach: Auto-discover from running Docker containers
Replicates the proven logic from backup-db-dump.sh in Go:
package backup
type DBType string
const (
DBTypePostgres DBType = "postgres"
DBTypeMariaDB DBType = "mariadb"
)
type DiscoveredDB struct {
ContainerName string
ContainerID string
DBType DBType
DBUser string
DBName string
StackName string // derived from container name
}
type DumpResult struct {
DB DiscoveredDB
FilePath string
Size int64
Duration time.Duration
Error error
}
func DiscoverDatabases(ctx context.Context, logger *log.Logger) ([]DiscoveredDB, error)
func DumpAll(ctx context.Context, dbs []DiscoveredDB, dumpDir string, logger *log.Logger) []DumpResult
func DumpOne(ctx context.Context, db DiscoveredDB, dumpDir string, logger *log.Logger) DumpResult
Discovery logic
Run docker ps --format '{{.ID}}\t{{.Names}}\t{{.Image}}' --filter status=running.
For each running container, check image name:
- Contains
postgres→ DBTypePostgres - Contains
mariadbormysql→ DBTypeMariaDB
Then for each DB container, get env vars via:
docker inspect <id> --format '{{range .Config.Env}}{{println .}}{{end}}'
Parse env vars:
- PostgreSQL:
POSTGRES_USER(default: "postgres"),POSTGRES_DB(default: same as POSTGRES_USER) - MariaDB:
MYSQL_ROOT_PASSWORD,MYSQL_DATABASE(orMARIADB_DATABASE)
Derive stack name from container name by stripping common DB suffixes:
paperless-ngx-postgres→paperless-ngxromm-db→rommimmich-postgres→immich- Logic: split on
-, check if last segment is a known suffix (postgres,db,mariadb,mysql,database,redis,cache), if so remove it
Dump execution
PostgreSQL:
docker exec <container> pg_dump -U <user> -d <db> --clean --if-exists --no-owner --no-privileges
MariaDB:
docker exec <container> mariadb-dump -u root -p<password> --single-transaction --routines --triggers <db>
IMPORTANT: Use docker exec to run dump commands INSIDE the DB container. Do NOT use pg_dump/mysqldump from the controller container — version mismatches between the controller's client and the DB server will cause failures.
Output handling:
- Use
os/exec.Command("docker", "exec", ...)withcmd.Stdoutpiped to a temp file - Write to
{dumpDir}/{stackName}-{dbtype}.sql.tmpduring dump - Rename
.tmp→.sqlon success only - Delete
.tmpon failure - Set 5-minute timeout per dump via
context.WithTimeout
Gotchas and edge cases
- MariaDB password from container env: Never log the password. Use
docker inspectto readMYSQL_ROOT_PASSWORDorMARIADB_ROOT_PASSWORD. - Empty/zero-size dumps: Check dump file size after writing. If 0 bytes → treat as failure.
- Dump file naming:
{stackName}-{dbtype}.sql(e.g.,paperless-ngx-postgres.sql). Overwrite previous dump each run (restic handles versioning). - Old tmp cleanup: Delete
.tmpfiles older than 1 hour on each run (leftover from crashed dumps). - Skip infrastructure DBs: Don't dump databases from protected stacks (if any have DBs in the future).
- Container not running: If a DB container was discovered but is no longer running by dump time → skip with warning (container may have been stopped between discovery and dump).
Dump directory
/srv/backups/db-dumps/ — configured in controller.yaml as paths.db_dump_dir.
Already mounted in docker-compose.yml via /srv/backups:/srv/backups.
The user does NOT see this directory (not in FileBrowser, not on HDD).
Phase 3B — Restic Integration (internal/backup/restic.go)
Design
type ResticManager struct {
repoPath string
passwordFile string
logger *log.Logger
customerID string
cacheDir string
}
func NewResticManager(cfg *config.Config, logger *log.Logger) *ResticManager
func (r *ResticManager) EnsureInitialized() error
func (r *ResticManager) Snapshot(paths []string, tags []string) (*SnapshotResult, error)
func (r *ResticManager) Prune(retention config.RetentionConfig) error
func (r *ResticManager) Check() error
func (r *ResticManager) LatestSnapshot() (*SnapshotInfo, error)
func (r *ResticManager) Stats() (*RepoStats, error)
type SnapshotResult struct {
SnapshotID string
FilesNew int
FilesChanged int
DataAdded string // human-readable
Duration time.Duration
}
type SnapshotInfo struct {
ID string
Time time.Time
Paths []string
Tags []string
}
type RepoStats struct {
TotalSize string
SnapshotCount int
LatestSnapshot *SnapshotInfo
}
Restic commands (all via os/exec)
All commands set these env vars:
cmd.Env = append(os.Environ(),
"RESTIC_REPOSITORY="+r.repoPath,
"RESTIC_PASSWORD_FILE="+r.passwordFile,
"RESTIC_CACHE_DIR="+r.cacheDir,
)
RESTIC_CACHE_DIR must be set to /opt/docker/felhom-controller/data/restic-cache (inside the controller-data Docker volume). Without this, restic defaults to ~/.cache/restic which may not persist across container restarts.
Init (idempotent):
- Check if
{repoPath}/configfile exists → if so, already initialized, skip - Otherwise:
restic init
Snapshot:
restic backup /opt/docker/stacks /srv/backups/db-dumps /opt/docker/felhom-controller/controller.yaml \
--tag felhom --tag <customerID> --host <customerID>
What gets backed up (v1):
/opt/docker/stacks/— compose files, .felhom.yml, app.yaml (deploy configs with secrets)/srv/backups/db-dumps/— SQL dumps (from the DB dump step)/opt/docker/felhom-controller/controller.yaml— controller config
NOT backed up in v1:
- HDD app data (Immich photos, Paperless documents) — too large, needs separate strategy
- Docker volumes directly — critical data covered by DB dumps
Parse snapshot output (restic backup with --json sends JSON lines to stderr):
{"message_type":"summary","files_new":5,"files_changed":2,"data_added":12345678,...,"snapshot_id":"abc123"}
Prune:
restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 6 --prune
Check:
restic check
Latest snapshot:
restic snapshots --latest 1 --json
Returns JSON array with snapshot objects.
Stats (repo size):
restic stats --json
Password auto-generation
On startup, EnsureInitialized() checks if the password file exists. If not:
- Generate 32 random bytes, base64url-encode
- Write to
r.passwordFile(the controller-data volume path) - Log
[INFO] Generated new restic repository password at <path> - Log
[WARN] Save this password externally — losing it means losing access to ALL backups
Gotchas
- restic is already in the Docker image (Dockerfile installs it). No additional setup.
- Locking: Restic handles repo locking internally. The scheduler's "skip if running" prevents concurrent operations. If a stale lock exists (controller crashed mid-backup), restic will error — add
restic unlockto the error handling path with a log warning. - Timeout: 30-minute timeout for snapshot operations. Parse context deadline.
- Large repos: First snapshot may be large (all stack configs + dumps). Subsequent snapshots are incremental (restic deduplicates).
- restic JSON output: Use
--jsonfor machine-parseable output. Parse from stderr forbackupcommand (stdout shows progress, stderr has JSON summary).
Actually, correction — restic with --json sends JSON to stdout. Regular progress goes to stderr. For backup --json, the summary JSON object with message_type: "summary" is on stdout. Parse the last JSON line from stdout.
Phase 3C — Backup Orchestrator (internal/backup/backup.go)
Design
type Manager struct {
cfg *config.Config
restic *ResticManager
logger *log.Logger
pinger *monitor.Pinger
mu sync.Mutex
lastDBDump *DBDumpStatus
lastBackup *BackupStatus
}
type DBDumpStatus struct {
LastRun time.Time
Results []DumpResult
Success bool
Duration time.Duration
}
type BackupStatus struct {
LastRun time.Time
Snapshot *SnapshotResult
Success bool
Duration time.Duration
RepoStats *RepoStats
}
func NewManager(cfg *config.Config, pinger *monitor.Pinger, logger *log.Logger) *Manager
func (m *Manager) RunDBDumps(ctx context.Context) error
func (m *Manager) RunBackup(ctx context.Context) error
func (m *Manager) RunFullBackup(ctx context.Context) error // dumps + snapshot + optional prune
func (m *Manager) GetStatus() (*DBDumpStatus, *BackupStatus)
func (m *Manager) GetRepoStats() (*RepoStats, error)
Full backup flow (daily scheduled)
- DB dumps:
DiscoverDatabases()→DumpAll()→ updatelastDBDumpstatus - Ping Healthchecks for DB dump result:
pinger.Ping/Fail(dbDumpUUID, summary) - Restic snapshot:
restic.EnsureInitialized()→restic.Snapshot(paths, tags) - Prune (weekly): Check day of week against
prune_scheduleconfig. If match →restic.Prune(retention)+restic.Check() - Ping Healthchecks for backup result:
pinger.Ping/Fail(backupUUID, summary) - Update
lastBackupstatus
Scheduler integration
// In main.go:
backupMgr := backup.NewManager(cfg, pinger, logger)
if cfg.Backup.Enabled {
sched.Daily("db-dump", cfg.Backup.DBDumpSchedule, func(ctx context.Context) error {
return backupMgr.RunDBDumps(ctx)
})
sched.Daily("backup", cfg.Backup.ResticSchedule, func(ctx context.Context) error {
return backupMgr.RunBackup(ctx)
})
}
Dashboard display
Add "Biztonsági mentés" (Backup) section to dashboard.html:
╔══════════════════════════════════════════╗
║ 🛡️ Biztonsági mentés ║
╠══════════════════════════════════════════╣
║ Utolsó mentés: 2026-02-15 03:01 ✅ ║
║ Adatbázisok: 2 mentve (12.3 MB) ║
║ Tároló méret: 45.2 MB (23 pillanatkép) ║
║ Következő: ma 03:00 ║
║ ║
║ [Mentés most] ║
╚══════════════════════════════════════════╝
Hungarian labels:
- "Biztonsági mentés" = Backup
- "Utolsó mentés" = Last backup
- "Adatbázisok" = Databases
- "mentve" = backed up
- "Tároló méret" = Repository size
- "pillanatkép" = snapshot(s)
- "Következő" = Next
- "Mentés most" = Backup now
Status colors:
- Green ✅: Last backup successful and less than
backup_max_age_hoursold - Yellow ⚠️: Last backup successful but older than expected
- Red ❌: Last backup failed or no backups exist yet
- Gray: Backup not configured (
backup.enabled: false)
If backup is disabled in config → show "Biztonsági mentés nincs beállítva" (Backup not configured).
API endpoints
Add to api/router.go:
GET /api/backup/status → backup manager status + repo stats
POST /api/backup/run → trigger immediate full backup (async)
POST /api/backup/run starts the backup in a background goroutine, returns immediately with {"ok": true, "message": "Mentés elindítva"}. The dashboard can poll /api/backup/status to track progress.
Docker-compose.yml final state
services:
felhom-controller:
image: gitea.dooplex.hu/admin/felhom-controller:latest
container_name: felhom-controller
restart: unless-stopped
ports:
- "8080:8080"
volumes:
# Docker socket — required for compose operations + DB dumps (docker exec)
- /var/run/docker.sock:/var/run/docker.sock:ro
# Controller config
- /opt/docker/felhom-controller/controller.yaml:/opt/docker/felhom-controller/controller.yaml:ro
# Controller persistent data (sessions, restic cache, restic password)
- controller-data:/opt/docker/felhom-controller/data
# Stack compose files (read + write for git sync)
- /opt/docker/stacks:/opt/docker/stacks
# Backup directories (restic repo + db dumps)
- /srv/backups:/srv/backups
# HDD mount (if available, for monitoring disk usage)
- ${HDD_PATH:-/mnt/hdd_placeholder}:${HDD_PATH:-/mnt/hdd_placeholder}:ro
# Host /sys — for CPU temperature reading (read-only)
- /sys:/host/sys:ro
environment:
- TZ=Europe/Budapest
labels:
- "traefik.enable=true"
- "traefik.http.routers.controller.rule=Host(`felhom.${DOMAIN}`)"
- "traefik.http.routers.controller.entrypoints=websecure"
- "traefik.http.routers.controller.tls=true"
- "traefik.http.services.controller.loadbalancer.server.port=8080"
- "traefik.docker.network=traefik-public"
- "felhom.managed=true"
- "felhom.component=controller"
networks:
- traefik-public
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/api/health"]
interval: 30s
timeout: 5s
start_period: 10s
retries: 3
volumes:
controller-data:
networks:
traefik-public:
external: true
Changes from current:
- Added:
/sys:/host/sys:ro— for temperature reading - Removed: dedicated restic-password bind mount (password now in controller-data volume)
Config changes summary
controller.yaml.example updates
monitoring:
system_health_interval: "5m" # NEW field
backup:
restic_password_file: "/opt/docker/felhom-controller/data/restic-password" # CHANGED default path
config.go updates
- Add
SystemHealthInterval stringtoMonitoringConfig - Default:
"5m"inapplyDefaults() - Change
restic_password_filedefault from/opt/docker/felhom-controller/restic-passwordto/opt/docker/felhom-controller/data/restic-password - Add env override:
FELHOM_MONITORING_SYSTEM_HEALTH_INTERVAL
Implementation Order
Step 1: Scheduler
- Create
internal/scheduler/scheduler.go - Implement
Every()andDaily()with logging, panic recovery, skip-if-running - Migrate the two existing goroutines from
main.goto scheduler - Build and verify — behavior should be identical, logs should show
[SCHED]entries
Step 2: CPU & Temperature metrics
- Create
internal/system/cpu_linux.go+cpu_other.go(build tags) - Add
readLoadAvg()andreadTemperature()toinfo_linux.go - Extend
SystemInfostruct ininfo.go - Update
GetInfo()signature in all files to accept*CPUCollector - Start CPUCollector in
main.go, pass to web server and API router - Update
docker-compose.yml— add/sys:/host/sys:ro - Update
dashboard.html— show CPU, load, temperature - Update
style.cssif needed for new display elements - Build, deploy, verify — new metrics visible on dashboard
Step 3: Healthchecks pinger + health checks
- Create
internal/monitor/pinger.go - Create
internal/monitor/healthcheck.go - Add
system_health_intervalto config - Add system health ping job to scheduler in
main.go - Build, deploy — check controller logs for health check runs
Step 4: Database dump engine
- Create
internal/backup/dbdump.go - Implement discovery + dump functions
- Wire up
RunDBDumpstemporarily to a test endpoint or manual scheduler trigger for testing - Build, deploy, verify — dumps should appear in
/srv/backups/db-dumps/for paperless-ngx-postgres
Step 5: Restic integration
- Create
internal/backup/restic.go - Implement init, snapshot, prune, check, stats
- Auto-generate restic password if missing
- Update docker-compose.yml (remove restic-password bind mount)
- Build, deploy, verify — repo initialized, password generated
Step 6: Backup orchestrator + dashboard
- Create
internal/backup/backup.go - Wire up scheduler daily jobs (DB dump + backup)
- Add API endpoints (
/api/backup/status,/api/backup/run) - Add backup status section to
dashboard.html - Add "Mentés most" button
- Build, deploy, verify full flow
Step 7: Documentation & cleanup
- Update
README.md— Phase 2 and 3 checked off, new module descriptions - Update
CONTEXT.mdwith session summary - Update
CLAUDE.mdif workflow changes - Version bump in build:
v0.4.0
Verification Checklist
After deployment, verify each item:
docker psshows controller healthy- Dashboard loads with CPU %, load average, temperature displayed
- Temperature shows realistic value (30-60°C idle for N100)
- CPU % updates (not stuck at 0)
/api/system/inforeturns all new fields (cpu_percent, load_avg_, temperature_)- Scheduler logs show
[SCHED]entries for all registered jobs - If HC UUIDs configured: pings visible in status.felhom.eu dashboard
- DB dump discovers paperless-ngx postgres container
- Dump file exists:
/srv/backups/db-dumps/paperless-ngx-postgres.sql - Restic repo initialized:
/srv/backups/restic-repo/configexists - Restic password auto-generated:
/opt/docker/felhom-controller/data/restic-passwordexists - "Mentés most" button triggers backup successfully
- Dashboard shows backup status section with last backup time
- All existing features still work (start/stop/deploy/update/logs/auth)
New files to create
internal/scheduler/scheduler.go
internal/monitor/pinger.go
internal/monitor/healthcheck.go
internal/backup/dbdump.go
internal/backup/restic.go
internal/backup/backup.go
internal/system/cpu_linux.go
internal/system/cpu_other.go
Existing files to modify
internal/system/info.go — new SystemInfo fields
internal/system/info_linux.go — readLoadAvg(), readTemperature(), GetInfo() signature
internal/system/info_other.go — GetInfo() signature update
internal/config/config.go — SystemHealthInterval, updated defaults
internal/api/router.go — backup endpoints, cpuCollector parameter
internal/web/server.go — accept cpuCollector, backupMgr
internal/web/handlers.go — pass cpuCollector/backupMgr to dashboard
internal/web/templates/dashboard.html — CPU/temp bars, backup status section
internal/web/templates/style.css — styles for new elements
cmd/controller/main.go — scheduler, cpuCollector, pinger, backupMgr wiring
controller/docker-compose.yml — /sys mount, remove restic-password mount
configs/controller.yaml.example — new fields, updated defaults
Manual steps after deployment (for Viktor)
- Verify /sys mount:
docker exec felhom-controller ls /host/sys/class/thermal/— should show thermal_zone directories - Healthchecks setup: Create project + 3 checks in status.felhom.eu for demo-felhom:
system-health(period: 10m, grace: 10m)db-dump(period: 24h, grace: 1h)backup(period: 24h, grace: 1h)
- Update controller.yaml: Add the three ping UUIDs
- Verify restic password:
docker exec felhom-controller cat /opt/docker/felhom-controller/data/restic-password - Test restore procedure:
docker exec felhom-controller restic -r /srv/backups/restic-repo \ --password-file /opt/docker/felhom-controller/data/restic-password snapshots - Save restic password externally — losing it means losing access to all backups