Files
deploy-felhom-compose/TASK.md
T

889 lines
31 KiB
Markdown

# TASK.md — Phase 2: Monitoring & Health + Phase 3: Backups
> Version bump target: **v0.4.0**
> Priority: Phase 2 first (scheduler + metrics are prerequisites for Phase 3)
---
## Overview
Implement two major features in felhom-controller:
1. **Phase 2** — Scheduler, CPU/temperature metrics, Healthchecks.io ping integration
2. **Phase 3** — Database dump engine, restic backup snapshots, dashboard status display
Both phases share the scheduler infrastructure. Implement in order.
---
## Phase 2A — Scheduler (`internal/scheduler/`)
### Why first
main.go currently has two ad-hoc goroutines (status refresh every 30s, stack scan every 2min).
Phase 2 adds system health pings. Phase 3 adds daily DB dumps and backups.
All need a centralized, logged, observable job runner. Build it once, use everywhere.
### Design: `internal/scheduler/scheduler.go`
```go
package scheduler
type JobFunc func(ctx context.Context) error
type Job struct {
Name string
Fn JobFunc
Interval time.Duration // for periodic jobs (every N)
Schedule string // for daily jobs ("02:30", "03:00") — mutually exclusive with Interval
LastRun time.Time
LastErr error
Running bool
}
type Scheduler struct {
jobs []*Job
logger *log.Logger
ctx context.Context
cancel context.CancelFunc
wg sync.WaitGroup
}
func New(logger *log.Logger) *Scheduler
func (s *Scheduler) Every(name string, interval time.Duration, fn JobFunc)
func (s *Scheduler) Daily(name string, timeStr string, fn JobFunc) // "02:30" format, Europe/Budapest timezone
func (s *Scheduler) Start(ctx context.Context)
func (s *Scheduler) Stop()
func (s *Scheduler) GetJobs() []Job // for dashboard/API display (copy, not pointer)
```
### Interval jobs (`Every`)
- Spawns a goroutine with `time.Ticker`
- Logs `[SCHED] Running job: <n>` at start, `[SCHED] Job <n> completed (took Xs)` or `[SCHED] Job <n> failed: <err> (took Xs)` at end
- Updates `LastRun`, `LastErr`, `Running` fields (mutex-protected)
- Respects ctx.Done() for shutdown
- **Quiet mode for high-frequency jobs:** Jobs that run every <=30 seconds should only log at debug level on success (avoid log spam). Failures always log at WARN/ERROR level.
### Daily jobs (`Daily`)
- Parses `timeStr` as "HH:MM" in `Europe/Budapest` timezone
- On start, calculates duration until next occurrence (today if not yet passed, tomorrow if passed)
- After each run, sleeps until the next day's scheduled time
- Uses `time.After()` or `time.Timer`, NOT `time.Ticker` (handles DST transitions correctly)
- Same logging pattern as interval jobs
- Logs `[SCHED] Daily job <n> scheduled for <next_time>` on registration
### Edge cases
- If `Daily` timeStr is invalid → log error at registration, don't start the job
- If a job panics → recover, log `[ERROR] Job <n> panicked: <err>`, mark as failed
- If a job is already running when the next tick fires → skip, log `[WARN] Job <n> still running, skipping`
- Graceful shutdown: `Stop()` cancels context, `wg.Wait()` with 30s timeout for running jobs to finish
### Integration in main.go
Replace the two ad-hoc goroutines with:
```go
sched := scheduler.New(logger)
// Existing periodic tasks (move from ad-hoc goroutines)
sched.Every("status-refresh", 30*time.Second, func(ctx context.Context) error {
return stackMgr.RefreshStatus()
})
sched.Every("stack-scan", 2*time.Minute, func(ctx context.Context) error {
return stackMgr.ScanStacks()
})
// Phase 2: System health ping (added below)
// Phase 3: DB dump, backup (added below)
sched.Start(ctx)
defer sched.Stop()
```
Delete the two existing goroutines in main.go after migrating to the scheduler.
---
## Phase 2B — CPU & Temperature Metrics (`internal/system/`)
### Current state
`info.go` defines `SystemInfo` struct with memory + disk fields.
`info_linux.go` reads `/proc/meminfo` and `syscall.Statfs`.
`info_other.go` provides stubs for non-Linux.
### New fields in `SystemInfo`
```go
// Add to SystemInfo struct in info.go:
CPUPercent float64 `json:"cpu_percent"` // 0-100, averaged across all cores
LoadAvg1 float64 `json:"load_avg_1"` // 1-minute load average
LoadAvg5 float64 `json:"load_avg_5"` // 5-minute load average
LoadAvg15 float64 `json:"load_avg_15"` // 15-minute load average
TemperatureCelsius float64 `json:"temperature_celsius"` // CPU/SoC temperature
TemperatureSource string `json:"temperature_source"` // e.g. "thermal_zone0", "x86_pkg_temp"
```
### CPU measurement approach
**Do NOT block `GetInfo()` with a delta calculation.**
Use a lightweight `CPUCollector` that runs in a background goroutine:
```go
// internal/system/cpu_linux.go (build tag: linux)
type CPUCollector struct {
mu sync.RWMutex
cpuPercent float64
sampleRate time.Duration // default: 5 seconds
cancel context.CancelFunc
}
func NewCPUCollector(sampleRate time.Duration) *CPUCollector
func (c *CPUCollector) Start(ctx context.Context)
func (c *CPUCollector) Stop()
func (c *CPUCollector) CPUPercent() float64 // returns latest sample
```
How it works:
1. Reads `/proc/stat` first line: `cpu <user> <nice> <system> <idle> <iowait> <irq> <softirq> <steal>`
2. Sleeps `sampleRate` (5s)
3. Reads again, computes delta: `busy = delta(user+nice+system+irq+softirq+steal)`, `total = busy + delta(idle+iowait)`
4. `cpuPercent = (busy / total) * 100`
5. Stores result, loops
Parsing `/proc/stat`:
```
cpu 1234 56 789 45678 123 45 67 0 0 0
```
Split by whitespace. Fields after "cpu" are: user(1) nice(2) system(3) idle(4) iowait(5) irq(6) softirq(7) steal(8).
Sum all = total. idle + iowait = idle_total. busy = total - idle_total.
**IMPORTANT: Inside a Docker container, `/proc/stat` reflects the HOST CPU** (unless CPU cgroups are applied with limits). So the controller's own `/proc/stat` works.
### Load average
Read from `/proc/loadavg` (instant, no delta needed):
```
0.15 0.10 0.05 1/234 56789
```
First three fields are 1/5/15 minute load averages. Parse with `fmt.Sscanf`.
Add `readLoadAvg(info *SystemInfo)` in `info_linux.go`.
### Temperature
Read from `/sys/class/thermal/thermal_zone*/temp`:
**IMPORTANT**: The controller runs in a Docker container. `/sys` is NOT available by default. We mount the host's `/sys` at `/host/sys` inside the container (see docker-compose.yml changes below).
```go
// internal/system/info_linux.go — add readTemperature(info *SystemInfo)
```
Algorithm:
1. Try `/host/sys/class/thermal/thermal_zone*/temp` first (Docker mount)
2. Fallback to `/sys/class/thermal/thermal_zone*/temp` (native/development)
3. For each zone, also read the `type` file for the label
4. Pick the highest temperature (usually `thermal_zone0` or `x86_pkg_temp`)
5. Value is in millidegrees Celsius → divide by 1000.0
6. Store the zone type as `TemperatureSource`
7. If no thermal zones found: try `/host/sys/class/hwmon/hwmon*/temp1_input` as fallback (same millidegree format)
8. If nothing found: leave fields as zero (dashboard hides temperature when 0)
### `GetInfo()` signature change
```go
// Current:
func GetInfo(hddPath string) SystemInfo
// New:
func GetInfo(hddPath string, cpuCollector *CPUCollector) SystemInfo
```
Inside `GetInfo()`:
1. Existing: `readMemInfo(&info)`, `readDiskUsage(...)` — unchanged
2. New: `readLoadAvg(&info)`
3. New: `readTemperature(&info)`
4. New: `if cpuCollector != nil { info.CPUPercent = cpuCollector.CPUPercent() }`
The `info_other.go` stub accepts the parameter but ignores it (returns empty SystemInfo as before).
### CPU collector lifecycle
Started in `main.go`:
```go
cpuCollector := system.NewCPUCollector(5 * time.Second)
cpuCollector.Start(ctx)
defer cpuCollector.Stop()
```
Passed to `web.NewServer()` and `api.NewRouter()` which pass it to `system.GetInfo()` calls.
### Dashboard display
Extend the existing system info bar in `dashboard.html`:
Current layout:
```
| Memória | ████████░░ 72% | SSD | ██████░░░░ 55% | HDD | ████░░░░░░ 38% |
```
New layout:
```
| Memória | ████████░░ 72% | CPU | ██░░░░░░░░ 15% | Hőmérséklet | 52°C |
| SSD | ██████░░░░ 55% | HDD | ████░░░░░░ 38% |
```
Or, if horizontal space is tight, keep the two-row layout from the current dashboard and add CPU + temperature to the same row structure. Use the same progress bar component.
Temperature display:
- Show as text "52°C" with colored dot (green/yellow/red)
- Green: < 60°C
- Yellow: 60-75°C
- Red: > 75°C
- If temperature is 0 (unavailable): hide entirely
CPU progress bar:
- Same color scheme as memory/disk: green < 70%, yellow 70-85%, red > 85%
Load average: Show as small text below CPU bar: "Load: 0.3 / 0.2 / 0.1"
---
## Phase 2C — Healthchecks.io Ping Integration (`internal/monitor/`)
### Design: `internal/monitor/pinger.go`
```go
package monitor
type Pinger struct {
baseURL string // e.g. "https://status.felhom.eu"
httpClient *http.Client
logger *log.Logger
enabled bool
}
func NewPinger(cfg *config.MonitoringConfig, logger *log.Logger) *Pinger
// Ping sends a success signal with optional diagnostic body
func (p *Pinger) Ping(uuid string, body string) error
// Fail sends a failure signal with diagnostic body
func (p *Pinger) Fail(uuid string, body string) error
// Start sends a "job started" signal (for duration tracking)
func (p *Pinger) Start(uuid string) error
```
### HTTP protocol
- Success: `POST {baseURL}/ping/{uuid}` with body as request body
- Failure: `POST {baseURL}/ping/{uuid}/fail` with body
- Start: `POST {baseURL}/ping/{uuid}/start`
- Timeout: 10 seconds
- Retry: 3 attempts with 2s backoff between retries
- If `uuid` is empty or starts with "CHANGEME" → skip silently (log at debug level only)
- If `enabled` is false → skip all pings
- **Never let ping failures affect the main operation** — log a warning on HTTP error, but always return nil from the calling job. Ping errors must not break backup/health flows.
### Design: `internal/monitor/healthcheck.go`
```go
// RunHealthCheck runs system checks and returns a diagnostic report.
type HealthReport struct {
Status string // "ok", "warn", "fail"
Issues []string // critical problems
Warnings []string // non-critical warnings
Info []string // informational items
Timestamp time.Time
}
func RunHealthCheck(cfg *config.Config, cpuCollector *system.CPUCollector) *HealthReport
func (r *HealthReport) FormatMessage() string // human-readable summary for HC ping body
```
Checks to run (replicating backup-healthcheck.sh logic in Go):
1. **Disk usage**: Read from `system.GetInfo()`. Compare against thresholds (`disk_warn_percent`, `disk_crit_percent`).
2. **Memory usage**: Same source. Warn if above `memory_warn_percent`.
3. **CPU usage**: From collector. Warn if above `cpu_warn_percent`.
4. **Temperature**: From `system.GetInfo()`. Warn if above `temperature_warn_celsius`.
5. **Docker health**: Verify Docker daemon is reachable by running `docker info` (quick exec check).
6. **Protected containers**: Verify protected stacks are running (traefik, cloudflared, felhom-controller) by checking container state.
Any issue → Status = "fail". Only warnings → Status = "warn". All clear → Status = "ok".
### Scheduler integration
```go
// In main.go:
pinger := monitor.NewPinger(&cfg.Monitoring, logger)
healthUUID := cfg.Monitoring.PingUUIDs.SystemHealth
// Parse system_health_interval (default "5m")
healthInterval, _ := time.ParseDuration(cfg.Monitoring.SystemHealthInterval)
sched.Every("system-health", healthInterval, func(ctx context.Context) error {
report := monitor.RunHealthCheck(cfg, cpuCollector)
body := report.FormatMessage()
if report.Status == "fail" {
pinger.Fail(healthUUID, body)
} else {
pinger.Ping(healthUUID, body)
}
return nil // never fail the scheduler job due to ping errors
})
```
### Config changes
Add to `MonitoringConfig`:
```go
SystemHealthInterval string `yaml:"system_health_interval"`
```
Default in `applyDefaults()`: `"5m"`
---
## Phase 3A — Database Dump Engine (`internal/backup/dbdump.go`)
### Approach: Auto-discover from running Docker containers
Replicates the proven logic from `backup-db-dump.sh` in Go:
```go
package backup
type DBType string
const (
DBTypePostgres DBType = "postgres"
DBTypeMariaDB DBType = "mariadb"
)
type DiscoveredDB struct {
ContainerName string
ContainerID string
DBType DBType
DBUser string
DBName string
StackName string // derived from container name
}
type DumpResult struct {
DB DiscoveredDB
FilePath string
Size int64
Duration time.Duration
Error error
}
func DiscoverDatabases(ctx context.Context, logger *log.Logger) ([]DiscoveredDB, error)
func DumpAll(ctx context.Context, dbs []DiscoveredDB, dumpDir string, logger *log.Logger) []DumpResult
func DumpOne(ctx context.Context, db DiscoveredDB, dumpDir string, logger *log.Logger) DumpResult
```
### Discovery logic
Run `docker ps --format '{{.ID}}\t{{.Names}}\t{{.Image}}' --filter status=running`.
For each running container, check image name:
- Contains `postgres` → DBTypePostgres
- Contains `mariadb` or `mysql` → DBTypeMariaDB
Then for each DB container, get env vars via:
`docker inspect <id> --format '{{range .Config.Env}}{{println .}}{{end}}'`
Parse env vars:
- **PostgreSQL**: `POSTGRES_USER` (default: "postgres"), `POSTGRES_DB` (default: same as POSTGRES_USER)
- **MariaDB**: `MYSQL_ROOT_PASSWORD`, `MYSQL_DATABASE` (or `MARIADB_DATABASE`)
Derive stack name from container name by stripping common DB suffixes:
- `paperless-ngx-postgres``paperless-ngx`
- `romm-db``romm`
- `immich-postgres``immich`
- Logic: split on `-`, check if last segment is a known suffix (`postgres`, `db`, `mariadb`, `mysql`, `database`, `redis`, `cache`), if so remove it
### Dump execution
**PostgreSQL:**
```bash
docker exec <container> pg_dump -U <user> -d <db> --clean --if-exists --no-owner --no-privileges
```
**MariaDB:**
```bash
docker exec <container> mariadb-dump -u root -p<password> --single-transaction --routines --triggers <db>
```
**IMPORTANT: Use `docker exec` to run dump commands INSIDE the DB container.** Do NOT use pg_dump/mysqldump from the controller container — version mismatches between the controller's client and the DB server will cause failures.
Output handling:
- Use `os/exec.Command("docker", "exec", ...)` with `cmd.Stdout` piped to a temp file
- Write to `{dumpDir}/{stackName}-{dbtype}.sql.tmp` during dump
- Rename `.tmp``.sql` on success only
- Delete `.tmp` on failure
- Set 5-minute timeout per dump via `context.WithTimeout`
### Gotchas and edge cases
- **MariaDB password from container env:** Never log the password. Use `docker inspect` to read `MYSQL_ROOT_PASSWORD` or `MARIADB_ROOT_PASSWORD`.
- **Empty/zero-size dumps:** Check dump file size after writing. If 0 bytes → treat as failure.
- **Dump file naming:** `{stackName}-{dbtype}.sql` (e.g., `paperless-ngx-postgres.sql`). Overwrite previous dump each run (restic handles versioning).
- **Old tmp cleanup:** Delete `.tmp` files older than 1 hour on each run (leftover from crashed dumps).
- **Skip infrastructure DBs:** Don't dump databases from protected stacks (if any have DBs in the future).
- **Container not running:** If a DB container was discovered but is no longer running by dump time → skip with warning (container may have been stopped between discovery and dump).
### Dump directory
`/srv/backups/db-dumps/` — configured in `controller.yaml` as `paths.db_dump_dir`.
Already mounted in docker-compose.yml via `/srv/backups:/srv/backups`.
The user does NOT see this directory (not in FileBrowser, not on HDD).
---
## Phase 3B — Restic Integration (`internal/backup/restic.go`)
### Design
```go
type ResticManager struct {
repoPath string
passwordFile string
logger *log.Logger
customerID string
cacheDir string
}
func NewResticManager(cfg *config.Config, logger *log.Logger) *ResticManager
func (r *ResticManager) EnsureInitialized() error
func (r *ResticManager) Snapshot(paths []string, tags []string) (*SnapshotResult, error)
func (r *ResticManager) Prune(retention config.RetentionConfig) error
func (r *ResticManager) Check() error
func (r *ResticManager) LatestSnapshot() (*SnapshotInfo, error)
func (r *ResticManager) Stats() (*RepoStats, error)
type SnapshotResult struct {
SnapshotID string
FilesNew int
FilesChanged int
DataAdded string // human-readable
Duration time.Duration
}
type SnapshotInfo struct {
ID string
Time time.Time
Paths []string
Tags []string
}
type RepoStats struct {
TotalSize string
SnapshotCount int
LatestSnapshot *SnapshotInfo
}
```
### Restic commands (all via `os/exec`)
All commands set these env vars:
```go
cmd.Env = append(os.Environ(),
"RESTIC_REPOSITORY="+r.repoPath,
"RESTIC_PASSWORD_FILE="+r.passwordFile,
"RESTIC_CACHE_DIR="+r.cacheDir,
)
```
**`RESTIC_CACHE_DIR`** must be set to `/opt/docker/felhom-controller/data/restic-cache` (inside the controller-data Docker volume). Without this, restic defaults to `~/.cache/restic` which may not persist across container restarts.
**Init** (idempotent):
- Check if `{repoPath}/config` file exists → if so, already initialized, skip
- Otherwise: `restic init`
**Snapshot:**
```bash
restic backup /opt/docker/stacks /srv/backups/db-dumps /opt/docker/felhom-controller/controller.yaml \
--tag felhom --tag <customerID> --host <customerID>
```
What gets backed up (v1):
- `/opt/docker/stacks/` — compose files, .felhom.yml, app.yaml (deploy configs with secrets)
- `/srv/backups/db-dumps/` — SQL dumps (from the DB dump step)
- `/opt/docker/felhom-controller/controller.yaml` — controller config
**NOT backed up in v1:**
- HDD app data (Immich photos, Paperless documents) — too large, needs separate strategy
- Docker volumes directly — critical data covered by DB dumps
Parse snapshot output (restic `backup` with `--json` sends JSON lines to stderr):
```json
{"message_type":"summary","files_new":5,"files_changed":2,"data_added":12345678,...,"snapshot_id":"abc123"}
```
**Prune:**
```bash
restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 6 --prune
```
**Check:**
```bash
restic check
```
**Latest snapshot:**
```bash
restic snapshots --latest 1 --json
```
Returns JSON array with snapshot objects.
**Stats (repo size):**
```bash
restic stats --json
```
### Password auto-generation
On startup, `EnsureInitialized()` checks if the password file exists. If not:
1. Generate 32 random bytes, base64url-encode
2. Write to `r.passwordFile` (the controller-data volume path)
3. Log `[INFO] Generated new restic repository password at <path>`
4. Log `[WARN] Save this password externally — losing it means losing access to ALL backups`
### Gotchas
- **restic is already in the Docker image** (Dockerfile installs it). No additional setup.
- **Locking:** Restic handles repo locking internally. The scheduler's "skip if running" prevents concurrent operations. If a stale lock exists (controller crashed mid-backup), restic will error — add `restic unlock` to the error handling path with a log warning.
- **Timeout:** 30-minute timeout for snapshot operations. Parse context deadline.
- **Large repos:** First snapshot may be large (all stack configs + dumps). Subsequent snapshots are incremental (restic deduplicates).
- **restic JSON output:** Use `--json` for machine-parseable output. Parse from stderr for `backup` command (stdout shows progress, stderr has JSON summary).
Actually, correction — restic with `--json` sends JSON to **stdout**. Regular progress goes to stderr. For `backup --json`, the summary JSON object with `message_type: "summary"` is on stdout. Parse the last JSON line from stdout.
---
## Phase 3C — Backup Orchestrator (`internal/backup/backup.go`)
### Design
```go
type Manager struct {
cfg *config.Config
restic *ResticManager
logger *log.Logger
pinger *monitor.Pinger
mu sync.Mutex
lastDBDump *DBDumpStatus
lastBackup *BackupStatus
}
type DBDumpStatus struct {
LastRun time.Time
Results []DumpResult
Success bool
Duration time.Duration
}
type BackupStatus struct {
LastRun time.Time
Snapshot *SnapshotResult
Success bool
Duration time.Duration
RepoStats *RepoStats
}
func NewManager(cfg *config.Config, pinger *monitor.Pinger, logger *log.Logger) *Manager
func (m *Manager) RunDBDumps(ctx context.Context) error
func (m *Manager) RunBackup(ctx context.Context) error
func (m *Manager) RunFullBackup(ctx context.Context) error // dumps + snapshot + optional prune
func (m *Manager) GetStatus() (*DBDumpStatus, *BackupStatus)
func (m *Manager) GetRepoStats() (*RepoStats, error)
```
### Full backup flow (daily scheduled)
1. **DB dumps:** `DiscoverDatabases()``DumpAll()` → update `lastDBDump` status
2. Ping Healthchecks for DB dump result: `pinger.Ping/Fail(dbDumpUUID, summary)`
3. **Restic snapshot:** `restic.EnsureInitialized()``restic.Snapshot(paths, tags)`
4. **Prune (weekly):** Check day of week against `prune_schedule` config. If match → `restic.Prune(retention)` + `restic.Check()`
5. Ping Healthchecks for backup result: `pinger.Ping/Fail(backupUUID, summary)`
6. Update `lastBackup` status
### Scheduler integration
```go
// In main.go:
backupMgr := backup.NewManager(cfg, pinger, logger)
if cfg.Backup.Enabled {
sched.Daily("db-dump", cfg.Backup.DBDumpSchedule, func(ctx context.Context) error {
return backupMgr.RunDBDumps(ctx)
})
sched.Daily("backup", cfg.Backup.ResticSchedule, func(ctx context.Context) error {
return backupMgr.RunBackup(ctx)
})
}
```
### Dashboard display
Add "Biztonsági mentés" (Backup) section to `dashboard.html`:
```
╔══════════════════════════════════════════╗
║ 🛡️ Biztonsági mentés ║
╠══════════════════════════════════════════╣
║ Utolsó mentés: 2026-02-15 03:01 ✅ ║
║ Adatbázisok: 2 mentve (12.3 MB) ║
║ Tároló méret: 45.2 MB (23 pillanatkép) ║
║ Következő: ma 03:00 ║
║ ║
║ [Mentés most] ║
╚══════════════════════════════════════════╝
```
Hungarian labels:
- "Biztonsági mentés" = Backup
- "Utolsó mentés" = Last backup
- "Adatbázisok" = Databases
- "mentve" = backed up
- "Tároló méret" = Repository size
- "pillanatkép" = snapshot(s)
- "Következő" = Next
- "Mentés most" = Backup now
Status colors:
- Green ✅: Last backup successful and less than `backup_max_age_hours` old
- Yellow ⚠️: Last backup successful but older than expected
- Red ❌: Last backup failed or no backups exist yet
- Gray: Backup not configured (`backup.enabled: false`)
If backup is disabled in config → show "Biztonsági mentés nincs beállítva" (Backup not configured).
### API endpoints
Add to `api/router.go`:
```
GET /api/backup/status → backup manager status + repo stats
POST /api/backup/run → trigger immediate full backup (async)
```
`POST /api/backup/run` starts the backup in a background goroutine, returns immediately with `{"ok": true, "message": "Mentés elindítva"}`. The dashboard can poll `/api/backup/status` to track progress.
---
## Docker-compose.yml final state
```yaml
services:
felhom-controller:
image: gitea.dooplex.hu/admin/felhom-controller:latest
container_name: felhom-controller
restart: unless-stopped
ports:
- "8080:8080"
volumes:
# Docker socket — required for compose operations + DB dumps (docker exec)
- /var/run/docker.sock:/var/run/docker.sock:ro
# Controller config
- /opt/docker/felhom-controller/controller.yaml:/opt/docker/felhom-controller/controller.yaml:ro
# Controller persistent data (sessions, restic cache, restic password)
- controller-data:/opt/docker/felhom-controller/data
# Stack compose files (read + write for git sync)
- /opt/docker/stacks:/opt/docker/stacks
# Backup directories (restic repo + db dumps)
- /srv/backups:/srv/backups
# HDD mount (if available, for monitoring disk usage)
- ${HDD_PATH:-/mnt/hdd_placeholder}:${HDD_PATH:-/mnt/hdd_placeholder}:ro
# Host /sys — for CPU temperature reading (read-only)
- /sys:/host/sys:ro
environment:
- TZ=Europe/Budapest
labels:
- "traefik.enable=true"
- "traefik.http.routers.controller.rule=Host(`felhom.${DOMAIN}`)"
- "traefik.http.routers.controller.entrypoints=websecure"
- "traefik.http.routers.controller.tls=true"
- "traefik.http.services.controller.loadbalancer.server.port=8080"
- "traefik.docker.network=traefik-public"
- "felhom.managed=true"
- "felhom.component=controller"
networks:
- traefik-public
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/api/health"]
interval: 30s
timeout: 5s
start_period: 10s
retries: 3
volumes:
controller-data:
networks:
traefik-public:
external: true
```
Changes from current:
1. **Added:** `/sys:/host/sys:ro` — for temperature reading
2. **Removed:** dedicated restic-password bind mount (password now in controller-data volume)
---
## Config changes summary
### `controller.yaml.example` updates
```yaml
monitoring:
system_health_interval: "5m" # NEW field
backup:
restic_password_file: "/opt/docker/felhom-controller/data/restic-password" # CHANGED default path
```
### `config.go` updates
- Add `SystemHealthInterval string` to `MonitoringConfig`
- Default: `"5m"` in `applyDefaults()`
- Change `restic_password_file` default from `/opt/docker/felhom-controller/restic-password` to `/opt/docker/felhom-controller/data/restic-password`
- Add env override: `FELHOM_MONITORING_SYSTEM_HEALTH_INTERVAL`
---
## Implementation Order
### Step 1: Scheduler
1. Create `internal/scheduler/scheduler.go`
2. Implement `Every()` and `Daily()` with logging, panic recovery, skip-if-running
3. Migrate the two existing goroutines from `main.go` to scheduler
4. **Build and verify** — behavior should be identical, logs should show `[SCHED]` entries
### Step 2: CPU & Temperature metrics
1. Create `internal/system/cpu_linux.go` + `cpu_other.go` (build tags)
2. Add `readLoadAvg()` and `readTemperature()` to `info_linux.go`
3. Extend `SystemInfo` struct in `info.go`
4. Update `GetInfo()` signature in all files to accept `*CPUCollector`
5. Start CPUCollector in `main.go`, pass to web server and API router
6. Update `docker-compose.yml` — add `/sys:/host/sys:ro`
7. Update `dashboard.html` — show CPU, load, temperature
8. Update `style.css` if needed for new display elements
9. **Build, deploy, verify** — new metrics visible on dashboard
### Step 3: Healthchecks pinger + health checks
1. Create `internal/monitor/pinger.go`
2. Create `internal/monitor/healthcheck.go`
3. Add `system_health_interval` to config
4. Add system health ping job to scheduler in `main.go`
5. **Build, deploy** — check controller logs for health check runs
### Step 4: Database dump engine
1. Create `internal/backup/dbdump.go`
2. Implement discovery + dump functions
3. Wire up `RunDBDumps` temporarily to a test endpoint or manual scheduler trigger for testing
4. **Build, deploy, verify** — dumps should appear in `/srv/backups/db-dumps/` for paperless-ngx-postgres
### Step 5: Restic integration
1. Create `internal/backup/restic.go`
2. Implement init, snapshot, prune, check, stats
3. Auto-generate restic password if missing
4. Update docker-compose.yml (remove restic-password bind mount)
5. **Build, deploy, verify** — repo initialized, password generated
### Step 6: Backup orchestrator + dashboard
1. Create `internal/backup/backup.go`
2. Wire up scheduler daily jobs (DB dump + backup)
3. Add API endpoints (`/api/backup/status`, `/api/backup/run`)
4. Add backup status section to `dashboard.html`
5. Add "Mentés most" button
6. **Build, deploy, verify full flow**
### Step 7: Documentation & cleanup
1. Update `README.md` — Phase 2 and 3 checked off, new module descriptions
2. Update `CONTEXT.md` with session summary
3. Update `CLAUDE.md` if workflow changes
4. Version bump in build: `v0.4.0`
---
## Verification Checklist
After deployment, verify each item:
- [ ] `docker ps` shows controller healthy
- [ ] Dashboard loads with CPU %, load average, temperature displayed
- [ ] Temperature shows realistic value (30-60°C idle for N100)
- [ ] CPU % updates (not stuck at 0)
- [ ] `/api/system/info` returns all new fields (cpu_percent, load_avg_*, temperature_*)
- [ ] Scheduler logs show `[SCHED]` entries for all registered jobs
- [ ] If HC UUIDs configured: pings visible in status.felhom.eu dashboard
- [ ] DB dump discovers paperless-ngx postgres container
- [ ] Dump file exists: `/srv/backups/db-dumps/paperless-ngx-postgres.sql`
- [ ] Restic repo initialized: `/srv/backups/restic-repo/config` exists
- [ ] Restic password auto-generated: `/opt/docker/felhom-controller/data/restic-password` exists
- [ ] "Mentés most" button triggers backup successfully
- [ ] Dashboard shows backup status section with last backup time
- [ ] All existing features still work (start/stop/deploy/update/logs/auth)
---
## New files to create
```
internal/scheduler/scheduler.go
internal/monitor/pinger.go
internal/monitor/healthcheck.go
internal/backup/dbdump.go
internal/backup/restic.go
internal/backup/backup.go
internal/system/cpu_linux.go
internal/system/cpu_other.go
```
## Existing files to modify
```
internal/system/info.go — new SystemInfo fields
internal/system/info_linux.go — readLoadAvg(), readTemperature(), GetInfo() signature
internal/system/info_other.go — GetInfo() signature update
internal/config/config.go — SystemHealthInterval, updated defaults
internal/api/router.go — backup endpoints, cpuCollector parameter
internal/web/server.go — accept cpuCollector, backupMgr
internal/web/handlers.go — pass cpuCollector/backupMgr to dashboard
internal/web/templates/dashboard.html — CPU/temp bars, backup status section
internal/web/templates/style.css — styles for new elements
cmd/controller/main.go — scheduler, cpuCollector, pinger, backupMgr wiring
controller/docker-compose.yml — /sys mount, remove restic-password mount
configs/controller.yaml.example — new fields, updated defaults
```
---
## Manual steps after deployment (for Viktor)
1. **Verify /sys mount:** `docker exec felhom-controller ls /host/sys/class/thermal/` — should show thermal_zone directories
2. **Healthchecks setup:** Create project + 3 checks in status.felhom.eu for demo-felhom:
- `system-health` (period: 10m, grace: 10m)
- `db-dump` (period: 24h, grace: 1h)
- `backup` (period: 24h, grace: 1h)
3. **Update controller.yaml:** Add the three ping UUIDs
4. **Verify restic password:** `docker exec felhom-controller cat /opt/docker/felhom-controller/data/restic-password`
5. **Test restore procedure:**
```bash
docker exec felhom-controller restic -r /srv/backups/restic-repo \
--password-file /opt/docker/felhom-controller/data/restic-password snapshots
```
6. **Save restic password externally** — losing it means losing access to all backups