889 lines
31 KiB
Markdown
889 lines
31 KiB
Markdown
# TASK.md — Phase 2: Monitoring & Health + Phase 3: Backups
|
|
|
|
> Version bump target: **v0.4.0**
|
|
> Priority: Phase 2 first (scheduler + metrics are prerequisites for Phase 3)
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
Implement two major features in felhom-controller:
|
|
|
|
1. **Phase 2** — Scheduler, CPU/temperature metrics, Healthchecks.io ping integration
|
|
2. **Phase 3** — Database dump engine, restic backup snapshots, dashboard status display
|
|
|
|
Both phases share the scheduler infrastructure. Implement in order.
|
|
|
|
---
|
|
|
|
## Phase 2A — Scheduler (`internal/scheduler/`)
|
|
|
|
### Why first
|
|
|
|
main.go currently has two ad-hoc goroutines (status refresh every 30s, stack scan every 2min).
|
|
Phase 2 adds system health pings. Phase 3 adds daily DB dumps and backups.
|
|
All need a centralized, logged, observable job runner. Build it once, use everywhere.
|
|
|
|
### Design: `internal/scheduler/scheduler.go`
|
|
|
|
```go
|
|
package scheduler
|
|
|
|
type JobFunc func(ctx context.Context) error
|
|
|
|
type Job struct {
|
|
Name string
|
|
Fn JobFunc
|
|
Interval time.Duration // for periodic jobs (every N)
|
|
Schedule string // for daily jobs ("02:30", "03:00") — mutually exclusive with Interval
|
|
LastRun time.Time
|
|
LastErr error
|
|
Running bool
|
|
}
|
|
|
|
type Scheduler struct {
|
|
jobs []*Job
|
|
logger *log.Logger
|
|
ctx context.Context
|
|
cancel context.CancelFunc
|
|
wg sync.WaitGroup
|
|
}
|
|
|
|
func New(logger *log.Logger) *Scheduler
|
|
func (s *Scheduler) Every(name string, interval time.Duration, fn JobFunc)
|
|
func (s *Scheduler) Daily(name string, timeStr string, fn JobFunc) // "02:30" format, Europe/Budapest timezone
|
|
func (s *Scheduler) Start(ctx context.Context)
|
|
func (s *Scheduler) Stop()
|
|
func (s *Scheduler) GetJobs() []Job // for dashboard/API display (copy, not pointer)
|
|
```
|
|
|
|
### Interval jobs (`Every`)
|
|
|
|
- Spawns a goroutine with `time.Ticker`
|
|
- Logs `[SCHED] Running job: <n>` at start, `[SCHED] Job <n> completed (took Xs)` or `[SCHED] Job <n> failed: <err> (took Xs)` at end
|
|
- Updates `LastRun`, `LastErr`, `Running` fields (mutex-protected)
|
|
- Respects ctx.Done() for shutdown
|
|
- **Quiet mode for high-frequency jobs:** Jobs that run every <=30 seconds should only log at debug level on success (avoid log spam). Failures always log at WARN/ERROR level.
|
|
|
|
### Daily jobs (`Daily`)
|
|
|
|
- Parses `timeStr` as "HH:MM" in `Europe/Budapest` timezone
|
|
- On start, calculates duration until next occurrence (today if not yet passed, tomorrow if passed)
|
|
- After each run, sleeps until the next day's scheduled time
|
|
- Uses `time.After()` or `time.Timer`, NOT `time.Ticker` (handles DST transitions correctly)
|
|
- Same logging pattern as interval jobs
|
|
- Logs `[SCHED] Daily job <n> scheduled for <next_time>` on registration
|
|
|
|
### Edge cases
|
|
|
|
- If `Daily` timeStr is invalid → log error at registration, don't start the job
|
|
- If a job panics → recover, log `[ERROR] Job <n> panicked: <err>`, mark as failed
|
|
- If a job is already running when the next tick fires → skip, log `[WARN] Job <n> still running, skipping`
|
|
- Graceful shutdown: `Stop()` cancels context, `wg.Wait()` with 30s timeout for running jobs to finish
|
|
|
|
### Integration in main.go
|
|
|
|
Replace the two ad-hoc goroutines with:
|
|
|
|
```go
|
|
sched := scheduler.New(logger)
|
|
|
|
// Existing periodic tasks (move from ad-hoc goroutines)
|
|
sched.Every("status-refresh", 30*time.Second, func(ctx context.Context) error {
|
|
return stackMgr.RefreshStatus()
|
|
})
|
|
sched.Every("stack-scan", 2*time.Minute, func(ctx context.Context) error {
|
|
return stackMgr.ScanStacks()
|
|
})
|
|
|
|
// Phase 2: System health ping (added below)
|
|
// Phase 3: DB dump, backup (added below)
|
|
|
|
sched.Start(ctx)
|
|
defer sched.Stop()
|
|
```
|
|
|
|
Delete the two existing goroutines in main.go after migrating to the scheduler.
|
|
|
|
---
|
|
|
|
## Phase 2B — CPU & Temperature Metrics (`internal/system/`)
|
|
|
|
### Current state
|
|
|
|
`info.go` defines `SystemInfo` struct with memory + disk fields.
|
|
`info_linux.go` reads `/proc/meminfo` and `syscall.Statfs`.
|
|
`info_other.go` provides stubs for non-Linux.
|
|
|
|
### New fields in `SystemInfo`
|
|
|
|
```go
|
|
// Add to SystemInfo struct in info.go:
|
|
CPUPercent float64 `json:"cpu_percent"` // 0-100, averaged across all cores
|
|
LoadAvg1 float64 `json:"load_avg_1"` // 1-minute load average
|
|
LoadAvg5 float64 `json:"load_avg_5"` // 5-minute load average
|
|
LoadAvg15 float64 `json:"load_avg_15"` // 15-minute load average
|
|
TemperatureCelsius float64 `json:"temperature_celsius"` // CPU/SoC temperature
|
|
TemperatureSource string `json:"temperature_source"` // e.g. "thermal_zone0", "x86_pkg_temp"
|
|
```
|
|
|
|
### CPU measurement approach
|
|
|
|
**Do NOT block `GetInfo()` with a delta calculation.**
|
|
|
|
Use a lightweight `CPUCollector` that runs in a background goroutine:
|
|
|
|
```go
|
|
// internal/system/cpu_linux.go (build tag: linux)
|
|
|
|
type CPUCollector struct {
|
|
mu sync.RWMutex
|
|
cpuPercent float64
|
|
sampleRate time.Duration // default: 5 seconds
|
|
cancel context.CancelFunc
|
|
}
|
|
|
|
func NewCPUCollector(sampleRate time.Duration) *CPUCollector
|
|
func (c *CPUCollector) Start(ctx context.Context)
|
|
func (c *CPUCollector) Stop()
|
|
func (c *CPUCollector) CPUPercent() float64 // returns latest sample
|
|
```
|
|
|
|
How it works:
|
|
1. Reads `/proc/stat` first line: `cpu <user> <nice> <system> <idle> <iowait> <irq> <softirq> <steal>`
|
|
2. Sleeps `sampleRate` (5s)
|
|
3. Reads again, computes delta: `busy = delta(user+nice+system+irq+softirq+steal)`, `total = busy + delta(idle+iowait)`
|
|
4. `cpuPercent = (busy / total) * 100`
|
|
5. Stores result, loops
|
|
|
|
Parsing `/proc/stat`:
|
|
```
|
|
cpu 1234 56 789 45678 123 45 67 0 0 0
|
|
```
|
|
Split by whitespace. Fields after "cpu" are: user(1) nice(2) system(3) idle(4) iowait(5) irq(6) softirq(7) steal(8).
|
|
Sum all = total. idle + iowait = idle_total. busy = total - idle_total.
|
|
|
|
**IMPORTANT: Inside a Docker container, `/proc/stat` reflects the HOST CPU** (unless CPU cgroups are applied with limits). So the controller's own `/proc/stat` works.
|
|
|
|
### Load average
|
|
|
|
Read from `/proc/loadavg` (instant, no delta needed):
|
|
```
|
|
0.15 0.10 0.05 1/234 56789
|
|
```
|
|
First three fields are 1/5/15 minute load averages. Parse with `fmt.Sscanf`.
|
|
|
|
Add `readLoadAvg(info *SystemInfo)` in `info_linux.go`.
|
|
|
|
### Temperature
|
|
|
|
Read from `/sys/class/thermal/thermal_zone*/temp`:
|
|
|
|
**IMPORTANT**: The controller runs in a Docker container. `/sys` is NOT available by default. We mount the host's `/sys` at `/host/sys` inside the container (see docker-compose.yml changes below).
|
|
|
|
```go
|
|
// internal/system/info_linux.go — add readTemperature(info *SystemInfo)
|
|
```
|
|
|
|
Algorithm:
|
|
1. Try `/host/sys/class/thermal/thermal_zone*/temp` first (Docker mount)
|
|
2. Fallback to `/sys/class/thermal/thermal_zone*/temp` (native/development)
|
|
3. For each zone, also read the `type` file for the label
|
|
4. Pick the highest temperature (usually `thermal_zone0` or `x86_pkg_temp`)
|
|
5. Value is in millidegrees Celsius → divide by 1000.0
|
|
6. Store the zone type as `TemperatureSource`
|
|
7. If no thermal zones found: try `/host/sys/class/hwmon/hwmon*/temp1_input` as fallback (same millidegree format)
|
|
8. If nothing found: leave fields as zero (dashboard hides temperature when 0)
|
|
|
|
### `GetInfo()` signature change
|
|
|
|
```go
|
|
// Current:
|
|
func GetInfo(hddPath string) SystemInfo
|
|
// New:
|
|
func GetInfo(hddPath string, cpuCollector *CPUCollector) SystemInfo
|
|
```
|
|
|
|
Inside `GetInfo()`:
|
|
1. Existing: `readMemInfo(&info)`, `readDiskUsage(...)` — unchanged
|
|
2. New: `readLoadAvg(&info)`
|
|
3. New: `readTemperature(&info)`
|
|
4. New: `if cpuCollector != nil { info.CPUPercent = cpuCollector.CPUPercent() }`
|
|
|
|
The `info_other.go` stub accepts the parameter but ignores it (returns empty SystemInfo as before).
|
|
|
|
### CPU collector lifecycle
|
|
|
|
Started in `main.go`:
|
|
|
|
```go
|
|
cpuCollector := system.NewCPUCollector(5 * time.Second)
|
|
cpuCollector.Start(ctx)
|
|
defer cpuCollector.Stop()
|
|
```
|
|
|
|
Passed to `web.NewServer()` and `api.NewRouter()` which pass it to `system.GetInfo()` calls.
|
|
|
|
### Dashboard display
|
|
|
|
Extend the existing system info bar in `dashboard.html`:
|
|
|
|
Current layout:
|
|
```
|
|
| Memória | ████████░░ 72% | SSD | ██████░░░░ 55% | HDD | ████░░░░░░ 38% |
|
|
```
|
|
|
|
New layout:
|
|
```
|
|
| Memória | ████████░░ 72% | CPU | ██░░░░░░░░ 15% | Hőmérséklet | 52°C |
|
|
| SSD | ██████░░░░ 55% | HDD | ████░░░░░░ 38% |
|
|
```
|
|
|
|
Or, if horizontal space is tight, keep the two-row layout from the current dashboard and add CPU + temperature to the same row structure. Use the same progress bar component.
|
|
|
|
Temperature display:
|
|
- Show as text "52°C" with colored dot (green/yellow/red)
|
|
- Green: < 60°C
|
|
- Yellow: 60-75°C
|
|
- Red: > 75°C
|
|
- If temperature is 0 (unavailable): hide entirely
|
|
|
|
CPU progress bar:
|
|
- Same color scheme as memory/disk: green < 70%, yellow 70-85%, red > 85%
|
|
|
|
Load average: Show as small text below CPU bar: "Load: 0.3 / 0.2 / 0.1"
|
|
|
|
---
|
|
|
|
## Phase 2C — Healthchecks.io Ping Integration (`internal/monitor/`)
|
|
|
|
### Design: `internal/monitor/pinger.go`
|
|
|
|
```go
|
|
package monitor
|
|
|
|
type Pinger struct {
|
|
baseURL string // e.g. "https://status.felhom.eu"
|
|
httpClient *http.Client
|
|
logger *log.Logger
|
|
enabled bool
|
|
}
|
|
|
|
func NewPinger(cfg *config.MonitoringConfig, logger *log.Logger) *Pinger
|
|
|
|
// Ping sends a success signal with optional diagnostic body
|
|
func (p *Pinger) Ping(uuid string, body string) error
|
|
|
|
// Fail sends a failure signal with diagnostic body
|
|
func (p *Pinger) Fail(uuid string, body string) error
|
|
|
|
// Start sends a "job started" signal (for duration tracking)
|
|
func (p *Pinger) Start(uuid string) error
|
|
```
|
|
|
|
### HTTP protocol
|
|
|
|
- Success: `POST {baseURL}/ping/{uuid}` with body as request body
|
|
- Failure: `POST {baseURL}/ping/{uuid}/fail` with body
|
|
- Start: `POST {baseURL}/ping/{uuid}/start`
|
|
- Timeout: 10 seconds
|
|
- Retry: 3 attempts with 2s backoff between retries
|
|
- If `uuid` is empty or starts with "CHANGEME" → skip silently (log at debug level only)
|
|
- If `enabled` is false → skip all pings
|
|
- **Never let ping failures affect the main operation** — log a warning on HTTP error, but always return nil from the calling job. Ping errors must not break backup/health flows.
|
|
|
|
### Design: `internal/monitor/healthcheck.go`
|
|
|
|
```go
|
|
// RunHealthCheck runs system checks and returns a diagnostic report.
|
|
type HealthReport struct {
|
|
Status string // "ok", "warn", "fail"
|
|
Issues []string // critical problems
|
|
Warnings []string // non-critical warnings
|
|
Info []string // informational items
|
|
Timestamp time.Time
|
|
}
|
|
|
|
func RunHealthCheck(cfg *config.Config, cpuCollector *system.CPUCollector) *HealthReport
|
|
func (r *HealthReport) FormatMessage() string // human-readable summary for HC ping body
|
|
```
|
|
|
|
Checks to run (replicating backup-healthcheck.sh logic in Go):
|
|
1. **Disk usage**: Read from `system.GetInfo()`. Compare against thresholds (`disk_warn_percent`, `disk_crit_percent`).
|
|
2. **Memory usage**: Same source. Warn if above `memory_warn_percent`.
|
|
3. **CPU usage**: From collector. Warn if above `cpu_warn_percent`.
|
|
4. **Temperature**: From `system.GetInfo()`. Warn if above `temperature_warn_celsius`.
|
|
5. **Docker health**: Verify Docker daemon is reachable by running `docker info` (quick exec check).
|
|
6. **Protected containers**: Verify protected stacks are running (traefik, cloudflared, felhom-controller) by checking container state.
|
|
|
|
Any issue → Status = "fail". Only warnings → Status = "warn". All clear → Status = "ok".
|
|
|
|
### Scheduler integration
|
|
|
|
```go
|
|
// In main.go:
|
|
pinger := monitor.NewPinger(&cfg.Monitoring, logger)
|
|
healthUUID := cfg.Monitoring.PingUUIDs.SystemHealth
|
|
|
|
// Parse system_health_interval (default "5m")
|
|
healthInterval, _ := time.ParseDuration(cfg.Monitoring.SystemHealthInterval)
|
|
|
|
sched.Every("system-health", healthInterval, func(ctx context.Context) error {
|
|
report := monitor.RunHealthCheck(cfg, cpuCollector)
|
|
body := report.FormatMessage()
|
|
|
|
if report.Status == "fail" {
|
|
pinger.Fail(healthUUID, body)
|
|
} else {
|
|
pinger.Ping(healthUUID, body)
|
|
}
|
|
return nil // never fail the scheduler job due to ping errors
|
|
})
|
|
```
|
|
|
|
### Config changes
|
|
|
|
Add to `MonitoringConfig`:
|
|
```go
|
|
SystemHealthInterval string `yaml:"system_health_interval"`
|
|
```
|
|
|
|
Default in `applyDefaults()`: `"5m"`
|
|
|
|
---
|
|
|
|
## Phase 3A — Database Dump Engine (`internal/backup/dbdump.go`)
|
|
|
|
### Approach: Auto-discover from running Docker containers
|
|
|
|
Replicates the proven logic from `backup-db-dump.sh` in Go:
|
|
|
|
```go
|
|
package backup
|
|
|
|
type DBType string
|
|
const (
|
|
DBTypePostgres DBType = "postgres"
|
|
DBTypeMariaDB DBType = "mariadb"
|
|
)
|
|
|
|
type DiscoveredDB struct {
|
|
ContainerName string
|
|
ContainerID string
|
|
DBType DBType
|
|
DBUser string
|
|
DBName string
|
|
StackName string // derived from container name
|
|
}
|
|
|
|
type DumpResult struct {
|
|
DB DiscoveredDB
|
|
FilePath string
|
|
Size int64
|
|
Duration time.Duration
|
|
Error error
|
|
}
|
|
|
|
func DiscoverDatabases(ctx context.Context, logger *log.Logger) ([]DiscoveredDB, error)
|
|
func DumpAll(ctx context.Context, dbs []DiscoveredDB, dumpDir string, logger *log.Logger) []DumpResult
|
|
func DumpOne(ctx context.Context, db DiscoveredDB, dumpDir string, logger *log.Logger) DumpResult
|
|
```
|
|
|
|
### Discovery logic
|
|
|
|
Run `docker ps --format '{{.ID}}\t{{.Names}}\t{{.Image}}' --filter status=running`.
|
|
|
|
For each running container, check image name:
|
|
- Contains `postgres` → DBTypePostgres
|
|
- Contains `mariadb` or `mysql` → DBTypeMariaDB
|
|
|
|
Then for each DB container, get env vars via:
|
|
`docker inspect <id> --format '{{range .Config.Env}}{{println .}}{{end}}'`
|
|
|
|
Parse env vars:
|
|
- **PostgreSQL**: `POSTGRES_USER` (default: "postgres"), `POSTGRES_DB` (default: same as POSTGRES_USER)
|
|
- **MariaDB**: `MYSQL_ROOT_PASSWORD`, `MYSQL_DATABASE` (or `MARIADB_DATABASE`)
|
|
|
|
Derive stack name from container name by stripping common DB suffixes:
|
|
- `paperless-ngx-postgres` → `paperless-ngx`
|
|
- `romm-db` → `romm`
|
|
- `immich-postgres` → `immich`
|
|
- Logic: split on `-`, check if last segment is a known suffix (`postgres`, `db`, `mariadb`, `mysql`, `database`, `redis`, `cache`), if so remove it
|
|
|
|
### Dump execution
|
|
|
|
**PostgreSQL:**
|
|
```bash
|
|
docker exec <container> pg_dump -U <user> -d <db> --clean --if-exists --no-owner --no-privileges
|
|
```
|
|
|
|
**MariaDB:**
|
|
```bash
|
|
docker exec <container> mariadb-dump -u root -p<password> --single-transaction --routines --triggers <db>
|
|
```
|
|
|
|
**IMPORTANT: Use `docker exec` to run dump commands INSIDE the DB container.** Do NOT use pg_dump/mysqldump from the controller container — version mismatches between the controller's client and the DB server will cause failures.
|
|
|
|
Output handling:
|
|
- Use `os/exec.Command("docker", "exec", ...)` with `cmd.Stdout` piped to a temp file
|
|
- Write to `{dumpDir}/{stackName}-{dbtype}.sql.tmp` during dump
|
|
- Rename `.tmp` → `.sql` on success only
|
|
- Delete `.tmp` on failure
|
|
- Set 5-minute timeout per dump via `context.WithTimeout`
|
|
|
|
### Gotchas and edge cases
|
|
|
|
- **MariaDB password from container env:** Never log the password. Use `docker inspect` to read `MYSQL_ROOT_PASSWORD` or `MARIADB_ROOT_PASSWORD`.
|
|
- **Empty/zero-size dumps:** Check dump file size after writing. If 0 bytes → treat as failure.
|
|
- **Dump file naming:** `{stackName}-{dbtype}.sql` (e.g., `paperless-ngx-postgres.sql`). Overwrite previous dump each run (restic handles versioning).
|
|
- **Old tmp cleanup:** Delete `.tmp` files older than 1 hour on each run (leftover from crashed dumps).
|
|
- **Skip infrastructure DBs:** Don't dump databases from protected stacks (if any have DBs in the future).
|
|
- **Container not running:** If a DB container was discovered but is no longer running by dump time → skip with warning (container may have been stopped between discovery and dump).
|
|
|
|
### Dump directory
|
|
|
|
`/srv/backups/db-dumps/` — configured in `controller.yaml` as `paths.db_dump_dir`.
|
|
Already mounted in docker-compose.yml via `/srv/backups:/srv/backups`.
|
|
|
|
The user does NOT see this directory (not in FileBrowser, not on HDD).
|
|
|
|
---
|
|
|
|
## Phase 3B — Restic Integration (`internal/backup/restic.go`)
|
|
|
|
### Design
|
|
|
|
```go
|
|
type ResticManager struct {
|
|
repoPath string
|
|
passwordFile string
|
|
logger *log.Logger
|
|
customerID string
|
|
cacheDir string
|
|
}
|
|
|
|
func NewResticManager(cfg *config.Config, logger *log.Logger) *ResticManager
|
|
|
|
func (r *ResticManager) EnsureInitialized() error
|
|
func (r *ResticManager) Snapshot(paths []string, tags []string) (*SnapshotResult, error)
|
|
func (r *ResticManager) Prune(retention config.RetentionConfig) error
|
|
func (r *ResticManager) Check() error
|
|
func (r *ResticManager) LatestSnapshot() (*SnapshotInfo, error)
|
|
func (r *ResticManager) Stats() (*RepoStats, error)
|
|
|
|
type SnapshotResult struct {
|
|
SnapshotID string
|
|
FilesNew int
|
|
FilesChanged int
|
|
DataAdded string // human-readable
|
|
Duration time.Duration
|
|
}
|
|
|
|
type SnapshotInfo struct {
|
|
ID string
|
|
Time time.Time
|
|
Paths []string
|
|
Tags []string
|
|
}
|
|
|
|
type RepoStats struct {
|
|
TotalSize string
|
|
SnapshotCount int
|
|
LatestSnapshot *SnapshotInfo
|
|
}
|
|
```
|
|
|
|
### Restic commands (all via `os/exec`)
|
|
|
|
All commands set these env vars:
|
|
```go
|
|
cmd.Env = append(os.Environ(),
|
|
"RESTIC_REPOSITORY="+r.repoPath,
|
|
"RESTIC_PASSWORD_FILE="+r.passwordFile,
|
|
"RESTIC_CACHE_DIR="+r.cacheDir,
|
|
)
|
|
```
|
|
|
|
**`RESTIC_CACHE_DIR`** must be set to `/opt/docker/felhom-controller/data/restic-cache` (inside the controller-data Docker volume). Without this, restic defaults to `~/.cache/restic` which may not persist across container restarts.
|
|
|
|
**Init** (idempotent):
|
|
- Check if `{repoPath}/config` file exists → if so, already initialized, skip
|
|
- Otherwise: `restic init`
|
|
|
|
**Snapshot:**
|
|
```bash
|
|
restic backup /opt/docker/stacks /srv/backups/db-dumps /opt/docker/felhom-controller/controller.yaml \
|
|
--tag felhom --tag <customerID> --host <customerID>
|
|
```
|
|
|
|
What gets backed up (v1):
|
|
- `/opt/docker/stacks/` — compose files, .felhom.yml, app.yaml (deploy configs with secrets)
|
|
- `/srv/backups/db-dumps/` — SQL dumps (from the DB dump step)
|
|
- `/opt/docker/felhom-controller/controller.yaml` — controller config
|
|
|
|
**NOT backed up in v1:**
|
|
- HDD app data (Immich photos, Paperless documents) — too large, needs separate strategy
|
|
- Docker volumes directly — critical data covered by DB dumps
|
|
|
|
Parse snapshot output (restic `backup` with `--json` sends JSON lines to stderr):
|
|
```json
|
|
{"message_type":"summary","files_new":5,"files_changed":2,"data_added":12345678,...,"snapshot_id":"abc123"}
|
|
```
|
|
|
|
**Prune:**
|
|
```bash
|
|
restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 6 --prune
|
|
```
|
|
|
|
**Check:**
|
|
```bash
|
|
restic check
|
|
```
|
|
|
|
**Latest snapshot:**
|
|
```bash
|
|
restic snapshots --latest 1 --json
|
|
```
|
|
Returns JSON array with snapshot objects.
|
|
|
|
**Stats (repo size):**
|
|
```bash
|
|
restic stats --json
|
|
```
|
|
|
|
### Password auto-generation
|
|
|
|
On startup, `EnsureInitialized()` checks if the password file exists. If not:
|
|
1. Generate 32 random bytes, base64url-encode
|
|
2. Write to `r.passwordFile` (the controller-data volume path)
|
|
3. Log `[INFO] Generated new restic repository password at <path>`
|
|
4. Log `[WARN] Save this password externally — losing it means losing access to ALL backups`
|
|
|
|
### Gotchas
|
|
|
|
- **restic is already in the Docker image** (Dockerfile installs it). No additional setup.
|
|
- **Locking:** Restic handles repo locking internally. The scheduler's "skip if running" prevents concurrent operations. If a stale lock exists (controller crashed mid-backup), restic will error — add `restic unlock` to the error handling path with a log warning.
|
|
- **Timeout:** 30-minute timeout for snapshot operations. Parse context deadline.
|
|
- **Large repos:** First snapshot may be large (all stack configs + dumps). Subsequent snapshots are incremental (restic deduplicates).
|
|
- **restic JSON output:** Use `--json` for machine-parseable output. Parse from stderr for `backup` command (stdout shows progress, stderr has JSON summary).
|
|
|
|
Actually, correction — restic with `--json` sends JSON to **stdout**. Regular progress goes to stderr. For `backup --json`, the summary JSON object with `message_type: "summary"` is on stdout. Parse the last JSON line from stdout.
|
|
|
|
---
|
|
|
|
## Phase 3C — Backup Orchestrator (`internal/backup/backup.go`)
|
|
|
|
### Design
|
|
|
|
```go
|
|
type Manager struct {
|
|
cfg *config.Config
|
|
restic *ResticManager
|
|
logger *log.Logger
|
|
pinger *monitor.Pinger
|
|
|
|
mu sync.Mutex
|
|
lastDBDump *DBDumpStatus
|
|
lastBackup *BackupStatus
|
|
}
|
|
|
|
type DBDumpStatus struct {
|
|
LastRun time.Time
|
|
Results []DumpResult
|
|
Success bool
|
|
Duration time.Duration
|
|
}
|
|
|
|
type BackupStatus struct {
|
|
LastRun time.Time
|
|
Snapshot *SnapshotResult
|
|
Success bool
|
|
Duration time.Duration
|
|
RepoStats *RepoStats
|
|
}
|
|
|
|
func NewManager(cfg *config.Config, pinger *monitor.Pinger, logger *log.Logger) *Manager
|
|
func (m *Manager) RunDBDumps(ctx context.Context) error
|
|
func (m *Manager) RunBackup(ctx context.Context) error
|
|
func (m *Manager) RunFullBackup(ctx context.Context) error // dumps + snapshot + optional prune
|
|
func (m *Manager) GetStatus() (*DBDumpStatus, *BackupStatus)
|
|
func (m *Manager) GetRepoStats() (*RepoStats, error)
|
|
```
|
|
|
|
### Full backup flow (daily scheduled)
|
|
|
|
1. **DB dumps:** `DiscoverDatabases()` → `DumpAll()` → update `lastDBDump` status
|
|
2. Ping Healthchecks for DB dump result: `pinger.Ping/Fail(dbDumpUUID, summary)`
|
|
3. **Restic snapshot:** `restic.EnsureInitialized()` → `restic.Snapshot(paths, tags)`
|
|
4. **Prune (weekly):** Check day of week against `prune_schedule` config. If match → `restic.Prune(retention)` + `restic.Check()`
|
|
5. Ping Healthchecks for backup result: `pinger.Ping/Fail(backupUUID, summary)`
|
|
6. Update `lastBackup` status
|
|
|
|
### Scheduler integration
|
|
|
|
```go
|
|
// In main.go:
|
|
backupMgr := backup.NewManager(cfg, pinger, logger)
|
|
|
|
if cfg.Backup.Enabled {
|
|
sched.Daily("db-dump", cfg.Backup.DBDumpSchedule, func(ctx context.Context) error {
|
|
return backupMgr.RunDBDumps(ctx)
|
|
})
|
|
|
|
sched.Daily("backup", cfg.Backup.ResticSchedule, func(ctx context.Context) error {
|
|
return backupMgr.RunBackup(ctx)
|
|
})
|
|
}
|
|
```
|
|
|
|
### Dashboard display
|
|
|
|
Add "Biztonsági mentés" (Backup) section to `dashboard.html`:
|
|
|
|
```
|
|
╔══════════════════════════════════════════╗
|
|
║ 🛡️ Biztonsági mentés ║
|
|
╠══════════════════════════════════════════╣
|
|
║ Utolsó mentés: 2026-02-15 03:01 ✅ ║
|
|
║ Adatbázisok: 2 mentve (12.3 MB) ║
|
|
║ Tároló méret: 45.2 MB (23 pillanatkép) ║
|
|
║ Következő: ma 03:00 ║
|
|
║ ║
|
|
║ [Mentés most] ║
|
|
╚══════════════════════════════════════════╝
|
|
```
|
|
|
|
Hungarian labels:
|
|
- "Biztonsági mentés" = Backup
|
|
- "Utolsó mentés" = Last backup
|
|
- "Adatbázisok" = Databases
|
|
- "mentve" = backed up
|
|
- "Tároló méret" = Repository size
|
|
- "pillanatkép" = snapshot(s)
|
|
- "Következő" = Next
|
|
- "Mentés most" = Backup now
|
|
|
|
Status colors:
|
|
- Green ✅: Last backup successful and less than `backup_max_age_hours` old
|
|
- Yellow ⚠️: Last backup successful but older than expected
|
|
- Red ❌: Last backup failed or no backups exist yet
|
|
- Gray: Backup not configured (`backup.enabled: false`)
|
|
|
|
If backup is disabled in config → show "Biztonsági mentés nincs beállítva" (Backup not configured).
|
|
|
|
### API endpoints
|
|
|
|
Add to `api/router.go`:
|
|
|
|
```
|
|
GET /api/backup/status → backup manager status + repo stats
|
|
POST /api/backup/run → trigger immediate full backup (async)
|
|
```
|
|
|
|
`POST /api/backup/run` starts the backup in a background goroutine, returns immediately with `{"ok": true, "message": "Mentés elindítva"}`. The dashboard can poll `/api/backup/status` to track progress.
|
|
|
|
---
|
|
|
|
## Docker-compose.yml final state
|
|
|
|
```yaml
|
|
services:
|
|
felhom-controller:
|
|
image: gitea.dooplex.hu/admin/felhom-controller:latest
|
|
container_name: felhom-controller
|
|
restart: unless-stopped
|
|
ports:
|
|
- "8080:8080"
|
|
volumes:
|
|
# Docker socket — required for compose operations + DB dumps (docker exec)
|
|
- /var/run/docker.sock:/var/run/docker.sock:ro
|
|
# Controller config
|
|
- /opt/docker/felhom-controller/controller.yaml:/opt/docker/felhom-controller/controller.yaml:ro
|
|
# Controller persistent data (sessions, restic cache, restic password)
|
|
- controller-data:/opt/docker/felhom-controller/data
|
|
# Stack compose files (read + write for git sync)
|
|
- /opt/docker/stacks:/opt/docker/stacks
|
|
# Backup directories (restic repo + db dumps)
|
|
- /srv/backups:/srv/backups
|
|
# HDD mount (if available, for monitoring disk usage)
|
|
- ${HDD_PATH:-/mnt/hdd_placeholder}:${HDD_PATH:-/mnt/hdd_placeholder}:ro
|
|
# Host /sys — for CPU temperature reading (read-only)
|
|
- /sys:/host/sys:ro
|
|
environment:
|
|
- TZ=Europe/Budapest
|
|
labels:
|
|
- "traefik.enable=true"
|
|
- "traefik.http.routers.controller.rule=Host(`felhom.${DOMAIN}`)"
|
|
- "traefik.http.routers.controller.entrypoints=websecure"
|
|
- "traefik.http.routers.controller.tls=true"
|
|
- "traefik.http.services.controller.loadbalancer.server.port=8080"
|
|
- "traefik.docker.network=traefik-public"
|
|
- "felhom.managed=true"
|
|
- "felhom.component=controller"
|
|
networks:
|
|
- traefik-public
|
|
healthcheck:
|
|
test: ["CMD", "curl", "-f", "http://localhost:8080/api/health"]
|
|
interval: 30s
|
|
timeout: 5s
|
|
start_period: 10s
|
|
retries: 3
|
|
|
|
volumes:
|
|
controller-data:
|
|
|
|
networks:
|
|
traefik-public:
|
|
external: true
|
|
```
|
|
|
|
Changes from current:
|
|
1. **Added:** `/sys:/host/sys:ro` — for temperature reading
|
|
2. **Removed:** dedicated restic-password bind mount (password now in controller-data volume)
|
|
|
|
---
|
|
|
|
## Config changes summary
|
|
|
|
### `controller.yaml.example` updates
|
|
|
|
```yaml
|
|
monitoring:
|
|
system_health_interval: "5m" # NEW field
|
|
|
|
backup:
|
|
restic_password_file: "/opt/docker/felhom-controller/data/restic-password" # CHANGED default path
|
|
```
|
|
|
|
### `config.go` updates
|
|
|
|
- Add `SystemHealthInterval string` to `MonitoringConfig`
|
|
- Default: `"5m"` in `applyDefaults()`
|
|
- Change `restic_password_file` default from `/opt/docker/felhom-controller/restic-password` to `/opt/docker/felhom-controller/data/restic-password`
|
|
- Add env override: `FELHOM_MONITORING_SYSTEM_HEALTH_INTERVAL`
|
|
|
|
---
|
|
|
|
## Implementation Order
|
|
|
|
### Step 1: Scheduler
|
|
1. Create `internal/scheduler/scheduler.go`
|
|
2. Implement `Every()` and `Daily()` with logging, panic recovery, skip-if-running
|
|
3. Migrate the two existing goroutines from `main.go` to scheduler
|
|
4. **Build and verify** — behavior should be identical, logs should show `[SCHED]` entries
|
|
|
|
### Step 2: CPU & Temperature metrics
|
|
1. Create `internal/system/cpu_linux.go` + `cpu_other.go` (build tags)
|
|
2. Add `readLoadAvg()` and `readTemperature()` to `info_linux.go`
|
|
3. Extend `SystemInfo` struct in `info.go`
|
|
4. Update `GetInfo()` signature in all files to accept `*CPUCollector`
|
|
5. Start CPUCollector in `main.go`, pass to web server and API router
|
|
6. Update `docker-compose.yml` — add `/sys:/host/sys:ro`
|
|
7. Update `dashboard.html` — show CPU, load, temperature
|
|
8. Update `style.css` if needed for new display elements
|
|
9. **Build, deploy, verify** — new metrics visible on dashboard
|
|
|
|
### Step 3: Healthchecks pinger + health checks
|
|
1. Create `internal/monitor/pinger.go`
|
|
2. Create `internal/monitor/healthcheck.go`
|
|
3. Add `system_health_interval` to config
|
|
4. Add system health ping job to scheduler in `main.go`
|
|
5. **Build, deploy** — check controller logs for health check runs
|
|
|
|
### Step 4: Database dump engine
|
|
1. Create `internal/backup/dbdump.go`
|
|
2. Implement discovery + dump functions
|
|
3. Wire up `RunDBDumps` temporarily to a test endpoint or manual scheduler trigger for testing
|
|
4. **Build, deploy, verify** — dumps should appear in `/srv/backups/db-dumps/` for paperless-ngx-postgres
|
|
|
|
### Step 5: Restic integration
|
|
1. Create `internal/backup/restic.go`
|
|
2. Implement init, snapshot, prune, check, stats
|
|
3. Auto-generate restic password if missing
|
|
4. Update docker-compose.yml (remove restic-password bind mount)
|
|
5. **Build, deploy, verify** — repo initialized, password generated
|
|
|
|
### Step 6: Backup orchestrator + dashboard
|
|
1. Create `internal/backup/backup.go`
|
|
2. Wire up scheduler daily jobs (DB dump + backup)
|
|
3. Add API endpoints (`/api/backup/status`, `/api/backup/run`)
|
|
4. Add backup status section to `dashboard.html`
|
|
5. Add "Mentés most" button
|
|
6. **Build, deploy, verify full flow**
|
|
|
|
### Step 7: Documentation & cleanup
|
|
1. Update `README.md` — Phase 2 and 3 checked off, new module descriptions
|
|
2. Update `CONTEXT.md` with session summary
|
|
3. Update `CLAUDE.md` if workflow changes
|
|
4. Version bump in build: `v0.4.0`
|
|
|
|
---
|
|
|
|
## Verification Checklist
|
|
|
|
After deployment, verify each item:
|
|
|
|
- [ ] `docker ps` shows controller healthy
|
|
- [ ] Dashboard loads with CPU %, load average, temperature displayed
|
|
- [ ] Temperature shows realistic value (30-60°C idle for N100)
|
|
- [ ] CPU % updates (not stuck at 0)
|
|
- [ ] `/api/system/info` returns all new fields (cpu_percent, load_avg_*, temperature_*)
|
|
- [ ] Scheduler logs show `[SCHED]` entries for all registered jobs
|
|
- [ ] If HC UUIDs configured: pings visible in status.felhom.eu dashboard
|
|
- [ ] DB dump discovers paperless-ngx postgres container
|
|
- [ ] Dump file exists: `/srv/backups/db-dumps/paperless-ngx-postgres.sql`
|
|
- [ ] Restic repo initialized: `/srv/backups/restic-repo/config` exists
|
|
- [ ] Restic password auto-generated: `/opt/docker/felhom-controller/data/restic-password` exists
|
|
- [ ] "Mentés most" button triggers backup successfully
|
|
- [ ] Dashboard shows backup status section with last backup time
|
|
- [ ] All existing features still work (start/stop/deploy/update/logs/auth)
|
|
|
|
---
|
|
|
|
## New files to create
|
|
|
|
```
|
|
internal/scheduler/scheduler.go
|
|
internal/monitor/pinger.go
|
|
internal/monitor/healthcheck.go
|
|
internal/backup/dbdump.go
|
|
internal/backup/restic.go
|
|
internal/backup/backup.go
|
|
internal/system/cpu_linux.go
|
|
internal/system/cpu_other.go
|
|
```
|
|
|
|
## Existing files to modify
|
|
|
|
```
|
|
internal/system/info.go — new SystemInfo fields
|
|
internal/system/info_linux.go — readLoadAvg(), readTemperature(), GetInfo() signature
|
|
internal/system/info_other.go — GetInfo() signature update
|
|
internal/config/config.go — SystemHealthInterval, updated defaults
|
|
internal/api/router.go — backup endpoints, cpuCollector parameter
|
|
internal/web/server.go — accept cpuCollector, backupMgr
|
|
internal/web/handlers.go — pass cpuCollector/backupMgr to dashboard
|
|
internal/web/templates/dashboard.html — CPU/temp bars, backup status section
|
|
internal/web/templates/style.css — styles for new elements
|
|
cmd/controller/main.go — scheduler, cpuCollector, pinger, backupMgr wiring
|
|
controller/docker-compose.yml — /sys mount, remove restic-password mount
|
|
configs/controller.yaml.example — new fields, updated defaults
|
|
```
|
|
|
|
---
|
|
|
|
## Manual steps after deployment (for Viktor)
|
|
|
|
1. **Verify /sys mount:** `docker exec felhom-controller ls /host/sys/class/thermal/` — should show thermal_zone directories
|
|
2. **Healthchecks setup:** Create project + 3 checks in status.felhom.eu for demo-felhom:
|
|
- `system-health` (period: 10m, grace: 10m)
|
|
- `db-dump` (period: 24h, grace: 1h)
|
|
- `backup` (period: 24h, grace: 1h)
|
|
3. **Update controller.yaml:** Add the three ping UUIDs
|
|
4. **Verify restic password:** `docker exec felhom-controller cat /opt/docker/felhom-controller/data/restic-password`
|
|
5. **Test restore procedure:**
|
|
```bash
|
|
docker exec felhom-controller restic -r /srv/backups/restic-repo \
|
|
--password-file /opt/docker/felhom-controller/data/restic-password snapshots
|
|
```
|
|
6. **Save restic password externally** — losing it means losing access to all backups |