From 1596e86e692b342dc032b2ddaac70139c6d5484b Mon Sep 17 00:00:00 2001 From: kisfenyo Date: Sun, 15 Feb 2026 11:20:30 +0100 Subject: [PATCH] docs: update CONTEXT.md and README.md for v0.4.0 Co-Authored-By: Claude Opus 4.6 --- CONTEXT.md | 83 +++++++++++++++++++++++++++++++++++++------- controller/README.md | 64 +++++++++++++++++++++++++--------- 2 files changed, 118 insertions(+), 29 deletions(-) diff --git a/CONTEXT.md b/CONTEXT.md index 8b99c80..42a6a73 100644 --- a/CONTEXT.md +++ b/CONTEXT.md @@ -7,7 +7,7 @@ > > Ask Claude Code: "Please update CONTEXT.md with what we did today" -Last updated: 2026-02-15 (session 9) +Last updated: 2026-02-15 (session 10) --- @@ -22,13 +22,65 @@ Last updated: 2026-02-15 (session 9) ## Current project state ### felhom-controller (this repo) -- **Version:** v0.3.0 +- **Version:** v0.4.0 - **Phase 1:** ✅ COMPLETE — Stack Manager + Deploy Flow +- **Phase 2:** ✅ COMPLETE — Monitoring & Health (scheduler, CPU/temp, healthchecks.io pings) +- **Phase 3:** ✅ COMPLETE — Backups (DB dumps, restic integration, manual trigger) - **First app deployed:** Paperless-ngx on demo-felhom.eu (2026-02-13) - **Running on:** demo-felhom (N100 mini PC) at 192.168.0.162:8080 -- **All Phase 1 features working:** deploy, start/stop/restart/update, logs, health-aware states, auth +- **All Phase 1-3 features working:** deploy, start/stop/restart/update, logs, health-aware states, auth, monitoring, backups -### What was just completed (2026-02-15 session 9) +### What was just completed (2026-02-15 session 10) +- **v0.4.0 — Monitoring & Health + Backups (Phase 2 & 3):** + - **Central job scheduler** (`internal/scheduler/scheduler.go`): + - Replaces ad-hoc goroutines in main.go with a unified scheduler + - `Every(name, interval, fn)` for periodic jobs, `Daily(name, timeStr, fn)` for scheduled tasks + - Panic recovery, skip-if-running, quiet mode for high-frequency jobs (≤30s) + - Daily jobs use `Europe/Budapest` timezone with `time.Timer` for DST correctness + - Graceful shutdown with 30s timeout for running jobs + - **CPU usage collector** (`internal/system/cpu_linux.go`): + - Background goroutine samples `/proc/stat` every 5s, computes delta-based CPU % + - Platform stubs for non-Linux in `cpu_other.go` + - **Temperature & load metrics** (`internal/system/info_linux.go`): + - Reads `/proc/loadavg` for 1/5/15 min load averages + - Reads thermal zones from `/host/sys/class/thermal/` (Docker mount) with `/sys/` fallback + - Handles millidegree values, picks highest zone, with hwmon fallback + - **Healthchecks.io pinger** (`internal/monitor/pinger.go`): + - HTTP ping client for Healthchecks.io-compatible endpoints + - POST to `/ping/{uuid}` (success), `/fail` (failure), `/start` (started) + - 10s timeout, 3 retries with 2s backoff, skips CHANGEME UUIDs + - **System health checks** (`internal/monitor/healthcheck.go`): + - Checks disk, memory, CPU, temperature, Docker reachability, protected containers + - Returns HealthReport with status "ok"/"warn"/"fail" + formatted message for pings + - **Database dump engine** (`internal/backup/dbdump.go`): + - Auto-discovers PostgreSQL/MariaDB containers via `docker ps` + `docker inspect` + - Dumps via `docker exec pg_dump`/`mariadb-dump` with 5min timeout + - Atomic writes (`.tmp` → `.sql`), empty file detection, stale temp cleanup + - **Restic integration** (`internal/backup/restic.go`): + - Auto-generates repository password (32 random bytes, base64url) + - Init, snapshot (JSON output), prune, check, stats, latest snapshot + - Stale lock detection with automatic unlock + retry + - **Backup orchestrator** (`internal/backup/backup.go`): + - DB dumps + restic snapshots, weekly prune on Sundays + - Thread-safe running flag, Healthchecks.io pings with results + - `RunFullBackup()` for manual trigger (sequential: dumps → snapshot) + - **Wiring updates:** + - `main.go`: scheduler-based job registration, cpuCollector lifecycle, pinger + backupMgr init + - `api/router.go`: `GET /api/backup/status`, `POST /api/backup/run` + - `web/server.go` + `handlers.go`: pass cpuCollector to GetInfo(), backup status on dashboard + - `funcmap.go`: `tempColor`, `fmtTemp`, `fmtLoad` template functions + - **Dashboard UI enhancements:** + - CPU usage bar with load average display below + - Temperature with colored indicator dot (green/yellow/red at 60°/75°C) + - Backup status card: last run time, DB count, repo size/snapshots + - "Mentés most" button triggers manual backup via API + - **Config updates:** + - `controller.yaml.example`: added `system_health_interval`, `hdd_path`, `system.reserved_memory_mb` + - `docker-compose.yml`: added `/sys:/host/sys:ro` mount for temperature reading + - `restic_password_file` default changed to `data/` subdir (auto-generated in named volume) +- **Controller version:** v0.4.0 — deployed and verified on demo-felhom.eu + +### What was previously completed (2026-02-15 session 9) - **v0.3.0 — Structural refactoring (templates + server split + domain rename):** - **Templates: go:embed migration** — moved all 7 HTML templates + CSS from Go string constants to individual files in `internal/web/templates/`. Created `embed.go` with `//go:embed` directive. Template loading now uses `ParseFS()` instead of `Parse()`. CSS served from embed.FS via `ReadFile()`. Zero runtime file dependencies — still compiled into the binary. - **Server decomposition** — split monolithic `server.go` (540 lines) into focused files: @@ -190,14 +242,15 @@ Last updated: 2026-02-15 (session 9) 7. Documentation: restart vs up -d for image updates ### What's next (priorities) -1. **Test orphan delete flow** — try deleting the orphaned filebrowser stack via the UI -2. Add `app_info` + `optional_config` to more apps (start with Immich, Mealie, Vaultwarden) -3. Deploy a second app (e.g., ActualBudget — simplest, or Immich — tests HDD + secrets) to validate all .felhom.yml files -4. Add app screenshots to the asset pipeline (romm-screenshot-1.webp etc.) -5. Test on Raspberry Pi (pi-customer-1) -6. Add `paths.hdd_path` to demo-felhom controller.yaml to enable HDD bar -7. Phase 2 continued: CPU/temperature metrics, Healthchecks.io pings -8. Phase 3: Backup system (DB dumps + restic) +1. **Configure Healthchecks.io UUIDs** on demo-felhom.eu (replace CHANGEME in controller.yaml) +2. **Test backup flow** — trigger manual backup via dashboard, verify restic repo + DB dumps +3. **Test orphan delete flow** — try deleting the orphaned filebrowser stack via the UI +4. Add `app_info` + `optional_config` to more apps (start with Immich, Mealie, Vaultwarden) +5. Deploy a second app (e.g., ActualBudget — simplest, or Immich — tests HDD + secrets) +6. Add app screenshots to the asset pipeline (romm-screenshot-1.webp etc.) +7. Test on Raspberry Pi (pi-customer-1) +8. Add `paths.hdd_path` to demo-felhom controller.yaml to enable HDD bar +9. Phase 4: Self-update mechanism ## Architecture decisions @@ -222,6 +275,12 @@ Last updated: 2026-02-15 (session 9) | Orphan = deployed but not in catalog | Safe lifecycle: remove from catalog → mark orphaned → user deletes via UI | | FileBrowser as infra (not catalog) | Needed even after apps deleted (user browses HDD data); deployed by setup script | | Protected HDD paths | Safety net: never delete top-level HDD dirs (media, storage, Dokumentumok, appdata) | +| Central scheduler (not ad-hoc goroutines) | Single place to register/monitor all periodic tasks, graceful shutdown, skip-if-running | +| CPU sampling via background goroutine | /proc/stat delta needs two readings — collector runs every 5s, GetInfo() reads cached value | +| Temperature from /host/sys (Docker mount) | Container can't read host /sys directly — mount /sys:/host/sys:ro, try /host/sys first | +| Restic password auto-generated | No manual setup needed — generated on first backup run, stored in named volume | +| DB discovery via docker inspect | No config needed — discovers postgres/mariadb containers by image name + env vars | +| Backup orchestrator with running flag | Prevents concurrent backups, supports both scheduled and manual trigger | ## Key file locations on demo-felhom diff --git a/controller/README.md b/controller/README.md index 7ef7d40..e3531a4 100644 --- a/controller/README.md +++ b/controller/README.md @@ -24,7 +24,7 @@ controller generates secrets, saves app.yaml, runs `docker compose up -d`, and t with Traefik routing and health checks. The dashboard correctly shows real-time container states including health substatus (starting → healthy → running). -Current version: **v0.3.0** +Current version: **v0.4.0** ### What works - Dashboard with live container state (green/orange/yellow/red) @@ -47,6 +47,14 @@ Current version: **v0.3.0** - Clickable app cards on dashboard and applications pages (navigate to info page) - Memory bar with two-segment visualization on deploy page (committed vs new app allocation) - Deployment progress UI: 3-step progress panel with real-time health polling (config → containers → health check) +- CPU usage bar with load average display (1/5/15 min) +- Temperature display with colored indicator dot (thermal zone reading) +- Central job scheduler replacing ad-hoc goroutines (periodic + daily jobs) +- Healthchecks.io-compatible system health pings with retry logic +- Database auto-discovery and dump (PostgreSQL/MariaDB via docker exec) +- Restic backup with auto-password generation, snapshot, prune, stats +- Backup status card on dashboard with manual "Mentés most" trigger button +- Backup API endpoints: status query and manual trigger ### Known issues / next priorities - Cloudflare Tunnel + Traefik TLS: paperless.demo-felhom.eu works locally but shows "Not secure" (certificate chain not fully validated through tunnel) @@ -101,10 +109,21 @@ controller/ │ ├── sync/ │ │ └── sync.go # Git sync: clone/pull app catalog, content-hash copy │ ├── api/router.go # REST API endpoints +│ ├── scheduler/ +│ │ └── scheduler.go # Central job scheduler (Every, Daily, skip-if-running) │ ├── system/ │ │ ├── info.go # SystemInfo struct -│ │ ├── info_linux.go # Linux: /proc/meminfo + statfs -│ │ └── info_other.go # Non-Linux stub +│ │ ├── info_linux.go # Linux: /proc/meminfo + statfs + loadavg + temperature +│ │ ├── info_other.go # Non-Linux stub +│ │ ├── cpu_linux.go # CPU collector (background /proc/stat sampling) +│ │ └── cpu_other.go # CPU collector stub (non-Linux) +│ ├── monitor/ +│ │ ├── pinger.go # Healthchecks.io HTTP ping client +│ │ └── healthcheck.go # System health checks (disk, mem, CPU, temp, Docker) +│ ├── backup/ +│ │ ├── backup.go # Backup orchestrator (DB dumps + restic + prune) +│ │ ├── dbdump.go # Database auto-discovery + dump (pg_dump, mariadb-dump) +│ │ └── restic.go # Restic operations (init, snapshot, prune, stats) │ └── web/ │ ├── server.go # HTTP server, routing, static file serving │ ├── auth.go # Session auth, login/logout handlers @@ -135,12 +154,12 @@ controller/ | **Config** | `internal/config/` | ✅ Done | Load & validate controller.yaml, env overrides | | **Stacks** | `internal/stacks/` | ✅ Done | Compose operations, scanning, metadata, deploy flow | | **API** | `internal/api/` | ✅ Done | REST endpoints (stacks, deploy, rescan, system info, health) | -| **System** | `internal/system/` | ✅ Done | System resource info (RAM, disk usage) for dashboard & API | +| **System** | `internal/system/` | ✅ Done | System resource info (RAM, disk, CPU, temperature, load) | | **Web** | `internal/web/` | ✅ Done | Hungarian dashboard, auth, deploy pages, asset serving | | **Sync** | `internal/sync/` | ✅ Done | Git-based app catalog sync (clone/pull, content-hash copy) | -| **Backup** | `internal/backup/` | 📲 Phase 3 | DB dumps, restic snapshots, restore | -| **Monitor** | `internal/monitor/` | 📲 Phase 2 | Health checks, Healthchecks pings, system metrics | -| **Scheduler** | `internal/scheduler/` | 📲 Phase 2 | Cron-like job runner for all periodic tasks | +| **Scheduler** | `internal/scheduler/` | ✅ Done | Central job scheduler (periodic + daily, skip-if-running) | +| **Monitor** | `internal/monitor/` | ✅ Done | Healthchecks.io pings, system health checks | +| **Backup** | `internal/backup/` | ✅ Done | DB auto-discovery + dump, restic snapshots, prune, manual trigger | ## Stack Management @@ -352,7 +371,7 @@ docker compose up -d | Node | Hardware | Domain | IP | Status | |------|----------|--------|----|--------| -| demo-felhom | Acemagic GK3PLUS N100, 16G RAM, 512G SSD + 1TB HDD | demo-felhom.eu | 192.168.0.162 | ✅ Controller v0.2.11 + Paperless-ngx running | +| demo-felhom | Acemagic GK3PLUS N100, 16G RAM, 512G SSD + 1TB HDD | demo-felhom.eu | 192.168.0.162 | ✅ Controller v0.4.0 + Paperless-ngx running | | pi-customer-1 | Raspberry Pi 3B+, 1G RAM, 32G SD | pi-customer-1.local | — | 📲 Not yet tested | ### First deployment log (Paperless-ngx on demo-felhom) @@ -385,7 +404,9 @@ docker compose up -d | POST | `/api/stacks/{name}/optional-config` | Yes | Update optional config env vars | | GET | `/api/stacks/{name}/logs` | Yes | Container logs (add `?raw=1` for plain text) | | POST | `/api/stacks/rescan` | Yes | Trigger manual stack discovery | -| GET | `/api/system/info` | Yes | System resource usage (RAM, disk, HDD) | +| GET | `/api/system/info` | Yes | System resource usage (RAM, disk, CPU, temp, load) | +| GET | `/api/backup/status` | Yes | Backup status (last run, DB dump count, repo stats) | +| POST | `/api/backup/run` | Yes | Trigger manual backup (DB dumps + restic snapshot) | ## Status & Roadmap @@ -412,20 +433,29 @@ docker compose up -d - [x] Alphabetically sorted stack display - [x] Deploy page doubles as read-only config viewer for deployed apps -### Phase 2 — Monitoring & Health +### Phase 2 — Monitoring & Health ✅ COMPLETE - [x] System metrics on dashboard (RAM, SSD, HDD usage bars) - [x] `/api/system/info` endpoint with live resource data - [x] Pre-deploy memory validation (mem_request hard block, mem_limit soft warning) - [x] Memory summary bar on deploy page -- [ ] CPU and temperature metrics -- [ ] Healthchecks.io ping integration +- [x] CPU usage collector (background /proc/stat sampling, 5s interval) +- [x] CPU usage bar on dashboard with load average display +- [x] Temperature reading from /sys/class/thermal (with /host/sys Docker mount) +- [x] Temperature display with colored indicator dot (green/yellow/red) +- [x] Central job scheduler (replaces ad-hoc goroutines) +- [x] Healthchecks.io-compatible HTTP pinger with retry logic +- [x] System health checks (disk, memory, CPU, temp, Docker, protected containers) - [ ] Customer notifications (email/Telegram) -### Phase 3 — Backups -- [ ] DB dump engine (PostgreSQL, MariaDB/MySQL, SQLite) -- [ ] Restic integration (snapshot, prune, check) -- [ ] Backup status on dashboard -- [ ] Manual backup trigger from UI +### Phase 3 — Backups ✅ COMPLETE +- [x] DB auto-discovery (PostgreSQL/MariaDB containers via docker inspect) +- [x] DB dump engine (pg_dump/mariadb-dump via docker exec, atomic writes) +- [x] Restic integration (auto-init, snapshot, prune, check, stats) +- [x] Restic password auto-generation (no manual setup needed) +- [x] Backup orchestrator (DB dumps + restic + weekly prune) +- [x] Backup status on dashboard (last run, DB count, repo stats) +- [x] Manual backup trigger from UI ("Mentés most" button) +- [x] `GET /api/backup/status` and `POST /api/backup/run` endpoints - [ ] Restore workflow ### Phase 4 — Git Sync & Updates