diff --git a/CONTEXT.md b/CONTEXT.md index 3a767ad..b424cb2 100644 --- a/CONTEXT.md +++ b/CONTEXT.md @@ -7,7 +7,7 @@ > > Ask Claude Code: "Please update CONTEXT.md with what we did today" -Last updated: 2026-02-16 (session 17) +Last updated: 2026-02-16 (session 18) --- @@ -22,7 +22,7 @@ Last updated: 2026-02-16 (session 17) ## Current project state ### felhom-controller (this repo) -- **Version:** v0.5.4 +- **Version:** v0.6.0 - **Phase 1:** ✅ COMPLETE — Stack Manager + Deploy Flow - **Phase 2:** ✅ COMPLETE — Monitoring & Health (scheduler, CPU/temp, healthchecks.io pings) - **Phase 3:** ✅ COMPLETE — Backups (DB dumps, restic integration, manual trigger, **dedicated backup page**) @@ -31,7 +31,40 @@ Last updated: 2026-02-16 (session 17) - **Running on:** demo-felhom (N100 mini PC) at 192.168.0.162:8080 - **All Phase 1-4 features working:** deploy, start/stop/restart/update, logs, health-aware states, auth, monitoring, backups, backup detail page, system monitoring page -### What was just completed (2026-02-16 session 17) +### What was just completed (2026-02-16 session 18) +- **v0.6.0 — Healthcheck Implementation + Central Push + Hub Dashboard:** + - **Part 1 — Healthcheck enhancements (controller-side):** + - Added `heartbeat` ping — lightweight "I'm alive" signal every 5 min (no logic, just ping) + - Added `backup_integrity` ping — weekly `restic check` on Sunday 04:00, pings healthchecks with result + - Added `Heartbeat` and `BackupIntegrity` fields to `PingUUIDsConfig` + - Added `RunIntegrityCheck()` to backup Manager (calls restic Check(), updates lastCheckTime/lastCheckOK, pings) + - Updated `controller.yaml.example` with new monitoring ping_uuids + - Created `monitoring/DEPRECATED.md` for legacy bash monitoring scripts + - **Part 2 — Central hub reporting (controller-side):** + - New `internal/report/` package: types.go (Report struct), builder.go (BuildReport), pusher.go (HTTP push) + - Report builder gathers data from all subsystems: system info (via metrics.GetStaticInfo + system.GetInfo), container stats (via metricsStore.QueryContainerSummary), backup status (via backupMgr.GetFullStatus), health (via monitor.RunHealthCheck), stacks (via stackMgr.GetStacks) + - Report pusher: POST JSON to hub with Bearer token auth, 3 retries with 5s backoff, never fails caller + - Added `HubConfig` to config.go (enabled, url, api_key, push_interval) + - Wired hub reporting into scheduler (configurable interval, default 15m) + - Hub reporting disabled by default (hub.enabled: false) + - **Part 3 — Hub service (felhom.eu repo, new `hub/` subfolder):** + - Full Go service: `cmd/hub/main.go`, `internal/api/handler.go`, `internal/store/store.go`, `internal/web/server.go` + - SQLite store with WAL mode, auto-migration, denormalized fields for fast queries + - REST API: POST /api/v1/report (Bearer token auth), GET /api/v1/customers, GET /api/v1/customers/{id}, GET /api/v1/customers/{id}/history + - Dark theme dashboard (English): multi-customer overview table with status indicators, customer detail page with system/storage/containers/backup/health sections + - Color coding: green (OK, <30min), yellow (warn or 30-60min), red (fail or >60min) + - K8s manifest: Deployment + Service + Ingress for hub.felhom.eu in felhom-system namespace + - Dockerfile, Makefile, hub.yaml.example config + - 90-day report retention with daily auto-prune + - **Controller version:** v0.6.0 — deployed and verified on demo-felhom.eu (9 scheduler jobs, all new jobs registered) + - **Manual steps remaining for Viktor (Part 4 of TASK.md):** + - Create 5 healthcheck checks on status.felhom.eu (heartbeat, system-health, db-dump, backup, backup-integrity) + - Update controller.yaml on demo-felhom with real UUIDs + - Build and deploy felhom-hub to k3s cluster + - Configure hub.felhom.eu DNS in Cloudflare + - Enable hub reporting on demo-felhom controller.yaml + +### What was previously completed (2026-02-16 session 17) - **v0.5.4 — Monitoring Page Frontend Fixes (4 bugs, frontend-only):** - **Bug 1: Tooltip "Invalid Date"** — `items[0].parsed.x` unreliable across Chart.js versions. Fixed tooltip callback to use `items[0].raw.x` (direct {x,y} data access) with `parsed.x` as fallback. - **Bug 2: Charts fill full width regardless of data density** — `setChartXBounds()` setting `min/max` at runtime was ignored because the scale was created without them. Fixed by including `min: now - defaultRangeMs, max: now` in the initial `chartOpts()` options. Now "7 nap" shows full 7-day x-axis with data clustered on the right. @@ -336,15 +369,19 @@ Last updated: 2026-02-16 (session 17) 7. Documentation: restart vs up -d for image updates ### What's next (priorities) -1. **Configure Healthchecks.io UUIDs** on demo-felhom.eu (replace CHANGEME in controller.yaml) +1. **Manual steps for v0.6.0** — Viktor needs to: + - Create 5 healthcheck checks on status.felhom.eu with correct periods/grace + - Update controller.yaml on demo-felhom with real UUIDs + - Build + deploy felhom-hub to k3s (`cd hub && make docker-push`, `kubectl apply -f manifests/hub.yaml`) + - Configure hub.felhom.eu DNS in Cloudflare + - Enable hub reporting on demo-felhom (`hub.enabled: true`, `hub.api_key: `) 2. **Test backup flow** — trigger manual backup via dashboard, verify restic repo + DB dumps -3. **Test orphan delete flow** — try deleting the orphaned filebrowser stack via the UI +3. **Test backup integrity check** — wait for Sunday 04:00 or manually trigger 4. Add `app_info` + `optional_config` to more apps (start with Immich, Mealie, Vaultwarden) 5. Deploy a second app (e.g., ActualBudget — simplest, or Immich — tests HDD + secrets) -6. Add app screenshots to the asset pipeline (romm-screenshot-1.webp etc.) -7. Test on Raspberry Pi (pi-customer-1) -8. Add `paths.hdd_path` to demo-felhom controller.yaml to enable HDD bar -9. Phase 4: Self-update mechanism +6. Test on Raspberry Pi (pi-customer-1) +7. Phase 4: Self-update mechanism +8. v0.6.1: Hub alerting (webhook to Healthchecks for stale customers) ## Architecture decisions @@ -411,7 +448,7 @@ Last updated: 2026-02-16 (session 17) |------------|--------|-------| | deploy-felhom-compose | Active | This repo. Controller code + deploy scripts | | app-catalog-felhom.eu | Active | 10 app templates, all with .felhom.yml metadata + memory limits | -| felhom.eu | Stable | Website live, SEO indexed, email working | +| felhom.eu | Active | Website + hub/ subfolder (felhom-hub service) + k8s manifests | | homelab-manifests | Stable | k3s cluster running (dooplex.hu services) | | misc-scripts | Utility | collect-repo.sh, backup helpers | diff --git a/controller/README.md b/controller/README.md index e3531a4..18e1d4b 100644 --- a/controller/README.md +++ b/controller/README.md @@ -24,7 +24,7 @@ controller generates secrets, saves app.yaml, runs `docker compose up -d`, and t with Traefik routing and health checks. The dashboard correctly shows real-time container states including health substatus (starting → healthy → running). -Current version: **v0.4.0** +Current version: **v0.6.0** ### What works - Dashboard with live container state (green/orange/yellow/red) @@ -55,6 +55,10 @@ Current version: **v0.4.0** - Restic backup with auto-password generation, snapshot, prune, stats - Backup status card on dashboard with manual "Mentés most" trigger button - Backup API endpoints: status query and manual trigger +- SQLite metrics store (system + container metrics, 60s collection, 30-day retention) +- Heartbeat ping (5-minute "I'm alive" signal to Healthchecks) +- Weekly backup integrity check (restic check, Sunday 04:00) +- Central hub reporting (periodic JSON push to felhom-hub service) ### Known issues / next priorities - Cloudflare Tunnel + Traefik TLS: paperless.demo-felhom.eu works locally but shows "Not secure" (certificate chain not fully validated through tunnel) @@ -87,10 +91,10 @@ Current version: **v0.4.0** │ │ └──────────┘ │ │ │ └────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ - │ pings │ git pull - ▼ ▼ - status.felhom.eu gitea.dooplex.hu - (Healthchecks on k3s) (stack definitions) + │ pings │ JSON push │ git pull + ▼ ▼ ▼ + status.felhom.eu hub.felhom.eu gitea.dooplex.hu + (Healthchecks) (central dashboard) (stack definitions) ``` ## Repository Layout @@ -121,9 +125,18 @@ controller/ │ │ ├── pinger.go # Healthchecks.io HTTP ping client │ │ └── healthcheck.go # System health checks (disk, mem, CPU, temp, Docker) │ ├── backup/ -│ │ ├── backup.go # Backup orchestrator (DB dumps + restic + prune) +│ │ ├── backup.go # Backup orchestrator (DB dumps + restic + prune + integrity) │ │ ├── dbdump.go # Database auto-discovery + dump (pg_dump, mariadb-dump) -│ │ └── restic.go # Restic operations (init, snapshot, prune, stats) +│ │ └── restic.go # Restic operations (init, snapshot, prune, check, stats) +│ ├── metrics/ +│ │ ├── store.go # SQLite metrics storage (system + container, downsampled queries) +│ │ ├── collector.go # Background collector (60s interval, system + docker stats) +│ │ ├── types.go # SystemSample, ContainerSample, StaticSystemInfo structs +│ │ └── sysinfo.go # Host-level static info (/proc, /etc) +│ ├── report/ +│ │ ├── types.go # Hub report JSON payload definitions +│ │ ├── builder.go # Builds report from system/stacks/backup/metrics state +│ │ └── pusher.go # HTTP POST to central hub (retry, Bearer auth) │ └── web/ │ ├── server.go # HTTP server, routing, static file serving │ ├── auth.go # Session auth, login/logout handlers @@ -159,6 +172,8 @@ controller/ | **Sync** | `internal/sync/` | ✅ Done | Git-based app catalog sync (clone/pull, content-hash copy) | | **Scheduler** | `internal/scheduler/` | ✅ Done | Central job scheduler (periodic + daily, skip-if-running) | | **Monitor** | `internal/monitor/` | ✅ Done | Healthchecks.io pings, system health checks | +| **Metrics** | `internal/metrics/` | ✅ Done | SQLite time-series store, system + container collection | +| **Report** | `internal/report/` | ✅ Done | Central hub push (JSON report builder + HTTP pusher) | | **Backup** | `internal/backup/` | ✅ Done | DB auto-discovery + dump, restic snapshots, prune, manual trigger | ## Stack Management @@ -371,7 +386,7 @@ docker compose up -d | Node | Hardware | Domain | IP | Status | |------|----------|--------|----|--------| -| demo-felhom | Acemagic GK3PLUS N100, 16G RAM, 512G SSD + 1TB HDD | demo-felhom.eu | 192.168.0.162 | ✅ Controller v0.4.0 + Paperless-ngx running | +| demo-felhom | Acemagic GK3PLUS N100, 16G RAM, 512G SSD + 1TB HDD | demo-felhom.eu | 192.168.0.162 | ✅ Controller v0.6.0 + Paperless-ngx running | | pi-customer-1 | Raspberry Pi 3B+, 1G RAM, 32G SD | pi-customer-1.local | — | 📲 Not yet tested | ### First deployment log (Paperless-ngx on demo-felhom) @@ -445,6 +460,9 @@ docker compose up -d - [x] Central job scheduler (replaces ad-hoc goroutines) - [x] Healthchecks.io-compatible HTTP pinger with retry logic - [x] System health checks (disk, memory, CPU, temp, Docker, protected containers) +- [x] Heartbeat ping (5-minute "I'm alive" signal) +- [x] SQLite metrics store (system + container metrics, 60s collection, 30-day prune) +- [x] Backup integrity check (weekly restic check with Healthchecks ping) - [ ] Customer notifications (email/Telegram) ### Phase 3 — Backups ✅ COMPLETE @@ -472,10 +490,13 @@ docker compose up -d - [ ] Health-based rollback mechanism - [ ] Config export/import -### Phase 6 — Central Management (future) -- [ ] API authentication for remote management -- [ ] Central dashboard on k3s querying all customer controllers +### Phase 6 — Central Management (in progress) +- [x] Central hub reporting (controller → hub JSON push with Bearer auth) +- [x] Hub report builder (system, stacks, backup, health, containers, metrics) +- [x] Hub service (felhom-hub: REST API + SQLite + dark-theme dashboard) +- [x] K8s manifests for hub deployment on k3s - [ ] Fleet-wide update management +- [ ] Customer notifications (email/Telegram) ## Related Repositories @@ -483,5 +504,5 @@ docker compose up -d |------------|---------| | [deploy-felhom-compose](https://gitea.dooplex.hu/admin/deploy-felhom-compose) | This repo — controller + deploy scripts | | [app-catalog-felhom.eu](https://gitea.dooplex.hu/admin/app-catalog-felhom.eu) | Docker Compose templates + .felhom.yml metadata | -| [felhom.eu](https://gitea.dooplex.hu/admin/felhom.eu) | Website + app assets + felhom infra manifests | +| [felhom.eu](https://gitea.dooplex.hu/admin/felhom.eu) | Website + app assets + felhom infra manifests (incl. felhom-hub) | | [homelab-manifests](https://gitea.dooplex.hu/admin/homelab-manifests) | k3s cluster manifests (dooplex.hu) | \ No newline at end of file