Files
deploy-felhom-compose/CONTEXT.md
T
2026-02-16 17:29:11 +01:00

529 lines
47 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# CONTEXT.md — Project Memory
> This file serves as persistent project memory across Claude Code sessions.
> It replaces the auto-generated "Memory" from the claude.ai Project.
> **Update this file at the end of each working session** with current state,
> recent decisions, and anything the next session needs to know.
>
> Ask Claude Code: "Please update CONTEXT.md with what we did today"
Last updated: 2026-02-16 (session 22)
---
## About Viktor (project owner)
- Works at Magyar Telekom (Budapest), building Felhom as a side business
- Felhom: managed home-server service for Hungarian households
- Technical but prefers pragmatic solutions over over-engineering
- Runs all infrastructure on Gitea (gitea.dooplex.hu), k3s cluster for management
- Customer deployments use Docker Compose (not Kubernetes) for simplicity
## Current project state
### felhom-controller (this repo)
- **Version:** v0.7.0
- **Phase 1:** ✅ COMPLETE — Stack Manager + Deploy Flow
- **Phase 2:** ✅ COMPLETE — Monitoring & Health (scheduler, CPU/temp, healthchecks.io pings)
- **Phase 3:** ✅ COMPLETE — Backups (DB dumps, restic integration, manual trigger, **dedicated backup page**)
- **Phase 4:** ✅ COMPLETE — Monitoring Page with Metrics Store (SQLite, Chart.js, system + container metrics)
- **Phase 5:** ✅ COMPLETE — Authentication, Persistence & Settings Page (settings.json, password change, session management)
- **First app deployed:** Paperless-ngx on demo-felhom.eu (2026-02-13)
- **Running on:** demo-felhom (N100 mini PC) at 192.168.0.162:8080
- **All Phase 1-5 features working:** deploy, start/stop/restart/update, logs, health-aware states, auth, monitoring, backups, backup detail page, system monitoring page, settings page
### What was just completed (2026-02-16 session 22)
- **v0.7.0 — Phase 1: Authentication, Persistence & Settings Page:**
- **New `internal/settings/settings.go`:** Shared persistence layer via `settings.json` in the data directory. Atomic writes (tmp + rename), thread-safe with `sync.RWMutex`. Stores password hash overrides and DB validation cache. Graceful handling if file doesn't exist.
- **Auth improvements:**
- Password resolution priority: `settings.json``controller.yaml` → none (open dashboard)
- Startup logs which source is active: `Auth: using password from settings.json/controller.yaml/no password configured`
- Session duration extended to 7 days (was 24h)
- `?next=` redirect after session expiry — returns user to the page they were on
- Flash messages on login page (green info box, used after password change)
- Conditional logout link — hidden when auth is disabled (no password configured)
- `invalidateAllSessions()` method for password change flow
- **New Settings page (`/settings`):**
- "Rendszer konfiguráció" section: read-only display of controller.yaml values (customer ID/name/domain, git repo/sync interval, backup enabled/schedule, monitoring, healthchecks URL, hub status, controller version)
- "Jelszó módosítás" section: form with current password, new password, confirm — validates min 8 chars, match check, bcrypt comparison
- Password saved to `settings.json`, all sessions invalidated, redirect to login with flash message
- Only shown if auth is enabled; otherwise shows info message to contact operator
- **Sidebar update:**
- "Beállítások" menu item with ⚙ icon pinned to bottom (above version/logout)
- Version and logout link separated from nav links
- Logout link conditionally shown only when auth is enabled
- **DB validation persistence:**
- After each successful dump, validation results saved to `settings.json` (`db_validations` map keyed by filename)
- Cached data survives container restarts
- `DBValidationCache` struct with `validated_at`, `table_count`, `has_header`, `error`
- **10 files changed** (3 new: settings.go, settings.html; 7 modified: main.go, backup.go, auth.go, handlers.go, server.go, layout.html, login.html, style.css)
- **Deployed:** Controller v0.7.0 to demo-felhom.eu, verified healthy
### What was previously completed (2026-02-16 session 21)
- **v0.6.3 — Bug fixes from v0.6.2 code scan (4 minor fixes):**
- **Bug 1:** `--hdd-path` in `docker-setup.sh` now uses `require_arg` validation like all other flags. Previously, `--hdd-path` as the last argument without a value would crash with a cryptic bash error under `set -u` instead of a friendly message.
- **Bug 2:** `stackAction()` in `layout.html` now receives `event` as an explicit parameter instead of relying on the deprecated implicit `window.event`. All 10 onclick call sites in `dashboard.html` and `stacks.html` updated to pass `event` as first argument.
- **Bug 3:** Page `<title>` now has an em dash separator: `"Vezérlőpult — Felhom.eu"` instead of `"VezérlőpultFelhom.eu"`.
- **Bug 4:** `nextPruneLabel()` in `funcmap.go` now returns `"ma"` (Hungarian for "today") on Sunday before 4am, consistent with the `nextRunLabel` function. Previously returned the date in `"2006-01-02"` format.
- **Deployed:** Controller v0.6.3 to demo-felhom.eu, verified healthy
### What was previously completed (2026-02-16 session 20)
- **Hub Dashboard Bugs + Backup Validation Fix (3 bugs):**
- **Bug 1&2 (Hub repo, felhom-hub v0.1.2):** Hub timestamp parsing failure — `time.Parse` with single hardcoded format silently failed for formats returned by `modernc.org/sqlite`. Added `parseSQLiteTime()` that tries 6 common formats. Fixed: hub main page showing DOWN despite OK status, and report history timestamps showing 00:00:00.
- **Bug 3 (Controller repo, v0.6.2):** Backup page showing "Hiba" for all DB validations — zero-value `DumpValidation{}` (never assigned) hit the `{{else}}` branch in template. Three fixes:
- Template: 4-branch guard (Valid → OK / Error → Hiba / zero-value → "" with tooltip)
- Debug logging: Added `[DEBUG]` and `[WARN]` log lines to all `ValidateDump()` code paths
- Re-validation: `RefreshCache()` now cross-checks `lastDBDump` results against fresh `ListDumpFiles()` validation, healing stale in-memory state
- **Deployed:** Hub v0.1.2 to k3s, Controller v0.6.2 to demo-felhom
- **Verified:** Controller logs show `ValidateDump OK` for all 3 databases (immich: 60 tables, paperless: 67 tables, romm: 14 tables)
### What was previously completed (2026-02-16 session 19)
- **v0.6.1 — Code Review Bugfixes (7 fixes):**
- **Fix 1:** `http.NotFound(w, nil)` → pass actual `*http.Request` in `deployHandler` and `appDetailHandler`
- **Fix 2:** Dashboard running/stopped counts now computed from the filtered `deployedStacks` set (was counting ALL stacks including non-deployed)
- **Fix 3:** Session cookie `Secure` flag now dynamic based on `r.TLS != nil || X-Forwarded-Proto == "https"`. `SameSite` changed from `Strict` to `Lax` (Strict breaks Cloudflare Tunnel redirects)
- **Fix 4:** Removed misleading `subtle.ConstantTimeCompare` from `isValidSession()` (map lookup already leaks timing; comparing token to itself is meaningless). Removed unused `token` field from `session` struct. Removed `crypto/subtle` import.
- **Fix 5:** Replaced `time.Tick()` (goroutine leak) with proper `time.NewTicker` + `done` channel in `cleanupSessions()`. Added `Close()` method to Server. Added `done chan struct{}` to Server struct.
- **Fix 6:** Added `http.MaxBytesReader(w, req.Body, 1<<20)` (1MB limit) to `deployStack`, `updateOptionalConfig`, `deleteStack` API handlers via `limitBody()` helper.
- **Fix 7:** Cached `time.LoadLocation("Europe/Budapest")` once at top of `templateFuncMap()`, removed 5 per-function `LoadLocation` calls (timeAgo, fmtTime, fmtTimeShort, nextRunLabel, nextPruneLabel).
- **Post-fix verification:** All 4 grep checks pass (0 results for NotFound(w,nil), ConstantTimeCompare, time.Tick(, Secure:.*true). `go vet ./...` clean.
- **Controller version:** v0.6.1 — deployed and verified on demo-felhom.eu
### What was previously completed (2026-02-16 session 18)
- **v0.6.0 — Healthcheck Implementation + Central Push + Hub Dashboard:**
- **Part 1 — Healthcheck enhancements (controller-side):**
- Added `heartbeat` ping — lightweight "I'm alive" signal every 5 min (no logic, just ping)
- Added `backup_integrity` ping — weekly `restic check` on Sunday 04:00, pings healthchecks with result
- Added `Heartbeat` and `BackupIntegrity` fields to `PingUUIDsConfig`
- Added `RunIntegrityCheck()` to backup Manager (calls restic Check(), updates lastCheckTime/lastCheckOK, pings)
- Updated `controller.yaml.example` with new monitoring ping_uuids
- Created `monitoring/DEPRECATED.md` for legacy bash monitoring scripts
- **Part 2 — Central hub reporting (controller-side):**
- New `internal/report/` package: types.go (Report struct), builder.go (BuildReport), pusher.go (HTTP push)
- Report builder gathers data from all subsystems: system info (via metrics.GetStaticInfo + system.GetInfo), container stats (via metricsStore.QueryContainerSummary), backup status (via backupMgr.GetFullStatus), health (via monitor.RunHealthCheck), stacks (via stackMgr.GetStacks)
- Report pusher: POST JSON to hub with Bearer token auth, 3 retries with 5s backoff, never fails caller
- Added `HubConfig` to config.go (enabled, url, api_key, push_interval)
- Wired hub reporting into scheduler (configurable interval, default 15m)
- Hub reporting disabled by default (hub.enabled: false)
- **Part 3 — Hub service (felhom.eu repo, new `hub/` subfolder):**
- Full Go service: `cmd/hub/main.go`, `internal/api/handler.go`, `internal/store/store.go`, `internal/web/server.go`
- SQLite store with WAL mode, auto-migration, denormalized fields for fast queries
- REST API: POST /api/v1/report (Bearer token auth), GET /api/v1/customers, GET /api/v1/customers/{id}, GET /api/v1/customers/{id}/history
- Dark theme dashboard (English): multi-customer overview table with status indicators, customer detail page with system/storage/containers/backup/health sections
- Color coding: green (OK, <30min), yellow (warn or 30-60min), red (fail or >60min)
- K8s manifest: Deployment + Service + Ingress for hub.felhom.eu in felhom-system namespace
- Dockerfile, Makefile, hub.yaml.example config
- 90-day report retention with daily auto-prune
- **Controller version:** v0.6.0 — deployed and verified on demo-felhom.eu (9 scheduler jobs, all new jobs registered)
- **Manual steps remaining for Viktor (Part 4 of TASK.md):**
- Create 5 healthcheck checks on status.felhom.eu (heartbeat, system-health, db-dump, backup, backup-integrity)
- Update controller.yaml on demo-felhom with real UUIDs
- Build and deploy felhom-hub to k3s cluster
- Configure hub.felhom.eu DNS in Cloudflare
- Enable hub reporting on demo-felhom controller.yaml
### What was previously completed (2026-02-16 session 17)
- **v0.5.4 — Monitoring Page Frontend Fixes (4 bugs, frontend-only):**
- **Bug 1: Tooltip "Invalid Date"** — `items[0].parsed.x` unreliable across Chart.js versions. Fixed tooltip callback to use `items[0].raw.x` (direct {x,y} data access) with `parsed.x` as fallback.
- **Bug 2: Charts fill full width regardless of data density** — `setChartXBounds()` setting `min/max` at runtime was ignored because the scale was created without them. Fixed by including `min: now - defaultRangeMs, max: now` in the initial `chartOpts()` options. Now "7 nap" shows full 7-day x-axis with data clustered on the right.
- **Bug 3: Sysinfo values not consistently right-aligned** — `.sysinfo-grid` used `auto-fill` creating variable-width cells. Fixed to `1fr 1fr` (fixed 2-column). Added `align-items: baseline`, `gap: 1rem`, `white-space: nowrap` on labels, `font-weight: 600` + `word-break: break-word` on values. Removed redundant `<style>` block from monitoring.html (styles now in style.css).
- **Bug 4: Charts overflow on mobile** — Added `min-width: 0` on `.chart-box` (critical CSS grid fix), `overflow: hidden` + `max-width: 100%` on `.chart-wrap` and `.chart-wrap-bar`, `max-width: 100%` on canvas.
- **Controller version:** v0.5.4 — deployed and verified on demo-felhom.eu
### What was previously completed (2026-02-16 session 16)
- **v0.5.1 — Monitoring Page Bugfixes:**
- **Bug 1: Hostname** — `os.Hostname()` returns the container ID inside Docker. Fixed by mounting `/etc/hostname:/host/etc/hostname:ro` and reading it first in `sysinfo.go`. Now shows `demo-felhom`.
- **Bug 2: Tooltip timestamps** — Chart.js tooltip callback used `items[0].parsed.x` (category index 0,1,2...) instead of `items[0].label` (actual timestamp). Index 0 worked by accident (`0 || label` falls through), but all other points showed 1970-01-01.
- **Bug 3+4: Default range + empty charts** — Default range was `24h` but new system had only minutes of data. Changed to `1h` default for both system and container detail charts. Moved `active` class to "1 óra" button.
- **Controller version:** v0.5.1 — deployed and verified on demo-felhom.eu
### What was previously completed (2026-02-16 session 15)
- **v0.5.0 — Backup Bugfixes + Monitoring Page with Metrics Store:**
- **Task 1: Fixed "Helyi mentés" showing "" after restart** — `GetFullStatus()` now synthesizes `LastBackup` from `SnapshotHistory` and `LastDBDump` from `DumpFiles` on disk when the in-memory values are nil (e.g., after controller restart). Dashboard handler also updated to use `GetFullStatus()` instead of `GetStatus()` for consistent behavior.
- **Task 2: Verified backup page caching** — Already implemented in v0.4.7 (`RefreshCache`, scheduler job, `AfterBackup` callback). No changes needed.
- **Task 3: New Monitoring Page ("Rendszermonitor")** — Full system monitoring subsystem:
- **SQLite metrics store** (`internal/metrics/store.go`, `types.go`): WAL-mode SQLite via `modernc.org/sqlite` (pure Go, no CGO). Stores system metrics (CPU%, memory, temperature, load) and container metrics (CPU%, memory, net/block I/O) with timestamp. Downsampled queries via bucket-based `GROUP BY` for Chart.js. 30-day auto-prune via daily scheduler job at 04:00.
- **Metrics collector** (`internal/metrics/collector.go`): Background goroutine collects system + container metrics every 60 seconds. System data from `system.GetInfo()`, container data from `docker stats --no-stream` with tab-separated format parsing.
- **System info provider** (`internal/metrics/sysinfo.go`, `sysinfo_other.go`): Reads hostname, OS, kernel, CPU model/cores, uptime from `/proc` filesystem. Linux-specific with build-tag fallback for cross-compilation.
- **REST API endpoints** (4 new routes in `router.go`): `GET /api/metrics/system` (time-series with range presets), `GET /api/metrics/containers/summary` (current stats), `GET /api/metrics/containers/{name}` (per-container time-series), `GET /api/metrics/sysinfo` (static system info).
- **Monitoring page template** (`monitoring.html`): 5 sections — System Overview (sysinfo via API), System Metrics Charts (4 line charts: CPU, Memory, Temperature, Load in 2×2 grid), Container Resources (2 horizontal bar charts: CPU% and Memory), Per-container Detail (click to expand with historical charts), Storage (server-rendered progress bars). Time range selectors (1h/6h/24h/7d/30d). Auto-refresh every 60s.
- **Chart.js 4.4.7** embedded locally (offline environments, ~200KB UMD), dark theme configuration matching site design.
- **CSS**: ~100 lines added for monitoring page (`.monitor-card`, `.charts-grid`, `.chart-box`, `.container-charts-row`, `.storage-bars`, responsive rules).
- **Wiring**: 4th sidebar nav item "Rendszermonitor", metrics DB path in named volume (`data/metrics.db`), `/etc/os-release:/host/etc/os-release:ro` volume mount in docker-compose.yml, Dockerfile updated to `golang:1.24-bookworm` (required by `modernc.org/sqlite`), `go.mod` upgraded to `go 1.24.0`.
- **Controller version:** v0.5.0 — deployed and verified on demo-felhom.eu (metrics collecting, 16 containers reporting, sysinfo showing Intel N100 correctly)
### What was previously completed (2026-02-16 session 14)
- **v0.4.7 — Protected Stack Detail Pages + Backup Page Caching:**
- **Protected stacks clickable** — `data-href` gating changed from `{{if not .Protected}}` to `{{if .Meta.Slug}}` on both `stacks.html` and `dashboard.html`. Protected stacks with `.felhom.yml` (i.e. a slug) are now clickable, linking to `/apps/{slug}`. Stacks without `.felhom.yml` remain non-clickable.
- **"Részletek" button for protected stacks** — Protected stack action section in `stacks.html` now shows a "Részletek" link when the stack has a slug, next to the restart button.
- **FileBrowser `.felhom.yml` resources** — Added `resources` section (mem_request: 128M, mem_limit: 256M, pi_compatible: true, needs_hdd: true) to both `install_filebrowser()` in `docker-setup.sh` and manually on the demo node. FileBrowser detail page now shows memory/Pi/HDD badges.
- **Backup page caching** — `GetFullStatus()` no longer runs expensive subprocess calls (restic stats, docker inspect, disk listing) on every page load. Instead, a new `RefreshCache()` method runs these in the background:
- Every 5 minutes via `backup-cache` scheduler job
- After each successful backup via `AfterBackup` callback
- On startup via a goroutine (non-blocking)
- `GetFullStatus()` returns the cached `FullBackupStatus` instantly, updating only dynamic fields (running flag, next run times, snapshot history). Falls back to a minimal status if cache hasn't populated yet.
- **Controller version:** v0.4.7 — deployed and verified on demo-felhom.eu
### What was previously completed (2026-02-16 session 13)
- **v0.4.6 — MariaDB Validation Fix + Dashboard & Protected Stack UX:**
- **Bugfix: MariaDB dump validation false positive** — MariaDB 11.4+ prepends `/*M!999999\- enable the sandbox mode */` before the dump header comment. `ValidateDump()` now scans the first 10 lines for the expected header pattern instead of just checking line 1. Accepts `-- MariaDB dump`, `-- MySQL dump`, `-- mysqldump` for MariaDB and `-- PostgreSQL database dump` for PostgreSQL.
- **Dashboard shows deployed apps only** — `dashboardHandler()` filters to deployed + protected stacks only. Non-deployed apps remain on the Alkalmazások page. Section heading changed to "Telepített alkalmazások". `TotalCount` stat card still shows all 52 apps.
- **Protected stack restart button** — Protected stacks (traefik, cloudflared, felhom-controller, filebrowser) now show an "Újraindítás" restart button when operational, on both dashboard (compact ↻) and Alkalmazások page (full button). "Védett" / "Védett rendszerkomponens" badge still shown.
- **API protection guard** — Centralized guard in `actionStack()` blocks all actions except `restart` on protected stacks (HTTP 403). Defense-in-depth: `StopStack()` and `DeleteStack()` retain their own guards.
- **FileBrowser `.felhom.yml`** — `install_filebrowser()` in `docker-setup.sh` now creates `.felhom.yml` with `subdomain: files` metadata, so the controller shows the `files.DOMAIN ↗` URL link. Manually created on demo node.
- **Controller version:** v0.4.6 — deployed and verified on demo-felhom.eu
### What was previously completed (2026-02-16 session 12)
- **v0.4.5 — Dedicated Backup Page ("Biztonsági mentés"):**
- **New `/backups` page** with full backup system visibility — 5 sections:
1. **Status overview cards**: Local backup status (green/gray), remote placeholder (gray), DB count, repo size
2. **Schedule section**: DB dump/restic/prune schedule with next-run times, last backup time + duration, retention policy, "Mentés most" button
3. **Database table**: Lists all discovered DBs with type badge (PostgreSQL/MariaDB), dump file size, last dump time, validation (table count), status
4. **Snapshot history table**: Last 20 snapshots with ID, time, data added, files new/changed
5. **Repository info card**: Path, size, snapshot count, integrity check status, backed-up paths list, remote copy placeholder
- **Backend extensions:**
- `SnapshotRecord` type + ring buffer (20 entries) in Manager for per-snapshot stats
- `DumpValidation` — scans dump files for CREATE TABLE statements, validates header and file size
- `ValidateDump()` runs after each successful dump in `DumpOne()`
- `ListDumpFiles()` scans dump directory for existing `.sql` files (fallback when in-memory results empty)
- `ListSnapshots()` on ResticManager — returns all snapshots from restic (newest first)
- `GetFullStatus()` on Manager — single call returns everything the page needs
- `LoadSnapshotHistory()` populates history from restic on startup (without delta stats)
- Restic check result tracking (`lastCheckTime`, `lastCheckOK`)
- `NextDailyRun()` exported from scheduler for next-run time calculation
- **Server wiring:**
- `Server` struct now holds `*scheduler.Scheduler`
- `NewServer()` accepts scheduler parameter
- `/backups` route + `backupsHandler()` in handlers.go
- **New template functions** (`funcmap.go`): `timeAgo`, `fmtTime`, `fmtTimeShort`, `dbTypeLabel`, `nextRunLabel`, `pruneLabel`, `nextPruneLabel`, `fmtDuration`, `fmtBytes`, `shortID`
- **Navigation**: Sidebar now has 3 items (Vezérlőpult, Alkalmazások, Biztonsági mentés)
- **Dashboard**: Backup card title is now a clickable link to `/backups`
- **Auto-refresh**: Page polls `/api/backup/status` every 3s during backup-in-progress, reloads when complete
- **CSS**: Full dark-theme styles for schedule card, database table, snapshot table, repository card, validation badges, DB type badges, empty state
- **Controller version:** v0.4.5 — deployed and verified on demo-felhom.eu (2 historical snapshots loaded)
### What was previously completed (2026-02-15 session 11)
- **v0.4.1 — App Filtering + Bugfixes:**
- **Filter bar on Alkalmazások page**: Four pill-shaped filter buttons (Mind/Futó/Leállítva/Telepíthető) with live count badges computed from DOM. Filters stack cards via `display: none`, updates URL with `?filter=running` via `history.replaceState`. Reads filter from URL on page load for deep-linking support.
- **New `filterCategory` template function** (`funcmap.go`): Maps container state + deployed flag to filter categories (running/stopped/available). Each stack card gets a `data-filter-state` attribute for client-side filtering.
- **Clickable dashboard stat cards**: Stat cards (Futó/Leállítva/Összes) changed from `<div>` to `<a>` with `href` linking to `/stacks?filter=running`, `/stacks?filter=stopped`, `/stacks` respectively. Hover effect with translateY + box-shadow.
- **docker-compose.yml synced to demo node**: Fixed the stale compose file that still had `dashboard.${DOMAIN}` Traefik label (from pre-v0.3.0). Now uses correct `felhom.${DOMAIN}` label + `/sys:/host/sys:ro` mount.
- **Controller version:** v0.4.1 — deployed and verified on demo-felhom.eu
- **Remaining manual tasks for Viktor (Task 2 & 3 from TASK.md):**
- Verify `felhom.demo-felhom.eu` resolves correctly (Cloudflare Tunnel public hostname may need updating from `dashboard.*` to `felhom.*`)
- Update Pi-hole local DNS if applicable
- Enable backup in `controller.yaml` on demo node (`backup.enabled: true`)
- Create `/srv/backups` directories on demo node
### What was previously completed (2026-02-15 session 10)
- **v0.4.0 — Monitoring & Health + Backups (Phase 2 & 3):**
- **Central job scheduler** (`internal/scheduler/scheduler.go`):
- Replaces ad-hoc goroutines in main.go with a unified scheduler
- `Every(name, interval, fn)` for periodic jobs, `Daily(name, timeStr, fn)` for scheduled tasks
- Panic recovery, skip-if-running, quiet mode for high-frequency jobs (≤30s)
- Daily jobs use `Europe/Budapest` timezone with `time.Timer` for DST correctness
- Graceful shutdown with 30s timeout for running jobs
- **CPU usage collector** (`internal/system/cpu_linux.go`):
- Background goroutine samples `/proc/stat` every 5s, computes delta-based CPU %
- Platform stubs for non-Linux in `cpu_other.go`
- **Temperature & load metrics** (`internal/system/info_linux.go`):
- Reads `/proc/loadavg` for 1/5/15 min load averages
- Reads thermal zones from `/host/sys/class/thermal/` (Docker mount) with `/sys/` fallback
- Handles millidegree values, picks highest zone, with hwmon fallback
- **Healthchecks.io pinger** (`internal/monitor/pinger.go`):
- HTTP ping client for Healthchecks.io-compatible endpoints
- POST to `/ping/{uuid}` (success), `/fail` (failure), `/start` (started)
- 10s timeout, 3 retries with 2s backoff, skips CHANGEME UUIDs
- **System health checks** (`internal/monitor/healthcheck.go`):
- Checks disk, memory, CPU, temperature, Docker reachability, protected containers
- Returns HealthReport with status "ok"/"warn"/"fail" + formatted message for pings
- **Database dump engine** (`internal/backup/dbdump.go`):
- Auto-discovers PostgreSQL/MariaDB containers via `docker ps` + `docker inspect`
- Dumps via `docker exec pg_dump`/`mariadb-dump` with 5min timeout
- Atomic writes (`.tmp``.sql`), empty file detection, stale temp cleanup
- **Restic integration** (`internal/backup/restic.go`):
- Auto-generates repository password (32 random bytes, base64url)
- Init, snapshot (JSON output), prune, check, stats, latest snapshot
- Stale lock detection with automatic unlock + retry
- **Backup orchestrator** (`internal/backup/backup.go`):
- DB dumps + restic snapshots, weekly prune on Sundays
- Thread-safe running flag, Healthchecks.io pings with results
- `RunFullBackup()` for manual trigger (sequential: dumps → snapshot)
- **Wiring updates:**
- `main.go`: scheduler-based job registration, cpuCollector lifecycle, pinger + backupMgr init
- `api/router.go`: `GET /api/backup/status`, `POST /api/backup/run`
- `web/server.go` + `handlers.go`: pass cpuCollector to GetInfo(), backup status on dashboard
- `funcmap.go`: `tempColor`, `fmtTemp`, `fmtLoad` template functions
- **Dashboard UI enhancements:**
- CPU usage bar with load average display below
- Temperature with colored indicator dot (green/yellow/red at 60°/75°C)
- Backup status card: last run time, DB count, repo size/snapshots
- "Mentés most" button triggers manual backup via API
- **Config updates:**
- `controller.yaml.example`: added `system_health_interval`, `hdd_path`, `system.reserved_memory_mb`
- `docker-compose.yml`: added `/sys:/host/sys:ro` mount for temperature reading
- `restic_password_file` default changed to `data/` subdir (auto-generated in named volume)
- **Controller version:** v0.4.0 — deployed and verified on demo-felhom.eu
### What was previously completed (2026-02-15 session 9)
- **v0.3.0 — Structural refactoring (templates + server split + domain rename):**
- **Templates: go:embed migration** — moved all 7 HTML templates + CSS from Go string constants to individual files in `internal/web/templates/`. Created `embed.go` with `//go:embed` directive. Template loading now uses `ParseFS()` instead of `Parse()`. CSS served from embed.FS via `ReadFile()`. Zero runtime file dependencies — still compiled into the binary.
- **Server decomposition** — split monolithic `server.go` (540 lines) into focused files:
- `auth.go`: session struct, auth middleware, login/logout handlers, session management
- `handlers.go`: page handlers (dashboard, stacks, logs, deploy, app detail)
- `funcmap.go`: template FuncMap with 14 custom functions
- `server.go`: Server struct, NewServer, loadTemplates (3-liner), ServeHTTP routing, render helper, static file serving
- **Domain rename** — controller subdomain changed from `dashboard.*` to `felhom.*` in Traefik labels and setup script
- **Documentation updated** — CLAUDE.md, README.md, CONTEXT.md all reflect new file structure
- **Reminder for Viktor:** Update Cloudflare Tunnel public hostname (`dashboard.demo-felhom.eu``felhom.demo-felhom.eu`) and Pi-hole DNS if needed
- **Controller version:** v0.3.0
### What was previously completed (2026-02-15 session 8)
- **FileBrowser as infrastructure service:**
- Created `scripts/hdd-setup.sh` (adapted from deploy-portainer) — sets up HDD folder structure with `Dokumentumok` user dir
- Created `scripts/docker-setup.sh` (adapted from deploy-portainer) — installs Docker, Traefik, FileBrowser as infra services
- Added `filebrowser` to protected stacks in `controller.yaml.example`
- Removed `templates/filebrowser/` from app-catalog-felhom.eu (no longer a catalog app)
- **Orphan stack detection and deletion:**
- Added `Orphaned` field to Stack struct + `getCatalogTemplateSlugs()` helper
- Orphan detection in `ScanStacks()` — deployed stacks with no matching catalog template marked as orphaned
- New `delete.go`: `DeleteStack()` (compose down + HDD cleanup + dir removal), `GetStackHDDData()`, `parseComposeHDDMounts()`
- Safety: protected HDD paths (root, media, storage, Dokumentumok, appdata) can never be deleted
- New API endpoints: `DELETE /api/stacks/{name}` and `GET /api/stacks/{name}/hdd-data`
- UI: orange "Elavult" badge on orphaned stacks, "Törlés" button, delete confirmation modal
- Modal shows HDD data paths/sizes, checkbox for "Felhasználói adatok törlése a merevlemezről"
- Hides "Frissítés" and "Részletek" buttons for orphaned stacks
- **Verified:** 1 orphaned stack detected on startup (filebrowser — now infra, removed from catalog)
- **Controller version:** v0.2.15
### Previously completed (2026-02-14 session 7)
- **Fixed YAML parse error in romm `.felhom.yml`** (app-catalog repo):
- Root cause: Hungarian opening quote `„` (U+201E) paired with ASCII `"` (0x22) inside YAML double-quoted strings terminated the string prematurely
- Affected lines: `help_text` for IGDB Client Secret and SteamGridDB API Key fields
- Fix: escaped inner ASCII double quotes with `\"` in the YAML strings
- This caused `LoadMetadata()` to silently fail and return empty defaults for ALL romm metadata (tagline, resources, category — everything)
- **Added error logging to `LoadMetadata()`** in `metadata.go`:
- `[ERROR]` log on YAML parse failure (was silently swallowed — critical bug)
- Temporary `[DEBUG]` log used for diagnosis, then removed
- **Fixed deploy command in CLAUDE.md**:
- `sed` pattern now targets only `image:` lines (was matching service name too, breaking YAML)
- Added `sudo` for both sed and docker compose (directory is root-owned)
- **Controller version:** v0.2.14
### Previously completed (2026-02-14 session 6)
- **Bug fix: App info logo SVG rendering** — `.app-info-logo` CSS in `templates.go`:
- Added `min-width`, `min-height`, `max-width`, `max-height: 80px` and `overflow: hidden`
- Prevents SVG images with explicit dimensions or no viewBox from overflowing container
- Logo now reliably renders at 80x80 regardless of SVG intrinsic size
- **Controller version:** v0.2.12
### Previously completed (2026-02-14 session 5)
- **App detail/info pages** — new feature:
- New route: `GET /apps/{slug}` renders a full info page (was redirect to deploy page)
- Hero section with logo, tagline, resource badges
- Screenshots section (graceful — hidden via `onerror` if assets don't exist)
- Info cards: use cases, first steps, prerequisites, default credentials, docs link
- Optional config form with AJAX save (POST `/api/stacks/{name}/optional-config`)
- New `.felhom.yml` fields: `app_info` (tagline, use_cases, first_steps, prerequisites, default_creds, docs_url) and `optional_config` (groups of env var fields)
- New structs in `metadata.go`: `AppInfo`, `OptionalConfigGroup`, `OptionalConfigField`
- `UpdateOptionalConfig` in `deploy.go`: saves optional env vars to `app.yaml`, restarts deployed stacks with `docker compose up -d` to pick up new env vars
- Navigation updated: stack cards on dashboard/stacks pages now link to `/apps/{slug}`, deploy page has "Részletek" link back to info page
- **RoMM metadata updated** (app-catalog repo):
- Full `app_info` section: tagline, 5 use cases, 6 first steps, 3 prerequisites, default creds, docs URL
- 6 optional config fields for metadata providers: IGDB (client_id + secret), SteamGridDB, ScreenScraper (user + password), MobyGames
- docker-compose.yml updated with SCREENSCRAPER_USER, SCREENSCRAPER_PASSWORD, MOBYGAMES_API_KEY env vars
- Display name fixed: "ROMM" → "RomM"
- **Controller version:** v0.2.11
### Previously completed (2026-02-14 session 4)
- **Fixed deploy race condition** in `internal/stacks/deploy.go`:
- In-memory `Deployed` flag now set BEFORE `docker compose up -d` (compose up can take 30-60s for image pulls)
- On failure: both in-memory state and disk (app.yaml) are reverted
- Eliminates stale "Telepítés" button during long compose operations
- **Added `checkBeforeDeploy()` JS guard** in `internal/web/templates.go`:
- Telepítés buttons on Vezérlőpult and Alkalmazások pages now fetch live state from `/api/stacks/{name}` before navigating
- If app is already deployed (e.g., another tab deployed it), shows alert and reloads page instead of navigating to deploy form
- Catches stale UI state gracefully
### Previously completed (2026-02-14 session 3)
- **Enhanced debug logging** across all stack operations in `internal/stacks/`:
- **Operation timing**: All stack ops (start, stop, restart, update, deploy) now log elapsed time
- **Post-start container state check**: Async goroutine after start/restart/update/deploy
- **Image pull detection**: Checks local images before deploy/update (debug level)
- **GetLogs/ScanStacks improvements**: Byte count logging, deployed/available counts
- All verbose checks gated on `cfg.Logging.Level == "debug"`; timing always at INFO
- **UI improvements** in `internal/web/templates.go` and `server.go`:
- **Memory bar fix on deploy page**: Bar segments now always visible (min-width: 3px), new app segment uses translucent green with distinct border for clear visual separation from committed memory
- **Clickable app cards**: Cards on Vezérlőpult and Alkalmazások pages are now clickable (navigates to deploy/detail page). Uses `data-href` attribute + delegated click handler. Protected stacks excluded. Actions area (buttons, state labels) excluded from click-to-navigate
- **Live-scrolling logs**: Logs page now auto-refreshes every 3s via AJAX polling (`?raw=1` returns plain text). Fixed-height container (70vh) with auto-scroll to bottom. Pulsing green "Élő" indicator. Pause/resume toggle ("Szüneteltetés"/"Folytatás"). User scroll position preserved when scrolled up to read history
- **Deployment progress UI**: Deploy button no longer shows alert+redirect immediately. Instead shows 3-step progress panel: config saved → containers starting → app initializing. Polls `GET /api/stacks/{name}` every 3s to track actual container health state. Handles running (auto-redirect), starting (keep polling), unhealthy (warning), exited (error), and 120s timeout. Shows elapsed time counter
- **Mealie healthcheck fix** (app-catalog-felhom.eu):
- `wget --spider` replaced with Python TCP socket check — mealie image doesn't include wget
- `start_period` increased to 60s (DB migrations take ~40s on first start)
- **Healthcheck audit**: filebrowser (Alpine, has BusyBox wget — OK), stirling-pdf (Ubuntu, has wget — OK)
### Previously completed (2026-02-15 session 2)
- **Phase 4: Git Sync + App Catalog Audit** — major milestone
- **Git sync module** (`internal/sync/sync.go`):
- Clones/pulls app-catalog-felhom.eu repo to local cache on startup
- Periodic sync based on `git.sync_interval` (default 15m)
- Copies `docker-compose.yml` + `.felhom.yml` to stacks dir (never overwrites `app.yaml`/`.env`)
- SHA-256 content comparison — only writes changed files
- Triggers `ScanStacks()` after sync so dashboard updates immediately
- Uses `os/exec` git CLI — no Go git library dependency
- **Manual sync button** ("Sablonok frissítése") on Alkalmazások page:
- `POST /api/sync` endpoint with 30s debounce
- Toast notification shows result (success/failure/what changed)
- Auto-reloads page if new apps or updates detected
- **Sync status** added to `/api/system/info` (last_sync, last_status, syncing flag)
- **.felhom.yml files created for all 10 apps** (paperless-ngx already had one):
- actualbudget, docmost, filebrowser, homebox, immich, mealie, romm, stirling-pdf, vaultwarden
- All follow the same format: display_name, description, category, subdomain, resources, deploy_fields
- **Docker Compose templates audited and fixed** for all 10 apps:
- Fixed `{{DOMAIN}}``${DOMAIN}` syntax in homebox, mealie, romm, stirling-pdf
- Fixed `{{HDD_PATH}}``${HDD_PATH}` in romm
- Added `deploy.resources.limits.memory` to all services across all templates
- Added `TZ=Europe/Budapest` to all sidecar services (postgres, redis, mariadb)
- Added healthcheck to romm main service
- Added `romm-redis` `condition: service_healthy` (was `service_started`)
- Standardized header comment blocks across all templates
- **Documentation updated**: app-catalog README, CLAUDE.md, CONTEXT.md
### Previously completed (2026-02-15 session 1)
- **Memory validation during deployment**:
- Pre-deploy memory check: compares `mem_request` sum against usable system RAM
- Hard block if requests exceed usable memory (total - 384MB reserved)
- Soft warning if `mem_limit` sum exceeds total RAM (overcommit OK for limits)
- `ParseMemoryMB()` supports "500M", "1G", "1.5G", "1024" formats
- `CommittedMemory()` sums requests/limits across all deployed stacks
- Memory summary bar shown on deploy page before user clicks deploy
- `system.reserved_memory_mb` configurable in controller.yaml (default: 384)
- **Display: `~` prefix on mem_request** in UI badges (display-only, exact value stored)
- **Felhom.eu logo** replaced text logos in sidebar and login page with actual SVG logo
- Logo SVG embedded as Go string constant, served at `/static/felhom-logo.svg`
### Previously completed (2026-02-14)
- **System info bar on Vezérlőpult dashboard**: RAM, SSD, and optional HDD usage
- Progress bars with color coding (green < 70%, yellow 70-85%, red > 85%)
- New `internal/system` package reads `/proc/meminfo` + `syscall.Statfs`
- Platform-specific: Linux impl + non-Linux stub (build tags)
- Hungarian labels: "Memória", "SSD tárhely", "Külső HDD"
- **Docker Compose memory limits** on paperless-ngx template:
- paperless-webserver: 768M, postgres: 256M, redis: 128M
- Added `mem_limit` field to `.felhom.yml` ResourceHints (total: 1152M)
- **`/api/system/info` endpoint** now returns live system metrics (was customer info)
- **Config**: Added `paths.hdd_path` for external HDD monitoring
- Controller image builds via build.sh, pushes to Gitea container registry
### Previously completed (2026-02-13)
- Built the entire felhom-controller from scratch (Go, no frameworks)
- Debugged and fixed 7 issues during first real deployment:
1. Password validation (empty passwords accepted)
2. In-memory Deployed flag not updating after deploy
3. Health-aware state parsing (starting/unhealthy detection)
4. Random card ordering (Go map iteration)
5. "Részletek" button redirect for deployed apps
6. Paperless OCR language installation (LANGUAGES vs LANGUAGE env var)
7. Documentation: restart vs up -d for image updates
### What's next (priorities)
1. **Manual steps for v0.6.0** — Viktor needs to:
- Create 5 healthcheck checks on status.felhom.eu with correct periods/grace
- Update controller.yaml on demo-felhom with real UUIDs
- Build + deploy felhom-hub to k3s (`cd hub && make docker-push`, `kubectl apply -f manifests/hub.yaml`)
- Configure hub.felhom.eu DNS in Cloudflare
- Enable hub reporting on demo-felhom (`hub.enabled: true`, `hub.api_key: <key>`)
2. **Test backup flow** — trigger manual backup via dashboard, verify restic repo + DB dumps
3. **Test backup integrity check** — wait for Sunday 04:00 or manually trigger
4. Add `app_info` + `optional_config` to more apps (start with Immich, Mealie, Vaultwarden)
5. Deploy a second app (e.g., ActualBudget — simplest, or Immich — tests HDD + secrets)
6. Test on Raspberry Pi (pi-customer-1)
7. Phase 4: Self-update mechanism
8. v0.6.1: Hub alerting (webhook to Healthchecks for stale customers)
## Architecture decisions
| Decision | Rationale |
|----------|-----------|
| Go stdlib for web (no Gin/Echo) | Minimal dependencies, single binary, easy to embed templates |
| Templates as go:embed HTML/CSS files | Zero runtime file dependencies (compiled into binary), but each template is a separate editable file |
| Docker Compose for customers (not k8s) | Simpler troubleshooting, customers don't need k8s knowledge |
| k3s for management infra only | Viktor's own services (gitea, monitoring, website) run on k3s |
| Cloudflare Tunnel for remote access | No port forwarding needed, works behind any NAT |
| app.yaml per stack | Separates deploy config from compose files, survives git pulls |
| Password fields require explicit input | Prevents accidental empty-password deployments |
| Health-aware state from Docker Status field | Docker's State says "running" even for unhealthy containers |
| Memory limits via deploy.resources.limits | Prevents runaway containers; ~50% headroom over expected usage |
| System info from /proc/meminfo + statfs | No external dependencies, cheap to read on each page load |
| mem_request vs mem_limit (K8s-inspired) | Requests = expected usage (hard block), limits = peak (overcommit OK) |
| 384MB reserved for system | Prevents deploying apps that would starve the OS/controller |
| Logo SVG embedded as Go constant | Same approach as CSS/HTML — zero external file deps |
| Git sync via os/exec git CLI | No Go git library needed, git is in the container image |
| SHA-256 for content comparison | Only copy changed files, avoid unnecessary disk writes |
| 30s debounce on manual sync | Prevents spamming the git server |
| Orphan = deployed but not in catalog | Safe lifecycle: remove from catalog → mark orphaned → user deletes via UI |
| FileBrowser as infra (not catalog) | Needed even after apps deleted (user browses HDD data); deployed by setup script |
| Protected HDD paths | Safety net: never delete top-level HDD dirs (media, storage, Dokumentumok, appdata) |
| Central scheduler (not ad-hoc goroutines) | Single place to register/monitor all periodic tasks, graceful shutdown, skip-if-running |
| CPU sampling via background goroutine | /proc/stat delta needs two readings — collector runs every 5s, GetInfo() reads cached value |
| Temperature from /host/sys (Docker mount) | Container can't read host /sys directly — mount /sys:/host/sys:ro, try /host/sys first |
| Restic password auto-generated | No manual setup needed — generated on first backup run, stored in named volume |
| DB discovery via docker inspect | No config needed — discovers postgres/mariadb containers by image name + env vars |
| Backup orchestrator with running flag | Prevents concurrent backups, supports both scheduled and manual trigger |
| modernc.org/sqlite (pure Go) | No CGO/gcc needed in Docker build stage — keeps `CGO_ENABLED=0` static binary |
| Chart.js embedded locally | Customer hardware may not have internet — CDN not reliable for offline environments |
| Metrics downsampling via SQL | Bucket-based AVG in GROUP BY keeps Chart.js responsive with up to 30 days of data |
| 60s metrics collection interval | Good balance of resolution vs. storage — ~44K rows/month for system metrics |
| /etc/os-release mounted read-only | Container can't read host OS info directly — mount to /host/etc/os-release:ro |
## Key file locations on demo-felhom
```
/opt/docker/felhom-controller/ # Controller compose + config
├── controller.yaml # Customer config (domain, auth, paths)
├── docker-compose.yml # Controller's own compose
└── .env # DOMAIN=demo-felhom.eu
/opt/docker/stacks/ # All app stacks
├── traefik/ # Reverse proxy (protected)
├── cloudflared/ # Tunnel (protected)
├── paperless-ngx/ # First deployed app ✅
│ ├── docker-compose.yml
│ ├── .felhom.yml # App metadata
│ └── app.yaml # Deploy config (env vars, locked fields)
└── whoami/ # Test stack (not deployed)
/mnt/hdd_placeholder/storage/ # HDD storage for apps
└── paperless/
├── consume/ # Drop files here for OCR
├── media/ # Processed documents
└── export/ # Backup exports
```
## Related repositories and their state
| Repository | Status | Notes |
|------------|--------|-------|
| deploy-felhom-compose | Active | This repo. Controller code + deploy scripts |
| app-catalog-felhom.eu | Active | 10 app templates, all with .felhom.yml metadata + memory limits |
| felhom.eu | Active | Website + hub/ subfolder (felhom-hub service) + k8s manifests |
| homelab-manifests | Stable | k3s cluster running (dooplex.hu services) |
| misc-scripts | Utility | collect-repo.sh, backup helpers |
## Gotchas & lessons learned
- `docker compose restart``docker compose up -d` — restart doesn't pick up new images
- Go maps have random iteration order — always sort slices before displaying
- Docker `.State`="running" doesn't mean healthy — check `.Status` for "(health: starting)" / "(unhealthy)"
- Paperless-ngx needs `PAPERLESS_OCR_LANGUAGES` (plural) to install language packs, `PAPERLESS_OCR_LANGUAGE` (singular) to select
- In-memory Deployed flag must be set BEFORE `docker compose up -d` (not after) — compose can take 30-60s for image pulls, during which the UI would show a stale "Telepítés" button
- Cloudflare Tunnel handles *.demo-felhom.eu → Traefik handles Host()-based routing to containers
- BIOS "AC Power Recovery" must be enabled on N100 for auto-restart after power outage
- `docker compose up -d` returns exit 0 even when containers immediately crash-loop — need post-start status check to detect this
- When logging env vars for debugging, only log keys (not values) to avoid leaking secrets in log files
- Mealie image (`ghcr.io/mealie-recipes/mealie`) doesn't include wget/curl — use Python TCP socket check for healthcheck
- Mealie DB migrations on first start take ~40s (alembic) — use `start_period: 60s` to avoid premature unhealthy status
- Alpine-based images (filebrowser, vaultwarden) have wget via BusyBox — healthchecks with `wget --spider` work fine
- Deploy `sed` command to update image version must target only the `image:` line — naive `sed 's|name:OLD|name:NEW|'` also matches the service name line (e.g., `felhom-controller:``felhom-controller:0.2.12`), breaking YAML. Use `sudo sed -i 's|image:.*felhom-controller:[^ ]*|image: ...felhom-controller:NEW|'` or similar scoped pattern
- Hungarian quotation marks `„"` in YAML: `„` (U+201E) is safe inside YAML double-quoted strings, but the closing `"` must NOT be ASCII `"` (0x22) — it terminates the YAML string. Use `\"` escape or Unicode `"` (U+201D). This caused a silent parse failure for the entire `.felhom.yml` file
- Never silently swallow parse errors — always log them. Silent failures make debugging impossible (took a dedicated debug session to find a simple quoting issue)