diff --git a/TASK.md b/TASK.md index 32c71e4..b9e8f67 100644 --- a/TASK.md +++ b/TASK.md @@ -1,720 +1,341 @@ -# TASK.md — v0.6.0: Healthcheck Implementation + Central Push + Multi-Customer Dashboard +# TASK.md — Code Review Bugfixes (v0.6.1) -> **Version:** v0.6.0 -> **Depends on:** v0.5.4 (current) -> **Repo:** `deploy-felhom-compose` (controller/ subfolder) -> **Build:** `~/build/felhom-controller/build.sh 0.6.0 --push` -> **Deploy target:** demo-felhom.eu (N100) + k3s cluster (dooplex.hu) +## Overview + +Fix bugs and logic issues identified during the v0.6.0 code review. All changes are in the `controller/` subtree. +No new features — only correctness, safety, and quality fixes. + +After all fixes: bump version to **v0.6.1**, commit, build, push, deploy, verify. --- -## Context +## Fix 1: `http.NotFound(w, nil)` — pass request, not nil -The controller already has health monitoring infrastructure built in v0.4.0: -- `internal/monitor/pinger.go` — Healthchecks.io-compatible HTTP ping client (success/fail/start, retries) -- `internal/monitor/healthcheck.go` — System health checks (disk, memory, CPU, temp, Docker, protected containers) -- Scheduler jobs in `main.go`: `system-health` (every 5m), `db-dump` (daily), `backup` (daily) -- Backup manager already calls `pinger.Ping()`/`pinger.Fail()` after each operation +**Files:** `internal/web/handlers.go` -**Problem:** The demo-felhom Healthchecks project has **zero checks created** (screenshot confirms empty project at `status.felhom.eu/projects/.../checks/`). The `controller.yaml` on demo-felhom has all `CHANGEME` placeholder UUIDs. Nothing is actually pinging. +**Problem:** Two handlers discard the `*http.Request` parameter as `_`, then call `http.NotFound(w, nil)`. While Go's current stdlib doesn't dereference the request in `NotFound`, this is incorrect and will break if middleware wraps it. -Additionally, there are legacy bash scripts (`backup-healthcheck.sh`, `monitoring-setup.sh`) from the pre-controller era that duplicate functionality now built into the controller. These should be deprecated in favor of controller-native pings. +**Changes:** -**This version has two major parts:** -1. **Prerequisite:** Get healthchecks actually working on demo-felhom (create checks, configure UUIDs, verify pings) -2. **New feature:** Central push from customer controllers to k3s + multi-customer overview dashboard +In `deployHandler`, change signature and call: +```go +// BEFORE: +func (s *Server) deployHandler(w http.ResponseWriter, _ *http.Request, name string) { + ... + if err != nil { + http.NotFound(w, nil) ---- - -## Part 0: Healthcheck Ping Design (controller.yaml schema update) - -### Current ping types (already implemented in code) - -| Ping | Schedule | Source | What it proves | -|------|----------|--------|----------------| -| `system_health` | Every 5 min | `monitor.RunHealthCheck()` | Server alive, Docker running, disks OK, protected containers up, CPU/mem/temp within thresholds | -| `db_dump` | Daily 02:30 | `backup.RunDBDumps()` | Database dumps completed successfully | -| `backup` | Daily 03:00 | `backup.RunBackup()` | Restic snapshot completed successfully | - -### New ping types to add - -| Ping | Schedule | Source | What it proves | -|------|----------|--------|----------------| -| `backup_integrity` | Weekly (Sunday 04:00) | New: `backup.RunIntegrityCheck()` | Restic repo passes `restic check` — data is not corrupted | -| `heartbeat` | Every 5 min | New: lightweight HTTP POST, no logic | Controller process is alive (distinct from `system_health` which does heavy checks and could fail due to a bug while the controller itself is fine) | - -### Revised `controller.yaml` monitoring section - -```yaml -monitoring: - enabled: true - healthchecks_base: "https://status.felhom.eu" - ping_uuids: - heartbeat: "" # NEW — every 1 min, controller alive - system_health: "" # existing — every 5 min, comprehensive check - db_dump: "" # existing — daily after db dumps - backup: "" # existing — daily after restic snapshot - backup_integrity: "" # NEW — weekly after restic check - system_health_interval: "5m" - health_check_schedule: "06:00" - thresholds: - disk_warn_percent: 80 - disk_crit_percent: 90 - backup_max_age_hours: 36 - cpu_warn_percent: 90 - memory_warn_percent: 85 - temperature_warn_celsius: 75 +// AFTER: +func (s *Server) deployHandler(w http.ResponseWriter, r *http.Request, name string) { + ... + if err != nil { + http.NotFound(w, r) ``` -> **Note:** Empty string and "CHANGEME..." UUIDs are both skipped by the pinger (already implemented). This means any check can be left unconfigured — the controller just skips it silently. +In `appDetailHandler`, same fix: +```go +// BEFORE: +func (s *Server) appDetailHandler(w http.ResponseWriter, _ *http.Request, slug string) { + ... + if found == nil { + http.NotFound(w, nil) -### Healthchecks check configuration (to be created manually on status.felhom.eu) +// AFTER: +func (s *Server) appDetailHandler(w http.ResponseWriter, r *http.Request, slug string) { + ... + if found == nil { + http.NotFound(w, r) +``` -For each customer project, create these checks: - -| Check name | Period | Grace | Tags | -|-----------|--------|-------|------| -| `heartbeat` | 5 minutes | 10 minutes | `heartbeat` | -| `system-health` | 5 minutes | 10 minutes | `system`, `health` | -| `db-dump` | 1 day (02:30 CET) | 30 minutes | `backup`, `db` | -| `backup` | 1 day (03:00 CET) | 60 minutes | `backup`, `restic` | -| `backup-integrity` | 7 days | 24 hours | `backup`, `integrity` | +**Verify:** Grep for `NotFound(w, nil)` — should return 0 results after fix. --- -## Part 1: Controller-side healthcheck implementation +## Fix 2: Dashboard running/stopped counts don't match displayed stacks -### Task 1.1: Add heartbeat ping +**File:** `internal/web/handlers.go`, `dashboardHandler` -**Files:** `cmd/controller/main.go` +**Problem:** The `running`/`stopped` stat counters iterate over ALL stacks (including non-deployed ones), but the dashboard only displays deployed + protected stacks. The numbers don't match what the user sees. -Add a new scheduler job — the simplest possible ping, no health check logic: +**Fix:** Compute the counts from the same filtered set (`deployedStacks`), not from `stackList`. Move the filter loop first, then count from the filtered result. ```go -// Heartbeat — lightweight "I'm alive" signal -sched.Every("heartbeat", 5*time.Minute, func(ctx context.Context) error { - pinger.Ping(cfg.Monitoring.PingUUIDs.Heartbeat, "") - return nil +func (s *Server) dashboardHandler(w http.ResponseWriter, _ *http.Request) { + stackList := s.stackMgr.GetStacks() + + // Filter to deployed + protected stacks first + var deployedStacks []stacks.Stack + for _, st := range stackList { + if st.Deployed || st.Protected { + deployedStacks = append(deployedStacks, st) + } + } + + // Count from the DISPLAYED set only + running, stopped := 0, 0 + for _, st := range deployedStacks { + switch st.State { + case stacks.StateRunning, stacks.StateStarting, stacks.StateUnhealthy, stacks.StateRestarting: + running++ + case stacks.StateStopped, stacks.StateExited: + stopped++ + } + } + + // ... rest unchanged, but use deployedStacks for display ... + data["Stacks"] = deployedStacks + data["RunningCount"] = running + data["StoppedCount"] = stopped + data["TotalCount"] = len(stackList) // keep this as total catalog size +``` + +**Verify:** Deploy, open dashboard. Count of green + red badges on cards should match the stat numbers. + +--- + +## Fix 3: `Secure: true` cookie blocks HTTP login + +**File:** `internal/web/auth.go`, `handleLogin` + +**Problem:** The session cookie has `Secure: true` hardcoded. When accessing via plain HTTP (e.g., `http://192.168.0.162:8080` during local setup), the browser silently refuses to send the cookie back, making login impossible with no visible error. + +**Fix:** Set `Secure` dynamically based on the incoming request: + +```go +// BEFORE: +http.SetCookie(w, &http.Cookie{ + Name: sessionCookieName, + Value: token, + Path: "/", + MaxAge: int(sessionMaxAge.Seconds()), + HttpOnly: true, + SameSite: http.SameSiteStrictMode, + Secure: true, +}) + +// AFTER: +isSecure := r.TLS != nil || r.Header.Get("X-Forwarded-Proto") == "https" +http.SetCookie(w, &http.Cookie{ + Name: sessionCookieName, + Value: token, + Path: "/", + MaxAge: int(sessionMaxAge.Seconds()), + HttpOnly: true, + SameSite: http.SameSiteLaxMode, // Lax needed: Strict can break redirects through CF tunnel + Secure: isSecure, }) ``` -**Files:** `internal/config/config.go` +Note: Also change `SameSiteStrictMode` → `SameSiteLaxMode`. Strict mode can cause issues when users arrive via Cloudflare Tunnel redirects (the cookie won't be sent on the first navigation from an external link). -Add `Heartbeat` field to `PingUUIDsConfig`: +**Verify:** Access `http://192.168.0.162:8080` in browser, log in — should work. Also verify HTTPS login still works via `https://vezerlo.demo-felhom.eu`. + +--- + +## Fix 4: Remove misleading `subtle.ConstantTimeCompare` in session check + +**File:** `internal/web/auth.go`, `isValidSession` + +**Problem:** The map lookup `s.sessions[token]` already reveals the token via timing. The subsequent `ConstantTimeCompare` compares the token to itself (it was just fetched by that key), so it always returns 1 and adds no security. It's misleading to keep it. + +**Fix:** Simplify: ```go -type PingUUIDsConfig struct { - Heartbeat string `yaml:"heartbeat"` - DBDump string `yaml:"db_dump"` - Backup string `yaml:"backup"` - SystemHealth string `yaml:"system_health"` - BackupIntegrity string `yaml:"backup_integrity"` // new -} -``` - -### Task 1.2: Add backup integrity check - -**Files:** `internal/backup/restic.go` - -Add a `Check()` method (may already exist as part of prune logic — verify first): - -```go -// Check runs `restic check` to verify repository integrity. -func (r *ResticRunner) Check() error { - args := []string{"check", "--repo", r.repo, "--json"} - // ... standard exec with password file, timeout 30 min -} -``` - -**Files:** `internal/backup/backup.go` - -Add `RunIntegrityCheck()`: - -```go -// RunIntegrityCheck runs restic check and pings healthchecks with the result. -func (m *Manager) RunIntegrityCheck(ctx context.Context) error { - err := m.restic.Check() - uuid := m.cfg.Monitoring.PingUUIDs.BackupIntegrity - if err != nil { - m.pinger.Fail(uuid, fmt.Sprintf("restic check failed: %v", err)) - return err +// BEFORE: +func (s *Server) isValidSession(token string) bool { + s.sessionsMu.RLock() + defer s.sessionsMu.RUnlock() + sess, ok := s.sessions[token] + if !ok || time.Now().After(sess.expiresAt) { + return false } - m.pinger.Ping(uuid, "restic check passed") - return nil + return subtle.ConstantTimeCompare([]byte(sess.token), []byte(token)) == 1 +} + +// AFTER: +func (s *Server) isValidSession(token string) bool { + s.sessionsMu.RLock() + defer s.sessionsMu.RUnlock() + sess, ok := s.sessions[token] + return ok && time.Now().Before(sess.expiresAt) } ``` -**Files:** `cmd/controller/main.go` - -Register the weekly job: +Also: the `token` field in the `session` struct is now unused (it duplicates the map key). Remove it: ```go -if cfg.Backup.Enabled && backupMgr != nil { - // ... existing daily jobs ... +// BEFORE: +type session struct { + token string + expiresAt time.Time +} - // Weekly integrity check — Sunday 04:00 - sched.Daily("backup-integrity", "04:00", func(ctx context.Context) error { - if time.Now().Weekday() != time.Sunday { - return nil // skip non-Sundays +// AFTER: +type session struct { + expiresAt time.Time +} +``` + +And update `createSession`: +```go +// BEFORE: +s.sessions[token] = &session{token: token, expiresAt: time.Now().Add(sessionMaxAge)} + +// AFTER: +s.sessions[token] = &session{expiresAt: time.Now().Add(sessionMaxAge)} +``` + +After these changes, remove the `"crypto/subtle"` import if no longer used. + +**Verify:** Log in, navigate around — session should work. Log out — should redirect to login. + +--- + +## Fix 5: `cleanupSessions` goroutine leak + +**File:** `internal/web/auth.go` + +**Problem:** `time.Tick()` creates a ticker that can never be GC'd. The goroutine runs forever, even during shutdown. + +**Fix:** This one is lower priority since the controller runs as a long-lived process, but the fix is simple. Since we don't currently pass a context to `NewServer`, use a `done` channel on the server: + +Add a `done` channel to the Server struct: +```go +type Server struct { + // ... existing fields ... + done chan struct{} +} +``` + +Initialize it in `NewServer`: +```go +func NewServer(...) *Server { + s := &Server{ + // ... existing ... + done: make(chan struct{}), + } + s.loadTemplates() + go s.cleanupSessions() + return s +} +``` + +Rewrite `cleanupSessions`: +```go +func (s *Server) cleanupSessions() { + ticker := time.NewTicker(15 * time.Minute) + defer ticker.Stop() + for { + select { + case <-s.done: + return + case <-ticker.C: + s.sessionsMu.Lock() + now := time.Now() + for t, sess := range s.sessions { + if now.After(sess.expiresAt) { + delete(s.sessions, t) + } + } + s.sessionsMu.Unlock() } - return backupMgr.RunIntegrityCheck(ctx) - }) -} -``` - -> **Note on scheduler:** `Daily()` fires every day at the given time. To make it weekly, check the weekday inside the function. If you prefer, add a `Weekly()` method to the scheduler — but the weekday check is simpler and consistent with how prune already works. - -### Task 1.3: Update example config - -**Files:** `controller/configs/controller.yaml.example` - -Update the `monitoring.ping_uuids` section to include `heartbeat` and `backup_integrity` fields. Add comments explaining each. - -### Task 1.4: Deprecation note for bash monitoring scripts - -The following files in `deploy-felhom-compose/monitoring/` are **superseded** by the controller's built-in monitoring: - -- `backup-healthcheck.sh` → replaced by `internal/monitor/healthcheck.go` (scheduler: `system-health`) -- `monitoring-setup.sh` → no longer needed (controller reads `controller.yaml` directly) -- `monitoring.conf.template` → replaced by `controller.yaml` monitoring section -- `backup-healthcheck.service` / `.timer` → replaced by controller's scheduler - -**Action:** Add a `DEPRECATED.md` in `deploy-felhom-compose/monitoring/` explaining that these scripts are kept for reference only and should not be used on nodes running felhom-controller v0.4.0+. Do NOT delete the files yet — they may be needed if a customer is still on a pre-controller setup. - -### Verification (Part 1) - -After building and deploying v0.6.0 to demo-felhom: - -1. Check controller logs: `docker logs felhom-controller --since 5m | grep -i "ping\|health\|heartbeat"` -2. Verify pings arrive at `status.felhom.eu` — all 5 checks should show green within 10 minutes -3. Test failure: `docker stop traefik`, wait 5 min, check that `system-health` goes red (protected container missing) -4. Restart traefik: `docker start traefik`, verify recovery - ---- - -## Part 2: Central push to k3s (customer → operator reporting) - -### Architecture - -``` -┌─────────────────────────┐ HTTPS POST /api/v1/report -│ Customer controller │────────────────────────────────────────┐ -│ (demo-felhom.eu) │ every 15 min (configurable) │ -└─────────────────────────┘ ▼ - ┌─────────────────────────────┐ -┌─────────────────────────┐ HTTPS POST │ felhom-hub │ -│ Customer controller │────────────────────────▶│ (k3s pod on dooplex.hu) │ -│ (customer-2) │ │ │ -└─────────────────────────┘ │ - Receives reports │ - │ - Stores in SQLite │ - │ - Serves dashboard │ - │ - Alerts on stale reports │ - └─────────────────────────────┘ - hub.felhom.eu -``` - -### Task 2.1: Define the report payload - -The controller pushes a JSON summary every 15 minutes. This is **not** raw metrics — it's an aggregated health summary. - -```json -{ - "version": 1, - "customer_id": "demo-felhom", - "customer_name": "Demo Ügyfél", - "controller_version": "0.6.0", - "timestamp": "2026-02-16T12:00:00Z", - "system": { - "hostname": "demo-felhom", - "os": "Debian GNU/Linux 13 (trixie)", - "kernel": "6.12.69+deb13-amd64", - "cpu_model": "Intel N100", - "cpu_cores": 4, - "uptime_seconds": 345600, - "cpu_percent": 12.5, - "memory_total_mb": 15872, - "memory_used_mb": 4200, - "memory_percent": 26.5, - "temperature_celsius": 48.0, - "load_avg_1": 0.45, - "load_avg_5": 0.38, - "load_avg_15": 0.32 - }, - "storage": [ - { "mount": "/", "total_gb": 476.0, "used_gb": 28.5, "percent": 6.0 }, - { "mount": "/mnt/hdd_1", "total_gb": 931.0, "used_gb": 120.3, "percent": 12.9 } - ], - "containers": { - "total": 16, - "running": 14, - "stopped": 2, - "unhealthy": 0, - "list": [ - { "name": "paperless-ngx-webserver-1", "state": "running", "cpu_percent": 2.1, "memory_mb": 350 }, - { "name": "traefik", "state": "running", "cpu_percent": 0.3, "memory_mb": 45 } - ] - }, - "backup": { - "enabled": true, - "last_db_dump": "2026-02-16T02:30:15Z", - "last_snapshot": "2026-02-16T03:02:45Z", - "snapshot_count": 42, - "repo_size_mb": 2048, - "last_integrity_check": "2026-02-09T04:00:00Z", - "integrity_ok": true - }, - "health": { - "status": "ok", - "issues": [], - "warnings": ["Disk /mnt/hdd_1 at 82%"] - }, - "stacks": { - "deployed": ["paperless-ngx", "immich", "jellyfin"], - "available": ["nextcloud", "vaultwarden", "home-assistant"], - "updates_available": 1 - } -} -``` - -### Task 2.2: Implement report builder in the controller - -**New file:** `controller/internal/report/builder.go` - -```go -package report - -// Report is the JSON payload pushed to the central hub. -type Report struct { - Version int `json:"version"` - CustomerID string `json:"customer_id"` - CustomerName string `json:"customer_name"` - ControllerVersion string `json:"controller_version"` - Timestamp time.Time `json:"timestamp"` - System SystemReport `json:"system"` - Storage []StorageReport `json:"storage"` - Containers ContainerReport `json:"containers"` - Backup BackupReport `json:"backup"` - Health HealthReport `json:"health"` - Stacks StacksReport `json:"stacks"` -} - -// BuildReport collects current state from all subsystems and returns a Report. -func BuildReport(cfg *config.Config, stackMgr *stacks.Manager, - backupMgr *backup.Manager, cpuCollector *system.CPUCollector, - pinger *monitor.Pinger, version string) *Report { - // Gather system info from system.GetInfo() - // Gather container info from stackMgr - // Gather backup info from backupMgr.GetFullStatus() - // Gather health from monitor.RunHealthCheck() - // Gather stack list from stackMgr.GetStacks() - // Return assembled Report -} -``` - -This function should call existing methods — **do not duplicate logic**. Use the same data sources the dashboard and monitoring page already use. - -### Task 2.3: Implement report pusher in the controller - -**New file:** `controller/internal/report/pusher.go` - -```go -package report - -// Pusher sends reports to the central hub. -type Pusher struct { - hubURL string - apiKey string - httpClient *http.Client - logger *log.Logger - enabled bool -} - -// Push sends a report to the hub. Returns nil on success. -// Retries 3 times with 5s backoff. Never returns error to caller -// (push failures should not affect controller operation). -func (p *Pusher) Push(report *Report) error { - // JSON marshal - // POST to hubURL + "/api/v1/report" - // Header: Authorization: Bearer - // Header: Content-Type: application/json - // Retry on failure - // Log but don't propagate errors -} -``` - -### Task 2.4: Add hub configuration to controller.yaml - -**Files:** `internal/config/config.go`, `controller/configs/controller.yaml.example` - -```yaml -# --- Central hub (operator dashboard) --- -hub: - enabled: false # Enable central reporting - url: "https://hub.felhom.eu" # Hub API endpoint - api_key: "" # Shared secret for authentication - push_interval: "15m" # How often to push reports -``` - -```go -type HubConfig struct { - Enabled bool `yaml:"enabled"` - URL string `yaml:"url"` - APIKey string `yaml:"api_key"` - PushInterval string `yaml:"push_interval"` -} -``` - -Add `Hub HubConfig `yaml:"hub"`` to the main `Config` struct. - -### Task 2.5: Wire the pusher into main.go - -```go -// --- Central hub reporting --- -if cfg.Hub.Enabled && cfg.Hub.URL != "" { - pushInterval, err := time.ParseDuration(cfg.Hub.PushInterval) - if err != nil { - pushInterval = 15 * time.Minute } - pusher := report.NewPusher(&cfg.Hub, logger) - sched.Every("hub-report", pushInterval, func(ctx context.Context) error { - r := report.BuildReport(cfg, stackMgr, backupMgr, cpuCollector, pinger, version) - return pusher.Push(r) - }) - logger.Printf("[INFO] Hub reporting enabled (every %s to %s)", pushInterval, cfg.Hub.URL) } ``` -### Verification (Part 2) +Add a `Close` method (called from main during shutdown, optional for now): +```go +func (s *Server) Close() { + close(s.done) +} +``` -1. Set `hub.enabled: true` and `hub.url` to a temporary endpoint (e.g., `https://webhook.site/...`) in demo-felhom's `controller.yaml` -2. Restart controller, check logs for "Hub reporting enabled" -3. Wait 15 min (or set `push_interval: "1m"` for testing), verify JSON arrives at the endpoint -4. Validate JSON structure matches the spec above -5. Reset `push_interval` to `"15m"` after testing +**Verify:** Build succeeds, controller starts without errors. --- -## Part 3: Hub service on k3s (operator side) +## Fix 6: Add `http.MaxBytesReader` to API POST endpoints -### Overview +**File:** `internal/api/router.go` -The hub is a lightweight Go service deployed on Viktor's k3s cluster in the `felhom-system` namespace. It receives reports from customer controllers, stores them in SQLite, and serves an English-language dashboard for Viktor. +**Problem:** `json.NewDecoder(req.Body).Decode(&body)` has no size limit. A malicious or accidental large POST could exhaust memory. -**Domain:** `hub.felhom.eu` (Nginx Ingress, cert-manager TLS) -**Namespace:** `felhom-system` (alongside Healthchecks and other felhom infra) -**Code:** `felhom.eu` repo on Gitea, `hub/` subfolder +**Fix:** Add a helper and use it in all handlers that decode JSON bodies (`deployStack`, `updateOptionalConfig`, `deleteStack`): -### Task 3.1: Hub service (subfolder in felhom.eu repository) - -The hub lives in the existing `felhom.eu` repository on Gitea as a `hub/` subfolder. It's deployed to the k3s cluster in the `felhom-system` namespace (alongside Healthchecks and other felhom infra). K8s manifests go in the `homelab-manifests` repo as usual. - -**Structure (inside felhom.eu repo):** - -``` -hub/ -├── cmd/hub/main.go # Entry point -├── internal/ -│ ├── api/ -│ │ └── handler.go # POST /api/v1/report, GET /api/v1/customers -│ ├── store/ -│ │ └── store.go # SQLite: save reports, query latest per customer -│ └── web/ -│ ├── server.go # Dashboard HTTP server -│ ├── templates/ -│ │ ├── dashboard.html # Multi-customer overview (English) -│ │ ├── customer.html # Single customer detail (English) -│ │ └── style.css # Dark theme matching felhom.eu -│ └── embed.go -├── configs/ -│ └── hub.yaml.example -├── Dockerfile -├── Makefile -└── go.mod +Add helper at the bottom of `router.go`: +```go +// limitBody wraps the request body with a size limit (default 1MB). +func limitBody(w http.ResponseWriter, req *http.Request) { + req.Body = http.MaxBytesReader(w, req.Body, 1<<20) // 1MB +} ``` -K8s manifests in `felhom.eu/manifests/` (alongside healthchecks.yaml, webpage.yaml, etc.): -``` -manifests/hub.yaml # Deployment, Service, Ingress, PVC +Then at the start of each handler that reads the body: +```go +func (r *Router) deployStack(w http.ResponseWriter, req *http.Request, name string) { + limitBody(w, req) + // ... existing json decode ... +} + +func (r *Router) updateOptionalConfig(w http.ResponseWriter, req *http.Request, name string) { + limitBody(w, req) + // ... +} + +func (r *Router) deleteStack(w http.ResponseWriter, req *http.Request, name string) { + limitBody(w, req) + // ... +} ``` -### Task 3.2: Hub API endpoints - -| Method | Path | Auth | Description | -|--------|------|------|-------------| -| `POST` | `/api/v1/report` | Bearer token | Receive customer report (JSON body) | -| `GET` | `/api/v1/customers` | Session/Basic | List all customers with latest status | -| `GET` | `/api/v1/customers/{id}` | Session/Basic | Get latest report for a customer | -| `GET` | `/api/v1/customers/{id}/history` | Session/Basic | Get report history (last 24h/7d/30d) | -| `GET` | `/` | Session/Basic | Dashboard HTML page | -| `GET` | `/customers/{id}` | Session/Basic | Customer detail HTML page | - -**Authentication:** -- Report ingest: Bearer token (shared secret per customer, or a single hub-wide key for simplicity) -- Dashboard: Basic auth or simple password (Viktor only) — reuse the same bcrypt approach as the controller - -### Task 3.3: Hub SQLite schema - -```sql -CREATE TABLE IF NOT EXISTS reports ( - id INTEGER PRIMARY KEY AUTOINCREMENT, - customer_id TEXT NOT NULL, - received_at DATETIME NOT NULL DEFAULT (datetime('now')), - report_json TEXT NOT NULL, -- Full JSON payload - -- Denormalized fields for fast queries: - health_status TEXT, -- "ok", "warn", "fail" - cpu_percent REAL, - memory_percent REAL, - container_total INTEGER, - container_running INTEGER, - backup_last_snapshot DATETIME, - controller_version TEXT -); - -CREATE INDEX IF NOT EXISTS idx_reports_customer ON reports(customer_id, received_at DESC); - --- Prune old reports: keep 30 days of history --- Run daily: DELETE FROM reports WHERE received_at < datetime('now', '-30 days'); -``` - -### Task 3.4: Hub dashboard UI (English) - -**Overview page (`/`):** - -A table/grid showing all customers at a glance: - -| Customer | Status | Last seen | CPU | Memory | Disk | Containers | Last backup | Version | -|----------|--------|-----------|-----|--------|------|------------|-------------|---------| -| 🟢 Demo Ügyfél | OK | 2 min ago | 12% | 26% | 6%/13% | 14/16 | 3h ago | 0.6.0 | -| 🟡 Kovács Péter | WARN | 18 min ago | 45% | 78% | 82% ⚠️ | 8/8 | 4h ago | 0.5.4 | -| 🔴 Nagy Anna | DOWN | 2h ago | – | – | – | – | 26h ago ⚠️ | 0.5.4 | - -**Color coding:** -- 🟢 Green: last seen < 30 min AND health = "ok" -- 🟡 Yellow: last seen < 30 min AND health = "warn", OR last seen 30-60 min -- 🔴 Red: last seen > 60 min OR health = "fail" - -**Customer detail page (`/customers/{id}`):** - -- Last report timestamp -- Full system info section (same layout as controller's monitoring page) -- Container list with CPU/memory -- Backup status details -- Health issues/warnings -- Report history (collapsible list, last 24h) - -**Design:** English language. Dark theme matching felhom.eu / the controller dashboard. Use the same CSS variables and fonts. - -### Task 3.5: Hub Kubernetes manifests - -**File:** `felhom.eu/manifests/hub.yaml` (alongside `healthchecks.yaml`, `webpage.yaml`, etc.) - -```yaml -# Namespace: felhom-system (shared with healthchecks and other felhom infra) -# Deployment: 1 replica, 64Mi-256Mi memory -# Service: ClusterIP port 8080 -# PVC: 1Gi for SQLite (Longhorn) -# Ingress: hub.felhom.eu via nginx-internal, cert-manager TLS -# Auth: same geo-restriction as other dooplex.hu services (HU only) -``` - -**ConfigMap** for `hub.yaml` config: -```yaml -auth: - password_hash: "" # bcrypt hash, same approach as controller -api: - report_api_key: "" # Bearer token for report ingest -retention: - max_days: 90 # Keep 90 days of report history - prune_schedule: "04:30" # Daily prune -alerting: - stale_threshold: "30m" # Alert if customer not seen for 30 min -``` - -### Task 3.6: Alerting (optional, future enhancement) - -When a customer is "stale" (no report for > 30 min), the hub could: -- Send a webhook to Healthchecks (one "customer-X-reporting" check per customer) -- Send email via Resend -- Push to Telegram - -For v0.6.0 scope: just show the status on the dashboard. Alerting can be added in v0.6.1. +**Verify:** Build succeeds. Normal deploy/config/delete still works (payloads are tiny). --- -## Part 4: Manual steps for Viktor (demo-felhom setup) +## Fix 7: Cache `time.LoadLocation` in template funcmap -These steps must be done by Viktor manually — Claude Code cannot access status.felhom.eu or the demo-felhom server. +**File:** `internal/web/funcmap.go` -### 4.1: Create Healthchecks checks on status.felhom.eu +**Problem:** At least 5 template functions call `time.LoadLocation("Europe/Budapest")` on every render. While Go caches internally, it still acquires a mutex each time. -1. Log into `status.felhom.eu` -2. Open the "demo-felhom" project -3. Create 5 checks with the settings from the table in Part 0 -4. Copy the ping UUIDs for each check +**Fix:** Load once at the top of `templateFuncMap` and capture in the closures: -### 4.2: Update controller.yaml on demo-felhom +```go +func (s *Server) templateFuncMap() template.FuncMap { + loc, err := time.LoadLocation("Europe/Budapest") + if err != nil { + loc = time.UTC + } -SSH into demo-felhom and update `/opt/docker/felhom-controller/controller.yaml`: - -```yaml -monitoring: - enabled: true - healthchecks_base: "https://status.felhom.eu" - ping_uuids: - heartbeat: "" - system_health: "" - db_dump: "" - backup: "" - backup_integrity: "" - system_health_interval: "5m" - health_check_schedule: "06:00" - thresholds: - disk_warn_percent: 80 - disk_crit_percent: 90 - backup_max_age_hours: 36 - cpu_warn_percent: 90 - memory_warn_percent: 85 - temperature_warn_celsius: 75 + return template.FuncMap{ + // ... in every function that currently calls time.LoadLocation, + // replace with the captured `loc` variable. + // Remove the per-function `loc, _ := time.LoadLocation(...)` lines. + // Example: + "timeAgo": func(t time.Time) string { + if t.IsZero() { return "–" } + now := time.Now().In(loc) + d := now.Sub(t.In(loc)) + // ... rest unchanged ... + }, + // Apply same pattern to: fmtTime, fmtTimeShort, nextRunLabel, nextPruneLabel + } +} ``` -### 4.3: Restart controller +There are 5 functions that need this change: `timeAgo`, `fmtTime`, `fmtTimeShort`, `nextRunLabel`, `nextPruneLabel`. -```bash -cd /opt/docker/felhom-controller -docker compose pull -docker compose up -d -docker logs -f felhom-controller --since 1m -``` - -### 4.4: Verify pings - -Wait 5 minutes, then check `status.felhom.eu` — all 5 checks should be green. - -### 4.5: Deploy hub to k3s (after Part 3 is built) - -```bash -# Build and push hub image (from felhom.eu repo, hub/ subfolder) -cd hub && make docker-push - -# Apply k8s manifests (from felhom.eu repo, manifests/ folder) -kubectl apply -f manifests/hub.yaml - -# Configure hub.felhom.eu DNS in Cloudflare -# Update demo-felhom controller.yaml with hub config -``` +**Verify:** Build succeeds. Dashboard/backup page timestamps still display correctly in Budapest time. --- -## Implementation order +## Post-fix checklist -1. **Part 1** (controller-side, in `deploy-felhom-compose` repo): - - Task 1.1: Heartbeat ping (5 min) - - Task 1.2: Backup integrity check (20 min) - - Task 1.3: Update example config (5 min) - - Task 1.4: Deprecation note for bash scripts (5 min) - -2. **Part 4.1–4.4** (Viktor manual: create checks, configure UUIDs, verify) - -3. **Part 2** (controller-side, report push): - - Task 2.1: Report payload types (10 min) - - Task 2.2: Report builder (30 min) - - Task 2.3: Report pusher (15 min) - - Task 2.4: Hub config in controller.yaml (10 min) - - Task 2.5: Wire into main.go (5 min) - -4. **Part 3** (hub in `felhom.eu` repo, k8s manifests in `homelab-manifests`): - - Task 3.1: Project scaffold in `hub/` subfolder (10 min) - - Task 3.2: API handlers (30 min) - - Task 3.3: SQLite store (20 min) - - Task 3.4: Dashboard UI — English (60 min) - - Task 3.5: K8s manifests in `felhom.eu/manifests/` (20 min) - -5. **Part 4.5** (Viktor manual: deploy hub, wire everything) - ---- - -## Files to modify (controller repo) - -``` -controller/cmd/controller/main.go — heartbeat job, integrity job, hub pusher -controller/internal/config/config.go — PingUUIDsConfig + HubConfig -controller/internal/backup/backup.go — RunIntegrityCheck() -controller/internal/backup/restic.go — Check() method (verify/add) -controller/internal/report/builder.go — NEW: report assembly -controller/internal/report/pusher.go — NEW: HTTP push client -controller/internal/report/types.go — NEW: Report struct definitions -controller/configs/controller.yaml.example — updated monitoring + new hub section -monitoring/DEPRECATED.md — NEW: deprecation notice for bash scripts -``` - -## Files to create (hub — in felhom.eu repo) - -``` -hub/cmd/hub/main.go -hub/internal/api/handler.go -hub/internal/store/store.go -hub/internal/web/server.go -hub/internal/web/templates/dashboard.html -hub/internal/web/templates/customer.html -hub/internal/web/templates/style.css -hub/internal/web/embed.go -hub/configs/hub.yaml.example -hub/Dockerfile -hub/Makefile -hub/go.mod -hub/README.md -``` - -## Files to create (k8s manifests — in felhom.eu repo) - -``` -manifests/hub.yaml -``` - ---- - -## Verification checklist - -- [ ] Heartbeat ping arrives every 5 min at status.felhom.eu -- [ ] System health ping arrives every 5 min with diagnostic body -- [ ] DB dump ping arrives daily at ~02:30 -- [ ] Backup ping arrives daily at ~03:00 -- [ ] Backup integrity ping arrives weekly on Sunday ~04:00 -- [ ] Stopping a protected container triggers system-health FAIL -- [ ] Controller logs show "Hub reporting enabled" when hub.enabled=true -- [ ] Hub receives JSON reports from controller -- [ ] Hub dashboard shows demo-felhom with green status -- [ ] Hub dashboard shows "last seen: X min ago" updating correctly -- [ ] Hub shows red status when controller is stopped for > 60 min -- [ ] Hub SQLite prunes old reports automatically -- [ ] All UUIDs are configurable (empty/CHANGEME = silently skipped) - ---- - -## CONTEXT.md update (after completion) - -Add to "What was just completed" section: - -``` -### What was just completed (session N) -- **v0.6.0 — Healthcheck Implementation + Central Push + Hub Dashboard:** - - **Healthcheck pings fully operational:** 5 check types (heartbeat, system-health, db-dump, backup, backup-integrity) configured on demo-felhom, all pinging status.felhom.eu - - **Backup integrity check:** Weekly `restic check` with Healthchecks ping - - **Central hub reporting:** Controller pushes JSON health summary every 15 min to hub.felhom.eu - - **felhom-hub service:** New Go service in felhom.eu repo (`hub/` subfolder), k8s manifests in `felhom.eu/manifests/hub.yaml`, deployed on k3s in felhom-system namespace, SQLite storage, English multi-customer dashboard - - **Deprecated:** Legacy bash monitoring scripts (backup-healthcheck.sh, monitoring-setup.sh) superseded by controller-native monitoring -``` - -Also update the repository distinction in CONTEXT.md: - -``` -## Repository & manifest layout - -- **homelab-manifests** — Viktor's personal k3s apps (*.dooplex.hu): mon-system, servarr, pihole, etc. -- **felhom.eu** — Everything felhom-related: - - `website/` — felhom.eu public website HTML - - `manifests/` — k8s manifests for felhom infra in felhom-system namespace (webpage, healthchecks, contact-mailer, umami, hub, felhom.secret) - - `hub/` — felhom-hub Go service (central multi-customer dashboard) -- **deploy-felhom-compose** — Customer-side: felhom-controller code, deploy scripts, monitoring scripts -- **app-catalog-felhom.eu** — Docker Compose templates for customer apps -``` \ No newline at end of file +1. `grep -rn 'NotFound(w, nil)' internal/` → 0 results +2. `grep -rn 'subtle.ConstantTimeCompare' internal/` → 0 results (unless used elsewhere) +3. `grep -rn 'time.Tick(' internal/` → 0 results +4. `grep -rn 'Secure:.*true' internal/web/auth.go` → 0 results (now dynamic) +5. Build: `go build ./cmd/controller/` succeeds with no errors +6. `go vet ./...` passes +7. Version bump in build to v0.6.1 +8. Commit, push, build, deploy, verify on demo-felhom.eu \ No newline at end of file