diff --git a/TASK.md b/TASK.md index d82d17c..32c71e4 100644 --- a/TASK.md +++ b/TASK.md @@ -1,351 +1,720 @@ -# TASK.md — v0.5.4: Monitoring Page Frontend Fixes +# TASK.md — v0.6.0: Healthcheck Implementation + Central Push + Multi-Customer Dashboard -> Version bump: **v0.5.4** -> Scope: Frontend-only — all changes in `monitoring.html` and `style.css` -> No Go code changes needed. +> **Version:** v0.6.0 +> **Depends on:** v0.5.4 (current) +> **Repo:** `deploy-felhom-compose` (controller/ subfolder) +> **Build:** `~/build/felhom-controller/build.sh 0.6.0 --push` +> **Deploy target:** demo-felhom.eu (N100) + k3s cluster (dooplex.hu) --- -## IMPORTANT: Build & Validation +## Context -Build must happen in `~/build/felhom-controller/`, NOT in the git repo: -```bash -cd ~/build/felhom-controller -git -C ~/git/deploy-felhom-compose pull -./build.sh 0.5.2 --push -``` +The controller already has health monitoring infrastructure built in v0.4.0: +- `internal/monitor/pinger.go` — Healthchecks.io-compatible HTTP ping client (success/fail/start, retries) +- `internal/monitor/healthcheck.go` — System health checks (disk, memory, CPU, temp, Docker, protected containers) +- Scheduler jobs in `main.go`: `system-health` (every 5m), `db-dump` (daily), `backup` (daily) +- Backup manager already calls `pinger.Ping()`/`pinger.Fail()` after each operation -**Never run `go build` inside `~/git/deploy-felhom-compose/controller/`.** +**Problem:** The demo-felhom Healthchecks project has **zero checks created** (screenshot confirms empty project at `status.felhom.eu/projects/.../checks/`). The `controller.yaml` on demo-felhom has all `CHANGEME` placeholder UUIDs. Nothing is actually pinging. -After deployment, validate all 4 fixes by: -1. Opening https://felhom.demo-felhom.eu/monitoring in browser -2. Opening the browser Developer Tools (F12) → Console tab -3. Checking each item below +Additionally, there are legacy bash scripts (`backup-healthcheck.sh`, `monitoring-setup.sh`) from the pre-controller era that duplicate functionality now built into the controller. These should be deprecated in favor of controller-native pings. -If you cannot access the browser, validate by reading the deployed HTML source: -```bash -ssh kisfenyo@192.168.0.162 "docker exec felhom-controller cat /app/templates/monitoring.html" | head -50 -``` +**This version has two major parts:** +1. **Prerequisite:** Get healthchecks actually working on demo-felhom (create checks, configure UUIDs, verify pings) +2. **New feature:** Central push from customer controllers to k3s + multi-customer overview dashboard --- -## Bug 1: Tooltip shows "Invalid Date" +## Part 0: Healthcheck Ping Design (controller.yaml schema update) -### Root cause +### Current ping types (already implemented in code) -The tooltip callback uses `items[0].parsed.x` which should return a numeric timestamp on a Chart.js linear axis. However, depending on the Chart.js version/build, `parsed.x` may return something unexpected (undefined, wrong type) causing `new Date()` to produce "Invalid Date". +| Ping | Schedule | Source | What it proves | +|------|----------|--------|----------------| +| `system_health` | Every 5 min | `monitor.RunHealthCheck()` | Server alive, Docker running, disks OK, protected containers up, CPU/mem/temp within thresholds | +| `db_dump` | Daily 02:30 | `backup.RunDBDumps()` | Database dumps completed successfully | +| `backup` | Daily 03:00 | `backup.RunBackup()` | Restic snapshot completed successfully | -### Diagnosis step +### New ping types to add -Before fixing, add a temporary console.log to confirm what `parsed.x` actually returns. In `monitoring.html`, in the tooltip callback inside `chartOpts()`: +| Ping | Schedule | Source | What it proves | +|------|----------|--------|----------------| +| `backup_integrity` | Weekly (Sunday 04:00) | New: `backup.RunIntegrityCheck()` | Restic repo passes `restic check` — data is not corrupted | +| `heartbeat` | Every 5 min | New: lightweight HTTP POST, no logic | Controller process is alive (distinct from `system_health` which does heavy checks and could fail due to a bug while the controller itself is fine) | -```javascript -title: function(items) { - if (!items.length) return ''; - console.log('[tooltip debug]', 'parsed.x:', items[0].parsed.x, typeof items[0].parsed.x, 'raw:', items[0].raw); - return formatTimestamp(items[0].parsed.x); +### Revised `controller.yaml` monitoring section + +```yaml +monitoring: + enabled: true + healthchecks_base: "https://status.felhom.eu" + ping_uuids: + heartbeat: "" # NEW — every 1 min, controller alive + system_health: "" # existing — every 5 min, comprehensive check + db_dump: "" # existing — daily after db dumps + backup: "" # existing — daily after restic snapshot + backup_integrity: "" # NEW — weekly after restic check + system_health_interval: "5m" + health_check_schedule: "06:00" + thresholds: + disk_warn_percent: 80 + disk_crit_percent: 90 + backup_max_age_hours: 36 + cpu_warn_percent: 90 + memory_warn_percent: 85 + temperature_warn_celsius: 75 +``` + +> **Note:** Empty string and "CHANGEME..." UUIDs are both skipped by the pinger (already implemented). This means any check can be left unconfigured — the controller just skips it silently. + +### Healthchecks check configuration (to be created manually on status.felhom.eu) + +For each customer project, create these checks: + +| Check name | Period | Grace | Tags | +|-----------|--------|-------|------| +| `heartbeat` | 5 minutes | 10 minutes | `heartbeat` | +| `system-health` | 5 minutes | 10 minutes | `system`, `health` | +| `db-dump` | 1 day (02:30 CET) | 30 minutes | `backup`, `db` | +| `backup` | 1 day (03:00 CET) | 60 minutes | `backup`, `restic` | +| `backup-integrity` | 7 days | 24 hours | `backup`, `integrity` | + +--- + +## Part 1: Controller-side healthcheck implementation + +### Task 1.1: Add heartbeat ping + +**Files:** `cmd/controller/main.go` + +Add a new scheduler job — the simplest possible ping, no health check logic: + +```go +// Heartbeat — lightweight "I'm alive" signal +sched.Every("heartbeat", 5*time.Minute, func(ctx context.Context) error { + pinger.Ping(cfg.Monitoring.PingUUIDs.Heartbeat, "") + return nil +}) +``` + +**Files:** `internal/config/config.go` + +Add `Heartbeat` field to `PingUUIDsConfig`: + +```go +type PingUUIDsConfig struct { + Heartbeat string `yaml:"heartbeat"` + DBDump string `yaml:"db_dump"` + Backup string `yaml:"backup"` + SystemHealth string `yaml:"system_health"` + BackupIntegrity string `yaml:"backup_integrity"` // new } ``` -Deploy, hover over a data point, check browser console. Possible findings: -- `parsed.x` is `undefined` → Chart.js isn't finding the x value from `{x,y}` data -- `parsed.x` is a very small number (like an index) → linear scale isn't applied -- `parsed.x` is correct ms timestamp → bug is in `formatTimestamp` +### Task 1.2: Add backup integrity check -### Fix +**Files:** `internal/backup/restic.go` -Replace the tooltip callback with a more robust approach that accesses the raw data point directly: +Add a `Check()` method (may already exist as part of prune logic — verify first): -```javascript -callbacks: { - title: function(items) { - if (!items.length) return ''; - // Access raw {x, y} data point directly — most reliable across Chart.js versions - var raw = items[0].raw; - if (raw && typeof raw === 'object' && raw.x) { - return formatTimestamp(raw.x); - } - // Fallback: try parsed.x - if (items[0].parsed && items[0].parsed.x) { - return formatTimestamp(items[0].parsed.x); - } - return ''; +```go +// Check runs `restic check` to verify repository integrity. +func (r *ResticRunner) Check() error { + args := []string{"check", "--repo", r.repo, "--json"} + // ... standard exec with password file, timeout 30 min +} +``` + +**Files:** `internal/backup/backup.go` + +Add `RunIntegrityCheck()`: + +```go +// RunIntegrityCheck runs restic check and pings healthchecks with the result. +func (m *Manager) RunIntegrityCheck(ctx context.Context) error { + err := m.restic.Check() + uuid := m.cfg.Monitoring.PingUUIDs.BackupIntegrity + if err != nil { + m.pinger.Fail(uuid, fmt.Sprintf("restic check failed: %v", err)) + return err } + m.pinger.Ping(uuid, "restic check passed") + return nil } ``` -After deploying and verifying, remove the console.log line. +**Files:** `cmd/controller/main.go` -### Verification -- Hover over any data point on any chart → tooltip title shows formatted date like "2026. 02. 16. 11:30" -- Verify on CPU, Memory, Temperature, Load charts -- Verify on container detail charts too (same `chartOpts` function is shared) +Register the weekly job: ---- +```go +if cfg.Backup.Enabled && backupMgr != nil { + // ... existing daily jobs ... -## Bug 2: Charts fill full width regardless of data density - -### Root cause - -`setChartXBounds()` sets `chart.options.scales.x.min/max` after chart initialization. Chart.js may not pick up dynamically added `min`/`max` properties if they weren't present in the options during initialization. The scale was created without `min`/`max`, and adding them at runtime may be ignored. - -### Diagnosis step - -Add console.log in `loadSystemMetrics()` after setting bounds and updating: - -```javascript -allCharts.forEach(function(c) { setChartXBounds(c, systemRange); }); -updateLineChart(chartCPU, timestamps, d.cpu); -console.log('[bounds debug] range:', systemRange, - 'options.min:', chartCPU.options.scales.x.min, - 'options.max:', chartCPU.options.scales.x.max, - 'scale.min:', chartCPU.scales.x.min, - 'scale.max:', chartCPU.scales.x.max); -``` - -Select "7 nap", check console. If `options.min/max` are set correctly but `scales.x.min/max` show the data extent, then Chart.js is ignoring the runtime-added properties. - -### Fix - -Include `min` and `max` in the initial chart options so Chart.js registers them from creation. Then dynamic updates work. - -**Step 1**: Modify `chartOpts()` to include initial min/max: - -```javascript -function chartOpts(yLabel, beginAtZero) { - var now = Date.now(); - var defaultRangeMs = parseRangeMs('1h'); // match default systemRange - return { - responsive: true, - maintainAspectRatio: false, - animation: {duration: 300}, - plugins: { - legend: {display: false}, - tooltip: { - backgroundColor: '#1c2128', - titleColor: '#e6edf3', - bodyColor: '#8b949e', - borderColor: '#30363d', - borderWidth: 1, - callbacks: { - title: function(items) { - if (!items.length) return ''; - var raw = items[0].raw; - if (raw && typeof raw === 'object' && raw.x) { - return formatTimestamp(raw.x); - } - if (items[0].parsed && items[0].parsed.x) { - return formatTimestamp(items[0].parsed.x); - } - return ''; - } - } - } - }, - scales: { - x: { - type: 'linear', - min: now - defaultRangeMs, - max: now, - grid: {color: 'rgba(48,54,61,0.5)'}, - ticks: { - color: '#8b949e', - maxTicksLimit: 8, - callback: function(v) { - return formatTimeLabel(v); - } - } - }, - y: { - grid: {color: 'rgba(48,54,61,0.5)'}, - ticks: {color: '#8b949e'}, - beginAtZero: beginAtZero !== false, - title: {display: !!yLabel, text: yLabel || '', color: '#6e7681', font: {size: 11}} - } + // Weekly integrity check — Sunday 04:00 + sched.Daily("backup-integrity", "04:00", func(ctx context.Context) error { + if time.Now().Weekday() != time.Sunday { + return nil // skip non-Sundays } - }; + return backupMgr.RunIntegrityCheck(ctx) + }) } ``` -Key change: `min: now - defaultRangeMs, max: now` are present from creation. +> **Note on scheduler:** `Daily()` fires every day at the given time. To make it weekly, check the weekday inside the function. If you prefer, add a `Weekly()` method to the scheduler — but the weekday check is simpler and consistent with how prune already works. -**Step 2**: `setChartXBounds()` stays the same — it updates existing properties. +### Task 1.3: Update example config -**Step 3**: Same fix for container detail charts — `initDetailCharts()` uses the same `chartOpts()` so it gets min/max automatically. +**Files:** `controller/configs/controller.yaml.example` -### Verification -- Select "7 nap" → x-axis spans 7 full days (Feb 9 to Feb 16), data appears as a small cluster on the far right -- Select "1 óra" → data fills most of the chart width -- Select "24 óra" → data fills proportional to collection time -- X-axis labels for 7d show dates (02.09 .. 02.16), not times -- X-axis labels for 1h/6h/24h show times (10:00, 11:00, etc.) +Update the `monitoring.ping_uuids` section to include `heartbeat` and `backup_integrity` fields. Add comments explaining each. + +### Task 1.4: Deprecation note for bash monitoring scripts + +The following files in `deploy-felhom-compose/monitoring/` are **superseded** by the controller's built-in monitoring: + +- `backup-healthcheck.sh` → replaced by `internal/monitor/healthcheck.go` (scheduler: `system-health`) +- `monitoring-setup.sh` → no longer needed (controller reads `controller.yaml` directly) +- `monitoring.conf.template` → replaced by `controller.yaml` monitoring section +- `backup-healthcheck.service` / `.timer` → replaced by controller's scheduler + +**Action:** Add a `DEPRECATED.md` in `deploy-felhom-compose/monitoring/` explaining that these scripts are kept for reference only and should not be used on nodes running felhom-controller v0.4.0+. Do NOT delete the files yet — they may be needed if a customer is still on a pre-controller setup. + +### Verification (Part 1) + +After building and deploying v0.6.0 to demo-felhom: + +1. Check controller logs: `docker logs felhom-controller --since 5m | grep -i "ping\|health\|heartbeat"` +2. Verify pings arrive at `status.felhom.eu` — all 5 checks should show green within 10 minutes +3. Test failure: `docker stop traefik`, wait 5 min, check that `system-health` goes red (protected container missing) +4. Restart traefik: `docker start traefik`, verify recovery --- -## Bug 3: System overview values not consistently right-aligned +## Part 2: Central push to k3s (customer → operator reporting) -### Root cause +### Architecture -`.sysinfo-row` uses `display: flex; justify-content: space-between` which does push values to the right of each cell. But `.sysinfo-grid` uses `repeat(auto-fill, minmax(280px, 1fr))` which creates varying cell widths — values don't align to a consistent edge across columns. +``` +┌─────────────────────────┐ HTTPS POST /api/v1/report +│ Customer controller │────────────────────────────────────────┐ +│ (demo-felhom.eu) │ every 15 min (configurable) │ +└─────────────────────────┘ ▼ + ┌─────────────────────────────┐ +┌─────────────────────────┐ HTTPS POST │ felhom-hub │ +│ Customer controller │────────────────────────▶│ (k3s pod on dooplex.hu) │ +│ (customer-2) │ │ │ +└─────────────────────────┘ │ - Receives reports │ + │ - Stores in SQLite │ + │ - Serves dashboard │ + │ - Alerts on stale reports │ + └─────────────────────────────┘ + hub.felhom.eu +``` -The ` ``` -The mobile rule `@media(max-width: 768px) { .sysinfo-grid { grid-template-columns: 1fr; } }` already exists and stays — collapses to single column on mobile. +This function should call existing methods — **do not duplicate logic**. Use the same data sources the dashboard and monitoring page already use. -### Verification -- Values are consistently right-aligned within each cell -- "Debian GNU/Linux 13 (trixie)" and "6.12.69+deb13-amd64" align to the right edge -- Both grid columns have equal width -- Long values wrap without breaking layout +### Task 2.3: Implement report pusher in the controller + +**New file:** `controller/internal/report/pusher.go` + +```go +package report + +// Pusher sends reports to the central hub. +type Pusher struct { + hubURL string + apiKey string + httpClient *http.Client + logger *log.Logger + enabled bool +} + +// Push sends a report to the hub. Returns nil on success. +// Retries 3 times with 5s backoff. Never returns error to caller +// (push failures should not affect controller operation). +func (p *Pusher) Push(report *Report) error { + // JSON marshal + // POST to hubURL + "/api/v1/report" + // Header: Authorization: Bearer + // Header: Content-Type: application/json + // Retry on failure + // Log but don't propagate errors +} +``` + +### Task 2.4: Add hub configuration to controller.yaml + +**Files:** `internal/config/config.go`, `controller/configs/controller.yaml.example` + +```yaml +# --- Central hub (operator dashboard) --- +hub: + enabled: false # Enable central reporting + url: "https://hub.felhom.eu" # Hub API endpoint + api_key: "" # Shared secret for authentication + push_interval: "15m" # How often to push reports +``` + +```go +type HubConfig struct { + Enabled bool `yaml:"enabled"` + URL string `yaml:"url"` + APIKey string `yaml:"api_key"` + PushInterval string `yaml:"push_interval"` +} +``` + +Add `Hub HubConfig `yaml:"hub"`` to the main `Config` struct. + +### Task 2.5: Wire the pusher into main.go + +```go +// --- Central hub reporting --- +if cfg.Hub.Enabled && cfg.Hub.URL != "" { + pushInterval, err := time.ParseDuration(cfg.Hub.PushInterval) + if err != nil { + pushInterval = 15 * time.Minute + } + pusher := report.NewPusher(&cfg.Hub, logger) + sched.Every("hub-report", pushInterval, func(ctx context.Context) error { + r := report.BuildReport(cfg, stackMgr, backupMgr, cpuCollector, pinger, version) + return pusher.Push(r) + }) + logger.Printf("[INFO] Hub reporting enabled (every %s to %s)", pushInterval, cfg.Hub.URL) +} +``` + +### Verification (Part 2) + +1. Set `hub.enabled: true` and `hub.url` to a temporary endpoint (e.g., `https://webhook.site/...`) in demo-felhom's `controller.yaml` +2. Restart controller, check logs for "Hub reporting enabled" +3. Wait 15 min (or set `push_interval: "1m"` for testing), verify JSON arrives at the endpoint +4. Validate JSON structure matches the spec above +5. Reset `push_interval` to `"15m"` after testing --- -## Bug 4: Charts overflow their container on mobile +## Part 3: Hub service on k3s (operator side) -### Root cause +### Overview -`.chart-wrap` has `position: relative; height: 180px` but no overflow or width constraint. CSS grid children default to `min-width: auto`, preventing them from shrinking below their content width. Chart.js canvas may render wider than the parent on narrow screens. +The hub is a lightweight Go service deployed on Viktor's k3s cluster in the `felhom-system` namespace. It receives reports from customer controllers, stores them in SQLite, and serves an English-language dashboard for Viktor. -### Fix +**Domain:** `hub.felhom.eu` (Nginx Ingress, cert-manager TLS) +**Namespace:** `felhom-system` (alongside Healthchecks and other felhom infra) +**Code:** `felhom.eu` repo on Gitea, `hub/` subfolder -**In `style.css`**, update these rules: +### Task 3.1: Hub service (subfolder in felhom.eu repository) -```css -.chart-box { - background: var(--bg-secondary); - border-radius: 8px; - padding: .75rem; - border: 1px solid rgba(48, 54, 61, 0.5); - min-width: 0; /* Allow grid children to shrink — critical fix */ - overflow: hidden; -} -.chart-wrap { - position: relative; - height: 180px; - overflow: hidden; - max-width: 100%; -} -.chart-wrap canvas { - max-width: 100%; -} -.chart-wrap-bar { - position: relative; - height: 250px; - overflow: hidden; - max-width: 100%; -} +The hub lives in the existing `felhom.eu` repository on Gitea as a `hub/` subfolder. It's deployed to the k3s cluster in the `felhom-system` namespace (alongside Healthchecks and other felhom infra). K8s manifests go in the `homelab-manifests` repo as usual. + +**Structure (inside felhom.eu repo):** + +``` +hub/ +├── cmd/hub/main.go # Entry point +├── internal/ +│ ├── api/ +│ │ └── handler.go # POST /api/v1/report, GET /api/v1/customers +│ ├── store/ +│ │ └── store.go # SQLite: save reports, query latest per customer +│ └── web/ +│ ├── server.go # Dashboard HTTP server +│ ├── templates/ +│ │ ├── dashboard.html # Multi-customer overview (English) +│ │ ├── customer.html # Single customer detail (English) +│ │ └── style.css # Dark theme matching felhom.eu +│ └── embed.go +├── configs/ +│ └── hub.yaml.example +├── Dockerfile +├── Makefile +└── go.mod ``` -Also add `.chart-box-half` update: -```css -.chart-box-half { - flex: 1; - min-width: 0; /* Same fix for flex containers */ -} +K8s manifests in `felhom.eu/manifests/` (alongside healthchecks.yaml, webpage.yaml, etc.): +``` +manifests/hub.yaml # Deployment, Service, Ingress, PVC ``` -Key additions: -- `min-width: 0` on `.chart-box` — **the critical CSS grid fix**: prevents grid children from forcing the grid wider than the viewport -- `overflow: hidden` on `.chart-wrap` and `.chart-wrap-bar` — clips any canvas overflow -- `max-width: 100%` on `.chart-wrap` and canvas -- `min-width: 0` on `.chart-box-half` — same fix for the flex-based container charts +### Task 3.2: Hub API endpoints -### Verification -- Open monitoring page at 375px width (browser devtools responsive mode) -- All four system metric charts fit within the screen -- Container bar charts fit within the screen -- No horizontal scrollbar appears -- Charts remain interactive (hover/click works) +| Method | Path | Auth | Description | +|--------|------|------|-------------| +| `POST` | `/api/v1/report` | Bearer token | Receive customer report (JSON body) | +| `GET` | `/api/v1/customers` | Session/Basic | List all customers with latest status | +| `GET` | `/api/v1/customers/{id}` | Session/Basic | Get latest report for a customer | +| `GET` | `/api/v1/customers/{id}/history` | Session/Basic | Get report history (last 24h/7d/30d) | +| `GET` | `/` | Session/Basic | Dashboard HTML page | +| `GET` | `/customers/{id}` | Session/Basic | Customer detail HTML page | + +**Authentication:** +- Report ingest: Bearer token (shared secret per customer, or a single hub-wide key for simplicity) +- Dashboard: Basic auth or simple password (Viktor only) — reuse the same bcrypt approach as the controller + +### Task 3.3: Hub SQLite schema + +```sql +CREATE TABLE IF NOT EXISTS reports ( + id INTEGER PRIMARY KEY AUTOINCREMENT, + customer_id TEXT NOT NULL, + received_at DATETIME NOT NULL DEFAULT (datetime('now')), + report_json TEXT NOT NULL, -- Full JSON payload + -- Denormalized fields for fast queries: + health_status TEXT, -- "ok", "warn", "fail" + cpu_percent REAL, + memory_percent REAL, + container_total INTEGER, + container_running INTEGER, + backup_last_snapshot DATETIME, + controller_version TEXT +); + +CREATE INDEX IF NOT EXISTS idx_reports_customer ON reports(customer_id, received_at DESC); + +-- Prune old reports: keep 30 days of history +-- Run daily: DELETE FROM reports WHERE received_at < datetime('now', '-30 days'); +``` + +### Task 3.4: Hub dashboard UI (English) + +**Overview page (`/`):** + +A table/grid showing all customers at a glance: + +| Customer | Status | Last seen | CPU | Memory | Disk | Containers | Last backup | Version | +|----------|--------|-----------|-----|--------|------|------------|-------------|---------| +| 🟢 Demo Ügyfél | OK | 2 min ago | 12% | 26% | 6%/13% | 14/16 | 3h ago | 0.6.0 | +| 🟡 Kovács Péter | WARN | 18 min ago | 45% | 78% | 82% ⚠️ | 8/8 | 4h ago | 0.5.4 | +| 🔴 Nagy Anna | DOWN | 2h ago | – | – | – | – | 26h ago ⚠️ | 0.5.4 | + +**Color coding:** +- 🟢 Green: last seen < 30 min AND health = "ok" +- 🟡 Yellow: last seen < 30 min AND health = "warn", OR last seen 30-60 min +- 🔴 Red: last seen > 60 min OR health = "fail" + +**Customer detail page (`/customers/{id}`):** + +- Last report timestamp +- Full system info section (same layout as controller's monitoring page) +- Container list with CPU/memory +- Backup status details +- Health issues/warnings +- Report history (collapsible list, last 24h) + +**Design:** English language. Dark theme matching felhom.eu / the controller dashboard. Use the same CSS variables and fonts. + +### Task 3.5: Hub Kubernetes manifests + +**File:** `felhom.eu/manifests/hub.yaml` (alongside `healthchecks.yaml`, `webpage.yaml`, etc.) + +```yaml +# Namespace: felhom-system (shared with healthchecks and other felhom infra) +# Deployment: 1 replica, 64Mi-256Mi memory +# Service: ClusterIP port 8080 +# PVC: 1Gi for SQLite (Longhorn) +# Ingress: hub.felhom.eu via nginx-internal, cert-manager TLS +# Auth: same geo-restriction as other dooplex.hu services (HU only) +``` + +**ConfigMap** for `hub.yaml` config: +```yaml +auth: + password_hash: "" # bcrypt hash, same approach as controller +api: + report_api_key: "" # Bearer token for report ingest +retention: + max_days: 90 # Keep 90 days of report history + prune_schedule: "04:30" # Daily prune +alerting: + stale_threshold: "30m" # Alert if customer not seen for 30 min +``` + +### Task 3.6: Alerting (optional, future enhancement) + +When a customer is "stale" (no report for > 30 min), the hub could: +- Send a webhook to Healthchecks (one "customer-X-reporting" check per customer) +- Send email via Resend +- Push to Telegram + +For v0.6.0 scope: just show the status on the dashboard. Alerting can be added in v0.6.1. + +--- + +## Part 4: Manual steps for Viktor (demo-felhom setup) + +These steps must be done by Viktor manually — Claude Code cannot access status.felhom.eu or the demo-felhom server. + +### 4.1: Create Healthchecks checks on status.felhom.eu + +1. Log into `status.felhom.eu` +2. Open the "demo-felhom" project +3. Create 5 checks with the settings from the table in Part 0 +4. Copy the ping UUIDs for each check + +### 4.2: Update controller.yaml on demo-felhom + +SSH into demo-felhom and update `/opt/docker/felhom-controller/controller.yaml`: + +```yaml +monitoring: + enabled: true + healthchecks_base: "https://status.felhom.eu" + ping_uuids: + heartbeat: "" + system_health: "" + db_dump: "" + backup: "" + backup_integrity: "" + system_health_interval: "5m" + health_check_schedule: "06:00" + thresholds: + disk_warn_percent: 80 + disk_crit_percent: 90 + backup_max_age_hours: 36 + cpu_warn_percent: 90 + memory_warn_percent: 85 + temperature_warn_celsius: 75 +``` + +### 4.3: Restart controller + +```bash +cd /opt/docker/felhom-controller +docker compose pull +docker compose up -d +docker logs -f felhom-controller --since 1m +``` + +### 4.4: Verify pings + +Wait 5 minutes, then check `status.felhom.eu` — all 5 checks should be green. + +### 4.5: Deploy hub to k3s (after Part 3 is built) + +```bash +# Build and push hub image (from felhom.eu repo, hub/ subfolder) +cd hub && make docker-push + +# Apply k8s manifests (from felhom.eu repo, manifests/ folder) +kubectl apply -f manifests/hub.yaml + +# Configure hub.felhom.eu DNS in Cloudflare +# Update demo-felhom controller.yaml with hub config +``` --- ## Implementation order -1. Edit `style.css` — sysinfo alignment + chart overflow fixes -2. Edit `monitoring.html` — remove `