feat: Hub monitoring takeover — event system, dead man's switch, notifications (v0.3.0)

Replace external Healthchecks.io with Hub-native monitoring. New events table + /api/v1/event endpoint for structured events from controllers. Staleness checker (60s) detects unresponsive nodes. Backup deadline checker (daily 05:00) catches missed backups. Notification dispatcher sends operator (English) + customer (Hungarian) emails via Resend with per-event cooldowns. Event timeline on customer page, dashboard badges. Config form deprecates Monitoring UUIDs section. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 18:53:24 +01:00
parent b4cb92e09f
commit 3217cb4751
16 changed files with 1319 additions and 64 deletions
@@ -2,9 +2,9 @@

 **Central operator dashboard for monitoring and managing Felhom customer deployments.**

-A lightweight Go service that receives periodic reports from felhom-controller instances, stores them in SQLite, and provides a web dashboard for fleet monitoring. Also serves as the infrastructure backup store for disaster recovery.
+A lightweight Go service that receives periodic reports and structured events from felhom-controller instances, stores them in SQLite, and provides a web dashboard for fleet monitoring. Also serves as the infrastructure backup store for disaster recovery, event-based dead man's switch monitoring, and notification dispatch.

-**Current version: v0.2.2**
+**Current version: v0.3.0**

 ---

@@ -72,14 +72,29 @@ The infra-backup payload contains everything needed to restore a customer deploy
 4. Controller uses disk UUIDs to auto-mount surviving drives
 5. Controller restores apps from local backups on those drives

+### Events
+
+| Method | Path | Description |
+|--------|------|-------------|
+| `POST` | `/api/v1/event` | Controller pushes structured event (27 allowed types, severity: info/warning/error) |
+
+Events are the primary monitoring mechanism. Each event has: customer_id, event_type, severity, message, details_json, source. Per-customer API keys are validated against the customer_id in the payload. Stored in the `events` table with automatic pruning.
+
+**Hub-generated events** (source="hub"):
+- `node_stale` / `node_down` / `node_recovered` — dead man's switch from staleness checker (every 60s)
+- `expected_backup_missed` / `expected_dbdump_missed` — backup deadline checker (daily at 05:00 Budapest)
+
 ### Notifications

 | Method | Path | Description |
 |--------|------|-------------|
-| `POST` | `/api/v1/notify` | Controller sends event notification (backup_failed, disk_warning, etc.) |
-| `POST` | `/api/v1/preferences` | Controller syncs customer notification preferences |
+| `POST` | `/api/v1/notify` | Legacy notification relay (kept for backward compatibility) |
+| `POST` | `/api/v1/preferences` | Controller syncs customer notification preferences (email, enabled_events, cooldown_hours) |

-Notifications are sent via Resend.com email API.
+Notifications are dispatched automatically when events are processed:
+- **Operator channel**: English emails for warning/error events, 1h cooldown per customer:eventType
+- **Customer channel**: Hungarian emails per event type, respects customer preferences and cooldown (default 6h)
+- Email delivery via Resend.com API

 ### Customer Config Retrieval

@@ -93,7 +108,7 @@ Config retrieval uses a separate per-customer retrieval password (not the API ke

 | Method | Path | Description |
 |--------|------|-------------|
-| `GET` | `/healthz` | Health check (no auth required) |
+| `GET` | `/healthz` | Health check (no auth required, returns 503 if SQLite ping fails) |

 ## Web Dashboard

@@ -101,13 +116,13 @@ Protected by bcrypt password + session cookie (7-day expiry).

 ### Pages

- **Dashboard (`/`)** — Fleet overview table showing all customers with live status. Config-only customers (no reports yet) appear as "PENDING" with gray badge. Blocked customers are hidden. Auto-refreshes every 60 seconds.
+- **Dashboard (`/`)** — Fleet overview table showing all customers with live status and event count badges (error+warning in last 24h). Config-only customers (no reports yet) appear as "PENDING" with gray badge. Blocked customers are hidden. Auto-refreshes every 60 seconds.
 - **Customers (`/configs`)** — Customer management list. Shows all customers (both managed and manual), their status, controller version, and config type (MANAGED/MANUAL). Blocked customers shown grayed-out with BLOCKED badge.
 - **Unified Customer Detail (`/customers/{id}`)** — Single page per customer combining config management and live monitoring. Adapts content based on available data:
-  - **Managed + reporting:** Full view — config info, system metrics, storage, containers, backup status, credentials, setup commands, YAML preview, controller update, notifications, history
+  - **Managed + reporting:** Full view — config info, system metrics, storage, containers, backup status, events timeline (last 50, severity filter), credentials, setup commands, YAML preview, controller update, notifications (with channel column), history
  - **Managed + no reports yet:** Config info, credentials, setup commands, "Waiting for first report" indicator
  - **Manual (report-only):** System metrics, storage, containers, backup, with "Create Config" button to convert to managed
- **Config Form (`/configs/new`, `/configs/{id}/edit`)** — Create/edit customer configurations with identity, infrastructure tokens, and monitoring overrides
+- **Config Form (`/configs/new`, `/configs/{id}/edit`)** — Create/edit customer configurations with identity, infrastructure tokens, and monitoring overrides. Legacy Monitoring UUIDs section collapsed by default with deprecation notice

 ### Customer States

@@ -144,9 +159,10 @@ SQLite with WAL mode. Tables:
 | Table | Purpose |
 |-------|---------|
 | `reports` | Full JSON reports with denormalized fields for dashboard queries |
+| `events` | Structured events from controllers and Hub (type, severity, message, details, source) |
 | `infra_backups` | Per-customer infrastructure snapshots for disaster recovery |
-| `customer_notifications` | Email + enabled event types per customer |
-| `notification_log` | Send/skip/fail history for notifications |
+| `customer_notifications` | Email, enabled event types, cooldown hours per customer |
+| `notification_log` | Send/skip/fail history for notifications with channel (operator/customer) |
 | `customer_configs` | Pre-configured customer settings, retrieval passwords, per-customer API keys, status (active/blocked) |

 Retention: configurable (default 90 days), daily prune at 04:30 Budapest time.
@@ -164,6 +180,8 @@ api:
 notifications:
  resend_api_key: ""          # Resend.com API key for email
  from_email: "monitoring@felhom.eu"
+  operator_email: ""          # Operator alert recipient
+  operator_enabled: true      # Enable operator email notifications

 retention:
  max_days: 90
@@ -195,16 +213,26 @@ Runs on k3s (Kubernetes) in the `felhom-system` namespace:
 ```bash
 # Build and push
 cd hub/
-make VERSION=0.2.2 docker docker-push
+make VERSION=0.3.0 docker docker-push

 # Deploy
-kubectl set image -n felhom-system deploy/hub hub=gitea.dooplex.hu/admin/felhom-hub:v0.2.2
+kubectl set image -n felhom-system deploy/hub hub=gitea.dooplex.hu/admin/felhom-hub:v0.3.0
 kubectl rollout status -n felhom-system deploy/hub

 # Check
 kubectl logs -n felhom-system -l app=hub --tail 20
 ```

+## Background Services
+
+| Service | Schedule | Description |
+|---------|----------|-------------|
+| **Staleness checker** | Every 60s | Detects controllers that stopped reporting. Generates `node_stale` (>30min), `node_down` (>60min), `node_recovered` events |
+| **Backup deadline checker** | Daily 05:00 Budapest | Detects missing backup/db-dump events since midnight. Generates `expected_backup_missed`, `expected_dbdump_missed` events |
+| **Report/event prune** | Daily 04:30 Budapest | Deletes reports and events older than retention period (default 90 days) |
+| **Registry version check** | Every 30min | Checks Gitea registry for new controller image tags |
+| **Template refresh** | Every 1h | Fetches latest `controller.yaml.example` from Gitea |
+
 ## Dependencies

 - `golang.org/x/crypto` — bcrypt for password hashing