feat: Hub monitoring takeover — event system, dead man's switch, notifications (v0.3.0)

Replace external Healthchecks.io with Hub-native monitoring. New events
table + /api/v1/event endpoint for structured events from controllers.
Staleness checker (60s) detects unresponsive nodes. Backup deadline
checker (daily 05:00) catches missed backups. Notification dispatcher
sends operator (English) + customer (Hungarian) emails via Resend with
per-event cooldowns. Event timeline on customer page, dashboard badges.
Config form deprecates Monitoring UUIDs section.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-20 18:53:24 +01:00
parent b4cb92e09f
commit 3217cb4751
16 changed files with 1319 additions and 64 deletions
+41 -13
View File
@@ -2,9 +2,9 @@
**Central operator dashboard for monitoring and managing Felhom customer deployments.**
A lightweight Go service that receives periodic reports from felhom-controller instances, stores them in SQLite, and provides a web dashboard for fleet monitoring. Also serves as the infrastructure backup store for disaster recovery.
A lightweight Go service that receives periodic reports and structured events from felhom-controller instances, stores them in SQLite, and provides a web dashboard for fleet monitoring. Also serves as the infrastructure backup store for disaster recovery, event-based dead man's switch monitoring, and notification dispatch.
**Current version: v0.2.2**
**Current version: v0.3.0**
---
@@ -72,14 +72,29 @@ The infra-backup payload contains everything needed to restore a customer deploy
4. Controller uses disk UUIDs to auto-mount surviving drives
5. Controller restores apps from local backups on those drives
### Events
| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/api/v1/event` | Controller pushes structured event (27 allowed types, severity: info/warning/error) |
Events are the primary monitoring mechanism. Each event has: customer_id, event_type, severity, message, details_json, source. Per-customer API keys are validated against the customer_id in the payload. Stored in the `events` table with automatic pruning.
**Hub-generated events** (source="hub"):
- `node_stale` / `node_down` / `node_recovered` — dead man's switch from staleness checker (every 60s)
- `expected_backup_missed` / `expected_dbdump_missed` — backup deadline checker (daily at 05:00 Budapest)
### Notifications
| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/api/v1/notify` | Controller sends event notification (backup_failed, disk_warning, etc.) |
| `POST` | `/api/v1/preferences` | Controller syncs customer notification preferences |
| `POST` | `/api/v1/notify` | Legacy notification relay (kept for backward compatibility) |
| `POST` | `/api/v1/preferences` | Controller syncs customer notification preferences (email, enabled_events, cooldown_hours) |
Notifications are sent via Resend.com email API.
Notifications are dispatched automatically when events are processed:
- **Operator channel**: English emails for warning/error events, 1h cooldown per customer:eventType
- **Customer channel**: Hungarian emails per event type, respects customer preferences and cooldown (default 6h)
- Email delivery via Resend.com API
### Customer Config Retrieval
@@ -93,7 +108,7 @@ Config retrieval uses a separate per-customer retrieval password (not the API ke
| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/healthz` | Health check (no auth required) |
| `GET` | `/healthz` | Health check (no auth required, returns 503 if SQLite ping fails) |
## Web Dashboard
@@ -101,13 +116,13 @@ Protected by bcrypt password + session cookie (7-day expiry).
### Pages
- **Dashboard (`/`)** — Fleet overview table showing all customers with live status. Config-only customers (no reports yet) appear as "PENDING" with gray badge. Blocked customers are hidden. Auto-refreshes every 60 seconds.
- **Dashboard (`/`)** — Fleet overview table showing all customers with live status and event count badges (error+warning in last 24h). Config-only customers (no reports yet) appear as "PENDING" with gray badge. Blocked customers are hidden. Auto-refreshes every 60 seconds.
- **Customers (`/configs`)** — Customer management list. Shows all customers (both managed and manual), their status, controller version, and config type (MANAGED/MANUAL). Blocked customers shown grayed-out with BLOCKED badge.
- **Unified Customer Detail (`/customers/{id}`)** — Single page per customer combining config management and live monitoring. Adapts content based on available data:
- **Managed + reporting:** Full view — config info, system metrics, storage, containers, backup status, credentials, setup commands, YAML preview, controller update, notifications, history
- **Managed + reporting:** Full view — config info, system metrics, storage, containers, backup status, events timeline (last 50, severity filter), credentials, setup commands, YAML preview, controller update, notifications (with channel column), history
- **Managed + no reports yet:** Config info, credentials, setup commands, "Waiting for first report" indicator
- **Manual (report-only):** System metrics, storage, containers, backup, with "Create Config" button to convert to managed
- **Config Form (`/configs/new`, `/configs/{id}/edit`)** — Create/edit customer configurations with identity, infrastructure tokens, and monitoring overrides
- **Config Form (`/configs/new`, `/configs/{id}/edit`)** — Create/edit customer configurations with identity, infrastructure tokens, and monitoring overrides. Legacy Monitoring UUIDs section collapsed by default with deprecation notice
### Customer States
@@ -144,9 +159,10 @@ SQLite with WAL mode. Tables:
| Table | Purpose |
|-------|---------|
| `reports` | Full JSON reports with denormalized fields for dashboard queries |
| `events` | Structured events from controllers and Hub (type, severity, message, details, source) |
| `infra_backups` | Per-customer infrastructure snapshots for disaster recovery |
| `customer_notifications` | Email + enabled event types per customer |
| `notification_log` | Send/skip/fail history for notifications |
| `customer_notifications` | Email, enabled event types, cooldown hours per customer |
| `notification_log` | Send/skip/fail history for notifications with channel (operator/customer) |
| `customer_configs` | Pre-configured customer settings, retrieval passwords, per-customer API keys, status (active/blocked) |
Retention: configurable (default 90 days), daily prune at 04:30 Budapest time.
@@ -164,6 +180,8 @@ api:
notifications:
resend_api_key: "" # Resend.com API key for email
from_email: "monitoring@felhom.eu"
operator_email: "" # Operator alert recipient
operator_enabled: true # Enable operator email notifications
retention:
max_days: 90
@@ -195,16 +213,26 @@ Runs on k3s (Kubernetes) in the `felhom-system` namespace:
```bash
# Build and push
cd hub/
make VERSION=0.2.2 docker docker-push
make VERSION=0.3.0 docker docker-push
# Deploy
kubectl set image -n felhom-system deploy/hub hub=gitea.dooplex.hu/admin/felhom-hub:v0.2.2
kubectl set image -n felhom-system deploy/hub hub=gitea.dooplex.hu/admin/felhom-hub:v0.3.0
kubectl rollout status -n felhom-system deploy/hub
# Check
kubectl logs -n felhom-system -l app=hub --tail 20
```
## Background Services
| Service | Schedule | Description |
|---------|----------|-------------|
| **Staleness checker** | Every 60s | Detects controllers that stopped reporting. Generates `node_stale` (>30min), `node_down` (>60min), `node_recovered` events |
| **Backup deadline checker** | Daily 05:00 Budapest | Detects missing backup/db-dump events since midnight. Generates `expected_backup_missed`, `expected_dbdump_missed` events |
| **Report/event prune** | Daily 04:30 Budapest | Deletes reports and events older than retention period (default 90 days) |
| **Registry version check** | Every 30min | Checks Gitea registry for new controller image tags |
| **Template refresh** | Every 1h | Fetches latest `controller.yaml.example` from Gitea |
## Dependencies
- `golang.org/x/crypto` — bcrypt for password hashing