feat: Hub monitoring takeover — event system, dead man's switch, notifications (v0.3.0)

Replace external Healthchecks.io with Hub-native monitoring. New events
table + /api/v1/event endpoint for structured events from controllers.
Staleness checker (60s) detects unresponsive nodes. Backup deadline
checker (daily 05:00) catches missed backups. Notification dispatcher
sends operator (English) + customer (Hungarian) emails via Resend with
per-event cooldowns. Event timeline on customer page, dashboard badges.
Config form deprecates Monitoring UUIDs section.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-20 18:53:24 +01:00
parent b4cb92e09f
commit 3217cb4751
16 changed files with 1319 additions and 64 deletions
+50
View File
@@ -1,5 +1,55 @@
# Felhom Hub — Changelog
## v0.3.0 (2026-02-20)
**Hub Monitoring Takeover — Event System, Dead Man's Switch, Notifications**
Replaces external Healthchecks.io with a Hub-native event system. The Hub becomes the single source of truth for all customer monitoring, event tracking, dead man's switch alerting, and notification delivery.
### Phase 1 — Event System
- **`events` table** in SQLite: stores all events with customer_id, event_type, severity, message, details_json, source, timestamp
- **Indexes**: `idx_events_customer_created` (customer + time DESC), `idx_events_type` (type + time DESC)
- **Store methods**: `SaveEvent`, `GetRecentEvents`, `GetEventsByType`, `GetLatestEventByType`, `GetAllRecentEvents`, `CountEventsBySeverity`, `PruneEvents`, `GetActiveCustomerIDs`
- **`POST /api/v1/event`** endpoint: accepts structured events from controllers, validates event_type against 27 allowed types, validates severity (info/warning/error), stores in DB
- **Enhanced auth**: `checkAuthCustomer()` validates per-customer API keys match the customer_id in payload; global key bypasses ownership check
- **Prune**: events pruned alongside reports at 04:30 Budapest time
### Phase 2 — Dead Man's Switch
- **Staleness checker** (`internal/monitor/staleness.go`): runs every 60s, detects when controllers stop reporting
- ok→stale (>30min): inserts `node_stale` warning event
- any→down (>60min): inserts `node_down` error event
- stale/down→ok: inserts `node_recovered` info event
- Skips blocked customers, no false alerts on startup
- **Backup deadline checker** (`internal/monitor/deadline.go`): runs daily at 05:00 Budapest
- Detects missing `backup_completed` events since midnight → inserts `expected_backup_missed` error
- Detects missing `db_dump_completed` events → inserts `expected_dbdump_missed` error
- Grace: skips customers with `node_down` state
- **`scheduleDaily()`** helper: goroutine that sleeps until target time (Europe/Budapest), runs function, loops
- **`/healthz`** enhanced: returns 503 if SQLite Ping fails
### Phase 3 — Notification System
- **Dispatcher** (`internal/notify/dispatcher.go`): processes events and sends emails via Resend API
- **Operator channel**: English emails to operator for warning/error events, 1h cooldown per customer:eventType
- **Customer channel**: Hungarian emails per event_type, respects customer preferences (enabled_events, cooldown_hours), blocked customers skipped
- **Test bypass**: `test` event type skips cooldown/preferences, sends directly to customer email
- **Email templates** (`internal/notify/templates.go`): operator (concise English), customer (Hungarian per event type with complete message table)
- **Cooldown tracking**: in-memory maps with per-customer:eventType granularity
- **`customer_notifications` table**: added `cooldown_hours` column (default 6)
- **`notification_log` table**: added `channel` column (operator/customer)
- Wired into `/api/v1/event` handler and staleness/deadline checkers
### Phase 4 — Hub UI
- **Events section** on customer detail page: last 50 events, severity filter buttons (All/Errors/Warnings/Info), colored severity badges
- **Dashboard badges**: error+warning count in last 24h per customer, clickable to customer events
- **Notification log**: shows channel column (operator/customer) in customer detail page
- **Config form**: Monitoring UUIDs section marked as "Legacy" with deprecation notice, collapsed by default
### Phase 6 — Config Cleanup
- **`controller.yaml.default`**: `monitoring.ping_uuids` section commented out (deprecated)
- **`buildConfigJSON`**: only writes `ping_uuids` to config JSON if user explicitly provides UUID values (new configs get none)
---
## v0.2.2 (2026-02-20)
**Config Hash Comparison**