feat: Hub monitoring takeover — event system, dead man's switch, notifications (v0.3.0)
Replace external Healthchecks.io with Hub-native monitoring. New events table + /api/v1/event endpoint for structured events from controllers. Staleness checker (60s) detects unresponsive nodes. Backup deadline checker (daily 05:00) catches missed backups. Notification dispatcher sends operator (English) + customer (Hungarian) emails via Resend with per-event cooldowns. Event timeline on customer page, dashboard badges. Config form deprecates Monitoring UUIDs section. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -1,5 +1,55 @@
|
||||
# Felhom Hub — Changelog
|
||||
|
||||
## v0.3.0 (2026-02-20)
|
||||
|
||||
**Hub Monitoring Takeover — Event System, Dead Man's Switch, Notifications**
|
||||
|
||||
Replaces external Healthchecks.io with a Hub-native event system. The Hub becomes the single source of truth for all customer monitoring, event tracking, dead man's switch alerting, and notification delivery.
|
||||
|
||||
### Phase 1 — Event System
|
||||
- **`events` table** in SQLite: stores all events with customer_id, event_type, severity, message, details_json, source, timestamp
|
||||
- **Indexes**: `idx_events_customer_created` (customer + time DESC), `idx_events_type` (type + time DESC)
|
||||
- **Store methods**: `SaveEvent`, `GetRecentEvents`, `GetEventsByType`, `GetLatestEventByType`, `GetAllRecentEvents`, `CountEventsBySeverity`, `PruneEvents`, `GetActiveCustomerIDs`
|
||||
- **`POST /api/v1/event`** endpoint: accepts structured events from controllers, validates event_type against 27 allowed types, validates severity (info/warning/error), stores in DB
|
||||
- **Enhanced auth**: `checkAuthCustomer()` validates per-customer API keys match the customer_id in payload; global key bypasses ownership check
|
||||
- **Prune**: events pruned alongside reports at 04:30 Budapest time
|
||||
|
||||
### Phase 2 — Dead Man's Switch
|
||||
- **Staleness checker** (`internal/monitor/staleness.go`): runs every 60s, detects when controllers stop reporting
|
||||
- ok→stale (>30min): inserts `node_stale` warning event
|
||||
- any→down (>60min): inserts `node_down` error event
|
||||
- stale/down→ok: inserts `node_recovered` info event
|
||||
- Skips blocked customers, no false alerts on startup
|
||||
- **Backup deadline checker** (`internal/monitor/deadline.go`): runs daily at 05:00 Budapest
|
||||
- Detects missing `backup_completed` events since midnight → inserts `expected_backup_missed` error
|
||||
- Detects missing `db_dump_completed` events → inserts `expected_dbdump_missed` error
|
||||
- Grace: skips customers with `node_down` state
|
||||
- **`scheduleDaily()`** helper: goroutine that sleeps until target time (Europe/Budapest), runs function, loops
|
||||
- **`/healthz`** enhanced: returns 503 if SQLite Ping fails
|
||||
|
||||
### Phase 3 — Notification System
|
||||
- **Dispatcher** (`internal/notify/dispatcher.go`): processes events and sends emails via Resend API
|
||||
- **Operator channel**: English emails to operator for warning/error events, 1h cooldown per customer:eventType
|
||||
- **Customer channel**: Hungarian emails per event_type, respects customer preferences (enabled_events, cooldown_hours), blocked customers skipped
|
||||
- **Test bypass**: `test` event type skips cooldown/preferences, sends directly to customer email
|
||||
- **Email templates** (`internal/notify/templates.go`): operator (concise English), customer (Hungarian per event type with complete message table)
|
||||
- **Cooldown tracking**: in-memory maps with per-customer:eventType granularity
|
||||
- **`customer_notifications` table**: added `cooldown_hours` column (default 6)
|
||||
- **`notification_log` table**: added `channel` column (operator/customer)
|
||||
- Wired into `/api/v1/event` handler and staleness/deadline checkers
|
||||
|
||||
### Phase 4 — Hub UI
|
||||
- **Events section** on customer detail page: last 50 events, severity filter buttons (All/Errors/Warnings/Info), colored severity badges
|
||||
- **Dashboard badges**: error+warning count in last 24h per customer, clickable to customer events
|
||||
- **Notification log**: shows channel column (operator/customer) in customer detail page
|
||||
- **Config form**: Monitoring UUIDs section marked as "Legacy" with deprecation notice, collapsed by default
|
||||
|
||||
### Phase 6 — Config Cleanup
|
||||
- **`controller.yaml.default`**: `monitoring.ping_uuids` section commented out (deprecated)
|
||||
- **`buildConfigJSON`**: only writes `ping_uuids` to config JSON if user explicitly provides UUID values (new configs get none)
|
||||
|
||||
---
|
||||
|
||||
## v0.2.2 (2026-02-20)
|
||||
|
||||
**Config Hash Comparison**
|
||||
|
||||
Reference in New Issue
Block a user