feat: Hub monitoring takeover — event push system + config cleanup (v0.21.0)

Replace external Healthchecks.io with Hub-native event system. Controller now pushes structured events via POST /api/v1/event with typed detail structs. Hub handles dead man's switch, notification dispatch, and cooldowns. Phase 5: PushEvent() core method, 21 event types, expanded notification settings (11 toggles), Hub connection monitoring on dashboard, alerts. Phase 6: Deprecation log for ping UUIDs, pinger kept for transition. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 18:53:21 +01:00
parent 55abe401ee
commit 8aebbb8902
13 changed files with 722 additions and 318 deletions
@@ -4,7 +4,7 @@

 A single, lightweight Go container that replaces Portainer + scattered systemd scripts with a unified, Hungarian-language web dashboard for managing Docker Compose stacks, backups, storage, monitoring, and notifications on customer hardware.

-**Current version: v0.20.0**
+**Current version: v0.21.0**

 ---

@@ -509,16 +509,9 @@ Backup destination validation (`CheckBackupDestination`) has tiered checks:
 - Disk >95% full → critical/blocked
 - Disk >90% full → warning

-#### Healthchecks.io Integration (`internal/monitor/pinger.go`)
+#### Healthchecks.io Integration (deprecated)

-Five ping UUIDs for external monitoring:
- **Heartbeat**: every 5 min (simple "I'm alive")
- **System Health**: periodic health check results
- **DB Dump**: after nightly database dumps
- **Backup**: after nightly restic backup
- **Backup Integrity**: weekly `restic check` result
-
-3-attempt retry with 2-second backoff. Pinger never fails the caller.
+Legacy pinger (`internal/monitor/pinger.go`) still runs for backward compatibility but is no longer the primary monitoring mechanism. Monitoring is now handled by the Hub event system (see [Notifications](#5-notifications)). A deprecation log is emitted on startup if ping UUIDs are configured.

 #### Metrics Store (`internal/metrics/`)

@@ -535,48 +528,66 @@ Full-page system monitor at `/monitoring`:
 - **System Metrics Charts**: 4 line charts (CPU, Memory, Temperature, Load) in 2x2 grid
 - **Container Resources**: horizontal bar charts (CPU% and Memory per container)
 - **Per-container Detail**: click-to-expand historical charts
- **Remote Monitoring Status**: shows Healthchecks ping UUID configuration
+- **Hub Connection Status**: shows Hub URL, customer ID, connection state (connected/unreachable), last successful push, last error

 Chart.js 4.4.7 embedded locally (works in offline environments), dark theme matching site design.

 #### Alert System (`internal/web/alerts.go`)

 State-based alerts displayed on all pages:
- Sources: health issues, missing ping UUIDs, backup disabled
+- Sources: health issues, Hub connection status, backup disabled, storage disconnected, update available
+- Hub alerts: `hub-disabled` (warning) when Hub not enabled, `hub-unreachable` (error) when last push failed and no success in 30 min
 - Sorted by severity (error > warning > info), capped at 5 visible
- Refreshed every 5 min + on startup
- Monitoring page suppresses ping-related alerts (shown in dedicated table instead)
+- Refreshed every 5 min + on startup + on storage state changes

 ---

 ### 5. Notifications

-#### Email Delivery
+#### Hub Event System (`internal/notify/notifier.go`)

-The controller relays notifications through the central hub, which sends emails via the Resend API:
-1. Controller detects event (health degradation, backup failure, etc.)
-2. Non-blocking POST to hub's `/api/v1/notify` with event details
-3. Hub checks customer notification preferences
-4. Hub sends Hungarian-language email via Resend
+The controller pushes structured events to the Hub's `/api/v1/event` endpoint. The Hub handles notification dispatch, cooldown management, and dead man's switch detection.
+
+**Core method:** `PushEvent(eventType, severity, message, details)` — non-blocking goroutine, 2 retries with 3s backoff, never blocks the caller.

 #### Event Types

-| Event | Trigger |
-|-------|---------|
-| `disk_warning` | Disk usage crosses warning/critical threshold |
-| `backup_failed` | Nightly backup or DB dump fails |
-| `update_available` | New app version detected in catalog |
-| `security_update` | Critical security update available |
+| Event Type | Severity | Trigger |
+|------------|----------|---------|
+| `backup_completed` | info | Nightly restic backup succeeds |
+| `backup_failed` | error | Nightly restic backup fails |
+| `db_dump_completed` | info | Nightly database dumps succeed |
+| `db_dump_failed` | error | Nightly database dumps fail |
+| `backup_integrity_ok` | info | Weekly `restic check` passes |
+| `backup_integrity_failed` | error | Weekly `restic check` fails |
+| `crossdrive_completed` | info | Cross-drive secondary backup succeeds |
+| `crossdrive_failed` | error | Cross-drive secondary backup fails |
+| `health_degraded` | warning | Health status degrades (ok→warn) |
+| `health_critical` | error | Health status critical (any→fail) |
+| `health_recovered` | info | Health status recovers (fail/warn→ok) |
+| `disk_warning` | warning | Disk usage crosses 90% |
+| `disk_critical` | error | Disk usage crosses 95% |
+| `storage_disconnected` | error | Storage drive physically removed |
+| `storage_reconnected` | info | Storage drive reconnected |
+| `controller_started` | info | Controller process starts |
+| `controller_updated` | info/error | Self-update success or failure |
+| `app_deployed` | info | New app deployed via API |
+| `app_removed` | info | App removed via API |
+| `disaster_recovery_started` | warning | DR restore begins |
+| `disaster_recovery_completed` | info/error | DR restore finishes (success/partial) |

-#### Cooldown System
+Each event carries typed detail structs (e.g., `BackupDetails`, `DiskDetails`, `HealthDetails`) serialized as JSON.

-Per-event-type cooldown (default 6 hours, configurable) prevents notification spam. Only notifies on **status degradation** (ok→warn, ok→fail, warn→fail), not on repeated same-status checks.
+#### Default Enabled Events
+
+Events the customer receives notifications for (configurable in settings):
+`backup_failed`, `db_dump_failed`, `disk_warning`, `disk_critical`, `storage_disconnected`, `node_down`, `health_critical`, `expected_backup_missed`, `expected_dbdump_missed`

 #### Preference Sync

-Notification preferences (email, enabled events, cooldown) are:
+Notification preferences (email, enabled events, cooldown hours) are:
 - Stored locally in `settings.json`
- Synced to hub on save and on controller startup
+- Synced to Hub on save and on controller startup via `POST /api/v1/preferences`
 - Hub sync failure doesn't block local save

 ---
@@ -776,7 +787,7 @@ Periodic JSON push (default every 15 min) to the central felhom-hub service:
 - Stacks: deployed apps with versions and states
 - Config hash: SHA256 of `controller.yaml` for Hub-side config comparison

-Bearer token authentication, 3-attempt retry with 5-second backoff.
+Bearer token authentication, 3-attempt retry with 5-second backoff. Push status tracked via `PushStatus` struct (LastAttempt, LastSuccess, LastError, consecutive failures) — used by the monitoring page and alert system to show Hub connection health.

 #### Infrastructure Backup to Hub (`internal/report/infra_backup.go`)

@@ -792,11 +803,14 @@ This enables fully automated recovery when the system drive is replaced — the
 #### Hub Dashboard

 The hub service (separate Go app in the `felhom.eu` repo) provides:
- Multi-customer overview table with status indicators
- Customer detail page with system/storage/containers/backup/health sections
+- Multi-customer overview table with status indicators and event count badges
+- Customer detail page with system/storage/containers/backup/health/events sections
+- Event timeline: last 50 events with severity filter, colored badges, source tracking
+- Dead man's switch: staleness detection (30min stale, 60min down), missed backup detection (daily at 05:00)
+- Notification dispatch: operator (English) + customer (Hungarian) emails via Resend with per-event cooldowns
 - Infra backup status per customer (last sync, stack count, disk count)
 - Color coding: green (<30min), yellow (30-60min), red (>60min since last report)
- 90-day report retention with daily prune
+- 90-day report + event retention with daily prune at 04:30 Budapest time

 ### 9. Disaster Recovery