feat: Hub monitoring takeover — event push system + config cleanup (v0.21.0)
Replace external Healthchecks.io with Hub-native event system. Controller now pushes structured events via POST /api/v1/event with typed detail structs. Hub handles dead man's switch, notification dispatch, and cooldowns. Phase 5: PushEvent() core method, 21 event types, expanded notification settings (11 toggles), Hub connection monitoring on dashboard, alerts. Phase 6: Deprecation log for ping UUIDs, pinger kept for transition. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
+48
-34
@@ -4,7 +4,7 @@
|
||||
|
||||
A single, lightweight Go container that replaces Portainer + scattered systemd scripts with a unified, Hungarian-language web dashboard for managing Docker Compose stacks, backups, storage, monitoring, and notifications on customer hardware.
|
||||
|
||||
**Current version: v0.20.0**
|
||||
**Current version: v0.21.0**
|
||||
|
||||
---
|
||||
|
||||
@@ -509,16 +509,9 @@ Backup destination validation (`CheckBackupDestination`) has tiered checks:
|
||||
- Disk >95% full → critical/blocked
|
||||
- Disk >90% full → warning
|
||||
|
||||
#### Healthchecks.io Integration (`internal/monitor/pinger.go`)
|
||||
#### Healthchecks.io Integration (deprecated)
|
||||
|
||||
Five ping UUIDs for external monitoring:
|
||||
- **Heartbeat**: every 5 min (simple "I'm alive")
|
||||
- **System Health**: periodic health check results
|
||||
- **DB Dump**: after nightly database dumps
|
||||
- **Backup**: after nightly restic backup
|
||||
- **Backup Integrity**: weekly `restic check` result
|
||||
|
||||
3-attempt retry with 2-second backoff. Pinger never fails the caller.
|
||||
Legacy pinger (`internal/monitor/pinger.go`) still runs for backward compatibility but is no longer the primary monitoring mechanism. Monitoring is now handled by the Hub event system (see [Notifications](#5-notifications)). A deprecation log is emitted on startup if ping UUIDs are configured.
|
||||
|
||||
#### Metrics Store (`internal/metrics/`)
|
||||
|
||||
@@ -535,48 +528,66 @@ Full-page system monitor at `/monitoring`:
|
||||
- **System Metrics Charts**: 4 line charts (CPU, Memory, Temperature, Load) in 2x2 grid
|
||||
- **Container Resources**: horizontal bar charts (CPU% and Memory per container)
|
||||
- **Per-container Detail**: click-to-expand historical charts
|
||||
- **Remote Monitoring Status**: shows Healthchecks ping UUID configuration
|
||||
- **Hub Connection Status**: shows Hub URL, customer ID, connection state (connected/unreachable), last successful push, last error
|
||||
|
||||
Chart.js 4.4.7 embedded locally (works in offline environments), dark theme matching site design.
|
||||
|
||||
#### Alert System (`internal/web/alerts.go`)
|
||||
|
||||
State-based alerts displayed on all pages:
|
||||
- Sources: health issues, missing ping UUIDs, backup disabled
|
||||
- Sources: health issues, Hub connection status, backup disabled, storage disconnected, update available
|
||||
- Hub alerts: `hub-disabled` (warning) when Hub not enabled, `hub-unreachable` (error) when last push failed and no success in 30 min
|
||||
- Sorted by severity (error > warning > info), capped at 5 visible
|
||||
- Refreshed every 5 min + on startup
|
||||
- Monitoring page suppresses ping-related alerts (shown in dedicated table instead)
|
||||
- Refreshed every 5 min + on startup + on storage state changes
|
||||
|
||||
---
|
||||
|
||||
### 5. Notifications
|
||||
|
||||
#### Email Delivery
|
||||
#### Hub Event System (`internal/notify/notifier.go`)
|
||||
|
||||
The controller relays notifications through the central hub, which sends emails via the Resend API:
|
||||
1. Controller detects event (health degradation, backup failure, etc.)
|
||||
2. Non-blocking POST to hub's `/api/v1/notify` with event details
|
||||
3. Hub checks customer notification preferences
|
||||
4. Hub sends Hungarian-language email via Resend
|
||||
The controller pushes structured events to the Hub's `/api/v1/event` endpoint. The Hub handles notification dispatch, cooldown management, and dead man's switch detection.
|
||||
|
||||
**Core method:** `PushEvent(eventType, severity, message, details)` — non-blocking goroutine, 2 retries with 3s backoff, never blocks the caller.
|
||||
|
||||
#### Event Types
|
||||
|
||||
| Event | Trigger |
|
||||
|-------|---------|
|
||||
| `disk_warning` | Disk usage crosses warning/critical threshold |
|
||||
| `backup_failed` | Nightly backup or DB dump fails |
|
||||
| `update_available` | New app version detected in catalog |
|
||||
| `security_update` | Critical security update available |
|
||||
| Event Type | Severity | Trigger |
|
||||
|------------|----------|---------|
|
||||
| `backup_completed` | info | Nightly restic backup succeeds |
|
||||
| `backup_failed` | error | Nightly restic backup fails |
|
||||
| `db_dump_completed` | info | Nightly database dumps succeed |
|
||||
| `db_dump_failed` | error | Nightly database dumps fail |
|
||||
| `backup_integrity_ok` | info | Weekly `restic check` passes |
|
||||
| `backup_integrity_failed` | error | Weekly `restic check` fails |
|
||||
| `crossdrive_completed` | info | Cross-drive secondary backup succeeds |
|
||||
| `crossdrive_failed` | error | Cross-drive secondary backup fails |
|
||||
| `health_degraded` | warning | Health status degrades (ok→warn) |
|
||||
| `health_critical` | error | Health status critical (any→fail) |
|
||||
| `health_recovered` | info | Health status recovers (fail/warn→ok) |
|
||||
| `disk_warning` | warning | Disk usage crosses 90% |
|
||||
| `disk_critical` | error | Disk usage crosses 95% |
|
||||
| `storage_disconnected` | error | Storage drive physically removed |
|
||||
| `storage_reconnected` | info | Storage drive reconnected |
|
||||
| `controller_started` | info | Controller process starts |
|
||||
| `controller_updated` | info/error | Self-update success or failure |
|
||||
| `app_deployed` | info | New app deployed via API |
|
||||
| `app_removed` | info | App removed via API |
|
||||
| `disaster_recovery_started` | warning | DR restore begins |
|
||||
| `disaster_recovery_completed` | info/error | DR restore finishes (success/partial) |
|
||||
|
||||
#### Cooldown System
|
||||
Each event carries typed detail structs (e.g., `BackupDetails`, `DiskDetails`, `HealthDetails`) serialized as JSON.
|
||||
|
||||
Per-event-type cooldown (default 6 hours, configurable) prevents notification spam. Only notifies on **status degradation** (ok→warn, ok→fail, warn→fail), not on repeated same-status checks.
|
||||
#### Default Enabled Events
|
||||
|
||||
Events the customer receives notifications for (configurable in settings):
|
||||
`backup_failed`, `db_dump_failed`, `disk_warning`, `disk_critical`, `storage_disconnected`, `node_down`, `health_critical`, `expected_backup_missed`, `expected_dbdump_missed`
|
||||
|
||||
#### Preference Sync
|
||||
|
||||
Notification preferences (email, enabled events, cooldown) are:
|
||||
Notification preferences (email, enabled events, cooldown hours) are:
|
||||
- Stored locally in `settings.json`
|
||||
- Synced to hub on save and on controller startup
|
||||
- Synced to Hub on save and on controller startup via `POST /api/v1/preferences`
|
||||
- Hub sync failure doesn't block local save
|
||||
|
||||
---
|
||||
@@ -776,7 +787,7 @@ Periodic JSON push (default every 15 min) to the central felhom-hub service:
|
||||
- Stacks: deployed apps with versions and states
|
||||
- Config hash: SHA256 of `controller.yaml` for Hub-side config comparison
|
||||
|
||||
Bearer token authentication, 3-attempt retry with 5-second backoff.
|
||||
Bearer token authentication, 3-attempt retry with 5-second backoff. Push status tracked via `PushStatus` struct (LastAttempt, LastSuccess, LastError, consecutive failures) — used by the monitoring page and alert system to show Hub connection health.
|
||||
|
||||
#### Infrastructure Backup to Hub (`internal/report/infra_backup.go`)
|
||||
|
||||
@@ -792,11 +803,14 @@ This enables fully automated recovery when the system drive is replaced — the
|
||||
#### Hub Dashboard
|
||||
|
||||
The hub service (separate Go app in the `felhom.eu` repo) provides:
|
||||
- Multi-customer overview table with status indicators
|
||||
- Customer detail page with system/storage/containers/backup/health sections
|
||||
- Multi-customer overview table with status indicators and event count badges
|
||||
- Customer detail page with system/storage/containers/backup/health/events sections
|
||||
- Event timeline: last 50 events with severity filter, colored badges, source tracking
|
||||
- Dead man's switch: staleness detection (30min stale, 60min down), missed backup detection (daily at 05:00)
|
||||
- Notification dispatch: operator (English) + customer (Hungarian) emails via Resend with per-event cooldowns
|
||||
- Infra backup status per customer (last sync, stack count, disk count)
|
||||
- Color coding: green (<30min), yellow (30-60min), red (>60min since last report)
|
||||
- 90-day report retention with daily prune
|
||||
- 90-day report + event retention with daily prune at 04:30 Budapest time
|
||||
|
||||
### 9. Disaster Recovery
|
||||
|
||||
|
||||
Reference in New Issue
Block a user