feat: Hub monitoring takeover — event system, dead man's switch, notifications (v0.3.0)

Replace external Healthchecks.io with Hub-native monitoring. New events
table + /api/v1/event endpoint for structured events from controllers.
Staleness checker (60s) detects unresponsive nodes. Backup deadline
checker (daily 05:00) catches missed backups. Notification dispatcher
sends operator (English) + customer (Hungarian) emails via Resend with
per-event cooldowns. Event timeline on customer page, dashboard badges.
Config form deprecates Monitoring UUIDs section.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-20 18:53:24 +01:00
parent b4cb92e09f
commit 3217cb4751
16 changed files with 1319 additions and 64 deletions
+50
View File
@@ -1,5 +1,55 @@
# Felhom Hub — Changelog
## v0.3.0 (2026-02-20)
**Hub Monitoring Takeover — Event System, Dead Man's Switch, Notifications**
Replaces external Healthchecks.io with a Hub-native event system. The Hub becomes the single source of truth for all customer monitoring, event tracking, dead man's switch alerting, and notification delivery.
### Phase 1 — Event System
- **`events` table** in SQLite: stores all events with customer_id, event_type, severity, message, details_json, source, timestamp
- **Indexes**: `idx_events_customer_created` (customer + time DESC), `idx_events_type` (type + time DESC)
- **Store methods**: `SaveEvent`, `GetRecentEvents`, `GetEventsByType`, `GetLatestEventByType`, `GetAllRecentEvents`, `CountEventsBySeverity`, `PruneEvents`, `GetActiveCustomerIDs`
- **`POST /api/v1/event`** endpoint: accepts structured events from controllers, validates event_type against 27 allowed types, validates severity (info/warning/error), stores in DB
- **Enhanced auth**: `checkAuthCustomer()` validates per-customer API keys match the customer_id in payload; global key bypasses ownership check
- **Prune**: events pruned alongside reports at 04:30 Budapest time
### Phase 2 — Dead Man's Switch
- **Staleness checker** (`internal/monitor/staleness.go`): runs every 60s, detects when controllers stop reporting
- ok→stale (>30min): inserts `node_stale` warning event
- any→down (>60min): inserts `node_down` error event
- stale/down→ok: inserts `node_recovered` info event
- Skips blocked customers, no false alerts on startup
- **Backup deadline checker** (`internal/monitor/deadline.go`): runs daily at 05:00 Budapest
- Detects missing `backup_completed` events since midnight → inserts `expected_backup_missed` error
- Detects missing `db_dump_completed` events → inserts `expected_dbdump_missed` error
- Grace: skips customers with `node_down` state
- **`scheduleDaily()`** helper: goroutine that sleeps until target time (Europe/Budapest), runs function, loops
- **`/healthz`** enhanced: returns 503 if SQLite Ping fails
### Phase 3 — Notification System
- **Dispatcher** (`internal/notify/dispatcher.go`): processes events and sends emails via Resend API
- **Operator channel**: English emails to operator for warning/error events, 1h cooldown per customer:eventType
- **Customer channel**: Hungarian emails per event_type, respects customer preferences (enabled_events, cooldown_hours), blocked customers skipped
- **Test bypass**: `test` event type skips cooldown/preferences, sends directly to customer email
- **Email templates** (`internal/notify/templates.go`): operator (concise English), customer (Hungarian per event type with complete message table)
- **Cooldown tracking**: in-memory maps with per-customer:eventType granularity
- **`customer_notifications` table**: added `cooldown_hours` column (default 6)
- **`notification_log` table**: added `channel` column (operator/customer)
- Wired into `/api/v1/event` handler and staleness/deadline checkers
### Phase 4 — Hub UI
- **Events section** on customer detail page: last 50 events, severity filter buttons (All/Errors/Warnings/Info), colored severity badges
- **Dashboard badges**: error+warning count in last 24h per customer, clickable to customer events
- **Notification log**: shows channel column (operator/customer) in customer detail page
- **Config form**: Monitoring UUIDs section marked as "Legacy" with deprecation notice, collapsed by default
### Phase 6 — Config Cleanup
- **`controller.yaml.default`**: `monitoring.ping_uuids` section commented out (deprecated)
- **`buildConfigJSON`**: only writes `ping_uuids` to config JSON if user explicitly provides UUID values (new configs get none)
---
## v0.2.2 (2026-02-20)
**Config Hash Comparison**
+41 -13
View File
@@ -2,9 +2,9 @@
**Central operator dashboard for monitoring and managing Felhom customer deployments.**
A lightweight Go service that receives periodic reports from felhom-controller instances, stores them in SQLite, and provides a web dashboard for fleet monitoring. Also serves as the infrastructure backup store for disaster recovery.
A lightweight Go service that receives periodic reports and structured events from felhom-controller instances, stores them in SQLite, and provides a web dashboard for fleet monitoring. Also serves as the infrastructure backup store for disaster recovery, event-based dead man's switch monitoring, and notification dispatch.
**Current version: v0.2.2**
**Current version: v0.3.0**
---
@@ -72,14 +72,29 @@ The infra-backup payload contains everything needed to restore a customer deploy
4. Controller uses disk UUIDs to auto-mount surviving drives
5. Controller restores apps from local backups on those drives
### Events
| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/api/v1/event` | Controller pushes structured event (27 allowed types, severity: info/warning/error) |
Events are the primary monitoring mechanism. Each event has: customer_id, event_type, severity, message, details_json, source. Per-customer API keys are validated against the customer_id in the payload. Stored in the `events` table with automatic pruning.
**Hub-generated events** (source="hub"):
- `node_stale` / `node_down` / `node_recovered` — dead man's switch from staleness checker (every 60s)
- `expected_backup_missed` / `expected_dbdump_missed` — backup deadline checker (daily at 05:00 Budapest)
### Notifications
| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/api/v1/notify` | Controller sends event notification (backup_failed, disk_warning, etc.) |
| `POST` | `/api/v1/preferences` | Controller syncs customer notification preferences |
| `POST` | `/api/v1/notify` | Legacy notification relay (kept for backward compatibility) |
| `POST` | `/api/v1/preferences` | Controller syncs customer notification preferences (email, enabled_events, cooldown_hours) |
Notifications are sent via Resend.com email API.
Notifications are dispatched automatically when events are processed:
- **Operator channel**: English emails for warning/error events, 1h cooldown per customer:eventType
- **Customer channel**: Hungarian emails per event type, respects customer preferences and cooldown (default 6h)
- Email delivery via Resend.com API
### Customer Config Retrieval
@@ -93,7 +108,7 @@ Config retrieval uses a separate per-customer retrieval password (not the API ke
| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/healthz` | Health check (no auth required) |
| `GET` | `/healthz` | Health check (no auth required, returns 503 if SQLite ping fails) |
## Web Dashboard
@@ -101,13 +116,13 @@ Protected by bcrypt password + session cookie (7-day expiry).
### Pages
- **Dashboard (`/`)** — Fleet overview table showing all customers with live status. Config-only customers (no reports yet) appear as "PENDING" with gray badge. Blocked customers are hidden. Auto-refreshes every 60 seconds.
- **Dashboard (`/`)** — Fleet overview table showing all customers with live status and event count badges (error+warning in last 24h). Config-only customers (no reports yet) appear as "PENDING" with gray badge. Blocked customers are hidden. Auto-refreshes every 60 seconds.
- **Customers (`/configs`)** — Customer management list. Shows all customers (both managed and manual), their status, controller version, and config type (MANAGED/MANUAL). Blocked customers shown grayed-out with BLOCKED badge.
- **Unified Customer Detail (`/customers/{id}`)** — Single page per customer combining config management and live monitoring. Adapts content based on available data:
- **Managed + reporting:** Full view — config info, system metrics, storage, containers, backup status, credentials, setup commands, YAML preview, controller update, notifications, history
- **Managed + reporting:** Full view — config info, system metrics, storage, containers, backup status, events timeline (last 50, severity filter), credentials, setup commands, YAML preview, controller update, notifications (with channel column), history
- **Managed + no reports yet:** Config info, credentials, setup commands, "Waiting for first report" indicator
- **Manual (report-only):** System metrics, storage, containers, backup, with "Create Config" button to convert to managed
- **Config Form (`/configs/new`, `/configs/{id}/edit`)** — Create/edit customer configurations with identity, infrastructure tokens, and monitoring overrides
- **Config Form (`/configs/new`, `/configs/{id}/edit`)** — Create/edit customer configurations with identity, infrastructure tokens, and monitoring overrides. Legacy Monitoring UUIDs section collapsed by default with deprecation notice
### Customer States
@@ -144,9 +159,10 @@ SQLite with WAL mode. Tables:
| Table | Purpose |
|-------|---------|
| `reports` | Full JSON reports with denormalized fields for dashboard queries |
| `events` | Structured events from controllers and Hub (type, severity, message, details, source) |
| `infra_backups` | Per-customer infrastructure snapshots for disaster recovery |
| `customer_notifications` | Email + enabled event types per customer |
| `notification_log` | Send/skip/fail history for notifications |
| `customer_notifications` | Email, enabled event types, cooldown hours per customer |
| `notification_log` | Send/skip/fail history for notifications with channel (operator/customer) |
| `customer_configs` | Pre-configured customer settings, retrieval passwords, per-customer API keys, status (active/blocked) |
Retention: configurable (default 90 days), daily prune at 04:30 Budapest time.
@@ -164,6 +180,8 @@ api:
notifications:
resend_api_key: "" # Resend.com API key for email
from_email: "monitoring@felhom.eu"
operator_email: "" # Operator alert recipient
operator_enabled: true # Enable operator email notifications
retention:
max_days: 90
@@ -195,16 +213,26 @@ Runs on k3s (Kubernetes) in the `felhom-system` namespace:
```bash
# Build and push
cd hub/
make VERSION=0.2.2 docker docker-push
make VERSION=0.3.0 docker docker-push
# Deploy
kubectl set image -n felhom-system deploy/hub hub=gitea.dooplex.hu/admin/felhom-hub:v0.2.2
kubectl set image -n felhom-system deploy/hub hub=gitea.dooplex.hu/admin/felhom-hub:v0.3.0
kubectl rollout status -n felhom-system deploy/hub
# Check
kubectl logs -n felhom-system -l app=hub --tail 20
```
## Background Services
| Service | Schedule | Description |
|---------|----------|-------------|
| **Staleness checker** | Every 60s | Detects controllers that stopped reporting. Generates `node_stale` (>30min), `node_down` (>60min), `node_recovered` events |
| **Backup deadline checker** | Daily 05:00 Budapest | Detects missing backup/db-dump events since midnight. Generates `expected_backup_missed`, `expected_dbdump_missed` events |
| **Report/event prune** | Daily 04:30 Budapest | Deletes reports and events older than retention period (default 90 days) |
| **Registry version check** | Every 30min | Checks Gitea registry for new controller image tags |
| **Template refresh** | Every 1h | Fetches latest `controller.yaml.example` from Gitea |
## Dependencies
- `golang.org/x/crypto` — bcrypt for password hashing
+87 -18
View File
@@ -13,6 +13,8 @@ import (
"time"
"gitea.dooplex.hu/admin/felhom-hub/internal/api"
"gitea.dooplex.hu/admin/felhom-hub/internal/monitor"
"gitea.dooplex.hu/admin/felhom-hub/internal/notify"
"gitea.dooplex.hu/admin/felhom-hub/internal/store"
"gitea.dooplex.hu/admin/felhom-hub/internal/web"
"gopkg.in/yaml.v3"
@@ -32,8 +34,10 @@ type Config struct {
ReportAPIKey string `yaml:"report_api_key"`
} `yaml:"api"`
Notifications struct {
ResendAPIKey string `yaml:"resend_api_key"`
FromEmail string `yaml:"from_email"`
ResendAPIKey string `yaml:"resend_api_key"`
FromEmail string `yaml:"from_email"`
OperatorEmail string `yaml:"operator_email"`
OperatorEnabled bool `yaml:"operator_enabled"`
} `yaml:"notifications"`
Retention struct {
MaxDays int `yaml:"max_days"`
@@ -119,6 +123,18 @@ func main() {
templateProvider = templateFetcher
}
apiHandler := api.New(dataStore, cfg.API.ReportAPIKey, cfg.Notifications.ResendAPIKey, cfg.Notifications.FromEmail, templateProvider, logger)
// Initialize notification dispatcher
dispatcher := notify.NewDispatcher(
dataStore,
cfg.Notifications.ResendAPIKey,
cfg.Notifications.FromEmail,
cfg.Notifications.OperatorEmail,
cfg.Notifications.OperatorEnabled,
logger,
)
apiHandler.SetDispatcher(dispatcher)
webServer := web.New(dataStore, cfg.Auth.PasswordHash, cfg.API.ReportAPIKey, staleThreshold, logger)
webServer.SetTemplateFetcher(templateFetcher)
@@ -127,6 +143,11 @@ func main() {
// Health check endpoint — bypasses auth (for k8s probes)
mux.HandleFunc("/healthz", func(w http.ResponseWriter, r *http.Request) {
if err := dataStore.Ping(); err != nil {
w.WriteHeader(http.StatusServiceUnavailable)
w.Write([]byte("db unhealthy"))
return
}
w.WriteHeader(http.StatusOK)
w.Write([]byte("ok"))
})
@@ -165,10 +186,34 @@ func main() {
}
webServer.SetVersionChecker(versionChecker)
// Prune on startup, then daily at configured time (default 04:30)
if cfg.Retention.MaxDays > 0 {
go pruneLoop(ctx, dataStore, cfg.Retention.MaxDays, logger)
pruneAll(dataStore, cfg.Retention.MaxDays, logger)
go scheduleDaily(ctx, "prune", cfg.Retention.PruneSchedule, func() {
pruneAll(dataStore, cfg.Retention.MaxDays, logger)
}, logger)
}
// Staleness checker — runs every 60s
stalenessChecker := monitor.NewStalenessChecker(dataStore, staleThreshold, dispatcher.ProcessEvent, logger)
go func() {
ticker := time.NewTicker(60 * time.Second)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
stalenessChecker.Check()
}
}
}()
// Backup deadline checker — runs daily at 05:00 Budapest
go scheduleDaily(ctx, "deadline-check", "05:00", func() {
monitor.CheckBackupDeadlines(dataStore, stalenessChecker, dispatcher.ProcessEvent, logger)
}, logger)
// Signal handling
sigCh := make(chan os.Signal, 1)
signal.Notify(sigCh, syscall.SIGINT, syscall.SIGTERM)
@@ -244,28 +289,52 @@ func loadConfig(path string, logger *log.Logger) *Config {
return cfg
}
func pruneLoop(ctx context.Context, s *store.Store, maxDays int, logger *log.Logger) {
// Prune once on startup
if deleted, err := s.Prune(maxDays); err != nil {
logger.Printf("[WARN] Prune failed: %v", err)
} else if deleted > 0 {
logger.Printf("[INFO] Pruned %d old report rows", deleted)
// scheduleDaily runs fn once daily at the given "HH:MM" time in Europe/Budapest.
// It blocks until ctx is cancelled.
func scheduleDaily(ctx context.Context, name, timeStr string, fn func(), logger *log.Logger) {
budapest, err := time.LoadLocation("Europe/Budapest")
if err != nil {
budapest = time.FixedZone("CET", 3600)
}
// Then daily
ticker := time.NewTicker(24 * time.Hour)
defer ticker.Stop()
hour, min := parseHM(timeStr)
for {
now := time.Now().In(budapest)
next := time.Date(now.Year(), now.Month(), now.Day(), hour, min, 0, 0, budapest)
if !next.After(now) {
next = next.Add(24 * time.Hour)
}
delay := time.Until(next)
logger.Printf("[INFO] %s: next run at %s (in %s)", name, next.Format("2006-01-02 15:04 MST"), delay.Round(time.Second))
select {
case <-ctx.Done():
return
case <-ticker.C:
if deleted, err := s.Prune(maxDays); err != nil {
logger.Printf("[WARN] Prune failed: %v", err)
} else if deleted > 0 {
logger.Printf("[INFO] Pruned %d old report rows", deleted)
}
case <-time.After(delay):
fn()
}
}
}
// parseHM parses "HH:MM" into hour and minute. Returns 0, 0 on invalid input.
func parseHM(s string) (int, int) {
var h, m int
if _, err := fmt.Sscanf(s, "%d:%d", &h, &m); err != nil {
return 0, 0
}
return h, m
}
func pruneAll(s *store.Store, maxDays int, logger *log.Logger) {
if deleted, err := s.Prune(maxDays); err != nil {
logger.Printf("[WARN] Prune reports failed: %v", err)
} else if deleted > 0 {
logger.Printf("[INFO] Pruned %d old report rows", deleted)
}
if deleted, err := s.PruneEvents(maxDays); err != nil {
logger.Printf("[WARN] Prune events failed: %v", err)
} else if deleted > 0 {
logger.Printf("[INFO] Pruned %d old event rows", deleted)
}
}
+149 -11
View File
@@ -12,6 +12,7 @@ import (
"time"
"gitea.dooplex.hu/admin/felhom-hub/internal/configgen"
"gitea.dooplex.hu/admin/felhom-hub/internal/notify"
"gitea.dooplex.hu/admin/felhom-hub/internal/store"
)
@@ -29,6 +30,7 @@ type Handler struct {
logger *log.Logger
httpClient *http.Client
templateProvider ConfigTemplateProvider
dispatcher *notify.Dispatcher
}
// New creates a new API handler.
@@ -44,23 +46,40 @@ func New(store *store.Store, apiKey, resendAPIKey, fromEmail string, templatePro
}
}
// SetDispatcher sets the notification dispatcher for event-triggered emails.
func (h *Handler) SetDispatcher(d *notify.Dispatcher) {
h.dispatcher = d
}
// checkAuth verifies the Bearer token against the global API key or a per-customer API key.
// Returns true if authorized.
func (h *Handler) checkAuth(r *http.Request) bool {
_, _, ok := h.checkAuthCustomer(r)
return ok
}
// checkAuthCustomer verifies the Bearer token and returns the authenticated customer identity.
// For per-customer keys: returns (customerID, false, true).
// For global key: returns ("", true, true) — caller must allow any customer_id.
// On failure: returns ("", false, false).
func (h *Handler) checkAuthCustomer(r *http.Request) (customerID string, isGlobal bool, ok bool) {
auth := r.Header.Get("Authorization")
if !strings.HasPrefix(auth, "Bearer ") {
return false
return "", false, false
}
token := strings.TrimPrefix(auth, "Bearer ")
// Check global key first
if h.apiKey != "" && subtle.ConstantTimeCompare([]byte(token), []byte(h.apiKey)) == 1 {
return true
return "", true, true
}
// Check per-customer key
cfg, err := h.store.GetCustomerConfigByAPIKey(token)
return err == nil && cfg != nil
if err != nil || cfg == nil {
return "", false, false
}
return cfg.CustomerID, false, true
}
// ServeHTTP routes API requests.
@@ -70,6 +89,8 @@ func (h *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
switch {
case r.Method == http.MethodPost && path == "/report":
h.handleReport(w, r)
case r.Method == http.MethodPost && path == "/event":
h.handleEvent(w, r)
case r.Method == http.MethodPost && path == "/notify":
h.handleNotify(w, r)
case r.Method == http.MethodPost && path == "/infra-backup":
@@ -97,7 +118,8 @@ func (h *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
}
func (h *Handler) handleReport(w http.ResponseWriter, r *http.Request) {
if !h.checkAuth(r) {
authCustomerID, isGlobal, ok := h.checkAuthCustomer(r)
if !ok {
http.Error(w, "Unauthorized", http.StatusUnauthorized)
return
}
@@ -117,6 +139,12 @@ func (h *Handler) handleReport(w http.ResponseWriter, r *http.Request) {
return
}
// Validate customer_id matches authenticated customer (unless global key)
if !isGlobal && authCustomerID != payload.CustomerID {
http.Error(w, "Forbidden: customer_id mismatch", http.StatusForbidden)
return
}
if err := h.store.SaveReport(payload.CustomerID, body); err != nil {
h.logger.Printf("[ERROR] Failed to save report from %s: %v", payload.CustomerID, err)
http.Error(w, "Internal error", http.StatusInternalServerError)
@@ -128,6 +156,114 @@ func (h *Handler) handleReport(w http.ResponseWriter, r *http.Request) {
w.Write([]byte(`{"status":"ok"}`))
}
// allowedEventTypes lists all valid event_type values the Hub accepts.
var allowedEventTypes = map[string]bool{
// Controller-pushed events
"controller_started": true,
"controller_updated": true,
"backup_completed": true,
"backup_failed": true,
"db_dump_completed": true,
"db_dump_failed": true,
"backup_integrity_ok": true,
"backup_integrity_failed": true,
"crossdrive_completed": true,
"crossdrive_failed": true,
"storage_disconnected": true,
"storage_reconnected": true,
"disk_warning": true,
"disk_critical": true,
"health_degraded": true,
"health_critical": true,
"health_recovered": true,
"app_deployed": true,
"app_removed": true,
"disaster_recovery_started": true,
"disaster_recovery_completed": true,
// Hub-generated events
"node_stale": true,
"node_down": true,
"node_recovered": true,
"expected_backup_missed": true,
"expected_dbdump_missed": true,
// Special
"test": true,
}
// handleEvent processes structured events from controllers (new endpoint, replaces /notify for updated controllers).
func (h *Handler) handleEvent(w http.ResponseWriter, r *http.Request) {
authCustomerID, isGlobal, ok := h.checkAuthCustomer(r)
if !ok {
http.Error(w, "Unauthorized", http.StatusUnauthorized)
return
}
body, err := io.ReadAll(io.LimitReader(r.Body, 1<<20))
if err != nil {
http.Error(w, "Bad request", http.StatusBadRequest)
return
}
var payload struct {
CustomerID string `json:"customer_id"`
EventType string `json:"event_type"`
Severity string `json:"severity"`
Message string `json:"message"`
Details json.RawMessage `json:"details"`
}
if err := json.Unmarshal(body, &payload); err != nil {
http.Error(w, "Invalid JSON", http.StatusBadRequest)
return
}
if payload.CustomerID == "" || payload.EventType == "" {
http.Error(w, "customer_id and event_type are required", http.StatusBadRequest)
return
}
// Validate customer_id matches authenticated customer (unless global key)
if !isGlobal && authCustomerID != payload.CustomerID {
http.Error(w, "Forbidden: customer_id mismatch", http.StatusForbidden)
return
}
// Validate event_type
if !allowedEventTypes[payload.EventType] {
http.Error(w, fmt.Sprintf("Invalid event_type: %s", payload.EventType), http.StatusBadRequest)
return
}
// Validate/default severity
switch payload.Severity {
case "info", "warning", "error":
default:
payload.Severity = "info"
}
// Store details as JSON string
detailsStr := "{}"
if len(payload.Details) > 0 && string(payload.Details) != "null" {
detailsStr = string(payload.Details)
}
_, err = h.store.SaveEvent(payload.CustomerID, payload.EventType, payload.Severity, payload.Message, detailsStr, "controller")
if err != nil {
h.logger.Printf("[ERROR] Failed to save event from %s: %v", payload.CustomerID, err)
http.Error(w, "Internal error", http.StatusInternalServerError)
return
}
h.logger.Printf("[INFO] Event from %s: %s (%s) — %s", payload.CustomerID, payload.EventType, payload.Severity, payload.Message)
// Dispatch notifications (non-blocking)
if h.dispatcher != nil {
go h.dispatcher.ProcessEvent(payload.CustomerID, payload.EventType, payload.Severity, payload.Message, detailsStr, "controller")
}
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
w.Write([]byte(`{"ok":true}`))
}
func (h *Handler) handleCustomers(w http.ResponseWriter, r *http.Request) {
customers, err := h.store.GetCustomers()
if err != nil {
@@ -258,7 +394,7 @@ func (h *Handler) handleNotify(w http.ResponseWriter, r *http.Request) {
// Check if customer is blocked
if h.store.IsCustomerBlocked(payload.CustomerID) {
h.logger.Printf("[INFO] Notification suppressed for blocked customer %s", payload.CustomerID)
h.store.LogNotification(payload.CustomerID, payload.EventType, payload.Severity, payload.Message, "skipped", "customer blocked")
h.store.LogNotification(payload.CustomerID, payload.EventType, payload.Severity, payload.Message, "skipped", "customer blocked", "customer")
w.WriteHeader(http.StatusOK)
w.Write([]byte(`{"status":"ok","sent":false,"reason":"blocked"}`))
return
@@ -275,7 +411,7 @@ func (h *Handler) handleNotify(w http.ResponseWriter, r *http.Request) {
// Check if customer has email configured and event type is enabled
if prefs == nil || prefs.Email == "" {
h.logger.Printf("[INFO] No email configured for %s, skipping notification", payload.CustomerID)
h.store.LogNotification(payload.CustomerID, payload.EventType, payload.Severity, payload.Message, "skipped", "no email configured")
h.store.LogNotification(payload.CustomerID, payload.EventType, payload.Severity, payload.Message, "skipped", "no email configured", "customer")
w.WriteHeader(http.StatusOK)
w.Write([]byte(`{"status":"ok","sent":false,"reason":"no_email"}`))
return
@@ -291,7 +427,7 @@ func (h *Handler) handleNotify(w http.ResponseWriter, r *http.Request) {
}
if !eventEnabled {
h.logger.Printf("[INFO] Event %s not enabled for %s, skipping", payload.EventType, payload.CustomerID)
h.store.LogNotification(payload.CustomerID, payload.EventType, payload.Severity, payload.Message, "skipped", "event not enabled")
h.store.LogNotification(payload.CustomerID, payload.EventType, payload.Severity, payload.Message, "skipped", "event not enabled", "customer")
w.WriteHeader(http.StatusOK)
w.Write([]byte(`{"status":"ok","sent":false,"reason":"event_disabled"}`))
return
@@ -300,7 +436,7 @@ func (h *Handler) handleNotify(w http.ResponseWriter, r *http.Request) {
// Send email via Resend API
if h.resendAPIKey == "" {
h.logger.Printf("[WARN] Resend API key not configured, cannot send notification email")
h.store.LogNotification(payload.CustomerID, payload.EventType, payload.Severity, payload.Message, "skipped", "resend api key not configured")
h.store.LogNotification(payload.CustomerID, payload.EventType, payload.Severity, payload.Message, "skipped", "resend api key not configured", "customer")
w.WriteHeader(http.StatusOK)
w.Write([]byte(`{"status":"ok","sent":false,"reason":"no_api_key"}`))
return
@@ -310,13 +446,13 @@ func (h *Handler) handleNotify(w http.ResponseWriter, r *http.Request) {
sendErr := h.sendResendEmail(prefs.Email, subject, emailBody)
if sendErr != nil {
h.logger.Printf("[ERROR] Failed to send notification email to %s: %v", prefs.Email, sendErr)
h.store.LogNotification(payload.CustomerID, payload.EventType, payload.Severity, payload.Message, "failed", sendErr.Error())
h.store.LogNotification(payload.CustomerID, payload.EventType, payload.Severity, payload.Message, "failed", sendErr.Error(), "customer")
http.Error(w, "Failed to send email", http.StatusInternalServerError)
return
}
h.logger.Printf("[INFO] Notification email sent to %s for %s/%s", prefs.Email, payload.CustomerID, payload.EventType)
h.store.LogNotification(payload.CustomerID, payload.EventType, payload.Severity, payload.Message, "sent", "")
h.store.LogNotification(payload.CustomerID, payload.EventType, payload.Severity, payload.Message, "sent", "", "customer")
w.WriteHeader(http.StatusOK)
w.Write([]byte(`{"status":"ok","sent":true}`))
@@ -339,13 +475,14 @@ func (h *Handler) handleSavePreferences(w http.ResponseWriter, r *http.Request)
CustomerID string `json:"customer_id"`
Email string `json:"email"`
EnabledEvents []string `json:"enabled_events"`
CooldownHours int `json:"cooldown_hours"`
}
if err := json.Unmarshal(body, &payload); err != nil || payload.CustomerID == "" {
http.Error(w, "Invalid payload: customer_id required", http.StatusBadRequest)
return
}
if err := h.store.SaveNotificationPrefs(payload.CustomerID, payload.Email, payload.EnabledEvents); err != nil {
if err := h.store.SaveNotificationPrefs(payload.CustomerID, payload.Email, payload.EnabledEvents, payload.CooldownHours); err != nil {
h.logger.Printf("[ERROR] Failed to save notification prefs for %s: %v", payload.CustomerID, err)
http.Error(w, "Internal error", http.StatusInternalServerError)
return
@@ -503,6 +640,7 @@ func formatNotificationEmail(customerID, eventType, severity, message, details s
severityLabel := map[string]string{
"info": "Információ",
"warning": "Figyelmeztetés",
"error": "Hiba",
"critical": "Kritikus",
}
label := severityLabel[severity]
+86
View File
@@ -0,0 +1,86 @@
package monitor
import (
"log"
"time"
"gitea.dooplex.hu/admin/felhom-hub/internal/store"
)
// budapest returns the Europe/Budapest timezone (cached).
var budapest *time.Location
func init() {
var err error
budapest, err = time.LoadLocation("Europe/Budapest")
if err != nil {
// Fallback: UTC+1 (CET base; DST handled by OS if available)
budapest = time.FixedZone("CET", 3600)
}
}
// CheckBackupDeadlines checks whether active customers had their expected
// daily backups and DB dumps. Runs once daily at 05:00 Budapest time.
//
// For each active customer, it checks for backup_completed and db_dump_completed
// events since Budapest midnight. If neither success nor failure events exist,
// it inserts expected_backup_missed / expected_dbdump_missed events.
//
// Customers whose nodes are "down" (no report in >1h) are skipped — they
// already have staleness events.
func CheckBackupDeadlines(s *store.Store, staleness *StalenessChecker, onEvent EventNotifyFunc, logger *log.Logger) {
customerIDs, err := s.GetActiveCustomerIDs()
if err != nil {
logger.Printf("[WARN] Deadline check: failed to get active customers: %v", err)
return
}
// Budapest midnight today
now := time.Now().In(budapest)
midnightBudapest := time.Date(now.Year(), now.Month(), now.Day(), 0, 0, 0, 0, budapest)
sinceUTC := midnightBudapest.UTC()
var backupMissed, dbdumpMissed, skipped int
for _, id := range customerIDs {
// Skip nodes that are down — they already have staleness events
if staleness != nil && staleness.GetState(id) == "down" {
skipped++
continue
}
// Check blocked
if s.IsCustomerBlocked(id) {
continue
}
// Check backup_completed / backup_failed since midnight
backupOK, _ := s.GetEventsByType(id, "backup_completed", sinceUTC)
backupFailed, _ := s.GetEventsByType(id, "backup_failed", sinceUTC)
if len(backupOK) == 0 && len(backupFailed) == 0 {
msg := "No backup completed or failed since midnight"
if _, err := s.SaveEvent(id, "expected_backup_missed", "error", msg, "{}", "hub"); err != nil {
logger.Printf("[WARN] Failed to save expected_backup_missed for %s: %v", id, err)
} else if onEvent != nil {
onEvent(id, "expected_backup_missed", "error", msg, "{}", "hub")
}
backupMissed++
}
// Check db_dump_completed / db_dump_failed since midnight
dumpOK, _ := s.GetEventsByType(id, "db_dump_completed", sinceUTC)
dumpFailed, _ := s.GetEventsByType(id, "db_dump_failed", sinceUTC)
if len(dumpOK) == 0 && len(dumpFailed) == 0 {
msg := "No DB dump completed or failed since midnight"
if _, err := s.SaveEvent(id, "expected_dbdump_missed", "error", msg, "{}", "hub"); err != nil {
logger.Printf("[WARN] Failed to save expected_dbdump_missed for %s: %v", id, err)
} else if onEvent != nil {
onEvent(id, "expected_dbdump_missed", "error", msg, "{}", "hub")
}
dbdumpMissed++
}
}
logger.Printf("[INFO] Deadline check: %d customers, %d backup missed, %d dbdump missed, %d skipped (down)",
len(customerIDs), backupMissed, dbdumpMissed, skipped)
}
+183
View File
@@ -0,0 +1,183 @@
package monitor
import (
"fmt"
"log"
"sync"
"time"
"gitea.dooplex.hu/admin/felhom-hub/internal/store"
)
// EventNotifyFunc is called after a hub-generated event is saved,
// to trigger notification dispatch. Keeps monitor decoupled from notify.
type EventNotifyFunc func(customerID, eventType, severity, message, detailsJSON, source string)
// StalenessChecker monitors customer report freshness and generates
// node_stale / node_down / node_recovered events on state transitions.
type StalenessChecker struct {
store *store.Store
threshold time.Duration // "stale" after this duration (default 30m)
downAfter time.Duration // "down" after this duration (2x threshold)
logger *log.Logger
onEvent EventNotifyFunc
mu sync.Mutex
states map[string]string // customerID → "ok" | "stale" | "down"
}
// NewStalenessChecker creates a checker and initializes state from current data.
// No events are generated during initialization (binding #12).
// onEvent is called after each hub-generated event is saved (may be nil).
func NewStalenessChecker(s *store.Store, threshold time.Duration, onEvent EventNotifyFunc, logger *log.Logger) *StalenessChecker {
sc := &StalenessChecker{
store: s,
threshold: threshold,
downAfter: 2 * threshold,
logger: logger,
onEvent: onEvent,
states: make(map[string]string),
}
// Seed states from current report timestamps — no events on init
customers, err := s.GetCustomers()
if err != nil {
logger.Printf("[WARN] Staleness checker: failed to seed states: %v", err)
return sc
}
var okCount, staleCount, downCount int
for _, c := range customers {
if s.IsCustomerBlocked(c.CustomerID) {
continue
}
age := time.Since(c.ReceivedAt)
switch {
case age > sc.downAfter:
sc.states[c.CustomerID] = "down"
downCount++
case age > sc.threshold:
sc.states[c.CustomerID] = "stale"
staleCount++
default:
sc.states[c.CustomerID] = "ok"
okCount++
}
}
logger.Printf("[INFO] Staleness checker initialized: %d ok, %d stale, %d down", okCount, staleCount, downCount)
return sc
}
// Check evaluates all customers and emits events on state transitions.
// Should be called periodically (every 60s).
func (sc *StalenessChecker) Check() {
customers, err := sc.store.GetCustomers()
if err != nil {
sc.logger.Printf("[WARN] Staleness check failed: %v", err)
return
}
sc.mu.Lock()
defer sc.mu.Unlock()
// Track which customers are still present (to clean up removed ones)
seen := make(map[string]bool, len(customers))
for _, c := range customers {
seen[c.CustomerID] = true
if sc.store.IsCustomerBlocked(c.CustomerID) {
delete(sc.states, c.CustomerID)
continue
}
age := time.Since(c.ReceivedAt)
var newState string
switch {
case age > sc.downAfter:
newState = "down"
case age > sc.threshold:
newState = "stale"
default:
newState = "ok"
}
oldState := sc.states[c.CustomerID]
if oldState == "" {
// New customer — set state without event
sc.states[c.CustomerID] = newState
continue
}
if oldState == newState {
continue
}
// State transition — emit event
sc.states[c.CustomerID] = newState
sc.emitTransition(c.CustomerID, oldState, newState, age)
}
// Clean up customers that no longer have reports
for id := range sc.states {
if !seen[id] {
delete(sc.states, id)
}
}
}
// GetState returns the current staleness state for a customer.
func (sc *StalenessChecker) GetState(customerID string) string {
sc.mu.Lock()
defer sc.mu.Unlock()
s := sc.states[customerID]
if s == "" {
return "unknown"
}
return s
}
func (sc *StalenessChecker) emitTransition(customerID, oldState, newState string, age time.Duration) {
var eventType, severity, message string
switch {
case newState == "stale":
eventType = "node_stale"
severity = "warning"
message = "No report received for " + formatDuration(age)
case newState == "down":
eventType = "node_down"
severity = "error"
message = "No report received for " + formatDuration(age)
case newState == "ok" && (oldState == "stale" || oldState == "down"):
eventType = "node_recovered"
severity = "info"
message = "Reports resumed (was " + oldState + " for " + formatDuration(age) + ")"
default:
return
}
sc.logger.Printf("[INFO] Staleness: %s %s → %s (%s)", customerID, oldState, newState, eventType)
if _, err := sc.store.SaveEvent(customerID, eventType, severity, message, "{}", "hub"); err != nil {
sc.logger.Printf("[WARN] Failed to save staleness event for %s: %v", customerID, err)
return
}
if sc.onEvent != nil {
sc.onEvent(customerID, eventType, severity, message, "{}", "hub")
}
}
func formatDuration(d time.Duration) string {
if d < time.Hour {
return fmt.Sprintf("%dm", int(d.Minutes()))
}
h := int(d.Hours())
m := int(d.Minutes()) % 60
if m == 0 {
return fmt.Sprintf("%dh", h)
}
return fmt.Sprintf("%dh%dm", h, m)
}
+201
View File
@@ -0,0 +1,201 @@
package notify
import (
"bytes"
"encoding/json"
"fmt"
"io"
"log"
"net/http"
"sync"
"time"
"gitea.dooplex.hu/admin/felhom-hub/internal/store"
)
// Dispatcher routes events to operator and/or customer email channels.
// Cooldowns are in-memory (lost on restart, acceptable).
type Dispatcher struct {
store *store.Store
resendAPIKey string
fromEmail string
operatorEmail string
operatorOn bool
httpClient *http.Client
logger *log.Logger
mu sync.Mutex
opCooldowns map[string]time.Time // "customerID:eventType" → last operator notify
custCooldowns map[string]time.Time // "customerID:eventType" → last customer notify
}
// NewDispatcher creates a new notification dispatcher.
func NewDispatcher(s *store.Store, resendAPIKey, fromEmail, operatorEmail string, operatorOn bool, logger *log.Logger) *Dispatcher {
return &Dispatcher{
store: s,
resendAPIKey: resendAPIKey,
fromEmail: fromEmail,
operatorEmail: operatorEmail,
operatorOn: operatorOn,
httpClient: &http.Client{Timeout: 10 * time.Second},
logger: logger,
opCooldowns: make(map[string]time.Time),
custCooldowns: make(map[string]time.Time),
}
}
// ProcessEvent evaluates an event and sends notifications as appropriate.
// Safe to call from goroutines.
func (d *Dispatcher) ProcessEvent(customerID, eventType, severity, message, detailsJSON, source string) {
if d.resendAPIKey == "" {
return
}
// "test" bypass — send directly to customer email, skip prefs/cooldown
if eventType == "test" {
d.sendTestEmail(customerID)
return
}
// Only warning and error severity trigger notifications
if severity != "warning" && severity != "error" {
return
}
// Operator channel
d.processOperator(customerID, eventType, severity, message, detailsJSON, source)
// Customer channel
d.processCustomer(customerID, eventType, severity, message, detailsJSON, source)
}
func (d *Dispatcher) sendTestEmail(customerID string) {
prefs, err := d.store.GetNotificationPrefs(customerID)
if err != nil || prefs.Email == "" {
d.logger.Printf("[WARN] Test email: no email configured for %s", customerID)
return
}
subject := "[Felhom] Teszt értesítés"
body := "Kedves Ügyfél!\n\nEz egy teszt értesítés a Felhom monitoring rendszerből.\nAz értesítések megfelelően működnek.\n\nÜdvözlettel,\nFelhom.eu monitoring"
if err := d.sendEmail(prefs.Email, subject, body); err != nil {
d.logger.Printf("[ERROR] Test email to %s failed: %v", prefs.Email, err)
d.store.LogNotification(customerID, "test", "info", "Teszt értesítés", "failed", err.Error(), "customer")
return
}
d.logger.Printf("[INFO] Test email sent to %s for %s", prefs.Email, customerID)
d.store.LogNotification(customerID, "test", "info", "Teszt értesítés", "sent", "", "customer")
}
func (d *Dispatcher) processOperator(customerID, eventType, severity, message, detailsJSON, source string) {
if !d.operatorOn || d.operatorEmail == "" {
return
}
cooldownKey := customerID + ":" + eventType
d.mu.Lock()
if last, ok := d.opCooldowns[cooldownKey]; ok && time.Since(last) < 1*time.Hour {
d.mu.Unlock()
return
}
d.opCooldowns[cooldownKey] = time.Now()
d.mu.Unlock()
subject, body := FormatOperatorEmail(customerID, eventType, severity, message, detailsJSON)
if err := d.sendEmail(d.operatorEmail, subject, body); err != nil {
d.logger.Printf("[ERROR] Operator email failed for %s/%s: %v", customerID, eventType, err)
d.store.LogNotification(customerID, eventType, severity, message, "failed", err.Error(), "operator")
return
}
d.logger.Printf("[INFO] Operator email sent for %s/%s", customerID, eventType)
d.store.LogNotification(customerID, eventType, severity, message, "sent", "", "operator")
}
func (d *Dispatcher) processCustomer(customerID, eventType, severity, message, detailsJSON, source string) {
// Check if customer is blocked
if d.store.IsCustomerBlocked(customerID) {
return
}
// Load preferences
prefs, err := d.store.GetNotificationPrefs(customerID)
if err != nil || prefs.Email == "" {
return
}
// Check if event type is enabled
if !isEventEnabled(prefs.EnabledEvents, eventType) {
return
}
// Customer cooldown (from prefs, default 6h)
cooldownHours := prefs.CooldownHours
if cooldownHours <= 0 {
cooldownHours = 6
}
cooldownDur := time.Duration(cooldownHours) * time.Hour
cooldownKey := customerID + ":" + eventType
d.mu.Lock()
if last, ok := d.custCooldowns[cooldownKey]; ok && time.Since(last) < cooldownDur {
d.mu.Unlock()
return
}
d.custCooldowns[cooldownKey] = time.Now()
d.mu.Unlock()
subject, body := FormatCustomerEmail(customerID, eventType, severity, message, detailsJSON)
if err := d.sendEmail(prefs.Email, subject, body); err != nil {
d.logger.Printf("[ERROR] Customer email failed for %s/%s: %v", customerID, eventType, err)
d.store.LogNotification(customerID, eventType, severity, message, "failed", err.Error(), "customer")
return
}
d.logger.Printf("[INFO] Customer email sent to %s for %s/%s", prefs.Email, customerID, eventType)
d.store.LogNotification(customerID, eventType, severity, message, "sent", "", "customer")
}
func (d *Dispatcher) sendEmail(to, subject, textBody string) error {
payload := map[string]interface{}{
"from": d.fromEmail,
"to": []string{to},
"subject": subject,
"text": textBody,
}
jsonData, err := json.Marshal(payload)
if err != nil {
return fmt.Errorf("marshaling email payload: %w", err)
}
req, err := http.NewRequest("POST", "https://api.resend.com/emails", bytes.NewReader(jsonData))
if err != nil {
return fmt.Errorf("creating request: %w", err)
}
req.Header.Set("Authorization", "Bearer "+d.resendAPIKey)
req.Header.Set("Content-Type", "application/json")
resp, err := d.httpClient.Do(req)
if err != nil {
return fmt.Errorf("sending request: %w", err)
}
defer resp.Body.Close()
if resp.StatusCode >= 400 {
respBody, _ := io.ReadAll(io.LimitReader(resp.Body, 1024))
return fmt.Errorf("resend API returned %d: %s", resp.StatusCode, string(respBody))
}
return nil
}
func isEventEnabled(enabledEvents []string, eventType string) bool {
for _, e := range enabledEvents {
if e == eventType {
return true
}
}
return false
}
+153
View File
@@ -0,0 +1,153 @@
package notify
import (
"fmt"
"time"
)
// budapest timezone for formatting.
var budapest *time.Location
func init() {
var err error
budapest, err = time.LoadLocation("Europe/Budapest")
if err != nil {
budapest = time.FixedZone("CET", 3600)
}
}
// ──────────────────────────────────────────────────────────────────────
// Operator email — concise, English
// ──────────────────────────────────────────────────────────────────────
// FormatOperatorEmail returns (subject, textBody) for the operator channel.
func FormatOperatorEmail(customerID, eventType, severity, message, detailsJSON string) (string, string) {
icon := "⚠️"
if severity == "error" {
icon = "🔴"
}
subject := fmt.Sprintf("[Felhom] %s %s: %s", icon, customerID, eventType)
now := time.Now().In(budapest).Format("2006-01-02 15:04 MST")
body := fmt.Sprintf(`Customer: %s
Event: %s
Severity: %s
Time: %s
Message: %s`, customerID, eventType, severity, now, message)
if detailsJSON != "" && detailsJSON != "{}" {
body += fmt.Sprintf("\nDetails: %s", detailsJSON)
}
body += fmt.Sprintf("\n\nDashboard: https://hub.felhom.eu/customers/%s", customerID)
return subject, body
}
// ──────────────────────────────────────────────────────────────────────
// Customer email — Hungarian, friendly
// ──────────────────────────────────────────────────────────────────────
// customerMessages maps event_type → Hungarian customer message.
var customerMessages = map[string]string{
// Backup events
"backup_completed": "A biztonsági mentés sikeresen elkészült.",
"backup_failed": "A biztonsági mentés sikertelen! Kérjük, ellenőrizd a rendszert.",
"db_dump_completed": "Az adatbázis mentés sikeresen elkészült.",
"db_dump_failed": "Az adatbázis mentés sikertelen!",
"backup_integrity_ok": "A mentés integritás ellenőrzés sikeres.",
"backup_integrity_failed": "A mentés integritás ellenőrzés hibát talált!",
"crossdrive_completed": "A másodlagos mentés sikeresen elkészült.",
"crossdrive_failed": "A másodlagos mentés sikertelen!",
// Disk events
"disk_warning": "A lemezterület 90% felett van — kérjük, szabadíts fel helyet.",
"disk_critical": "A lemezterület kritikusan magas (95%+) — azonnali beavatkozás szükséges!",
// Storage events
"storage_disconnected": "Egy meghajtó leválasztva — a mentések szünetelhetnek.",
"storage_reconnected": "A meghajtó újra csatlakoztatva.",
// Staleness events (Hub-generated)
"node_stale": "A szerver nem küldött jelentést az elmúlt időszakban.",
"node_down": "A szerver nem elérhető!",
"node_recovered": "A szerver újra elérhető.",
// Health events
"health_degraded": "A rendszer állapota romlott.",
"health_critical": "A rendszer állapota kritikus!",
"health_recovered": "A rendszer állapota helyreállt.",
// Controller events
"controller_started": "A vezérlő elindult.",
"controller_updated": "A vezérlő frissítve lett.",
// Deadline events (Hub-generated)
"expected_backup_missed": "A mai biztonsági mentés nem készült el a határidőig!",
"expected_dbdump_missed": "A mai adatbázis mentés nem készült el a határidőig!",
// App lifecycle events
"app_deployed": "Alkalmazás telepítve.",
"app_removed": "Alkalmazás eltávolítva.",
// Disaster recovery events
"disaster_recovery_started": "Katasztrófa helyreállítás elindítva.",
"disaster_recovery_completed": "Katasztrófa helyreállítás befejezve.",
// Test
"test": "Ez egy teszt értesítés.",
}
// severityLabels maps severity to Hungarian labels.
var severityLabels = map[string]string{
"info": "Információ",
"warning": "Figyelmeztetés",
"error": "Hiba",
}
// FormatCustomerEmail returns (subject, textBody) for the customer channel.
func FormatCustomerEmail(customerID, eventType, severity, message, detailsJSON string) (string, string) {
label := severityLabels[severity]
if label == "" {
label = severity
}
// Use the per-event-type Hungarian message if available, otherwise fall back to message
hunMessage := customerMessages[eventType]
if hunMessage == "" {
hunMessage = message
}
subject := fmt.Sprintf("[Felhom] %s: %s", label, hunMessage)
now := time.Now().In(budapest).Format("2006-01-02 15:04")
body := fmt.Sprintf(`Kedves Ügyfél!
A Felhom rendszered a következő értesítést küldte:
%s
Részletek:
- Szerver: %s
- Időpont: %s
- Szint: %s
- Típus: %s`, hunMessage, customerID, now, label, eventType)
if message != "" && message != hunMessage {
body += fmt.Sprintf("\n- Üzenet: %s", message)
}
if detailsJSON != "" && detailsJSON != "{}" {
body += fmt.Sprintf("\n- Megjegyzés: %s", detailsJSON)
}
body += `
Ha kérdésed van, vedd fel a kapcsolatot az üzemeltetővel.
Üdvözlettel,
Felhom.eu monitoring`
return subject, body
}
+225 -12
View File
@@ -121,6 +121,33 @@ func (s *Store) migrate() error {
// v0.2.1: add status column to customer_configs (idempotent)
s.db.Exec("ALTER TABLE customer_configs ADD COLUMN status TEXT NOT NULL DEFAULT 'active'")
// v0.3.0: events table for hub-native monitoring
_, err = s.db.Exec(`
CREATE TABLE IF NOT EXISTS events (
id INTEGER PRIMARY KEY AUTOINCREMENT,
customer_id TEXT NOT NULL,
event_type TEXT NOT NULL,
severity TEXT NOT NULL DEFAULT 'info',
message TEXT NOT NULL DEFAULT '',
details_json TEXT NOT NULL DEFAULT '{}',
source TEXT NOT NULL DEFAULT 'controller',
created_at DATETIME NOT NULL DEFAULT (datetime('now'))
);
CREATE INDEX IF NOT EXISTS idx_events_customer_created
ON events(customer_id, created_at DESC);
CREATE INDEX IF NOT EXISTS idx_events_type
ON events(event_type, created_at DESC);
`)
if err != nil {
return err
}
// v0.3.0: add cooldown_hours to customer_notifications (idempotent)
s.db.Exec("ALTER TABLE customer_notifications ADD COLUMN cooldown_hours INTEGER DEFAULT 6")
// v0.3.0: add channel column to notification_log (idempotent)
s.db.Exec("ALTER TABLE notification_log ADD COLUMN channel TEXT NOT NULL DEFAULT 'customer'")
return nil
}
@@ -129,15 +156,17 @@ type NotificationPrefs struct {
CustomerID string
Email string
EnabledEvents []string
CooldownHours int
}
// GetNotificationPrefs returns notification preferences for a customer.
func (s *Store) GetNotificationPrefs(customerID string) (*NotificationPrefs, error) {
var email, eventsJSON string
var cooldownHours int
err := s.db.QueryRow(
"SELECT email, enabled_events FROM customer_notifications WHERE customer_id = ?",
"SELECT email, enabled_events, COALESCE(cooldown_hours, 6) FROM customer_notifications WHERE customer_id = ?",
customerID,
).Scan(&email, &eventsJSON)
).Scan(&email, &eventsJSON, &cooldownHours)
if err != nil {
if err == sql.ErrNoRows {
return nil, nil
@@ -150,34 +179,46 @@ func (s *Store) GetNotificationPrefs(customerID string) (*NotificationPrefs, err
s.logger.Printf("[WARN] Corrupt enabled_events JSON for %s: %v", customerID, err)
}
if cooldownHours <= 0 {
cooldownHours = 6
}
return &NotificationPrefs{
CustomerID: customerID,
Email: email,
EnabledEvents: events,
CooldownHours: cooldownHours,
}, nil
}
// SaveNotificationPrefs creates or updates notification preferences for a customer.
func (s *Store) SaveNotificationPrefs(customerID, email string, enabledEvents []string) error {
func (s *Store) SaveNotificationPrefs(customerID, email string, enabledEvents []string, cooldownHours int) error {
eventsJSON, _ := json.Marshal(enabledEvents)
if cooldownHours <= 0 {
cooldownHours = 6
}
_, err := s.db.Exec(`
INSERT INTO customer_notifications (customer_id, email, enabled_events, updated_at)
VALUES (?, ?, ?, datetime('now'))
INSERT INTO customer_notifications (customer_id, email, enabled_events, cooldown_hours, updated_at)
VALUES (?, ?, ?, ?, datetime('now'))
ON CONFLICT(customer_id) DO UPDATE SET
email = excluded.email,
enabled_events = excluded.enabled_events,
cooldown_hours = excluded.cooldown_hours,
updated_at = datetime('now')`,
customerID, email, string(eventsJSON),
customerID, email, string(eventsJSON), cooldownHours,
)
return err
}
// LogNotification records a notification attempt.
func (s *Store) LogNotification(customerID, eventType, severity, message, status, errorMsg string) error {
func (s *Store) LogNotification(customerID, eventType, severity, message, status, errorMsg, channel string) error {
if channel == "" {
channel = "customer"
}
_, err := s.db.Exec(`
INSERT INTO notification_log (customer_id, event_type, severity, message, status, error_message)
VALUES (?, ?, ?, ?, ?, ?)`,
customerID, eventType, severity, message, status, errorMsg,
INSERT INTO notification_log (customer_id, event_type, severity, message, status, error_message, channel)
VALUES (?, ?, ?, ?, ?, ?, ?)`,
customerID, eventType, severity, message, status, errorMsg, channel,
)
return err
}
@@ -189,13 +230,14 @@ type NotificationLogEntry struct {
Message string
Status string // "sent", "skipped", "failed"
ErrorMessage string
Channel string // "operator" or "customer"
CreatedAt time.Time
}
// GetRecentNotifications returns the most recent notification log entries for a customer.
func (s *Store) GetRecentNotifications(customerID string, limit int) ([]NotificationLogEntry, error) {
rows, err := s.db.Query(`
SELECT event_type, severity, message, status, COALESCE(error_message, ''), created_at
SELECT event_type, severity, message, status, COALESCE(error_message, ''), COALESCE(channel, 'customer'), created_at
FROM notification_log
WHERE customer_id = ?
ORDER BY created_at DESC
@@ -209,7 +251,7 @@ func (s *Store) GetRecentNotifications(customerID string, limit int) ([]Notifica
for rows.Next() {
var e NotificationLogEntry
var createdAt, errorMsg string
if err := rows.Scan(&e.EventType, &e.Severity, &e.Message, &e.Status, &errorMsg, &createdAt); err != nil {
if err := rows.Scan(&e.EventType, &e.Severity, &e.Message, &e.Status, &errorMsg, &e.Channel, &createdAt); err != nil {
return nil, err
}
e.CreatedAt = parseSQLiteTime(createdAt)
@@ -658,6 +700,177 @@ func (s *Store) UpdateRetrievalPassword(customerID, newPassword string) error {
return err
}
// --- Event system ---
// Event represents a single event record.
type Event struct {
ID int64
CustomerID string
EventType string
Severity string // "info", "warning", "error"
Message string
DetailsJSON string // raw JSON
Source string // "controller" or "hub"
CreatedAt time.Time
}
// SaveEvent inserts a new event and returns its ID.
func (s *Store) SaveEvent(customerID, eventType, severity, message, detailsJSON, source string) (int64, error) {
if detailsJSON == "" {
detailsJSON = "{}"
}
if source == "" {
source = "controller"
}
res, err := s.db.Exec(`
INSERT INTO events (customer_id, event_type, severity, message, details_json, source)
VALUES (?, ?, ?, ?, ?, ?)`,
customerID, eventType, severity, message, detailsJSON, source,
)
if err != nil {
return 0, err
}
return res.LastInsertId()
}
// GetRecentEvents returns the most recent events for a customer, newest first.
func (s *Store) GetRecentEvents(customerID string, limit int) ([]Event, error) {
rows, err := s.db.Query(`
SELECT id, customer_id, event_type, severity, message, details_json, source, created_at
FROM events
WHERE customer_id = ?
ORDER BY created_at DESC
LIMIT ?`, customerID, limit)
if err != nil {
return nil, err
}
defer rows.Close()
return scanEvents(rows)
}
// GetEventsByType returns events of a specific type for a customer since a given time.
func (s *Store) GetEventsByType(customerID, eventType string, since time.Time) ([]Event, error) {
rows, err := s.db.Query(`
SELECT id, customer_id, event_type, severity, message, details_json, source, created_at
FROM events
WHERE customer_id = ? AND event_type = ? AND created_at >= ?
ORDER BY created_at DESC`,
customerID, eventType, since.UTC().Format("2006-01-02 15:04:05"))
if err != nil {
return nil, err
}
defer rows.Close()
return scanEvents(rows)
}
// GetLatestEventByType returns the most recent event of a given type for a customer.
func (s *Store) GetLatestEventByType(customerID, eventType string) (*Event, error) {
var e Event
var createdAt string
err := s.db.QueryRow(`
SELECT id, customer_id, event_type, severity, message, details_json, source, created_at
FROM events
WHERE customer_id = ? AND event_type = ?
ORDER BY created_at DESC
LIMIT 1`, customerID, eventType,
).Scan(&e.ID, &e.CustomerID, &e.EventType, &e.Severity, &e.Message, &e.DetailsJSON, &e.Source, &createdAt)
if err == sql.ErrNoRows {
return nil, nil
}
if err != nil {
return nil, err
}
e.CreatedAt = parseSQLiteTime(createdAt)
return &e, nil
}
// GetAllRecentEvents returns the most recent events across all customers.
func (s *Store) GetAllRecentEvents(limit int) ([]Event, error) {
rows, err := s.db.Query(`
SELECT id, customer_id, event_type, severity, message, details_json, source, created_at
FROM events
ORDER BY created_at DESC
LIMIT ?`, limit)
if err != nil {
return nil, err
}
defer rows.Close()
return scanEvents(rows)
}
// CountEventsBySeverity returns a count of events per severity for a customer since a given time.
func (s *Store) CountEventsBySeverity(customerID string, since time.Time) (map[string]int, error) {
rows, err := s.db.Query(`
SELECT severity, COUNT(*) FROM events
WHERE customer_id = ? AND created_at >= ?
GROUP BY severity`,
customerID, since.UTC().Format("2006-01-02 15:04:05"))
if err != nil {
return nil, err
}
defer rows.Close()
counts := make(map[string]int)
for rows.Next() {
var sev string
var count int
if err := rows.Scan(&sev, &count); err != nil {
return nil, err
}
counts[sev] = count
}
return counts, rows.Err()
}
// PruneEvents deletes events older than the given number of days.
func (s *Store) PruneEvents(maxDays int) (int64, error) {
cutoff := time.Now().AddDate(0, 0, -maxDays).UTC().Format("2006-01-02 15:04:05")
res, err := s.db.Exec("DELETE FROM events WHERE created_at < ?", cutoff)
if err != nil {
return 0, err
}
return res.RowsAffected()
}
// GetActiveCustomerIDs returns customer IDs from customer_configs where status is 'active'.
func (s *Store) GetActiveCustomerIDs() ([]string, error) {
rows, err := s.db.Query("SELECT customer_id FROM customer_configs WHERE status = 'active'")
if err != nil {
return nil, err
}
defer rows.Close()
var ids []string
for rows.Next() {
var id string
if err := rows.Scan(&id); err != nil {
return nil, err
}
ids = append(ids, id)
}
return ids, rows.Err()
}
// Ping verifies the database is accessible.
func (s *Store) Ping() error {
var n int
return s.db.QueryRow("SELECT 1").Scan(&n)
}
func scanEvents(rows *sql.Rows) ([]Event, error) {
var events []Event
for rows.Next() {
var e Event
var createdAt string
if err := rows.Scan(&e.ID, &e.CustomerID, &e.EventType, &e.Severity, &e.Message, &e.DetailsJSON, &e.Source, &createdAt); err != nil {
return nil, err
}
e.CreatedAt = parseSQLiteTime(createdAt)
events = append(events, e)
}
return events, rows.Err()
}
// parseSQLiteTime tries multiple formats that modernc.org/sqlite may return.
func parseSQLiteTime(s string) time.Time {
formats := []string{
+12 -2
View File
@@ -223,12 +223,14 @@ func (s *Server) handleCustomerUnified(w http.ResponseWriter, r *http.Request, c
}
}
// History, notifications, infra backup
// History, notifications, events, infra backup
var history []store.CustomerSummary
var notifPrefs *store.NotificationPrefs
var recentNotifs []store.NotificationLogEntry
var infraMeta *store.InfraBackupMeta
var infraBackupAge string
var events []store.Event
var eventCounts map[string]int
if customer != nil {
history, _ = s.store.GetCustomerHistory(customerID, 24*time.Hour)
@@ -238,6 +240,8 @@ func (s *Server) handleCustomerUnified(w http.ResponseWriter, r *http.Request, c
if infraMeta != nil {
infraBackupAge = timeAgo(infraMeta.UpdatedAt)
}
events, _ = s.store.GetRecentEvents(customerID, 50)
eventCounts, _ = s.store.CountEventsBySeverity(customerID, time.Now().Add(-24*time.Hour))
}
type pageData struct {
@@ -270,6 +274,9 @@ func (s *Server) handleCustomerUnified(w http.ResponseWriter, r *http.Request, c
RecentNotifications []store.NotificationLogEntry
History []store.CustomerSummary
Events []store.Event
EventCounts map[string]int // severity → count (last 24h)
Flash string
ActiveNav string
}
@@ -304,6 +311,9 @@ func (s *Server) handleCustomerUnified(w http.ResponseWriter, r *http.Request, c
RecentNotifications: recentNotifs,
History: history,
Events: events,
EventCounts: eventCounts,
Flash: r.URL.Query().Get("flash"),
ActiveNav: "configs",
}
@@ -697,7 +707,7 @@ func buildConfigJSON(r *http.Request) string {
overrides["git"] = git
}
// Monitoring UUIDs
// Monitoring UUIDs (legacy — only written if user explicitly provides values)
uuids := make(map[string]interface{})
for _, key := range []string{"heartbeat", "system_health", "db_dump", "backup", "backup_integrity"} {
if v := strings.TrimSpace(r.FormValue("uuid_" + key)); v != "" {
+6 -6
View File
@@ -80,12 +80,12 @@ backup:
monitoring:
enabled: true
healthchecks_base: "https://status.felhom.eu"
ping_uuids:
heartbeat: "" # Every 5 min — controller process alive
system_health: "" # Every 5 min — comprehensive system check
db_dump: "" # Daily — after database dumps
backup: "" # Daily — after restic snapshot
backup_integrity: "" # Weekly (Sunday) — restic check
# ping_uuids: (deprecated — monitoring is now handled by the Hub event system)
# heartbeat: ""
# system_health: ""
# db_dump: ""
# backup: ""
# backup_integrity: ""
system_health_interval: "5m"
health_check_schedule: "06:00"
thresholds:
+15
View File
@@ -40,6 +40,13 @@ func New(store *store.Store, passwordHash, apiKey string, staleThreshold time.Du
b, _ := json.Marshal(v)
return template.JS(b)
},
"add": func(a, b int) int { return a + b },
"mapGet": func(m map[string]int, key string) int {
if m == nil {
return 0
}
return m[key]
},
}
tmpl := template.Must(template.New("").Funcs(funcMap).ParseFS(templateFS, "templates/*.html"))
@@ -232,6 +239,8 @@ func (s *Server) handleDashboard(w http.ResponseWriter, r *http.Request) {
store.CustomerSummary
OverallStatus string // "ok", "warn", "down", "pending"
BackupAge string
EventErrors int
EventWarnings int
}
// Build map of report customers keyed by ID
@@ -266,6 +275,12 @@ func (s *Server) handleDashboard(w http.ResponseWriter, r *http.Request) {
dc.BackupAge = ""
}
// Event counts (last 24h)
if counts, err := s.store.CountEventsBySeverity(c.CustomerID, time.Now().Add(-24*time.Hour)); err == nil {
dc.EventErrors = counts["error"]
dc.EventWarnings = counts["warning"]
}
data = append(data, dc)
}
+3 -2
View File
@@ -96,8 +96,9 @@
</div>
</details>
<details class="card" {{if index .Overrides "monitoring"}}open{{end}}>
<summary><h2 style="display:inline">Monitoring UUIDs</h2></summary>
<details class="card">
<summary><h2 style="display:inline">Monitoring UUIDs</h2> <span class="severity-badge severity-warning" style="font-size: 0.7em; vertical-align: middle;">Legacy</span></summary>
<p class="text-muted" style="margin: 0.5rem 0;">Healthchecks ping UUIDs are deprecated. Monitoring is now handled natively by the Hub event system. These fields are kept for backward compatibility with older controllers.</p>
<div class="form-grid" style="margin-top: 1rem;">
{{$uuids := ""}}
{{with .Overrides}}{{with index . "monitoring"}}{{with index . "ping_uuids"}}{{$uuids = .}}{{end}}{{end}}{{end}}
@@ -403,6 +403,61 @@
{{end}}
</section>
<!-- Events -->
<section class="card">
<h2>Events
{{if .EventCounts}}
{{with mapGet .EventCounts "error"}}<span class="severity-badge severity-error">{{.}} error{{if gt . 1}}s{{end}}</span>{{end}}
{{with mapGet .EventCounts "warning"}}<span class="severity-badge severity-warning">{{.}} warning{{if gt . 1}}s{{end}}</span>{{end}}
{{end}}
<span class="text-muted" style="font-size: 0.7em; font-weight: normal;"> (last 24h)</span>
</h2>
{{if .Events}}
<div style="margin-bottom: 0.5rem;">
<button class="btn btn-sm btn-outline event-filter active" data-filter="all">All</button>
<button class="btn btn-sm btn-outline event-filter" data-filter="error">Errors</button>
<button class="btn btn-sm btn-outline event-filter" data-filter="warning">Warnings</button>
<button class="btn btn-sm btn-outline event-filter" data-filter="info">Info</button>
</div>
<table class="history-table" id="events-table">
<thead>
<tr>
<th>Time</th>
<th>Severity</th>
<th>Type</th>
<th>Message</th>
<th>Source</th>
</tr>
</thead>
<tbody>
{{range .Events}}
<tr data-severity="{{.Severity}}">
<td title="{{.CreatedAt.Format "2006-01-02 15:04:05"}}">{{.CreatedAt.Format "Jan 02 15:04"}}</td>
<td><span class="severity-badge severity-{{.Severity}}">{{.Severity}}</span></td>
<td><code>{{.EventType}}</code></td>
<td>{{.Message}}</td>
<td>{{.Source}}</td>
</tr>
{{end}}
</tbody>
</table>
<script>
document.querySelectorAll('.event-filter').forEach(btn => {
btn.addEventListener('click', function() {
document.querySelectorAll('.event-filter').forEach(b => b.classList.remove('active'));
this.classList.add('active');
const filter = this.dataset.filter;
document.querySelectorAll('#events-table tbody tr').forEach(row => {
row.style.display = (filter === 'all' || row.dataset.severity === filter) ? '' : 'none';
});
});
});
</script>
{{else}}
<p class="text-muted">No events recorded yet.</p>
{{end}}
</section>
<!-- Notifications -->
<section class="card">
<h2>Notifications</h2>
@@ -424,6 +479,7 @@
<thead>
<tr>
<th>Time</th>
<th>Channel</th>
<th>Event</th>
<th>Status</th>
<th>Message</th>
@@ -433,6 +489,7 @@
{{range .RecentNotifications}}
<tr>
<td>{{.CreatedAt.Format "Jan 02 15:04"}}</td>
<td><span class="status-badge status-badge-{{.Channel}}">{{.Channel}}</span></td>
<td>{{.EventType}}</td>
<td><span class="status-badge status-badge-{{.Status}}">{{.Status}}</span></td>
<td>{{.Message}}</td>
@@ -28,6 +28,7 @@
<tr>
<th>Customer</th>
<th>Status</th>
<th>Events</th>
<th>Last Seen</th>
<th>CPU</th>
<th>Memory</th>
@@ -49,6 +50,7 @@
{{if eq .OverallStatus "ok"}}OK{{else if eq .OverallStatus "warn"}}WARN{{else if eq .OverallStatus "disabled"}}PAUSED{{else if eq .OverallStatus "pending"}}PENDING{{else}}DOWN{{end}}
</span>
</td>
<td>{{if eq .OverallStatus "pending"}}—{{else}}{{if gt (add .EventErrors .EventWarnings) 0}}{{if gt .EventErrors 0}}<span class="severity-badge severity-error">{{.EventErrors}}</span>{{end}}{{if gt .EventWarnings 0}}<span class="severity-badge severity-warning">{{.EventWarnings}}</span>{{end}}{{else}}<span class="text-muted"></span>{{end}}{{end}}</td>
<td>{{if eq .OverallStatus "pending"}}—{{else}}{{timeAgo .ReceivedAt}}{{end}}</td>
<td>{{if eq .OverallStatus "pending"}}—{{else}}{{formatFloat .CPUPercent}}%{{end}}</td>
<td>{{if eq .OverallStatus "pending"}}—{{else}}{{formatFloat .MemoryPercent}}%{{end}}</td>
+49
View File
@@ -564,6 +564,55 @@ code {
color: var(--text-muted);
}
/* Severity badges */
.severity-badge {
display: inline-block;
padding: 0.15em 0.5em;
border-radius: 4px;
font-size: 0.8em;
font-weight: 600;
line-height: 1.4;
}
.severity-error {
background: rgba(239, 68, 68, 0.15);
color: #ef4444;
}
.severity-warning {
background: rgba(245, 158, 11, 0.15);
color: #f59e0b;
}
.severity-info {
background: rgba(59, 130, 246, 0.15);
color: #3b82f6;
}
/* Event filter buttons */
.event-filter {
font-size: 0.8em;
padding: 0.2em 0.6em;
cursor: pointer;
}
.event-filter.active {
background: var(--accent);
color: #fff;
border-color: var(--accent);
}
/* Notification channel badges */
.status-badge-operator {
background: rgba(139, 92, 246, 0.15);
color: #8b5cf6;
}
.status-badge-customer {
background: rgba(59, 130, 246, 0.15);
color: #3b82f6;
}
/* Responsive */
@media (max-width: 768px) {
.container { padding: 1rem; }