Files

T

admin 3217cb4751 feat: Hub monitoring takeover — event system, dead man's switch, notifications (v0.3.0)

Replace external Healthchecks.io with Hub-native monitoring. New events
table + /api/v1/event endpoint for structured events from controllers.
Staleness checker (60s) detects unresponsive nodes. Backup deadline
checker (daily 05:00) catches missed backups. Notification dispatcher
sends operator (English) + customer (Hungarian) emails via Resend with
per-event cooldowns. Event timeline on customer page, dashboard badges.
Config form deprecates Monitoring UUIDs section.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-20 18:53:24 +01:00

cmd/hub

feat: Hub monitoring takeover — event system, dead man's switch, notifications (v0.3.0)

2026-02-20 18:53:24 +01:00

configs

feat: customer config management — CRUD, API retrieval, per-customer auth (v0.2.0)

2026-02-20 13:36:32 +01:00

internal

feat: Hub monitoring takeover — event system, dead man's switch, notifications (v0.3.0)

2026-02-20 18:53:24 +01:00

CHANGELOG.md

feat: Hub monitoring takeover — event system, dead man's switch, notifications (v0.3.0)

2026-02-20 18:53:24 +01:00

Dockerfile

dockerfile update 2

2026-02-16 13:44:25 +01:00

go.mod

feat: customer config management — CRUD, API retrieval, per-customer auth (v0.2.0)

2026-02-20 13:36:32 +01:00

go.sum

hub: add go.sum

2026-02-16 14:16:11 +01:00

Makefile

Add felhom-hub: multi-customer dashboard service

2026-02-16 13:19:25 +01:00

README.md

feat: Hub monitoring takeover — event system, dead man's switch, notifications (v0.3.0)

2026-02-20 18:53:24 +01:00

README.md

felhom-hub

Central operator dashboard for monitoring and managing Felhom customer deployments.

A lightweight Go service that receives periodic reports and structured events from felhom-controller instances, stores them in SQLite, and provides a web dashboard for fleet monitoring. Also serves as the infrastructure backup store for disaster recovery, event-based dead man's switch monitoring, and notification dispatch.

Current version: v0.3.0

Architecture

   Customer nodes                             Central Hub (k3s)
┌─────────────────┐                     ┌────────────────────────┐
│ felhom-controller│──── JSON push ────▶│  felhom-hub            │
│ (every 15 min)   │    (Bearer auth)   │                        │
│                  │                     │  ┌─────────────────┐   │
│ POST /api/v1/    │                     │  │ API Handler     │   │
│   report         │                     │  │ (ingest reports, │   │
│   infra-backup   │◀── config push ────│  │  infra backups,  │   │
│   notify         │    (YAML body)     │  │  config push)    │   │
│                  │                     │  └────────┬────────┘   │
└─────────────────┘                     │           │             │
                                        │  ┌────────▼────────┐   │
   Operator browser                     │  │ SQLite Store    │   │
┌─────────────────┐                     │  │ (reports,       │   │
│ Web Dashboard   │◀── HTML pages ──────│  │  infra_backups, │   │
│ (hub.felhom.eu) │    (bcrypt auth)    │  │  configs,       │   │
└─────────────────┘                     │  │  notifications) │   │
                                        │  └─────────────────┘   │
                                        │                        │
                                        │  ┌─────────────────┐   │
                                        │  │ Web Dashboard   │   │
                                        │  │ (unified customer│   │
                                        │  │  management)     │   │
                                        │  └─────────────────┘   │
                                        └────────────────────────┘

API Endpoints

All API endpoints require Authorization: Bearer <api_key> (except /healthz and /api/v1/config/{id}). Auth accepts both the global report_api_key and per-customer API keys (generated when creating customer configs).

Report Ingest

Method	Path	Description
`POST`	`/api/v1/report`	Controller pushes periodic status report
`GET`	`/api/v1/customers`	List all customers with latest report summary
`GET`	`/api/v1/customers/{id}`	Get latest full report for a customer
`GET`	`/api/v1/customers/{id}/history?period=7d`	Get report history

Infrastructure Backup (Disaster Recovery)

Method	Path	Description
`POST`	`/api/v1/infra-backup`	Controller pushes infrastructure snapshot
`GET`	`/api/v1/infra-backup/{customer_id}`	Fresh controller pulls backup for restore

The infra-backup payload contains everything needed to restore a customer deployment:

controller.yaml (base64, full config including secrets)
settings.json (base64, backup preferences, storage paths)
Disk layout (UUIDs, labels, mount points, fstab options, bind-mount topology)
Deployed stacks manifest (app names, HDD paths, display names)
Restic passwords (primary + cross-drive, for encrypted backup access)

Disaster recovery flow:

Customer's system drive fails → replaced with fresh Debian install
docker-setup.sh deploys controller with Hub details (customer_id + API key)
Controller detects fresh deployment → calls GET /api/v1/infra-backup/{customer_id}
Controller uses disk UUIDs to auto-mount surviving drives
Controller restores apps from local backups on those drives

Events

Method	Path	Description
`POST`	`/api/v1/event`	Controller pushes structured event (27 allowed types, severity: info/warning/error)

Events are the primary monitoring mechanism. Each event has: customer_id, event_type, severity, message, details_json, source. Per-customer API keys are validated against the customer_id in the payload. Stored in the events table with automatic pruning.

Hub-generated events (source="hub"):

node_stale / node_down / node_recovered — dead man's switch from staleness checker (every 60s)
expected_backup_missed / expected_dbdump_missed — backup deadline checker (daily at 05:00 Budapest)

Notifications

Method	Path	Description
`POST`	`/api/v1/notify`	Legacy notification relay (kept for backward compatibility)
`POST`	`/api/v1/preferences`	Controller syncs customer notification preferences (email, enabled_events, cooldown_hours)

Notifications are dispatched automatically when events are processed:

Operator channel: English emails for warning/error events, 1h cooldown per customer:eventType
Customer channel: Hungarian emails per event type, respects customer preferences and cooldown (default 6h)
Email delivery via Resend.com API

Customer Config Retrieval

Method	Path	Description
`GET`	`/api/v1/config/{customer_id}`	Download generated controller.yaml (auth: `X-Retrieval-Password` header)

Config retrieval uses a separate per-customer retrieval password (not the API key). The Hub generates a complete controller.yaml by deep-merging controller.yaml.example (periodically fetched from the Gitea repo) with customer-specific overrides (identity, infrastructure tokens, hub API key, session secret).

Health

Method	Path	Description
`GET`	`/healthz`	Health check (no auth required, returns 503 if SQLite ping fails)

Web Dashboard

Protected by bcrypt password + session cookie (7-day expiry).

Pages

Dashboard (/) — Fleet overview table showing all customers with live status and event count badges (error+warning in last 24h). Config-only customers (no reports yet) appear as "PENDING" with gray badge. Blocked customers are hidden. Auto-refreshes every 60 seconds.
Customers (/configs) — Customer management list. Shows all customers (both managed and manual), their status, controller version, and config type (MANAGED/MANUAL). Blocked customers shown grayed-out with BLOCKED badge.
Unified Customer Detail (/customers/{id}) — Single page per customer combining config management and live monitoring. Adapts content based on available data:
- Managed + reporting: Full view — config info, system metrics, storage, containers, backup status, events timeline (last 50, severity filter), credentials, setup commands, YAML preview, controller update, notifications (with channel column), history
- Managed + no reports yet: Config info, credentials, setup commands, "Waiting for first report" indicator
- Manual (report-only): System metrics, storage, containers, backup, with "Create Config" button to convert to managed
Config Form (/configs/new, /configs/{id}/edit) — Create/edit customer configurations with identity, infrastructure tokens, and monitoring overrides. Legacy Monitoring UUIDs section collapsed by default with deprecation notice

Customer States

State	Dashboard	Customers List	Detail Page
Active + reporting	Shown with live status	MANAGED + status badge	Full unified view
Active + no reports	Shown as PENDING (gray)	MANAGED + no status	Config + "waiting for report"
Manual (report-only)	Shown with live status	MANUAL + status badge	Reports + "Create Config" button
Blocked	Hidden	Shown grayed-out, BLOCKED badge	Blocked banner + Unblock button

Customer Actions

Action	Description
Block/Unblock	Toggle blocked status — blocked customers are hidden from dashboard and notifications are suppressed, but reports are still accepted and stored
Push Config	Generate YAML from Hub config and POST it to the controller's `/api/config/apply` endpoint (requires controller URL from reports)
Create Config	Auto-create a managed config from a manual customer's report data, then redirect to edit form
Trigger Update	Instruct controller to self-update to the latest version
Delete	Remove customer config (customer reappears as manual if reports continue)

Status Logic

OK (green): report < 30 min old, health = ok
WARN (yellow): 30-60 min stale or health = warn
DOWN (red): > 60 min stale or health = fail
DISABLED (gray): controller monitoring paused
PENDING (gray): config exists but no reports received yet
BLOCKED (gray): customer blocked by operator

Data Storage

SQLite with WAL mode. Tables:

Table	Purpose
`reports`	Full JSON reports with denormalized fields for dashboard queries
`events`	Structured events from controllers and Hub (type, severity, message, details, source)
`infra_backups`	Per-customer infrastructure snapshots for disaster recovery
`customer_notifications`	Email, enabled event types, cooldown hours per customer
`notification_log`	Send/skip/fail history for notifications with channel (operator/customer)
`customer_configs`	Pre-configured customer settings, retrieval passwords, per-customer API keys, status (active/blocked)

Retention: configurable (default 90 days), daily prune at 04:30 Budapest time.

Configuration

# hub.yaml
auth:
  password_hash: ""           # bcrypt hash for dashboard login (empty = no auth)

api:
  report_api_key: ""          # Bearer token for API auth

notifications:
  resend_api_key: ""          # Resend.com API key for email
  from_email: "monitoring@felhom.eu"
  operator_email: ""          # Operator alert recipient
  operator_enabled: true      # Enable operator email notifications

retention:
  max_days: 90
  prune_schedule: "04:30"

alerting:
  stale_threshold: "30m"      # Customer considered stale after this duration

registry:
  image: "gitea.dooplex.hu/admin/felhom-controller"
  username: ""                # Gitea registry credentials
  token: ""
  check_interval: "30m"      # How often to check for new controller versions
  template_interval: "1h"    # How often to refresh controller.yaml.example

server:
  listen: ":8080"
  data_dir: "/data"           # SQLite database location

Deployment

Runs on k3s (Kubernetes) in the felhom-system namespace:

PVC: 1GB Longhorn volume for SQLite database
Resources: 64Mi-256Mi memory, 50m-500m CPU
Ingress: hub.felhom.eu with TLS (cert-manager)
Geo-restriction: Hungary only (nginx annotation)

# Build and push
cd hub/
make VERSION=0.3.0 docker docker-push

# Deploy
kubectl set image -n felhom-system deploy/hub hub=gitea.dooplex.hu/admin/felhom-hub:v0.3.0
kubectl rollout status -n felhom-system deploy/hub

# Check
kubectl logs -n felhom-system -l app=hub --tail 20

Background Services

Service	Schedule	Description
Staleness checker	Every 60s	Detects controllers that stopped reporting. Generates `node_stale` (>30min), `node_down` (>60min), `node_recovered` events
Backup deadline checker	Daily 05:00 Budapest	Detects missing backup/db-dump events since midnight. Generates `expected_backup_missed`, `expected_dbdump_missed` events
Report/event prune	Daily 04:30 Budapest	Deletes reports and events older than retention period (default 90 days)
Registry version check	Every 30min	Checks Gitea registry for new controller image tags
Template refresh	Every 1h	Fetches latest `controller.yaml.example` from Gitea

Dependencies

golang.org/x/crypto — bcrypt for password hashing
gopkg.in/yaml.v3 — YAML config parsing
modernc.org/sqlite — Pure Go SQLite (no CGo)