Files
felhom.eu/hub/README.md
T
admin 11428659d1 hub v0.3.1: Config diff display + pull config
Replace broken SHA256 hash comparison with value-based YAML comparison.
Add "Show Diff" button showing per-key differences in a color-coded table.
Add "Pull Config" to import controller's current config into the Hub.
New endpoints: GET /customers/{id}/config-diff, POST /customers/{id}/pull-config.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 19:26:53 +01:00

243 lines
13 KiB
Markdown

# felhom-hub
**Central operator dashboard for monitoring and managing Felhom customer deployments.**
A lightweight Go service that receives periodic reports and structured events from felhom-controller instances, stores them in SQLite, and provides a web dashboard for fleet monitoring. Also serves as the infrastructure backup store for disaster recovery, event-based dead man's switch monitoring, and notification dispatch.
**Current version: v0.3.1**
---
## Architecture
```
Customer nodes Central Hub (k3s)
┌─────────────────┐ ┌────────────────────────┐
│ felhom-controller│──── JSON push ────▶│ felhom-hub │
│ (every 15 min) │ (Bearer auth) │ │
│ │ │ ┌─────────────────┐ │
│ POST /api/v1/ │ │ │ API Handler │ │
│ report │ │ │ (ingest reports, │ │
│ infra-backup │◀── config push ────│ │ infra backups, │ │
│ notify │ (YAML body) │ │ config push) │ │
│ │ │ └────────┬────────┘ │
└─────────────────┘ │ │ │
│ ┌────────▼────────┐ │
Operator browser │ │ SQLite Store │ │
┌─────────────────┐ │ │ (reports, │ │
│ Web Dashboard │◀── HTML pages ──────│ │ infra_backups, │ │
│ (hub.felhom.eu) │ (bcrypt auth) │ │ configs, │ │
└─────────────────┘ │ │ notifications) │ │
│ └─────────────────┘ │
│ │
│ ┌─────────────────┐ │
│ │ Web Dashboard │ │
│ │ (unified customer│ │
│ │ management) │ │
│ └─────────────────┘ │
└────────────────────────┘
```
## API Endpoints
All API endpoints require `Authorization: Bearer <api_key>` (except `/healthz` and `/api/v1/config/{id}`). Auth accepts both the global `report_api_key` and per-customer API keys (generated when creating customer configs).
### Report Ingest
| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/api/v1/report` | Controller pushes periodic status report |
| `GET` | `/api/v1/customers` | List all customers with latest report summary |
| `GET` | `/api/v1/customers/{id}` | Get latest full report for a customer |
| `GET` | `/api/v1/customers/{id}/history?period=7d` | Get report history |
### Infrastructure Backup (Disaster Recovery)
| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/api/v1/infra-backup` | Controller pushes infrastructure snapshot |
| `GET` | `/api/v1/infra-backup/{customer_id}` | Fresh controller pulls backup for restore |
The infra-backup payload contains everything needed to restore a customer deployment:
- `controller.yaml` (base64, full config including secrets)
- `settings.json` (base64, backup preferences, storage paths)
- Disk layout (UUIDs, labels, mount points, fstab options, bind-mount topology)
- Deployed stacks manifest (app names, HDD paths, display names)
- Restic passwords (primary + cross-drive, for encrypted backup access)
**Disaster recovery flow:**
1. Customer's system drive fails → replaced with fresh Debian install
2. `docker-setup.sh` deploys controller with Hub details (customer_id + API key)
3. Controller detects fresh deployment → calls `GET /api/v1/infra-backup/{customer_id}`
4. Controller uses disk UUIDs to auto-mount surviving drives
5. Controller restores apps from local backups on those drives
### Events
| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/api/v1/event` | Controller pushes structured event (27 allowed types, severity: info/warning/error) |
Events are the primary monitoring mechanism. Each event has: customer_id, event_type, severity, message, details_json, source. Per-customer API keys are validated against the customer_id in the payload. Stored in the `events` table with automatic pruning.
**Hub-generated events** (source="hub"):
- `node_stale` / `node_down` / `node_recovered` — dead man's switch from staleness checker (every 60s)
- `expected_backup_missed` / `expected_dbdump_missed` — backup deadline checker (daily at 05:00 Budapest)
### Notifications
| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/api/v1/notify` | Legacy notification relay (kept for backward compatibility) |
| `POST` | `/api/v1/preferences` | Controller syncs customer notification preferences (email, enabled_events, cooldown_hours) |
Notifications are dispatched automatically when events are processed:
- **Operator channel**: English emails for warning/error events, 1h cooldown per customer:eventType
- **Customer channel**: Hungarian emails per event type, respects customer preferences and cooldown (default 6h)
- Email delivery via Resend.com API
### Customer Config Retrieval
| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/api/v1/config/{customer_id}` | Download generated controller.yaml (auth: `X-Retrieval-Password` header) |
Config retrieval uses a separate per-customer retrieval password (not the API key). The Hub generates a complete `controller.yaml` by deep-merging `controller.yaml.example` (periodically fetched from the Gitea repo) with customer-specific overrides (identity, infrastructure tokens, hub API key, session secret).
### Health
| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/healthz` | Health check (no auth required, returns 503 if SQLite ping fails) |
## Web Dashboard
Protected by bcrypt password + session cookie (7-day expiry).
### Pages
- **Dashboard (`/`)** — Fleet overview table showing all customers with live status and event count badges (error+warning in last 24h). Config-only customers (no reports yet) appear as "PENDING" with gray badge. Blocked customers are hidden. Auto-refreshes every 60 seconds.
- **Customers (`/configs`)** — Customer management list. Shows all customers (both managed and manual), their status, controller version, and config type (MANAGED/MANUAL). Blocked customers shown grayed-out with BLOCKED badge.
- **Unified Customer Detail (`/customers/{id}`)** — Single page per customer combining config management and live monitoring. Adapts content based on available data:
- **Managed + reporting:** Full view — config info, system metrics, storage, containers, backup status, events timeline (last 50, severity filter), credentials, setup commands, YAML preview, controller update, notifications (with channel column), history
- **Managed + no reports yet:** Config info, credentials, setup commands, "Waiting for first report" indicator
- **Manual (report-only):** System metrics, storage, containers, backup, with "Create Config" button to convert to managed
- **Config Form (`/configs/new`, `/configs/{id}/edit`)** — Create/edit customer configurations with identity, infrastructure tokens, and monitoring overrides. Legacy Monitoring UUIDs section collapsed by default with deprecation notice
### Customer States
| State | Dashboard | Customers List | Detail Page |
|-------|-----------|----------------|-------------|
| **Active + reporting** | Shown with live status | MANAGED + status badge | Full unified view |
| **Active + no reports** | Shown as PENDING (gray) | MANAGED + no status | Config + "waiting for report" |
| **Manual (report-only)** | Shown with live status | MANUAL + status badge | Reports + "Create Config" button |
| **Blocked** | Hidden | Shown grayed-out, BLOCKED badge | Blocked banner + Unblock button |
### Customer Actions
| Action | Description |
|--------|-------------|
| **Block/Unblock** | Toggle blocked status — blocked customers are hidden from dashboard and notifications are suppressed, but reports are still accepted and stored |
| **Push Config** | Generate YAML from Hub config and POST it to the controller's `/api/config/apply` endpoint (requires controller URL from reports) |
| **Pull Config** | Import controller's current config into Hub — fetches live YAML via `GET /api/config`, extracts identity and override fields, updates Hub's stored config |
| **Show Diff** | Compare Hub-generated config with controller's live config — shows per-key differences in a color-coded table (value-based comparison, ignores key ordering and volatile fields) |
| **Create Config** | Auto-create a managed config from a manual customer's report data, then redirect to edit form |
| **Trigger Update** | Instruct controller to self-update to the latest version |
| **Delete** | Remove customer config (customer reappears as manual if reports continue) |
### Status Logic
- **OK (green):** report < 30 min old, health = ok
- **WARN (yellow):** 30-60 min stale or health = warn
- **DOWN (red):** > 60 min stale or health = fail
- **DISABLED (gray):** controller monitoring paused
- **PENDING (gray):** config exists but no reports received yet
- **BLOCKED (gray):** customer blocked by operator
## Data Storage
SQLite with WAL mode. Tables:
| Table | Purpose |
|-------|---------|
| `reports` | Full JSON reports with denormalized fields for dashboard queries |
| `events` | Structured events from controllers and Hub (type, severity, message, details, source) |
| `infra_backups` | Per-customer infrastructure snapshots for disaster recovery |
| `customer_notifications` | Email, enabled event types, cooldown hours per customer |
| `notification_log` | Send/skip/fail history for notifications with channel (operator/customer) |
| `customer_configs` | Pre-configured customer settings, retrieval passwords, per-customer API keys, status (active/blocked) |
Retention: configurable (default 90 days), daily prune at 04:30 Budapest time.
## Configuration
```yaml
# hub.yaml
auth:
password_hash: "" # bcrypt hash for dashboard login (empty = no auth)
api:
report_api_key: "" # Bearer token for API auth
notifications:
resend_api_key: "" # Resend.com API key for email
from_email: "monitoring@felhom.eu"
operator_email: "" # Operator alert recipient
operator_enabled: true # Enable operator email notifications
retention:
max_days: 90
prune_schedule: "04:30"
alerting:
stale_threshold: "30m" # Customer considered stale after this duration
registry:
image: "gitea.dooplex.hu/admin/felhom-controller"
username: "" # Gitea registry credentials
token: ""
check_interval: "30m" # How often to check for new controller versions
template_interval: "1h" # How often to refresh controller.yaml.example
server:
listen: ":8080"
data_dir: "/data" # SQLite database location
```
## Deployment
Runs on k3s (Kubernetes) in the `felhom-system` namespace:
- **PVC:** 1GB Longhorn volume for SQLite database
- **Resources:** 64Mi-256Mi memory, 50m-500m CPU
- **Ingress:** `hub.felhom.eu` with TLS (cert-manager)
- **Geo-restriction:** Hungary only (nginx annotation)
```bash
# Build and push
cd hub/
make VERSION=0.3.0 docker docker-push
# Deploy
kubectl set image -n felhom-system deploy/hub hub=gitea.dooplex.hu/admin/felhom-hub:v0.3.0
kubectl rollout status -n felhom-system deploy/hub
# Check
kubectl logs -n felhom-system -l app=hub --tail 20
```
## Background Services
| Service | Schedule | Description |
|---------|----------|-------------|
| **Staleness checker** | Every 60s | Detects controllers that stopped reporting. Generates `node_stale` (>30min), `node_down` (>60min), `node_recovered` events |
| **Backup deadline checker** | Daily 05:00 Budapest | Detects missing backup/db-dump events since midnight. Generates `expected_backup_missed`, `expected_dbdump_missed` events |
| **Report/event prune** | Daily 04:30 Budapest | Deletes reports and events older than retention period (default 90 days) |
| **Registry version check** | Every 30min | Checks Gitea registry for new controller image tags |
| **Template refresh** | Every 1h | Fetches latest `controller.yaml.example` from Gitea |
## Dependencies
- `golang.org/x/crypto` — bcrypt for password hashing
- `gopkg.in/yaml.v3` — YAML config parsing
- `modernc.org/sqlite` — Pure Go SQLite (no CGo)