Files
felhom.eu/hub/README.md
2026-02-27 09:23:18 +01:00

329 lines
20 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# felhom-hub
**Central operator dashboard for monitoring and managing Felhom customer deployments.**
A lightweight Go service that receives periodic reports and structured events from felhom-controller instances, stores them in SQLite, and provides a web dashboard for fleet monitoring. Also serves as the infrastructure backup store for disaster recovery, event-based dead man's switch monitoring, and notification dispatch.
**Current version: v0.6.3**
---
## Architecture
```
Customer nodes Central Hub (k3s)
┌─────────────────┐ ┌────────────────────────┐
│ felhom-controller│──── JSON push ────▶│ felhom-hub │
│ (every 15 min) │ (Bearer auth) │ │
│ │ │ ┌─────────────────┐ │
│ POST /api/v1/ │ │ │ API Handler │ │
│ report │ │ │ (ingest reports, │ │
│ infra-backup │◀── config push ────│ │ infra backups, │ │
│ notify │ (YAML body) │ │ config push, │ │
│ │ │ │ asset serving) │ │
│ GET /api/v1/ │ │ └────────┬────────┘ │
│ assets/* │◀── asset download ─│ │ │
└─────────────────┘ (Bearer auth) │ ┌────────▼────────┐ │
│ │ SQLite Store │ │
Operator browser │ │ (reports, │ │
┌─────────────────┐ │ │ assets, │ │
│ Web Dashboard │◀── HTML pages ──────│ │ infra_backups, │ │
│ (hub.felhom.eu) │ (bcrypt auth) │ │ configs, │ │
└─────────────────┘ │ │ notifications) │ │
│ └─────────────────┘ │
│ │
│ ┌─────────────────┐ │
│ │ Asset Manager │ │
│ │ (PVC storage, │ │
│ │ SHA-256 manifest│ │
│ │ file serving) │ │
│ └─────────────────┘ │
│ │
│ ┌─────────────────┐ │
│ │ Web Dashboard │ │
│ │ (unified customer│ │
│ │ management) │ │
│ └─────────────────┘ │
└────────────────────────┘
```
## API Endpoints
All API endpoints require `Authorization: Bearer <api_key>` (except `/healthz` and `/api/v1/config/{id}`). Auth accepts both the global `report_api_key` and per-customer API keys (generated when creating customer configs).
### Report Ingest
| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/api/v1/report` | Controller pushes periodic status report (v0.28.0+ includes `app_telemetry` field) |
| `GET` | `/api/v1/customers` | List all customers with latest report summary |
| `GET` | `/api/v1/customers/{id}` | Get latest full report for a customer |
| `GET` | `/api/v1/customers/{id}/history?period=7d` | Get report history |
The `POST /api/v1/report` handler (v0.4.0+) automatically parses the optional `app_telemetry` JSON array from the request body and stores it in `app_telemetry` / `app_log_issues` tables. Old controllers (no `app_telemetry` key) continue to work unchanged.
### Infrastructure Backup (Disaster Recovery)
| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/api/v1/infra-backup` | Controller pushes infrastructure snapshot |
| `GET` | `/api/v1/infra-backup/{customer_id}` | Fresh controller pulls backup for restore |
The infra-backup payload contains everything needed to restore a customer deployment:
- `controller.yaml` (base64, full config including secrets)
- `settings.json` (base64, backup preferences, storage paths)
- Disk layout (UUIDs, labels, mount points, fstab options, bind-mount topology)
- Deployed stacks manifest (app names, HDD paths, display names)
- Restic passwords (primary + cross-drive, for encrypted backup access)
**Disaster recovery flow:**
1. Customer's system drive fails → replaced with fresh Debian install
2. `docker-setup.sh` deploys controller with minimal config (domain only)
3. Controller enters setup wizard → user chooses restore from local drive or Hub
4. For Hub restore: calls `GET /api/v1/recovery/{customer_id}` (gets config + infra backup)
5. Controller uses disk UUIDs to auto-mount surviving drives
6. Controller restores apps from local backups on those drives
### Recovery (Disaster Recovery)
| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/api/v1/recovery/{customer_id}` | Combined recovery: returns generated controller.yaml + infra backup in one response |
Auth: `X-Retrieval-Password` header (same per-customer password as config retrieval). Response:
```json
{
"customer_id": "example",
"config_yaml": "customer:\n id: example\n ...",
"infra_backup": { ... },
"has_infra_backup": true
}
```
If no infra backup exists yet, `infra_backup` is null and `has_infra_backup` is false.
### Report Response
The `POST /api/v1/report` response now includes `customer_blocked: true` when the customer's status is "blocked". Controllers use this to detect their standing and enter limited mode after a grace period.
### Events
| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/api/v1/event` | Controller pushes structured event (27 allowed types, severity: info/warning/error) |
Events are the primary monitoring mechanism. Each event has: customer_id, event_type, severity, message, details_json, source. Per-customer API keys are validated against the customer_id in the payload. Stored in the `events` table with automatic pruning.
**Hub-generated events** (source="hub"):
- `node_stale` / `node_down` / `node_recovered` — dead man's switch from staleness checker (every 60s)
- `expected_backup_missed` / `expected_dbdump_missed` — backup deadline checker (daily at 05:00 Budapest)
### Notifications
| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/api/v1/notify` | Legacy notification relay (kept for backward compatibility) |
| `POST` | `/api/v1/preferences` | Controller syncs customer notification preferences (email, enabled_events, cooldown_hours) |
Notifications are dispatched automatically when events are processed:
- **Operator channel**: English emails for warning/error events, 1h cooldown per customer:eventType
- **Customer channel**: Hungarian emails per event type, respects customer preferences and cooldown (default 6h)
- Email delivery via Resend.com API
### Customer Config Retrieval
| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/api/v1/config/{customer_id}` | Download generated controller.yaml (auth: `X-Retrieval-Password` header) |
Config retrieval uses a separate per-customer retrieval password (not the API key). Retrieval passwords are auto-generated as **Hungarian word passphrases** (e.g., `alma-kerék-madár-felhő`) for easy phone-based entry during disaster recovery. The Hub generates a complete `controller.yaml` by deep-merging `controller.yaml.example` (periodically fetched from the Gitea repo) with customer-specific overrides (identity, infrastructure tokens, hub API key, session secret).
### Assets
| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/api/v1/assets/manifest` | JSON manifest of all assets with SHA-256 checksums |
| `GET` | `/api/v1/assets/file/{filename}` | Download a single asset file (logo, screenshot) |
Assets are stored on the Hub PVC at `<dataDir>/assets/`. On first run, assets are seeded from the Docker image (`/usr/share/felhom/assets-seed/`). The manifest includes filename, size, and SHA-256 hash for each file — controllers use this for efficient change detection.
**Asset types served:** `{slug}-logo.svg`, `{slug}-logo.png`, `{slug}-screenshot-{N}.webp`
The asset manager (`internal/assets/`) scans the assets directory on startup, builds an in-memory manifest, and serves files with appropriate Content-Type and cache headers. Both endpoints require Bearer token auth (global or per-customer API key).
### Health
| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/healthz` | Health check (no auth required, returns 503 if SQLite ping fails) |
## Web Dashboard
Protected by bcrypt password + session cookie (7-day expiry).
### Authentication & Session Model (`internal/web/server.go`)
- Login generates a **cryptographically random 64-char hex session token** stored server-side in a `map[string]*hubSession` (+ `sync.RWMutex`). The old literal `hub_session=authenticated` cookie is gone.
- Each session also stores a **per-session CSRF token** (separate 64-char hex random value).
- Cookie attributes: `SameSite=Lax`, `Secure` (when TLS), `HttpOnly`, 7-day `Max-Age`.
- `RequireAuth` middleware validates the session token with `subtle.ConstantTimeCompare` and redirects to `/login` on failure.
- `CleanupSessions(ctx)` goroutine runs hourly to purge expired sessions.
### CSRF Protection (`internal/web/server.go`)
Synchronizer-token CSRF protection on all browser POST/DELETE/PATCH operations:
- CSRF validation block runs at the top of `ServeHTTP` before routing.
- Skipped when: no session cookie present (API/Basic-Auth path); or safe methods (GET/HEAD/OPTIONS).
- Token read from `_csrf` form field or `X-CSRF-Token` request header.
- On failure: JSON `{"ok":false,"error":"CSRF token missing or invalid"}` for `/api/` paths; HTTP 403 text otherwise.
- Template delivery: `csrfToken(r)` and `csrfField(r)` helpers inject `CSRFToken` and `CSRFField` into every render data struct via `configs.go`. Templates use `{{.CSRFField}}` in forms and `csrfHeaders()` JS helper for fetch calls.
### Pages
- **Dashboard (`/`)** — Fleet overview table showing all customers with live status and event count badges (error+warning in last 24h). Config-only customers (no reports yet) appear as "PENDING" with gray badge. Blocked customers are hidden. Auto-refreshes every 60 seconds.
- **Customers (`/configs`)** — Customer management list. Shows all customers (both managed and manual), their status, controller version, and config type (MANAGED/MANUAL). Blocked customers shown grayed-out with BLOCKED badge.
- **Fleet App Analytics (`/apps`)** — Fleet-wide app telemetry overview (v0.4.0+). Shows all deployed apps across all customers with deployment count, avg/P95 memory, catalog estimate/limit accuracy indicators, and 24h error/warning badge counts. Sortable columns (deployments/memory/errors), 24h/7d/30d time period selector.
- **App Detail (`/apps/{name}`)** — Per-app drill-down page with Chart.js memory trend (avg + peak lines, catalog limit dashed line), per-customer breakdown table, and known log issues table (severity, message, occurrence count, affected customers, first/last seen). Shows suggested mem_limit from P95×1.2 rounded to 32 MB.
- **Unified Customer Detail (`/customers/{id}`)** — Single page per customer combining config management and live monitoring. Auto-refresh toggle (localStorage-persisted, enabled by default) replaces the previous hardcoded 60s meta-refresh. Adapts content based on available data:
- **Managed + reporting:** Full view — config info, system metrics, storage, containers, backup status, events timeline (last 50, severity filter), credentials, setup commands, YAML preview, controller update, notifications (with channel column), history
- **Managed + no reports yet:** Config info, credentials, setup commands, "Waiting for first report" indicator
- **Manual (report-only):** System metrics, storage, containers, backup, with "Create Config" button to convert to managed
- **Config Form (`/configs/new`, `/configs/{id}/edit`)** — Create/edit customer configurations with identity, infrastructure tokens, and monitoring overrides. Legacy Monitoring UUIDs section collapsed by default with deprecation notice. CF API token requires **Zone DNS:Edit** (ACME) and **Zone WAF:Edit** (geo-restriction) permissions.
### Customer States
| State | Dashboard | Customers List | Detail Page |
|-------|-----------|----------------|-------------|
| **Active + reporting** | Shown with live status | MANAGED + status badge | Full unified view |
| **Active + no reports** | Shown as PENDING (gray) | MANAGED + no status | Config + "waiting for report" |
| **Manual (report-only)** | Shown with live status | MANUAL + status badge | Reports + "Create Config" button |
| **Blocked** | Hidden | Shown grayed-out, BLOCKED badge | Blocked banner + Unblock button |
### Customer Actions
| Action | Description |
|--------|-------------|
| **Block/Unblock** | Toggle blocked status — blocked customers are hidden from dashboard and notifications are suppressed, but reports are still accepted and stored |
| **Push Config** | Generate YAML from Hub config and POST it to the controller's `/api/config/apply` endpoint (requires controller URL from reports) |
| **Pull Config** | Import controller's current config into Hub — fetches live YAML via `GET /api/config`, extracts identity and override fields, updates Hub's stored config |
| **Show Diff** | Compare Hub-generated config with controller's live config — shows per-key differences in a color-coded table (value-based comparison, ignores key ordering and volatile fields) |
| **Create Config** | Auto-create a managed config from a manual customer's report data, then redirect to edit form |
| **Trigger Update** | Instruct controller to self-update to the latest version |
| **Delete** | Remove customer config (customer reappears as manual if reports continue) |
### Status Logic
- **OK (green):** report < 30 min old, health = ok
- **WARN (yellow):** 30-60 min stale or health = warn
- **DOWN (red):** > 60 min stale or health = fail
- **DISABLED (gray):** controller monitoring paused
- **PENDING (gray):** config exists but no reports received yet
- **BLOCKED (gray):** customer blocked by operator
## Data Storage
SQLite with WAL mode. Tables:
| Table | Purpose |
|-------|---------|
| `reports` | Full JSON reports with denormalized fields for dashboard queries |
| `events` | Structured events from controllers and Hub (type, severity, message, details, source) |
| `infra_backups` | Per-customer infrastructure snapshots for disaster recovery |
| `customer_notifications` | Email, enabled event types, cooldown hours per customer |
| `notification_log` | Send/skip/fail history for notifications with channel (operator/customer) |
| `customer_configs` | Pre-configured customer settings, retrieval passwords, per-customer API keys, status (active/blocked) |
Retention: configurable (default 90 days), daily prune at 04:30 Budapest time.
### PVC Asset Storage
App assets (logos, screenshots, branding) are stored on the PVC at `<dataDir>/assets/`. On every startup, the Hub compares SHA-256 checksums between the image seed (`/usr/share/felhom/assets-seed/`) and the PVC, updating any changed files. This means redeploying the Hub image with updated assets automatically propagates changes without PVC deletion.
A manual "Refresh Assets from Image" button is available on the **Configuration** page (`/configuration`) for triggering a re-seed + manifest rebuild on demand.
## Configuration
```yaml
# hub.yaml
auth:
password_hash: "" # bcrypt hash for dashboard login (empty = no auth)
api:
report_api_key: "" # Bearer token for API auth
notifications:
resend_api_key: "" # Resend.com API key for email
from_email: "monitoring@felhom.eu"
operator_email: "" # Operator alert recipient
operator_enabled: true # Enable operator email notifications
retention:
max_days: 90
prune_schedule: "04:30"
alerting:
stale_threshold: "30m" # Customer considered stale after this duration
registry:
image: "gitea.dooplex.hu/admin/felhom-controller"
username: "" # Gitea registry credentials
token: ""
check_interval: "30m" # How often to check for new controller versions
template_interval: "1h" # How often to refresh controller.yaml.example
server:
listen: ":8080"
data_dir: "/data" # SQLite database location
```
## Deployment
Runs on k3s (Kubernetes) in the `felhom-system` namespace:
- **PVC:** 1GB Longhorn volume for SQLite database + app assets
- **Resources:** 64Mi-256Mi memory, 50m-500m CPU
- **Ingress:** `hub.felhom.eu` with TLS (cert-manager)
- **Geo-restriction:** Hungary only (nginx annotation)
```bash
# Build and push (on 192.168.0.180)
cd ~/build/felhom-hub
./build.sh v0.3.8 --push
# Build script auto-syncs app assets from website/assets/ into the image
# Deploy (ArgoCD managed — update manifests/hub.yaml image tag, commit+push)
git pull && kubectl apply -f manifests/hub.yaml
# Check
kubectl logs -n felhom-system -l app=hub --tail 20
```
**Note:** `kubectl set image` alone does NOT persist — ArgoCD reverts it. Always update `manifests/hub.yaml` and apply.
The Dockerfile includes `COPY assets/ /usr/share/felhom/assets-seed/` which bakes app assets into the image as a seed for the PVC. The build script copies `*-logo.svg`, `*-logo.png`, and `*-screenshot-*.webp` from the website repo's `assets/` directory.
## Background Services
| Service | Schedule | Description |
|---------|----------|-------------|
| **Staleness checker** | Every 60s | Detects controllers that stopped reporting. Generates `node_stale` (>30min), `node_down` (>60min), `node_recovered` events |
| **Backup deadline checker** | Daily 05:00 Budapest | Detects missing backup/db-dump events since midnight. Generates `expected_backup_missed`, `expected_dbdump_missed` events |
| **Report/event prune** | Daily 04:30 Budapest | Deletes reports and events older than retention period (default 90 days) |
| **Registry version check** | Every 30min | Checks Gitea registry for new controller image tags |
| **Template refresh** | Every 1h | Fetches latest `controller.yaml.example` from Gitea |
| **Asset seeding** | On startup | Compares SHA-256 checksums and updates changed assets from Docker image seed |
## Internal Packages
| Package | Purpose |
|---------|---------|
| `internal/api` | REST API handler (report ingest, config, events, assets, notifications) |
| `internal/web` | Web dashboard (session auth, customer management, fleet overview) |
| `internal/assets` | PVC asset manager (manifest generation, SHA-256 checksums, file serving, image seed) |
| `internal/configgen` | Shared YAML config generation (deep-merge template + customer overrides) |
## Dependencies
- `golang.org/x/crypto` — bcrypt for password hashing
- `gopkg.in/yaml.v3` — YAML config parsing
- `modernc.org/sqlite` — Pure Go SQLite (no CGo)