1e354cbd41
- Add Configuration page with "Refresh Assets" button - Replace seedIfEmpty with seedOrUpdate (SHA-256 compare on startup) - Translate all Hungarian text on Apps pages to English - Add Configuration tab to all template navigation - Expand isAssetFile to match favicon patterns - Add felhom-logo.svg to website assets for the pipeline Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
329 lines
19 KiB
Markdown
329 lines
19 KiB
Markdown
# felhom-hub
|
||
|
||
**Central operator dashboard for monitoring and managing Felhom customer deployments.**
|
||
|
||
A lightweight Go service that receives periodic reports and structured events from felhom-controller instances, stores them in SQLite, and provides a web dashboard for fleet monitoring. Also serves as the infrastructure backup store for disaster recovery, event-based dead man's switch monitoring, and notification dispatch.
|
||
|
||
**Current version: v0.5.0**
|
||
|
||
---
|
||
|
||
## Architecture
|
||
|
||
```
|
||
Customer nodes Central Hub (k3s)
|
||
┌─────────────────┐ ┌────────────────────────┐
|
||
│ felhom-controller│──── JSON push ────▶│ felhom-hub │
|
||
│ (every 15 min) │ (Bearer auth) │ │
|
||
│ │ │ ┌─────────────────┐ │
|
||
│ POST /api/v1/ │ │ │ API Handler │ │
|
||
│ report │ │ │ (ingest reports, │ │
|
||
│ infra-backup │◀── config push ────│ │ infra backups, │ │
|
||
│ notify │ (YAML body) │ │ config push, │ │
|
||
│ │ │ │ asset serving) │ │
|
||
│ GET /api/v1/ │ │ └────────┬────────┘ │
|
||
│ assets/* │◀── asset download ─│ │ │
|
||
└─────────────────┘ (Bearer auth) │ ┌────────▼────────┐ │
|
||
│ │ SQLite Store │ │
|
||
Operator browser │ │ (reports, │ │
|
||
┌─────────────────┐ │ │ assets, │ │
|
||
│ Web Dashboard │◀── HTML pages ──────│ │ infra_backups, │ │
|
||
│ (hub.felhom.eu) │ (bcrypt auth) │ │ configs, │ │
|
||
└─────────────────┘ │ │ notifications) │ │
|
||
│ └─────────────────┘ │
|
||
│ │
|
||
│ ┌─────────────────┐ │
|
||
│ │ Asset Manager │ │
|
||
│ │ (PVC storage, │ │
|
||
│ │ SHA-256 manifest│ │
|
||
│ │ file serving) │ │
|
||
│ └─────────────────┘ │
|
||
│ │
|
||
│ ┌─────────────────┐ │
|
||
│ │ Web Dashboard │ │
|
||
│ │ (unified customer│ │
|
||
│ │ management) │ │
|
||
│ └─────────────────┘ │
|
||
└────────────────────────┘
|
||
```
|
||
|
||
## API Endpoints
|
||
|
||
All API endpoints require `Authorization: Bearer <api_key>` (except `/healthz` and `/api/v1/config/{id}`). Auth accepts both the global `report_api_key` and per-customer API keys (generated when creating customer configs).
|
||
|
||
### Report Ingest
|
||
|
||
| Method | Path | Description |
|
||
|--------|------|-------------|
|
||
| `POST` | `/api/v1/report` | Controller pushes periodic status report (v0.28.0+ includes `app_telemetry` field) |
|
||
| `GET` | `/api/v1/customers` | List all customers with latest report summary |
|
||
| `GET` | `/api/v1/customers/{id}` | Get latest full report for a customer |
|
||
| `GET` | `/api/v1/customers/{id}/history?period=7d` | Get report history |
|
||
|
||
The `POST /api/v1/report` handler (v0.4.0+) automatically parses the optional `app_telemetry` JSON array from the request body and stores it in `app_telemetry` / `app_log_issues` tables. Old controllers (no `app_telemetry` key) continue to work unchanged.
|
||
|
||
### Infrastructure Backup (Disaster Recovery)
|
||
|
||
| Method | Path | Description |
|
||
|--------|------|-------------|
|
||
| `POST` | `/api/v1/infra-backup` | Controller pushes infrastructure snapshot |
|
||
| `GET` | `/api/v1/infra-backup/{customer_id}` | Fresh controller pulls backup for restore |
|
||
|
||
The infra-backup payload contains everything needed to restore a customer deployment:
|
||
- `controller.yaml` (base64, full config including secrets)
|
||
- `settings.json` (base64, backup preferences, storage paths)
|
||
- Disk layout (UUIDs, labels, mount points, fstab options, bind-mount topology)
|
||
- Deployed stacks manifest (app names, HDD paths, display names)
|
||
- Restic passwords (primary + cross-drive, for encrypted backup access)
|
||
|
||
**Disaster recovery flow:**
|
||
1. Customer's system drive fails → replaced with fresh Debian install
|
||
2. `docker-setup.sh` deploys controller with minimal config (domain only)
|
||
3. Controller enters setup wizard → user chooses restore from local drive or Hub
|
||
4. For Hub restore: calls `GET /api/v1/recovery/{customer_id}` (gets config + infra backup)
|
||
5. Controller uses disk UUIDs to auto-mount surviving drives
|
||
6. Controller restores apps from local backups on those drives
|
||
|
||
### Recovery (Disaster Recovery)
|
||
|
||
| Method | Path | Description |
|
||
|--------|------|-------------|
|
||
| `GET` | `/api/v1/recovery/{customer_id}` | Combined recovery: returns generated controller.yaml + infra backup in one response |
|
||
|
||
Auth: `X-Retrieval-Password` header (same per-customer password as config retrieval). Response:
|
||
```json
|
||
{
|
||
"customer_id": "example",
|
||
"config_yaml": "customer:\n id: example\n ...",
|
||
"infra_backup": { ... },
|
||
"has_infra_backup": true
|
||
}
|
||
```
|
||
If no infra backup exists yet, `infra_backup` is null and `has_infra_backup` is false.
|
||
|
||
### Report Response
|
||
|
||
The `POST /api/v1/report` response now includes `customer_blocked: true` when the customer's status is "blocked". Controllers use this to detect their standing and enter limited mode after a grace period.
|
||
|
||
### Events
|
||
|
||
| Method | Path | Description |
|
||
|--------|------|-------------|
|
||
| `POST` | `/api/v1/event` | Controller pushes structured event (27 allowed types, severity: info/warning/error) |
|
||
|
||
Events are the primary monitoring mechanism. Each event has: customer_id, event_type, severity, message, details_json, source. Per-customer API keys are validated against the customer_id in the payload. Stored in the `events` table with automatic pruning.
|
||
|
||
**Hub-generated events** (source="hub"):
|
||
- `node_stale` / `node_down` / `node_recovered` — dead man's switch from staleness checker (every 60s)
|
||
- `expected_backup_missed` / `expected_dbdump_missed` — backup deadline checker (daily at 05:00 Budapest)
|
||
|
||
### Notifications
|
||
|
||
| Method | Path | Description |
|
||
|--------|------|-------------|
|
||
| `POST` | `/api/v1/notify` | Legacy notification relay (kept for backward compatibility) |
|
||
| `POST` | `/api/v1/preferences` | Controller syncs customer notification preferences (email, enabled_events, cooldown_hours) |
|
||
|
||
Notifications are dispatched automatically when events are processed:
|
||
- **Operator channel**: English emails for warning/error events, 1h cooldown per customer:eventType
|
||
- **Customer channel**: Hungarian emails per event type, respects customer preferences and cooldown (default 6h)
|
||
- Email delivery via Resend.com API
|
||
|
||
### Customer Config Retrieval
|
||
|
||
| Method | Path | Description |
|
||
|--------|------|-------------|
|
||
| `GET` | `/api/v1/config/{customer_id}` | Download generated controller.yaml (auth: `X-Retrieval-Password` header) |
|
||
|
||
Config retrieval uses a separate per-customer retrieval password (not the API key). Retrieval passwords are auto-generated as **Hungarian word passphrases** (e.g., `alma-kerék-madár-felhő`) for easy phone-based entry during disaster recovery. The Hub generates a complete `controller.yaml` by deep-merging `controller.yaml.example` (periodically fetched from the Gitea repo) with customer-specific overrides (identity, infrastructure tokens, hub API key, session secret).
|
||
|
||
### Assets
|
||
|
||
| Method | Path | Description |
|
||
|--------|------|-------------|
|
||
| `GET` | `/api/v1/assets/manifest` | JSON manifest of all assets with SHA-256 checksums |
|
||
| `GET` | `/api/v1/assets/file/{filename}` | Download a single asset file (logo, screenshot) |
|
||
|
||
Assets are stored on the Hub PVC at `<dataDir>/assets/`. On first run, assets are seeded from the Docker image (`/usr/share/felhom/assets-seed/`). The manifest includes filename, size, and SHA-256 hash for each file — controllers use this for efficient change detection.
|
||
|
||
**Asset types served:** `{slug}-logo.svg`, `{slug}-logo.png`, `{slug}-screenshot-{N}.webp`
|
||
|
||
The asset manager (`internal/assets/`) scans the assets directory on startup, builds an in-memory manifest, and serves files with appropriate Content-Type and cache headers. Both endpoints require Bearer token auth (global or per-customer API key).
|
||
|
||
### Health
|
||
|
||
| Method | Path | Description |
|
||
|--------|------|-------------|
|
||
| `GET` | `/healthz` | Health check (no auth required, returns 503 if SQLite ping fails) |
|
||
|
||
## Web Dashboard
|
||
|
||
Protected by bcrypt password + session cookie (7-day expiry).
|
||
|
||
### Authentication & Session Model (`internal/web/server.go`)
|
||
|
||
- Login generates a **cryptographically random 64-char hex session token** stored server-side in a `map[string]*hubSession` (+ `sync.RWMutex`). The old literal `hub_session=authenticated` cookie is gone.
|
||
- Each session also stores a **per-session CSRF token** (separate 64-char hex random value).
|
||
- Cookie attributes: `SameSite=Lax`, `Secure` (when TLS), `HttpOnly`, 7-day `Max-Age`.
|
||
- `RequireAuth` middleware validates the session token with `subtle.ConstantTimeCompare` and redirects to `/login` on failure.
|
||
- `CleanupSessions(ctx)` goroutine runs hourly to purge expired sessions.
|
||
|
||
### CSRF Protection (`internal/web/server.go`)
|
||
|
||
Synchronizer-token CSRF protection on all browser POST/DELETE/PATCH operations:
|
||
|
||
- CSRF validation block runs at the top of `ServeHTTP` before routing.
|
||
- Skipped when: no session cookie present (API/Basic-Auth path); or safe methods (GET/HEAD/OPTIONS).
|
||
- Token read from `_csrf` form field or `X-CSRF-Token` request header.
|
||
- On failure: JSON `{"ok":false,"error":"CSRF token missing or invalid"}` for `/api/` paths; HTTP 403 text otherwise.
|
||
- Template delivery: `csrfToken(r)` and `csrfField(r)` helpers inject `CSRFToken` and `CSRFField` into every render data struct via `configs.go`. Templates use `{{.CSRFField}}` in forms and `csrfHeaders()` JS helper for fetch calls.
|
||
|
||
### Pages
|
||
|
||
- **Dashboard (`/`)** — Fleet overview table showing all customers with live status and event count badges (error+warning in last 24h). Config-only customers (no reports yet) appear as "PENDING" with gray badge. Blocked customers are hidden. Auto-refreshes every 60 seconds.
|
||
- **Customers (`/configs`)** — Customer management list. Shows all customers (both managed and manual), their status, controller version, and config type (MANAGED/MANUAL). Blocked customers shown grayed-out with BLOCKED badge.
|
||
- **Fleet App Analytics (`/apps`)** — Fleet-wide app telemetry overview (v0.4.0+). Shows all deployed apps across all customers with deployment count, avg/P95 memory, catalog estimate/limit accuracy indicators, and 24h error/warning badge counts. Sortable columns (deployments/memory/errors), 24h/7d/30d time period selector.
|
||
- **App Detail (`/apps/{name}`)** — Per-app drill-down page with Chart.js memory trend (avg + peak lines, catalog limit dashed line), per-customer breakdown table, and known log issues table (severity, message, occurrence count, affected customers, first/last seen). Shows suggested mem_limit from P95×1.2 rounded to 32 MB.
|
||
- **Unified Customer Detail (`/customers/{id}`)** — Single page per customer combining config management and live monitoring. Adapts content based on available data:
|
||
- **Managed + reporting:** Full view — config info, system metrics, storage, containers, backup status, events timeline (last 50, severity filter), credentials, setup commands, YAML preview, controller update, notifications (with channel column), history
|
||
- **Managed + no reports yet:** Config info, credentials, setup commands, "Waiting for first report" indicator
|
||
- **Manual (report-only):** System metrics, storage, containers, backup, with "Create Config" button to convert to managed
|
||
- **Config Form (`/configs/new`, `/configs/{id}/edit`)** — Create/edit customer configurations with identity, infrastructure tokens, and monitoring overrides. Legacy Monitoring UUIDs section collapsed by default with deprecation notice
|
||
|
||
### Customer States
|
||
|
||
| State | Dashboard | Customers List | Detail Page |
|
||
|-------|-----------|----------------|-------------|
|
||
| **Active + reporting** | Shown with live status | MANAGED + status badge | Full unified view |
|
||
| **Active + no reports** | Shown as PENDING (gray) | MANAGED + no status | Config + "waiting for report" |
|
||
| **Manual (report-only)** | Shown with live status | MANUAL + status badge | Reports + "Create Config" button |
|
||
| **Blocked** | Hidden | Shown grayed-out, BLOCKED badge | Blocked banner + Unblock button |
|
||
|
||
### Customer Actions
|
||
|
||
| Action | Description |
|
||
|--------|-------------|
|
||
| **Block/Unblock** | Toggle blocked status — blocked customers are hidden from dashboard and notifications are suppressed, but reports are still accepted and stored |
|
||
| **Push Config** | Generate YAML from Hub config and POST it to the controller's `/api/config/apply` endpoint (requires controller URL from reports) |
|
||
| **Pull Config** | Import controller's current config into Hub — fetches live YAML via `GET /api/config`, extracts identity and override fields, updates Hub's stored config |
|
||
| **Show Diff** | Compare Hub-generated config with controller's live config — shows per-key differences in a color-coded table (value-based comparison, ignores key ordering and volatile fields) |
|
||
| **Create Config** | Auto-create a managed config from a manual customer's report data, then redirect to edit form |
|
||
| **Trigger Update** | Instruct controller to self-update to the latest version |
|
||
| **Delete** | Remove customer config (customer reappears as manual if reports continue) |
|
||
|
||
### Status Logic
|
||
|
||
- **OK (green):** report < 30 min old, health = ok
|
||
- **WARN (yellow):** 30-60 min stale or health = warn
|
||
- **DOWN (red):** > 60 min stale or health = fail
|
||
- **DISABLED (gray):** controller monitoring paused
|
||
- **PENDING (gray):** config exists but no reports received yet
|
||
- **BLOCKED (gray):** customer blocked by operator
|
||
|
||
## Data Storage
|
||
|
||
SQLite with WAL mode. Tables:
|
||
|
||
| Table | Purpose |
|
||
|-------|---------|
|
||
| `reports` | Full JSON reports with denormalized fields for dashboard queries |
|
||
| `events` | Structured events from controllers and Hub (type, severity, message, details, source) |
|
||
| `infra_backups` | Per-customer infrastructure snapshots for disaster recovery |
|
||
| `customer_notifications` | Email, enabled event types, cooldown hours per customer |
|
||
| `notification_log` | Send/skip/fail history for notifications with channel (operator/customer) |
|
||
| `customer_configs` | Pre-configured customer settings, retrieval passwords, per-customer API keys, status (active/blocked) |
|
||
|
||
Retention: configurable (default 90 days), daily prune at 04:30 Budapest time.
|
||
|
||
### PVC Asset Storage
|
||
|
||
App assets (logos, screenshots, branding) are stored on the PVC at `<dataDir>/assets/`. On every startup, the Hub compares SHA-256 checksums between the image seed (`/usr/share/felhom/assets-seed/`) and the PVC, updating any changed files. This means redeploying the Hub image with updated assets automatically propagates changes without PVC deletion.
|
||
|
||
A manual "Refresh Assets from Image" button is available on the **Configuration** page (`/configuration`) for triggering a re-seed + manifest rebuild on demand.
|
||
|
||
## Configuration
|
||
|
||
```yaml
|
||
# hub.yaml
|
||
auth:
|
||
password_hash: "" # bcrypt hash for dashboard login (empty = no auth)
|
||
|
||
api:
|
||
report_api_key: "" # Bearer token for API auth
|
||
|
||
notifications:
|
||
resend_api_key: "" # Resend.com API key for email
|
||
from_email: "monitoring@felhom.eu"
|
||
operator_email: "" # Operator alert recipient
|
||
operator_enabled: true # Enable operator email notifications
|
||
|
||
retention:
|
||
max_days: 90
|
||
prune_schedule: "04:30"
|
||
|
||
alerting:
|
||
stale_threshold: "30m" # Customer considered stale after this duration
|
||
|
||
registry:
|
||
image: "gitea.dooplex.hu/admin/felhom-controller"
|
||
username: "" # Gitea registry credentials
|
||
token: ""
|
||
check_interval: "30m" # How often to check for new controller versions
|
||
template_interval: "1h" # How often to refresh controller.yaml.example
|
||
|
||
server:
|
||
listen: ":8080"
|
||
data_dir: "/data" # SQLite database location
|
||
```
|
||
|
||
## Deployment
|
||
|
||
Runs on k3s (Kubernetes) in the `felhom-system` namespace:
|
||
- **PVC:** 1GB Longhorn volume for SQLite database + app assets
|
||
- **Resources:** 64Mi-256Mi memory, 50m-500m CPU
|
||
- **Ingress:** `hub.felhom.eu` with TLS (cert-manager)
|
||
- **Geo-restriction:** Hungary only (nginx annotation)
|
||
|
||
```bash
|
||
# Build and push (on 192.168.0.180)
|
||
cd ~/build/felhom-hub
|
||
./build.sh v0.3.8 --push
|
||
# Build script auto-syncs app assets from website/assets/ into the image
|
||
|
||
# Deploy (ArgoCD managed — update manifests/hub.yaml image tag, commit+push)
|
||
git pull && kubectl apply -f manifests/hub.yaml
|
||
|
||
# Check
|
||
kubectl logs -n felhom-system -l app=hub --tail 20
|
||
```
|
||
|
||
**Note:** `kubectl set image` alone does NOT persist — ArgoCD reverts it. Always update `manifests/hub.yaml` and apply.
|
||
|
||
The Dockerfile includes `COPY assets/ /usr/share/felhom/assets-seed/` which bakes app assets into the image as a seed for the PVC. The build script copies `*-logo.svg`, `*-logo.png`, and `*-screenshot-*.webp` from the website repo's `assets/` directory.
|
||
|
||
## Background Services
|
||
|
||
| Service | Schedule | Description |
|
||
|---------|----------|-------------|
|
||
| **Staleness checker** | Every 60s | Detects controllers that stopped reporting. Generates `node_stale` (>30min), `node_down` (>60min), `node_recovered` events |
|
||
| **Backup deadline checker** | Daily 05:00 Budapest | Detects missing backup/db-dump events since midnight. Generates `expected_backup_missed`, `expected_dbdump_missed` events |
|
||
| **Report/event prune** | Daily 04:30 Budapest | Deletes reports and events older than retention period (default 90 days) |
|
||
| **Registry version check** | Every 30min | Checks Gitea registry for new controller image tags |
|
||
| **Template refresh** | Every 1h | Fetches latest `controller.yaml.example` from Gitea |
|
||
| **Asset seeding** | On startup | Compares SHA-256 checksums and updates changed assets from Docker image seed |
|
||
|
||
## Internal Packages
|
||
|
||
| Package | Purpose |
|
||
|---------|---------|
|
||
| `internal/api` | REST API handler (report ingest, config, events, assets, notifications) |
|
||
| `internal/web` | Web dashboard (session auth, customer management, fleet overview) |
|
||
| `internal/assets` | PVC asset manager (manifest generation, SHA-256 checksums, file serving, image seed) |
|
||
| `internal/configgen` | Shared YAML config generation (deep-merge template + customer overrides) |
|
||
|
||
## Dependencies
|
||
|
||
- `golang.org/x/crypto` — bcrypt for password hashing
|
||
- `gopkg.in/yaml.v3` — YAML config parsing
|
||
- `modernc.org/sqlite` — Pure Go SQLite (no CGo)
|