Files
felhom.eu/hub/README.md
T
admin 41e313bf36 hub v0.1.7: Infrastructure backup endpoints for disaster recovery
Add infra-backup push/pull API for controller DR:
- POST /api/v1/infra-backup — controller pushes infrastructure snapshot
- GET /api/v1/infra-backup/{customer_id} — fresh controller pulls backup
- infra_backups SQLite table with per-customer snapshots
- Customer detail page shows infra backup status card
- README.md with full API docs and DR flow

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 13:17:12 +01:00

165 lines
6.8 KiB
Markdown

# felhom-hub
**Central operator dashboard for monitoring and managing Felhom customer deployments.**
A lightweight Go service that receives periodic reports from felhom-controller instances, stores them in SQLite, and provides a web dashboard for fleet monitoring. Also serves as the infrastructure backup store for disaster recovery.
**Current version: v0.1.6**
---
## Architecture
```
Customer nodes Central Hub (k3s)
┌─────────────────┐ ┌────────────────────────┐
│ felhom-controller│──── JSON push ────▶│ felhom-hub │
│ (every 15 min) │ (Bearer auth) │ │
│ │ │ ┌─────────────────┐ │
│ POST /api/v1/ │ │ │ API Handler │ │
│ report │ │ │ (ingest reports, │ │
│ infra-backup │ │ │ infra backups) │ │
│ notify │ │ └────────┬────────┘ │
│ │ │ │ │
└─────────────────┘ │ ┌────────▼────────┐ │
│ │ SQLite Store │ │
Operator browser │ │ (reports, │ │
┌─────────────────┐ │ │ infra_backups, │ │
│ Web Dashboard │◀── HTML pages ──────│ │ notifications) │ │
│ (hub.felhom.eu) │ (bcrypt auth) │ └─────────────────┘ │
└─────────────────┘ │ │
│ ┌─────────────────┐ │
│ │ Web Dashboard │ │
│ │ (multi-customer │ │
│ │ overview) │ │
│ └─────────────────┘ │
└────────────────────────┘
```
## API Endpoints
All API endpoints require `Authorization: Bearer <report_api_key>` (except `/healthz`).
### Report Ingest
| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/api/v1/report` | Controller pushes periodic status report |
| `GET` | `/api/v1/customers` | List all customers with latest report summary |
| `GET` | `/api/v1/customers/{id}` | Get latest full report for a customer |
| `GET` | `/api/v1/customers/{id}/history?period=7d` | Get report history |
### Infrastructure Backup (Disaster Recovery)
| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/api/v1/infra-backup` | Controller pushes infrastructure snapshot |
| `GET` | `/api/v1/infra-backup/{customer_id}` | Fresh controller pulls backup for restore |
The infra-backup payload contains everything needed to restore a customer deployment:
- `controller.yaml` (base64, full config including secrets)
- `settings.json` (base64, backup preferences, storage paths)
- Disk layout (UUIDs, labels, mount points, fstab options, bind-mount topology)
- Deployed stacks manifest (app names, HDD paths, display names)
- Restic passwords (primary + cross-drive, for encrypted backup access)
**Disaster recovery flow:**
1. Customer's system drive fails → replaced with fresh Debian install
2. `docker-setup.sh` deploys controller with Hub details (customer_id + API key)
3. Controller detects fresh deployment → calls `GET /api/v1/infra-backup/{customer_id}`
4. Controller uses disk UUIDs to auto-mount surviving drives
5. Controller restores apps from local backups on those drives
### Notifications
| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/api/v1/notify` | Controller sends event notification (backup_failed, disk_warning, etc.) |
| `POST` | `/api/v1/preferences` | Controller syncs customer notification preferences |
Notifications are sent via Resend.com email API.
### Health
| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/healthz` | Health check (no auth required) |
## Web Dashboard
Protected by bcrypt password + session cookie (7-day expiry).
- **Customer overview table:** status indicators (OK/WARN/DOWN), CPU/memory %, disk usage, container counts, backup age, controller version
- **Customer detail page:** system info, storage bars, container table, notification preferences, notification log, 24h history graphs
- **Auto-refresh:** 60-second cycle
- **Status logic:**
- Green: report < 30 min old, health = ok
- Yellow: 30-60 min stale or health = warn
- Red: > 60 min stale or health = fail
## Data Storage
SQLite with WAL mode. Tables:
| Table | Purpose |
|-------|---------|
| `reports` | Full JSON reports with denormalized fields for dashboard queries |
| `infra_backups` | Per-customer infrastructure snapshots for disaster recovery |
| `customer_notifications` | Email + enabled event types per customer |
| `notification_log` | Send/skip/fail history for notifications |
Retention: configurable (default 90 days), daily prune at 04:30 Budapest time.
## Configuration
```yaml
# hub.yaml
auth:
password_hash: "" # bcrypt hash for dashboard login (empty = no auth)
api:
report_api_key: "" # Bearer token for API auth
notifications:
resend_api_key: "" # Resend.com API key for email
from_email: "monitoring@felhom.eu"
retention:
max_days: 90
prune_schedule: "04:30"
alerting:
stale_threshold: "30m" # Customer considered stale after this duration
server:
listen: ":8080"
data_dir: "/data" # SQLite database location
```
## Deployment
Runs on k3s (Kubernetes) in the `felhom-system` namespace:
- **PVC:** 1GB Longhorn volume for SQLite database
- **Resources:** 64Mi-256Mi memory, 50m-500m CPU
- **Ingress:** `hub.felhom.eu` with TLS (cert-manager)
- **Geo-restriction:** Hungary only (nginx annotation)
```bash
# Build and push
cd hub/
make VERSION=0.2.0 docker docker-push
# Deploy
kubectl set image -n felhom-system deploy/hub hub=gitea.dooplex.hu/admin/felhom-hub:v0.2.0
kubectl rollout status -n felhom-system deploy/hub
# Check
kubectl logs -n felhom-system -l app=hub --tail 20
```
## Dependencies
- `golang.org/x/crypto` — bcrypt for password hashing
- `gopkg.in/yaml.v3` — YAML config parsing
- `modernc.org/sqlite` — Pure Go SQLite (no CGo)