810 lines
38 KiB
Markdown
810 lines
38 KiB
Markdown
# felhom-controller
|
||
|
||
**Central management container for Felhom home servers.**
|
||
|
||
A single, lightweight Go container that replaces Portainer + scattered systemd scripts with a unified, Hungarian-language web dashboard for managing Docker Compose stacks, backups, storage, monitoring, and notifications on customer hardware.
|
||
|
||
**Current version: v0.12.7**
|
||
|
||
---
|
||
|
||
## Table of Contents
|
||
|
||
- [Architecture](#architecture)
|
||
- [Features](#features)
|
||
- [App Management](#1-app-management)
|
||
- [Backup System](#2-backup-system)
|
||
- [Storage Management](#3-storage-management)
|
||
- [Monitoring & Health](#4-monitoring--health)
|
||
- [Notifications](#5-notifications)
|
||
- [Update Management](#6-update-management)
|
||
- [Authentication & Settings](#7-authentication--settings)
|
||
- [Central Hub](#8-central-hub-reporting)
|
||
- [Repository Layout](#repository-layout)
|
||
- [Configuration](#configuration)
|
||
- [REST API](#rest-api)
|
||
- [Build & Deploy](#build--deploy)
|
||
- [Roadmap](#roadmap)
|
||
|
||
---
|
||
|
||
## Architecture
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ Customer Hardware (N100 mini PC / Raspberry Pi) │
|
||
│ │
|
||
│ ┌──────────┐ ┌────────────────────────────────────────────┐ │
|
||
│ │ Traefik │ │ felhom-controller (privileged container) │ │
|
||
│ │ (reverse │──▶│ │ │
|
||
│ │ proxy) │ │ ┌──────────┐ ┌─────────────────────────┐│ │
|
||
│ └──────────┘ │ │ Web UI │ │ Stack Manager ││ │
|
||
│ │ │ (HU dash │ │ (compose ops, git sync, ││ │
|
||
│ ┌──────────┐ │ │ board) │ │ deploy, delete, update) ││ │
|
||
│ │cloudflared│ │ └──────────┘ └─────────────────────────┘│ │
|
||
│ │ (tunnel) │ │ ┌──────────┐ ┌─────────────────────────┐│ │
|
||
│ └──────────┘ │ │ Backup │ │ Storage Manager ││ │
|
||
│ │ │ (3-layer │ │ (disk scan, format, ││ │
|
||
│ ┌──────────┐ │ │ restic) │ │ mount, migrate) ││ │
|
||
│ │ App │ │ └──────────┘ └─────────────────────────┘│ │
|
||
│ │ stacks │ │ ┌──────────┐ ┌─────────────────────────┐│ │
|
||
│ │ (docker │ │ │Scheduler │ │ Monitor & Metrics ││ │
|
||
│ │ compose) │ │ │(cron-like│ │ (health, pings, SQLite ││ │
|
||
│ └──────────┘ │ │ jobs) │ │ time-series, Chart.js) ││ │
|
||
│ │ └──────────┘ └─────────────────────────┘│ │
|
||
│ │ ┌──────────┐ ┌─────────────────────────┐│ │
|
||
│ │ │ Notify │ │ REST API + Hub Reporter ││ │
|
||
│ │ │ (email) │ │ (JSON push to hub) ││ │
|
||
│ │ └──────────┘ └─────────────────────────┘│ │
|
||
│ └────────────────────────────────────────────┘ │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
│ pings │ JSON push │ git pull
|
||
▼ ▼ ▼
|
||
status.felhom.eu hub.felhom.eu gitea.dooplex.hu
|
||
(Healthchecks) (central dashboard) (stack definitions)
|
||
```
|
||
|
||
### Key Architecture Decisions
|
||
|
||
- **Pure Go, no frameworks** — stdlib `net/http` + `html/template`. Only external deps: `bcrypt`, `yaml.v3`, `modernc.org/sqlite` (pure Go, no CGO).
|
||
- **Privileged container** — Required for disk operations (format, mount, fstab), `/dev` access, and Docker socket control.
|
||
- **`/host-dev` indirection** — Docker overrides `/dev` with a tmpfs. The host's `/dev` is mounted at `/host-dev` to access block devices.
|
||
- **`StackDataProvider` interface** — Breaks circular import between backup and stacks packages. Implemented by `stackAdapter` in `main.go`.
|
||
- **Atomic file writes** — All persistent state (`settings.json`, `app.yaml`) written to `.tmp` then `os.Rename` for crash safety.
|
||
- **`go:embed` templates** — All HTML/CSS/JS compiled into the binary. No runtime file dependencies.
|
||
- **Europe/Budapest timezone** — All scheduled jobs, timestamps, and UI labels use Hungarian timezone.
|
||
|
||
### Module Map
|
||
|
||
| Module | Path | Responsibility |
|
||
|--------|------|----------------|
|
||
| **Config** | `internal/config/` | YAML loader, validation, `FELHOM_*` env overrides |
|
||
| **Settings** | `internal/settings/` | Runtime-mutable `settings.json` (passwords, backup prefs, storage paths, notifications) |
|
||
| **Stacks** | `internal/stacks/` | Compose operations, scanning, `.felhom.yml` metadata, deploy/delete flow |
|
||
| **Sync** | `internal/sync/` | Git-based app catalog sync (clone/pull, content-hash copy) |
|
||
| **Backup** | `internal/backup/` | 3-layer backup: DB dumps → restic snapshots → cross-drive copies, restore |
|
||
| **Storage** | `internal/storage/` | Disk scanning (`lsblk`), partitioning (`sfdisk`), formatting (`mkfs.ext4`), mounting, data migration (`rsync`) |
|
||
| **System** | `internal/system/` | System info (`/proc`), CPU collector, mount points, disk usage, FS info |
|
||
| **Monitor** | `internal/monitor/` | Healthchecks.io pinger, system health checks |
|
||
| **Metrics** | `internal/metrics/` | SQLite time-series store, system + container metric collection |
|
||
| **Scheduler** | `internal/scheduler/` | Central job scheduler (periodic + daily, skip-if-running, panic recovery) |
|
||
| **Notify** | `internal/notify/` | Email notifications via hub relay, preference sync, per-event cooldowns |
|
||
| **Report** | `internal/report/` | Hub report builder + HTTP pusher (system, stacks, backup, health) |
|
||
| **API** | `internal/api/` | REST JSON endpoints |
|
||
| **Web** | `internal/web/` | Hungarian dashboard, auth, page handlers, template functions, alerts |
|
||
|
||
---
|
||
|
||
## Features
|
||
|
||
### 1. App Management
|
||
|
||
The controller manages Docker Compose stacks through a complete lifecycle: catalog sync, first-time deployment, runtime operations, and deletion.
|
||
|
||
#### Git Sync (`internal/sync/`)
|
||
|
||
The app catalog lives in a separate Git repository. The controller:
|
||
- Shallow-clones the catalog on startup
|
||
- Periodically fetches updates (configurable, default 15 min)
|
||
- Copies only `docker-compose.yml` and `.felhom.yml` to the stacks directory
|
||
- **Never overwrites** `app.yaml` or `.env` (user secrets are safe)
|
||
- Uses SHA-256 content hashing — only writes files that actually changed
|
||
- Triggers stack rescan after sync so the dashboard updates immediately
|
||
- Manual sync via "Sablonok frissitese" button or `POST /api/sync`
|
||
|
||
#### First-Time Deploy Flow
|
||
|
||
1. Customer sees app card with "Telepites" button
|
||
2. Deploy page shows auto-filled fields (domain), auto-generated secrets (DB passwords, hex keys), and user-configurable inputs (admin password, language, storage path)
|
||
3. `checkBeforeDeploy()` JS guard fetches live state first (prevents double-deploy from another tab)
|
||
4. **Memory validation** checks `mem_request` against available RAM:
|
||
- `usable_memory = total_ram - reserved_memory_mb` (default 384MB reserved)
|
||
- Hard block if requests exceed usable memory
|
||
- Soft warning if limits exceed total RAM (overcommit OK)
|
||
5. Controller generates secrets, saves `app.yaml`, sets in-memory `Deployed` flag **before** `docker compose up -d` (avoids stale UI during slow image pulls), reverts on failure
|
||
6. 3-step progress panel polls `GET /api/stacks/{name}` every 3s: config saved → containers starting → health check passed
|
||
7. Post-deploy: locked fields (DB_PASSWORD, etc.) become read-only
|
||
|
||
#### App Info Pages
|
||
|
||
Each app can define rich metadata in `.felhom.yml`:
|
||
- `app_info`: tagline, use_cases, first_steps, prerequisites, default_creds, docs_url
|
||
- `optional_config`: groups of post-deploy configurable env vars (e.g., API keys for metadata providers)
|
||
- `resources`: mem_request, mem_limit, pi_compatible, needs_hdd
|
||
|
||
The `/apps/{slug}` page renders hero section, screenshots, setup guide, and optional config form.
|
||
|
||
#### Stack Operations
|
||
|
||
| Operation | What it does |
|
||
|-----------|-------------|
|
||
| Start | `docker compose up -d` |
|
||
| Stop | `docker compose stop` (blocked for protected stacks) |
|
||
| Restart | `docker compose restart` |
|
||
| Update | `docker compose pull` + `docker compose up -d` |
|
||
| Delete | `docker compose down --rmi local --volumes` + optional HDD data cleanup |
|
||
|
||
**Protected stacks** (traefik, cloudflared, felhom-controller) cannot be stopped or deleted from the UI. Restart is allowed.
|
||
|
||
**Orphan detection**: Deployed stacks with no matching catalog template are marked as orphaned with an "Elavult" badge and can be safely deleted.
|
||
|
||
#### Container State Display
|
||
|
||
| State | Color | Label | Meaning |
|
||
|-------|-------|-------|---------|
|
||
| Running + healthy | Green | "Fut" | All containers running and healthy |
|
||
| Running + starting | Orange | "Indulas..." | Healthcheck not yet passed |
|
||
| Running + unhealthy | Yellow | "Nem egeszseges" | Healthcheck failing |
|
||
| Stopped/exited | Red | "Leallitva" | All containers stopped |
|
||
| Restarting | Yellow | "Ujrainditas..." | Restart loop |
|
||
| Not deployed | Gray | "Nincs telepitve" | Compose file exists, not deployed |
|
||
|
||
---
|
||
|
||
### 2. Backup System
|
||
|
||
The backup system implements a **3-2-1 backup architecture**. Each tier is a **complete,
|
||
self-sufficient backup** — any single tier can fully restore an app.
|
||
|
||
| Tier | Contents | Location | Can fully restore? |
|
||
|------|----------|----------|--------------------|
|
||
| **1. Nightly restic** | DB + Config + User data | Same drive as app | Yes (not against drive failure) |
|
||
| **2. Cross-drive** | DB + Config + User data | Different physical device | Yes |
|
||
| **3. Remote** | Everything | Cloud / remote server | Future |
|
||
|
||
**Key principles:**
|
||
- User data backup is **mandatory** — every app with HDD bind mounts is included
|
||
automatically. There is no per-app toggle.
|
||
- Each tier includes **everything** needed to restore: DB dumps, config, and user data.
|
||
No tier depends on another tier's data.
|
||
- The `AppBackupPrefs.Enabled` field in settings.json is legacy and not read by any code.
|
||
|
||
#### Tier 1: Nightly Backup (mandatory, same drive)
|
||
|
||
The nightly backup has two phases that run sequentially:
|
||
|
||
**Phase 1 — Database Dumps** (`internal/backup/dbdump.go`, scheduled 02:30)
|
||
|
||
- **Auto-discovery** of PostgreSQL and MariaDB containers via `docker ps` + `docker inspect`
|
||
- Dumps via `docker exec pg_dump` / `docker exec mariadb-dump` with 5-minute timeout
|
||
- Atomic writes (`.tmp` → `.sql`) to prevent corruption
|
||
- **Validation** after each dump: checks file size, header presence, counts `CREATE TABLE`
|
||
- Results cached in `settings.json` surviving container restarts
|
||
|
||
**Phase 2 — Restic Snapshot** (`internal/backup/restic.go`, scheduled 03:00)
|
||
|
||
- Auto-generated repository password (32 random bytes, base64url), synced to hub
|
||
- **Paths included in every snapshot:**
|
||
- Stacks dir (all compose.yml + app.yaml + .felhom.yml)
|
||
- DB dump dir (all `.sql` dump files from Phase 1)
|
||
- `controller.yaml` (controller config)
|
||
- **ALL deployed apps' HDD mount paths** — discovered via `resolveAppBackupPaths()` which iterates `ListDeployedStacks()`, no `Enabled` flag
|
||
- Auto-detects and unlocks stale locks (restic repo lock)
|
||
- Weekly prune on Sundays with configurable retention (keep-daily, keep-weekly, keep-monthly)
|
||
- Weekly integrity check (`restic check`) on Sunday 04:00
|
||
|
||
**Protects against:** accidental deletion, data corruption, point-in-time rollback.
|
||
Does NOT protect against drive failure (backup is on the same physical drive).
|
||
|
||
#### Tier 2: Cross-Drive Backup (opt-in, different device) (`internal/backup/crossdrive.go`)
|
||
|
||
**Complete backup** to a different physical drive — DB dumps + config + user data.
|
||
|
||
- **Two methods:**
|
||
- **rsync** — Simple mirror with `--delete` (fast, no versioning, **browsable** on disk)
|
||
- **restic** — Versioned, deduplicated, encrypted (shared repo across apps, not browsable)
|
||
- Per-app configuration in settings.json: destination path, method, schedule (daily/weekly/manual)
|
||
- **Pre-backup DB dump:** `DumpStackDB()` runs fresh pg_dump/mariadb-dump before each cross-drive backup; non-fatal on failure (wired via `DBDumper` interface to avoid circular imports)
|
||
- **Drive-type-aware validation** (`ValidateDestination`):
|
||
|
||
| Destination type | Space checks |
|
||
|-----------------|--------------|
|
||
| External mount (different device than `/`) | Block if <100 MB free |
|
||
| System drive (same device as `/`) | Require ≥10 GB free AND <90% used; logged warning |
|
||
|
||
- **Rsync destination layout** (complete — can restore app independently):
|
||
```
|
||
backups/rsync/<app>/
|
||
_db/ ← DB dump files (stackName_postgres.sql, etc.)
|
||
_config/ ← compose.yml, app.yaml, .felhom.yml
|
||
<user data> ← HDD mount contents (single mount: flat; multi-mount: leaf subfolders)
|
||
```
|
||
- DB dump files excluded from user data rsync (`--exclude backups/*.sql.gz/sql/dump`) to avoid duplicating app-internal dumps
|
||
- `_` prefix directories prevent collision with user data
|
||
- **Restic backup paths:** includes HDD mounts + config dir + DB dump dir (deduplication handles overlap)
|
||
- Safety guards: destination ≠ source, path-overlap check, writable check
|
||
- **Chained execution:** runs immediately after nightly restic — daily apps every night, weekly apps on Sundays
|
||
- Per-app concurrency lock prevents overlapping runs
|
||
- Status (last_run, duration, size, error) persisted to settings.json
|
||
|
||
**Protects against:** primary drive failure, drive theft/damage.
|
||
|
||
#### Tier 3: Remote Backup (future)
|
||
|
||
Complete offsite backup for disaster recovery. Not yet implemented.
|
||
|
||
#### Restore (`internal/backup/restore.go`)
|
||
|
||
All deployed apps appear in the restore dropdown — every app has restic snapshot data
|
||
(stacks dir + DB dumps are always backed up).
|
||
|
||
| App type | Config restored | DB restored | User data restored |
|
||
|----------|----------------|------------|-------------------|
|
||
| Has HDD data | ✓ | ✓ | ✓ (always — backup is mandatory) |
|
||
| DB only, no HDD | ✓ | ✓ | n/a |
|
||
| No DB, no HDD | ✓ | — | n/a |
|
||
|
||
- **Snapshot API** returns ALL snapshots unfiltered — older snapshots still allow config+DB restore; `RestoreApp` extracts whatever paths are available
|
||
- **Restore type info** shown per-app when selected in dropdown (Hungarian banners):
|
||
- Has HDD: "Teljes visszaállítás: adatbázis + konfiguráció + felhasználói adatok"
|
||
- Has DB, no HDD: "Adatbázis és konfiguráció visszaállítása"
|
||
- No DB, no HDD: "Csak konfiguráció visszaállítása"
|
||
- **Execution flow:** stop app → `restic restore <id> --target / --include <path>...` → restart app
|
||
- Running flag prevents concurrent backup/restore operations
|
||
- Snapshot ID validated (8–64 lowercase hex)
|
||
|
||
**Note:** Restore currently uses Tier 1 (primary restic repo) only. Restoring from Tier 2
|
||
(cross-drive) is a future enhancement.
|
||
|
||
#### Backup Page UI (`internal/web/templates/backups.html`)
|
||
|
||
Unified per-app status table with expandable rows showing **per-tier** backup status:
|
||
|
||
**Status dot per app:**
|
||
|
||
| Dot color | Meaning |
|
||
|-----------|---------|
|
||
| Green | Fully covered — cross-drive configured and last run OK |
|
||
| Yellow | Warning — no second copy, or last backup failed, or disk space issue |
|
||
| Red | Cross-drive destination blocked or inaccessible |
|
||
| Gray (auto) | No user data — only config/DB backup (automatic) |
|
||
|
||
**Per-app backup tiers:**
|
||
- **1. mentés** (Tier 1, always present) — Auto badge + "helyi" + last run + contents (e.g., "DB + Konfig + Adatok")
|
||
- **2. mentés** (Tier 2, only for apps with HDD data) — one of:
|
||
- Configured: method (rsync/restic) + destination + schedule + last run + status + contents + browsable indicator (📁 for rsync) + action buttons
|
||
- Not configured: "✓ 1. mentés auto" + "⚠ Nincs 2. másolat" + settings link
|
||
|
||
**Backup contents per app** (shown per tier):
|
||
- Apps with DB + HDD: "DB + Konfig + Adatok"
|
||
- Apps with DB only: "DB + Konfig"
|
||
- Apps with HDD, no DB: "Konfig + Adatok"
|
||
- Apps with neither: "Konfig"
|
||
|
||
**Other sections:**
|
||
- Schedule overview with next run times for DB dump, restic, prune
|
||
- Snapshot history table (last 20 snapshots with ID, time, files new/changed, data added)
|
||
- Repository info card (path, size, snapshot count, encryption key with show/copy)
|
||
- Restore section: app dropdown → snapshot dropdown → restore type info → confirmation checkbox → execute
|
||
|
||
---
|
||
|
||
### 3. Storage Management
|
||
|
||
The storage subsystem handles the full lifecycle of external storage: detection, initialization, path registration, and data migration.
|
||
|
||
#### Disk Scanning (`internal/storage/scan.go`)
|
||
|
||
- `ScanDisks()` uses `lsblk -J -b` for block device enumeration
|
||
- System disk detection via host fstab parsing (`/host-fstab`) + UUID resolution via `blkid`
|
||
- Partitions enriched with filesystem type, UUID, and label from direct `blkid` probing (Docker containers have incomplete udev cache)
|
||
- Returns `AvailableDisks` (non-system, non-loop, non-CDROM) and `SystemDisks` separately
|
||
- Handles NVMe (`nvme0n1p1`), SCSI (`sdb1`), and eMMC (`mmcblk0p1`) naming
|
||
|
||
#### Disk Initialization Wizard (`internal/storage/format.go`)
|
||
|
||
A step-by-step UI at `/settings/storage/init`:
|
||
|
||
1. **Scan** — Lists available disks with model, size, partition info
|
||
2. **Select** — User picks a disk and enters a mount name (e.g., `hdd_1`)
|
||
3. **Confirm** — User types "FORMAZAS" to confirm destructive operation
|
||
4. **Format pipeline**: `wipefs` → `sfdisk` (GPT) → `mkfs.ext4` → `blkid` UUID → backup fstab → append UUID-based fstab entry → mount → `findmnt` verification → `chown 1000:1000` → create `storage/` and `Dokumentumok/` subdirectories
|
||
5. Auto-registers new storage path in settings.json
|
||
6. Smart partition detection: skips repartitioning for existing empty partitions
|
||
|
||
Safety guards: system disk detection, mount path conflict check, confirmation required, progress channel for real-time UI feedback.
|
||
|
||
#### Storage Path Registry (`internal/settings/settings.go`)
|
||
|
||
Multiple external storage paths supported with:
|
||
- **Label**: Human-readable name (editable inline)
|
||
- **Default flag**: New deploys use this path by default
|
||
- **Schedulable flag**: Path appears in deploy dropdown
|
||
- **Auto-discovery**: On startup, scans deployed apps' `HDD_PATH` values and registers unknown paths
|
||
- Thread-safe CRUD: Add, Remove, SetDefault, SetSchedulable, SetLabel
|
||
|
||
#### Data Migration (`internal/storage/migrate.go`)
|
||
|
||
Move app data between storage paths (e.g., SSD → HDD, HDD → new HDD):
|
||
|
||
1. Validate: stack exists, deployed, has HDD data, target differs from source
|
||
2. Estimate total size, check free space on target
|
||
3. Stop the application
|
||
4. `rsync -a --info=progress2` per mount path with real-time progress parsing
|
||
5. Update `app.yaml` HDD_PATH to new location
|
||
6. Start the application
|
||
7. **Rollback on failure**: reverts config, restarts on old storage
|
||
|
||
Progress UI at `/stacks/{name}/migrate` with byte counter and percentage.
|
||
|
||
#### Stale Data Cleanup
|
||
|
||
After migration, the deploy page detects leftover data on previous storage paths:
|
||
- Shows path, size, and a delete button
|
||
- Two-step confirmation required
|
||
- Protected paths (storage root, media, Dokumentumok, appdata) cannot be deleted
|
||
|
||
#### FileBrowser Mount Sync
|
||
|
||
When storage paths are added or removed, `syncFileBrowserMounts()` auto-regenerates FileBrowser's `docker-compose.yml` with volume mounts for all registered paths, then recreates the container.
|
||
|
||
---
|
||
|
||
### 4. Monitoring & Health
|
||
|
||
#### System Health Checks (`internal/monitor/healthcheck.go`)
|
||
|
||
`RunHealthCheck()` evaluates multiple subsystems and returns a `HealthReport` with status (`ok`/`warn`/`fail`):
|
||
|
||
| Check | Warning | Critical |
|
||
|-------|---------|----------|
|
||
| Disk usage (SSD/HDD) | >= 90% | >= 95% |
|
||
| Memory | available < 512MB | available < 256MB |
|
||
| CPU temperature | >= 75C | >= 85C |
|
||
| Docker daemon | — | unreachable |
|
||
| Protected containers | — | not running |
|
||
| Storage paths | not a mount point (data on SSD) | path inaccessible, disk >= 95% |
|
||
|
||
Backup destination validation (`CheckBackupDestination`) has tiered checks:
|
||
- Path doesn't exist → critical/blocked
|
||
- Not writable → critical/blocked
|
||
- Same block device as root → warning (data on system drive)
|
||
- Disk >95% full → critical/blocked
|
||
- Disk >90% full → warning
|
||
|
||
#### Healthchecks.io Integration (`internal/monitor/pinger.go`)
|
||
|
||
Five ping UUIDs for external monitoring:
|
||
- **Heartbeat**: every 5 min (simple "I'm alive")
|
||
- **System Health**: periodic health check results
|
||
- **DB Dump**: after nightly database dumps
|
||
- **Backup**: after nightly restic backup
|
||
- **Backup Integrity**: weekly `restic check` result
|
||
|
||
3-attempt retry with 2-second backoff. Pinger never fails the caller.
|
||
|
||
#### Metrics Store (`internal/metrics/`)
|
||
|
||
- **SQLite with WAL mode** for concurrent reads during collection
|
||
- **System metrics**: CPU%, memory (total/used/available), temperature, load average — collected every 60 seconds
|
||
- **Container metrics**: CPU%, memory, network I/O, block I/O per container
|
||
- Downsampled queries for chart time ranges (1h, 6h, 24h, 7d, 30d)
|
||
- 30-day auto-prune via daily scheduler job
|
||
|
||
#### Monitoring Page
|
||
|
||
Full-page system monitor at `/monitoring`:
|
||
- **System Overview**: hostname, OS, kernel, CPU model/cores, uptime
|
||
- **System Metrics Charts**: 4 line charts (CPU, Memory, Temperature, Load) in 2x2 grid
|
||
- **Container Resources**: horizontal bar charts (CPU% and Memory per container)
|
||
- **Per-container Detail**: click-to-expand historical charts
|
||
- **Remote Monitoring Status**: shows Healthchecks ping UUID configuration
|
||
|
||
Chart.js 4.4.7 embedded locally (works in offline environments), dark theme matching site design.
|
||
|
||
#### Alert System (`internal/web/alerts.go`)
|
||
|
||
State-based alerts displayed on all pages:
|
||
- Sources: health issues, missing ping UUIDs, backup disabled
|
||
- Sorted by severity (error > warning > info), capped at 5 visible
|
||
- Refreshed every 5 min + on startup
|
||
- Monitoring page suppresses ping-related alerts (shown in dedicated table instead)
|
||
|
||
---
|
||
|
||
### 5. Notifications
|
||
|
||
#### Email Delivery
|
||
|
||
The controller relays notifications through the central hub, which sends emails via the Resend API:
|
||
1. Controller detects event (health degradation, backup failure, etc.)
|
||
2. Non-blocking POST to hub's `/api/v1/notify` with event details
|
||
3. Hub checks customer notification preferences
|
||
4. Hub sends Hungarian-language email via Resend
|
||
|
||
#### Event Types
|
||
|
||
| Event | Trigger |
|
||
|-------|---------|
|
||
| `disk_warning` | Disk usage crosses warning/critical threshold |
|
||
| `backup_failed` | Nightly backup or DB dump fails |
|
||
| `update_available` | New app version detected in catalog |
|
||
| `security_update` | Critical security update available |
|
||
|
||
#### Cooldown System
|
||
|
||
Per-event-type cooldown (default 6 hours, configurable) prevents notification spam. Only notifies on **status degradation** (ok→warn, ok→fail, warn→fail), not on repeated same-status checks.
|
||
|
||
#### Preference Sync
|
||
|
||
Notification preferences (email, enabled events, cooldown) are:
|
||
- Stored locally in `settings.json`
|
||
- Synced to hub on save and on controller startup
|
||
- Hub sync failure doesn't block local save
|
||
|
||
---
|
||
|
||
### 6. Update Management
|
||
|
||
#### App Catalog Sync
|
||
|
||
- Periodic `git fetch` + `git reset --hard` of the app catalog repo
|
||
- Content-hash comparison prevents unnecessary file writes
|
||
- Post-sync stack rescan detects new/changed apps immediately
|
||
|
||
#### Planned Update Classifications
|
||
|
||
| Marker | Behavior |
|
||
|--------|----------|
|
||
| No marker | Optional — shown on dashboard, customer clicks "Update" |
|
||
| `UPDATE_REQUIRED=true` | Mandatory — auto-applied during next update window |
|
||
| `UPDATE_SECURITY=true` | Critical — applied immediately |
|
||
|
||
---
|
||
|
||
### 7. Authentication & Settings
|
||
|
||
#### Session Auth (`internal/web/auth.go`)
|
||
|
||
- bcrypt password verification with configurable source priority: `settings.json` → `controller.yaml` → no auth (open access)
|
||
- 7-day session duration with random 32-byte hex tokens
|
||
- `?next=` redirect after login preserves the page the user was visiting
|
||
- Session cleanup every 15 minutes
|
||
- All sessions invalidated on password change
|
||
- Conditional logout link (hidden when auth is disabled)
|
||
|
||
#### Settings Persistence (`internal/settings/settings.go`)
|
||
|
||
Runtime-mutable settings in `settings.json` (separate from infrastructure config):
|
||
|
||
| Section | Contents |
|
||
|---------|----------|
|
||
| `password_hash` | bcrypt hash override |
|
||
| `notifications` | email, enabled events, cooldown hours |
|
||
| `db_validations` | per-DB dump validation results (survives restarts) |
|
||
| `app_backup` | per-app map: enabled flag, cross-drive config (method, dest, schedule, runtime status) |
|
||
| `storage_paths` | registered paths with label, default flag, schedulable flag |
|
||
| `cross_drive_restic_password` | auto-generated restic password for cross-drive repos |
|
||
|
||
All public methods use `sync.RWMutex`. File writes are atomic (`.tmp` + rename).
|
||
|
||
#### Settings Page (`/settings`)
|
||
|
||
Three sections:
|
||
1. **System config** — read-only display of `controller.yaml` values
|
||
2. **Password change** — current + new + confirm, min 8 chars
|
||
3. **Storage paths** — add/remove, edit labels, set default, toggle schedulable, per-path app list with sizes
|
||
4. **Notifications** — email, event checkboxes, cooldown hours, test email button
|
||
|
||
---
|
||
|
||
### 8. Central Hub Reporting
|
||
|
||
#### Report Push (`internal/report/`)
|
||
|
||
Periodic JSON push (default every 15 min) to the central felhom-hub service:
|
||
- System: hostname, OS, CPU, memory, disk usage, uptime
|
||
- Containers: running/stopped counts, per-container CPU/memory
|
||
- Backup: last run, success, repo stats, snapshot count, restic password (for disaster recovery)
|
||
- Health: current status, issues, warnings
|
||
- Stacks: deployed apps with versions and states
|
||
|
||
Bearer token authentication, 3-attempt retry with 5-second backoff.
|
||
|
||
#### Hub Dashboard
|
||
|
||
The hub service (separate Go app in the `felhom.eu` repo) provides:
|
||
- Multi-customer overview table with status indicators
|
||
- Customer detail page with system/storage/containers/backup/health sections
|
||
- Color coding: green (<30min), yellow (30-60min), red (>60min since last report)
|
||
- 90-day report retention with daily prune
|
||
|
||
---
|
||
|
||
## Repository Layout
|
||
|
||
```
|
||
controller/
|
||
├── cmd/controller/main.go # Entry point, wires all 14 modules
|
||
├── internal/
|
||
│ ├── config/config.go # YAML loader, validation, env overrides
|
||
│ ├── settings/settings.go # Runtime settings (JSON, atomic writes, RWMutex)
|
||
│ ├── stacks/
|
||
│ │ ├── manager.go # Stack scanning, compose ops, container status
|
||
│ │ ├── metadata.go # Parse .felhom.yml app metadata
|
||
│ │ ├── deploy.go # First-deploy: secret gen, app.yaml, compose up
|
||
│ │ └── delete.go # Stack deletion + HDD data cleanup
|
||
│ ├── sync/sync.go # Git sync: clone/pull app catalog, content-hash copy
|
||
│ ├── storage/
|
||
│ │ ├── scan.go, scan_linux.go # Disk detection via lsblk + blkid
|
||
│ │ ├── format.go, format_linux.go # Partition, format, mount pipeline
|
||
│ │ ├── safety.go, safety_linux.go # System disk detection, mount guards, fstab ops
|
||
│ │ ├── migrate.go # App data migration (rsync with progress)
|
||
│ │ └── *_other.go # Non-Linux stubs for cross-compilation
|
||
│ ├── backup/
|
||
│ │ ├── backup.go # Orchestrator (dumps + restic + cross-drive chain)
|
||
│ │ ├── dbdump.go # DB auto-discovery + dump (pg_dump, mariadb-dump)
|
||
│ │ ├── restic.go # Restic operations (init, snapshot, prune, check)
|
||
│ │ ├── appdata.go # StackDataProvider interface, app data discovery
|
||
│ │ ├── crossdrive.go # Per-app backup to secondary storage (rsync/restic)
|
||
│ │ └── restore.go # Per-app restore with auto stop/restart
|
||
│ ├── api/router.go # REST API endpoints (~30 routes)
|
||
│ ├── scheduler/scheduler.go # Central job scheduler (Every, Daily)
|
||
│ ├── system/
|
||
│ │ ├── info.go, info_linux.go # RAM, disk, CPU, temperature, load average
|
||
│ │ ├── cpu_linux.go # Background /proc/stat sampling
|
||
│ │ └── mounts_linux.go # Mount points, disk usage, FS info, backup dest checks
|
||
│ ├── monitor/
|
||
│ │ ├── pinger.go # Healthchecks.io HTTP ping client
|
||
│ │ └── healthcheck.go # System health checks (disk, mem, CPU, temp, Docker)
|
||
│ ├── metrics/
|
||
│ │ ├── store.go # SQLite time-series (WAL mode, downsampled queries)
|
||
│ │ ├── collector.go # Background collector (60s, system + docker stats)
|
||
│ │ └── sysinfo.go # Static system info (/proc, /etc)
|
||
│ ├── notify/notifier.go # Email relay to hub, preference sync, cooldowns
|
||
│ ├── report/
|
||
│ │ ├── builder.go # Hub report builder (all subsystems → JSON)
|
||
│ │ └── pusher.go # HTTP POST to hub (retry, Bearer auth)
|
||
│ └── web/
|
||
│ ├── server.go # HTTP server, routing, static files
|
||
│ ├── auth.go # Session auth, login/logout, session cleanup
|
||
│ ├── handlers.go # Page handlers (dashboard, stacks, deploy, backups, etc.)
|
||
│ ├── storage_handlers.go # Storage API handlers (scan, format, migrate, cleanup)
|
||
│ ├── alerts.go # State-based alert generation
|
||
│ ├── funcmap.go # Template functions (state colors, Hungarian formatting)
|
||
│ ├── embed.go # go:embed for templates + Chart.js
|
||
│ └── templates/ # 12 HTML files + style.css (Hungarian UI)
|
||
├── configs/
|
||
│ ├── controller.yaml.example # Full config reference
|
||
│ └── example-felhom-metadata.yml # .felhom.yml format reference
|
||
├── Dockerfile # Multi-stage: Go 1.24 builder + debian-slim runtime
|
||
├── docker-compose.yml # Controller's own compose (privileged, /mnt rshared)
|
||
└── go.mod # Go 1.24, deps: bcrypt, yaml.v3, modernc.org/sqlite
|
||
```
|
||
|
||
---
|
||
|
||
## Configuration
|
||
|
||
### Controller config (`controller.yaml`)
|
||
|
||
Single YAML file per customer, infrastructure-only. Does **not** contain app-specific config.
|
||
|
||
Key sections:
|
||
```yaml
|
||
customer:
|
||
name: "Demo Felhom"
|
||
node_id: "demo-felhom"
|
||
|
||
paths:
|
||
stacks_dir: "/opt/docker/stacks"
|
||
data_dir: "/opt/docker/felhom-controller/data"
|
||
db_dump_dir: "/srv/backups/db-dumps"
|
||
restic_repo: "/srv/backups/restic-repo"
|
||
|
||
git:
|
||
repo_url: "https://gitea.dooplex.hu/admin/app-catalog-felhom.eu.git"
|
||
sync_interval: "15m"
|
||
|
||
backup:
|
||
enabled: true
|
||
db_dump_time: "02:30"
|
||
restic_time: "03:00"
|
||
retention: { keep_daily: 7, keep_weekly: 4, keep_monthly: 6 }
|
||
|
||
monitoring:
|
||
health_interval: "5m"
|
||
ping_uuids:
|
||
heartbeat: "uuid-here"
|
||
system_health: "uuid-here"
|
||
db_dump: "uuid-here"
|
||
backup: "uuid-here"
|
||
backup_integrity: "uuid-here"
|
||
|
||
hub:
|
||
enabled: true
|
||
url: "https://hub.felhom.eu"
|
||
api_key: "bearer-token-here"
|
||
|
||
system:
|
||
reserved_memory_mb: 384 # RAM reserved for OS + controller
|
||
```
|
||
|
||
Environment variable overrides: `FELHOM_LOGGING_LEVEL=debug`, `FELHOM_HUB_ENABLED=false`, etc.
|
||
|
||
### Runtime settings (`settings.json`)
|
||
|
||
Auto-managed by the controller. Contains password hash overrides, notification preferences, per-app backup configs, storage path registry, DB validation cache. All writes are atomic.
|
||
|
||
### Per-app config (`app.yaml`)
|
||
|
||
Auto-generated during deployment. Contains env vars, locked fields list, deploy timestamp. Secret fields are locked (read-only after first deploy).
|
||
|
||
---
|
||
|
||
## Scheduler Jobs
|
||
|
||
| Job | Type | When | Purpose |
|
||
|-----|------|------|---------|
|
||
| status-refresh | periodic | 30s | Refresh container states |
|
||
| stack-scan | periodic | 2m | Rescan stacks directory |
|
||
| heartbeat | periodic | 5m | Ping Healthchecks "I'm alive" |
|
||
| system-health | periodic | configurable | Health checks + alert refresh |
|
||
| backup-cache | periodic | 5m | Refresh backup status cache |
|
||
| hub-report | periodic | 15m | Push report to central hub |
|
||
| db-dump | daily | 02:30 | Database dumps |
|
||
| backup | daily | 03:00 | Restic backup → cross-drive chain |
|
||
| backup-integrity | daily | Sun 04:00 | Restic check |
|
||
| metrics-prune | daily | 04:00 | Delete metrics older than 30 days |
|
||
|
||
All daily jobs use Europe/Budapest timezone. Skip-if-running prevents concurrent execution. Panic recovery in all jobs.
|
||
|
||
---
|
||
|
||
## REST API
|
||
|
||
### Stack Operations
|
||
|
||
| Method | Endpoint | Description |
|
||
|--------|----------|-------------|
|
||
| GET | `/api/health` | Health check (no auth) |
|
||
| GET | `/api/stacks` | List all stacks |
|
||
| GET | `/api/stacks/{name}` | Stack details |
|
||
| POST | `/api/stacks/{name}/deploy` | First-time deploy |
|
||
| POST | `/api/stacks/{name}/start` | Start stack |
|
||
| POST | `/api/stacks/{name}/stop` | Stop stack |
|
||
| POST | `/api/stacks/{name}/restart` | Restart stack |
|
||
| POST | `/api/stacks/{name}/update` | Pull + recreate |
|
||
| POST | `/api/stacks/{name}/optional-config` | Update optional env vars |
|
||
| GET | `/api/stacks/{name}/logs` | Container logs (`?raw=1` for plain text) |
|
||
| GET | `/api/stacks/{name}/hdd-data` | HDD data paths + sizes |
|
||
| DELETE | `/api/stacks/{name}` | Delete stack |
|
||
| POST | `/api/sync` | Trigger catalog sync |
|
||
| GET | `/api/system/info` | System info + sync status |
|
||
|
||
### Backup & Restore
|
||
|
||
| Method | Endpoint | Description |
|
||
|--------|----------|-------------|
|
||
| GET | `/api/backup/status` | Full backup status |
|
||
| POST | `/api/backup/run` | Trigger manual backup |
|
||
| GET | `/api/backup/snapshots` | List snapshots (`?stack={name}` for filtering) |
|
||
| POST | `/api/stacks/{name}/cross-backup` | Save cross-drive config |
|
||
| POST | `/api/stacks/{name}/cross-backup/run` | Trigger cross-drive backup |
|
||
| GET | `/api/stacks/{name}/cross-backup/status` | Cross-drive status |
|
||
| POST | `/api/backup/cross-drive/run-all` | Run all scheduled cross-drive backups |
|
||
|
||
### Storage
|
||
|
||
| Method | Endpoint | Description |
|
||
|--------|----------|-------------|
|
||
| GET | `/api/storage/scan` | Scan available disks |
|
||
| POST | `/api/storage/init` | Format and mount a disk |
|
||
| GET | `/api/storage/init/status` | Format progress |
|
||
| POST | `/api/storage/migrate` | Start app data migration |
|
||
| GET | `/api/storage/migrate/status` | Migration progress |
|
||
|
||
### Metrics
|
||
|
||
| Method | Endpoint | Description |
|
||
|--------|----------|-------------|
|
||
| GET | `/api/metrics/system` | System metrics time-series (`?range=1h|6h|24h|7d|30d`) |
|
||
| GET | `/api/metrics/containers/summary` | Current container stats |
|
||
| GET | `/api/metrics/containers/{name}` | Per-container time-series |
|
||
| GET | `/api/metrics/sysinfo` | Static system info |
|
||
|
||
Response format: `{"ok": true/false, "data": ..., "error": "...", "message": "..."}`
|
||
|
||
---
|
||
|
||
## Build & Deploy
|
||
|
||
### Build
|
||
|
||
```bash
|
||
# On build server (192.168.0.180)
|
||
cd ~/build/felhom-controller
|
||
git -C ~/git/deploy-felhom-compose pull
|
||
./build.sh v0.12.2 --push
|
||
```
|
||
|
||
### Deploy on customer node
|
||
|
||
```bash
|
||
# On customer node (e.g., 192.168.0.162)
|
||
cd /opt/docker/felhom-controller
|
||
sudo docker pull gitea.dooplex.hu/admin/felhom-controller:v0.12.2
|
||
sudo sed -i 's|image: gitea.dooplex.hu/admin/felhom-controller:.*|image: gitea.dooplex.hu/admin/felhom-controller:v0.12.2|' docker-compose.yml
|
||
sudo docker compose up -d
|
||
```
|
||
|
||
**Important:** Always use `docker compose up -d`, NOT `docker compose restart` — restart doesn't pick up new images.
|
||
|
||
### Docker Requirements
|
||
|
||
The controller container needs:
|
||
- `privileged: true` (disk operations)
|
||
- Docker socket mount (`/var/run/docker.sock`)
|
||
- `/mnt` mount with `propagation: rshared` (container mounts visible to host)
|
||
- `/dev` mounted as `/host-dev` (block device access)
|
||
- `/etc/fstab` mounted as `/host-fstab` (persistent mount config)
|
||
|
||
See `docker-compose.yml` for the full volume configuration.
|
||
|
||
---
|
||
|
||
## Roadmap
|
||
|
||
### Completed
|
||
|
||
- [x] Stack management with deploy flow and memory validation
|
||
- [x] Git-based app catalog sync
|
||
- [x] Central job scheduler
|
||
- [x] System monitoring with SQLite metrics and Chart.js charts
|
||
- [x] Healthchecks.io integration (5 ping types)
|
||
- [x] 3-layer backup system (DB dumps + restic + cross-drive)
|
||
- [x] Per-app backup restore with auto stop/restart
|
||
- [x] Storage management (scan, format, mount, registry)
|
||
- [x] App data migration between storage paths
|
||
- [x] Central hub reporting
|
||
- [x] Email notifications via hub relay
|
||
- [x] Settings persistence and password management
|
||
- [x] Dashboard alert system
|
||
|
||
### In Progress / Planned
|
||
|
||
- [ ] Update classification and auto-apply (optional/required/security markers)
|
||
- [ ] Self-update mechanism with health-based rollback
|
||
- [ ] Docker volume backup (`/var/lib/docker/volumes:ro`)
|
||
- [ ] Raspberry Pi testing (pi-customer-1)
|
||
- [ ] Cross-drive restic pruning (unbounded snapshot growth)
|
||
- [ ] CSRF protection on POST endpoints
|
||
- [ ] Login rate limiting
|
||
|
||
---
|
||
|
||
## Test Environments
|
||
|
||
| Node | Hardware | Domain | Status |
|
||
|------|----------|--------|--------|
|
||
| demo-felhom | Acemagic GK3PLUS N100, 16G RAM, 512G SSD + 1TB HDD | demo-felhom.eu | Controller v0.12.2 running |
|
||
| pi-customer-1 | Raspberry Pi 3B+, 1G RAM, 32G SD | pi-customer-1.local | Not yet tested |
|
||
|
||
## Related Repositories
|
||
|
||
| Repository | Purpose |
|
||
|------------|---------|
|
||
| [deploy-felhom-compose](https://gitea.dooplex.hu/admin/deploy-felhom-compose) | This repo — controller + deploy scripts |
|
||
| [app-catalog-felhom.eu](https://gitea.dooplex.hu/admin/app-catalog-felhom.eu) | Docker Compose templates + .felhom.yml metadata |
|
||
| [felhom.eu](https://gitea.dooplex.hu/admin/felhom.eu) | Website + app assets + felhom-hub service |
|