Files
deploy-felhom-compose/controller/README.md
T
2026-02-18 11:03:56 +01:00

793 lines
38 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# felhom-controller
**Central management container for Felhom home servers.**
A single, lightweight Go container that replaces Portainer + scattered systemd scripts with a unified, Hungarian-language web dashboard for managing Docker Compose stacks, backups, storage, monitoring, and notifications on customer hardware.
**Current version: v0.12.7**
---
## Table of Contents
- [Architecture](#architecture)
- [Features](#features)
- [App Management](#1-app-management)
- [Backup System](#2-backup-system)
- [Storage Management](#3-storage-management)
- [Monitoring & Health](#4-monitoring--health)
- [Notifications](#5-notifications)
- [Update Management](#6-update-management)
- [Authentication & Settings](#7-authentication--settings)
- [Central Hub](#8-central-hub-reporting)
- [Repository Layout](#repository-layout)
- [Configuration](#configuration)
- [REST API](#rest-api)
- [Build & Deploy](#build--deploy)
- [Roadmap](#roadmap)
---
## Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ Customer Hardware (N100 mini PC / Raspberry Pi) │
│ │
│ ┌──────────┐ ┌────────────────────────────────────────────┐ │
│ │ Traefik │ │ felhom-controller (privileged container) │ │
│ │ (reverse │──▶│ │ │
│ │ proxy) │ │ ┌──────────┐ ┌─────────────────────────┐│ │
│ └──────────┘ │ │ Web UI │ │ Stack Manager ││ │
│ │ │ (HU dash │ │ (compose ops, git sync, ││ │
│ ┌──────────┐ │ │ board) │ │ deploy, delete, update) ││ │
│ │cloudflared│ │ └──────────┘ └─────────────────────────┘│ │
│ │ (tunnel) │ │ ┌──────────┐ ┌─────────────────────────┐│ │
│ └──────────┘ │ │ Backup │ │ Storage Manager ││ │
│ │ │ (3-layer │ │ (disk scan, format, ││ │
│ ┌──────────┐ │ │ restic) │ │ mount, migrate) ││ │
│ │ App │ │ └──────────┘ └─────────────────────────┘│ │
│ │ stacks │ │ ┌──────────┐ ┌─────────────────────────┐│ │
│ │ (docker │ │ │Scheduler │ │ Monitor & Metrics ││ │
│ │ compose) │ │ │(cron-like│ │ (health, pings, SQLite ││ │
│ └──────────┘ │ │ jobs) │ │ time-series, Chart.js) ││ │
│ │ └──────────┘ └─────────────────────────┘│ │
│ │ ┌──────────┐ ┌─────────────────────────┐│ │
│ │ │ Notify │ │ REST API + Hub Reporter ││ │
│ │ │ (email) │ │ (JSON push to hub) ││ │
│ │ └──────────┘ └─────────────────────────┘│ │
│ └────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│ pings │ JSON push │ git pull
▼ ▼ ▼
status.felhom.eu hub.felhom.eu gitea.dooplex.hu
(Healthchecks) (central dashboard) (stack definitions)
```
### Key Architecture Decisions
- **Pure Go, no frameworks** — stdlib `net/http` + `html/template`. Only external deps: `bcrypt`, `yaml.v3`, `modernc.org/sqlite` (pure Go, no CGO).
- **Privileged container** — Required for disk operations (format, mount, fstab), `/dev` access, and Docker socket control.
- **`/host-dev` indirection** — Docker overrides `/dev` with a tmpfs. The host's `/dev` is mounted at `/host-dev` to access block devices.
- **`StackDataProvider` interface** — Breaks circular import between backup and stacks packages. Implemented by `stackAdapter` in `main.go`.
- **Atomic file writes** — All persistent state (`settings.json`, `app.yaml`) written to `.tmp` then `os.Rename` for crash safety.
- **`go:embed` templates** — All HTML/CSS/JS compiled into the binary. No runtime file dependencies.
- **Europe/Budapest timezone** — All scheduled jobs, timestamps, and UI labels use Hungarian timezone.
### Module Map
| Module | Path | Responsibility |
|--------|------|----------------|
| **Config** | `internal/config/` | YAML loader, validation, `FELHOM_*` env overrides |
| **Settings** | `internal/settings/` | Runtime-mutable `settings.json` (passwords, backup prefs, storage paths, notifications) |
| **Stacks** | `internal/stacks/` | Compose operations, scanning, `.felhom.yml` metadata, deploy/delete flow |
| **Sync** | `internal/sync/` | Git-based app catalog sync (clone/pull, content-hash copy) |
| **Backup** | `internal/backup/` | 3-layer backup: DB dumps → restic snapshots → cross-drive copies, restore |
| **Storage** | `internal/storage/` | Disk scanning (`lsblk`), partitioning (`sfdisk`), formatting (`mkfs.ext4`), mounting, data migration (`rsync`) |
| **System** | `internal/system/` | System info (`/proc`), CPU collector, mount points, disk usage, FS info |
| **Monitor** | `internal/monitor/` | Healthchecks.io pinger, system health checks |
| **Metrics** | `internal/metrics/` | SQLite time-series store, system + container metric collection |
| **Scheduler** | `internal/scheduler/` | Central job scheduler (periodic + daily, skip-if-running, panic recovery) |
| **Notify** | `internal/notify/` | Email notifications via hub relay, preference sync, per-event cooldowns |
| **Report** | `internal/report/` | Hub report builder + HTTP pusher (system, stacks, backup, health) |
| **API** | `internal/api/` | REST JSON endpoints |
| **Web** | `internal/web/` | Hungarian dashboard, auth, page handlers, template functions, alerts |
---
## Features
### 1. App Management
The controller manages Docker Compose stacks through a complete lifecycle: catalog sync, first-time deployment, runtime operations, and deletion.
#### Git Sync (`internal/sync/`)
The app catalog lives in a separate Git repository. The controller:
- Shallow-clones the catalog on startup
- Periodically fetches updates (configurable, default 15 min)
- Copies only `docker-compose.yml` and `.felhom.yml` to the stacks directory
- **Never overwrites** `app.yaml` or `.env` (user secrets are safe)
- Uses SHA-256 content hashing — only writes files that actually changed
- Triggers stack rescan after sync so the dashboard updates immediately
- Manual sync via "Sablonok frissitese" button or `POST /api/sync`
#### First-Time Deploy Flow
1. Customer sees app card with "Telepites" button
2. Deploy page shows auto-filled fields (domain), auto-generated secrets (DB passwords, hex keys), and user-configurable inputs (admin password, language, storage path)
3. `checkBeforeDeploy()` JS guard fetches live state first (prevents double-deploy from another tab)
4. **Memory validation** checks `mem_request` against available RAM:
- `usable_memory = total_ram - reserved_memory_mb` (default 384MB reserved)
- Hard block if requests exceed usable memory
- Soft warning if limits exceed total RAM (overcommit OK)
5. Controller generates secrets, saves `app.yaml`, sets in-memory `Deployed` flag **before** `docker compose up -d` (avoids stale UI during slow image pulls), reverts on failure
6. 3-step progress panel polls `GET /api/stacks/{name}` every 3s: config saved → containers starting → health check passed
7. Post-deploy: locked fields (DB_PASSWORD, etc.) become read-only
#### App Info Pages
Each app can define rich metadata in `.felhom.yml`:
- `app_info`: tagline, use_cases, first_steps, prerequisites, default_creds, docs_url
- `optional_config`: groups of post-deploy configurable env vars (e.g., API keys for metadata providers)
- `resources`: mem_request, mem_limit, pi_compatible, needs_hdd
The `/apps/{slug}` page renders hero section, screenshots, setup guide, and optional config form.
#### Stack Operations
| Operation | What it does |
|-----------|-------------|
| Start | `docker compose up -d` |
| Stop | `docker compose stop` (blocked for protected stacks) |
| Restart | `docker compose restart` |
| Update | `docker compose pull` + `docker compose up -d` |
| Delete | `docker compose down --rmi local --volumes` + optional HDD data cleanup |
**Protected stacks** (traefik, cloudflared, felhom-controller) cannot be stopped or deleted from the UI. Restart is allowed.
**Orphan detection**: Deployed stacks with no matching catalog template are marked as orphaned with an "Elavult" badge and can be safely deleted.
#### Container State Display
| State | Color | Label | Meaning |
|-------|-------|-------|---------|
| Running + healthy | Green | "Fut" | All containers running and healthy |
| Running + starting | Orange | "Indulas..." | Healthcheck not yet passed |
| Running + unhealthy | Yellow | "Nem egeszseges" | Healthcheck failing |
| Stopped/exited | Red | "Leallitva" | All containers stopped |
| Restarting | Yellow | "Ujrainditas..." | Restart loop |
| Not deployed | Gray | "Nincs telepitve" | Compose file exists, not deployed |
---
### 2. Backup System
The backup system implements a **3-2-1 backup architecture**:
| Rule | What | Where | Status |
|------|------|-------|--------|
| **1. Nightly backup** | DB dumps + config + ALL user data | Same drive as app | Mandatory, automatic |
| **2. Cross-drive backup** | User data copy to secondary drive | Different physical device | Opt-in per app |
| **3. Remote backup** | Offsite copy for disaster recovery | Cloud / remote server | Future |
**Key principle:** User data backup is **mandatory** — every app with HDD bind mounts
is included in the nightly restic snapshot automatically. There is no per-app toggle.
The `AppBackupPrefs.Enabled` field in settings.json is legacy and not read by any code.
#### Rule 1: Nightly Backup (mandatory, same drive)
The nightly backup has two phases that run sequentially:
**Phase 1 — Database Dumps** (`internal/backup/dbdump.go`, scheduled 02:30)
- **Auto-discovery** of PostgreSQL and MariaDB containers via `docker ps` + `docker inspect`
- Dumps via `docker exec pg_dump` / `docker exec mariadb-dump` with 5-minute timeout
- Atomic writes (`.tmp``.sql`) to prevent corruption
- **Validation** after each dump: checks file size, header presence, counts `CREATE TABLE`
- Results cached in `settings.json` surviving container restarts
**Phase 2 — Restic Snapshot** (`internal/backup/restic.go`, scheduled 03:00)
- Auto-generated repository password (32 random bytes, base64url), synced to hub
- **Paths included in every snapshot:**
- Stacks dir (all compose.yml + app.yaml + .felhom.yml)
- DB dump dir (all `.sql` dump files from Phase 1)
- `controller.yaml` (controller config)
- **ALL deployed apps' HDD mount paths** — discovered via `resolveAppBackupPaths()` which iterates `ListDeployedStacks()`, no `Enabled` flag
- Auto-detects and unlocks stale locks (restic repo lock)
- Weekly prune on Sundays with configurable retention (keep-daily, keep-weekly, keep-monthly)
- Weekly integrity check (`restic check`) on Sunday 04:00
**What this protects against:** accidental deletion, data corruption, point-in-time rollback.
Does NOT protect against drive failure (backup is on the same physical drive).
#### Rule 2: Cross-Drive Backup (opt-in, different device) (`internal/backup/crossdrive.go`)
Copies user data to a **different physical drive**, providing the second copy for 3-2-1.
- **Two methods:**
- **rsync** — Simple mirror with `--delete` (fast, no versioning)
- **restic** — Versioned, deduplicated, encrypted (shared repo across apps, auto-generated password)
- Per-app configuration in settings.json: destination path, method, schedule (daily/weekly/manual)
- **Pre-backup DB dump:** `DumpStackDB()` runs fresh pg_dump/mariadb-dump before each cross-drive backup to ensure DB state matches user data; non-fatal on failure (wired via `DBDumper` interface to avoid circular imports)
- **Drive-type-aware validation** (`ValidateDestination`):
| Destination type | Space checks |
|-----------------|--------------|
| External mount (different device than `/`) | Block if <100 MB free |
| System drive (same device as `/`) | Require ≥10 GB free AND <90% used; logged warning |
- **Rsync destination layout:**
- Single mount: `backups/rsync/<app>/` (flat, no extra nesting)
- Multiple mounts: `backups/rsync/<app>/<leaf>/` per mount; duplicate leaf names get `_N` suffix
- DB dump files excluded (`--exclude backups/*.sql.gz/sql/dump`) — already handled by pg_dump
- Safety guards: destination ≠ source, path-overlap check, writable check
- **Chained execution:** runs immediately after nightly restic — daily apps every night, weekly apps on Sundays
- Per-app concurrency lock prevents overlapping runs
- Status (last_run, duration, size, error) persisted to settings.json
**What this protects against:** primary drive failure, drive theft/damage.
#### Rule 3: Remote Backup (future)
Offsite backup for disaster recovery. Not yet implemented.
#### Restore (`internal/backup/restore.go`)
All deployed apps appear in the restore dropdown — every app has restic snapshot data
(stacks dir + DB dumps are always backed up).
| App type | Config restored | DB restored | User data restored |
|----------|----------------|------------|-------------------|
| Has HDD data | ✓ | ✓ | ✓ (always — backup is mandatory) |
| DB only, no HDD | ✓ | ✓ | n/a |
| No DB, no HDD | ✓ | — | n/a |
- **Snapshot API** returns ALL snapshots unfiltered — older snapshots (pre-mandatory HDD backup) still allow config+DB restore; `RestoreApp` extracts whatever paths are available
- **Restore type info** shown per-app when selected in dropdown (Hungarian banners):
- Has HDD: "Teljes visszaállítás: adatbázis + konfiguráció + felhasználói adatok"
- Has DB, no HDD: "Adatbázis és konfiguráció visszaállítása"
- No DB, no HDD: "Csak konfiguráció visszaállítása"
- **Execution flow:** stop app → `restic restore <id> --target / --include <path>...` → restart app
- Running flag prevents concurrent backup/restore operations
- Snapshot ID validated (864 lowercase hex)
#### Backup Page UI (`internal/web/templates/backups.html`)
Unified per-app status table with expandable rows showing 3 backup layers per app:
**Status dot per app:**
| Dot color | Meaning |
|-----------|---------|
| Green | Fully covered — cross-drive configured and last run OK |
| Yellow | Warning — no second copy, or last backup failed, or disk space issue |
| Red | Cross-drive destination blocked or inaccessible |
| Gray (auto) | No user data — only config/DB backup (automatic) |
**Three backup layers per app row:**
1. **Adatbázis mentés** — Auto badge + last run timestamp + status
2. **Konfiguráció** — Auto badge + last restic snapshot timestamp + status
3. **Felhasználói adatok** — one of:
- Cross-drive configured: method + destination + schedule + last run + status + "Futtatás most" button
- HDD data, no cross-drive: "✓ Helyi mentés auto" (green) + "⚠ Nincs 2. másolat" (yellow) + settings link
- No HDD data: "— (nincs HDD adat)" (muted)
**Other sections:**
- Schedule overview with next run times for DB dump, restic, prune
- Snapshot history table (last 20 snapshots with ID, time, files new/changed, data added)
- Repository info card (path, size, snapshot count, encryption key with show/copy)
- Restore section: app dropdown → snapshot dropdown → restore type info → confirmation checkbox → execute
---
### 3. Storage Management
The storage subsystem handles the full lifecycle of external storage: detection, initialization, path registration, and data migration.
#### Disk Scanning (`internal/storage/scan.go`)
- `ScanDisks()` uses `lsblk -J -b` for block device enumeration
- System disk detection via host fstab parsing (`/host-fstab`) + UUID resolution via `blkid`
- Partitions enriched with filesystem type, UUID, and label from direct `blkid` probing (Docker containers have incomplete udev cache)
- Returns `AvailableDisks` (non-system, non-loop, non-CDROM) and `SystemDisks` separately
- Handles NVMe (`nvme0n1p1`), SCSI (`sdb1`), and eMMC (`mmcblk0p1`) naming
#### Disk Initialization Wizard (`internal/storage/format.go`)
A step-by-step UI at `/settings/storage/init`:
1. **Scan** — Lists available disks with model, size, partition info
2. **Select** — User picks a disk and enters a mount name (e.g., `hdd_1`)
3. **Confirm** — User types "FORMAZAS" to confirm destructive operation
4. **Format pipeline**: `wipefs``sfdisk` (GPT) → `mkfs.ext4``blkid` UUID → backup fstab → append UUID-based fstab entry → mount → `findmnt` verification → `chown 1000:1000` → create `storage/` and `Dokumentumok/` subdirectories
5. Auto-registers new storage path in settings.json
6. Smart partition detection: skips repartitioning for existing empty partitions
Safety guards: system disk detection, mount path conflict check, confirmation required, progress channel for real-time UI feedback.
#### Storage Path Registry (`internal/settings/settings.go`)
Multiple external storage paths supported with:
- **Label**: Human-readable name (editable inline)
- **Default flag**: New deploys use this path by default
- **Schedulable flag**: Path appears in deploy dropdown
- **Auto-discovery**: On startup, scans deployed apps' `HDD_PATH` values and registers unknown paths
- Thread-safe CRUD: Add, Remove, SetDefault, SetSchedulable, SetLabel
#### Data Migration (`internal/storage/migrate.go`)
Move app data between storage paths (e.g., SSD → HDD, HDD → new HDD):
1. Validate: stack exists, deployed, has HDD data, target differs from source
2. Estimate total size, check free space on target
3. Stop the application
4. `rsync -a --info=progress2` per mount path with real-time progress parsing
5. Update `app.yaml` HDD_PATH to new location
6. Start the application
7. **Rollback on failure**: reverts config, restarts on old storage
Progress UI at `/stacks/{name}/migrate` with byte counter and percentage.
#### Stale Data Cleanup
After migration, the deploy page detects leftover data on previous storage paths:
- Shows path, size, and a delete button
- Two-step confirmation required
- Protected paths (storage root, media, Dokumentumok, appdata) cannot be deleted
#### FileBrowser Mount Sync
When storage paths are added or removed, `syncFileBrowserMounts()` auto-regenerates FileBrowser's `docker-compose.yml` with volume mounts for all registered paths, then recreates the container.
---
### 4. Monitoring & Health
#### System Health Checks (`internal/monitor/healthcheck.go`)
`RunHealthCheck()` evaluates multiple subsystems and returns a `HealthReport` with status (`ok`/`warn`/`fail`):
| Check | Warning | Critical |
|-------|---------|----------|
| Disk usage (SSD/HDD) | >= 90% | >= 95% |
| Memory | available < 512MB | available < 256MB |
| CPU temperature | >= 75C | >= 85C |
| Docker daemon | — | unreachable |
| Protected containers | — | not running |
| Storage paths | not a mount point (data on SSD) | path inaccessible, disk >= 95% |
Backup destination validation (`CheckBackupDestination`) has tiered checks:
- Path doesn't exist → critical/blocked
- Not writable → critical/blocked
- Same block device as root → warning (data on system drive)
- Disk >95% full → critical/blocked
- Disk >90% full → warning
#### Healthchecks.io Integration (`internal/monitor/pinger.go`)
Five ping UUIDs for external monitoring:
- **Heartbeat**: every 5 min (simple "I'm alive")
- **System Health**: periodic health check results
- **DB Dump**: after nightly database dumps
- **Backup**: after nightly restic backup
- **Backup Integrity**: weekly `restic check` result
3-attempt retry with 2-second backoff. Pinger never fails the caller.
#### Metrics Store (`internal/metrics/`)
- **SQLite with WAL mode** for concurrent reads during collection
- **System metrics**: CPU%, memory (total/used/available), temperature, load average — collected every 60 seconds
- **Container metrics**: CPU%, memory, network I/O, block I/O per container
- Downsampled queries for chart time ranges (1h, 6h, 24h, 7d, 30d)
- 30-day auto-prune via daily scheduler job
#### Monitoring Page
Full-page system monitor at `/monitoring`:
- **System Overview**: hostname, OS, kernel, CPU model/cores, uptime
- **System Metrics Charts**: 4 line charts (CPU, Memory, Temperature, Load) in 2x2 grid
- **Container Resources**: horizontal bar charts (CPU% and Memory per container)
- **Per-container Detail**: click-to-expand historical charts
- **Remote Monitoring Status**: shows Healthchecks ping UUID configuration
Chart.js 4.4.7 embedded locally (works in offline environments), dark theme matching site design.
#### Alert System (`internal/web/alerts.go`)
State-based alerts displayed on all pages:
- Sources: health issues, missing ping UUIDs, backup disabled
- Sorted by severity (error > warning > info), capped at 5 visible
- Refreshed every 5 min + on startup
- Monitoring page suppresses ping-related alerts (shown in dedicated table instead)
---
### 5. Notifications
#### Email Delivery
The controller relays notifications through the central hub, which sends emails via the Resend API:
1. Controller detects event (health degradation, backup failure, etc.)
2. Non-blocking POST to hub's `/api/v1/notify` with event details
3. Hub checks customer notification preferences
4. Hub sends Hungarian-language email via Resend
#### Event Types
| Event | Trigger |
|-------|---------|
| `disk_warning` | Disk usage crosses warning/critical threshold |
| `backup_failed` | Nightly backup or DB dump fails |
| `update_available` | New app version detected in catalog |
| `security_update` | Critical security update available |
#### Cooldown System
Per-event-type cooldown (default 6 hours, configurable) prevents notification spam. Only notifies on **status degradation** (ok→warn, ok→fail, warn→fail), not on repeated same-status checks.
#### Preference Sync
Notification preferences (email, enabled events, cooldown) are:
- Stored locally in `settings.json`
- Synced to hub on save and on controller startup
- Hub sync failure doesn't block local save
---
### 6. Update Management
#### App Catalog Sync
- Periodic `git fetch` + `git reset --hard` of the app catalog repo
- Content-hash comparison prevents unnecessary file writes
- Post-sync stack rescan detects new/changed apps immediately
#### Planned Update Classifications
| Marker | Behavior |
|--------|----------|
| No marker | Optional — shown on dashboard, customer clicks "Update" |
| `UPDATE_REQUIRED=true` | Mandatory — auto-applied during next update window |
| `UPDATE_SECURITY=true` | Critical — applied immediately |
---
### 7. Authentication & Settings
#### Session Auth (`internal/web/auth.go`)
- bcrypt password verification with configurable source priority: `settings.json``controller.yaml` → no auth (open access)
- 7-day session duration with random 32-byte hex tokens
- `?next=` redirect after login preserves the page the user was visiting
- Session cleanup every 15 minutes
- All sessions invalidated on password change
- Conditional logout link (hidden when auth is disabled)
#### Settings Persistence (`internal/settings/settings.go`)
Runtime-mutable settings in `settings.json` (separate from infrastructure config):
| Section | Contents |
|---------|----------|
| `password_hash` | bcrypt hash override |
| `notifications` | email, enabled events, cooldown hours |
| `db_validations` | per-DB dump validation results (survives restarts) |
| `app_backup` | per-app map: enabled flag, cross-drive config (method, dest, schedule, runtime status) |
| `storage_paths` | registered paths with label, default flag, schedulable flag |
| `cross_drive_restic_password` | auto-generated restic password for cross-drive repos |
All public methods use `sync.RWMutex`. File writes are atomic (`.tmp` + rename).
#### Settings Page (`/settings`)
Three sections:
1. **System config** — read-only display of `controller.yaml` values
2. **Password change** — current + new + confirm, min 8 chars
3. **Storage paths** — add/remove, edit labels, set default, toggle schedulable, per-path app list with sizes
4. **Notifications** — email, event checkboxes, cooldown hours, test email button
---
### 8. Central Hub Reporting
#### Report Push (`internal/report/`)
Periodic JSON push (default every 15 min) to the central felhom-hub service:
- System: hostname, OS, CPU, memory, disk usage, uptime
- Containers: running/stopped counts, per-container CPU/memory
- Backup: last run, success, repo stats, snapshot count, restic password (for disaster recovery)
- Health: current status, issues, warnings
- Stacks: deployed apps with versions and states
Bearer token authentication, 3-attempt retry with 5-second backoff.
#### Hub Dashboard
The hub service (separate Go app in the `felhom.eu` repo) provides:
- Multi-customer overview table with status indicators
- Customer detail page with system/storage/containers/backup/health sections
- Color coding: green (<30min), yellow (30-60min), red (>60min since last report)
- 90-day report retention with daily prune
---
## Repository Layout
```
controller/
├── cmd/controller/main.go # Entry point, wires all 14 modules
├── internal/
│ ├── config/config.go # YAML loader, validation, env overrides
│ ├── settings/settings.go # Runtime settings (JSON, atomic writes, RWMutex)
│ ├── stacks/
│ │ ├── manager.go # Stack scanning, compose ops, container status
│ │ ├── metadata.go # Parse .felhom.yml app metadata
│ │ ├── deploy.go # First-deploy: secret gen, app.yaml, compose up
│ │ └── delete.go # Stack deletion + HDD data cleanup
│ ├── sync/sync.go # Git sync: clone/pull app catalog, content-hash copy
│ ├── storage/
│ │ ├── scan.go, scan_linux.go # Disk detection via lsblk + blkid
│ │ ├── format.go, format_linux.go # Partition, format, mount pipeline
│ │ ├── safety.go, safety_linux.go # System disk detection, mount guards, fstab ops
│ │ ├── migrate.go # App data migration (rsync with progress)
│ │ └── *_other.go # Non-Linux stubs for cross-compilation
│ ├── backup/
│ │ ├── backup.go # Orchestrator (dumps + restic + cross-drive chain)
│ │ ├── dbdump.go # DB auto-discovery + dump (pg_dump, mariadb-dump)
│ │ ├── restic.go # Restic operations (init, snapshot, prune, check)
│ │ ├── appdata.go # StackDataProvider interface, app data discovery
│ │ ├── crossdrive.go # Per-app backup to secondary storage (rsync/restic)
│ │ └── restore.go # Per-app restore with auto stop/restart
│ ├── api/router.go # REST API endpoints (~30 routes)
│ ├── scheduler/scheduler.go # Central job scheduler (Every, Daily)
│ ├── system/
│ │ ├── info.go, info_linux.go # RAM, disk, CPU, temperature, load average
│ │ ├── cpu_linux.go # Background /proc/stat sampling
│ │ └── mounts_linux.go # Mount points, disk usage, FS info, backup dest checks
│ ├── monitor/
│ │ ├── pinger.go # Healthchecks.io HTTP ping client
│ │ └── healthcheck.go # System health checks (disk, mem, CPU, temp, Docker)
│ ├── metrics/
│ │ ├── store.go # SQLite time-series (WAL mode, downsampled queries)
│ │ ├── collector.go # Background collector (60s, system + docker stats)
│ │ └── sysinfo.go # Static system info (/proc, /etc)
│ ├── notify/notifier.go # Email relay to hub, preference sync, cooldowns
│ ├── report/
│ │ ├── builder.go # Hub report builder (all subsystems → JSON)
│ │ └── pusher.go # HTTP POST to hub (retry, Bearer auth)
│ └── web/
│ ├── server.go # HTTP server, routing, static files
│ ├── auth.go # Session auth, login/logout, session cleanup
│ ├── handlers.go # Page handlers (dashboard, stacks, deploy, backups, etc.)
│ ├── storage_handlers.go # Storage API handlers (scan, format, migrate, cleanup)
│ ├── alerts.go # State-based alert generation
│ ├── funcmap.go # Template functions (state colors, Hungarian formatting)
│ ├── embed.go # go:embed for templates + Chart.js
│ └── templates/ # 12 HTML files + style.css (Hungarian UI)
├── configs/
│ ├── controller.yaml.example # Full config reference
│ └── example-felhom-metadata.yml # .felhom.yml format reference
├── Dockerfile # Multi-stage: Go 1.24 builder + debian-slim runtime
├── docker-compose.yml # Controller's own compose (privileged, /mnt rshared)
└── go.mod # Go 1.24, deps: bcrypt, yaml.v3, modernc.org/sqlite
```
---
## Configuration
### Controller config (`controller.yaml`)
Single YAML file per customer, infrastructure-only. Does **not** contain app-specific config.
Key sections:
```yaml
customer:
name: "Demo Felhom"
node_id: "demo-felhom"
paths:
stacks_dir: "/opt/docker/stacks"
data_dir: "/opt/docker/felhom-controller/data"
db_dump_dir: "/srv/backups/db-dumps"
restic_repo: "/srv/backups/restic-repo"
git:
repo_url: "https://gitea.dooplex.hu/admin/app-catalog-felhom.eu.git"
sync_interval: "15m"
backup:
enabled: true
db_dump_time: "02:30"
restic_time: "03:00"
retention: { keep_daily: 7, keep_weekly: 4, keep_monthly: 6 }
monitoring:
health_interval: "5m"
ping_uuids:
heartbeat: "uuid-here"
system_health: "uuid-here"
db_dump: "uuid-here"
backup: "uuid-here"
backup_integrity: "uuid-here"
hub:
enabled: true
url: "https://hub.felhom.eu"
api_key: "bearer-token-here"
system:
reserved_memory_mb: 384 # RAM reserved for OS + controller
```
Environment variable overrides: `FELHOM_LOGGING_LEVEL=debug`, `FELHOM_HUB_ENABLED=false`, etc.
### Runtime settings (`settings.json`)
Auto-managed by the controller. Contains password hash overrides, notification preferences, per-app backup configs, storage path registry, DB validation cache. All writes are atomic.
### Per-app config (`app.yaml`)
Auto-generated during deployment. Contains env vars, locked fields list, deploy timestamp. Secret fields are locked (read-only after first deploy).
---
## Scheduler Jobs
| Job | Type | When | Purpose |
|-----|------|------|---------|
| status-refresh | periodic | 30s | Refresh container states |
| stack-scan | periodic | 2m | Rescan stacks directory |
| heartbeat | periodic | 5m | Ping Healthchecks "I'm alive" |
| system-health | periodic | configurable | Health checks + alert refresh |
| backup-cache | periodic | 5m | Refresh backup status cache |
| hub-report | periodic | 15m | Push report to central hub |
| db-dump | daily | 02:30 | Database dumps |
| backup | daily | 03:00 | Restic backup → cross-drive chain |
| backup-integrity | daily | Sun 04:00 | Restic check |
| metrics-prune | daily | 04:00 | Delete metrics older than 30 days |
All daily jobs use Europe/Budapest timezone. Skip-if-running prevents concurrent execution. Panic recovery in all jobs.
---
## REST API
### Stack Operations
| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/health` | Health check (no auth) |
| GET | `/api/stacks` | List all stacks |
| GET | `/api/stacks/{name}` | Stack details |
| POST | `/api/stacks/{name}/deploy` | First-time deploy |
| POST | `/api/stacks/{name}/start` | Start stack |
| POST | `/api/stacks/{name}/stop` | Stop stack |
| POST | `/api/stacks/{name}/restart` | Restart stack |
| POST | `/api/stacks/{name}/update` | Pull + recreate |
| POST | `/api/stacks/{name}/optional-config` | Update optional env vars |
| GET | `/api/stacks/{name}/logs` | Container logs (`?raw=1` for plain text) |
| GET | `/api/stacks/{name}/hdd-data` | HDD data paths + sizes |
| DELETE | `/api/stacks/{name}` | Delete stack |
| POST | `/api/sync` | Trigger catalog sync |
| GET | `/api/system/info` | System info + sync status |
### Backup & Restore
| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/backup/status` | Full backup status |
| POST | `/api/backup/run` | Trigger manual backup |
| GET | `/api/backup/snapshots` | List snapshots (`?stack={name}` for filtering) |
| POST | `/api/stacks/{name}/cross-backup` | Save cross-drive config |
| POST | `/api/stacks/{name}/cross-backup/run` | Trigger cross-drive backup |
| GET | `/api/stacks/{name}/cross-backup/status` | Cross-drive status |
| POST | `/api/backup/cross-drive/run-all` | Run all scheduled cross-drive backups |
### Storage
| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/storage/scan` | Scan available disks |
| POST | `/api/storage/init` | Format and mount a disk |
| GET | `/api/storage/init/status` | Format progress |
| POST | `/api/storage/migrate` | Start app data migration |
| GET | `/api/storage/migrate/status` | Migration progress |
### Metrics
| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/metrics/system` | System metrics time-series (`?range=1h|6h|24h|7d|30d`) |
| GET | `/api/metrics/containers/summary` | Current container stats |
| GET | `/api/metrics/containers/{name}` | Per-container time-series |
| GET | `/api/metrics/sysinfo` | Static system info |
Response format: `{"ok": true/false, "data": ..., "error": "...", "message": "..."}`
---
## Build & Deploy
### Build
```bash
# On build server (192.168.0.180)
cd ~/build/felhom-controller
git -C ~/git/deploy-felhom-compose pull
./build.sh v0.12.2 --push
```
### Deploy on customer node
```bash
# On customer node (e.g., 192.168.0.162)
cd /opt/docker/felhom-controller
sudo docker pull gitea.dooplex.hu/admin/felhom-controller:v0.12.2
sudo sed -i 's|image: gitea.dooplex.hu/admin/felhom-controller:.*|image: gitea.dooplex.hu/admin/felhom-controller:v0.12.2|' docker-compose.yml
sudo docker compose up -d
```
**Important:** Always use `docker compose up -d`, NOT `docker compose restart` — restart doesn't pick up new images.
### Docker Requirements
The controller container needs:
- `privileged: true` (disk operations)
- Docker socket mount (`/var/run/docker.sock`)
- `/mnt` mount with `propagation: rshared` (container mounts visible to host)
- `/dev` mounted as `/host-dev` (block device access)
- `/etc/fstab` mounted as `/host-fstab` (persistent mount config)
See `docker-compose.yml` for the full volume configuration.
---
## Roadmap
### Completed
- [x] Stack management with deploy flow and memory validation
- [x] Git-based app catalog sync
- [x] Central job scheduler
- [x] System monitoring with SQLite metrics and Chart.js charts
- [x] Healthchecks.io integration (5 ping types)
- [x] 3-layer backup system (DB dumps + restic + cross-drive)
- [x] Per-app backup restore with auto stop/restart
- [x] Storage management (scan, format, mount, registry)
- [x] App data migration between storage paths
- [x] Central hub reporting
- [x] Email notifications via hub relay
- [x] Settings persistence and password management
- [x] Dashboard alert system
### In Progress / Planned
- [ ] Update classification and auto-apply (optional/required/security markers)
- [ ] Self-update mechanism with health-based rollback
- [ ] Docker volume backup (`/var/lib/docker/volumes:ro`)
- [ ] Raspberry Pi testing (pi-customer-1)
- [ ] Cross-drive restic pruning (unbounded snapshot growth)
- [ ] CSRF protection on POST endpoints
- [ ] Login rate limiting
---
## Test Environments
| Node | Hardware | Domain | Status |
|------|----------|--------|--------|
| demo-felhom | Acemagic GK3PLUS N100, 16G RAM, 512G SSD + 1TB HDD | demo-felhom.eu | Controller v0.12.2 running |
| pi-customer-1 | Raspberry Pi 3B+, 1G RAM, 32G SD | pi-customer-1.local | Not yet tested |
## Related Repositories
| Repository | Purpose |
|------------|---------|
| [deploy-felhom-compose](https://gitea.dooplex.hu/admin/deploy-felhom-compose) | This repo — controller + deploy scripts |
| [app-catalog-felhom.eu](https://gitea.dooplex.hu/admin/app-catalog-felhom.eu) | Docker Compose templates + .felhom.yml metadata |
| [felhom.eu](https://gitea.dooplex.hu/admin/felhom.eu) | Website + app assets + felhom-hub service |