Files
deploy-felhom-compose/controller/README.md
T
admin 8aebbb8902 feat: Hub monitoring takeover — event push system + config cleanup (v0.21.0)
Replace external Healthchecks.io with Hub-native event system. Controller
now pushes structured events via POST /api/v1/event with typed detail
structs. Hub handles dead man's switch, notification dispatch, and cooldowns.

Phase 5: PushEvent() core method, 21 event types, expanded notification
settings (11 toggles), Hub connection monitoring on dashboard, alerts.
Phase 6: Deprecation log for ping UUIDs, pinger kept for transition.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 18:53:21 +01:00

1190 lines
64 KiB
Markdown

# felhom-controller
**Central management container for Felhom home servers.**
A single, lightweight Go container that replaces Portainer + scattered systemd scripts with a unified, Hungarian-language web dashboard for managing Docker Compose stacks, backups, storage, monitoring, and notifications on customer hardware.
**Current version: v0.21.0**
---
## Table of Contents
- [Architecture](#architecture)
- [Features](#features)
- [App Management](#1-app-management)
- [Backup System](#2-backup-system)
- [Storage Management](#3-storage-management)
- [Monitoring & Health](#4-monitoring--health)
- [Notifications](#5-notifications)
- [Update Management](#6-update-management)
- [Authentication & Settings](#7-authentication--settings)
- [Central Hub](#8-central-hub-reporting)
- [Repository Layout](#repository-layout)
- [Configuration](#configuration)
- [REST API](#rest-api)
- [Build & Deploy](#build--deploy)
- [Roadmap](#roadmap)
---
## Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ Customer Hardware (N100 mini PC / Raspberry Pi) │
│ │
│ ┌──────────┐ ┌────────────────────────────────────────────┐ │
│ │ Traefik │ │ felhom-controller (privileged container) │ │
│ │ (reverse │──▶│ │ │
│ │ proxy) │ │ ┌──────────┐ ┌─────────────────────────┐│ │
│ └──────────┘ │ │ Web UI │ │ Stack Manager ││ │
│ │ │ (HU dash │ │ (compose ops, git sync, ││ │
│ ┌──────────┐ │ │ board) │ │ deploy, delete, update) ││ │
│ │cloudflared│ │ └──────────┘ └─────────────────────────┘│ │
│ │ (tunnel) │ │ ┌──────────┐ ┌─────────────────────────┐│ │
│ └──────────┘ │ │ Backup │ │ Storage Manager ││ │
│ │ │ (3-layer │ │ (disk scan, format, ││ │
│ ┌──────────┐ │ │ restic) │ │ mount, migrate) ││ │
│ │ App │ │ └──────────┘ └─────────────────────────┘│ │
│ │ stacks │ │ ┌──────────┐ ┌─────────────────────────┐│ │
│ │ (docker │ │ │Scheduler │ │ Monitor & Metrics ││ │
│ │ compose) │ │ │(cron-like│ │ (health, pings, SQLite ││ │
│ └──────────┘ │ │ jobs) │ │ time-series, Chart.js) ││ │
│ │ └──────────┘ └─────────────────────────┘│ │
│ │ ┌──────────┐ ┌─────────────────────────┐│ │
│ │ │ Notify │ │ REST API + Hub Reporter ││ │
│ │ │ (email) │ │ (JSON push to hub) ││ │
│ │ └──────────┘ └─────────────────────────┘│ │
│ └────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│ pings │ JSON push │ git pull
▼ ▼ ▼
status.felhom.eu hub.felhom.eu gitea.dooplex.hu
(Healthchecks) (central dashboard) (stack definitions)
```
### Key Architecture Decisions
- **Pure Go, no frameworks** — stdlib `net/http` + `html/template`. Only external deps: `bcrypt`, `yaml.v3`, `modernc.org/sqlite` (pure Go, no CGO).
- **Privileged container** — Required for disk operations (format, mount, fstab), `/dev` access, and Docker socket control.
- **`/host-dev` indirection** — Docker overrides `/dev` with a tmpfs. The host's `/dev` is mounted at `/host-dev` to access block devices.
- **`StackDataProvider` interface** — Breaks circular import between backup and stacks packages. Implemented by `stackAdapter` in `main.go`. Provides `GetStackHDDPath()` for per-drive backup routing.
- **Atomic file writes** — All persistent state (`settings.json`, `app.yaml`) written to `.tmp` then `os.Rename` for crash safety.
- **`go:embed` templates** — All HTML/CSS/JS compiled into the binary. No runtime file dependencies.
- **Europe/Budapest timezone** — All scheduled jobs, timestamps, and UI labels use Hungarian timezone.
### Module Map
| Module | Path | Responsibility |
|--------|------|----------------|
| **Config** | `internal/config/` | YAML loader, validation, `FELHOM_*` env overrides |
| **Settings** | `internal/settings/` | Runtime-mutable `settings.json` (passwords, backup prefs, storage paths, notifications) |
| **Stacks** | `internal/stacks/` | Compose operations, scanning, `.felhom.yml` metadata, deploy/delete flow |
| **Sync** | `internal/sync/` | Git-based app catalog sync (clone/pull, content-hash copy) |
| **Backup** | `internal/backup/` | Per-drive 3-layer backup: DB dumps → restic snapshots → cross-drive copies, restore |
| **Storage** | `internal/storage/` | Disk scanning (`lsblk`), partitioning (`sfdisk`), formatting (`mkfs.ext4`), mounting, data migration (`rsync`) |
| **System** | `internal/system/` | System info (`/proc`), CPU collector, mount points, disk usage, FS info |
| **Monitor** | `internal/monitor/` | Healthchecks.io pinger, system health checks, storage watchdog |
| **Metrics** | `internal/metrics/` | SQLite time-series store, system + container metric collection |
| **Scheduler** | `internal/scheduler/` | Central job scheduler (periodic + daily, skip-if-running, panic recovery) |
| **SelfUpdate** | `internal/selfupdate/` | Version checking (registry), update trigger, state persistence, startup verification |
| **Notify** | `internal/notify/` | Email notifications via hub relay, preference sync, per-event cooldowns |
| **Report** | `internal/report/` | Hub report builder + HTTP pusher (system, stacks, backup, health) |
| **API** | `internal/api/` | REST JSON endpoints |
| **Web** | `internal/web/` | Hungarian dashboard, auth, page handlers, template functions, alerts |
---
## Features
### 1. App Management
The controller manages Docker Compose stacks through a complete lifecycle: catalog sync, first-time deployment, runtime operations, and deletion.
#### Git Sync (`internal/sync/`)
The app catalog lives in a separate Git repository. The controller:
- Shallow-clones the catalog on startup
- Periodically fetches updates (configurable, default 15 min)
- Copies only `docker-compose.yml` and `.felhom.yml` to the stacks directory
- **Never overwrites** `app.yaml` or `.env` (user secrets are safe)
- Uses SHA-256 content hashing — only writes files that actually changed
- Triggers stack rescan after sync so the dashboard updates immediately
- **Post-sync hook**: auto-injects missing deploy fields (new secrets, domains) into existing `app.yaml` for stacks whose templates were updated (see Missing Field Injection below)
- Manual sync via "Sablonok frissitese" button or `POST /api/sync`
#### First-Time Deploy Flow
1. Customer sees app card with "Telepites" button
2. Deploy page shows auto-filled fields (domain), auto-generated secrets (DB passwords, hex keys, base64 keys), and user-configurable inputs (admin password, language, storage path)
3. `checkBeforeDeploy()` JS guard fetches live state first (prevents double-deploy from another tab)
4. **Memory validation** checks `mem_request` against available RAM:
- `usable_memory = total_ram - reserved_memory_mb` (default 384MB reserved)
- Hard block if requests exceed usable memory
- Soft warning if limits exceed total RAM (overcommit OK)
5. Controller generates secrets, saves `app.yaml`, sets in-memory `Deployed` flag **before** `docker compose up -d` (avoids stale UI during slow image pulls), reverts on failure
6. 3-step progress panel polls `GET /api/stacks/{name}` every 3s: config saved → containers starting → health check passed
7. Post-deploy: locked fields (DB_PASSWORD, etc.) become read-only
#### App Info Pages
Each app can define rich metadata in `.felhom.yml`:
- `app_info`: tagline, use_cases, first_steps, prerequisites, default_creds, docs_url
- `optional_config`: groups of post-deploy configurable env vars (e.g., API keys for metadata providers)
- `resources`: mem_request, mem_limit, pi_compatible, needs_hdd
The `/apps/{slug}` page renders hero section, screenshots, setup guide, and optional config form.
#### Stack Operations
| Operation | What it does |
|-----------|-------------|
| Start | `docker compose up -d` |
| Stop | `docker compose stop` (blocked for protected stacks) |
| Restart | `docker compose restart` |
| Update | `docker compose pull` + `docker compose up -d` |
| Remove | `docker compose down --volumes` + remove `app.yaml` + optional HDD/backup cleanup; template preserved for redeploy |
| Delete | `docker compose down --rmi local --volumes` + optional HDD data cleanup (orphaned stacks only) |
**Remove vs Delete**: "Eltávolítás" (Remove) is for deployed catalog stacks — it reverts the stack to "Nincs telepítve" state while keeping the template for easy redeployment. "Törlés" (Delete) is for orphaned stacks — it removes the entire stack directory including templates. Both require stopping the stack first.
**Remove modal** shows three sections: (1) always-removed items (Docker volumes, app.yaml, cross-drive schedule), (2) optional HDD data deletion with reimport warning, (3) optional backup data deletion (DB dumps + cross-drive rsync) with restic retention note.
**Protected stacks** (traefik, cloudflared, felhom-controller) cannot be stopped, removed, or deleted from the UI. Restart is allowed.
**Orphan detection**: Deployed stacks with no matching catalog template are marked as orphaned with an "Elavult" badge and can be safely deleted.
#### Missing Field Injection (`deploy.go`)
When app templates are updated (e.g., a new `APP_KEY` secret is added to `.felhom.yml`), existing deployed apps need the new field in their `app.yaml`. The controller handles this automatically:
- **On startup**: `InjectMissingFields()` runs for all deployed stacks
- **After sync**: the post-sync hook runs for stacks whose templates were updated
- For each deployed stack, compares `.felhom.yml` `deploy_fields` against `app.yaml` env vars
- Missing `secret` fields: auto-generated using the field's generator spec (`password:N`, `hex:N`, `base64key:N`)
- Missing `domain` fields: filled with the customer's configured domain
- Other field types (e.g., `text`, `select`): logged as warning for manual configuration
- Locked fields are added to the locked list automatically
**Generator types**: `password:N` (alphanumeric), `hex:N` (hex-encoded random bytes), `base64key:N` (`base64:` + N random bytes base64-encoded, for Laravel APP_KEY etc.), `static:VALUE` (literal value).
#### Container State Display
| State | Color | Label | Meaning |
|-------|-------|-------|---------|
| Running + healthy | Green | "Fut" | All containers running and healthy |
| Running + starting | Orange | "Indulas..." | Healthcheck not yet passed |
| Running + unhealthy | Yellow | "Nem egeszseges" | Healthcheck failing |
| Stopped/exited | Red | "Leallitva" | All containers stopped |
| Restarting | Yellow | "Ujrainditas..." | Restart loop |
| Not deployed | Gray | "Nincs telepitve" | Compose file exists, not deployed |
---
### 2. Backup System
The backup system implements a **3-2-1 backup architecture**. Each tier is a **complete,
self-sufficient backup** — any single tier can fully restore an app.
| Tier | Contents | Location | Can fully restore? |
|------|----------|----------|--------------------|
| **1. Nightly restic** | DB + Config + User data | Same drive as app | Yes (not against drive failure) |
| **2. Cross-drive** | DB + Config + User data | Different physical device | Yes |
| **3. Remote** | Everything | Cloud / remote server | Future |
**Key principles:**
- User data backup is **mandatory** — every app with HDD bind mounts is included
automatically. There is no per-app toggle.
- Each tier includes **everything** needed to restore: DB dumps, config, and user data.
No tier depends on another tier's data.
- **Tier 2 is configurable for ALL apps** — not just apps with HDD data. Non-HDD apps
back up config + DB dumps to the secondary drive (small but protects against drive failure).
- The `AppBackupPrefs.Enabled` field in settings.json is legacy and not read by any code.
**Per-app Tier 2 contents by app type:**
| App type | Tier 2 contents | Example |
|----------|----------------|---------|
| HDD + DB | Config + DB + User data | Immich, Paperless-ngx |
| HDD, no DB | Config + User data | — |
| DB, no HDD | Config + DB | Mealie, Vikunja |
| Config only | Config | Gokapi, Homepage |
#### Tier 1: Nightly Backup (mandatory, same drive)
The nightly backup has two phases that run sequentially. All paths are **per-drive** — each physical drive gets its own restic repo and per-app DB dump directories.
**Drive layout (v0.14.1):**
```
<drive>/
├── appdata/<app>/ ← app user data
└── backups/
└── primary/
├── restic/ ← one restic repo per drive (all apps on this drive)
└── <app>/db-dumps/ ← per-app DB dump files
```
Path computation is centralized in `backup/paths.go`:
- `PrimaryResticRepoPath(drivePath)``<drive>/backups/primary/restic/`
- `AppDBDumpPath(drivePath, stackName)``<drive>/backups/primary/<stack>/db-dumps/`
- `AppDataDir(drivePath, stackName)``<drive>/appdata/<stack>/`
- `SecondaryInfraPath(drivePath)``<drive>/backups/secondary/_infra/`
**Phase 1 — Database Dumps** (`internal/backup/dbdump.go`, scheduled 02:30)
- **Auto-discovery** of PostgreSQL and MariaDB containers via `docker ps` + `docker inspect`
- Dumps via `docker exec pg_dump` / `docker exec mariadb-dump` with 5-minute timeout
- Dumps are written to the app's **home drive**: `AppDBDumpPath(appDrive, stackName)`
- Atomic writes (`.tmp``.sql`) to prevent corruption
- **Validation** after each dump: checks file size, header presence, counts `CREATE TABLE`
- Results cached in `settings.json` surviving container restarts
**Phase 2 — Restic Snapshot** (`internal/backup/restic.go`, scheduled 03:00)
- Apps are **grouped by drive** via `groupStacksByDrive()` — each drive's apps are backed up to that drive's restic repo
- App drive resolution: `GetStackHDDPath()` (from `StackDataProvider`) → falls back to `SystemDataPath`
- Auto-generated repository password (32 random bytes, base64url), shared across all repos, synced to hub
- **Paths included in every per-drive snapshot:**
- Per-app DB dump dirs on that drive
- Per-app HDD mount paths (user data)
- Stacks dir (compose.yml + app.yaml + .felhom.yml for all apps)
- `controller.yaml` (controller config)
- Auto-detects and unlocks stale locks (restic repo lock)
- Weekly prune on Sundays with configurable retention (keep-daily, keep-weekly, keep-monthly)
- Weekly integrity check (`restic check`) on Sunday 04:00 — checks **all** primary repos
**Protects against:** accidental deletion, data corruption, point-in-time rollback.
Does NOT protect against drive failure (backup is on the same physical drive).
#### Tier 2: Cross-Drive Backup (opt-in, different device) (`internal/backup/crossdrive.go`)
**Complete backup** to a different physical drive. Available for **all apps** — apps with HDD
data back up config + DB + user data; apps without HDD back up config + DB dumps only.
- **Auto-enable for small apps (v0.14.1):** Apps without HDD mounts (config-only, DB-only) are
automatically configured for daily rsync Tier 2 when ≥2 storage paths are registered.
`AutoEnableSmallApps()` runs at the start of each nightly backup cycle. Never overwrites
existing user-configured cross-drive settings (even disabled ones).
- **Infrastructure config backup (v0.14.1):** `syncInfraConfig()` rsyncs the stacks directory
and `controller.yaml` to `<dest>/backups/secondary/_infra/` on every secondary destination
drive. Runs before per-app backups. Cross-drive restic also includes infra paths.
- **Two methods:**
- **rsync** — Simple mirror with `--delete` (fast, no versioning, **browsable** on disk)
- **restic** — Versioned, deduplicated, encrypted (shared repo across apps, not browsable)
- Per-app configuration in settings.json: destination path, method, schedule (daily/weekly/manual)
- **Pre-backup DB dump:** `DumpStackDB()` runs fresh pg_dump/mariadb-dump before each cross-drive backup; non-fatal on failure (wired via `DBDumper` interface to avoid circular imports)
- **Empty mounts allowed:** `RunAppBackup` accepts apps with no HDD mounts — the rsync
mount loop simply doesn't execute, but DB + config copy still runs
- **Drive-type-aware validation** (`ValidateDestination`):
| Destination type | Space checks |
|-----------------|--------------|
| External mount (different device than `/`) | Block if <100 MB free |
| System drive (same device as `/`) | Require ≥10 GB free AND <90% used; logged warning |
- **Secondary drive layout (v0.14.1):**
```
<dest-drive>/backups/secondary/
├── _infra/ ← infrastructure config mirror (v0.14.1)
│ ├── controller.yaml
│ └── stacks/ ← full stacks dir (all app configs)
├── <app>/rsync/ ← per-app rsync mirror
│ ├── _db/ ← DB dump files
│ ├── _config/ ← compose.yml, app.yaml, .felhom.yml
│ └── <user data> ← HDD mount contents (if app has HDD data)
└── restic/ ← shared restic repo (all cross-drive apps)
```
- DB dump files read from **per-app home drive** path (`AppDBDumpPath`)
- `_` prefix directories prevent collision with user data
- For non-HDD apps, only `_db/` and `_config/` are present (no user data directory)
- **Restic backup paths:** includes HDD mounts (if any) + config dir + per-app DB dump dir from home drive + stacks dir + controller.yaml (infra, v0.14.1)
- Safety guards: destination ≠ source, path-overlap check (HDD mounts only), writable check
- **Chained execution:** runs immediately after nightly restic — daily apps every night, weekly apps on Sundays
- Per-app concurrency lock prevents overlapping runs
- Status (last_run, duration, size, error) persisted to settings.json
**Protects against:** primary drive failure, drive theft/damage.
#### Tier 3: Remote Backup (future)
Complete offsite backup for disaster recovery. Not yet implemented.
Placeholder shown in UI ("3. mentés — Hamarosan").
#### Restore (`internal/backup/restore.go`)
All deployed apps appear in the restore dropdown — every app has restic snapshot data
(stacks dir + DB dumps are always backed up).
| App type | Config restored | DB restored | User data restored |
|----------|----------------|------------|-------------------|
| Has HDD data | Yes | Yes | Yes (always — backup is mandatory) |
| DB only, no HDD | Yes | Yes | n/a |
| No DB, no HDD | Yes | — | n/a |
- **Snapshot API** returns ALL snapshots unfiltered — older snapshots still allow config+DB restore; `RestoreApp` extracts whatever paths are available
- **Restore type info** shown per-app when selected in dropdown (Hungarian banners):
- Has HDD: "Teljes visszaállitas: adatbazis + konfiguracio + felhasznaloi adatok"
- Has DB, no HDD: "Adatbazis es konfiguracio visszaallitasa"
- No DB, no HDD: "Csak konfiguracio visszaallitasa"
- **Execution flow:** stop app → resolve app's home drive → `restic restore <id> --target / --include <path>...` from per-drive repo → restart app
- Restic repo resolved via `PrimaryResticRepoPath(appDrivePath)`
- DB dumps restored from `AppDBDumpPath(appDrivePath, stackName)`
- Running flag prevents concurrent backup/restore operations
- Snapshot ID validated (8-64 lowercase hex)
**Note:** Restore currently uses Tier 1 (primary restic repo on app's home drive) only.
Restoring from Tier 2 (cross-drive) is a future enhancement.
#### Backup Page UI (`internal/web/templates/backups.html`)
Unified per-app status table with expandable rows showing **per-tier** backup status:
**Status dot per app:**
| Dot color | Meaning |
|-----------|---------|
| Green | 2+ tiers configured with successful backups + destination healthy |
| Yellow | Only 1 tier, or Tier 2 failing, or Tier 2 configured but never run |
| Red | Tier 2 destination blocked or inaccessible |
Every app starts as yellow (1 tier only). Green requires Tier 2 configured with successful backup.
**Per-app backup tiers (3 rows per app):**
- **1. mentes** (Tier 1, always present) — Auto badge + "helyi" + last run + contents (e.g., "DB + Konfig + Adatok")
- **2. mentes** (Tier 2, configurable for ALL apps) — one of:
- Configured: method (rsync/restic) + destination + schedule + last run + status + contents + browsable indicator (folder icon for rsync) + action buttons
- Not configured: "1. mentes auto" + "Nincs 2. masolat" + settings link
- **3. mentes** (Tier 3, placeholder) — grayed out "Hamarosan" + "tavoli (offsite)" + future note
**Backup contents per app** (shown per tier):
- Apps with DB + HDD: "DB + Konfig + Adatok"
- Apps with DB only: "DB + Konfig"
- Apps with HDD, no DB: "Konfig + Adatok"
- Apps with neither: "Konfig"
**Deploy page** shows cross-drive (Tier 2) configuration form for **all deployed apps**,
not just those with HDD data. Non-HDD apps can configure destination, method, and schedule.
**Other sections:**
- Schedule overview with next run times for DB dump, restic, prune
- Snapshot history table (last 20 snapshots aggregated from all per-drive repos, sorted by time)
- Storage overview card (total size across repos, snapshot count, DB dump count/size, encryption key with show/copy)
- Restore section: app dropdown → snapshot dropdown → restore type info → confirmation checkbox → execute
---
### 3. Storage Management
The storage subsystem handles the full lifecycle of external storage: detection, initialization, path registration, and data migration.
#### Disk Scanning (`internal/storage/scan.go`)
- `ScanDisks()` uses `lsblk -J -b` for block device enumeration
- System disk detection via host fstab parsing (`/host-fstab`) + UUID resolution via `blkid`
- Partitions enriched with filesystem type, UUID, and label from direct `blkid` probing (Docker containers have incomplete udev cache)
- Returns `AvailableDisks` (non-system, non-loop, non-CDROM) and `SystemDisks` separately
- Handles NVMe (`nvme0n1p1`), SCSI (`sdb1`), and eMMC (`mmcblk0p1`) naming
#### Disk Initialization Wizard (`internal/storage/format.go`)
A step-by-step UI at `/settings/storage/init`:
1. **Scan** — Lists available disks with model, size, partition info
2. **Select** — User picks a disk and enters a mount name (e.g., `hdd_1`)
3. **Confirm** — User types "FORMAZAS" to confirm destructive operation
4. **Format pipeline**: `wipefs` → `sfdisk` (GPT) → `mkfs.ext4` → `blkid` UUID → backup fstab → append UUID-based fstab entry → mount → `findmnt` verification → `chown 1000:1000` → create `appdata/`, `backups/`, and `Dokumentumok/` subdirectories
5. Auto-registers new storage path in settings.json
6. Smart partition detection: skips repartitioning for existing empty partitions
Safety guards: system disk detection, mount path conflict check, confirmation required, progress channel for real-time UI feedback.
#### Attach Existing Drive Wizard (`internal/storage/attach.go`)
A step-by-step UI at `/settings/storage/attach` for drives that already have a filesystem (e.g., a previously used ext4 drive). Unlike the init wizard, this does **not** format the drive — existing data is preserved.
**Problem solved:** Mounting a whole drive at `/mnt/<name>` would mix existing user data with the controller's directory structure (`storage/`, `Dokumentumok/`, backup repos). The bind-mount approach isolates the controller's working directory from other data on the drive.
1. **Scan** — Lists available disks, filtered to partitions that have an existing filesystem (FSType != "")
2. **Mount raw** — Partition is mounted read-only at a hidden staging path (`/mnt/.felhom-raw/<label>`)
3. **Browse** — Directory browser shows the drive's contents. User can navigate and create a new folder (e.g., `felhom_data`)
4. **Configure** — User enters a mount name and display label. Warning: mount path is immutable until detached
5. **Finalize** — Bind-mounts the selected subfolder at `/mnt/<name>`. Two fstab entries are created (both with `nofail`):
- Raw mount: `UUID=<uuid> /mnt/.felhom-raw/<x> <fstype> defaults,nofail,noatime 0 2`
- Bind mount: `/mnt/.felhom-raw/<x>/<subfolder> /mnt/<name> none bind,nofail 0 0`
6. Sets permissions (`chown 1000:1000`), creates `storage/` and `Dokumentumok/` subdirectories
7. Auto-registers the storage path in settings.json + syncs FileBrowser mounts
Cancel at any point cleans up the temporary raw mount. The bind mount path (`/mnt/<name>`) is a real mount point, so all existing code (disk usage, IsMountPoint checks, etc.) works unchanged.
#### Storage Path Registry (`internal/settings/settings.go`)
Multiple external storage paths supported with:
- **Label**: Human-readable name (editable inline)
- **Default flag**: New deploys use this path by default
- **Schedulable flag**: Path appears in deploy dropdown
- **Disconnected state**: `Disconnected`, `DisconnectedAt`, `StoppedStacks` — set by watchdog or safe-disconnect API, cleared on reconnect
- **Auto-discovery**: On startup, scans deployed apps' `HDD_PATH` values and registers unknown paths
- Thread-safe CRUD: Add, Remove, SetDefault, SetSchedulable, SetLabel, SetDisconnected, ClearDisconnected
#### Data Migration (`internal/storage/migrate.go`)
Move app data between storage paths (e.g., SSD → HDD, HDD → new HDD):
1. Validate: stack exists, deployed, has HDD data, target differs from source
2. Estimate total size, check free space on target
3. Stop the application
4. `rsync -a --info=progress2` per mount path with real-time progress parsing
5. Update `app.yaml` HDD_PATH to new location
6. Start the application
7. **Rollback on failure**: reverts config, restarts on old storage
Progress UI at `/stacks/{name}/migrate` with byte counter and percentage.
#### Stale Data Cleanup
After migration, the deploy page detects leftover data on previous storage paths:
- Shows path, size, and a delete button
- Two-step confirmation required
- Protected paths (appdata, backups, media, Dokumentumok) cannot be deleted
#### FileBrowser Mount Sync
When storage paths are added or removed, `syncFileBrowserMounts()` auto-regenerates FileBrowser's `docker-compose.yml` with volume mounts for all registered paths, then recreates the container.
#### Storage Watchdog (`internal/monitor/watchdog.go`)
Continuously monitors registered storage paths for disconnection/reconnection (primarily USB drives):
- **Probe loop**: `ProbeStoragePath()` calls `syscall.Statfs()` with 3-second timeout in a goroutine. Runs every 5s per connected path, 30s per disconnected path.
- **Debouncing**: 3 consecutive probe failures required before declaring a drive disconnected (prevents false positives from transient I/O).
- **Disconnect reaction** (automatic, ~15s detection):
1. Stops all deployed stacks whose `HDD_PATH` is under the disconnected drive (skips protected stacks)
2. Persists `Disconnected`, `DisconnectedAt`, `StoppedStacks` to `settings.json`
3. Lazy-unmounts stale VFS entries (`umount -l`) — for attach-wizard drives, unmounts bind first, then raw
4. Fires alert refresh (red banner on all pages), notification (`storage_disconnected`), and immediate hub report push
- **Auto-reconnect** (for UUID-based fstab entries):
1. Checks `/host-dev/disk/by-uuid/<uuid>` for device reappearance
2. Cleans stale mounts, then `mount -T /host-fstab <path>` (raw + bind for attach-wizard drives)
3. Verifies with a post-mount probe
4. Runs `restic unlock` if stale lock files exist
5. Validates `StoppedStacks` (filters to actually-stopped stacks), clears `Disconnected` flag
6. Fires alert refresh, notification (`storage_reconnected`), hub report push
**Safe disconnect UI** (manual, Settings page):
- "Leválasztás" button shown for USB drives (detected via sysfs symlink path containing `/usb`)
- Confirmation dialog lists affected apps
- Flow: stop apps → `sync` → `umount` (fallback `umount -l`) → mark disconnected → notification
- Disconnected card: dashed border, red badge, timestamp, stopped apps list, "Csatlakoztatás" (reconnect) button
- After reconnect: "Alkalmazások indítása" button to restart auto-stopped stacks
**USB detection** (`system.IsUSBDevice`): Reads `/host/sys/block/<disk>` symlink — if target path contains `/usb`, it's a USB device. The `removable` sysfs flag is unreliable for USB HDDs (returns 0).
**Backup guards**: Nightly DB dumps, restic snapshots, and cross-drive backups all skip disconnected drives with WARN log (not treated as failures).
**UI integration**: Disconnected drives show with hatched red bars on dashboard, monitoring, and backup pages. Per-app backup rows show "Meghajtó leválasztva" badge. Health check emits warnings for disconnected paths.
---
### 4. Monitoring & Health
#### System Health Checks (`internal/monitor/healthcheck.go`)
`RunHealthCheck()` evaluates multiple subsystems and returns a `HealthReport` with status (`ok`/`warn`/`fail`):
| Check | Warning | Critical |
|-------|---------|----------|
| Disk usage (SSD/HDD) | >= 90% | >= 95% |
| Memory | available < 512MB | available < 256MB |
| CPU temperature | >= 75C | >= 85C |
| Docker daemon | — | unreachable |
| Protected containers | — | not running |
| Storage paths | not a mount point (data on SSD), drive disconnected | path inaccessible, disk >= 95% |
Backup destination validation (`CheckBackupDestination`) has tiered checks:
- Path doesn't exist → critical/blocked
- Not writable → critical/blocked
- Same block device as root → warning (data on system drive)
- Disk >95% full → critical/blocked
- Disk >90% full → warning
#### Healthchecks.io Integration (deprecated)
Legacy pinger (`internal/monitor/pinger.go`) still runs for backward compatibility but is no longer the primary monitoring mechanism. Monitoring is now handled by the Hub event system (see [Notifications](#5-notifications)). A deprecation log is emitted on startup if ping UUIDs are configured.
#### Metrics Store (`internal/metrics/`)
- **SQLite with WAL mode** for concurrent reads during collection
- **System metrics**: CPU%, memory (total/used/available), temperature, load average — collected every 60 seconds
- **Container metrics**: CPU%, memory, network I/O, block I/O per container
- Downsampled queries for chart time ranges (1h, 6h, 24h, 7d, 30d)
- 30-day auto-prune via daily scheduler job
#### Monitoring Page
Full-page system monitor at `/monitoring`:
- **System Overview**: hostname, OS, kernel, CPU model/cores, uptime
- **System Metrics Charts**: 4 line charts (CPU, Memory, Temperature, Load) in 2x2 grid
- **Container Resources**: horizontal bar charts (CPU% and Memory per container)
- **Per-container Detail**: click-to-expand historical charts
- **Hub Connection Status**: shows Hub URL, customer ID, connection state (connected/unreachable), last successful push, last error
Chart.js 4.4.7 embedded locally (works in offline environments), dark theme matching site design.
#### Alert System (`internal/web/alerts.go`)
State-based alerts displayed on all pages:
- Sources: health issues, Hub connection status, backup disabled, storage disconnected, update available
- Hub alerts: `hub-disabled` (warning) when Hub not enabled, `hub-unreachable` (error) when last push failed and no success in 30 min
- Sorted by severity (error > warning > info), capped at 5 visible
- Refreshed every 5 min + on startup + on storage state changes
---
### 5. Notifications
#### Hub Event System (`internal/notify/notifier.go`)
The controller pushes structured events to the Hub's `/api/v1/event` endpoint. The Hub handles notification dispatch, cooldown management, and dead man's switch detection.
**Core method:** `PushEvent(eventType, severity, message, details)` — non-blocking goroutine, 2 retries with 3s backoff, never blocks the caller.
#### Event Types
| Event Type | Severity | Trigger |
|------------|----------|---------|
| `backup_completed` | info | Nightly restic backup succeeds |
| `backup_failed` | error | Nightly restic backup fails |
| `db_dump_completed` | info | Nightly database dumps succeed |
| `db_dump_failed` | error | Nightly database dumps fail |
| `backup_integrity_ok` | info | Weekly `restic check` passes |
| `backup_integrity_failed` | error | Weekly `restic check` fails |
| `crossdrive_completed` | info | Cross-drive secondary backup succeeds |
| `crossdrive_failed` | error | Cross-drive secondary backup fails |
| `health_degraded` | warning | Health status degrades (ok→warn) |
| `health_critical` | error | Health status critical (any→fail) |
| `health_recovered` | info | Health status recovers (fail/warn→ok) |
| `disk_warning` | warning | Disk usage crosses 90% |
| `disk_critical` | error | Disk usage crosses 95% |
| `storage_disconnected` | error | Storage drive physically removed |
| `storage_reconnected` | info | Storage drive reconnected |
| `controller_started` | info | Controller process starts |
| `controller_updated` | info/error | Self-update success or failure |
| `app_deployed` | info | New app deployed via API |
| `app_removed` | info | App removed via API |
| `disaster_recovery_started` | warning | DR restore begins |
| `disaster_recovery_completed` | info/error | DR restore finishes (success/partial) |
Each event carries typed detail structs (e.g., `BackupDetails`, `DiskDetails`, `HealthDetails`) serialized as JSON.
#### Default Enabled Events
Events the customer receives notifications for (configurable in settings):
`backup_failed`, `db_dump_failed`, `disk_warning`, `disk_critical`, `storage_disconnected`, `node_down`, `health_critical`, `expected_backup_missed`, `expected_dbdump_missed`
#### Preference Sync
Notification preferences (email, enabled events, cooldown hours) are:
- Stored locally in `settings.json`
- Synced to Hub on save and on controller startup via `POST /api/v1/preferences`
- Hub sync failure doesn't block local save
---
### 6. Update Management
#### App Catalog Sync
- Periodic `git fetch` + `git reset --hard` of the app catalog repo
- Content-hash comparison prevents unnecessary file writes
- Post-sync stack rescan detects new/changed apps immediately
#### Planned Update Classifications
| Marker | Behavior |
|--------|----------|
| No marker | Optional — shown on dashboard, customer clicks "Update" |
| `UPDATE_REQUIRED=true` | Mandatory — auto-applied during next update window |
| `UPDATE_SECURITY=true` | Critical — applied immediately |
#### Controller Self-Update (`internal/selfupdate/`)
The controller can update itself — a Watchtower-style pull-and-restart mechanism for a single container. Replaces manual SSH-based `docker pull + sed + docker compose up -d` with a one-click Settings page button or scheduled auto-update.
##### How It Works
```
1. Check Gitea Docker Registry V2 API for new image tags
2. Compare highest semver tag with current Version (set at build time via ldflags)
3. If newer version exists → pull image → update compose file → docker compose up -d
4. Current container is replaced by Docker → new container starts with new version
5. On startup, new container reads update-state.json → marks update success/failure
```
##### Design Philosophy
- **No automatic rollback** — follows the Watchtower pattern (24k+ GitHub stars, no rollback). Docker's `restart: unless-stopped` policy is the crash safety net. Healthchecks.io detects when the controller goes down.
- **Audit state file** — `update-state.json` in the data volume records every update attempt (previous version, target version, initiator, result). Operators can SSH in and revert using `PreviousImage` from this file.
- **Backup-aware** — refuses to start an update while a backup is in progress (`backupRunning()` guard).
##### Package Structure
| File | Purpose |
|------|---------|
| `version.go` | `ParseVersion("X.Y.Z")` → `Version{Major,Minor,Patch}`, `Compare()` returns -1/0/1. Hand-rolled, no external deps. Rejects "dev" and "latest". |
| `state.go` | `UpdateState` struct persisted as JSON. `LoadState()`, `SaveState()` (atomic: `.tmp` + rename), `ClearState()`. Status values: `"pending"`, `"success"`, `"failed"`. |
| `updater.go` | Core `Updater` struct. Registry check via HTTP GET to `gitea.dooplex.hu/v2/admin/felhom-controller/tags/list` with Basic Auth (git username/token). Update trigger: `docker pull` → compose file regex replace → `docker compose up -d`. Thread-safe with `sync.Mutex`. |
##### Update Trigger Flow
1. **Guard checks:** concurrent update lock, dev version check, backup running check, compose file accessible
2. Write `update-state.json` with status `"pending"` (audit trail)
3. `docker pull <image>:<targetVersion>`
4. Read compose file → replace image tag via regexp → atomic write (`.tmp` + rename)
5. `docker compose -f /opt/docker/felhom-controller/docker-compose.yml -p felhom-controller up -d`
6. Docker kills the current container, starts the new one
##### Startup Verification
Called once from `main.go` before the scheduler starts:
1. Load `update-state.json` — if missing or status != `"pending"`, nothing to do
2. Compare running `Version` with `state.TargetVersion`
3. **Match** → mark `"success"`, notify via hub
4. **Mismatch** → mark `"failed"`, notify via hub
5. No rollback attempt — operator reverts manually if needed
##### Auto-Update Scheduling
Two separate scheduler jobs prevent interference with backups:
| Job | Type | Default | Purpose |
|-----|------|---------|---------|
| `selfupdate-check` | `sched.Every` | 6h | Check registry, cache result (for UI). Never triggers update. |
| `selfupdate-auto` | `sched.Daily` | 04:30 | If auto-update enabled + update available + backup not running → trigger. |
The auto-update time (`config.SelfUpdate.AutoUpdateTime`, default `"04:30"`) is deliberately separate from the backup window (02:30-~04:00) to avoid collisions. The `backupRunning()` guard is the hard safety check — if backups run long past 04:30, the update is skipped and retried the next day.
An initial version check fires 30s after startup so the Settings page shows version info quickly.
##### Compose File Access
The controller needs write access to its own `docker-compose.yml`. This is achieved via Docker volume mount ordering:
```yaml
volumes:
# 1. Directory mount — gives access to compose file + .env
- /opt/docker/felhom-controller:/opt/docker/felhom-controller
# 2. Read-only override — prevents accidental config writes
- /opt/docker/felhom-controller/controller.yaml:/opt/docker/felhom-controller/controller.yaml:ro
# 3. Named volume override — persistent data in Docker-managed volume
- controller-data:/opt/docker/felhom-controller/data
```
##### API Endpoints
| Method | Path | Auth | Description |
|--------|------|------|-------------|
| GET | `/api/selfupdate/status` | Session or API key | Current status (cached, no network call) |
| POST | `/api/selfupdate/check` | Session or API key | Force registry check, return result |
| POST | `/api/selfupdate/update` | Session or API key | Trigger update (async, returns immediately) |
Self-update endpoints accept either session auth (for UI) or hub API key as bearer token (for external triggering from build scripts or hub). This enables the post-v0.16.0 deploy workflow:
```bash
# After building + pushing new image:
curl -s -X POST https://felhom.demo-felhom.eu/api/selfupdate/update \
-H "Authorization: Bearer <HUB_API_KEY>"
```
##### Settings Page UI
The "Verzió és frissítés" card on the Settings page (`/settings`) shows:
- Current version and latest available version
- "Frissítés elérhető" (update available) badge
- Last check time and any errors
- Auto-update status with configured time
- Last update result (success/failed/pending)
- **Buttons:** "Frissítés keresése" (check) + "Frissítés telepítése" (apply)
After triggering an update, the page polls `/api/health` every 3s and reloads when the new container responds.
A global info-level alert ("Új controller verzió elérhető") appears on all pages when an update is available, linking to the Settings page.
##### Configuration
```yaml
self_update:
enabled: true
check_interval: "6h" # How often to check registry
image: "gitea.dooplex.hu/admin/felhom-controller" # Default
auto_update: false # Set true for unattended updates
auto_update_time: "04:30" # When to auto-apply (after backups)
health_timeout_seconds: 60 # Reserved for future use
```
##### Edge Cases
| Scenario | Behavior |
|----------|----------|
| `Version == "dev"` | `ParseVersion` returns error → no updates reported, trigger refused |
| Registry unreachable | Log warning, return error in check result. No crash. |
| No registry credentials | Return error "Registry hitelesítő adatok hiányoznak" |
| Compose file not writable | Refuse update before doing anything |
| Backup running | Refuse with "Mentés fut, próbálja később" |
| Concurrent update | Mutex prevents duplicates: "Frissítés már folyamatban" |
| Bad update (crash loop) | Docker restarts container. State file stays "pending". Operator SSH-reverts using `PreviousImage`. |
| Corrupt state file | Treated as "no pending update", logged, deleted |
---
### 7. Authentication & Settings
#### Session Auth (`internal/web/auth.go`)
- bcrypt password verification with configurable source priority: `settings.json` → `controller.yaml` → no auth (open access)
- 7-day session duration with random 32-byte hex tokens
- `?next=` redirect after login preserves the page the user was visiting
- Session cleanup every 15 minutes
- All sessions invalidated on password change
- Conditional logout link (hidden when auth is disabled)
#### Settings Persistence (`internal/settings/settings.go`)
Runtime-mutable settings in `settings.json` (separate from infrastructure config):
| Section | Contents |
|---------|----------|
| `password_hash` | bcrypt hash override |
| `notifications` | email, enabled events, cooldown hours |
| `db_validations` | per-DB dump validation results (survives restarts) |
| `app_backup` | per-app map: enabled flag, cross-drive config (method, dest, schedule, runtime status) |
| `storage_paths` | registered paths with label, default flag, schedulable flag, disconnected state |
| `cross_drive_restic_password` | auto-generated restic password for cross-drive repos |
All public methods use `sync.RWMutex`. File writes are atomic (`.tmp` + rename).
#### Settings Page (`/settings`)
Five sections:
1. **System config** — read-only display of `controller.yaml` values
2. **Version & update** — current/latest version, check/update buttons, auto-update status, last update result
3. **Storage paths** — add/remove, edit labels, set default, toggle schedulable, per-path app list with sizes, safe disconnect/reconnect for USB drives
4. **Password change** — current + new + confirm, min 8 chars
5. **Notifications** — email, event checkboxes, cooldown hours, test email button
---
### 8. Central Hub Reporting
#### Report Push (`internal/report/`)
Periodic JSON push (default every 15 min) to the central felhom-hub service:
- System: hostname, OS, CPU, memory, disk usage, uptime
- Containers: running/stopped counts, per-container CPU/memory
- Backup: last run, success, repo stats, snapshot count, restic password (for disaster recovery)
- Health: current status, issues, warnings
- Stacks: deployed apps with versions and states
- Config hash: SHA256 of `controller.yaml` for Hub-side config comparison
Bearer token authentication, 3-attempt retry with 5-second backoff. Push status tracked via `PushStatus` struct (LastAttempt, LastSuccess, LastError, consecutive failures) — used by the monitoring page and alert system to show Hub connection health.
#### Infrastructure Backup to Hub (`internal/report/infra_backup.go`)
After each backup cycle, the controller pushes a full infrastructure snapshot to the Hub for disaster recovery. This snapshot includes:
- `controller.yaml` (base64-encoded, full config including secrets)
- `settings.json` (base64-encoded, backup prefs, storage paths, cross-drive configs)
- Disk layout (UUIDs, labels, mount points, fstab options, bind-mount topology)
- Deployed stacks manifest (app names, HDD paths)
- Restic passwords (primary + cross-drive, base64-encoded)
This enables fully automated recovery when the system drive is replaced — the new controller pulls the snapshot from the Hub, auto-mounts surviving drives by UUID, and restores all applications.
#### Hub Dashboard
The hub service (separate Go app in the `felhom.eu` repo) provides:
- Multi-customer overview table with status indicators and event count badges
- Customer detail page with system/storage/containers/backup/health/events sections
- Event timeline: last 50 events with severity filter, colored badges, source tracking
- Dead man's switch: staleness detection (30min stale, 60min down), missed backup detection (daily at 05:00)
- Notification dispatch: operator (English) + customer (Hungarian) emails via Resend with per-event cooldowns
- Infra backup status per customer (last sync, stack count, disk count)
- Color coding: green (<30min), yellow (30-60min), red (>60min since last report)
- 90-day report + event retention with daily prune at 04:30 Budapest time
### 9. Disaster Recovery
When a system drive fails and is replaced, the controller can automatically restore the full deployment:
```
1. docker-setup.sh deploys fresh controller (Hub enabled, customer_id configured)
2. Controller detects empty data dir → fresh deployment
3. Controller pulls infra backup from Hub → gets disk layout, passwords, configs
4. Controller scans block devices for UUIDs matching stored disk layout
5. Controller mounts surviving drives (e.g., HDD with backups)
6. Controller scans mounted drives for local backup data (_infra/ + rsync copies)
7. Controller auto-restores stack configs → apps appear in dashboard
8. User opens dashboard → "Visszaállítás" (Restore) wizard
9. User confirms → sequential restore: rsync first, restic fallback, DB import
10. Apps restored and running
```
**Backup sources (priority order):**
1. **Rsync copies** (cross-drive, plain files, no password needed) — fastest, most reliable
2. **Restic snapshots** (encrypted, needs password from Hub) — comprehensive but slower
**Fallback:** If the Hub is unreachable, the controller can still detect backups on already-mounted drives (manual mount or pre-existing fstab entries).
---
## Repository Layout
```
controller/
├── cmd/controller/main.go # Entry point, wires all 14 modules
├── internal/
│ ├── config/config.go # YAML loader, validation, env overrides
│ ├── settings/settings.go # Runtime settings (JSON, atomic writes, RWMutex)
│ ├── stacks/
│ │ ├── manager.go # Stack scanning, compose ops, container status
│ │ ├── metadata.go # Parse .felhom.yml app metadata
│ │ ├── deploy.go # First-deploy: secret gen, app.yaml, compose up; missing field injection
│ │ └── delete.go # Stack deletion/removal + HDD/backup data cleanup
│ ├── sync/sync.go # Git sync: clone/pull app catalog, content-hash copy
│ ├── storage/
│ │ ├── scan.go, scan_linux.go # Disk detection via lsblk + blkid
│ │ ├── format.go, format_linux.go # Partition, format, mount pipeline
│ │ ├── attach.go, attach_linux.go # Attach existing FS drive (raw mount + bind mount)
│ │ ├── safety.go, safety_linux.go # System disk detection, mount guards, fstab ops
│ │ ├── migrate.go # App data migration (rsync with progress)
│ │ └── *_other.go # Non-Linux stubs for cross-compilation
│ ├── backup/
│ │ ├── backup.go # Orchestrator (per-drive dumps + restic + cross-drive chain)
│ │ ├── paths.go # Per-drive path helpers (PrimaryResticRepoPath, AppDBDumpPath, etc.)
│ │ ├── dbdump.go # DB auto-discovery + dump (pg_dump, mariadb-dump)
│ │ ├── restic.go # Restic operations (init, snapshot, prune, check) — repoPath as param
│ │ ├── appdata.go # StackDataProvider interface, app data discovery
│ │ ├── crossdrive.go # Per-app backup to secondary storage (rsync/restic)
│ │ ├── restore.go # Per-app restore from per-drive repo
│ │ ├── restore_scan.go # DR: scan drives for backup data, build restore plan
│ │ ├── restore_app_linux.go # DR: per-app restore (rsync config/data + docker compose up)
│ │ └── restore_drives_linux.go # DR: auto-mount drives by UUID from Hub infra backup
│ ├── api/router.go # REST API endpoints (~30 routes)
│ ├── scheduler/scheduler.go # Central job scheduler (Every, Daily)
│ ├── system/
│ │ ├── info.go, info_linux.go # RAM, disk, CPU, temperature, load average
│ │ ├── cpu_linux.go # Background /proc/stat sampling
│ │ └── mounts_linux.go # Mount points, disk usage, FS info, backup dest checks, storage probing, USB detection
│ ├── monitor/
│ │ ├── pinger.go # Healthchecks.io HTTP ping client
│ │ ├── healthcheck.go # System health checks (disk, mem, CPU, temp, Docker)
│ │ └── watchdog.go # Storage watchdog (probe, disconnect/reconnect, safe eject)
│ ├── metrics/
│ │ ├── store.go # SQLite time-series (WAL mode, downsampled queries)
│ │ ├── collector.go # Background collector (60s, system + docker stats)
│ │ └── sysinfo.go # Static system info (/proc, /etc)
│ ├── selfupdate/
│ │ ├── version.go # Semver parsing + comparison (hand-rolled)
│ │ ├── state.go # Update audit state (JSON, atomic writes)
│ │ └── updater.go # Registry check, update trigger, startup verify
│ ├── notify/notifier.go # Email relay to hub, preference sync, cooldowns
│ ├── report/
│ │ ├── builder.go # Hub report builder (all subsystems → JSON)
│ │ ├── pusher.go # HTTP POST to hub (retry, Bearer auth)
│ │ └── infra_pull.go # DR: pull infra backup from Hub for fresh deployment
│ └── web/
│ ├── server.go # HTTP server, routing, static files
│ ├── auth.go # Session auth, login/logout, session cleanup
│ ├── handlers.go # Page handlers (dashboard, stacks, deploy, backups, etc.)
│ ├── handler_restore.go # DR: restore page handler + APIs (scan, restore all, skip)
│ ├── storage_handlers.go # Storage API handlers (scan, format, attach, migrate, cleanup, disconnect/reconnect)
│ ├── alerts.go # State-based alert generation
│ ├── funcmap.go # Template functions (state colors, Hungarian formatting)
│ ├── embed.go # go:embed for templates + Chart.js
│ └── templates/ # 13 HTML files + style.css (Hungarian UI)
├── configs/
│ ├── controller.yaml.example # Full config reference
│ └── example-felhom-metadata.yml # .felhom.yml format reference
├── Dockerfile # Multi-stage: Go 1.24 builder + debian-slim runtime
├── docker-compose.yml # Controller's own compose (privileged, /mnt rshared)
└── go.mod # Go 1.24, deps: bcrypt, yaml.v3, modernc.org/sqlite
```
---
## Configuration
### Controller config (`controller.yaml`)
Single YAML file per customer, infrastructure-only. Does **not** contain app-specific config.
Key sections:
```yaml
customer:
name: "Demo Felhom"
id: "demo-felhom"
paths:
stacks_dir: "/opt/docker/stacks"
data_dir: "/opt/docker/felhom-controller/data"
system_data_path: "/mnt/sys_drive" # NVMe/system drive — fallback for apps without HDD
git:
repo_url: "https://gitea.dooplex.hu/admin/app-catalog-felhom.eu.git"
sync_interval: "15m"
# Per-drive backup paths are computed automatically:
# <drive>/backups/primary/restic/ — restic repo per drive
# <drive>/backups/primary/<app>/db-dumps/ — DB dumps per app
# <drive>/backups/secondary/ — cross-drive rsync + restic
backup:
enabled: true
restic_password_file: "/opt/docker/felhom-controller/data/restic-password"
db_dump_schedule: "02:30"
restic_schedule: "03:00"
retention: { keep_daily: 7, keep_weekly: 4, keep_monthly: 6 }
monitoring:
health_interval: "5m"
ping_uuids:
heartbeat: "uuid-here"
system_health: "uuid-here"
db_dump: "uuid-here"
backup: "uuid-here"
backup_integrity: "uuid-here"
hub:
enabled: true
url: "https://hub.felhom.eu"
api_key: "bearer-token-here"
system:
reserved_memory_mb: 384 # RAM reserved for OS + controller
```
Environment variable overrides: `FELHOM_LOGGING_LEVEL=debug`, `FELHOM_HUB_ENABLED=false`, etc.
### Runtime settings (`settings.json`)
Auto-managed by the controller. Contains password hash overrides, notification preferences, per-app backup configs, storage path registry, DB validation cache. All writes are atomic.
### Per-app config (`app.yaml`)
Auto-generated during deployment. Contains env vars, locked fields list, deploy timestamp. Secret fields are locked (read-only after first deploy). Missing fields from updated templates are auto-injected on startup and after sync (see Missing Field Injection).
---
## Scheduler Jobs
| Job | Type | When | Purpose |
|-----|------|------|---------|
| status-refresh | periodic | 30s | Refresh container states |
| stack-scan | periodic | 2m | Rescan stacks directory |
| heartbeat | periodic | 5m | Ping Healthchecks "I'm alive" |
| system-health | periodic | configurable | Health checks + alert refresh |
| backup-cache | periodic | 5m | Refresh backup status cache |
| hub-report | periodic | 15m | Push report to central hub |
| db-dump | daily | 02:30 | Database dumps |
| backup | daily | 03:00 | Restic backup → cross-drive chain |
| backup-integrity | daily | Sun 04:00 | Restic check |
| metrics-prune | daily | 04:00 | Delete metrics older than 30 days |
| selfupdate-check | periodic | 6h | Check registry for new version (cache for UI) |
| selfupdate-auto | daily | 04:30 | Auto-update if enabled + backup not running |
All daily jobs use Europe/Budapest timezone. Skip-if-running prevents concurrent execution. Panic recovery in all jobs.
---
## REST API
### Stack Operations
| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/health` | Health check (no auth) |
| GET | `/api/stacks` | List all stacks |
| GET | `/api/stacks/{name}` | Stack details |
| POST | `/api/stacks/{name}/deploy` | First-time deploy |
| POST | `/api/stacks/{name}/start` | Start stack |
| POST | `/api/stacks/{name}/stop` | Stop stack |
| POST | `/api/stacks/{name}/restart` | Restart stack |
| POST | `/api/stacks/{name}/update` | Pull + recreate |
| POST | `/api/stacks/{name}/optional-config` | Update optional env vars |
| GET | `/api/stacks/{name}/logs` | Container logs (`?raw=1` for plain text) |
| GET | `/api/stacks/{name}/hdd-data` | HDD data paths + sizes |
| GET | `/api/stacks/{name}/backup-data` | Backup data paths + sizes (DB dumps, cross-drive rsync) |
| POST | `/api/stacks/{name}/remove` | Remove deployed stack (revert to "not deployed") |
| DELETE | `/api/stacks/{name}` | Delete orphaned stack |
| POST | `/api/sync` | Trigger catalog sync |
| GET | `/api/system/info` | System info + sync status |
### Backup & Restore
| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/backup/status` | Full backup status |
| POST | `/api/backup/run` | Trigger manual backup |
| GET | `/api/backup/snapshots` | List snapshots (`?stack={name}` for filtering) |
| POST | `/api/stacks/{name}/cross-backup` | Save cross-drive config |
| POST | `/api/stacks/{name}/cross-backup/run` | Trigger cross-drive backup |
| GET | `/api/stacks/{name}/cross-backup/status` | Cross-drive status |
| POST | `/api/backup/cross-drive/run-all` | Run all scheduled cross-drive backups |
### Storage
| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/storage/scan` | Scan available disks |
| POST | `/api/storage/init` | Format and mount a disk |
| GET | `/api/storage/init/status` | Format progress |
| POST | `/api/storage/attach/mount-raw` | Temp-mount partition for browsing |
| GET | `/api/storage/attach/browse?path=` | List directories on raw mount |
| POST | `/api/storage/attach/mkdir` | Create folder on raw mount |
| POST | `/api/storage/attach` | Finalize attach (bind mount + fstab) |
| GET | `/api/storage/attach/status` | Attach progress |
| POST | `/api/storage/attach/cancel` | Cleanup temp raw mount |
| POST | `/api/storage/migrate` | Start app data migration |
| GET | `/api/storage/migrate/status` | Migration progress |
| POST | `/api/storage/disconnect` | Safe disconnect (stop apps, unmount) |
| POST | `/api/storage/reconnect` | Reconnect disconnected drive |
| POST | `/api/storage/restart-apps` | Restart auto-stopped apps |
| GET | `/api/storage/status` | All storage paths with connection state |
### Self-Update
| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/selfupdate/status` | Update status (cached check result + last state) |
| POST | `/api/selfupdate/check` | Force registry check |
| POST | `/api/selfupdate/update` | Trigger self-update (async) |
Self-update endpoints accept session auth OR `Authorization: Bearer <hub_api_key>` for external triggering.
### Config Management
| Method | Endpoint | Description |
|--------|----------|-------------|
| POST | `/api/config/apply` | Apply new controller.yaml from Hub (atomic write) |
| GET | `/api/config/hash` | Get SHA256 hash of current controller.yaml |
Config endpoints accept session auth OR `Authorization: Bearer <hub_api_key>` (same as self-update). The `/api/config/apply` endpoint:
- Accepts raw YAML body (the generated config from Hub)
- Validates YAML is parseable before writing
- Atomic write: writes to `.tmp` then `os.Rename` for crash safety
- Does NOT reload config — restart required to apply changes
- Returns `{"ok": true, "message": "Config applied. Restart controller to apply changes."}`
### Metrics
| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/metrics/system` | System metrics time-series (`?range=1h|6h|24h|7d|30d`) |
| GET | `/api/metrics/containers/summary` | Current container stats |
| GET | `/api/metrics/containers/{name}` | Per-container time-series |
| GET | `/api/metrics/sysinfo` | Static system info |
Response format: `{"ok": true/false, "data": ..., "error": "...", "message": "..."}`
---
## Build & Deploy
### Build
```bash
# On build server (192.168.0.180)
cd ~/build/felhom-controller
git -C ~/git/deploy-felhom-compose pull
./build.sh v0.20.0 --push
```
### Deploy on customer node
**Option A: Self-Update API (v0.16.0+)**
After building and pushing the new image, trigger the controller's self-update endpoint:
```bash
curl -s -X POST https://felhom.demo-felhom.eu/api/selfupdate/update \
-H "Authorization: Bearer <HUB_API_KEY>"
```
The controller pulls the new image, updates its own compose file, and runs `docker compose up -d` to replace itself. The Settings page also has a "Frissítés telepítése" button for manual triggering.
**Option B: Manual SSH (pre-v0.16.0 or fallback)**
```bash
# On customer node (e.g., 192.168.0.162)
cd /opt/docker/felhom-controller
sudo docker pull gitea.dooplex.hu/admin/felhom-controller:<VERSION>
sudo sed -i 's|image: gitea.dooplex.hu/admin/felhom-controller:.*|image: gitea.dooplex.hu/admin/felhom-controller:<VERSION>|' docker-compose.yml
sudo docker compose up -d
```
**Important:** Always use `docker compose up -d`, NOT `docker compose restart` — restart doesn't pick up new images.
### Docker Requirements
The controller container needs:
- `privileged: true` (disk operations)
- Docker socket mount (`/var/run/docker.sock`)
- `/mnt` mount with `propagation: rshared` (container mounts visible to host)
- `/dev` mounted as `/host-dev` (block device access)
- `/etc/fstab` mounted as `/host-fstab` (persistent mount config)
See `docker-compose.yml` for the full volume configuration.
---
## Roadmap
### Completed
- [x] Stack management with deploy flow and memory validation
- [x] Git-based app catalog sync
- [x] Central job scheduler
- [x] System monitoring with SQLite metrics and Chart.js charts
- [x] Healthchecks.io integration (5 ping types)
- [x] 3-layer backup system (DB dumps + restic + cross-drive)
- [x] Per-app backup restore with auto stop/restart
- [x] Storage management (scan, format, mount, registry)
- [x] Attach existing drive wizard (v0.15.0) — bind-mount subfolder from pre-formatted drive, directory browser
- [x] App data migration between storage paths
- [x] Storage watchdog (v0.17.0) — USB disconnect detection (~15s), auto-stop apps, auto-remount on reconnect, safe eject UI
- [x] Central hub reporting
- [x] Email notifications via hub relay
- [x] Settings persistence and password management
- [x] Dashboard alert system
- [x] Per-drive backup architecture (v0.14.0) — per-drive restic repos, per-app DB dumps, path helpers
- [x] Cross-drive restic pruning (v0.14.0)
- [x] Auto Tier 2 for small apps (v0.14.1) — auto-enable daily rsync for non-HDD apps when ≥2 drives
- [x] Infrastructure config in cross-drive backup (v0.14.1) — stacks dir + controller.yaml in `_infra/` + restic
- [x] Disaster recovery (v0.15.5) — Hub-based infra backup, auto-mount by UUID, restore UI with full-page takeover
- [x] Controller self-update (v0.16.0) — Watchtower-style pull + restart, Settings page UI, API key auth, auto-update scheduling
- [x] Hub-managed config (v0.20.0) — Config apply endpoint (`POST /api/config/apply`), config hash in reports for sync comparison
### In Progress / Planned
- [ ] Update classification and auto-apply (optional/required/security markers)
- [ ] Docker volume backup (`/var/lib/docker/volumes:ro`)
- [ ] Raspberry Pi testing (pi-customer-1)
- [ ] CSRF protection on POST endpoints
- [ ] Login rate limiting
---
## Test Environments
| Node | Hardware | Domain | Status |
|------|----------|--------|--------|
| demo-felhom | Acemagic GK3PLUS N100, 16G RAM, 512G SSD + 1TB HDD | demo-felhom.eu | Controller v0.20.0 |
| pi-customer-1 | Raspberry Pi 3B+, 1G RAM, 32G SD | pi-customer-1.local | Not yet tested |
## Related Repositories
| Repository | Purpose |
|------------|---------|
| [deploy-felhom-compose](https://gitea.dooplex.hu/admin/deploy-felhom-compose) | This repo — controller + deploy scripts |
| [app-catalog-felhom.eu](https://gitea.dooplex.hu/admin/app-catalog-felhom.eu) | Docker Compose templates + .felhom.yml metadata |
| [felhom.eu](https://gitea.dooplex.hu/admin/felhom.eu) | Website + app assets + felhom-hub service |