Files
deploy-felhom-compose/CONTEXT.md
T
2026-02-15 11:20:30 +01:00

24 KiB

CONTEXT.md — Project Memory

This file serves as persistent project memory across Claude Code sessions. It replaces the auto-generated "Memory" from the claude.ai Project. Update this file at the end of each working session with current state, recent decisions, and anything the next session needs to know.

Ask Claude Code: "Please update CONTEXT.md with what we did today"

Last updated: 2026-02-15 (session 10)


About Viktor (project owner)

  • Works at Magyar Telekom (Budapest), building Felhom as a side business
  • Felhom: managed home-server service for Hungarian households
  • Technical but prefers pragmatic solutions over over-engineering
  • Runs all infrastructure on Gitea (gitea.dooplex.hu), k3s cluster for management
  • Customer deployments use Docker Compose (not Kubernetes) for simplicity

Current project state

felhom-controller (this repo)

  • Version: v0.4.0
  • Phase 1: COMPLETE — Stack Manager + Deploy Flow
  • Phase 2: COMPLETE — Monitoring & Health (scheduler, CPU/temp, healthchecks.io pings)
  • Phase 3: COMPLETE — Backups (DB dumps, restic integration, manual trigger)
  • First app deployed: Paperless-ngx on demo-felhom.eu (2026-02-13)
  • Running on: demo-felhom (N100 mini PC) at 192.168.0.162:8080
  • All Phase 1-3 features working: deploy, start/stop/restart/update, logs, health-aware states, auth, monitoring, backups

What was just completed (2026-02-15 session 10)

  • v0.4.0 — Monitoring & Health + Backups (Phase 2 & 3):
    • Central job scheduler (internal/scheduler/scheduler.go):
      • Replaces ad-hoc goroutines in main.go with a unified scheduler
      • Every(name, interval, fn) for periodic jobs, Daily(name, timeStr, fn) for scheduled tasks
      • Panic recovery, skip-if-running, quiet mode for high-frequency jobs (≤30s)
      • Daily jobs use Europe/Budapest timezone with time.Timer for DST correctness
      • Graceful shutdown with 30s timeout for running jobs
    • CPU usage collector (internal/system/cpu_linux.go):
      • Background goroutine samples /proc/stat every 5s, computes delta-based CPU %
      • Platform stubs for non-Linux in cpu_other.go
    • Temperature & load metrics (internal/system/info_linux.go):
      • Reads /proc/loadavg for 1/5/15 min load averages
      • Reads thermal zones from /host/sys/class/thermal/ (Docker mount) with /sys/ fallback
      • Handles millidegree values, picks highest zone, with hwmon fallback
    • Healthchecks.io pinger (internal/monitor/pinger.go):
      • HTTP ping client for Healthchecks.io-compatible endpoints
      • POST to /ping/{uuid} (success), /fail (failure), /start (started)
      • 10s timeout, 3 retries with 2s backoff, skips CHANGEME UUIDs
    • System health checks (internal/monitor/healthcheck.go):
      • Checks disk, memory, CPU, temperature, Docker reachability, protected containers
      • Returns HealthReport with status "ok"/"warn"/"fail" + formatted message for pings
    • Database dump engine (internal/backup/dbdump.go):
      • Auto-discovers PostgreSQL/MariaDB containers via docker ps + docker inspect
      • Dumps via docker exec pg_dump/mariadb-dump with 5min timeout
      • Atomic writes (.tmp.sql), empty file detection, stale temp cleanup
    • Restic integration (internal/backup/restic.go):
      • Auto-generates repository password (32 random bytes, base64url)
      • Init, snapshot (JSON output), prune, check, stats, latest snapshot
      • Stale lock detection with automatic unlock + retry
    • Backup orchestrator (internal/backup/backup.go):
      • DB dumps + restic snapshots, weekly prune on Sundays
      • Thread-safe running flag, Healthchecks.io pings with results
      • RunFullBackup() for manual trigger (sequential: dumps → snapshot)
    • Wiring updates:
      • main.go: scheduler-based job registration, cpuCollector lifecycle, pinger + backupMgr init
      • api/router.go: GET /api/backup/status, POST /api/backup/run
      • web/server.go + handlers.go: pass cpuCollector to GetInfo(), backup status on dashboard
      • funcmap.go: tempColor, fmtTemp, fmtLoad template functions
    • Dashboard UI enhancements:
      • CPU usage bar with load average display below
      • Temperature with colored indicator dot (green/yellow/red at 60°/75°C)
      • Backup status card: last run time, DB count, repo size/snapshots
      • "Mentés most" button triggers manual backup via API
    • Config updates:
      • controller.yaml.example: added system_health_interval, hdd_path, system.reserved_memory_mb
      • docker-compose.yml: added /sys:/host/sys:ro mount for temperature reading
      • restic_password_file default changed to data/ subdir (auto-generated in named volume)
  • Controller version: v0.4.0 — deployed and verified on demo-felhom.eu

What was previously completed (2026-02-15 session 9)

  • v0.3.0 — Structural refactoring (templates + server split + domain rename):
    • Templates: go:embed migration — moved all 7 HTML templates + CSS from Go string constants to individual files in internal/web/templates/. Created embed.go with //go:embed directive. Template loading now uses ParseFS() instead of Parse(). CSS served from embed.FS via ReadFile(). Zero runtime file dependencies — still compiled into the binary.
    • Server decomposition — split monolithic server.go (540 lines) into focused files:
      • auth.go: session struct, auth middleware, login/logout handlers, session management
      • handlers.go: page handlers (dashboard, stacks, logs, deploy, app detail)
      • funcmap.go: template FuncMap with 14 custom functions
      • server.go: Server struct, NewServer, loadTemplates (3-liner), ServeHTTP routing, render helper, static file serving
    • Domain rename — controller subdomain changed from dashboard.* to felhom.* in Traefik labels and setup script
    • Documentation updated — CLAUDE.md, README.md, CONTEXT.md all reflect new file structure
    • Reminder for Viktor: Update Cloudflare Tunnel public hostname (dashboard.demo-felhom.eufelhom.demo-felhom.eu) and Pi-hole DNS if needed
  • Controller version: v0.3.0

What was previously completed (2026-02-15 session 8)

  • FileBrowser as infrastructure service:
    • Created scripts/hdd-setup.sh (adapted from deploy-portainer) — sets up HDD folder structure with Dokumentumok user dir
    • Created scripts/docker-setup.sh (adapted from deploy-portainer) — installs Docker, Traefik, FileBrowser as infra services
    • Added filebrowser to protected stacks in controller.yaml.example
    • Removed templates/filebrowser/ from app-catalog-felhom.eu (no longer a catalog app)
  • Orphan stack detection and deletion:
    • Added Orphaned field to Stack struct + getCatalogTemplateSlugs() helper
    • Orphan detection in ScanStacks() — deployed stacks with no matching catalog template marked as orphaned
    • New delete.go: DeleteStack() (compose down + HDD cleanup + dir removal), GetStackHDDData(), parseComposeHDDMounts()
    • Safety: protected HDD paths (root, media, storage, Dokumentumok, appdata) can never be deleted
    • New API endpoints: DELETE /api/stacks/{name} and GET /api/stacks/{name}/hdd-data
    • UI: orange "Elavult" badge on orphaned stacks, "Törlés" button, delete confirmation modal
    • Modal shows HDD data paths/sizes, checkbox for "Felhasználói adatok törlése a merevlemezről"
    • Hides "Frissítés" and "Részletek" buttons for orphaned stacks
  • Verified: 1 orphaned stack detected on startup (filebrowser — now infra, removed from catalog)
  • Controller version: v0.2.15

Previously completed (2026-02-14 session 7)

  • Fixed YAML parse error in romm .felhom.yml (app-catalog repo):
    • Root cause: Hungarian opening quote (U+201E) paired with ASCII " (0x22) inside YAML double-quoted strings terminated the string prematurely
    • Affected lines: help_text for IGDB Client Secret and SteamGridDB API Key fields
    • Fix: escaped inner ASCII double quotes with \" in the YAML strings
    • This caused LoadMetadata() to silently fail and return empty defaults for ALL romm metadata (tagline, resources, category — everything)
  • Added error logging to LoadMetadata() in metadata.go:
    • [ERROR] log on YAML parse failure (was silently swallowed — critical bug)
    • Temporary [DEBUG] log used for diagnosis, then removed
  • Fixed deploy command in CLAUDE.md:
    • sed pattern now targets only image: lines (was matching service name too, breaking YAML)
    • Added sudo for both sed and docker compose (directory is root-owned)
  • Controller version: v0.2.14

Previously completed (2026-02-14 session 6)

  • Bug fix: App info logo SVG rendering.app-info-logo CSS in templates.go:
    • Added min-width, min-height, max-width, max-height: 80px and overflow: hidden
    • Prevents SVG images with explicit dimensions or no viewBox from overflowing container
    • Logo now reliably renders at 80x80 regardless of SVG intrinsic size
  • Controller version: v0.2.12

Previously completed (2026-02-14 session 5)

  • App detail/info pages — new feature:
    • New route: GET /apps/{slug} renders a full info page (was redirect to deploy page)
    • Hero section with logo, tagline, resource badges
    • Screenshots section (graceful — hidden via onerror if assets don't exist)
    • Info cards: use cases, first steps, prerequisites, default credentials, docs link
    • Optional config form with AJAX save (POST /api/stacks/{name}/optional-config)
    • New .felhom.yml fields: app_info (tagline, use_cases, first_steps, prerequisites, default_creds, docs_url) and optional_config (groups of env var fields)
    • New structs in metadata.go: AppInfo, OptionalConfigGroup, OptionalConfigField
    • UpdateOptionalConfig in deploy.go: saves optional env vars to app.yaml, restarts deployed stacks with docker compose up -d to pick up new env vars
    • Navigation updated: stack cards on dashboard/stacks pages now link to /apps/{slug}, deploy page has "Részletek" link back to info page
  • RoMM metadata updated (app-catalog repo):
    • Full app_info section: tagline, 5 use cases, 6 first steps, 3 prerequisites, default creds, docs URL
    • 6 optional config fields for metadata providers: IGDB (client_id + secret), SteamGridDB, ScreenScraper (user + password), MobyGames
    • docker-compose.yml updated with SCREENSCRAPER_USER, SCREENSCRAPER_PASSWORD, MOBYGAMES_API_KEY env vars
    • Display name fixed: "ROMM" → "RomM"
  • Controller version: v0.2.11

Previously completed (2026-02-14 session 4)

  • Fixed deploy race condition in internal/stacks/deploy.go:
    • In-memory Deployed flag now set BEFORE docker compose up -d (compose up can take 30-60s for image pulls)
    • On failure: both in-memory state and disk (app.yaml) are reverted
    • Eliminates stale "Telepítés" button during long compose operations
  • Added checkBeforeDeploy() JS guard in internal/web/templates.go:
    • Telepítés buttons on Vezérlőpult and Alkalmazások pages now fetch live state from /api/stacks/{name} before navigating
    • If app is already deployed (e.g., another tab deployed it), shows alert and reloads page instead of navigating to deploy form
    • Catches stale UI state gracefully

Previously completed (2026-02-14 session 3)

  • Enhanced debug logging across all stack operations in internal/stacks/:
    • Operation timing: All stack ops (start, stop, restart, update, deploy) now log elapsed time
    • Post-start container state check: Async goroutine after start/restart/update/deploy
    • Image pull detection: Checks local images before deploy/update (debug level)
    • GetLogs/ScanStacks improvements: Byte count logging, deployed/available counts
    • All verbose checks gated on cfg.Logging.Level == "debug"; timing always at INFO
  • UI improvements in internal/web/templates.go and server.go:
    • Memory bar fix on deploy page: Bar segments now always visible (min-width: 3px), new app segment uses translucent green with distinct border for clear visual separation from committed memory
    • Clickable app cards: Cards on Vezérlőpult and Alkalmazások pages are now clickable (navigates to deploy/detail page). Uses data-href attribute + delegated click handler. Protected stacks excluded. Actions area (buttons, state labels) excluded from click-to-navigate
    • Live-scrolling logs: Logs page now auto-refreshes every 3s via AJAX polling (?raw=1 returns plain text). Fixed-height container (70vh) with auto-scroll to bottom. Pulsing green "Élő" indicator. Pause/resume toggle ("Szüneteltetés"/"Folytatás"). User scroll position preserved when scrolled up to read history
    • Deployment progress UI: Deploy button no longer shows alert+redirect immediately. Instead shows 3-step progress panel: config saved → containers starting → app initializing. Polls GET /api/stacks/{name} every 3s to track actual container health state. Handles running (auto-redirect), starting (keep polling), unhealthy (warning), exited (error), and 120s timeout. Shows elapsed time counter
  • Mealie healthcheck fix (app-catalog-felhom.eu):
    • wget --spider replaced with Python TCP socket check — mealie image doesn't include wget
    • start_period increased to 60s (DB migrations take ~40s on first start)
  • Healthcheck audit: filebrowser (Alpine, has BusyBox wget — OK), stirling-pdf (Ubuntu, has wget — OK)

Previously completed (2026-02-15 session 2)

  • Phase 4: Git Sync + App Catalog Audit — major milestone
  • Git sync module (internal/sync/sync.go):
    • Clones/pulls app-catalog-felhom.eu repo to local cache on startup
    • Periodic sync based on git.sync_interval (default 15m)
    • Copies docker-compose.yml + .felhom.yml to stacks dir (never overwrites app.yaml/.env)
    • SHA-256 content comparison — only writes changed files
    • Triggers ScanStacks() after sync so dashboard updates immediately
    • Uses os/exec git CLI — no Go git library dependency
  • Manual sync button ("Sablonok frissítése") on Alkalmazások page:
    • POST /api/sync endpoint with 30s debounce
    • Toast notification shows result (success/failure/what changed)
    • Auto-reloads page if new apps or updates detected
  • Sync status added to /api/system/info (last_sync, last_status, syncing flag)
  • .felhom.yml files created for all 10 apps (paperless-ngx already had one):
    • actualbudget, docmost, filebrowser, homebox, immich, mealie, romm, stirling-pdf, vaultwarden
    • All follow the same format: display_name, description, category, subdomain, resources, deploy_fields
  • Docker Compose templates audited and fixed for all 10 apps:
    • Fixed {{DOMAIN}}${DOMAIN} syntax in homebox, mealie, romm, stirling-pdf
    • Fixed {{HDD_PATH}}${HDD_PATH} in romm
    • Added deploy.resources.limits.memory to all services across all templates
    • Added TZ=Europe/Budapest to all sidecar services (postgres, redis, mariadb)
    • Added healthcheck to romm main service
    • Added romm-redis condition: service_healthy (was service_started)
    • Standardized header comment blocks across all templates
  • Documentation updated: app-catalog README, CLAUDE.md, CONTEXT.md

Previously completed (2026-02-15 session 1)

  • Memory validation during deployment:
    • Pre-deploy memory check: compares mem_request sum against usable system RAM
    • Hard block if requests exceed usable memory (total - 384MB reserved)
    • Soft warning if mem_limit sum exceeds total RAM (overcommit OK for limits)
    • ParseMemoryMB() supports "500M", "1G", "1.5G", "1024" formats
    • CommittedMemory() sums requests/limits across all deployed stacks
    • Memory summary bar shown on deploy page before user clicks deploy
    • system.reserved_memory_mb configurable in controller.yaml (default: 384)
  • Display: ~ prefix on mem_request in UI badges (display-only, exact value stored)
  • Felhom.eu logo replaced text logos in sidebar and login page with actual SVG logo
    • Logo SVG embedded as Go string constant, served at /static/felhom-logo.svg

Previously completed (2026-02-14)

  • System info bar on Vezérlőpult dashboard: RAM, SSD, and optional HDD usage
    • Progress bars with color coding (green < 70%, yellow 70-85%, red > 85%)
    • New internal/system package reads /proc/meminfo + syscall.Statfs
    • Platform-specific: Linux impl + non-Linux stub (build tags)
    • Hungarian labels: "Memória", "SSD tárhely", "Külső HDD"
  • Docker Compose memory limits on paperless-ngx template:
    • paperless-webserver: 768M, postgres: 256M, redis: 128M
    • Added mem_limit field to .felhom.yml ResourceHints (total: 1152M)
  • /api/system/info endpoint now returns live system metrics (was customer info)
  • Config: Added paths.hdd_path for external HDD monitoring
  • Controller image builds via build.sh, pushes to Gitea container registry

Previously completed (2026-02-13)

  • Built the entire felhom-controller from scratch (Go, no frameworks)
  • Debugged and fixed 7 issues during first real deployment:
    1. Password validation (empty passwords accepted)
    2. In-memory Deployed flag not updating after deploy
    3. Health-aware state parsing (starting/unhealthy detection)
    4. Random card ordering (Go map iteration)
    5. "Részletek" button redirect for deployed apps
    6. Paperless OCR language installation (LANGUAGES vs LANGUAGE env var)
    7. Documentation: restart vs up -d for image updates

What's next (priorities)

  1. Configure Healthchecks.io UUIDs on demo-felhom.eu (replace CHANGEME in controller.yaml)
  2. Test backup flow — trigger manual backup via dashboard, verify restic repo + DB dumps
  3. Test orphan delete flow — try deleting the orphaned filebrowser stack via the UI
  4. Add app_info + optional_config to more apps (start with Immich, Mealie, Vaultwarden)
  5. Deploy a second app (e.g., ActualBudget — simplest, or Immich — tests HDD + secrets)
  6. Add app screenshots to the asset pipeline (romm-screenshot-1.webp etc.)
  7. Test on Raspberry Pi (pi-customer-1)
  8. Add paths.hdd_path to demo-felhom controller.yaml to enable HDD bar
  9. Phase 4: Self-update mechanism

Architecture decisions

Decision Rationale
Go stdlib for web (no Gin/Echo) Minimal dependencies, single binary, easy to embed templates
Templates as go:embed HTML/CSS files Zero runtime file dependencies (compiled into binary), but each template is a separate editable file
Docker Compose for customers (not k8s) Simpler troubleshooting, customers don't need k8s knowledge
k3s for management infra only Viktor's own services (gitea, monitoring, website) run on k3s
Cloudflare Tunnel for remote access No port forwarding needed, works behind any NAT
app.yaml per stack Separates deploy config from compose files, survives git pulls
Password fields require explicit input Prevents accidental empty-password deployments
Health-aware state from Docker Status field Docker's State says "running" even for unhealthy containers
Memory limits via deploy.resources.limits Prevents runaway containers; ~50% headroom over expected usage
System info from /proc/meminfo + statfs No external dependencies, cheap to read on each page load
mem_request vs mem_limit (K8s-inspired) Requests = expected usage (hard block), limits = peak (overcommit OK)
384MB reserved for system Prevents deploying apps that would starve the OS/controller
Logo SVG embedded as Go constant Same approach as CSS/HTML — zero external file deps
Git sync via os/exec git CLI No Go git library needed, git is in the container image
SHA-256 for content comparison Only copy changed files, avoid unnecessary disk writes
30s debounce on manual sync Prevents spamming the git server
Orphan = deployed but not in catalog Safe lifecycle: remove from catalog → mark orphaned → user deletes via UI
FileBrowser as infra (not catalog) Needed even after apps deleted (user browses HDD data); deployed by setup script
Protected HDD paths Safety net: never delete top-level HDD dirs (media, storage, Dokumentumok, appdata)
Central scheduler (not ad-hoc goroutines) Single place to register/monitor all periodic tasks, graceful shutdown, skip-if-running
CPU sampling via background goroutine /proc/stat delta needs two readings — collector runs every 5s, GetInfo() reads cached value
Temperature from /host/sys (Docker mount) Container can't read host /sys directly — mount /sys:/host/sys:ro, try /host/sys first
Restic password auto-generated No manual setup needed — generated on first backup run, stored in named volume
DB discovery via docker inspect No config needed — discovers postgres/mariadb containers by image name + env vars
Backup orchestrator with running flag Prevents concurrent backups, supports both scheduled and manual trigger

Key file locations on demo-felhom

/opt/docker/felhom-controller/         # Controller compose + config
  ├── controller.yaml                  # Customer config (domain, auth, paths)
  ├── docker-compose.yml               # Controller's own compose
  └── .env                             # DOMAIN=demo-felhom.eu

/opt/docker/stacks/                    # All app stacks
  ├── traefik/                         # Reverse proxy (protected)
  ├── cloudflared/                     # Tunnel (protected)
  ├── paperless-ngx/                   # First deployed app ✅
  │   ├── docker-compose.yml
  │   ├── .felhom.yml                  # App metadata
  │   └── app.yaml                     # Deploy config (env vars, locked fields)
  └── whoami/                          # Test stack (not deployed)

/mnt/hdd_placeholder/storage/          # HDD storage for apps
  └── paperless/
      ├── consume/                     # Drop files here for OCR
      ├── media/                       # Processed documents
      └── export/                      # Backup exports
Repository Status Notes
deploy-felhom-compose Active This repo. Controller code + deploy scripts
app-catalog-felhom.eu Active 10 app templates, all with .felhom.yml metadata + memory limits
felhom.eu Stable Website live, SEO indexed, email working
homelab-manifests Stable k3s cluster running (dooplex.hu services)
misc-scripts Utility collect-repo.sh, backup helpers

Gotchas & lessons learned

  • docker compose restartdocker compose up -d — restart doesn't pick up new images
  • Go maps have random iteration order — always sort slices before displaying
  • Docker .State="running" doesn't mean healthy — check .Status for "(health: starting)" / "(unhealthy)"
  • Paperless-ngx needs PAPERLESS_OCR_LANGUAGES (plural) to install language packs, PAPERLESS_OCR_LANGUAGE (singular) to select
  • In-memory Deployed flag must be set BEFORE docker compose up -d (not after) — compose can take 30-60s for image pulls, during which the UI would show a stale "Telepítés" button
  • Cloudflare Tunnel handles *.demo-felhom.eu → Traefik handles Host()-based routing to containers
  • BIOS "AC Power Recovery" must be enabled on N100 for auto-restart after power outage
  • docker compose up -d returns exit 0 even when containers immediately crash-loop — need post-start status check to detect this
  • When logging env vars for debugging, only log keys (not values) to avoid leaking secrets in log files
  • Mealie image (ghcr.io/mealie-recipes/mealie) doesn't include wget/curl — use Python TCP socket check for healthcheck
  • Mealie DB migrations on first start take ~40s (alembic) — use start_period: 60s to avoid premature unhealthy status
  • Alpine-based images (filebrowser, vaultwarden) have wget via BusyBox — healthchecks with wget --spider work fine
  • Deploy sed command to update image version must target only the image: line — naive sed 's|name:OLD|name:NEW|' also matches the service name line (e.g., felhom-controller:felhom-controller:0.2.12), breaking YAML. Use sudo sed -i 's|image:.*felhom-controller:[^ ]*|image: ...felhom-controller:NEW|' or similar scoped pattern
  • Hungarian quotation marks „" in YAML: (U+201E) is safe inside YAML double-quoted strings, but the closing " must NOT be ASCII " (0x22) — it terminates the YAML string. Use \" escape or Unicode " (U+201D). This caused a silent parse failure for the entire .felhom.yml file
  • Never silently swallow parse errors — always log them. Silent failures make debugging impossible (took a dedicated debug session to find a simple quoting issue)