# CONTEXT.md — Project Memory

> This file serves as persistent project memory across Claude Code sessions.
> It replaces the auto-generated "Memory" from the claude.ai Project.
> **Update this file at the end of each working session** with current state,
> recent decisions, and anything the next session needs to know.
>
> Ask Claude Code: "Please update CONTEXT.md with what we did today"

Last updated: 2026-06-12 (storage UX polish)

> **NOTE:** this file is stale below this banner (last full pass session 59 / v0.16.1). Current state
> is tracked in `CHANGELOG.md`, `controller/README.md`, and the auto-memory `MEMORY.md`. Live version:
> **v0.45.0**.
>
> **2026-06-13 — v0.56.0 Phase 4: FileBrowser scoping + UI polish (SLICE COMPLETE):**
> - 4A: FileBrowser bind scoped to `<drive>/appdata` (recovery units + Tier 2 copies under `backups/`
>   NOT mounted → customer can't browse/delete the restore source). 4B: deploy storage step states
>   files-on-drive / DB-on-fast-SSD. 4C: `buildStorageBars` stable sort + purpose description on the
>   monitoring list (user-data drives only; agent local/local-lvm/pbs live on the storage page, not here).
> - Live-validated (9201): FileBrowser mount `/mnt/felhom-usb/appdata -> /srv/felhom-usb` (backups hidden);
>   deploy + monitoring text rendered. **All 5 phases (1, 2, 2b, 3, 4) shipped + live-validated, v0.52→v0.56.**
>
> **2026-06-13 — v0.55.0 Phase 3: auto off-drive Tier 2 (rootfs-headroom guard):**
> - `internal/backup/tier2.go`: rsync `-a --delete` of each HDD app's recovery unit + appdata → a
>   DIFFERENT physical disk (`<target>/backups/secondary/<app>/`). Auto target: prefer another registered
>   drive (off-disk via `system.SamePhysicalDevice`), else internal SSD for SMALL units only.
> - **Rootfs-headroom guard** (`tier2FitsHeadroom`, unit-tested): SSD = ~8G guest rootfs, so REFUSE
>   unless the unit fits leaving reserve = max(2G, 20%) free; honest "needs 2nd HDD" status when nothing
>   fits — never fills the rootfs. Status via surviving `settings.CrossDriveBackup`; "2. mentés" UI card
>   now populated (`buildAppBackupRows`). Daily `tier2-backup` 03:30 + `POST /api/backup/tier2`.
> - **Live-validated (9201):** happy path (RomM → SSD, off felhom-usb, 77KB, "[SSD: DB/config only]");
>   refuse path (1G userdata dummy → REFUSED with honest msg, rootfs not filled); UI card shows
>   "Sikeres → belső SSD (csak DB/konfiguráció)". Demo cleaned.
> - Next: Phase 4 (FileBrowser scoping + deploy-UI DB-on-SSD note + monitoring sort).
>
> **2026-06-13 — v0.53.0/v0.53.1 Phase 2: per-app recovery unit (capture side, SECRET-FREE):**
> - Each app's `backups/primary/<app>/` becomes a self-contained recovery unit: `compose/`
>   (docker-compose.yml + .felhom.yml + **secret-stripped** app.yaml) + db-dumps/ + volume-dumps/ +
>   `manifest.json` (image pins, secret env-var NAMES, data_key names, checksums, secret_source note).
> - **Secret-free by design.** Decided after reading the ACTUAL hub code: hub is zero-knowledge (no app
>   secrets); app.yaml + key live on the guest rootfs → in the PBS whole-guest snapshot. So the unit
>   stores no secret/data-key/image; restore recovers secrets from the guest's app.yaml (live/PBS),
>   regenerates nothing. `data_key` (DeployField.DataKey; AdventureLog SECRET_KEY marked) = fail-closed
>   restore annotation only.
> - Capture needs no decryption (non-secret env is plaintext; excludes secret-named + encrypted keys).
>   Wired into RunDBDumps AND the periodic RefreshCache (idempotent checksum-skip → no USB thrash).
> - **Deploy mechanism resolved:** controller in guest 9201 is golden/bootstrap-managed —
>   `felhom-controller-bootstrap.service` docker-runs the tag from `/etc/felhom-controller-image`
>   (gitea anon-pull). Deploy = build+push → anon-pull → update tag file → restart the service.
> - **Live-validated (9201):** RomM unit captured (images=3, secrets=3, data_keys=0), secret-leak grep
>   = NO_LEAK.
> - **v0.54.0 Phase 2b (restore-from-unit + fail-closed gate):** `RestoreFromRecoveryUnit` recreates an
>   app from its unit + secrets recovered from the GUEST's live app.yaml (`RecoverStackSecrets`,
>   `stacks.RedeployFromEnv`), regenerating nothing. `reconcileRestoreSecrets` (pure, unit-tested) is the
>   fail-closed gate: missing/empty data-key → REFUSE (needs PBS whole-guest restore); missing resettable
>   secret → warn+proceed. Wired into `/backup/restore`. Gate + orchestration + data_key parsing
>   unit/integration-tested; deployed v0.54.0 healthy.
> - **LIVE-validated (9201, AdventureLog):** unit manifest `data_key_env_vars:[SECRET_KEY]`
>   (catalog→manifest live); with SECRET_KEY made unrecoverable, `POST /backup/restore` REFUSED with the
>   exact fail-closed message BEFORE any compose-up. Demo has NO dashboard password → API open (auth+CSRF
>   skipped), driven via public URL. NOTE: full deploy-with-data→restore e2e blocked because AdventureLog
>   images don't fit the 8G guest rootfs ("no space left") — that's the Phase 3 rootfs-headroom concern
>   seen live. Demo left clean (AdventureLog reverted to not-deployed).
> - Next: Phase 3 (Tier 2 auto off-drive, rootfs-headroom guard), Phase 4 (FileBrowser + UI).
>
> **2026-06-13 — v0.52.0 Phase 1 GATE: deploy-side double-nest fix (catalog) + path-agreement test:**
> - The `felhom-data` double-nest lived in the **app-catalog compose templates**
>   (`${HDD_PATH}/felhom-data/appdata/<app>`), not in `deploy.go`. On a Model-A in-guest drive the mount
>   already IS the `felhom-data` namespace, so it double-nested on disk while the v0.51.0 backup helpers
>   resolved single-nested → divergence. Fixed all four HDD templates (romm, nextcloud, immich,
>   paperless-ngx) → `${HDD_PATH}/appdata/<app>`.
> - New `internal/stacks/hddpath_agreement_test.go` locks deploy-resolver (`ParseComposeHDDMounts`) ==
>   backup helper (`AppDataDir(NamespaceRoot(.,true))`). No controller runtime change → no image rebuild
>   (deployed stays 0.51.0, functionally current; golden not rebaked for a no-op).
> - **Live (guest 9201):** git-sync auto-delivered the fix to all four stack files; RomM migrated
>   (stop→move→verify→redeploy) from `/mnt/felhom-usb/felhom-data/appdata/romm` →
>   `/mnt/felhom-usb/appdata/romm`, healthy + HTTP 200, no data loss, old namespace empty. **GATE PASSED.**
> - Next: Phase 2 (per-app recovery unit), Phase 3 (auto-enabled off-drive Tier 2 w/ rootfs-headroom
>   guard), Phase 4 (FileBrowser scoping + deploy-UI DB-on-SSD note + monitoring sort).
>
> **2026-06-12 — storage UX polish (v0.45.0), pairs with felhom-agent v0.24.0:**
> - **Agent eject role-gate (Part A, felhom-agent v0.24.0):** `POST /disks/eject` now refuses to
>   unmount system/backup storage *at the agent* (fail-safe to protected on ambiguity) — the UI hiding
>   the button was never the control. Validated live on guest 9201 (eject `/var/lib/vz` → 403, no unmount).
> - **Controller (Part B, v0.45.0):** B1 deterministic `/api/disks` order (user-data→system→backup,
>   alpha within); B2 init wizard excludes mounted drives; B3 **Regisztrálás** primary action for a
>   mounted-but-unregistered user-data drive (`POST /api/storage/register`); B4 per-card purpose
>   descriptions + app-backing tags + tiering note (`local` & `local-lvm` both kept); B5 eject already
>   names affected apps. All validated live on guest 9201.
> - **Golden REBAKED** to controller 0.45.0 (`/root/build-golden.sh`; gitea allows anon pull, no creds);
>   archive `local:backup/vzdump-lxc-9100-2026_06_12-09_53_03.tar.zst`, build guest 9100 purged.

---

## About Viktor (project owner)

- Works at Deutsche Telekom (Budapest), building Felhom.eu as a side business
- Felhom.eu: managed home-server service for Hungarian households
- Technical but prefers pragmatic solutions over over-engineering
- Runs all infrastructure on Gitea (gitea.dooplex.hu), k3s cluster for management
- Customer deployments use Docker Compose (not Kubernetes) for simplicity

### felhom-controller (this repo)
- **Version:** v0.16.1
- **Phase 1:** ✅ COMPLETE — Stack Manager + Deploy Flow
- **Phase 2:** ✅ COMPLETE — Monitoring & Health (scheduler, CPU/temp, healthchecks.io pings)
- **Phase 3:** ✅ COMPLETE — Backups (DB dumps, restic integration, manual trigger, **dedicated backup page**)
- **Phase 4:** ✅ COMPLETE — Monitoring Page with Metrics Store (SQLite, Chart.js, system + container metrics)
- **Phase 5:** ✅ COMPLETE — Authentication, Persistence & Settings Page (settings.json, password change, session management)
- **Phase 6:** ✅ COMPLETE — Monitoring Warnings, Dashboard Alerts & Notification System
- **Phase 7:** ✅ COMPLETE — Storage Overview, Per-App Backup Toggles & Limited Restore
- **Phase A:** ✅ COMPLETE — Storage Paths Foundation (registry, auto-discovery, per-app HDD_PATH, deploy dropdown, health monitoring)
- **Phase B:** ✅ COMPLETE — Storage Management UI Polish & Health Severity Fix (flash messages, label editing, app details, FS info, deploy free space, backup context)
- **Phase C:** ✅ COMPLETE — Storage Init Wizard, Data Migration & Startup Fix (disk scan/format/mount wizard, rsync-based migration, startup pings)
- **v0.11.1 bugfix:** ✅ COMPLETE — Storage Scan: system disk detection via host fstab + blkid UUID resolution; FSType enrichment via `blkid -o export`
- **v0.11.2 bugfix:** ✅ COMPLETE — /host-dev mount for block device access; `HostDevicePath()` helper; all format/scan/safety ops use /host-dev
- **v0.11.3 bugfix:** ✅ COMPLETE — Added `fdisk` package to Dockerfile (provides `sfdisk`; not in `util-linux` on Debian bookworm)
- **v0.11.4 bugfix:** ✅ COMPLETE — FormatAndMount: fixed sfdisk (wipefs+force+`,,`), mount (explicit device path), mount propagation (rshared), ASCII label, smart partition skip, findmnt verification
- **v0.11.6:** ✅ COMPLETE — FileBrowser auto-mount sync (`syncFileBrowserMounts()`) + 3 UI fixes (badge color, progress bar, button text)
- **v0.11.7:** ✅ COMPLETE — Stale data cleanup + FileBrowser sync after migration + deploy page title fix
- **v0.11.8:** ✅ COMPLETE — Per-App Cross-Drive Backup (3-2-1 rule): rsync/restic to secondary drive, deploy page UI, backup page summary, scheduler jobs, API endpoints
- **v0.11.9:** ✅ COMPLETE — UI Polish Fixes: spacing, tooltip on "Módszer", status dot instead of disabled checkbox, progressive disclosure, emoji cleanup
- **First app deployed:** Paperless-ngx on demo-felhom.eu (2026-02-13)
- **Running on:** demo-felhom (N100 mini PC) at 192.168.0.162:8080, felhotest (Proxmox VM) at router.abonet.hu:33022
- **All Phase 1-5 features working:** deploy, start/stop/restart/update, logs, health-aware states, auth, monitoring, backups, backup detail page, system monitoring page, settings page

## Architecture decisions

| Decision | Rationale |
|----------|-----------|
| Go stdlib for web (no Gin/Echo) | Minimal dependencies, single binary, easy to embed templates |
| Templates as go:embed HTML/CSS files | Zero runtime file dependencies (compiled into binary), but each template is a separate editable file |
| Docker Compose for customers (not k8s) | Simpler troubleshooting, customers don't need k8s knowledge |
| k3s for management infra only | Viktor's own services (gitea, monitoring, website) run on k3s |
| Cloudflare Tunnel for remote access | No port forwarding needed, works behind any NAT |
| app.yaml per stack | Separates deploy config from compose files, survives git pulls |
| Password fields require explicit input | Prevents accidental empty-password deployments |
| Health-aware state from Docker Status field | Docker's State says "running" even for unhealthy containers |
| Memory limits via deploy.resources.limits | Prevents runaway containers; ~50% headroom over expected usage |
| System info from /proc/meminfo + statfs | No external dependencies, cheap to read on each page load |
| mem_request vs mem_limit (K8s-inspired) | Requests = expected usage (hard block), limits = peak (overcommit OK) |
| 384MB reserved for system | Prevents deploying apps that would starve the OS/controller |
| Logo SVG embedded as Go constant | Same approach as CSS/HTML — zero external file deps |
| Git sync via os/exec git CLI | No Go git library needed, git is in the container image |
| SHA-256 for content comparison | Only copy changed files, avoid unnecessary disk writes |
| 30s debounce on manual sync | Prevents spamming the git server |
| Orphan = deployed but not in catalog | Safe lifecycle: remove from catalog → mark orphaned → user deletes via UI |
| FileBrowser as infra (not catalog) | Needed even after apps deleted (user browses HDD data); deployed by setup script |
| Protected HDD paths | Safety net: never delete top-level HDD dirs (media, storage, Dokumentumok, appdata) |
| Central scheduler (not ad-hoc goroutines) | Single place to register/monitor all periodic tasks, graceful shutdown, skip-if-running |
| CPU sampling via background goroutine | /proc/stat delta needs two readings — collector runs every 5s, GetInfo() reads cached value |
| Temperature from /host/sys (Docker mount) | Container can't read host /sys directly — mount /sys:/host/sys:ro, try /host/sys first |
| Restic password auto-generated | No manual setup needed — generated on first backup run, stored in named volume |
| DB discovery via docker inspect | No config needed — discovers postgres/mariadb containers by image name + env vars |
| Backup orchestrator with running flag | Prevents concurrent backups, supports both scheduled and manual trigger |
| modernc.org/sqlite (pure Go) | No CGO/gcc needed in Docker build stage — keeps `CGO_ENABLED=0` static binary |
| AlertManager state-based refresh | Alerts regenerated every 5min from health report — no persistent storage needed, always reflects current state |
| Notification relay via hub | Controller → hub → Resend → email. Hub acts as central relay: knows customer email, handles Resend API. Controller only needs hub URL + API key |
| In-memory notification cooldowns | Per-event-type cooldown map (default 6h). Lost on restart = acceptable (better to re-notify than miss). No persistence needed |
| Health status change detection | Only notify on degradation (ok→warn, ok→fail, warn→fail). Avoids spam on flapping. First run records baseline, doesn't notify |
| Resend HTTP API (no SMTP) | Direct POST to api.resend.com — same pattern as website contact-mailer. Simpler than SMTP setup, good deliverability |
| Preferences sync on save + startup | Controller pushes prefs to hub (not pull). Startup sync handles hub DB rebuild. Local save always succeeds even if sync fails |
| Chart.js embedded locally | Customer hardware may not have internet — CDN not reliable for offline environments |
| StackDataProvider interface | backup package needs stack data but can't import stacks (circular). Interface in backup, thin adapter in main.go |
| Password sync to hub via report | Restic password in Docker named volume on SSD. Hub sync provides redundancy for disaster recovery |
| App backup via HDD mounts only | Docker volumes at /var/lib/docker/volumes/ not mounted in controller. HDD data is the important user data; DB in volumes covered by nightly dump |
| Restore uses running mutex | Prevents concurrent backup+restore on same restic repo. Reuses existing `m.running` flag |
| Storage paths registry in settings.json | Multi-storage support: each app's HDD_PATH from app.yaml is authoritative. Auto-discovery on startup avoids manual config. Registry enables UI management + health monitoring per path |
| /mnt:/mnt:rw mount in controller | Replaces per-path HDD_PATH mount. Enables multi-storage + restore writes. All customer HDD mounts are under /mnt/ by convention |
| Per-app HDD_PATH resolution (app.yaml > global) | App's own env HDD_PATH is Priority 1, registered storage paths as fallback. Eliminates dependency on global controller.yaml hdd_path |
| Mount-point detection via syscall.Stat_t.Dev | Compares device ID of path vs parent dir — reliable check that path is on separate filesystem. Prevents data writes to SSD |
| Health severity: mount-point = warning | Non-mount-point is informational, not a service failure. FAIL reserved for genuinely broken things. Avoids false alarms on demo/test environments |
| FS info via findmnt + sysfs | `findmnt -n -o SOURCE,FSTYPE --target <path>` for filesystem type/device. `/sys/block/<dev>/device/model` for disk model. Best-effort, returns nil on failure |
| Query param flash messages | Stateless, no session store needed. Consistent with backup page pattern. `?storage_msg=success&storage_detail=...` |
| StorageLabels map on stacks page | Separate map passed to template (not modifying Stack struct). Built from deployed apps' HDD_PATH → registered path label lookup |
| Metrics downsampling via SQL | Bucket-based AVG in GROUP BY keeps Chart.js responsive with up to 30 days of data |
| 60s metrics collection interval | Good balance of resolution vs. storage — ~44K rows/month for system metrics |
| /etc/os-release mounted read-only | Container can't read host OS info directly — mount to /host/etc/os-release:ro |

## Key file locations on demo-felhom

```
/opt/docker/felhom-controller/         # Controller compose + config
  ├── controller.yaml                  # Customer config (domain, auth, paths)
  ├── docker-compose.yml               # Controller's own compose
  └── data/                             # Controller persistent data (named volume)

/opt/docker/stacks/                    # All app stacks
  ├── traefik/                         # Reverse proxy (protected)
  ├── cloudflared/                     # Tunnel (protected)
  ├── paperless-ngx/                   # First deployed app ✅
  │   ├── docker-compose.yml
  │   ├── .felhom.yml                  # App metadata
  │   └── app.yaml                     # Deploy config (env vars, locked fields)
  └── whoami/                          # Test stack (not deployed)

/mnt/hdd_placeholder/storage/          # HDD storage for apps
  └── paperless/
      ├── consume/                     # Drop files here for OCR
      ├── media/                       # Processed documents
      └── export/                      # Backup exports
```

## Related repositories and their state

| Repository | Status | Notes |
|------------|--------|-------|
| felhom-controller | Active | This repo. Controller code + deploy scripts |
| app-catalog-felhom.eu | Active | 10 app templates, all with .felhom.yml metadata + memory limits |
| felhom.eu | Active | Website + hub/ subfolder (felhom-hub service) + k8s manifests |
| homelab-manifests | Stable | k3s cluster running (dooplex.hu services) |
| misc-scripts | Utility | collect-repo.sh, backup helpers |

## Gotchas & lessons learned

- `docker compose restart` ≠ `docker compose up -d` — restart doesn't pick up new images
- Go maps have random iteration order — always sort slices before displaying
- Docker `.State`="running" doesn't mean healthy — check `.Status` for "(health: starting)" / "(unhealthy)"
- Paperless-ngx needs `PAPERLESS_OCR_LANGUAGES` (plural) to install language packs, `PAPERLESS_OCR_LANGUAGE` (singular) to select
- In-memory Deployed flag must be set BEFORE `docker compose up -d` (not after) — compose can take 30-60s for image pulls, during which the UI would show a stale "Telepítés" button
- Cloudflare Tunnel handles *.demo-felhom.eu → Traefik handles Host()-based routing to containers
- BIOS "AC Power Recovery" must be enabled on N100 for auto-restart after power outage
- `docker compose up -d` returns exit 0 even when containers immediately crash-loop — need post-start status check to detect this
- When logging env vars for debugging, only log keys (not values) to avoid leaking secrets in log files
- Mealie image (`ghcr.io/mealie-recipes/mealie`) doesn't include wget/curl — use Python TCP socket check for healthcheck
- Mealie DB migrations on first start take ~40s (alembic) — use `start_period: 60s` to avoid premature unhealthy status
- Alpine-based images (filebrowser, vaultwarden) have wget via BusyBox — healthchecks with `wget --spider` work fine
- Deploy `sed` command to update image version must target only the `image:` line — naive `sed 's|name:OLD|name:NEW|'` also matches the service name line (e.g., `felhom-controller:` → `felhom-controller:0.2.12`), breaking YAML. Use `sudo sed -i 's|image:.*felhom-controller:[^ ]*|image: ...felhom-controller:NEW|'` or similar scoped pattern
- Hungarian quotation marks `„"` in YAML: `„` (U+201E) is safe inside YAML double-quoted strings, but the closing `"` must NOT be ASCII `"` (0x22) — it terminates the YAML string. Use `\"` escape or Unicode `"` (U+201D). This caused a silent parse failure for the entire `.felhom.yml` file
- Never silently swallow parse errors — always log them. Silent failures make debugging impossible (took a dedicated debug session to find a simple quoting issue)