Files
felhom-controller/CLAUDE.md
T
admin 086281b582 docs: reflow CLAUDE.md; unify REPORT/CHANGELOG convention; add no-secrets rule
Reflow removes hard mid-paragraph line wraps (code blocks and tables untouched);
rendered output unchanged. Adds the uniform CHANGELOG (cumulative) / REPORT
(overwrite-latest) convention plus a no-secrets rule. Docs/meta only, no version bump.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 20:54:45 +02:00

316 lines
20 KiB
Markdown

# CLAUDE.md — Project Instructions for Claude Code
> This file is read automatically by Claude Code at the start of every session. It replaces the "Instructions" panel from the claude.ai Project. Keep it updated as the project evolves.
!!! IMPORTANT !!!
- Always update CHANGELOG.md whenever you modified the code, and pushed to git!!
- IF controller feature changed (new/modify/remove) always update the relevant part of controller/README.md with the architectural change!!
## Project overview
Creating a business (Felhom) for home-server deployment for Hungarian customers. This repository (`felhom-controller`) contains the felhom-controller — a Go application that manages Docker Compose stacks on customer hardware via a Hungarian-language web dashboard.
See `controller/README.md` for full architecture and status (update after each session, keep track of how different functions/features operate, like backup, monitoring, storage handling, app management, user settings, update workflow, notification system, etc-etc...).
See `CHANGELOG.md` for recent work (update after each session — see "Working with CHANGELOG.md" below).
See `CONTEXT.md` for current project state, decisions and roadmap (update after each session).
See `TASK.md` for the current task to implement (if it exists).
Claude in Chrome extension is available — can be used to test web UI on demo-felhom.eu or verify dashboard deployments in browser.
## System context — the Proxmox re-platform (READ THIS FIRST)
The project has **re-platformed onto Proxmox**, with a locked **three-component model**:
- **Hub** (`felhom.eu/hub/`) — operator backend on k3s.
- **Host agent** (`felhom-agent/`, formerly `proxmox-controller`) — one per Proxmox host; operator-tier; owns ALL Proxmox interaction.
- **In-guest controller** (THIS repo) — one per customer LXC; **Docker-only; holds NO Proxmox credentials**.
**This repo is being de-privileged.** In the target model, host/disk/Proxmox/Cloudflare responsibilities move OUT of the controller into the **host agent**: System info, Storage (disk scan/format/mount/migrate), the disk-tier Backup (restic, cross-drive, drive-restore, infra-backup), and the Cloudflare-API geo enforcement. The controller keeps the **app domain**: stack/deploy management, the Hungarian web UI, app-data backup (DB dumps + Docker-volume tars), metrics/telemetry, integrations, git-sync, notifications.
> **Authoritative map:** `felhom.eu/documentation/architecture/02-controller-module-map.md` — the per-package **KEEP / PORT / DELETE(→agent) / DELETE(obsolete) / MODIFY** classification. Read it before touching `backup/`, `storage/`, `cloudflare/`, `system/`, or `config/`. Also doc 01 (topology/trust) and doc 03 (the host agent).
**⚠️ Status — do NOT assume the target state is implemented.** The de-privileging has only *started*: the recent `internal/appbackup/` extraction split the keep-side app-data-backup primitives from the delete-side disk/host code (groundwork, no behaviour change). The **bulk strip has NOT happened** — the current code STILL contains the full privileged storage / restic / cross-drive / disk / Cloudflare stack. The strip + the agent-local-API client land at **~slice 8**. So the code you see is the **pre-strip, still-privileged** controller; match the code, not the target, unless a TASK says otherwise.
**Don't confuse the two ex-"controllers":** `felhom-agent` (host, operator-tier, was `proxmox-controller`) vs this `felhom-controller` (in-guest, was `deploy-felhom-compose`).
## Cross-repo & artifacts
- Workspace orientation (the felhom system, shared conventions, access) lives in the workspace-root `e:\git\CLAUDE.md`. Sibling per-repo files: `felhom-agent/CLAUDE.md`, `felhom.eu/CLAUDE.md`.
- **Artifact taxonomy:** `TASK.md` / `TASK-*.md` = a spec for YOU to implement (then push + update CHANGELOG + CONTEXT + README).
- **`RUNBOOK-*.md`** — an operational procedure. CC executes the steps it has access and capability for, including live validation on the demo nodes and the demo Proxmox host (CC has root@felhom-pve SSH + the felhom-agent token). A step is human-only only when it genuinely needs physical presence, a real-world decision, or credentials CC truly lacks — mark those steps HUMAN. Do not decline a whole procedure because it touches a live host or a privileged token. (Judgment still applies: confirm before irreversible ops on real customer data — but demo scratch guests are fair game.)
> **In every repository where you make a change, update both files in that repo:**
> - **`CHANGELOG.md`** — a cumulative log of **all** changes; newest entry on top.
> - **`REPORT.md`** — **overwrite** with a summary of the **most recent** implementation (or significant validation/operational run) only; not cumulative.
>
> **Never write secrets** — tokens, passwords, private keys, API keys — into `CHANGELOG.md`, `REPORT.md`, or any committed file. Reference them as "stored out-of-band" instead.
## Code quality rules
- Always double-check generated code for bugs, logic issues, syntax errors
- Handle edge cases without overcomplicating the script/program
- Add debug capabilities (logging, verbose output) for easier troubleshooting
- If you need more input or troubleshooting command output, ask first — don't guess
## Environment
| Machine | OS | IP | Purpose |
|---------|----|----|---------|
| **Local (this machine)** | Windows 11 | — | Development, Claude Code runs here. Repos in `E:\git\` |
| **Build server (k3s, infra)** | Debian 13 | 192.168.0.180 | Build + push container images, k3s cluster |
| **Demo node** | Debian 13 | 192.168.0.162 | Test deployment (demo-felhom.eu) |
| **Demo node 2** | Debian 13 | router.abonet.hu:33022 | Remote test deployment |
## Workspace layout
Claude Code runs on Windows 11. The working directory is `E:\git\` (mapped as `/e/git/` in Git Bash). This repo is at:
```
E:\git\felhom-controller\ (or /e/git/felhom-controller/ in Git Bash)
├── controller/ # Go application (main codebase)
│ ├── cmd/controller/ # Entry point (main.go)
│ ├── internal/
│ │ ├── config/ # YAML config loading
│ │ ├── settings/ # settings.json persistence (password hash, DB cache)
│ │ ├── stacks/ # Docker Compose operations, deploy flow
│ │ ├── sync/ # Git sync — periodic pull of app catalog repo
│ │ ├── api/ # REST API endpoints
│ │ ├── system/ # System info (memory, disk)
│ │ └── web/ # Dashboard UI
│ │ ├── server.go # Server struct, routing, static serving
│ │ ├── auth.go # Session auth, login/logout handlers
│ │ ├── handlers.go # Page handlers (dashboard, stacks, deploy, etc.)
│ │ ├── funcmap.go # Template function map
│ │ ├── embed.go # go:embed directive for templates
│ │ ├── templates.go # Felhom logo SVG constant
│ │ └── templates/ # go:embed HTML/CSS files (Hungarian UI)
│ ├── Dockerfile
│ ├── Makefile
│ └── go.mod
├── scripts/ # Setup scripts for customer nodes
├── CLAUDE.md # This file
├── CHANGELOG.md # Changelog
├── CONTEXT.md # Project memory / state / architectural state/decisions/roadmap
└── TASK.md # Current task (if exists)
```
Related repos (same parent directory):
```
E:\git\app-catalog-felhom.eu\ # Docker Compose templates + .felhom.yml metadata per app
E:\git\felhom.eu\ # Website (htmls) + k3s manifests
E:\git\homelab-manifests\ # k3s cluster manifests (dooplex.hu services)
E:\git\misc-scripts\ # Helper scripts
```
All repos hosted at `gitea.dooplex.hu/admin/`. Git credentials are stored (`git config credential.helper store`).
## SSH access
SSH key-based authentication is configured and working. No password prompts.
**IMPORTANT — SSH binary:** Claude Code runs in Git Bash, which has its own SSH at `/usr/bin/ssh` (= `C:\Program Files\Git\usr\bin\ssh.exe`). This binary does NOT have access to the Windows SSH agent and will fail silently (exit 0/141 with no output). Always use the Windows native OpenSSH binary with the full path:
```
SSH=/c/Windows/System32/OpenSSH/ssh.exe
```
All SSH commands in this file use `$SSH` — set it at the start of your session or substitute the full path manually.
| Host | OS | IP | User | Role |
|------|----|----|------|------|
| Build server | Debian 13 | 192.168.0.180 | kisfenyo | Build + push container images |
| Demo Proxmox host | 192.168.0.162 | root@pam (SSH alias felhom-pve, root, no sudo) | pveum/pct + live Proxmox validation — available to CC |
## Test environments
| Node | OS | Hardware | Domain | IP | Notes |
|------|-----|----------|--------|----|-------|
| demo-felhom | Debian 13 | Acemagic N100, 16G RAM, 512G SSD + 1TB HDD | demo-felhom.eu | 192.168.0.162 | Primary test node, Cloudflare Tunnel |
| felhotest | Debian 13 | Proxmox VM (4-16G RAM, 8 vCPU, 200G + 100G SCSI) | — | router.abonet.hu:33022 | Remote test node |
| pi-customer-1 | Debian 13 | Raspberry Pi 3B+, 1G RAM, 32G SD | pi-customer-1.local | 192.168.0.161 | Secondary test, not yet active |
- Pi-hole DNS on local network forwards `*.demo-felhom.eu` → 192.168.0.162
- External access via Cloudflare Tunnel → Traefik reverse proxy
> **⚠️ Re-platform note:** per the host-agent work, `192.168.0.162` is now a **Proxmox host** (`demo-felhom`, PVE 9.2.2) — the demo-node tables above predate that. Confirm how/where the controller is currently deployed and tested post-re-platform before relying on the bare-metal `docker compose` deploy steps below; on the re-platformed node the controller may now run inside an LXC guest rather than directly on the host.
## Build & deploy workflow — MANDATORY
After making code changes to the controller, you **MUST** build, push, and deploy the new image. Do NOT leave code changes uncommitted or undeployed. The full cycle is:
### Step 1: Commit and push changes
```bash
cd /e/git/felhom-controller
git add -A && git commit -m "<descriptive message>" && git push
```
### Step 2: Build + push the container image on the build server
The build server (192.168.0.180) has the build toolchain. The version tag should be incremented from the current running version.
!! Important: use "kisfenyo" user for SSH, as written below
First, set the SSH variable (required for every session — Git Bash's built-in ssh does NOT work):
```bash
SSH=/c/Windows/System32/OpenSSH/ssh.exe
```
Check the current running version:
```bash
$SSH kisfenyo@192.168.0.162 "docker ps --filter name=felhom-controller --format '{{.Image}}'"
```
Then build with the next version (e.g., if current is 0.2.10, use 0.2.11): IMPORTANT!: Build directory is: ~/build/felhom-controller
```bash
$SSH kisfenyo@192.168.0.180 "cd ~/build/felhom-controller && git -C ~/git/felhom-controller pull && ./build.sh <NEW_VERSION> --push"
```
The build script:
- Pulls latest code from Gitea
- Builds a multi-arch Docker image (amd64 + arm64) if `--multiarch`, or current arch if `--push`
- Pushes to `gitea.dooplex.hu/admin/felhom-controller:<VERSION>`
- Expects the version as first argument (e.g., `0.2.11`)
### Step 3: Deploy on demo nodes
```bash
# Demo node 1 (local)
$SSH kisfenyo@192.168.0.162 "cd /opt/docker/felhom-controller && sudo docker pull gitea.dooplex.hu/admin/felhom-controller:<NEW_VERSION> && sudo sed -i 's|image: gitea.dooplex.hu/admin/felhom-controller:.*|image: gitea.dooplex.hu/admin/felhom-controller:<NEW_VERSION>|' docker-compose.yml && sudo docker compose up -d"
# Demo node 2 (remote)
$SSH -p 33022 kisfenyo@router.abonet.hu "cd /opt/docker/felhom-controller && sudo docker pull gitea.dooplex.hu/admin/felhom-controller:<NEW_VERSION> && sudo sed -i 's|image: gitea.dooplex.hu/admin/felhom-controller:.*|image: gitea.dooplex.hu/admin/felhom-controller:<NEW_VERSION>|' docker-compose.yml && sudo docker compose up -d"
```
### Step 4: Verify the deployment
```bash
$SSH kisfenyo@192.168.0.162 "docker ps --filter name=felhom-controller --format '{{.Image}} {{.Status}}'"
$SSH -p 33022 kisfenyo@router.abonet.hu "docker ps --filter name=felhom-controller --format '{{.Image}} {{.Status}}'"
```
Should show the new version and "Up" status. Also check logs for startup errors:
```bash
$SSH kisfenyo@192.168.0.162 "docker logs felhom-controller --tail 20"
$SSH -p 33022 kisfenyo@router.abonet.hu "docker logs felhom-controller --tail 20"
```
### Build workflow summary
| Step | Command | Where |
|------|---------|-------|
| 0. Set SSH var | `SSH=/c/Windows/System32/OpenSSH/ssh.exe` | Local (once per session) |
| 1. Commit + push | `git add -A && git commit -m "..." && git push` | Local (this repo) |
| 2. Build + push image | `$SSH kisfenyo@192.168.0.180 "cd ~/build/felhom-controller... ./build.sh <VER> --push"` | Build server |
| 3. Deploy (node 1) | `$SSH kisfenyo@192.168.0.162 "... docker compose up -d"` | Demo node |
| 3b. Deploy (node 2) | `$SSH -p 33022 kisfenyo@router.abonet.hu "... docker compose up -d"` | Demo node 2 |
| 4. Verify | `$SSH kisfenyo@192.168.0.162 "docker ps ..."` + same for router.abonet.hu | Both nodes |
### Build & deploy workflow — Hub (felhom-hub)
The central hub (`hub.felhom.eu`) is a separate Go app in the `E:\git\felhom.eu\hub\` repo. The controller pushes periodic reports to it (when `hub.enabled: true` in `controller.yaml`).
| Step | Command | Where |
|------|---------|-------|
| 1. Commit + push | `cd /e/git/felhom.eu && git add -A && git commit && git push` | Local |
| 2. Build + push image | `$SSH kisfenyo@192.168.0.180 "cd ~/build/felhom-hub && ./build.sh <VER> --push"` | Build server |
| 3. Deploy to k3s | `$SSH kisfenyo@192.168.0.180 "sudo kubectl set image -n felhom-system deploy/hub hub=gitea.dooplex.hu/admin/felhom-hub:<VER>"` | Build server |
| 4. Verify | `$SSH kisfenyo@192.168.0.180 "sudo kubectl get pods -n felhom-system -l app=hub && sudo kubectl logs -n felhom-system -l app=hub --tail 10"` | Build server |
See `E:\git\felhom.eu\CLAUDE.md` for full hub details.
**IMPORTANT:** If you make changes to the app-catalog-felhom.eu repo, commit and push those too:
```bash
cd /e/git/app-catalog-felhom.eu
git add -A && git commit -m "<message>" && git push
```
The controller's git sync will pick up catalog changes within 15 minutes, or you can trigger it manually via the dashboard "Sablonok frissítése" button.
## Tech stack
- **Language:** Go 1.22+
- **Web framework:** stdlib `net/http` + `html/template` (no frameworks)
- **Templates:** go:embed HTML files in `internal/web/templates/` (Hungarian UI)
- **CSS:** go:embed CSS file in `internal/web/templates/style.css`
- **Auth:** bcrypt password hash + session cookies
- **Container orchestration:** Docker Compose via CLI (`docker compose up -d`)
- **Reverse proxy:** Traefik (separate stack, managed by controller)
- **Tunnel:** Cloudflare Tunnel (cloudflared, separate stack)
## Key patterns
- All UI text is in Hungarian (Budapest timezone, Hungarian locale)
- Templates use Go template functions: `stateColor`, `stateLabel`, `stateIcon`, `stateStr`, `isOperational`, `logoURL`, `logoPNGURL`, `appPageURL`
- Container states: `running`, `starting`, `unhealthy`, `stopped`, `exited`, `restarting`, `paused`, `not_deployed`
- Docker `.State` field is combined with `.Status` field to detect health substatus
- Stacks are sorted alphabetically by DisplayName
- Protected stacks (traefik, cloudflared, felhom-controller) can't be stopped from UI
- `app.yaml` persists deploy config; `deployed: true` flag controls UI state
- In-memory `Deployed` flag is set BEFORE `docker compose up -d` (avoids race condition with slow image pulls); reverted on failure
- Password fields require explicit user input or generation (no silent auto-fill)
- App cards on dashboard and stacks pages are clickable via `data-href` attribute (skip protected stacks)
- Logs page uses AJAX polling (`?raw=1` query param returns plain text) with auto-scroll and pause/resume
- Memory bar on deploy page uses two-segment stacked bar (committed = solid green, new = translucent green)
- Deploy flow shows 3-step progress panel (config → containers → health), polls `GET /api/stacks/{name}` every 3s until running/unhealthy/timeout(120s)
- Telepítés buttons have `checkBeforeDeploy()` onclick guard — fetches live state from API before navigating to deploy page
- App info pages at `/apps/{slug}` — detail view with use cases, setup guide, screenshots, optional config
- Optional config saves to `app.yaml` and restarts deployed apps via `docker compose up -d`
- `optional_config` fields in `.felhom.yml` define post-deploy configurable env vars (e.g., API keys)
- `app_info` in `.felhom.yml` provides tagline, use_cases, first_steps, prerequisites, default_creds, docs_url
## Git sync module (internal/sync)
- Uses `os/exec` to call `git` CLI — no Go git library dependency
- On startup: clones repo to `{data_dir}/catalog-cache/` (shallow clone, `--depth 1`)
- Periodically: `git fetch --depth 1` + `git reset --hard origin/{branch}`
- Copies only `docker-compose.yml` and `.felhom.yml` to stacks dir
- **Never overwrites** `app.yaml` — this contains deployed secrets
- Content-hash comparison (SHA-256) — only writes if file actually changed
- After sync, triggers `ScanStacks()` rescan for dashboard update
- `POST /api/sync` triggers immediate sync (30s debounce)
- "Sablonok frissítése" button on Alkalmazások page
- Sync status exposed in `/api/system/info` response
## Debug logging
The controller has two-tier logging controlled by `logging.level` in `controller.yaml` (or `FELHOM_LOGGING_LEVEL` env var):
- **`info`** (default): Operation success/failure with elapsed time, post-start container states, scan counts
- **`debug`**: All of above plus env var keys per compose command, local image availability checks, compose command completion times, log fetch byte counts
Key patterns used in `internal/stacks/`:
- `time.Since(start)` for operation timing — always logged at INFO level
- `m.isDebug()` gates verbose output (env var keys, image checks)
- `truncateStr(s, 500)` caps stdout/stderr in error logs
- `logPostStartStatus()` runs async (goroutine + 3s sleep) after start/restart/update/deploy — never blocks or fails the operation
- `checkLocalImages()` parses compose YAML for `image:` lines, runs `docker image inspect` per image
- Env var **keys** are logged, never values (secrets safety)
## Important lessons learned
1. `PAPERLESS_OCR_LANGUAGES` (plural, with S) **installs** tesseract packs; `PAPERLESS_OCR_LANGUAGE` (singular) **selects** which to use
2. `docker compose restart` does NOT pick up new images — always use `docker compose up -d`
3. Go map iteration order is random — always sort before displaying in UI
4. Docker's `.State` field says "running" even for unhealthy containers — must parse `.Status` for health info
5. In-memory `Deployed` flag must be set BEFORE `docker compose up -d` (not after) — compose can take 30-60s for image pulls; revert both in-memory and disk on failure
6. `docker compose up -d` returns exit 0 even when containers crash-loop — post-start status check is essential for detecting failures
7. Mealie image has no wget/curl — use Python TCP socket check for healthcheck; set `start_period: 60s` for DB migration time
8. Always verify container images have the healthcheck tool (`wget`, `curl`, etc.) before using it — Alpine has BusyBox wget, Python images have `python3`
## Working with CHANGELOG.md
**DO NOT read the full file** — it is large (29K+ tokens) and will waste context or fail.
- **At session start:** Do NOT read CHANGELOG.md. Use `CONTEXT.md` and `controller/README.md` for current state.
- **To add a new entry:** Read only the top ~30 lines (`limit: 30`) to see the format and insertion point, then use Edit to insert the new entry after line 1 (`## Changelog`).
- **To check history:** Use Grep to search for specific topics instead of reading the file.
## End-of-session checklist
Before ending a session, always:
1. **Commit and push** all code changes
2. **Build, push, and deploy** the new controller image (if controller code changed)
3. **Update CHANGELOG.md** with what was done
4. **Update CONTEXT.md** with decisions made, update architectural state and what's next
5. **Update controller/README.md** if architecture or features changed
6. **Verify** the deployment is working (check `docker ps` and logs)