Files
felhom-controller/CLAUDE.md
T
2026-06-08 20:07:56 +02:00

345 lines
20 KiB
Markdown

# CLAUDE.md — Project Instructions for Claude Code
> This file is read automatically by Claude Code at the start of every session.
> It replaces the "Instructions" panel from the claude.ai Project.
> Keep it updated as the project evolves.
!!! IMPORTANT !!!
- Always update CHANGELOG.md whenever you modified the code, and pushed to git!!
- IF controller feature changed (new/modify/remove) always update the relevant part of controller/README.md with the architectural change!!
## Project overview
Creating a business (Felhom) for home-server deployment for Hungarian customers. This repository
(`felhom-controller`) contains the felhom-controller — a Go application that manages Docker
Compose stacks on customer hardware via a Hungarian-language web dashboard.
See `controller/README.md` for full architecture and status (update after each session, keep track of how different functions/features operate, like backup, monitoring, storage handling, app management, user settings, update workflow, notification system, etc-etc...).
See `CHANGELOG.md` for recent work (update after each session — see "Working with CHANGELOG.md" below).
See `CONTEXT.md` for current project state, decisions and roadmap (update after each session).
See `TASK.md` for the current task to implement (if it exists).
Claude in Chrome extension is available — can be used to test web UI on demo-felhom.eu or verify dashboard deployments in browser.
## System context — the Proxmox re-platform (READ THIS FIRST)
The project has **re-platformed onto Proxmox**, with a locked **three-component model**:
- **Hub** (`felhom.eu/hub/`) — operator backend on k3s.
- **Host agent** (`felhom-agent/`, formerly `proxmox-controller`) — one per Proxmox host; operator-tier; owns ALL Proxmox interaction.
- **In-guest controller** (THIS repo) — one per customer LXC; **Docker-only; holds NO Proxmox credentials**.
**This repo is being de-privileged.** In the target model, host/disk/Proxmox/Cloudflare
responsibilities move OUT of the controller into the **host agent**: System info, Storage (disk
scan/format/mount/migrate), the disk-tier Backup (restic, cross-drive, drive-restore, infra-backup),
and the Cloudflare-API geo enforcement. The controller keeps the **app domain**: stack/deploy
management, the Hungarian web UI, app-data backup (DB dumps + Docker-volume tars), metrics/telemetry,
integrations, git-sync, notifications.
> **Authoritative map:** `felhom.eu/documentation/architecture/02-controller-module-map.md` — the
> per-package **KEEP / PORT / DELETE(→agent) / DELETE(obsolete) / MODIFY** classification. Read it
> before touching `backup/`, `storage/`, `cloudflare/`, `system/`, or `config/`. Also doc 01
> (topology/trust) and doc 03 (the host agent).
**⚠️ Status — do NOT assume the target state is implemented.** The de-privileging has only *started*:
the recent `internal/appbackup/` extraction split the keep-side app-data-backup primitives from the
delete-side disk/host code (groundwork, no behaviour change). The **bulk strip has NOT happened**
the current code STILL contains the full privileged storage / restic / cross-drive / disk /
Cloudflare stack. The strip + the agent-local-API client land at **~slice 8**. So the code you see
is the **pre-strip, still-privileged** controller; match the code, not the target, unless a TASK
says otherwise.
**Don't confuse the two ex-"controllers":** `felhom-agent` (host, operator-tier, was
`proxmox-controller`) vs this `felhom-controller` (in-guest, was `deploy-felhom-compose`).
## Cross-repo & artifacts
- Workspace orientation (the felhom system, shared conventions, access) lives in the workspace-root
`e:\git\CLAUDE.md`. Sibling per-repo files: `felhom-agent/CLAUDE.md`, `felhom.eu/CLAUDE.md`.
- **Artifact taxonomy:** `TASK.md` / `TASK-*.md` = a spec for YOU to implement (then push + update
CHANGELOG + CONTEXT + README).
- **`RUNBOOK-*.md`** — an operational procedure. CC executes the steps it has access and capability for, including live validation on the demo nodes and the demo Proxmox host (CC has root@felhom-pve SSH + the felhom-agent token). A step is human-only only when it genuinely needs physical presence, a real-world decision, or credentials CC truly lacks — mark those steps HUMAN. Do not decline a whole procedure because it touches a live host or a privileged token. (Judgment still applies: confirm before irreversible ops on real customer data — but demo scratch guests are fair game.)
## Code quality rules
- Always double-check generated code for bugs, logic issues, syntax errors
- Handle edge cases without overcomplicating the script/program
- Add debug capabilities (logging, verbose output) for easier troubleshooting
- If you need more input or troubleshooting command output, ask first — don't guess
## Environment
| Machine | OS | IP | Purpose |
|---------|----|----|---------|
| **Local (this machine)** | Windows 11 | — | Development, Claude Code runs here. Repos in `E:\git\` |
| **Build server (k3s, infra)** | Debian 13 | 192.168.0.180 | Build + push container images, k3s cluster |
| **Demo node** | Debian 13 | 192.168.0.162 | Test deployment (demo-felhom.eu) |
| **Demo node 2** | Debian 13 | router.abonet.hu:33022 | Remote test deployment |
## Workspace layout
Claude Code runs on Windows 11. The working directory is `E:\git\` (mapped as `/e/git/` in Git Bash).
This repo is at:
```
E:\git\felhom-controller\ (or /e/git/felhom-controller/ in Git Bash)
├── controller/ # Go application (main codebase)
│ ├── cmd/controller/ # Entry point (main.go)
│ ├── internal/
│ │ ├── config/ # YAML config loading
│ │ ├── settings/ # settings.json persistence (password hash, DB cache)
│ │ ├── stacks/ # Docker Compose operations, deploy flow
│ │ ├── sync/ # Git sync — periodic pull of app catalog repo
│ │ ├── api/ # REST API endpoints
│ │ ├── system/ # System info (memory, disk)
│ │ └── web/ # Dashboard UI
│ │ ├── server.go # Server struct, routing, static serving
│ │ ├── auth.go # Session auth, login/logout handlers
│ │ ├── handlers.go # Page handlers (dashboard, stacks, deploy, etc.)
│ │ ├── funcmap.go # Template function map
│ │ ├── embed.go # go:embed directive for templates
│ │ ├── templates.go # Felhom logo SVG constant
│ │ └── templates/ # go:embed HTML/CSS files (Hungarian UI)
│ ├── Dockerfile
│ ├── Makefile
│ └── go.mod
├── scripts/ # Setup scripts for customer nodes
├── CLAUDE.md # This file
├── CHANGELOG.md # Changelog
├── CONTEXT.md # Project memory / state / architectural state/decisions/roadmap
└── TASK.md # Current task (if exists)
```
Related repos (same parent directory):
```
E:\git\app-catalog-felhom.eu\ # Docker Compose templates + .felhom.yml metadata per app
E:\git\felhom.eu\ # Website (htmls) + k3s manifests
E:\git\homelab-manifests\ # k3s cluster manifests (dooplex.hu services)
E:\git\misc-scripts\ # Helper scripts
```
All repos hosted at `gitea.dooplex.hu/admin/`. Git credentials are stored (`git config credential.helper store`).
## SSH access
SSH key-based authentication is configured and working. No password prompts.
**IMPORTANT — SSH binary:** Claude Code runs in Git Bash, which has its own SSH at
`/usr/bin/ssh` (= `C:\Program Files\Git\usr\bin\ssh.exe`). This binary does NOT have
access to the Windows SSH agent and will fail silently (exit 0/141 with no output).
Always use the Windows native OpenSSH binary with the full path:
```
SSH=/c/Windows/System32/OpenSSH/ssh.exe
```
All SSH commands in this file use `$SSH` — set it at the start of your session or
substitute the full path manually.
| Host | OS | IP | User | Role |
|------|----|----|------|------|
| Build server | Debian 13 | 192.168.0.180 | kisfenyo | Build + push container images |
| Demo Proxmox host | 192.168.0.162 | root@pam (SSH alias felhom-pve, root, no sudo) | pveum/pct + live Proxmox validation — available to CC |
## Test environments
| Node | OS | Hardware | Domain | IP | Notes |
|------|-----|----------|--------|----|-------|
| demo-felhom | Debian 13 | Acemagic N100, 16G RAM, 512G SSD + 1TB HDD | demo-felhom.eu | 192.168.0.162 | Primary test node, Cloudflare Tunnel |
| felhotest | Debian 13 | Proxmox VM (4-16G RAM, 8 vCPU, 200G + 100G SCSI) | — | router.abonet.hu:33022 | Remote test node |
| pi-customer-1 | Debian 13 | Raspberry Pi 3B+, 1G RAM, 32G SD | pi-customer-1.local | 192.168.0.161 | Secondary test, not yet active |
- Pi-hole DNS on local network forwards `*.demo-felhom.eu` → 192.168.0.162
- External access via Cloudflare Tunnel → Traefik reverse proxy
> **⚠️ Re-platform note:** per the host-agent work, `192.168.0.162` is now a **Proxmox host**
> (`demo-felhom`, PVE 9.2.2) — the demo-node tables above predate that. Confirm how/where the
> controller is currently deployed and tested post-re-platform before relying on the bare-metal
> `docker compose` deploy steps below; on the re-platformed node the controller may now run inside
> an LXC guest rather than directly on the host.
## Build & deploy workflow — MANDATORY
After making code changes to the controller, you **MUST** build, push, and deploy the new image.
Do NOT leave code changes uncommitted or undeployed. The full cycle is:
### Step 1: Commit and push changes
```bash
cd /e/git/felhom-controller
git add -A && git commit -m "<descriptive message>" && git push
```
### Step 2: Build + push the container image on the build server
The build server (192.168.0.180) has the build toolchain. The version tag should be incremented from the current running version.
!! Important: use "kisfenyo" user for SSH, as written below
First, set the SSH variable (required for every session — Git Bash's built-in ssh does NOT work):
```bash
SSH=/c/Windows/System32/OpenSSH/ssh.exe
```
Check the current running version:
```bash
$SSH kisfenyo@192.168.0.162 "docker ps --filter name=felhom-controller --format '{{.Image}}'"
```
Then build with the next version (e.g., if current is 0.2.10, use 0.2.11):
IMPORTANT!: Build directory is: ~/build/felhom-controller
```bash
$SSH kisfenyo@192.168.0.180 "cd ~/build/felhom-controller && git -C ~/git/felhom-controller pull && ./build.sh <NEW_VERSION> --push"
```
The build script:
- Pulls latest code from Gitea
- Builds a multi-arch Docker image (amd64 + arm64) if `--multiarch`, or current arch if `--push`
- Pushes to `gitea.dooplex.hu/admin/felhom-controller:<VERSION>`
- Expects the version as first argument (e.g., `0.2.11`)
### Step 3: Deploy on demo nodes
```bash
# Demo node 1 (local)
$SSH kisfenyo@192.168.0.162 "cd /opt/docker/felhom-controller && sudo docker pull gitea.dooplex.hu/admin/felhom-controller:<NEW_VERSION> && sudo sed -i 's|image: gitea.dooplex.hu/admin/felhom-controller:.*|image: gitea.dooplex.hu/admin/felhom-controller:<NEW_VERSION>|' docker-compose.yml && sudo docker compose up -d"
# Demo node 2 (remote)
$SSH -p 33022 kisfenyo@router.abonet.hu "cd /opt/docker/felhom-controller && sudo docker pull gitea.dooplex.hu/admin/felhom-controller:<NEW_VERSION> && sudo sed -i 's|image: gitea.dooplex.hu/admin/felhom-controller:.*|image: gitea.dooplex.hu/admin/felhom-controller:<NEW_VERSION>|' docker-compose.yml && sudo docker compose up -d"
```
### Step 4: Verify the deployment
```bash
$SSH kisfenyo@192.168.0.162 "docker ps --filter name=felhom-controller --format '{{.Image}} {{.Status}}'"
$SSH -p 33022 kisfenyo@router.abonet.hu "docker ps --filter name=felhom-controller --format '{{.Image}} {{.Status}}'"
```
Should show the new version and "Up" status. Also check logs for startup errors:
```bash
$SSH kisfenyo@192.168.0.162 "docker logs felhom-controller --tail 20"
$SSH -p 33022 kisfenyo@router.abonet.hu "docker logs felhom-controller --tail 20"
```
### Build workflow summary
| Step | Command | Where |
|------|---------|-------|
| 0. Set SSH var | `SSH=/c/Windows/System32/OpenSSH/ssh.exe` | Local (once per session) |
| 1. Commit + push | `git add -A && git commit -m "..." && git push` | Local (this repo) |
| 2. Build + push image | `$SSH kisfenyo@192.168.0.180 "cd ~/build/felhom-controller... ./build.sh <VER> --push"` | Build server |
| 3. Deploy (node 1) | `$SSH kisfenyo@192.168.0.162 "... docker compose up -d"` | Demo node |
| 3b. Deploy (node 2) | `$SSH -p 33022 kisfenyo@router.abonet.hu "... docker compose up -d"` | Demo node 2 |
| 4. Verify | `$SSH kisfenyo@192.168.0.162 "docker ps ..."` + same for router.abonet.hu | Both nodes |
### Build & deploy workflow — Hub (felhom-hub)
The central hub (`hub.felhom.eu`) is a separate Go app in the `E:\git\felhom.eu\hub\` repo.
The controller pushes periodic reports to it (when `hub.enabled: true` in `controller.yaml`).
| Step | Command | Where |
|------|---------|-------|
| 1. Commit + push | `cd /e/git/felhom.eu && git add -A && git commit && git push` | Local |
| 2. Build + push image | `$SSH kisfenyo@192.168.0.180 "cd ~/build/felhom-hub && ./build.sh <VER> --push"` | Build server |
| 3. Deploy to k3s | `$SSH kisfenyo@192.168.0.180 "sudo kubectl set image -n felhom-system deploy/hub hub=gitea.dooplex.hu/admin/felhom-hub:<VER>"` | Build server |
| 4. Verify | `$SSH kisfenyo@192.168.0.180 "sudo kubectl get pods -n felhom-system -l app=hub && sudo kubectl logs -n felhom-system -l app=hub --tail 10"` | Build server |
See `E:\git\felhom.eu\CLAUDE.md` for full hub details.
**IMPORTANT:** If you make changes to the app-catalog-felhom.eu repo, commit and push those too:
```bash
cd /e/git/app-catalog-felhom.eu
git add -A && git commit -m "<message>" && git push
```
The controller's git sync will pick up catalog changes within 15 minutes, or you can trigger it
manually via the dashboard "Sablonok frissítése" button.
## Tech stack
- **Language:** Go 1.22+
- **Web framework:** stdlib `net/http` + `html/template` (no frameworks)
- **Templates:** go:embed HTML files in `internal/web/templates/` (Hungarian UI)
- **CSS:** go:embed CSS file in `internal/web/templates/style.css`
- **Auth:** bcrypt password hash + session cookies
- **Container orchestration:** Docker Compose via CLI (`docker compose up -d`)
- **Reverse proxy:** Traefik (separate stack, managed by controller)
- **Tunnel:** Cloudflare Tunnel (cloudflared, separate stack)
## Key patterns
- All UI text is in Hungarian (Budapest timezone, Hungarian locale)
- Templates use Go template functions: `stateColor`, `stateLabel`, `stateIcon`, `stateStr`, `isOperational`, `logoURL`, `logoPNGURL`, `appPageURL`
- Container states: `running`, `starting`, `unhealthy`, `stopped`, `exited`, `restarting`, `paused`, `not_deployed`
- Docker `.State` field is combined with `.Status` field to detect health substatus
- Stacks are sorted alphabetically by DisplayName
- Protected stacks (traefik, cloudflared, felhom-controller) can't be stopped from UI
- `app.yaml` persists deploy config; `deployed: true` flag controls UI state
- In-memory `Deployed` flag is set BEFORE `docker compose up -d` (avoids race condition with slow image pulls); reverted on failure
- Password fields require explicit user input or generation (no silent auto-fill)
- App cards on dashboard and stacks pages are clickable via `data-href` attribute (skip protected stacks)
- Logs page uses AJAX polling (`?raw=1` query param returns plain text) with auto-scroll and pause/resume
- Memory bar on deploy page uses two-segment stacked bar (committed = solid green, new = translucent green)
- Deploy flow shows 3-step progress panel (config → containers → health), polls `GET /api/stacks/{name}` every 3s until running/unhealthy/timeout(120s)
- Telepítés buttons have `checkBeforeDeploy()` onclick guard — fetches live state from API before navigating to deploy page
- App info pages at `/apps/{slug}` — detail view with use cases, setup guide, screenshots, optional config
- Optional config saves to `app.yaml` and restarts deployed apps via `docker compose up -d`
- `optional_config` fields in `.felhom.yml` define post-deploy configurable env vars (e.g., API keys)
- `app_info` in `.felhom.yml` provides tagline, use_cases, first_steps, prerequisites, default_creds, docs_url
## Git sync module (internal/sync)
- Uses `os/exec` to call `git` CLI — no Go git library dependency
- On startup: clones repo to `{data_dir}/catalog-cache/` (shallow clone, `--depth 1`)
- Periodically: `git fetch --depth 1` + `git reset --hard origin/{branch}`
- Copies only `docker-compose.yml` and `.felhom.yml` to stacks dir
- **Never overwrites** `app.yaml` — this contains deployed secrets
- Content-hash comparison (SHA-256) — only writes if file actually changed
- After sync, triggers `ScanStacks()` rescan for dashboard update
- `POST /api/sync` triggers immediate sync (30s debounce)
- "Sablonok frissítése" button on Alkalmazások page
- Sync status exposed in `/api/system/info` response
## Debug logging
The controller has two-tier logging controlled by `logging.level` in `controller.yaml` (or `FELHOM_LOGGING_LEVEL` env var):
- **`info`** (default): Operation success/failure with elapsed time, post-start container states, scan counts
- **`debug`**: All of above plus env var keys per compose command, local image availability checks, compose command completion times, log fetch byte counts
Key patterns used in `internal/stacks/`:
- `time.Since(start)` for operation timing — always logged at INFO level
- `m.isDebug()` gates verbose output (env var keys, image checks)
- `truncateStr(s, 500)` caps stdout/stderr in error logs
- `logPostStartStatus()` runs async (goroutine + 3s sleep) after start/restart/update/deploy — never blocks or fails the operation
- `checkLocalImages()` parses compose YAML for `image:` lines, runs `docker image inspect` per image
- Env var **keys** are logged, never values (secrets safety)
## Important lessons learned
1. `PAPERLESS_OCR_LANGUAGES` (plural, with S) **installs** tesseract packs; `PAPERLESS_OCR_LANGUAGE` (singular) **selects** which to use
2. `docker compose restart` does NOT pick up new images — always use `docker compose up -d`
3. Go map iteration order is random — always sort before displaying in UI
4. Docker's `.State` field says "running" even for unhealthy containers — must parse `.Status` for health info
5. In-memory `Deployed` flag must be set BEFORE `docker compose up -d` (not after) — compose can take 30-60s for image pulls; revert both in-memory and disk on failure
6. `docker compose up -d` returns exit 0 even when containers crash-loop — post-start status check is essential for detecting failures
7. Mealie image has no wget/curl — use Python TCP socket check for healthcheck; set `start_period: 60s` for DB migration time
8. Always verify container images have the healthcheck tool (`wget`, `curl`, etc.) before using it — Alpine has BusyBox wget, Python images have `python3`
## Working with CHANGELOG.md
**DO NOT read the full file** — it is large (29K+ tokens) and will waste context or fail.
- **At session start:** Do NOT read CHANGELOG.md. Use `CONTEXT.md` and `controller/README.md` for current state.
- **To add a new entry:** Read only the top ~30 lines (`limit: 30`) to see the format and insertion point, then use Edit to insert the new entry after line 1 (`## Changelog`).
- **To check history:** Use Grep to search for specific topics instead of reading the file.
## End-of-session checklist
Before ending a session, always:
1. **Commit and push** all code changes
2. **Build, push, and deploy** the new controller image (if controller code changed)
3. **Update CHANGELOG.md** with what was done
4. **Update CONTEXT.md** with decisions made, update architectural state and what's next
5. **Update controller/README.md** if architecture or features changed
6. **Verify** the deployment is working (check `docker ps` and logs)