Files
felhom.eu/CLAUDE.md
T
admin 9347fcd3a5 docs(CLAUDE): correct hub/manifests deploy to GitOps via the 'felhom' ArgoCD app
No separate hub app; manifests/ synced by app 'felhom' (auto-sync off). Deploy =
build+push pinned image -> bump manifests/hub.yaml tag + commit -> manual sync.
Never :latest (manifest is ArgoCD's truth). Replaces the stale kubectl apply/set image steps.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 10:19:23 +02:00

100 lines
8.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# CLAUDE.md — Project Instructions for Claude Code (`felhom.eu`)
> Read automatically by Claude Code when it works in this repo. Keep it updated as the project evolves. Cross-repo orientation (the felhom system, artifact taxonomy, access) lives in the workspace-root `e:\git\CLAUDE.md`; this file is `felhom.eu`-specific.
## Project overview
This repo (`felhom.eu`) contains:
- **Website** (`website/`) — static HTML at felhom.eu, served via k3s nginx + git-sync sidecar.
- **Hub** (`hub/`) — Go application (felhom-hub) — the **operator backend**, on k3s at `hub.felhom.eu`.
- **K8s manifests** (`manifests/`) — k3s deployment manifests for felhom-system services.
- **Architecture docs** (`documentation/`) — the **authoritative design home for the whole Felhom system**: `architecture/01..05-*.md` (topology/trust, controller module map, host-agent, signing, hub), `proxmox-platform.md`, and `tests/phase{0,1-2,3,4}-findings.md`. Read these before designing.
See `README.md` for full architecture/DNS/email/SEO docs. See `TASK.md` for the current task (if any).
## The Felhom system (so the hub's role is in context)
Felhom is **Proxmox-based**, with a locked **three-component model**:
- **Hub** (this repo, `hub/`) — operator backend. Authors operator *intent*; mirrors box *reality*; holds **no data-plane role** and never connects inbound to a box.
- **Host agent** (repo `felhom-agent/`) — one per Proxmox host; owns all Proxmox interaction.
- **In-guest controller** (repo `felhom-controller/`) — one per customer LXC; Docker-only.
The hub is **not** just controller monitoring anymore. As of slice 3 it ingests **two report streams**: the agent's host-domain report (`POST /api/v1/host-report`, the heartbeat) and the legacy controller report (`POST /api/v1/report`). The controller path is **frozen and retires at the slice-10 cutover** — do not modify it until then.
## Hub — current state (v0.7.x)
- **Tables:** `customer_configs`, `events`, `app_telemetry`/`app_log_issues`, the legacy `reports`, and the slice-3 host-domain additions `hosts` / `guests` / `host_reports` (additive; columns marked inert exist for the slice-10 cutover but are unused now).
- **Auth:** Bearer — global key, per-customer key (legacy), and per-host key (`GetHostByAPIKey`, slice 3). Provisional global-key host mint at `POST /api/v1/admin/hosts`.
- **Monitoring:** the controller `StalenessChecker` (over `reports`) AND a sibling `HostStalenessChecker` (over `host_reports`, emitting `host_stale`/`host_down`/`host_recovered`).
- Two-tier notifications (operator English / customer Hungarian, Resend, cooldowns); `events` audit.
## Code quality rules
- Always double-check generated code for bugs, logic issues, syntax errors.
- Handle edge cases without overcomplicating.
- Add debug capabilities (logging, verbose output).
- If you need more input or troubleshooting output, **ask first — don't guess**.
## Workflow & artifacts
The planning/architecture assistant ("project Claude", in claude.ai) writes specs and validates pushes; **you (Claude Code) implement**. A file being open in the editor is NOT an instruction.
- **`TASK.md` / `TASK-*.md`** — a spec for you to implement. Then push and update this repo's changelog (`hub/CHANGELOG.md`) and root `REPORT.md` per the convention below.
- **`RUNBOOK-*.md`** — an operational procedure. CC executes the steps it has access and capability for, including live validation on the demo nodes and the demo Proxmox host (CC has root@felhom-pve SSH + the felhom-agent token). A step is human-only only when it genuinely needs physical presence, a real-world decision, or credentials CC truly lacks — mark those steps HUMAN. Do not decline a whole procedure because it touches a live host or a privileged token. (Judgment still applies: confirm before irreversible ops on real customer data — but demo scratch guests are fair game.)
- Validation of a push against a spec's criteria is project Claude's job, not yours, unless asked.
> **In every repository where you make a change, update both files in that repo:**
> - **`CHANGELOG.md`** — a cumulative log of **all** changes; newest entry on top.
> - **`REPORT.md`** — **overwrite** with a summary of the **most recent** implementation (or significant validation/operational run) only; not cumulative.
>
> **Never write secrets** — tokens, passwords, private keys, API keys — into `CHANGELOG.md`, `REPORT.md`, or any committed file. Reference them as "stored out-of-band" instead.
## Tech stack (Hub)
- **Language:** Go 1.24+ (build server is go1.26.0).
- **Web:** stdlib `net/http` + `html/template`. **DB:** SQLite via `modernc.org/sqlite` (pure Go).
- **Auth:** bcrypt + Bearer tokens. **Deploy:** Docker on k3s (felhom-system ns).
- **Storage:** Longhorn PVC at `/data/` (SQLite DB). **Config:** YAML via ConfigMap at `/etc/felhom-hub/hub.yaml`.
## SSH access
Use the Windows OpenSSH binary (Git Bash's `/usr/bin/ssh` can't reach the Windows agent and fails silently): `SSH=/c/Windows/System32/OpenSSH/ssh.exe`. All SSH commands below use `$SSH`.
| Host | IP | User | Role |
|------|----|------|------|
| Build server (k3s node) | 192.168.0.180 | kisfenyo | Build + push images, kubectl (needs `sudo`) |
| Demo Proxmox host | 192.168.0.162 | root@pam (SSH alias felhom-pve, root, no sudo) | pveum/pct + live Proxmox validation — available to CC |
## Build & deploy — Hub (GitOps via ArgoCD)
The whole k3s cluster is GitOps via a **single ArgoCD app named `felhom`** (`argocd.dooplex.hu`) that syncs this repo's **`manifests/`** to the **`felhom-system`** namespace. **There is no separate `hub` ArgoCD app** — the hub is one `Deployment` (`manifests/hub.yaml`) *inside* the `felhom` app. **Auto-sync is OFF**: deploys are a deliberate manual sync. ArgoCD's source of truth is the **manifest**, so:
- **A code change + CHANGELOG version bump does NOT deploy anything.** The running image only changes when `manifests/hub.yaml`'s `image:` tag changes in git and the app is synced.
- **Pin explicit versions, never `:latest`.** A `:latest` re-push wouldn't change the manifest, so ArgoCD wouldn't redeploy, and Synced / History / Rollback would all misreport what's actually live.
After a code change to `hub/`, to deploy:
1. **Commit + push the code:** `cd /e/git/felhom.eu && git add -A && git commit -m "<msg>" && git push`
2. **Build + push the image** (build script lives on the build server, not in this repo): `$SSH kisfenyo@192.168.0.180 "cd ~/build/felhom-hub && ./build.sh <NEW_VERSION> --push"` (pulls latest from Gitea, builds version into `main.Version` via ldflags, pushes `gitea.dooplex.hu/admin/felhom-hub:<VER>`). Pin `<VER>`; don't rely on `:latest`.
3. **Bump the manifest:** set the `image:` tag in `manifests/hub.yaml` to `:<NEW_VERSION>`, commit to `main`, push. The `felhom` app now shows **OutOfSync**.
4. **Sync:** ArgoCD UI → app `felhom`**Sync**, or `$SSH kisfenyo@192.168.0.180 "argocd app sync felhom"` (argocd CLI v3.2.1 at `/usr/local/bin`).
5. **Verify:** `$SSH kisfenyo@192.168.0.180 "sudo kubectl get deploy -n felhom-system hub -o jsonpath='{.spec.template.spec.containers[0].image}'; echo; sudo kubectl logs -n felhom-system -l app=hub --tail 10"` (expect the new tag + `[INFO] felhom-hub <VERSION> starting`).
> A bare `kubectl set image` would be reverted on the next sync (the manifest is the truth) — always go through `manifests/hub.yaml`. **The live image can lag the CHANGELOG** when version bumps were committed but step 3/4 was never done; reconcile via the manifest, not by assuming the changelog reflects what's running.
## Build & deploy — Website / Manifests
- **Website** auto-deploys via git-sync; just push to `main` (live in 12 min). Emergency edits: FileBrowser at `https://files.felhom.eu`.
- **Manifests** (`manifests/`) are GitOps via the `felhom` ArgoCD app — commit to `main`, then sync (auto-sync is off): UI Sync or `argocd app sync felhom`. Do **not** `kubectl apply` them directly (a later sync reverts drift; the manifest in git is the truth).
## Key patterns
- Hub ingests **host-reports from agents** (`POST /api/v1/host-report`, Bearer per-host) and legacy **controller reports** (`POST /api/v1/report`). The host-report `received_at` is the dead-man's-switch liveness signal.
- Status logic: OK (report < 30m), WARN (30m1h or health=warn), DOWN (> 1h or health=fail).
- SQLite timestamps vary in format — use `parseSQLiteTime()`.
- Dashboard/detail auto-refresh every 60s via `<meta http-equiv="refresh">`. Geo-restricted to Hungary via nginx ingress annotation.
## File encoding
All `website/` HTML is **UTF-8 with BOM** — preserve it. Hub Go source is standard UTF-8 (no BOM).