Files
felhom.eu/CLAUDE.md
T
admin 0c6ec27054 docs(CLAUDE): document the working kubectl sync trigger for felhom (argocd CLI not logged in)
The argocd CLI on 180 has no server session and --core breaks under sudo (env stripped);
the reliable scripted sync is annotate refresh + patch .operation on the Application CR.
Verified by deploying hub v0.7.2.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 10:26:33 +02:00

107 lines
9.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# CLAUDE.md — Project Instructions for Claude Code (`felhom.eu`)
> Read automatically by Claude Code when it works in this repo. Keep it updated as the project evolves. Cross-repo orientation (the felhom system, artifact taxonomy, access) lives in the workspace-root `e:\git\CLAUDE.md`; this file is `felhom.eu`-specific.
## Project overview
This repo (`felhom.eu`) contains:
- **Website** (`website/`) — static HTML at felhom.eu, served via k3s nginx + git-sync sidecar.
- **Hub** (`hub/`) — Go application (felhom-hub) — the **operator backend**, on k3s at `hub.felhom.eu`.
- **K8s manifests** (`manifests/`) — k3s deployment manifests for felhom-system services.
- **Architecture docs** (`documentation/`) — the **authoritative design home for the whole Felhom system**: `architecture/01..05-*.md` (topology/trust, controller module map, host-agent, signing, hub), `proxmox-platform.md`, and `tests/phase{0,1-2,3,4}-findings.md`. Read these before designing.
See `README.md` for full architecture/DNS/email/SEO docs. See `TASK.md` for the current task (if any).
## The Felhom system (so the hub's role is in context)
Felhom is **Proxmox-based**, with a locked **three-component model**:
- **Hub** (this repo, `hub/`) — operator backend. Authors operator *intent*; mirrors box *reality*; holds **no data-plane role** and never connects inbound to a box.
- **Host agent** (repo `felhom-agent/`) — one per Proxmox host; owns all Proxmox interaction.
- **In-guest controller** (repo `felhom-controller/`) — one per customer LXC; Docker-only.
The hub is **not** just controller monitoring anymore. As of slice 3 it ingests **two report streams**: the agent's host-domain report (`POST /api/v1/host-report`, the heartbeat) and the legacy controller report (`POST /api/v1/report`). The controller path is **frozen and retires at the slice-10 cutover** — do not modify it until then.
## Hub — current state (v0.7.x)
- **Tables:** `customer_configs`, `events`, `app_telemetry`/`app_log_issues`, the legacy `reports`, and the slice-3 host-domain additions `hosts` / `guests` / `host_reports` (additive; columns marked inert exist for the slice-10 cutover but are unused now).
- **Auth:** Bearer — global key, per-customer key (legacy), and per-host key (`GetHostByAPIKey`, slice 3). Provisional global-key host mint at `POST /api/v1/admin/hosts`.
- **Monitoring:** the controller `StalenessChecker` (over `reports`) AND a sibling `HostStalenessChecker` (over `host_reports`, emitting `host_stale`/`host_down`/`host_recovered`).
- Two-tier notifications (operator English / customer Hungarian, Resend, cooldowns); `events` audit.
## Code quality rules
- Always double-check generated code for bugs, logic issues, syntax errors.
- Handle edge cases without overcomplicating.
- Add debug capabilities (logging, verbose output).
- If you need more input or troubleshooting output, **ask first — don't guess**.
## Workflow & artifacts
The planning/architecture assistant ("project Claude", in claude.ai) writes specs and validates pushes; **you (Claude Code) implement**. A file being open in the editor is NOT an instruction.
- **`TASK.md` / `TASK-*.md`** — a spec for you to implement. Then push and update this repo's changelog (`hub/CHANGELOG.md`) and root `REPORT.md` per the convention below.
- **`RUNBOOK-*.md`** — an operational procedure. CC executes the steps it has access and capability for, including live validation on the demo nodes and the demo Proxmox host (CC has root@felhom-pve SSH + the felhom-agent token). A step is human-only only when it genuinely needs physical presence, a real-world decision, or credentials CC truly lacks — mark those steps HUMAN. Do not decline a whole procedure because it touches a live host or a privileged token. (Judgment still applies: confirm before irreversible ops on real customer data — but demo scratch guests are fair game.)
- Validation of a push against a spec's criteria is project Claude's job, not yours, unless asked.
> **In every repository where you make a change, update both files in that repo:**
> - **`CHANGELOG.md`** — a cumulative log of **all** changes; newest entry on top.
> - **`REPORT.md`** — **overwrite** with a summary of the **most recent** implementation (or significant validation/operational run) only; not cumulative.
>
> **Never write secrets** — tokens, passwords, private keys, API keys — into `CHANGELOG.md`, `REPORT.md`, or any committed file. Reference them as "stored out-of-band" instead.
## Tech stack (Hub)
- **Language:** Go 1.24+ (build server is go1.26.0).
- **Web:** stdlib `net/http` + `html/template`. **DB:** SQLite via `modernc.org/sqlite` (pure Go).
- **Auth:** bcrypt + Bearer tokens. **Deploy:** Docker on k3s (felhom-system ns).
- **Storage:** Longhorn PVC at `/data/` (SQLite DB). **Config:** YAML via ConfigMap at `/etc/felhom-hub/hub.yaml`.
## SSH access
Use the Windows OpenSSH binary (Git Bash's `/usr/bin/ssh` can't reach the Windows agent and fails silently): `SSH=/c/Windows/System32/OpenSSH/ssh.exe`. All SSH commands below use `$SSH`.
| Host | IP | User | Role |
|------|----|------|------|
| Build server (k3s node) | 192.168.0.180 | kisfenyo | Build + push images, kubectl (needs `sudo`) |
| Demo Proxmox host | 192.168.0.162 | root@pam (SSH alias felhom-pve, root, no sudo) | pveum/pct + live Proxmox validation — available to CC |
## Build & deploy — Hub (GitOps via ArgoCD)
The whole k3s cluster is GitOps via a **single ArgoCD app named `felhom`** (`argocd.dooplex.hu`) that syncs this repo's **`manifests/`** to the **`felhom-system`** namespace. **There is no separate `hub` ArgoCD app** — the hub is one `Deployment` (`manifests/hub.yaml`) *inside* the `felhom` app. **Auto-sync is OFF**: deploys are a deliberate manual sync. ArgoCD's source of truth is the **manifest**, so:
- **A code change + CHANGELOG version bump does NOT deploy anything.** The running image only changes when `manifests/hub.yaml`'s `image:` tag changes in git and the app is synced.
- **Pin explicit versions, never `:latest`.** A `:latest` re-push wouldn't change the manifest, so ArgoCD wouldn't redeploy, and Synced / History / Rollback would all misreport what's actually live.
After a code change to `hub/`, to deploy:
1. **Commit + push the code:** `cd /e/git/felhom.eu && git add -A && git commit -m "<msg>" && git push`
2. **Build + push the image** (build script lives on the build server, not in this repo): `$SSH kisfenyo@192.168.0.180 "cd ~/build/felhom-hub && ./build.sh <NEW_VERSION> --push"` (pulls latest from Gitea, builds version into `main.Version` via ldflags, pushes `gitea.dooplex.hu/admin/felhom-hub:<VER>`). Pin `<VER>`; don't rely on `:latest`.
3. **Bump the manifest:** set the `image:` tag in `manifests/hub.yaml` to `:<NEW_VERSION>`, commit to `main`, push. The `felhom` app now shows **OutOfSync**.
4. **Sync** (auto-sync is off, so this is required). Easiest is the ArgoCD UI → app `felhom`**Sync**. From the shell, the `argocd` CLI on 180 is **not logged in** (no server session) and `--core` looks in the wrong namespace under `sudo` (env is stripped) — so the reliable scripted path is to drive the Application CR with `kubectl`:
```bash
# a) hard-refresh so ArgoCD picks up the new commit, then confirm OutOfSync:
$SSH kisfenyo@192.168.0.180 "sudo kubectl -n argocd annotate application felhom argocd.argoproj.io/refresh=hard --overwrite; sleep 8; sudo kubectl -n argocd get application felhom -o jsonpath='{.status.sync.status} {.status.sync.revision}{\"\n\"}'"
# b) trigger the sync via the .operation field (the app controller runs it):
$SSH kisfenyo@192.168.0.180 "sudo kubectl -n argocd patch application felhom --type merge -p '{\"operation\":{\"initiatedBy\":{\"username\":\"cc\"},\"sync\":{\"syncStrategy\":{\"apply\":{}}}}}'"
```
(If you do log the CLI in: `argocd app sync felhom` is the one-liner equivalent.)
5. **Verify:** `$SSH kisfenyo@192.168.0.180 "sudo kubectl -n argocd get application felhom -o jsonpath='sync={.status.sync.status} health={.status.health.status}{\"\n\"}'; sudo kubectl -n felhom-system rollout status deploy/hub --timeout=90s; sudo kubectl -n felhom-system get deploy hub -o jsonpath='{.spec.template.spec.containers[0].image}'; echo; sudo kubectl -n felhom-system logs -l app=hub --tail 10"` (expect Synced/Healthy + the new tag + `[INFO] felhom-hub <VERSION> starting`).
> A bare `kubectl set image` would be reverted on the next sync (the manifest is the truth) — always go through `manifests/hub.yaml`. **The live image can lag the CHANGELOG** when version bumps were committed but step 3/4 was never done; reconcile via the manifest, not by assuming the changelog reflects what's running.
## Build & deploy — Website / Manifests
- **Website** auto-deploys via git-sync; just push to `main` (live in 12 min). Emergency edits: FileBrowser at `https://files.felhom.eu`.
- **Manifests** (`manifests/`) are GitOps via the `felhom` ArgoCD app — commit to `main`, then sync (auto-sync is off): UI Sync or `argocd app sync felhom`. Do **not** `kubectl apply` them directly (a later sync reverts drift; the manifest in git is the truth).
## Key patterns
- Hub ingests **host-reports from agents** (`POST /api/v1/host-report`, Bearer per-host) and legacy **controller reports** (`POST /api/v1/report`). The host-report `received_at` is the dead-man's-switch liveness signal.
- Status logic: OK (report < 30m), WARN (30m1h or health=warn), DOWN (> 1h or health=fail).
- SQLite timestamps vary in format — use `parseSQLiteTime()`.
- Dashboard/detail auto-refresh every 60s via `<meta http-equiv="refresh">`. Geo-restricted to Hungary via nginx ingress annotation.
## File encoding
All `website/` HTML is **UTF-8 with BOM** — preserve it. Hub Go source is standard UTF-8 (no BOM).