Files
deploy-felhom-compose/TASK.md
T

720 lines
27 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# TASK.md — v0.6.0: Healthcheck Implementation + Central Push + Multi-Customer Dashboard
> **Version:** v0.6.0
> **Depends on:** v0.5.4 (current)
> **Repo:** `deploy-felhom-compose` (controller/ subfolder)
> **Build:** `~/build/felhom-controller/build.sh 0.6.0 --push`
> **Deploy target:** demo-felhom.eu (N100) + k3s cluster (dooplex.hu)
---
## Context
The controller already has health monitoring infrastructure built in v0.4.0:
- `internal/monitor/pinger.go` — Healthchecks.io-compatible HTTP ping client (success/fail/start, retries)
- `internal/monitor/healthcheck.go` — System health checks (disk, memory, CPU, temp, Docker, protected containers)
- Scheduler jobs in `main.go`: `system-health` (every 5m), `db-dump` (daily), `backup` (daily)
- Backup manager already calls `pinger.Ping()`/`pinger.Fail()` after each operation
**Problem:** The demo-felhom Healthchecks project has **zero checks created** (screenshot confirms empty project at `status.felhom.eu/projects/.../checks/`). The `controller.yaml` on demo-felhom has all `CHANGEME` placeholder UUIDs. Nothing is actually pinging.
Additionally, there are legacy bash scripts (`backup-healthcheck.sh`, `monitoring-setup.sh`) from the pre-controller era that duplicate functionality now built into the controller. These should be deprecated in favor of controller-native pings.
**This version has two major parts:**
1. **Prerequisite:** Get healthchecks actually working on demo-felhom (create checks, configure UUIDs, verify pings)
2. **New feature:** Central push from customer controllers to k3s + multi-customer overview dashboard
---
## Part 0: Healthcheck Ping Design (controller.yaml schema update)
### Current ping types (already implemented in code)
| Ping | Schedule | Source | What it proves |
|------|----------|--------|----------------|
| `system_health` | Every 5 min | `monitor.RunHealthCheck()` | Server alive, Docker running, disks OK, protected containers up, CPU/mem/temp within thresholds |
| `db_dump` | Daily 02:30 | `backup.RunDBDumps()` | Database dumps completed successfully |
| `backup` | Daily 03:00 | `backup.RunBackup()` | Restic snapshot completed successfully |
### New ping types to add
| Ping | Schedule | Source | What it proves |
|------|----------|--------|----------------|
| `backup_integrity` | Weekly (Sunday 04:00) | New: `backup.RunIntegrityCheck()` | Restic repo passes `restic check` — data is not corrupted |
| `heartbeat` | Every 5 min | New: lightweight HTTP POST, no logic | Controller process is alive (distinct from `system_health` which does heavy checks and could fail due to a bug while the controller itself is fine) |
### Revised `controller.yaml` monitoring section
```yaml
monitoring:
enabled: true
healthchecks_base: "https://status.felhom.eu"
ping_uuids:
heartbeat: "" # NEW — every 1 min, controller alive
system_health: "" # existing — every 5 min, comprehensive check
db_dump: "" # existing — daily after db dumps
backup: "" # existing — daily after restic snapshot
backup_integrity: "" # NEW — weekly after restic check
system_health_interval: "5m"
health_check_schedule: "06:00"
thresholds:
disk_warn_percent: 80
disk_crit_percent: 90
backup_max_age_hours: 36
cpu_warn_percent: 90
memory_warn_percent: 85
temperature_warn_celsius: 75
```
> **Note:** Empty string and "CHANGEME..." UUIDs are both skipped by the pinger (already implemented). This means any check can be left unconfigured — the controller just skips it silently.
### Healthchecks check configuration (to be created manually on status.felhom.eu)
For each customer project, create these checks:
| Check name | Period | Grace | Tags |
|-----------|--------|-------|------|
| `heartbeat` | 5 minutes | 10 minutes | `heartbeat` |
| `system-health` | 5 minutes | 10 minutes | `system`, `health` |
| `db-dump` | 1 day (02:30 CET) | 30 minutes | `backup`, `db` |
| `backup` | 1 day (03:00 CET) | 60 minutes | `backup`, `restic` |
| `backup-integrity` | 7 days | 24 hours | `backup`, `integrity` |
---
## Part 1: Controller-side healthcheck implementation
### Task 1.1: Add heartbeat ping
**Files:** `cmd/controller/main.go`
Add a new scheduler job — the simplest possible ping, no health check logic:
```go
// Heartbeat — lightweight "I'm alive" signal
sched.Every("heartbeat", 5*time.Minute, func(ctx context.Context) error {
pinger.Ping(cfg.Monitoring.PingUUIDs.Heartbeat, "")
return nil
})
```
**Files:** `internal/config/config.go`
Add `Heartbeat` field to `PingUUIDsConfig`:
```go
type PingUUIDsConfig struct {
Heartbeat string `yaml:"heartbeat"`
DBDump string `yaml:"db_dump"`
Backup string `yaml:"backup"`
SystemHealth string `yaml:"system_health"`
BackupIntegrity string `yaml:"backup_integrity"` // new
}
```
### Task 1.2: Add backup integrity check
**Files:** `internal/backup/restic.go`
Add a `Check()` method (may already exist as part of prune logic — verify first):
```go
// Check runs `restic check` to verify repository integrity.
func (r *ResticRunner) Check() error {
args := []string{"check", "--repo", r.repo, "--json"}
// ... standard exec with password file, timeout 30 min
}
```
**Files:** `internal/backup/backup.go`
Add `RunIntegrityCheck()`:
```go
// RunIntegrityCheck runs restic check and pings healthchecks with the result.
func (m *Manager) RunIntegrityCheck(ctx context.Context) error {
err := m.restic.Check()
uuid := m.cfg.Monitoring.PingUUIDs.BackupIntegrity
if err != nil {
m.pinger.Fail(uuid, fmt.Sprintf("restic check failed: %v", err))
return err
}
m.pinger.Ping(uuid, "restic check passed")
return nil
}
```
**Files:** `cmd/controller/main.go`
Register the weekly job:
```go
if cfg.Backup.Enabled && backupMgr != nil {
// ... existing daily jobs ...
// Weekly integrity check — Sunday 04:00
sched.Daily("backup-integrity", "04:00", func(ctx context.Context) error {
if time.Now().Weekday() != time.Sunday {
return nil // skip non-Sundays
}
return backupMgr.RunIntegrityCheck(ctx)
})
}
```
> **Note on scheduler:** `Daily()` fires every day at the given time. To make it weekly, check the weekday inside the function. If you prefer, add a `Weekly()` method to the scheduler — but the weekday check is simpler and consistent with how prune already works.
### Task 1.3: Update example config
**Files:** `controller/configs/controller.yaml.example`
Update the `monitoring.ping_uuids` section to include `heartbeat` and `backup_integrity` fields. Add comments explaining each.
### Task 1.4: Deprecation note for bash monitoring scripts
The following files in `deploy-felhom-compose/monitoring/` are **superseded** by the controller's built-in monitoring:
- `backup-healthcheck.sh` → replaced by `internal/monitor/healthcheck.go` (scheduler: `system-health`)
- `monitoring-setup.sh` → no longer needed (controller reads `controller.yaml` directly)
- `monitoring.conf.template` → replaced by `controller.yaml` monitoring section
- `backup-healthcheck.service` / `.timer` → replaced by controller's scheduler
**Action:** Add a `DEPRECATED.md` in `deploy-felhom-compose/monitoring/` explaining that these scripts are kept for reference only and should not be used on nodes running felhom-controller v0.4.0+. Do NOT delete the files yet — they may be needed if a customer is still on a pre-controller setup.
### Verification (Part 1)
After building and deploying v0.6.0 to demo-felhom:
1. Check controller logs: `docker logs felhom-controller --since 5m | grep -i "ping\|health\|heartbeat"`
2. Verify pings arrive at `status.felhom.eu` — all 5 checks should show green within 10 minutes
3. Test failure: `docker stop traefik`, wait 5 min, check that `system-health` goes red (protected container missing)
4. Restart traefik: `docker start traefik`, verify recovery
---
## Part 2: Central push to k3s (customer → operator reporting)
### Architecture
```
┌─────────────────────────┐ HTTPS POST /api/v1/report
│ Customer controller │────────────────────────────────────────┐
│ (demo-felhom.eu) │ every 15 min (configurable) │
└─────────────────────────┘ ▼
┌─────────────────────────────┐
┌─────────────────────────┐ HTTPS POST │ felhom-hub │
│ Customer controller │────────────────────────▶│ (k3s pod on dooplex.hu) │
│ (customer-2) │ │ │
└─────────────────────────┘ │ - Receives reports │
│ - Stores in SQLite │
│ - Serves dashboard │
│ - Alerts on stale reports │
└─────────────────────────────┘
hub.felhom.eu
```
### Task 2.1: Define the report payload
The controller pushes a JSON summary every 15 minutes. This is **not** raw metrics — it's an aggregated health summary.
```json
{
"version": 1,
"customer_id": "demo-felhom",
"customer_name": "Demo Ügyfél",
"controller_version": "0.6.0",
"timestamp": "2026-02-16T12:00:00Z",
"system": {
"hostname": "demo-felhom",
"os": "Debian GNU/Linux 13 (trixie)",
"kernel": "6.12.69+deb13-amd64",
"cpu_model": "Intel N100",
"cpu_cores": 4,
"uptime_seconds": 345600,
"cpu_percent": 12.5,
"memory_total_mb": 15872,
"memory_used_mb": 4200,
"memory_percent": 26.5,
"temperature_celsius": 48.0,
"load_avg_1": 0.45,
"load_avg_5": 0.38,
"load_avg_15": 0.32
},
"storage": [
{ "mount": "/", "total_gb": 476.0, "used_gb": 28.5, "percent": 6.0 },
{ "mount": "/mnt/hdd_1", "total_gb": 931.0, "used_gb": 120.3, "percent": 12.9 }
],
"containers": {
"total": 16,
"running": 14,
"stopped": 2,
"unhealthy": 0,
"list": [
{ "name": "paperless-ngx-webserver-1", "state": "running", "cpu_percent": 2.1, "memory_mb": 350 },
{ "name": "traefik", "state": "running", "cpu_percent": 0.3, "memory_mb": 45 }
]
},
"backup": {
"enabled": true,
"last_db_dump": "2026-02-16T02:30:15Z",
"last_snapshot": "2026-02-16T03:02:45Z",
"snapshot_count": 42,
"repo_size_mb": 2048,
"last_integrity_check": "2026-02-09T04:00:00Z",
"integrity_ok": true
},
"health": {
"status": "ok",
"issues": [],
"warnings": ["Disk /mnt/hdd_1 at 82%"]
},
"stacks": {
"deployed": ["paperless-ngx", "immich", "jellyfin"],
"available": ["nextcloud", "vaultwarden", "home-assistant"],
"updates_available": 1
}
}
```
### Task 2.2: Implement report builder in the controller
**New file:** `controller/internal/report/builder.go`
```go
package report
// Report is the JSON payload pushed to the central hub.
type Report struct {
Version int `json:"version"`
CustomerID string `json:"customer_id"`
CustomerName string `json:"customer_name"`
ControllerVersion string `json:"controller_version"`
Timestamp time.Time `json:"timestamp"`
System SystemReport `json:"system"`
Storage []StorageReport `json:"storage"`
Containers ContainerReport `json:"containers"`
Backup BackupReport `json:"backup"`
Health HealthReport `json:"health"`
Stacks StacksReport `json:"stacks"`
}
// BuildReport collects current state from all subsystems and returns a Report.
func BuildReport(cfg *config.Config, stackMgr *stacks.Manager,
backupMgr *backup.Manager, cpuCollector *system.CPUCollector,
pinger *monitor.Pinger, version string) *Report {
// Gather system info from system.GetInfo()
// Gather container info from stackMgr
// Gather backup info from backupMgr.GetFullStatus()
// Gather health from monitor.RunHealthCheck()
// Gather stack list from stackMgr.GetStacks()
// Return assembled Report
}
```
This function should call existing methods — **do not duplicate logic**. Use the same data sources the dashboard and monitoring page already use.
### Task 2.3: Implement report pusher in the controller
**New file:** `controller/internal/report/pusher.go`
```go
package report
// Pusher sends reports to the central hub.
type Pusher struct {
hubURL string
apiKey string
httpClient *http.Client
logger *log.Logger
enabled bool
}
// Push sends a report to the hub. Returns nil on success.
// Retries 3 times with 5s backoff. Never returns error to caller
// (push failures should not affect controller operation).
func (p *Pusher) Push(report *Report) error {
// JSON marshal
// POST to hubURL + "/api/v1/report"
// Header: Authorization: Bearer <apiKey>
// Header: Content-Type: application/json
// Retry on failure
// Log but don't propagate errors
}
```
### Task 2.4: Add hub configuration to controller.yaml
**Files:** `internal/config/config.go`, `controller/configs/controller.yaml.example`
```yaml
# --- Central hub (operator dashboard) ---
hub:
enabled: false # Enable central reporting
url: "https://hub.felhom.eu" # Hub API endpoint
api_key: "" # Shared secret for authentication
push_interval: "15m" # How often to push reports
```
```go
type HubConfig struct {
Enabled bool `yaml:"enabled"`
URL string `yaml:"url"`
APIKey string `yaml:"api_key"`
PushInterval string `yaml:"push_interval"`
}
```
Add `Hub HubConfig `yaml:"hub"`` to the main `Config` struct.
### Task 2.5: Wire the pusher into main.go
```go
// --- Central hub reporting ---
if cfg.Hub.Enabled && cfg.Hub.URL != "" {
pushInterval, err := time.ParseDuration(cfg.Hub.PushInterval)
if err != nil {
pushInterval = 15 * time.Minute
}
pusher := report.NewPusher(&cfg.Hub, logger)
sched.Every("hub-report", pushInterval, func(ctx context.Context) error {
r := report.BuildReport(cfg, stackMgr, backupMgr, cpuCollector, pinger, version)
return pusher.Push(r)
})
logger.Printf("[INFO] Hub reporting enabled (every %s to %s)", pushInterval, cfg.Hub.URL)
}
```
### Verification (Part 2)
1. Set `hub.enabled: true` and `hub.url` to a temporary endpoint (e.g., `https://webhook.site/...`) in demo-felhom's `controller.yaml`
2. Restart controller, check logs for "Hub reporting enabled"
3. Wait 15 min (or set `push_interval: "1m"` for testing), verify JSON arrives at the endpoint
4. Validate JSON structure matches the spec above
5. Reset `push_interval` to `"15m"` after testing
---
## Part 3: Hub service on k3s (operator side)
### Overview
The hub is a lightweight Go service deployed on Viktor's k3s cluster in the `felhom-system` namespace. It receives reports from customer controllers, stores them in SQLite, and serves an English-language dashboard for Viktor.
**Domain:** `hub.felhom.eu` (Nginx Ingress, cert-manager TLS)
**Namespace:** `felhom-system` (alongside Healthchecks and other felhom infra)
**Code:** `felhom.eu` repo on Gitea, `hub/` subfolder
### Task 3.1: Hub service (subfolder in felhom.eu repository)
The hub lives in the existing `felhom.eu` repository on Gitea as a `hub/` subfolder. It's deployed to the k3s cluster in the `felhom-system` namespace (alongside Healthchecks and other felhom infra). K8s manifests go in the `homelab-manifests` repo as usual.
**Structure (inside felhom.eu repo):**
```
hub/
├── cmd/hub/main.go # Entry point
├── internal/
│ ├── api/
│ │ └── handler.go # POST /api/v1/report, GET /api/v1/customers
│ ├── store/
│ │ └── store.go # SQLite: save reports, query latest per customer
│ └── web/
│ ├── server.go # Dashboard HTTP server
│ ├── templates/
│ │ ├── dashboard.html # Multi-customer overview (English)
│ │ ├── customer.html # Single customer detail (English)
│ │ └── style.css # Dark theme matching felhom.eu
│ └── embed.go
├── configs/
│ └── hub.yaml.example
├── Dockerfile
├── Makefile
└── go.mod
```
K8s manifests in `felhom.eu/manifests/` (alongside healthchecks.yaml, webpage.yaml, etc.):
```
manifests/hub.yaml # Deployment, Service, Ingress, PVC
```
### Task 3.2: Hub API endpoints
| Method | Path | Auth | Description |
|--------|------|------|-------------|
| `POST` | `/api/v1/report` | Bearer token | Receive customer report (JSON body) |
| `GET` | `/api/v1/customers` | Session/Basic | List all customers with latest status |
| `GET` | `/api/v1/customers/{id}` | Session/Basic | Get latest report for a customer |
| `GET` | `/api/v1/customers/{id}/history` | Session/Basic | Get report history (last 24h/7d/30d) |
| `GET` | `/` | Session/Basic | Dashboard HTML page |
| `GET` | `/customers/{id}` | Session/Basic | Customer detail HTML page |
**Authentication:**
- Report ingest: Bearer token (shared secret per customer, or a single hub-wide key for simplicity)
- Dashboard: Basic auth or simple password (Viktor only) — reuse the same bcrypt approach as the controller
### Task 3.3: Hub SQLite schema
```sql
CREATE TABLE IF NOT EXISTS reports (
id INTEGER PRIMARY KEY AUTOINCREMENT,
customer_id TEXT NOT NULL,
received_at DATETIME NOT NULL DEFAULT (datetime('now')),
report_json TEXT NOT NULL, -- Full JSON payload
-- Denormalized fields for fast queries:
health_status TEXT, -- "ok", "warn", "fail"
cpu_percent REAL,
memory_percent REAL,
container_total INTEGER,
container_running INTEGER,
backup_last_snapshot DATETIME,
controller_version TEXT
);
CREATE INDEX IF NOT EXISTS idx_reports_customer ON reports(customer_id, received_at DESC);
-- Prune old reports: keep 30 days of history
-- Run daily: DELETE FROM reports WHERE received_at < datetime('now', '-30 days');
```
### Task 3.4: Hub dashboard UI (English)
**Overview page (`/`):**
A table/grid showing all customers at a glance:
| Customer | Status | Last seen | CPU | Memory | Disk | Containers | Last backup | Version |
|----------|--------|-----------|-----|--------|------|------------|-------------|---------|
| 🟢 Demo Ügyfél | OK | 2 min ago | 12% | 26% | 6%/13% | 14/16 | 3h ago | 0.6.0 |
| 🟡 Kovács Péter | WARN | 18 min ago | 45% | 78% | 82% ⚠️ | 8/8 | 4h ago | 0.5.4 |
| 🔴 Nagy Anna | DOWN | 2h ago | | | | | 26h ago ⚠️ | 0.5.4 |
**Color coding:**
- 🟢 Green: last seen < 30 min AND health = "ok"
- 🟡 Yellow: last seen < 30 min AND health = "warn", OR last seen 30-60 min
- 🔴 Red: last seen > 60 min OR health = "fail"
**Customer detail page (`/customers/{id}`):**
- Last report timestamp
- Full system info section (same layout as controller's monitoring page)
- Container list with CPU/memory
- Backup status details
- Health issues/warnings
- Report history (collapsible list, last 24h)
**Design:** English language. Dark theme matching felhom.eu / the controller dashboard. Use the same CSS variables and fonts.
### Task 3.5: Hub Kubernetes manifests
**File:** `felhom.eu/manifests/hub.yaml` (alongside `healthchecks.yaml`, `webpage.yaml`, etc.)
```yaml
# Namespace: felhom-system (shared with healthchecks and other felhom infra)
# Deployment: 1 replica, 64Mi-256Mi memory
# Service: ClusterIP port 8080
# PVC: 1Gi for SQLite (Longhorn)
# Ingress: hub.felhom.eu via nginx-internal, cert-manager TLS
# Auth: same geo-restriction as other dooplex.hu services (HU only)
```
**ConfigMap** for `hub.yaml` config:
```yaml
auth:
password_hash: "" # bcrypt hash, same approach as controller
api:
report_api_key: "" # Bearer token for report ingest
retention:
max_days: 90 # Keep 90 days of report history
prune_schedule: "04:30" # Daily prune
alerting:
stale_threshold: "30m" # Alert if customer not seen for 30 min
```
### Task 3.6: Alerting (optional, future enhancement)
When a customer is "stale" (no report for > 30 min), the hub could:
- Send a webhook to Healthchecks (one "customer-X-reporting" check per customer)
- Send email via Resend
- Push to Telegram
For v0.6.0 scope: just show the status on the dashboard. Alerting can be added in v0.6.1.
---
## Part 4: Manual steps for Viktor (demo-felhom setup)
These steps must be done by Viktor manually — Claude Code cannot access status.felhom.eu or the demo-felhom server.
### 4.1: Create Healthchecks checks on status.felhom.eu
1. Log into `status.felhom.eu`
2. Open the "demo-felhom" project
3. Create 5 checks with the settings from the table in Part 0
4. Copy the ping UUIDs for each check
### 4.2: Update controller.yaml on demo-felhom
SSH into demo-felhom and update `/opt/docker/felhom-controller/controller.yaml`:
```yaml
monitoring:
enabled: true
healthchecks_base: "https://status.felhom.eu"
ping_uuids:
heartbeat: "<UUID-from-step-4.1>"
system_health: "<UUID-from-step-4.1>"
db_dump: "<UUID-from-step-4.1>"
backup: "<UUID-from-step-4.1>"
backup_integrity: "<UUID-from-step-4.1>"
system_health_interval: "5m"
health_check_schedule: "06:00"
thresholds:
disk_warn_percent: 80
disk_crit_percent: 90
backup_max_age_hours: 36
cpu_warn_percent: 90
memory_warn_percent: 85
temperature_warn_celsius: 75
```
### 4.3: Restart controller
```bash
cd /opt/docker/felhom-controller
docker compose pull
docker compose up -d
docker logs -f felhom-controller --since 1m
```
### 4.4: Verify pings
Wait 5 minutes, then check `status.felhom.eu` — all 5 checks should be green.
### 4.5: Deploy hub to k3s (after Part 3 is built)
```bash
# Build and push hub image (from felhom.eu repo, hub/ subfolder)
cd hub && make docker-push
# Apply k8s manifests (from felhom.eu repo, manifests/ folder)
kubectl apply -f manifests/hub.yaml
# Configure hub.felhom.eu DNS in Cloudflare
# Update demo-felhom controller.yaml with hub config
```
---
## Implementation order
1. **Part 1** (controller-side, in `deploy-felhom-compose` repo):
- Task 1.1: Heartbeat ping (5 min)
- Task 1.2: Backup integrity check (20 min)
- Task 1.3: Update example config (5 min)
- Task 1.4: Deprecation note for bash scripts (5 min)
2. **Part 4.14.4** (Viktor manual: create checks, configure UUIDs, verify)
3. **Part 2** (controller-side, report push):
- Task 2.1: Report payload types (10 min)
- Task 2.2: Report builder (30 min)
- Task 2.3: Report pusher (15 min)
- Task 2.4: Hub config in controller.yaml (10 min)
- Task 2.5: Wire into main.go (5 min)
4. **Part 3** (hub in `felhom.eu` repo, k8s manifests in `homelab-manifests`):
- Task 3.1: Project scaffold in `hub/` subfolder (10 min)
- Task 3.2: API handlers (30 min)
- Task 3.3: SQLite store (20 min)
- Task 3.4: Dashboard UI — English (60 min)
- Task 3.5: K8s manifests in `felhom.eu/manifests/` (20 min)
5. **Part 4.5** (Viktor manual: deploy hub, wire everything)
---
## Files to modify (controller repo)
```
controller/cmd/controller/main.go — heartbeat job, integrity job, hub pusher
controller/internal/config/config.go — PingUUIDsConfig + HubConfig
controller/internal/backup/backup.go — RunIntegrityCheck()
controller/internal/backup/restic.go — Check() method (verify/add)
controller/internal/report/builder.go — NEW: report assembly
controller/internal/report/pusher.go — NEW: HTTP push client
controller/internal/report/types.go — NEW: Report struct definitions
controller/configs/controller.yaml.example — updated monitoring + new hub section
monitoring/DEPRECATED.md — NEW: deprecation notice for bash scripts
```
## Files to create (hub — in felhom.eu repo)
```
hub/cmd/hub/main.go
hub/internal/api/handler.go
hub/internal/store/store.go
hub/internal/web/server.go
hub/internal/web/templates/dashboard.html
hub/internal/web/templates/customer.html
hub/internal/web/templates/style.css
hub/internal/web/embed.go
hub/configs/hub.yaml.example
hub/Dockerfile
hub/Makefile
hub/go.mod
hub/README.md
```
## Files to create (k8s manifests — in felhom.eu repo)
```
manifests/hub.yaml
```
---
## Verification checklist
- [ ] Heartbeat ping arrives every 5 min at status.felhom.eu
- [ ] System health ping arrives every 5 min with diagnostic body
- [ ] DB dump ping arrives daily at ~02:30
- [ ] Backup ping arrives daily at ~03:00
- [ ] Backup integrity ping arrives weekly on Sunday ~04:00
- [ ] Stopping a protected container triggers system-health FAIL
- [ ] Controller logs show "Hub reporting enabled" when hub.enabled=true
- [ ] Hub receives JSON reports from controller
- [ ] Hub dashboard shows demo-felhom with green status
- [ ] Hub dashboard shows "last seen: X min ago" updating correctly
- [ ] Hub shows red status when controller is stopped for > 60 min
- [ ] Hub SQLite prunes old reports automatically
- [ ] All UUIDs are configurable (empty/CHANGEME = silently skipped)
---
## CONTEXT.md update (after completion)
Add to "What was just completed" section:
```
### What was just completed (session N)
- **v0.6.0 — Healthcheck Implementation + Central Push + Hub Dashboard:**
- **Healthcheck pings fully operational:** 5 check types (heartbeat, system-health, db-dump, backup, backup-integrity) configured on demo-felhom, all pinging status.felhom.eu
- **Backup integrity check:** Weekly `restic check` with Healthchecks ping
- **Central hub reporting:** Controller pushes JSON health summary every 15 min to hub.felhom.eu
- **felhom-hub service:** New Go service in felhom.eu repo (`hub/` subfolder), k8s manifests in `felhom.eu/manifests/hub.yaml`, deployed on k3s in felhom-system namespace, SQLite storage, English multi-customer dashboard
- **Deprecated:** Legacy bash monitoring scripts (backup-healthcheck.sh, monitoring-setup.sh) superseded by controller-native monitoring
```
Also update the repository distinction in CONTEXT.md:
```
## Repository & manifest layout
- **homelab-manifests** — Viktor's personal k3s apps (*.dooplex.hu): mon-system, servarr, pihole, etc.
- **felhom.eu** — Everything felhom-related:
- `website/` — felhom.eu public website HTML
- `manifests/` — k8s manifests for felhom infra in felhom-system namespace (webpage, healthchecks, contact-mailer, umami, hub, felhom.secret)
- `hub/` — felhom-hub Go service (central multi-customer dashboard)
- **deploy-felhom-compose** — Customer-side: felhom-controller code, deploy scripts, monitoring scripts
- **app-catalog-felhom.eu** — Docker Compose templates for customer apps
```