feat: controller-side HTTP/TCP health probes

Add network-level health probing from the controller to deployed apps.
The controller probes containers over the shared Docker network and
overrides stack state to "unhealthy" if the service isn't responding.

Three probe types: http (any response = alive), api (validates status
code and body content), tcp (port reachability). Configured per-app
via healthcheck: section in .felhom.yml. Runs every minute, per-app
interval defaults to 5 minutes.

This replaces Docker-level healthchecks for distroless images (e.g.
Vikunja) that lack shell utilities, and complements existing Docker
healthchecks for other apps.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-25 11:11:21 +01:00
parent 077640d9bb
commit 4c5d430b1a
6 changed files with 425 additions and 13 deletions
+12 -1
View File
@@ -212,11 +212,22 @@ When app templates are updated (e.g., a new `APP_KEY` secret is added to `.felho
| Running + healthy | Green | "Fut" | All containers running and healthy |
| Running + starting | Orange | "Indulas..." | Healthcheck not yet passed |
| Deploying | Orange | "Telepítés..." | Compose up in progress (image pull, container creation) |
| Running + unhealthy | Yellow | "Nem egeszseges" | Healthcheck failing |
| Running + unhealthy | Yellow | "Nem egeszseges" | Docker or controller-side healthcheck failing |
| Stopped/exited | Red | "Leallitva" | All containers stopped |
| Restarting | Yellow | "Ujrainditas..." | Restart loop |
| Not deployed | Gray | "Nincs telepitve" | Compose file exists, not deployed |
#### Controller-side Health Probes (`internal/stacks/healthprobe.go`)
For apps that declare a `healthcheck:` section in `.felhom.yml`, the controller probes the container directly over the Docker network (both are on `traefik-public`). This complements Docker-level healthchecks and is the **only** health mechanism for distroless/scratch images that lack shell utilities.
Three probe types are supported:
- **`http`** — Any HTTP response (even 4xx/5xx) = service is alive. Only connection refused/timeout = unhealthy.
- **`api`** — HTTP request with response validation (expected status code, body content). Fails if expectations aren't met.
- **`tcp`** — Simple port reachability check via `net.Dial`.
Multiple checks per app are supported (all must pass). The probe scheduler runs every minute; per-app intervals default to 5 minutes and are configurable via `healthcheck.interval` in `.felhom.yml`. Probe results are stored in `Stack.HealthProbe` and exposed via the API. Failed probes override the stack state to `StateUnhealthy`; the override clears automatically when the next probe passes.
---
### 2. Backup System