diff --git a/TASK.md b/TASK.md index d376606..4b844c4 100644 --- a/TASK.md +++ b/TASK.md @@ -1,246 +1,512 @@ -# TASK: Fix startup hub report — Push() silently swallows errors (v0.15.5) +# TASK: Major rewrite of `scripts/docker-setup.sh` (v5.0) -## Problem +## Overview -The startup hub report exists but silently fails. On the latest deployment, the controller tried to push a report 5 seconds after boot, but the hub returned HTTP 503 (it was still starting up). `Push()` always returns `nil` by design, so `main.go` logged `[INFO] Startup hub report sent` even though the push actually failed. The hub shows stale data until the first scheduled report fires (15 minutes later). +Rewrite `docker-setup.sh` to bring it up to date with the current Felhom architecture. +The script should now be a complete end-to-end provisioning tool: install infrastructure, +run an interactive configuration wizard, generate `controller.yaml`, deploy FileBrowser +as a protected stack, and deploy felhom-controller — all in one run. -Evidence from logs: -``` -09:46:47 [INFO] Hub reporting enabled (every 15m0s to https://hub.felhom.eu) -09:47:02 [WARN] Hub report push failed after 3 attempts: HTTP 503 ← Push() logged this internally -09:47:02 [INFO] Startup hub report sent ← main.go logged "sent" because Push() returned nil -``` - -The hub pod only became ready at 09:47:02 — the same second Push() gave up. - -## Root cause - -`Push()` in `pusher.go` (line 39-86) has comment: "Never returns error to caller — push failures should not affect controller operation." It always returns `nil`. The startup code in `main.go` checks `err` from `Push()` but it's always nil, so it always takes the success branch. - -The scheduler (`scheduler.go:223`) already handles errors from `JobFunc` gracefully — it logs the error and continues. So returning real errors from `Push()` is safe for scheduled calls too. - -## Fix - -### Step 1: Make `Push()` return actual errors - -**File:** `controller/internal/report/pusher.go` - -Change `Push()` to return the real error instead of always `nil`: - -**Current** (line 38-86): -```go -// Push sends a report to the hub. Retries 3 times with 5s backoff. -// Never returns error to caller — push failures should not affect controller operation. -func (p *Pusher) Push(report *Report) error { - if !p.enabled { - return nil - } - - data, err := json.Marshal(report) - if err != nil { - p.logger.Printf("[WARN] Hub report marshal failed: %v", err) - return nil - } - - url := p.hubURL + "/api/v1/report" - - var lastErr error - for attempt := 0; attempt < 3; attempt++ { - if attempt > 0 { - time.Sleep(5 * time.Second) - } - - req, err := http.NewRequest(http.MethodPost, url, bytes.NewReader(data)) - if err != nil { - lastErr = err - continue - } - req.Header.Set("Content-Type", "application/json") - if p.apiKey != "" { - req.Header.Set("Authorization", "Bearer "+p.apiKey) - } - - resp, err := p.httpClient.Do(req) - if err != nil { - lastErr = err - continue - } - io.Copy(io.Discard, resp.Body) - resp.Body.Close() - - if resp.StatusCode >= 200 && resp.StatusCode < 300 { - p.logger.Printf("[INFO] Hub report pushed successfully (%d bytes)", len(data)) - return nil - } - lastErr = fmt.Errorf("HTTP %d", resp.StatusCode) - } - - p.logger.Printf("[WARN] Hub report push failed after 3 attempts: %v", lastErr) - return nil -} -``` - -**Replace with:** -```go -// Push sends a report to the hub. Retries 3 times with 5s backoff. -func (p *Pusher) Push(report *Report) error { - if !p.enabled { - return nil - } - - data, err := json.Marshal(report) - if err != nil { - return fmt.Errorf("marshal report: %w", err) - } - - url := p.hubURL + "/api/v1/report" - - var lastErr error - for attempt := 0; attempt < 3; attempt++ { - if attempt > 0 { - time.Sleep(5 * time.Second) - } - - req, err := http.NewRequest(http.MethodPost, url, bytes.NewReader(data)) - if err != nil { - lastErr = err - continue - } - req.Header.Set("Content-Type", "application/json") - if p.apiKey != "" { - req.Header.Set("Authorization", "Bearer "+p.apiKey) - } - - resp, err := p.httpClient.Do(req) - if err != nil { - lastErr = err - continue - } - io.Copy(io.Discard, resp.Body) - resp.Body.Close() - - if resp.StatusCode >= 200 && resp.StatusCode < 300 { - p.logger.Printf("[INFO] Hub report pushed successfully (%d bytes)", len(data)) - return nil - } - lastErr = fmt.Errorf("HTTP %d", resp.StatusCode) - } - - return fmt.Errorf("hub push failed after 3 attempts: %w", lastErr) -} -``` - -Changes: -- Removed "Never returns error" comment -- Marshal error: return wrapped error instead of logging + nil -- After retries exhausted: return error instead of logging + nil -- Success path: unchanged (returns nil) - -This is safe because: -- The scheduler (`executeJob` in `scheduler.go:223-235`) already catches and logs errors from `JobFunc` -- The startup code in `main.go` already checks `err` — it just never saw one before - -### Step 2: Add startup retry with longer delay - -**File:** `controller/cmd/controller/main.go` - -The startup goroutine (starting at ~line 270) sends the hub report once. If Push() fails (hub not ready), it should retry a few times with delay. The hub typically takes 10-15 seconds to start. - -**Current** (~line 289-297): -```go - // Hub report - if hubPusher != nil { - if cfg.Hub.Enabled { - r := report.BuildReport(cfg, stackMgr, backupMgr, cpuCollector, metricsStore, Version, sett.GetStoragePaths()) - if err := hubPusher.Push(r); err != nil { - logger.Printf("[WARN] Startup hub report failed: %v", err) - } else { - logger.Println("[INFO] Startup hub report sent") - } - } else { -``` - -**Replace the `if cfg.Hub.Enabled` block** (keep the `else` disabled-notification branch unchanged): -```go - // Hub report - if hubPusher != nil { - if cfg.Hub.Enabled { - r := report.BuildReport(cfg, stackMgr, backupMgr, cpuCollector, metricsStore, Version, sett.GetStoragePaths()) - var pushErr error - for attempt := 1; attempt <= 3; attempt++ { - pushErr = hubPusher.Push(r) - if pushErr == nil { - logger.Println("[INFO] Startup hub report sent") - break - } - logger.Printf("[WARN] Startup hub report attempt %d/3 failed: %v", attempt, pushErr) - if attempt < 3 { - time.Sleep(15 * time.Second) - } - } - if pushErr != nil { - logger.Printf("[WARN] Startup hub report failed after 3 attempts — next scheduled push in %s", cfg.Hub.PushInterval) - } - } else { -``` - -This gives the hub up to ~40 seconds to come up (5s initial + Push's own 3x5s retries on first attempt, then 15s wait, then another Push attempt, etc.). The `else` branch for disabled notifications stays unchanged. - -**IMPORTANT:** The `else` branch (disabled notification via `PushOnce`) stays as-is — no changes needed there. +**Read the entire current `scripts/docker-setup.sh` before starting. This is a rewrite +of an existing ~1600-line script, not a new file.** --- -## Summary of changes +## Changes Required -| File | Change | -|------|--------| -| `controller/internal/report/pusher.go` | `Push()` returns actual errors instead of always nil | -| `controller/cmd/controller/main.go` | Startup hub push retries 3 times with 15s delay between attempts | +### 1. Update banner and version -Only **2 files** changed. No new types, no new methods, no template changes. +- Set `SCRIPT_VERSION="5.0.0"` +- Update `print_banner()` — no Portainer, the title should be: + ``` + Felhom Infrastructure Setup v5.0.0 + ``` +- Update the comment header block at the top of the file to match the new scope + (Docker + Traefik + FileBrowser + Controller + configuration wizard). +- Update `print_help()` to reflect all removed/changed options. + +### 2. Remove Portainer (confirm clean) + +The current script has no Portainer code (already removed in a prior version). +Just make sure there are zero references to "portainer" or "Portainer" anywhere — +banner, comments, help text, variables. Search and confirm. + +### 3. Remove `--cf-tunnel-token` CLI option + +**Remove** the `--cf-tunnel-token` CLI flag and the `CF_TUNNEL_TOKEN` variable from +`parse_args()`. The Cloudflare tunnel token is now collected by the configuration wizard +and written into `controller.yaml` (see §7 below). The `install_cloudflare_tunnel()` +function stays but reads the token from the wizard variable instead of a CLI flag. + +Also remove `--hdd-path` CLI option and `HDD_PATH` variable — deprecated. + +Keep these CLI options (still useful for non-interactive/scripted runs): +- `--ip`, `--gateway`, `--dns`, `--interface` (network config) +- `--domain`, `--email`, `--cf-token` (TLS/domain — can pre-seed wizard) +- `--customer` (customer ID — can pre-seed wizard) +- `--traefik-password`, `--self-signed-cert` +- `--skip-filebrowser` +- `--dry-run`, `--debug`, `--help`, `--bootstrap` + +### 4. Remove `--hdd-path` references + +Remove `HDD_PATH` variable, `--hdd-path` argument parsing, and all references. +FileBrowser mounts are determined by the wizard (system_data_path and any existing +`/mnt/*` mounts). + +### 5. FileBrowser deployment as protected stack + +The current `install_filebrowser()` function needs to be rewritten: + +**Location:** Deploy to `/opt/docker/stacks/filebrowser/` (already the current +`FILEBROWSER_DIR` — keep this). + +**Compose file:** Generate a compose file matching the current production layout +on the demo node. Key differences from current script template: + +```yaml +services: + filebrowser: + image: gtstef/filebrowser:latest + container_name: filebrowser + restart: unless-stopped + environment: + - TZ=Europe/Budapest + volumes: + - filebrowser_data:/home/filebrowser/data + # Mount discovered drives — populated by wizard + # e.g. /mnt/hdd_1:/srv/hdd_1, /mnt/sys_drive:/srv/sys_drive + networks: + - traefik-public + deploy: + resources: + limits: + memory: 256M + healthcheck: + test: ["CMD", "wget", "--spider", "-q", "http://localhost:80/"] + interval: 30s + timeout: 5s + retries: 3 + start_period: 15s + labels: + - "traefik.enable=true" + - "traefik.http.routers.filebrowser.rule=Host(`files.`)" + - "traefik.http.routers.filebrowser.entrypoints=websecure" + - "traefik.http.routers.filebrowser.tls=true" + - "traefik.http.services.filebrowser.loadbalancer.server.port=80" + - "traefik.docker.network=traefik-public" +``` + +**Drive discovery for volumes:** The wizard (§7) collects `system_data_path`. +Additionally, scan `/mnt/` for existing mount points at install time. For each +discovered mount (e.g., `/mnt/hdd_1`, `/mnt/sys_drive`), add a volume mapping: +`/mnt/:/srv/`. If no mounts found, only mount the `system_data_path`. + +**Hardcode domain** in the Traefik host rule (no `${DOMAIN}` env var needed). +Use the wizard's domain value directly: `Host(\`files.ACTUAL-DOMAIN\`)`. + +**Also generate `.felhom.yml`** metadata file — keep the existing one from the +current script (Hungarian text, category: storage, etc.). + +**No `.env` file needed** for filebrowser (domain is hardcoded in compose labels). + +### 6. Controller deployment (NEW step) + +Add a new step to deploy felhom-controller. This is currently missing from the +script — the user had to deploy it manually. + +**Location:** `/opt/docker/felhom-controller/` + +**docker-compose.yml** — generate matching the current production layout: + +```yaml +services: + felhom-controller: + image: gitea.dooplex.hu/admin/felhom-controller:latest + container_name: felhom-controller + restart: unless-stopped + privileged: true + ports: + - "8080:8080" + volumes: + - /var/run/docker.sock:/var/run/docker.sock + - /opt/docker/felhom-controller/controller.yaml:/opt/docker/felhom-controller/controller.yaml:ro + - controller-data:/opt/docker/felhom-controller/data + - /opt/docker/stacks:/opt/docker/stacks + - /srv/backups:/srv/backups + - type: bind + source: /mnt + target: /mnt + bind: + propagation: rshared + - /sys:/host/sys:ro + - /etc/os-release:/host/etc/os-release:ro + - /etc/hostname:/host/etc/hostname:ro + - /dev:/host-dev:rw + - /etc/fstab:/host-fstab + - /run/udev:/run/udev:ro + environment: + - TZ=Europe/Budapest + labels: + - "traefik.enable=true" + - "traefik.http.routers.controller.rule=Host(`felhom.`)" + - "traefik.http.routers.controller.entrypoints=websecure" + - "traefik.http.routers.controller.tls=true" + - "traefik.http.services.controller.loadbalancer.server.port=8080" + - "traefik.docker.network=traefik-public" + - "felhom.managed=true" + - "felhom.component=controller" + networks: + - traefik-public + healthcheck: + test: ["CMD", "curl", "-f", "http://localhost:8080/api/health"] + interval: 30s + timeout: 5s + start_period: 10s + retries: 3 + +volumes: + controller-data: + +networks: + traefik-public: + external: true +``` + +**Hardcode domain** in Traefik labels (like filebrowser). + +**Generate `.env`** with just `DOMAIN=` — needed only as a reference/ +documentation, since we hardcode the domain in compose labels. Actually, skip +the `.env` file entirely — compose doesn't need it if labels are hardcoded. + +**Use `latest` tag** for the image. The controller has self-update capability +so it will manage its own version after initial deployment. + +**Pull and start** the controller, then verify health via the healthcheck endpoint. + +### 7. Configuration wizard for `controller.yaml` + +Add an interactive wizard function `run_config_wizard()` that runs AFTER +infrastructure setup but BEFORE deploying the controller. It generates +`/opt/docker/felhom-controller/controller.yaml`. + +**CLI pre-seeding:** If `--domain`, `--customer`, `--email`, `--cf-token` are +provided via CLI, use them as defaults in the wizard (user can still change). + +**Wizard flow** (each question is a `read -p` prompt with a default shown in brackets): + +``` +=========================================================== + Felhom Controller Configuration Wizard +=========================================================== + +--- Customer identity --- +Customer ID [demo-felhom]: _ +Customer display name [Demo Ügyfél]: _ +Domain [homeserver.local]: _ +Customer email (optional) []: _ + +--- Infrastructure secrets --- +Cloudflare Tunnel token (optional, leave empty to skip) []: _ +Cloudflare API token (for DNS-01 certs, optional) []: _ + +--- Paths --- +System data partition mount point + (if the system drive was partitioned for user data, + provide the mount point, e.g., /mnt/sys_drive) +System data path [/mnt/sys_drive]: _ + +--- Dashboard password --- +Set a password for the controller dashboard? + (leave empty for first-visit setup prompt) +Dashboard password []: _ + +--- Git sync --- +App catalog repository URL [https://gitea.dooplex.hu/admin/app-catalog-felhom.eu.git]: _ +Git username []: _ +Git token []: _ + +--- Healthcheck monitoring --- +Healthchecks.io ping UUIDs (leave empty to skip): + Heartbeat UUID []: _ + System health UUID []: _ + DB dump UUID []: _ + Backup UUID []: _ + Backup integrity UUID []: _ + +--- Ready --- +``` + +**Password hashing:** If user provides a dashboard password, hash it with bcrypt. +Use `htpasswd -bnBC 10 "" "PASSWORD" | tr -d ':'` or the `python3 -c` fallback. +Store the hash in `web.password_hash`. + +**Session secret:** Auto-generate: `openssl rand -hex 32` + +**Hub config:** Always enabled, with the hardcoded API key: +```yaml +hub: + enabled: true + url: "https://hub.felhom.eu" + api_key: "094091de545ce28795c47ac2158fc30750db5c24a621c49329b001ee8db57fb8" + push_interval: "15m" +``` + +**Backup:** Keep `enabled: true` — the user confirmed it should stay for +troubleshooting purposes. + +**hdd_path:** Do NOT include in generated config. It's deprecated. Remove it +from the template entirely. + +**Full template** — write this to `/opt/docker/felhom-controller/controller.yaml`: + +```yaml +# Felhom Controller Configuration +# Generated by docker-setup.sh v5.0.0 on + +customer: + id: "" + name: "" + domain: "" + email: "" + telegram_chat_id: "" + +infrastructure: + cf_tunnel_token: "" + cf_api_token: "" + +paths: + stacks_dir: "/opt/docker/stacks" + data_dir: "/opt/docker/felhom-controller/data" + system_data_path: "" + +system: + reserved_memory_mb: 384 + +web: + listen: ":8080" + password_hash: "" + session_secret: "" + +git: + repo_url: "" + branch: "main" + sync_interval: "15m" + username: "" + token: "" + +stacks: + protected: + - "traefik" + - "cloudflared" + - "felhom-controller" + - "filebrowser" + update_window: "03:00-05:00" + compose_command: "" + +backup: + enabled: true + restic_password_file: "/opt/docker/felhom-controller/data/restic-password" + db_dump_schedule: "02:30" + restic_schedule: "03:00" + retention: + keep_daily: 7 + keep_weekly: 4 + keep_monthly: 6 + prune_schedule: "weekly" + +monitoring: + enabled: true + healthchecks_base: "https://status.felhom.eu" + ping_uuids: + heartbeat: "" + system_health: "" + db_dump: "" + backup: "" + backup_integrity: "" + system_health_interval: "5m" + health_check_schedule: "06:00" + thresholds: + disk_warn_percent: 80 + disk_crit_percent: 90 + backup_max_age_hours: 36 + cpu_warn_percent: 90 + memory_warn_percent: 85 + temperature_warn_celsius: 75 + +hub: + enabled: true + url: "https://hub.felhom.eu" + api_key: "094091de545ce28795c47ac2158fc30750db5c24a621c49329b001ee8db57fb8" + push_interval: "15m" + +self_update: + enabled: true + check_interval: "6h" + image: "gitea.dooplex.hu/admin/felhom-controller" + auto_update: false + health_timeout_seconds: 60 + +notifications: + customer_events: + - "disk_warning" + - "backup_failed" + - "update_available" + - "security_update" + operator_events: + - "disk_critical" + - "backup_failed" + - "self_update_failed" + - "container_unhealthy" + +logging: + level: "info" + file: "" + max_size_mb: 10 + max_files: 3 + +assets: + source_url: "https://felhom.eu" +``` + +### 8. Update `controller.yaml.example` + +Update `controller/configs/controller.yaml.example` to match the wizard template: +- **Remove** `hdd_path` line entirely +- **Set** `hub.enabled: true` (was `false`) +- **Set** `hub.api_key` to the real key: `094091de545ce28795c47ac2158fc30750db5c24a621c49329b001ee8db57fb8` +- **Improve** `system_data_path` comment to be clearer: + ```yaml + system_data_path: "/mnt/sys_drive" # Mount point of user-data partition on system drive (e.g., /mnt/sys_drive) + ``` + +### 9. Update `install_cloudflare_tunnel()` + +The function currently reads from `CF_TUNNEL_TOKEN` (CLI arg). Change it to +read from the wizard variable (same variable name is fine, just populated by the +wizard instead of CLI). The function body stays the same — it creates the +docker-compose at `/opt/docker/cloudflared/` and starts it. + +**Guard:** If wizard left the CF tunnel token empty, skip this step (already +handled by the existing `if [[ -z "$CF_TUNNEL_TOKEN" ]]` check). + +### 10. Update execution order in `main()` + +New execution order: + +``` +1. Install base packages +2. Configure network (static IP, if requested) +3. Install Docker Engine + Compose +4. Install Traefik reverse proxy +5. Generate self-signed certificate (if requested) +6. Run configuration wizard → generates controller.yaml +7. Install Cloudflare Tunnel (if token provided in wizard) +8. Install FileBrowser (protected stack) +9. Deploy felhom-controller +10. Install helper tools +11. Print summary +``` + +Update step numbering and `get_total_steps()` accordingly. + +### 11. Update `print_summary()` + +Update the summary to reflect: +- Controller is deployed and accessible at `https://felhom.` +- FileBrowser at `https://files.` +- Remove manual "deploy felhom-controller" instructions (it's automated now) +- Show healthcheck UUID status (configured / not configured) +- Show hub status (enabled) +- Remove the `CUSTOMER_ID` display bug (the "Note: No --customer specified" + message is inside the `if [[ -n "$CUSTOMER_ID" ]]` block — wrong logic) + +### 12. Update `print_help()` + +Update help text to reflect: +- Removed `--cf-tunnel-token` (now in wizard) +- Removed `--hdd-path` (deprecated) +- Mention the interactive wizard +- Updated "WHAT THIS SCRIPT INSTALLS" list: + 1. Base packages + 2. Docker Engine + Compose + 3. Traefik reverse proxy + 4. TLS certificates + 5. Felhom Controller (with interactive configuration) + 6. FileBrowser Quantum (web file manager) + 7. Cloudflare Tunnel (if configured) + 8. Helper tools --- -## Build & Deploy +## Additional observations +### Bugs in current script + +1. **`print_summary()` CUSTOMER_ID logic is inverted** (line ~1507): + The "Note: No --customer specified" message is inside `if [[ -n "$CUSTOMER_ID" ]]` + which only triggers when a customer IS specified. Should be in an else branch + or removed. + +2. **Step numbering is fragile**: The `get_total_steps()` and hardcoded step + numbers (e.g., `log_step "3/$(get_total_steps)"`) will desync if steps are + added/removed. Consider using a counter variable incremented at each step. + +### Things NOT to change + +- `bootstrap_sudo()` — works fine, keep as-is +- Network configuration (steps 2) — keep all network manager detection logic +- Docker installation (step 3) — keep as-is +- Traefik installation (step 4) — keep as-is +- Self-signed cert generation — keep as-is +- Helper tools installation — keep as-is +- Error trap and diagnostics — keep as-is +- Color/logging functions — keep as-is + +### Template completeness check + +The controller.yaml template covers all sections from the current example. +Sections that use sensible defaults and don't need wizard prompts: +- `system.reserved_memory_mb` (384) +- `backup.*` (all defaults are fine) +- `stacks.protected` (hardcoded list) +- `stacks.update_window` ("03:00-05:00") +- `monitoring.thresholds.*` (all defaults) +- `self_update.*` (all defaults) +- `notifications.*` (all defaults) +- `logging.*` (all defaults) +- `assets.*` (hardcoded) + +--- + +## Implementation notes + +- The script is bash — no external YAML parser needed. Use `cat > file << EOF` + with variable substitution for generating YAML. +- For bcrypt hashing, prefer `htpasswd -bnBC 10 "" "$password" | tr -d ':\n'` + (apache2-utils is installed in step 1). Fallback: `python3 -c "import bcrypt; ..."` +- The wizard should show current/default values in brackets and accept Enter + for defaults: `read -p "Domain [$default]: " input; value="${input:-$default}"` +- Dry-run mode should show what the wizard WOULD generate without writing files. +- All generated files should have appropriate permissions: + - `controller.yaml`: `chmod 600` (contains secrets) + - `docker-compose.yml` files: `chmod 644` + +--- + +## Build & test + +After implementing, test the script with `--dry-run` to verify: ```bash +sudo ./docker-setup.sh --domain test.local --customer test --dry-run +``` + +For a real deployment test on the demo node: +```bash +# Copy script to demo node SSH=/c/Windows/System32/OpenSSH/ssh.exe -# 1. Commit & push -cd e:/git/deploy-felhom-compose -git add -A && git commit -m "v0.15.5: Fix startup hub report — Push() returns real errors, startup retries" && git push -# 2. Build -$SSH kisfenyo@192.168.0.180 "cd ~/build/felhom-controller && git -C ~/git/deploy-felhom-compose pull && ./build.sh v0.15.5 --push" -# 3. Deploy -$SSH kisfenyo@192.168.0.162 "cd /opt/docker/felhom-controller && sudo docker pull gitea.dooplex.hu/admin/felhom-controller:v0.15.5 && sudo sed -i 's|image: gitea.dooplex.hu/admin/felhom-controller:.*|image: gitea.dooplex.hu/admin/felhom-controller:v0.15.5|' docker-compose.yml && sudo docker compose up -d" -# 4. Verify — look for successful startup push -$SSH kisfenyo@192.168.0.162 "sleep 10 && docker logs felhom-controller --tail 15 2>&1 | grep -i hub" +scp scripts/docker-setup.sh kisfenyo@192.168.0.162:/tmp/ + +# Run on demo node (it already has infrastructure, so most steps will skip) +$SSH kisfenyo@192.168.0.162 "sudo bash /tmp/docker-setup.sh --domain demo-felhom.eu --customer demo-felhom --email certs@felhom.eu --cf-token " ``` - -### Compile check -Always run `go build ./...` in `controller/` before committing. - -## Documentation - -Add a CHANGELOG.md entry. Read the first 30 lines for format, then insert a new entry: - -```markdown -### vX.X.X (2026-02-19 session XX) -- **v0.15.5 — Fix startup hub report silently failing:** - - `Push()` now returns actual errors instead of always nil. Previously, push failures were logged internally but the caller could never detect them, leading to misleading "Startup hub report sent" log even when the push failed (e.g., hub returning HTTP 503 during simultaneous deployment). - - Startup hub push now retries 3 times with 15-second delays between attempts, giving the hub time to come up when both are deployed together. Each attempt uses Push()'s own 3-retry logic internally. - - **Files modified (2):** `internal/report/pusher.go`, `cmd/controller/main.go` -``` - -Update version in `C:\Users\User\.claude\projects\e--git\memory\MEMORY.md` to `v0.15.5`. - -## Verification - -After deploying v0.15.5: -1. Check logs: `docker logs felhom-controller 2>&1 | grep -i hub` - - Should show `[INFO] Startup hub report sent` (success) - - OR `[WARN] Startup hub report attempt 1/3 failed: ...` followed by eventual success -2. Check hub dashboard at `hub.felhom.eu` — should show fresh data with current timestamp -3. If hub is deployed at the same time: the retries should handle the delay diff --git a/controller/configs/controller.yaml.example b/controller/configs/controller.yaml.example index 78c1d8a..1645910 100644 --- a/controller/configs/controller.yaml.example +++ b/controller/configs/controller.yaml.example @@ -29,8 +29,7 @@ infrastructure: paths: stacks_dir: "/opt/docker/stacks" # Where compose files live data_dir: "/opt/docker/felhom-controller/data" - system_data_path: "/mnt/sys_drive" # NVMe/system drive mount — fallback for apps without HDD - hdd_path: "" # DEPRECATED: use Settings > Adattárolók instead. Fallback only for auto-discovery. + system_data_path: "/mnt/sys_drive" # Mount point of user-data partition on system drive (e.g., /mnt/sys_drive) # --- System --- system: @@ -99,9 +98,9 @@ monitoring: # --- Central hub (operator dashboard) --- hub: - enabled: false # Enable central reporting + enabled: true # Enable central reporting url: "https://hub.felhom.eu" # Hub API endpoint - api_key: "" # Shared secret for authentication + api_key: "094091de545ce28795c47ac2158fc30750db5c24a621c49329b001ee8db57fb8" # Shared secret for authentication push_interval: "15m" # How often to push reports # --- Self-update ---