Files
deploy-felhom-compose/TASK.md
T

8.6 KiB

TASK: Fix startup hub report — Push() silently swallows errors (v0.15.5)

Problem

The startup hub report exists but silently fails. On the latest deployment, the controller tried to push a report 5 seconds after boot, but the hub returned HTTP 503 (it was still starting up). Push() always returns nil by design, so main.go logged [INFO] Startup hub report sent even though the push actually failed. The hub shows stale data until the first scheduled report fires (15 minutes later).

Evidence from logs:

09:46:47 [INFO] Hub reporting enabled (every 15m0s to https://hub.felhom.eu)
09:47:02 [WARN] Hub report push failed after 3 attempts: HTTP 503   ← Push() logged this internally
09:47:02 [INFO] Startup hub report sent                              ← main.go logged "sent" because Push() returned nil

The hub pod only became ready at 09:47:02 — the same second Push() gave up.

Root cause

Push() in pusher.go (line 39-86) has comment: "Never returns error to caller — push failures should not affect controller operation." It always returns nil. The startup code in main.go checks err from Push() but it's always nil, so it always takes the success branch.

The scheduler (scheduler.go:223) already handles errors from JobFunc gracefully — it logs the error and continues. So returning real errors from Push() is safe for scheduled calls too.

Fix

Step 1: Make Push() return actual errors

File: controller/internal/report/pusher.go

Change Push() to return the real error instead of always nil:

Current (line 38-86):

// Push sends a report to the hub. Retries 3 times with 5s backoff.
// Never returns error to caller — push failures should not affect controller operation.
func (p *Pusher) Push(report *Report) error {
	if !p.enabled {
		return nil
	}

	data, err := json.Marshal(report)
	if err != nil {
		p.logger.Printf("[WARN] Hub report marshal failed: %v", err)
		return nil
	}

	url := p.hubURL + "/api/v1/report"

	var lastErr error
	for attempt := 0; attempt < 3; attempt++ {
		if attempt > 0 {
			time.Sleep(5 * time.Second)
		}

		req, err := http.NewRequest(http.MethodPost, url, bytes.NewReader(data))
		if err != nil {
			lastErr = err
			continue
		}
		req.Header.Set("Content-Type", "application/json")
		if p.apiKey != "" {
			req.Header.Set("Authorization", "Bearer "+p.apiKey)
		}

		resp, err := p.httpClient.Do(req)
		if err != nil {
			lastErr = err
			continue
		}
		io.Copy(io.Discard, resp.Body)
		resp.Body.Close()

		if resp.StatusCode >= 200 && resp.StatusCode < 300 {
			p.logger.Printf("[INFO] Hub report pushed successfully (%d bytes)", len(data))
			return nil
		}
		lastErr = fmt.Errorf("HTTP %d", resp.StatusCode)
	}

	p.logger.Printf("[WARN] Hub report push failed after 3 attempts: %v", lastErr)
	return nil
}

Replace with:

// Push sends a report to the hub. Retries 3 times with 5s backoff.
func (p *Pusher) Push(report *Report) error {
	if !p.enabled {
		return nil
	}

	data, err := json.Marshal(report)
	if err != nil {
		return fmt.Errorf("marshal report: %w", err)
	}

	url := p.hubURL + "/api/v1/report"

	var lastErr error
	for attempt := 0; attempt < 3; attempt++ {
		if attempt > 0 {
			time.Sleep(5 * time.Second)
		}

		req, err := http.NewRequest(http.MethodPost, url, bytes.NewReader(data))
		if err != nil {
			lastErr = err
			continue
		}
		req.Header.Set("Content-Type", "application/json")
		if p.apiKey != "" {
			req.Header.Set("Authorization", "Bearer "+p.apiKey)
		}

		resp, err := p.httpClient.Do(req)
		if err != nil {
			lastErr = err
			continue
		}
		io.Copy(io.Discard, resp.Body)
		resp.Body.Close()

		if resp.StatusCode >= 200 && resp.StatusCode < 300 {
			p.logger.Printf("[INFO] Hub report pushed successfully (%d bytes)", len(data))
			return nil
		}
		lastErr = fmt.Errorf("HTTP %d", resp.StatusCode)
	}

	return fmt.Errorf("hub push failed after 3 attempts: %w", lastErr)
}

Changes:

  • Removed "Never returns error" comment
  • Marshal error: return wrapped error instead of logging + nil
  • After retries exhausted: return error instead of logging + nil
  • Success path: unchanged (returns nil)

This is safe because:

  • The scheduler (executeJob in scheduler.go:223-235) already catches and logs errors from JobFunc
  • The startup code in main.go already checks err — it just never saw one before

Step 2: Add startup retry with longer delay

File: controller/cmd/controller/main.go

The startup goroutine (starting at ~line 270) sends the hub report once. If Push() fails (hub not ready), it should retry a few times with delay. The hub typically takes 10-15 seconds to start.

Current (~line 289-297):

		// Hub report
		if hubPusher != nil {
			if cfg.Hub.Enabled {
				r := report.BuildReport(cfg, stackMgr, backupMgr, cpuCollector, metricsStore, Version, sett.GetStoragePaths())
				if err := hubPusher.Push(r); err != nil {
					logger.Printf("[WARN] Startup hub report failed: %v", err)
				} else {
					logger.Println("[INFO] Startup hub report sent")
				}
			} else {

Replace the if cfg.Hub.Enabled block (keep the else disabled-notification branch unchanged):

		// Hub report
		if hubPusher != nil {
			if cfg.Hub.Enabled {
				r := report.BuildReport(cfg, stackMgr, backupMgr, cpuCollector, metricsStore, Version, sett.GetStoragePaths())
				var pushErr error
				for attempt := 1; attempt <= 3; attempt++ {
					pushErr = hubPusher.Push(r)
					if pushErr == nil {
						logger.Println("[INFO] Startup hub report sent")
						break
					}
					logger.Printf("[WARN] Startup hub report attempt %d/3 failed: %v", attempt, pushErr)
					if attempt < 3 {
						time.Sleep(15 * time.Second)
					}
				}
				if pushErr != nil {
					logger.Printf("[WARN] Startup hub report failed after 3 attempts — next scheduled push in %s", cfg.Hub.PushInterval)
				}
			} else {

This gives the hub up to ~40 seconds to come up (5s initial + Push's own 3x5s retries on first attempt, then 15s wait, then another Push attempt, etc.). The else branch for disabled notifications stays unchanged.

IMPORTANT: The else branch (disabled notification via PushOnce) stays as-is — no changes needed there.


Summary of changes

File Change
controller/internal/report/pusher.go Push() returns actual errors instead of always nil
controller/cmd/controller/main.go Startup hub push retries 3 times with 15s delay between attempts

Only 2 files changed. No new types, no new methods, no template changes.


Build & Deploy

SSH=/c/Windows/System32/OpenSSH/ssh.exe
# 1. Commit & push
cd e:/git/deploy-felhom-compose
git add -A && git commit -m "v0.15.5: Fix startup hub report — Push() returns real errors, startup retries" && git push
# 2. Build
$SSH kisfenyo@192.168.0.180 "cd ~/build/felhom-controller && git -C ~/git/deploy-felhom-compose pull && ./build.sh v0.15.5 --push"
# 3. Deploy
$SSH kisfenyo@192.168.0.162 "cd /opt/docker/felhom-controller && sudo docker pull gitea.dooplex.hu/admin/felhom-controller:v0.15.5 && sudo sed -i 's|image: gitea.dooplex.hu/admin/felhom-controller:.*|image: gitea.dooplex.hu/admin/felhom-controller:v0.15.5|' docker-compose.yml && sudo docker compose up -d"
# 4. Verify — look for successful startup push
$SSH kisfenyo@192.168.0.162 "sleep 10 && docker logs felhom-controller --tail 15 2>&1 | grep -i hub"

Compile check

Always run go build ./... in controller/ before committing.

Documentation

Add a CHANGELOG.md entry. Read the first 30 lines for format, then insert a new entry:

### vX.X.X (2026-02-19 session XX)
- **v0.15.5 — Fix startup hub report silently failing:**

  `Push()` now returns actual errors instead of always nil. Previously, push failures were logged internally but the caller could never detect them, leading to misleading "Startup hub report sent" log even when the push failed (e.g., hub returning HTTP 503 during simultaneous deployment).

  Startup hub push now retries 3 times with 15-second delays between attempts, giving the hub time to come up when both are deployed together. Each attempt uses Push()'s own 3-retry logic internally.

  **Files modified (2):** `internal/report/pusher.go`, `cmd/controller/main.go`

Update version in C:\Users\User\.claude\projects\e--git\memory\MEMORY.md to v0.15.5.

Verification

After deploying v0.15.5:

  1. Check logs: docker logs felhom-controller 2>&1 | grep -i hub
    • Should show [INFO] Startup hub report sent (success)
    • OR [WARN] Startup hub report attempt 1/3 failed: ... followed by eventual success
  2. Check hub dashboard at hub.felhom.eu — should show fresh data with current timestamp
  3. If hub is deployed at the same time: the retries should handle the delay