diff --git a/CLAUDE.md b/CLAUDE.md index 24785a8..680b14a 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -113,10 +113,26 @@ ssh kisfenyo@192.168.0.162 "cd /opt/docker/felhom-controller && docker pull gite - "Sablonok frissítése" button on Alkalmazások page - Sync status exposed in `/api/system/info` response +## Debug logging + +The controller has two-tier logging controlled by `logging.level` in `controller.yaml` (or `FELHOM_LOGGING_LEVEL` env var): + +- **`info`** (default): Operation success/failure with elapsed time, post-start container states, scan counts +- **`debug`**: All of above plus env var keys per compose command, local image availability checks, compose command completion times, log fetch byte counts + +Key patterns used in `internal/stacks/`: +- `time.Since(start)` for operation timing — always logged at INFO level +- `m.isDebug()` gates verbose output (env var keys, image checks) +- `truncateStr(s, 500)` caps stdout/stderr in error logs +- `logPostStartStatus()` runs async (goroutine + 3s sleep) after start/restart/update/deploy — never blocks or fails the operation +- `checkLocalImages()` parses compose YAML for `image:` lines, runs `docker image inspect` per image +- Env var **keys** are logged, never values (secrets safety) + ## Important lessons learned 1. `PAPERLESS_OCR_LANGUAGES` (plural, with S) **installs** tesseract packs; `PAPERLESS_OCR_LANGUAGE` (singular) **selects** which to use 2. `docker compose restart` does NOT pick up new images — always use `docker compose up -d` 3. Go map iteration order is random — always sort before displaying in UI 4. Docker's `.State` field says "running" even for unhealthy containers — must parse `.Status` for health info -5. After `DeployStack()` succeeds, update in-memory `Deployed` flag immediately — `RefreshStatus()` only reads docker ps, not app.yaml \ No newline at end of file +5. After `DeployStack()` succeeds, update in-memory `Deployed` flag immediately — `RefreshStatus()` only reads docker ps, not app.yaml +6. `docker compose up -d` returns exit 0 even when containers crash-loop — post-start status check is essential for detecting failures \ No newline at end of file diff --git a/CONTEXT.md b/CONTEXT.md index fb209fa..804b28b 100644 --- a/CONTEXT.md +++ b/CONTEXT.md @@ -7,7 +7,7 @@ > > Ask Claude Code: "Please update CONTEXT.md with what we did today" -Last updated: 2026-02-15 (session 2) +Last updated: 2026-02-14 (session 3) --- @@ -28,7 +28,29 @@ Last updated: 2026-02-15 (session 2) - **Running on:** demo-felhom (N100 mini PC) at 192.168.0.162:8080 - **All Phase 1 features working:** deploy, start/stop/restart/update, logs, health-aware states, auth -### What was just completed (2026-02-15 session 2) +### What was just completed (2026-02-14 session 3) +- **Enhanced debug logging** across all stack operations in `internal/stacks/`: + - **Operation timing**: All stack ops (start, stop, restart, update, deploy) now log elapsed time + - Success: `[INFO] Stack immich started successfully (took 45.2s)` + - Failure: `[ERROR] Stack immich start failed after 3.1s: exit code 1` + - **`composeExecCustomEnv` improvements**: + - Logs env var **keys** at debug level (never values — secrets stay safe) + - Logs exit code, truncated stdout/stderr (max 500 chars) on failure + - Logs command completion time on success + - **Post-start container state check** (StartStack, RestartStack, UpdateStack, DeployStack): + - Async goroutine: sleeps 3s, runs `docker compose ps -a`, logs each container's state + - Critical for detecting crash-loops that `docker compose up -d` wouldn't surface + - Non-blocking — never fails the operation, just logs a warning if check fails + - **Image pull detection** (DeployStack, UpdateStack at debug level): + - Parses `docker-compose.yml` for `image:` lines + - Runs `docker image inspect` per image to check local availability + - Skips images with `${VAR}` interpolation (can't resolve at check time) + - **GetLogs improvement**: Logs byte count of returned logs (distinguishes empty vs failure) + - **ScanStacks improvement**: `[INFO] Scanned stacks: 10 found (3 deployed, 7 available)` + - **New helpers added to manager.go**: `isDebug()`, `truncateStr()`, `logPostStartStatus()`, `checkLocalImages()` + - All verbose checks gated on `cfg.Logging.Level == "debug"`; timing and container states always logged at INFO + +### Previously completed (2026-02-15 session 2) - **Phase 4: Git Sync + App Catalog Audit** — major milestone - **Git sync module** (`internal/sync/sync.go`): - Clones/pulls app-catalog-felhom.eu repo to local cache on startup @@ -164,4 +186,6 @@ Last updated: 2026-02-15 (session 2) - Paperless-ngx needs `PAPERLESS_OCR_LANGUAGES` (plural) to install language packs, `PAPERLESS_OCR_LANGUAGE` (singular) to select - After deploying a stack, update the in-memory Deployed flag immediately — RefreshStatus() only reads docker ps - Cloudflare Tunnel handles *.demo-felhom.eu → Traefik handles Host()-based routing to containers -- BIOS "AC Power Recovery" must be enabled on N100 for auto-restart after power outage \ No newline at end of file +- BIOS "AC Power Recovery" must be enabled on N100 for auto-restart after power outage +- `docker compose up -d` returns exit 0 even when containers immediately crash-loop — need post-start status check to detect this +- When logging env vars for debugging, only log keys (not values) to avoid leaking secrets in log files \ No newline at end of file diff --git a/controller/README.md b/controller/README.md index 0cce841..f020ac9 100644 --- a/controller/README.md +++ b/controller/README.md @@ -41,6 +41,7 @@ Current version: **v0.2.1** - Pre-deploy memory validation (hard block on `mem_request` overcommit, soft warning on `mem_limit` overcommit) - Memory summary bar shown on deploy page before deployment - Felhom.eu logo SVG in sidebar and login page +- Verbose debug logging with operation timing, post-start container state checks, and image pull detection ### Known issues / next priorities - Cloudflare Tunnel + Traefik TLS: paperless.demo-felhom.eu works locally but shows "Not secure" (certificate chain not fully validated through tunnel) @@ -89,9 +90,11 @@ controller/ ├── internal/ │ ├── config/config.go # YAML loader, validation, env overrides │ ├── stacks/ -│ │ ├── manager.go # Stack scanning, compose ops, container status +│ │ ├── manager.go # Stack scanning, compose ops, container status, debug logging │ │ ├── metadata.go # Parse .felhom.yml app metadata │ │ └── deploy.go # First-deploy flow: secret gen, app.yaml, compose up +│ ├── sync/ +│ │ └── sync.go # Git sync: clone/pull app catalog, content-hash copy │ ├── api/router.go # REST API endpoints │ ├── system/ │ │ ├── info.go # SystemInfo struct @@ -121,6 +124,7 @@ controller/ | **API** | `internal/api/` | ✅ Done | REST endpoints (stacks, deploy, rescan, system info, health) | | **System** | `internal/system/` | ✅ Done | System resource info (RAM, disk usage) for dashboard & API | | **Web** | `internal/web/` | ✅ Done | Hungarian dashboard, auth, deploy pages, asset serving | +| **Sync** | `internal/sync/` | ✅ Done | Git-based app catalog sync (clone/pull, content-hash copy) | | **Backup** | `internal/backup/` | 📲 Phase 3 | DB dumps, restic snapshots, restore | | **Monitor** | `internal/monitor/` | 📲 Phase 2 | Health checks, Healthchecks pings, system metrics | | **Scheduler** | `internal/scheduler/` | 📲 Phase 2 | Cron-like job runner for all periodic tasks | @@ -208,6 +212,53 @@ The following stacks cannot be stopped from the customer UI: - `cloudflared` (tunnel) - `felhom-controller` (this container) +## Logging + +The controller uses two-tier logging controlled by `logging.level` in `controller.yaml`: + +```yaml +logging: + level: debug # debug | info | warn | error (default: info) + file: "" # optional log file path + max_size_mb: 10 + max_files: 3 +``` + +Can also be set via environment variable: `FELHOM_LOGGING_LEVEL=debug` + +### What gets logged at each level + +| Level | What's logged | +|-------|--------------| +| **info** | Operation success/failure with elapsed time, post-start container states, stack scan counts, deploy memory checks | +| **debug** | All of above + env var keys per compose command, local image availability checks, compose command timing, log fetch byte counts | + +### Example output (debug level) + +``` +[INFO] Starting stack: immich +[DEBUG] Env vars for compose: [DOMAIN, DB_PASSWORD, HDD_PATH] (3 app + 42 system) +[DEBUG] Running: docker compose up -d (in /opt/docker/stacks/immich) +[DEBUG] Command completed: docker compose up -d (took 12.3s) +[INFO] Stack immich started successfully (took 12.3s) +[INFO] Stack immich post-start status: +[INFO] immich-server ghcr.io/immich-app/immich-server:release running Up 3 seconds (health: starting) +[INFO] immich-postgres docker.io/tensorchord/pgvecto-rs:pg16... running Up 3 seconds (healthy) +[INFO] immich-redis docker.io/library/redis:7-alpine running Up 3 seconds (healthy) +``` + +On failure: +``` +[ERROR] Command failed: docker compose up -d (in /opt/docker/stacks/immich) — exit code 1 (took 2.1s) +[ERROR] stderr: Error response from daemon: pull access denied... +[ERROR] Stack immich start failed after 2.1s: exit code 1 +``` + +### Security + +- Env var **values** are never logged — only keys appear in debug output +- stdout/stderr in error logs are truncated to 500 characters to prevent log spam + ## Configuration ### Controller config (infrastructure only) @@ -356,7 +407,9 @@ docker compose up -d - [ ] Restore workflow ### Phase 4 — Git Sync & Updates -- [ ] Periodic git pull for stack definitions +- [x] Periodic git pull for stack definitions (git sync module) +- [x] Manual sync button on Alkalmazások page ("Sablonok frissítése") +- [x] Sync status in `/api/system/info` - [ ] Update classification (optional/required/security) - [ ] Update window enforcement - [ ] Dashboard update notifications with "Update" button diff --git a/controller/internal/stacks/deploy.go b/controller/internal/stacks/deploy.go index da76314..754f391 100644 --- a/controller/internal/stacks/deploy.go +++ b/controller/internal/stacks/deploy.go @@ -178,9 +178,16 @@ func (m *Manager) DeployStack(req DeployRequest) (string, error) { } m.logger.Printf("[INFO] Deploying stack %s with %d env vars: [%s]", req.StackName, len(env), strings.Join(envKeys, ", ")) + // Check which images are available locally before pulling + if m.isDebug() { + m.checkLocalImages(req.StackName, stackDir) + } + // Run docker compose up -d + start := time.Now() _, composeErr := m.composeExecWithEnv(stackDir, env, "up", "-d") if composeErr != nil { + m.logger.Printf("[ERROR] Stack %s deploy failed after %.1fs: %v", req.StackName, time.Since(start).Seconds(), composeErr) // Deployment failed — keep app.yaml for debugging but mark as not deployed appCfg.Deployed = false _ = SaveAppConfig(stackDir, appCfg) @@ -196,7 +203,12 @@ func (m *Manager) DeployStack(req DeployRequest) (string, error) { } m.mu.Unlock() - m.logger.Printf("[INFO] Stack %s deployed successfully", req.StackName) + m.logger.Printf("[INFO] Stack %s deployed successfully (took %.1fs)", req.StackName, time.Since(start).Seconds()) + + // Post-deploy container state check (async, non-blocking) + deployEnv := m.stackEnv(stackDir) + m.logPostStartStatus(req.StackName, stackDir, deployEnv) + return deployWarning, m.RefreshStatus() } diff --git a/controller/internal/stacks/manager.go b/controller/internal/stacks/manager.go index f0a48ad..d1cefe1 100644 --- a/controller/internal/stacks/manager.go +++ b/controller/internal/stacks/manager.go @@ -166,7 +166,14 @@ func (m *Manager) ScanStacks() error { } } - m.logger.Printf("[INFO] Scanned stacks: %d found", len(m.stacks)) + deployedCount := 0 + for _, s := range m.stacks { + if s.Deployed { + deployedCount++ + } + } + m.logger.Printf("[INFO] Scanned stacks: %d found (%d deployed, %d available)", + len(m.stacks), deployedCount, len(m.stacks)-deployedCount) return m.refreshStatusLocked() } @@ -357,15 +364,18 @@ func (m *Manager) StartStack(name string) error { } m.logger.Printf("[INFO] Starting stack: %s", name) + start := time.Now() dir := filepath.Dir(stack.ComposePath) env := m.stackEnv(dir) if _, err := m.composeExecCustomEnv(dir, env, "up", "-d"); err != nil { + m.logger.Printf("[ERROR] Stack %s start failed after %.1fs: %v", name, time.Since(start).Seconds(), err) return fmt.Errorf("starting stack %s: %w", name, err) } - m.logger.Printf("[INFO] Stack %s started", name) + m.logger.Printf("[INFO] Stack %s started successfully (took %.1fs)", name, time.Since(start).Seconds()) + m.logPostStartStatus(name, dir, env) return m.RefreshStatus() } @@ -380,13 +390,15 @@ func (m *Manager) StopStack(name string) error { } m.logger.Printf("[INFO] Stopping stack: %s", name) + start := time.Now() dir := filepath.Dir(stack.ComposePath) if _, err := m.composeExec(dir, "down"); err != nil { + m.logger.Printf("[ERROR] Stack %s stop failed after %.1fs: %v", name, time.Since(start).Seconds(), err) return fmt.Errorf("stopping stack %s: %w", name, err) } - m.logger.Printf("[INFO] Stack %s stopped", name) + m.logger.Printf("[INFO] Stack %s stopped successfully (took %.1fs)", name, time.Since(start).Seconds()) return m.RefreshStatus() } @@ -397,13 +409,17 @@ func (m *Manager) RestartStack(name string) error { } m.logger.Printf("[INFO] Restarting stack: %s", name) + start := time.Now() dir := filepath.Dir(stack.ComposePath) if _, err := m.composeExec(dir, "restart"); err != nil { + m.logger.Printf("[ERROR] Stack %s restart failed after %.1fs: %v", name, time.Since(start).Seconds(), err) return fmt.Errorf("restarting stack %s: %w", name, err) } - m.logger.Printf("[INFO] Stack %s restarted", name) + m.logger.Printf("[INFO] Stack %s restarted successfully (took %.1fs)", name, time.Since(start).Seconds()) + env := m.stackEnv(dir) + m.logPostStartStatus(name, dir, env) return m.RefreshStatus() } @@ -414,18 +430,26 @@ func (m *Manager) UpdateStack(name string) error { } m.logger.Printf("[INFO] Updating stack: %s", name) + start := time.Now() dir := filepath.Dir(stack.ComposePath) env := m.stackEnv(dir) + if m.isDebug() { + m.checkLocalImages(name, dir) + } + if _, err := m.composeExecCustomEnv(dir, env, "pull"); err != nil { + m.logger.Printf("[ERROR] Stack %s update (pull) failed after %.1fs: %v", name, time.Since(start).Seconds(), err) return fmt.Errorf("pulling images for %s: %w", name, err) } if _, err := m.composeExecCustomEnv(dir, env, "up", "-d", "--remove-orphans"); err != nil { + m.logger.Printf("[ERROR] Stack %s update (up) failed after %.1fs: %v", name, time.Since(start).Seconds(), err) return fmt.Errorf("recreating %s: %w", name, err) } - m.logger.Printf("[INFO] Stack %s updated", name) + m.logger.Printf("[INFO] Stack %s updated successfully (took %.1fs)", name, time.Since(start).Seconds()) + m.logPostStartStatus(name, dir, env) return m.RefreshStatus() } @@ -442,11 +466,20 @@ func (m *Manager) GetLogs(name string, lines int) (string, error) { lines = 1000 } + m.logger.Printf("[DEBUG] Fetching logs for %s (tail %d)", name, lines) + dir := filepath.Dir(stack.ComposePath) output, err := m.composeExec(dir, "logs", "--tail", fmt.Sprintf("%d", lines), "--no-color") if err != nil { + m.logger.Printf("[WARN] Failed to fetch logs for %s: %v", name, err) return "", fmt.Errorf("getting logs for %s: %w", name, err) } + + if len(output) == 0 { + m.logger.Printf("[DEBUG] Logs result for %s: 0 bytes returned (empty)", name) + } else { + m.logger.Printf("[DEBUG] Logs result for %s: %d bytes returned", name, len(output)) + } return output, nil } @@ -489,7 +522,30 @@ func (m *Manager) composeExecCustomEnv(dir string, env []string, args ...string) if env != nil { cmd.Env = env } else { - cmd.Env = m.stackEnv(dir) + env = m.stackEnv(dir) + cmd.Env = env + } + + // Log env var keys at debug level + if m.isDebug() { + var appKeys []string + sysCount := 0 + for _, e := range env { + parts := strings.SplitN(e, "=", 2) + if len(parts) == 2 { + key := parts[0] + // Only log non-system env vars (skip PATH, HOME, etc.) + if strings.ToUpper(key) == key && !strings.HasPrefix(key, "_") { + appKeys = append(appKeys, key) + } else { + sysCount++ + } + } + } + if len(appKeys) > 0 { + m.logger.Printf("[DEBUG] Env vars for compose: [%s] (%d app + %d system)", + strings.Join(appKeys, ", "), len(appKeys), sysCount) + } } var stdout, stderr bytes.Buffer @@ -498,10 +554,25 @@ func (m *Manager) composeExecCustomEnv(dir string, env []string, args ...string) m.logger.Printf("[DEBUG] Running: %s %s (in %s)", m.composeCmd, strings.Join(args, " "), dir) + start := time.Now() if err := cmd.Run(); err != nil { - return stdout.String(), fmt.Errorf("%w\nstderr: %s", err, stderr.String()) + elapsed := time.Since(start) + exitCode := -1 + if exitErr, ok := err.(*exec.ExitError); ok { + exitCode = exitErr.ExitCode() + } + m.logger.Printf("[ERROR] Command failed: %s %s (in %s) — exit code %d (took %.1fs)", + m.composeCmd, strings.Join(args, " "), dir, exitCode, elapsed.Seconds()) + if stdoutStr := truncateStr(stdout.String(), 500); stdoutStr != "" { + m.logger.Printf("[ERROR] stdout: %s", stdoutStr) + } + if stderrStr := truncateStr(stderr.String(), 500); stderrStr != "" { + m.logger.Printf("[ERROR] stderr: %s", stderrStr) + } + return stdout.String(), fmt.Errorf("exit code %d\nstderr: %s", exitCode, truncateStr(stderr.String(), 500)) } + m.logger.Printf("[DEBUG] Command completed: %s %s (took %.1fs)", m.composeCmd, strings.Join(args, " "), time.Since(start).Seconds()) return stdout.String(), nil } @@ -519,6 +590,88 @@ func (m *Manager) execCommand(name string, args ...string) (string, error) { return stdout.String(), nil } +// isDebug returns true if logging level is "debug". +func (m *Manager) isDebug() bool { + return m.cfg.Logging.Level == "debug" +} + +// truncateStr truncates a string to maxLen characters, appending "..." if truncated. +func truncateStr(s string, maxLen int) string { + s = strings.TrimSpace(s) + if len(s) <= maxLen { + return s + } + return s[:maxLen] + "..." +} + +// logPostStartStatus queries container states after a start/deploy operation +// and logs them. This runs asynchronously to avoid blocking the HTTP response. +func (m *Manager) logPostStartStatus(name, stackDir string, env []string) { + go func() { + time.Sleep(3 * time.Second) + + output, err := m.composeExecCustomEnv(stackDir, env, "ps", "-a", "--format", "table {{.Name}}\t{{.Image}}\t{{.State}}\t{{.Status}}") + if err != nil { + m.logger.Printf("[WARN] Post-start status check failed for %s: %v", name, err) + return + } + + lines := strings.Split(strings.TrimSpace(output), "\n") + if len(lines) <= 1 { + m.logger.Printf("[WARN] Post-start status for %s: no containers found", name) + return + } + + m.logger.Printf("[INFO] Stack %s post-start status:", name) + // Skip header line + for _, line := range lines[1:] { + m.logger.Printf("[INFO] %s", line) + } + }() +} + +// checkLocalImages parses docker-compose.yml for image: lines and checks which +// are available locally. Informational only — logs results but never fails. +func (m *Manager) checkLocalImages(name, stackDir string) { + composePath := filepath.Join(stackDir, "docker-compose.yml") + data, err := os.ReadFile(composePath) + if err != nil { + composePath = filepath.Join(stackDir, "docker-compose.yaml") + data, err = os.ReadFile(composePath) + if err != nil { + m.logger.Printf("[DEBUG] Could not read compose file for image check: %v", err) + return + } + } + + var images []string + for _, line := range strings.Split(string(data), "\n") { + trimmed := strings.TrimSpace(line) + if strings.HasPrefix(trimmed, "image:") { + img := strings.TrimSpace(strings.TrimPrefix(trimmed, "image:")) + img = strings.Trim(img, "\"'") + if img != "" && !strings.Contains(img, "${") { + images = append(images, img) + } + } + } + + if len(images) == 0 { + m.logger.Printf("[DEBUG] No static image references found in %s compose file", name) + return + } + + m.logger.Printf("[INFO] Deploying stack %s — checking %d images...", name, len(images)) + for _, img := range images { + cmd := exec.Command("docker", "image", "inspect", img) + if err := cmd.Run(); err != nil { + m.logger.Printf("[DEBUG] %s — not found locally, will pull", img) + } else { + m.logger.Printf("[DEBUG] %s — found locally", img) + } + } +} + // --- Memory helpers --- // ParseMemoryMB parses a memory string like "500M", "1G", "1.5G", "1024M", "768"