867 lines
32 KiB
Markdown
867 lines
32 KiB
Markdown
# TASK: App Telemetry & Analytics
|
|
|
|
**Controller:** v0.27.3 → v0.28.0
|
|
**Hub:** v0.3.8 → v0.4.0
|
|
|
|
## Overview
|
|
|
|
Add per-app (per-stack) memory/CPU telemetry and container log error scanning to the controller's report push cycle, then build fleet-wide analytics dashboard pages in the hub.
|
|
|
|
---
|
|
|
|
## Spec Issues Found (corrections already applied in this plan)
|
|
|
|
1. **Wrong column name in metrics DB**: Spec uses `memory_bytes` — actual column is `mem_usage_mb` (already in MB). No byte→MB conversion needed.
|
|
2. **`ts` column is INTEGER (Unix timestamp)**, not datetime — WHERE clauses must use `unix()`.
|
|
3. **`metricsStore` already passed to `BuildReport()`** — no main.go wiring change needed for that dependency.
|
|
4. **Chart.js is NOT in the hub** — needs to be added (copy from controller's `internal/web/static/chart.min.js`).
|
|
5. **Hub nav is header-based**, not sidebar — add "Alkalmazások" to the header `<nav>`.
|
|
6. **`StackInfo` type undefined** in spec — use existing `stacks.Stack` directly (report already imports stacks).
|
|
7. **Fingerprinting threshold**: Use 6+ digits instead of 4+ to avoid mangling HTTP status codes (404, 503) and port numbers.
|
|
|
|
---
|
|
|
|
## Phase 1: Controller — Metrics Telemetry
|
|
|
|
### File: `controller/internal/metrics/telemetry.go` (NEW)
|
|
|
|
Create this new file in the existing `metrics` package.
|
|
|
|
```go
|
|
package metrics
|
|
|
|
import (
|
|
"time"
|
|
)
|
|
|
|
// ContainerTelemetry holds aggregated resource stats for one container.
|
|
type ContainerTelemetry struct {
|
|
ContainerName string `json:"container_name"`
|
|
MemoryCurrentMB float64 `json:"memory_current_mb"`
|
|
MemoryAvgMB float64 `json:"memory_avg_mb"`
|
|
MemoryPeakMB float64 `json:"memory_peak_mb"`
|
|
CPUAvgPercent float64 `json:"cpu_avg_percent"`
|
|
SampleCount int `json:"sample_count"`
|
|
}
|
|
|
|
// GetContainerTelemetry queries the metrics DB for per-container resource
|
|
// summaries since the given time. Returns empty slice (not error) if no data.
|
|
func (s *MetricsStore) GetContainerTelemetry(since time.Time) ([]ContainerTelemetry, error) {
|
|
sinceUnix := since.Unix()
|
|
|
|
// Get averages and peaks
|
|
rows, err := s.db.Query(`
|
|
SELECT container_name,
|
|
AVG(mem_usage_mb),
|
|
MAX(mem_usage_mb),
|
|
AVG(cpu_percent),
|
|
COUNT(*)
|
|
FROM container_metrics
|
|
WHERE ts > ?
|
|
GROUP BY container_name`, sinceUnix)
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
defer rows.Close()
|
|
|
|
var results []ContainerTelemetry
|
|
for rows.Next() {
|
|
var ct ContainerTelemetry
|
|
if err := rows.Scan(&ct.ContainerName, &ct.MemoryAvgMB, &ct.MemoryPeakMB,
|
|
&ct.CPUAvgPercent, &ct.SampleCount); err != nil {
|
|
continue
|
|
}
|
|
results = append(results, ct)
|
|
}
|
|
|
|
// Get current (most recent) memory per container using QueryContainerSummary
|
|
if stats, err := s.QueryContainerSummary(); err == nil {
|
|
currentMap := make(map[string]float64, len(stats))
|
|
for _, st := range stats {
|
|
currentMap[st.ContainerName] = st.MemUsageMB
|
|
}
|
|
for i := range results {
|
|
if cur, ok := currentMap[results[i].ContainerName]; ok {
|
|
results[i].MemoryCurrentMB = cur
|
|
}
|
|
}
|
|
}
|
|
|
|
if results == nil {
|
|
results = []ContainerTelemetry{}
|
|
}
|
|
return results, nil
|
|
}
|
|
```
|
|
|
|
**Key details:**
|
|
- Method on existing `*MetricsStore` — no new struct needed
|
|
- `ts` column is Unix INTEGER — compare with `since.Unix()`
|
|
- `mem_usage_mb` is already in MB — no conversion
|
|
- Uses existing `QueryContainerSummary()` for current values (returns latest row per container, ordered by CPU DESC)
|
|
- Returns empty slice on no data, not error
|
|
|
|
---
|
|
|
|
## Phase 2: Controller — Log Scanner
|
|
|
|
### File: `controller/internal/metrics/logscanner.go` (NEW)
|
|
|
|
Create in the `metrics` package (it's data collection, same domain as metrics).
|
|
|
|
```go
|
|
package metrics
|
|
|
|
import (
|
|
"context"
|
|
"os/exec"
|
|
"regexp"
|
|
"strings"
|
|
"time"
|
|
"unicode/utf8"
|
|
"log"
|
|
"sort"
|
|
)
|
|
```
|
|
|
|
**Types:**
|
|
```go
|
|
type ContainerLogSummary struct {
|
|
ContainerName string `json:"container_name"`
|
|
ErrorCount int `json:"error_count"`
|
|
WarnCount int `json:"warn_count"`
|
|
RecentIssues []LogIssue `json:"recent_issues,omitempty"`
|
|
}
|
|
|
|
type LogIssue struct {
|
|
Severity string `json:"severity"`
|
|
Message string `json:"message"`
|
|
Count int `json:"count"`
|
|
LastSeen time.Time `json:"last_seen"`
|
|
}
|
|
```
|
|
|
|
**Function: `ScanContainerLogs(containerNames []string, since time.Duration, logger *log.Logger) []ContainerLogSummary`**
|
|
|
|
Implementation notes:
|
|
- Iterate `containerNames` **sequentially** (not parallel — avoid load spikes)
|
|
- For each container, run: `exec.CommandContext(ctx, "docker", "logs", "--since=15m", "--tail=1000", containerName)`
|
|
- Context timeout: 10 seconds per container
|
|
- Merge stderr into stdout (`cmd.CombinedOutput()`)
|
|
- On error: log at DEBUG, skip container, continue
|
|
- **Skip non-UTF-8 lines** using `utf8.Valid([]byte(line))`
|
|
- **Truncate lines** to 500 chars before matching
|
|
- **Pattern matching** — check first 5 space-separated words of each line (case-insensitive):
|
|
- Error patterns: `error`, `fatal`, `panic`, `crit`, `oom`, `killed`, `exception`, `traceback`
|
|
- Warning patterns: `warn`, `warning`
|
|
- **Fingerprinting for deduplication:**
|
|
- Strip leading timestamp (regex: `^\d{4}[-/]\d{2}[-/]\d{2}[T ]\d{2}:\d{2}:\d{2}[.\d]*[Z ]?` and syslog-style `^[A-Z][a-z]{2} \d{1,2} \d{2}:\d{2}:\d{2} `)
|
|
- Replace sequences of 6+ digits with `<N>` (NOT 4+ — avoids mangling HTTP status codes, port numbers)
|
|
- Replace hex strings of 8+ chars with `<HEX>`
|
|
- Replace UUIDs (`[0-9a-f]{8}-[0-9a-f]{4}-...`) with `<UUID>`
|
|
- Trim whitespace, lowercase for grouping key
|
|
- **Group by fingerprint**, keep count + last_seen time
|
|
- **Limits:** Max 10 `RecentIssues` per container (sorted by count DESC, then last_seen DESC). Cap total issues across all containers at 50.
|
|
- **Total scan warning:** If total scan takes > 5 minutes, log a warning
|
|
- Return `[]ContainerLogSummary` (nil-safe — return empty slice)
|
|
|
|
**The caller (report builder) is responsible for filtering out infrastructure containers before calling this function.** The function doesn't need config access.
|
|
|
|
---
|
|
|
|
## Phase 3: Controller — Report Integration
|
|
|
|
### File: `controller/internal/report/types.go` (MODIFY)
|
|
|
|
Add these types and the new field to `Report`:
|
|
|
|
```go
|
|
// Add to Report struct:
|
|
AppTelemetry []AppTelemetry `json:"app_telemetry,omitempty"`
|
|
|
|
// New types (add after StacksReport):
|
|
|
|
// AppTelemetry holds per-app (per-stack) resource and log telemetry.
|
|
type AppTelemetry struct {
|
|
AppName string `json:"app_name"`
|
|
DisplayName string `json:"display_name"`
|
|
Containers []string `json:"containers"`
|
|
MemoryCurrentMB float64 `json:"memory_current_mb"`
|
|
MemoryAvgMB float64 `json:"memory_avg_mb"`
|
|
MemoryPeakMB float64 `json:"memory_peak_mb"`
|
|
CPUAvgPercent float64 `json:"cpu_avg_percent"`
|
|
CatalogEstimate string `json:"catalog_estimate"`
|
|
CatalogLimit string `json:"catalog_limit"`
|
|
LogErrors int `json:"log_errors"`
|
|
LogWarnings int `json:"log_warnings"`
|
|
Issues []metrics.LogIssue `json:"issues,omitempty"`
|
|
}
|
|
```
|
|
|
|
Note: `LogIssue` is defined in the `metrics` package (from logscanner.go). The `report` package already imports `metrics`.
|
|
|
|
### File: `controller/internal/report/telemetry.go` (NEW)
|
|
|
|
**Function: `buildAppTelemetry(allStacks []stacks.Stack, telemetry []metrics.ContainerTelemetry, logs []metrics.ContainerLogSummary) []AppTelemetry`**
|
|
|
|
Private function (lowercase `b`), called from `BuildReport`. Logic:
|
|
|
|
1. Build lookup maps: `containerName → ContainerTelemetry` and `containerName → ContainerLogSummary`
|
|
2. Iterate `allStacks`. Skip stacks where `s.Protected || !s.Deployed`
|
|
3. For each stack:
|
|
- Collect container names from `s.Containers`
|
|
- Sum `MemoryCurrentMB`, `MemoryAvgMB`, `MemoryPeakMB`, `CPUAvgPercent` across all containers in the stack
|
|
- Sum `ErrorCount`, `WarnCount` across all containers
|
|
- Merge `RecentIssues` from all containers, sort by count DESC, cap at 10
|
|
- Get `CatalogEstimate` from `s.Meta.Resources.MemRequest` and `CatalogLimit` from `s.Meta.Resources.MemLimit`
|
|
- Get `DisplayName` from `s.Meta.DisplayName`
|
|
4. Return slice sorted by `AppName`
|
|
|
|
### File: `controller/internal/report/builder.go` (MODIFY)
|
|
|
|
In `BuildReport()`, add AFTER the stacks section (before the final debug log), approximately at line 151:
|
|
|
|
```go
|
|
// App telemetry (metrics + log scan)
|
|
r.AppTelemetry = buildAppTelemetrySection(cfg, stackMgr, metricsStore, logger)
|
|
```
|
|
|
|
Create helper function `buildAppTelemetrySection`:
|
|
```go
|
|
func buildAppTelemetrySection(cfg *config.Config, stackMgr *stacks.Manager, metricsStore *metrics.MetricsStore, logger *log.Logger) []AppTelemetry {
|
|
allStacks := stackMgr.GetStacks()
|
|
|
|
// 1. Get metrics telemetry (last 15 minutes)
|
|
var telemetry []metrics.ContainerTelemetry
|
|
if metricsStore != nil {
|
|
var err error
|
|
telemetry, err = metricsStore.GetContainerTelemetry(time.Now().Add(-15 * time.Minute))
|
|
if err != nil && logger != nil {
|
|
logger.Printf("[WARN] Telemetry metrics query failed: %v", err)
|
|
}
|
|
}
|
|
|
|
// 2. Collect non-protected container names for log scan
|
|
var containerNames []string
|
|
for _, s := range allStacks {
|
|
if s.Protected || !s.Deployed {
|
|
continue
|
|
}
|
|
for _, c := range s.Containers {
|
|
containerNames = append(containerNames, c.Name)
|
|
}
|
|
}
|
|
|
|
// 3. Scan logs
|
|
logs := metrics.ScanContainerLogs(containerNames, 15*time.Minute, logger)
|
|
|
|
// 4. Build per-app telemetry
|
|
return buildAppTelemetry(allStacks, telemetry, logs)
|
|
}
|
|
```
|
|
|
|
**Key: uses `s.Protected` (from config's protected list) to skip infrastructure containers** — no hardcoded names.
|
|
|
|
### File: `controller/cmd/controller/main.go` (MODIFY)
|
|
|
|
**Minimal change needed.** The `metricsStore` is already passed to `BuildReport()`. However, `config.Config` (the `cfg` parameter) is also already available in `BuildReport`. Check if `BuildReport` signature needs `*config.Config` for the protected stacks check — **it already receives `cfg *config.Config`** on line 25. No signature change needed.
|
|
|
|
Actually, looking at this more carefully: `buildAppTelemetrySection` needs `cfg` to check `cfg.IsProtectedStack()`. But we're NOT using `cfg.IsProtectedStack()` — we're using `s.Protected` which is already set on each stack by the manager during `ScanStacks()`. So **no config dependency needed in the telemetry builder** beyond what's already available.
|
|
|
|
Wait — double check: the `buildAppTelemetrySection` function as written takes `cfg *config.Config` but only uses `stackMgr.GetStacks()` which already has `s.Protected` set. We can simplify by removing the `cfg` parameter:
|
|
|
|
```go
|
|
func buildAppTelemetrySection(stackMgr *stacks.Manager, metricsStore *metrics.MetricsStore, logger *log.Logger) []AppTelemetry {
|
|
```
|
|
|
|
**No changes to main.go needed.**
|
|
|
|
---
|
|
|
|
## Phase 4: Hub — Store Changes
|
|
|
|
### File: `hub/internal/store/store.go` (MODIFY)
|
|
|
|
#### 4a. Add table creation in `migrate()` function
|
|
|
|
Add after the existing table creation statements:
|
|
|
|
```sql
|
|
CREATE TABLE IF NOT EXISTS app_telemetry (
|
|
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
|
customer_id TEXT NOT NULL,
|
|
app_name TEXT NOT NULL,
|
|
display_name TEXT NOT NULL DEFAULT '',
|
|
reported_at DATETIME NOT NULL,
|
|
memory_current_mb REAL DEFAULT 0,
|
|
memory_avg_mb REAL DEFAULT 0,
|
|
memory_peak_mb REAL DEFAULT 0,
|
|
cpu_avg_percent REAL DEFAULT 0,
|
|
catalog_estimate TEXT DEFAULT '',
|
|
catalog_limit TEXT DEFAULT '',
|
|
log_errors INTEGER DEFAULT 0,
|
|
log_warnings INTEGER DEFAULT 0,
|
|
containers_json TEXT DEFAULT '[]',
|
|
issues_json TEXT DEFAULT '[]'
|
|
);
|
|
|
|
CREATE INDEX IF NOT EXISTS idx_app_telemetry_lookup
|
|
ON app_telemetry(app_name, reported_at);
|
|
CREATE INDEX IF NOT EXISTS idx_app_telemetry_customer
|
|
ON app_telemetry(customer_id, app_name, reported_at);
|
|
CREATE INDEX IF NOT EXISTS idx_app_telemetry_prune
|
|
ON app_telemetry(reported_at);
|
|
|
|
CREATE TABLE IF NOT EXISTS app_log_issues (
|
|
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
|
app_name TEXT NOT NULL,
|
|
fingerprint TEXT NOT NULL,
|
|
severity TEXT NOT NULL,
|
|
message TEXT NOT NULL,
|
|
first_seen DATETIME NOT NULL,
|
|
last_seen DATETIME NOT NULL,
|
|
occurrence_count INTEGER DEFAULT 1,
|
|
affected_customers TEXT DEFAULT '[]',
|
|
UNIQUE(app_name, fingerprint)
|
|
);
|
|
|
|
CREATE INDEX IF NOT EXISTS idx_app_log_issues_app
|
|
ON app_log_issues(app_name, last_seen DESC);
|
|
```
|
|
|
|
#### 4b. Add types (top of file or separate types section)
|
|
|
|
```go
|
|
type AppTelemetryRecord struct {
|
|
AppName string `json:"app_name"`
|
|
DisplayName string `json:"display_name"`
|
|
Containers []string `json:"containers"`
|
|
MemoryCurrentMB float64 `json:"memory_current_mb"`
|
|
MemoryAvgMB float64 `json:"memory_avg_mb"`
|
|
MemoryPeakMB float64 `json:"memory_peak_mb"`
|
|
CPUAvgPercent float64 `json:"cpu_avg_percent"`
|
|
CatalogEstimate string `json:"catalog_estimate"`
|
|
CatalogLimit string `json:"catalog_limit"`
|
|
LogErrors int `json:"log_errors"`
|
|
LogWarnings int `json:"log_warnings"`
|
|
Issues []struct {
|
|
Severity string `json:"severity"`
|
|
Message string `json:"message"`
|
|
Count int `json:"count"`
|
|
LastSeen time.Time `json:"last_seen"`
|
|
} `json:"issues,omitempty"`
|
|
}
|
|
|
|
type FleetAppSummary struct {
|
|
AppName string
|
|
DisplayName string
|
|
DeploymentCount int
|
|
AvgMemoryMB float64
|
|
PeakMemoryMB float64 // max across fleet
|
|
P95MemoryMB float64
|
|
AvgCPU float64
|
|
TotalErrors int
|
|
TotalWarnings int
|
|
CatalogEstimate string
|
|
CatalogLimit string
|
|
}
|
|
|
|
type AppTelemetryPoint struct {
|
|
ReportedAt time.Time
|
|
CustomerID string
|
|
MemoryAvgMB float64
|
|
MemoryPeakMB float64
|
|
CPUAvgPercent float64
|
|
LogErrors int
|
|
LogWarnings int
|
|
}
|
|
|
|
type AppCustomerStats struct {
|
|
CustomerID string
|
|
AvgMemoryMB float64
|
|
PeakMemoryMB float64
|
|
AvgCPU float64
|
|
TotalErrors int
|
|
LastReport time.Time
|
|
}
|
|
|
|
type CustomerAppSummary struct {
|
|
AppName string
|
|
DisplayName string
|
|
MemoryCurrentMB float64
|
|
MemoryAvgMB float64
|
|
MemoryPeakMB float64
|
|
CatalogLimit string
|
|
LogErrors int
|
|
LogWarnings int
|
|
}
|
|
|
|
type AppIssue struct {
|
|
ID int
|
|
AppName string
|
|
Fingerprint string
|
|
Severity string
|
|
Message string
|
|
FirstSeen time.Time
|
|
LastSeen time.Time
|
|
OccurrenceCount int
|
|
AffectedCustomers []string
|
|
}
|
|
```
|
|
|
|
#### 4c. Add store methods
|
|
|
|
**`SaveAppTelemetry(customerID string, reportedAt time.Time, records []AppTelemetryRecord) error`**
|
|
- Insert each record into `app_telemetry` table
|
|
- Serialize `Containers` to JSON for `containers_json`
|
|
- Serialize `Issues` to JSON for `issues_json`
|
|
- Use a transaction for batch insert
|
|
- For each record with non-empty issues, call `upsertAppIssue` for each issue
|
|
|
|
**`upsertAppIssue(appName, fingerprint, severity, message, customerID string, lastSeen time.Time) error`**
|
|
- Private helper
|
|
- Use `INSERT INTO app_log_issues ... ON CONFLICT(app_name, fingerprint) DO UPDATE SET last_seen=MAX(last_seen, excluded.last_seen), occurrence_count=occurrence_count+1, ...`
|
|
- For `affected_customers`: parse existing JSON array, add customerID if not present, re-serialize
|
|
|
|
**`GetFleetAppSummary(since time.Time) ([]FleetAppSummary, error)`**
|
|
```sql
|
|
SELECT app_name,
|
|
MAX(display_name) as display_name,
|
|
COUNT(DISTINCT customer_id) as deployment_count,
|
|
AVG(memory_avg_mb) as avg_memory_mb,
|
|
MAX(memory_peak_mb) as peak_memory_mb,
|
|
AVG(cpu_avg_percent) as avg_cpu,
|
|
SUM(log_errors) as total_errors,
|
|
SUM(log_warnings) as total_warnings,
|
|
MAX(catalog_estimate) as catalog_estimate,
|
|
MAX(catalog_limit) as catalog_limit
|
|
FROM app_telemetry
|
|
WHERE reported_at > ?
|
|
GROUP BY app_name
|
|
ORDER BY deployment_count DESC, avg_memory_mb DESC
|
|
```
|
|
|
|
For P95 memory: separate query per app (can be batched in Go):
|
|
```sql
|
|
SELECT memory_peak_mb FROM app_telemetry
|
|
WHERE app_name = ? AND reported_at > ?
|
|
ORDER BY memory_peak_mb ASC
|
|
LIMIT 1 OFFSET (
|
|
SELECT CAST(COUNT(*) * 0.95 AS INTEGER)
|
|
FROM app_telemetry WHERE app_name = ? AND reported_at > ?
|
|
)
|
|
```
|
|
|
|
**`GetAppTelemetryHistory(appName string, since time.Time) ([]AppTelemetryPoint, error)`**
|
|
```sql
|
|
SELECT reported_at, customer_id, memory_avg_mb, memory_peak_mb, cpu_avg_percent, log_errors, log_warnings
|
|
FROM app_telemetry
|
|
WHERE app_name = ? AND reported_at > ?
|
|
ORDER BY reported_at ASC
|
|
```
|
|
|
|
**`GetAppCustomerBreakdown(appName string, since time.Time) ([]AppCustomerStats, error)`**
|
|
```sql
|
|
SELECT customer_id, AVG(memory_avg_mb), MAX(memory_peak_mb), AVG(cpu_avg_percent),
|
|
SUM(log_errors), MAX(reported_at)
|
|
FROM app_telemetry
|
|
WHERE app_name = ? AND reported_at > ?
|
|
GROUP BY customer_id
|
|
ORDER BY AVG(memory_avg_mb) DESC
|
|
```
|
|
|
|
**`GetCustomerAppSummary(customerID string, since time.Time) ([]CustomerAppSummary, error)`**
|
|
```sql
|
|
SELECT t1.app_name, t1.display_name, t1.memory_current_mb,
|
|
AVG(t2.memory_avg_mb), MAX(t2.memory_peak_mb),
|
|
MAX(t2.catalog_limit),
|
|
SUM(t2.log_errors), SUM(t2.log_warnings)
|
|
FROM app_telemetry t1
|
|
INNER JOIN (
|
|
SELECT app_name, MAX(reported_at) as max_reported
|
|
FROM app_telemetry WHERE customer_id = ? AND reported_at > ?
|
|
GROUP BY app_name
|
|
) latest ON t1.app_name = latest.app_name AND t1.reported_at = latest.max_reported AND t1.customer_id = ?
|
|
LEFT JOIN app_telemetry t2 ON t2.app_name = t1.app_name AND t2.customer_id = ? AND t2.reported_at > ?
|
|
GROUP BY t1.app_name
|
|
ORDER BY AVG(t2.memory_avg_mb) DESC
|
|
```
|
|
|
|
Actually, this query is overly complex. Simpler approach:
|
|
```sql
|
|
-- Get 7d averages/peaks
|
|
SELECT app_name, MAX(display_name), AVG(memory_avg_mb), MAX(memory_peak_mb),
|
|
MAX(catalog_limit), SUM(log_errors), SUM(log_warnings)
|
|
FROM app_telemetry WHERE customer_id = ? AND reported_at > ?
|
|
GROUP BY app_name ORDER BY AVG(memory_avg_mb) DESC
|
|
```
|
|
Then for current memory, get from the latest record per app:
|
|
```sql
|
|
SELECT app_name, memory_current_mb FROM app_telemetry
|
|
WHERE customer_id = ? AND reported_at = (
|
|
SELECT MAX(reported_at) FROM app_telemetry WHERE customer_id = ? AND app_name = app_telemetry.app_name
|
|
)
|
|
```
|
|
|
|
Or just do two queries and merge in Go. Simpler and more readable.
|
|
|
|
**`GetAppIssues(appName string, limit int) ([]AppIssue, error)`**
|
|
```sql
|
|
SELECT id, app_name, fingerprint, severity, message, first_seen, last_seen,
|
|
occurrence_count, affected_customers
|
|
FROM app_log_issues WHERE app_name = ?
|
|
ORDER BY last_seen DESC LIMIT ?
|
|
```
|
|
Parse `affected_customers` from JSON string to `[]string` in Go.
|
|
|
|
**`GetRecentIssuesAllApps(limit int) ([]AppIssue, error)`**
|
|
```sql
|
|
SELECT id, app_name, fingerprint, severity, message, first_seen, last_seen,
|
|
occurrence_count, affected_customers
|
|
FROM app_log_issues ORDER BY last_seen DESC LIMIT ?
|
|
```
|
|
|
|
**`PruneAppTelemetry(before time.Time) (int64, error)`**
|
|
```sql
|
|
DELETE FROM app_telemetry WHERE reported_at < ?
|
|
```
|
|
|
|
**`PruneStaleIssues(notSeenSince time.Time) (int64, error)`**
|
|
```sql
|
|
DELETE FROM app_log_issues WHERE last_seen < ?
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 5: Hub — API Handler Changes
|
|
|
|
### File: `hub/internal/api/handler.go` (MODIFY)
|
|
|
|
In the report handler (`POST /api/v1/report`), AFTER `store.SaveReport(customerID, body)`:
|
|
|
|
```go
|
|
// Parse and save app telemetry (backward-compatible — old controllers won't have this field)
|
|
var telemetryPayload struct {
|
|
AppTelemetry []store.AppTelemetryRecord `json:"app_telemetry"`
|
|
}
|
|
if err := json.Unmarshal(body, &telemetryPayload); err == nil && len(telemetryPayload.AppTelemetry) > 0 {
|
|
if err := h.store.SaveAppTelemetry(customerID, time.Now(), telemetryPayload.AppTelemetry); err != nil {
|
|
h.logger.Printf("[WARN] Failed to save app telemetry for %s: %v", customerID, err)
|
|
}
|
|
}
|
|
```
|
|
|
|
This is non-breaking: if `app_telemetry` is absent or null, the slice will be empty and nothing is stored.
|
|
|
|
---
|
|
|
|
## Phase 6: Hub — Prune Updates
|
|
|
|
### File: `hub/cmd/hub/main.go` (MODIFY)
|
|
|
|
In the prune goroutine (find where `store.Prune(maxDays)` is called), add after it:
|
|
|
|
```go
|
|
if n, err := st.PruneAppTelemetry(time.Now().Add(-90 * 24 * time.Hour)); err != nil {
|
|
logger.Printf("[ERROR] Prune app telemetry: %v", err)
|
|
} else if n > 0 {
|
|
logger.Printf("[INFO] Pruned %d old app telemetry rows", n)
|
|
}
|
|
if n, err := st.PruneStaleIssues(time.Now().Add(-30 * 24 * time.Hour)); err != nil {
|
|
logger.Printf("[ERROR] Prune stale issues: %v", err)
|
|
} else if n > 0 {
|
|
logger.Printf("[INFO] Pruned %d stale app issues", n)
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 7: Hub — Web Dashboard Pages
|
|
|
|
### 7a. Add Chart.js to hub
|
|
|
|
**Copy** `controller/internal/web/static/chart.min.js` to `hub/internal/web/static/chart.min.js`.
|
|
|
|
**Create or modify** `hub/internal/web/embed.go`:
|
|
```go
|
|
package web
|
|
|
|
import "embed"
|
|
|
|
//go:embed templates/*.html templates/*.css
|
|
var templateFS embed.FS
|
|
|
|
//go:embed static/chart.min.js
|
|
var chartJS []byte
|
|
```
|
|
|
|
The hub already has `hub/internal/web/embed.go` with `//go:embed templates/*` and `var templateFS embed.FS`. Add the chart.min.js embed directive there alongside the existing ones.
|
|
|
|
**Add route** in `server.go` ServeHTTP:
|
|
```go
|
|
case path == "/static/chart.min.js":
|
|
w.Header().Set("Content-Type", "application/javascript")
|
|
w.Header().Set("Cache-Control", "public, max-age=86400")
|
|
w.Write(chartJS)
|
|
```
|
|
|
|
### 7b. Add routes in `server.go` ServeHTTP
|
|
|
|
Add BEFORE the `case strings.HasPrefix(path, "/customers/")` block (to avoid the prefix match catching `/apps`):
|
|
|
|
```go
|
|
case path == "/apps" || path == "/apps/":
|
|
s.handleApps(w, r)
|
|
case strings.HasPrefix(path, "/apps/"):
|
|
appName := strings.TrimPrefix(path, "/apps/")
|
|
s.handleAppDetail(w, r, appName)
|
|
```
|
|
|
|
### 7c. Navigation update
|
|
|
|
In **`hub/internal/web/templates/dashboard.html`** (or wherever the header nav is defined), add between "Dashboard" and "Customers" links:
|
|
|
|
```html
|
|
<a href="/apps" class="nav-link">Alkalmazások</a>
|
|
```
|
|
|
|
If the nav is in a shared layout/header included in all templates, update it there. If each template has its own nav block, update all of them.
|
|
|
|
### 7d. Handler: `handleApps`
|
|
|
|
**File: `hub/internal/web/server.go`** (or a new `hub/internal/web/apps.go` for cleaner organization)
|
|
|
|
```go
|
|
func (s *Server) handleApps(w http.ResponseWriter, r *http.Request) {
|
|
// Parse time range from query param: ?period=24h|7d|30d (default 7d)
|
|
period := r.URL.Query().Get("period")
|
|
since := parsePeriod(period, 7*24*time.Hour) // helper: returns time.Now().Add(-duration)
|
|
|
|
// Sort param: ?sort=memory|deployments|errors&order=asc|desc
|
|
sortBy := r.URL.Query().Get("sort")
|
|
order := r.URL.Query().Get("order")
|
|
|
|
summary, err := s.store.GetFleetAppSummary(since)
|
|
if err != nil { ... }
|
|
|
|
// Sort in Go based on query params (default: by deployment count DESC)
|
|
sortFleetSummary(summary, sortBy, order)
|
|
|
|
// Summary cards
|
|
totalApps := len(summary)
|
|
totalDeployments := 0
|
|
appsWithErrors := 0
|
|
for _, s := range summary {
|
|
totalDeployments += s.DeploymentCount
|
|
if s.TotalErrors > 0 { appsWithErrors++ }
|
|
}
|
|
|
|
data := map[string]interface{}{
|
|
"Apps": summary, "Period": period,
|
|
"TotalApps": totalApps, "TotalDeployments": totalDeployments,
|
|
"AppsWithErrors": appsWithErrors,
|
|
"Sort": sortBy, "Order": order,
|
|
"CSRFToken": csrfToken,
|
|
}
|
|
s.templates.ExecuteTemplate(w, "apps.html", data)
|
|
}
|
|
```
|
|
|
|
Add helper `parsePeriod(s string, defaultDur time.Duration) time.Time`:
|
|
```go
|
|
func parsePeriod(s string, defaultDur time.Duration) time.Time {
|
|
switch s {
|
|
case "24h": return time.Now().Add(-24 * time.Hour)
|
|
case "7d": return time.Now().Add(-7 * 24 * time.Hour)
|
|
case "30d": return time.Now().Add(-30 * 24 * time.Hour)
|
|
default: return time.Now().Add(-defaultDur)
|
|
}
|
|
}
|
|
```
|
|
|
|
### 7e. Handler: `handleAppDetail`
|
|
|
|
```go
|
|
func (s *Server) handleAppDetail(w http.ResponseWriter, r *http.Request, appName string) {
|
|
period := r.URL.Query().Get("period")
|
|
since := parsePeriod(period, 7*24*time.Hour)
|
|
|
|
// Get customer breakdown
|
|
customers, _ := s.store.GetAppCustomerBreakdown(appName, since)
|
|
|
|
// Get telemetry history for chart
|
|
history, _ := s.store.GetAppTelemetryHistory(appName, since)
|
|
|
|
// Get issues
|
|
issues, _ := s.store.GetAppIssues(appName, 20)
|
|
|
|
// Compute fleet summary for this single app (for overview card)
|
|
fleetAll, _ := s.store.GetFleetAppSummary(since)
|
|
var appSummary *FleetAppSummary
|
|
for i := range fleetAll {
|
|
if fleetAll[i].AppName == appName { appSummary = &fleetAll[i]; break }
|
|
}
|
|
|
|
// Suggested mem_limit: ceil(P95 * 1.2), rounded to nearest 32M
|
|
var suggestedLimit int
|
|
if appSummary != nil && appSummary.P95MemoryMB > 0 {
|
|
raw := appSummary.P95MemoryMB * 1.2
|
|
suggestedLimit = ((int(raw) + 31) / 32) * 32 // round up to nearest 32
|
|
}
|
|
|
|
// Prepare chart data (aggregate by time bucket for fleet-wide view)
|
|
// Group history points by reported_at hour, compute avg of avgs and max of peaks
|
|
chartData := aggregateHistoryForChart(history)
|
|
|
|
data := map[string]interface{}{
|
|
"AppName": appName, "Summary": appSummary,
|
|
"Customers": customers, "Issues": issues,
|
|
"ChartData": chartData, "SuggestedLimit": suggestedLimit,
|
|
"Period": period, "CSRFToken": csrfToken,
|
|
}
|
|
s.templates.ExecuteTemplate(w, "app_detail.html", data)
|
|
}
|
|
```
|
|
|
|
`aggregateHistoryForChart` groups data points into hourly buckets, returns `{Labels []string, AvgMemory []float64, PeakMemory []float64, CatalogLimit []float64}` for Chart.js.
|
|
|
|
### 7f. Extend `handleCustomerUnified`
|
|
|
|
In **`hub/internal/web/configs.go`** where `handleCustomerUnified` builds its template data, add:
|
|
|
|
```go
|
|
// App telemetry section
|
|
appTelemetry, _ := s.store.GetCustomerAppSummary(customerID, time.Now().Add(-7*24*time.Hour))
|
|
// Only show section if data exists
|
|
data["AppTelemetry"] = appTelemetry
|
|
data["HasAppTelemetry"] = len(appTelemetry) > 0
|
|
```
|
|
|
|
### 7g. Template: `hub/internal/web/templates/apps.html` (NEW)
|
|
|
|
Fleet-wide app list page. Follow existing hub template patterns:
|
|
- Same dark theme, same table styles
|
|
- Header nav with "Alkalmazások" active
|
|
- Summary cards row at top (3 cards: Total apps, Total deployments, Apps with errors)
|
|
- Time range selector buttons (24h / 7d / 30d) — links to `?period=...`
|
|
- Main table with columns: App name (link to /apps/{name}), Deployments, Avg Memory, P95 Memory, Catalog Estimate, Catalog Limit, Estimate Accuracy (icon), Errors (24h badge), Warnings (24h badge)
|
|
- Sortable column headers (links to `?sort=...&order=...`)
|
|
- Estimate accuracy: green dot if P95 < limit, yellow if P95 > 50% of limit, red if P95 > limit
|
|
- All text in Hungarian
|
|
|
|
### 7h. Template: `hub/internal/web/templates/app_detail.html` (NEW)
|
|
|
|
Per-app detail page with:
|
|
1. **Overview card**: App name, display name, catalog estimates, deployment count, suggested mem_limit
|
|
2. **Memory trend chart**: `<canvas id="memoryChart">`, Chart.js line chart with:
|
|
- Lines: Avg Memory (blue), Peak Memory (red)
|
|
- Dashed horizontal line: Catalog mem_limit (green)
|
|
- Data from `chartData` (injected via `{{ json .ChartData }}`)
|
|
3. **Customer breakdown table**: Customer link, Avg Memory, Peak Memory, CPU, Errors, Last Report
|
|
4. **Common issues table**: Severity badge, Message, Occurrences, Affected Customers, First/Last Seen
|
|
|
|
Include Chart.js: `<script src="/static/chart.min.js"></script>`
|
|
|
|
### 7i. Template: `hub/internal/web/templates/customer_unified.html` (MODIFY)
|
|
|
|
Add new section "Alkalmazás telemetria" (conditionally shown):
|
|
|
|
```html
|
|
{{ if .HasAppTelemetry }}
|
|
<div class="section">
|
|
<h2>Alkalmazás telemetria</h2>
|
|
<table class="data-table">
|
|
<thead>
|
|
<tr>
|
|
<th>Alkalmazás</th>
|
|
<th>Memória (jelenlegi)</th>
|
|
<th>Memória (átlag 7d)</th>
|
|
<th>Memória (csúcs 7d)</th>
|
|
<th>Katalógus limit</th>
|
|
<th>Hibák (24ó)</th>
|
|
<th>Figyelmeztetések (24ó)</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody>
|
|
{{ range .AppTelemetry }}
|
|
<tr>
|
|
<td><a href="/apps/{{ .AppName }}">{{ .DisplayName }}</a></td>
|
|
<td>{{ formatFloat .MemoryCurrentMB }} MB</td>
|
|
<td>{{ formatFloat .MemoryAvgMB }} MB</td>
|
|
<td>{{ formatFloat .MemoryPeakMB }} MB</td>
|
|
<td>{{ .CatalogLimit }}</td>
|
|
<td>{{ if gt .LogErrors 0 }}<span class="badge badge-error">{{ .LogErrors }}</span>{{ else }}0{{ end }}</td>
|
|
<td>{{ if gt .LogWarnings 0 }}<span class="badge badge-warn">{{ .LogWarnings }}</span>{{ else }}0{{ end }}</td>
|
|
</tr>
|
|
{{ end }}
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
{{ end }}
|
|
```
|
|
|
|
Memory cell coloring: Use inline style or CSS class based on percentage of catalog_limit. This requires a template function to parse the limit string and compare — add a `memoryColor(currentMB float64, limitStr string) string` template function that returns a CSS color.
|
|
|
|
### 7j. CSS additions in `hub/internal/web/templates/style.css` (MODIFY)
|
|
|
|
Add styles for:
|
|
- `.badge`, `.badge-error` (red), `.badge-warn` (yellow)
|
|
- `.summary-cards` (flex row of 3 cards)
|
|
- `.summary-card` (dark card with number + label)
|
|
- `.chart-container` (responsive canvas wrapper)
|
|
- `.period-selector` (button group for time ranges)
|
|
- `.accuracy-dot` (small colored circle for estimate accuracy)
|
|
- Memory cell colors: `.mem-ok` (green text), `.mem-warn` (yellow), `.mem-danger` (red)
|
|
|
|
Follow existing hub dark theme color variables.
|
|
|
|
---
|
|
|
|
## Phase 8: Version Bumps & Changelog
|
|
|
|
### Controller
|
|
- Update version constant to `v0.28.0` in the appropriate file (likely `cmd/controller/main.go` or a `version.go` file)
|
|
- Add CHANGELOG.md entry
|
|
|
|
### Hub
|
|
- Update version constant to `v0.4.0`
|
|
- Add CHANGELOG.md entry (if hub has one)
|
|
|
|
---
|
|
|
|
## Implementation Checklist
|
|
|
|
### Controller (can be deployed and tested independently)
|
|
- [ ] `internal/metrics/telemetry.go` — GetContainerTelemetry method
|
|
- [ ] `internal/metrics/logscanner.go` — ScanContainerLogs + types
|
|
- [ ] `internal/report/types.go` — Add AppTelemetry field + type definitions
|
|
- [ ] `internal/report/telemetry.go` — buildAppTelemetry + buildAppTelemetrySection
|
|
- [ ] `internal/report/builder.go` — Call buildAppTelemetrySection in BuildReport
|
|
- [ ] Test: build, deploy to demo, verify `app_telemetry` in report JSON
|
|
|
|
### Hub (deploy after controller verified)
|
|
- [ ] `internal/store/store.go` — Tables + types + all store methods
|
|
- [ ] `internal/api/handler.go` — Parse & save telemetry from reports
|
|
- [ ] `cmd/hub/main.go` — Add prune calls
|
|
- [ ] `internal/web/static/chart.min.js` — Copy from controller
|
|
- [ ] `internal/web/embed.go` — Add chart.min.js embed
|
|
- [ ] `internal/web/server.go` — Routes + handlers + chart.js serving + parsePeriod helper
|
|
- [ ] `internal/web/templates/apps.html` — Fleet app list page
|
|
- [ ] `internal/web/templates/app_detail.html` — App detail page with chart
|
|
- [ ] `internal/web/templates/customer_unified.html` — Add telemetry section
|
|
- [ ] `internal/web/templates/style.css` — New styles
|
|
- [ ] Navigation update in all templates (add "Alkalmazások" link)
|
|
- [ ] Test: deploy hub, verify pages after a few report cycles
|
|
|
|
---
|
|
|
|
## Notes for Implementation
|
|
|
|
1. **Run `go build ./...` after each phase** to catch compile errors early
|
|
2. **The report package already imports `metrics` and `stacks`** — no new import cycles
|
|
3. **Hub templates use `template.Must(template.New("").Funcs(funcMap).ParseFS(templateFS, "templates/*.html"))`** — new .html files are automatically picked up
|
|
4. **Hub already has `embed.go`** at `hub/internal/web/embed.go` — add chart.min.js embed directive there. Create `hub/internal/web/static/` directory
|
|
5. **All hub template functions** are defined in `server.go` New() constructor — add any new functions (like `memoryColor`) there
|
|
6. **CSRF tokens**: All POST forms need `<input type="hidden" name="_csrf" value="{{ .CSRFToken }}">` — but the apps pages are read-only (GET only), so no CSRF needed
|
|
7. **Hub's ServeHTTP uses hasSuffix/hasPrefix routing** — put exact matches (`/apps`) BEFORE prefix matches. The `/apps/{name}` route must come before `/apps/` fallback
|