32 KiB
TASK: App Telemetry & Analytics
Controller: v0.27.3 → v0.28.0 Hub: v0.3.8 → v0.4.0
Overview
Add per-app (per-stack) memory/CPU telemetry and container log error scanning to the controller's report push cycle, then build fleet-wide analytics dashboard pages in the hub.
Spec Issues Found (corrections already applied in this plan)
- Wrong column name in metrics DB: Spec uses
memory_bytes— actual column ismem_usage_mb(already in MB). No byte→MB conversion needed. tscolumn is INTEGER (Unix timestamp), not datetime — WHERE clauses must useunix().metricsStorealready passed toBuildReport()— no main.go wiring change needed for that dependency.- Chart.js is NOT in the hub — needs to be added (copy from controller's
internal/web/static/chart.min.js). - Hub nav is header-based, not sidebar — add "Alkalmazások" to the header
<nav>. StackInfotype undefined in spec — use existingstacks.Stackdirectly (report already imports stacks).- Fingerprinting threshold: Use 6+ digits instead of 4+ to avoid mangling HTTP status codes (404, 503) and port numbers.
Phase 1: Controller — Metrics Telemetry
File: controller/internal/metrics/telemetry.go (NEW)
Create this new file in the existing metrics package.
package metrics
import (
"time"
)
// ContainerTelemetry holds aggregated resource stats for one container.
type ContainerTelemetry struct {
ContainerName string `json:"container_name"`
MemoryCurrentMB float64 `json:"memory_current_mb"`
MemoryAvgMB float64 `json:"memory_avg_mb"`
MemoryPeakMB float64 `json:"memory_peak_mb"`
CPUAvgPercent float64 `json:"cpu_avg_percent"`
SampleCount int `json:"sample_count"`
}
// GetContainerTelemetry queries the metrics DB for per-container resource
// summaries since the given time. Returns empty slice (not error) if no data.
func (s *MetricsStore) GetContainerTelemetry(since time.Time) ([]ContainerTelemetry, error) {
sinceUnix := since.Unix()
// Get averages and peaks
rows, err := s.db.Query(`
SELECT container_name,
AVG(mem_usage_mb),
MAX(mem_usage_mb),
AVG(cpu_percent),
COUNT(*)
FROM container_metrics
WHERE ts > ?
GROUP BY container_name`, sinceUnix)
if err != nil {
return nil, err
}
defer rows.Close()
var results []ContainerTelemetry
for rows.Next() {
var ct ContainerTelemetry
if err := rows.Scan(&ct.ContainerName, &ct.MemoryAvgMB, &ct.MemoryPeakMB,
&ct.CPUAvgPercent, &ct.SampleCount); err != nil {
continue
}
results = append(results, ct)
}
// Get current (most recent) memory per container using QueryContainerSummary
if stats, err := s.QueryContainerSummary(); err == nil {
currentMap := make(map[string]float64, len(stats))
for _, st := range stats {
currentMap[st.ContainerName] = st.MemUsageMB
}
for i := range results {
if cur, ok := currentMap[results[i].ContainerName]; ok {
results[i].MemoryCurrentMB = cur
}
}
}
if results == nil {
results = []ContainerTelemetry{}
}
return results, nil
}
Key details:
- Method on existing
*MetricsStore— no new struct needed tscolumn is Unix INTEGER — compare withsince.Unix()mem_usage_mbis already in MB — no conversion- Uses existing
QueryContainerSummary()for current values (returns latest row per container, ordered by CPU DESC) - Returns empty slice on no data, not error
Phase 2: Controller — Log Scanner
File: controller/internal/metrics/logscanner.go (NEW)
Create in the metrics package (it's data collection, same domain as metrics).
package metrics
import (
"context"
"os/exec"
"regexp"
"strings"
"time"
"unicode/utf8"
"log"
"sort"
)
Types:
type ContainerLogSummary struct {
ContainerName string `json:"container_name"`
ErrorCount int `json:"error_count"`
WarnCount int `json:"warn_count"`
RecentIssues []LogIssue `json:"recent_issues,omitempty"`
}
type LogIssue struct {
Severity string `json:"severity"`
Message string `json:"message"`
Count int `json:"count"`
LastSeen time.Time `json:"last_seen"`
}
Function: ScanContainerLogs(containerNames []string, since time.Duration, logger *log.Logger) []ContainerLogSummary
Implementation notes:
- Iterate
containerNamessequentially (not parallel — avoid load spikes) - For each container, run:
exec.CommandContext(ctx, "docker", "logs", "--since=15m", "--tail=1000", containerName)- Context timeout: 10 seconds per container
- Merge stderr into stdout (
cmd.CombinedOutput()) - On error: log at DEBUG, skip container, continue
- Skip non-UTF-8 lines using
utf8.Valid([]byte(line)) - Truncate lines to 500 chars before matching
- Pattern matching — check first 5 space-separated words of each line (case-insensitive):
- Error patterns:
error,fatal,panic,crit,oom,killed,exception,traceback - Warning patterns:
warn,warning
- Error patterns:
- Fingerprinting for deduplication:
- Strip leading timestamp (regex:
^\d{4}[-/]\d{2}[-/]\d{2}[T ]\d{2}:\d{2}:\d{2}[.\d]*[Z ]?and syslog-style^[A-Z][a-z]{2} \d{1,2} \d{2}:\d{2}:\d{2}) - Replace sequences of 6+ digits with
<N>(NOT 4+ — avoids mangling HTTP status codes, port numbers) - Replace hex strings of 8+ chars with
<HEX> - Replace UUIDs (
[0-9a-f]{8}-[0-9a-f]{4}-...) with<UUID> - Trim whitespace, lowercase for grouping key
- Strip leading timestamp (regex:
- Group by fingerprint, keep count + last_seen time
- Limits: Max 10
RecentIssuesper container (sorted by count DESC, then last_seen DESC). Cap total issues across all containers at 50. - Total scan warning: If total scan takes > 5 minutes, log a warning
- Return
[]ContainerLogSummary(nil-safe — return empty slice)
The caller (report builder) is responsible for filtering out infrastructure containers before calling this function. The function doesn't need config access.
Phase 3: Controller — Report Integration
File: controller/internal/report/types.go (MODIFY)
Add these types and the new field to Report:
// Add to Report struct:
AppTelemetry []AppTelemetry `json:"app_telemetry,omitempty"`
// New types (add after StacksReport):
// AppTelemetry holds per-app (per-stack) resource and log telemetry.
type AppTelemetry struct {
AppName string `json:"app_name"`
DisplayName string `json:"display_name"`
Containers []string `json:"containers"`
MemoryCurrentMB float64 `json:"memory_current_mb"`
MemoryAvgMB float64 `json:"memory_avg_mb"`
MemoryPeakMB float64 `json:"memory_peak_mb"`
CPUAvgPercent float64 `json:"cpu_avg_percent"`
CatalogEstimate string `json:"catalog_estimate"`
CatalogLimit string `json:"catalog_limit"`
LogErrors int `json:"log_errors"`
LogWarnings int `json:"log_warnings"`
Issues []metrics.LogIssue `json:"issues,omitempty"`
}
Note: LogIssue is defined in the metrics package (from logscanner.go). The report package already imports metrics.
File: controller/internal/report/telemetry.go (NEW)
Function: buildAppTelemetry(allStacks []stacks.Stack, telemetry []metrics.ContainerTelemetry, logs []metrics.ContainerLogSummary) []AppTelemetry
Private function (lowercase b), called from BuildReport. Logic:
- Build lookup maps:
containerName → ContainerTelemetryandcontainerName → ContainerLogSummary - Iterate
allStacks. Skip stacks wheres.Protected || !s.Deployed - For each stack:
- Collect container names from
s.Containers - Sum
MemoryCurrentMB,MemoryAvgMB,MemoryPeakMB,CPUAvgPercentacross all containers in the stack - Sum
ErrorCount,WarnCountacross all containers - Merge
RecentIssuesfrom all containers, sort by count DESC, cap at 10 - Get
CatalogEstimatefroms.Meta.Resources.MemRequestandCatalogLimitfroms.Meta.Resources.MemLimit - Get
DisplayNamefroms.Meta.DisplayName
- Collect container names from
- Return slice sorted by
AppName
File: controller/internal/report/builder.go (MODIFY)
In BuildReport(), add AFTER the stacks section (before the final debug log), approximately at line 151:
// App telemetry (metrics + log scan)
r.AppTelemetry = buildAppTelemetrySection(cfg, stackMgr, metricsStore, logger)
Create helper function buildAppTelemetrySection:
func buildAppTelemetrySection(cfg *config.Config, stackMgr *stacks.Manager, metricsStore *metrics.MetricsStore, logger *log.Logger) []AppTelemetry {
allStacks := stackMgr.GetStacks()
// 1. Get metrics telemetry (last 15 minutes)
var telemetry []metrics.ContainerTelemetry
if metricsStore != nil {
var err error
telemetry, err = metricsStore.GetContainerTelemetry(time.Now().Add(-15 * time.Minute))
if err != nil && logger != nil {
logger.Printf("[WARN] Telemetry metrics query failed: %v", err)
}
}
// 2. Collect non-protected container names for log scan
var containerNames []string
for _, s := range allStacks {
if s.Protected || !s.Deployed {
continue
}
for _, c := range s.Containers {
containerNames = append(containerNames, c.Name)
}
}
// 3. Scan logs
logs := metrics.ScanContainerLogs(containerNames, 15*time.Minute, logger)
// 4. Build per-app telemetry
return buildAppTelemetry(allStacks, telemetry, logs)
}
Key: uses s.Protected (from config's protected list) to skip infrastructure containers — no hardcoded names.
File: controller/cmd/controller/main.go (MODIFY)
Minimal change needed. The metricsStore is already passed to BuildReport(). However, config.Config (the cfg parameter) is also already available in BuildReport. Check if BuildReport signature needs *config.Config for the protected stacks check — it already receives cfg *config.Config on line 25. No signature change needed.
Actually, looking at this more carefully: buildAppTelemetrySection needs cfg to check cfg.IsProtectedStack(). But we're NOT using cfg.IsProtectedStack() — we're using s.Protected which is already set on each stack by the manager during ScanStacks(). So no config dependency needed in the telemetry builder beyond what's already available.
Wait — double check: the buildAppTelemetrySection function as written takes cfg *config.Config but only uses stackMgr.GetStacks() which already has s.Protected set. We can simplify by removing the cfg parameter:
func buildAppTelemetrySection(stackMgr *stacks.Manager, metricsStore *metrics.MetricsStore, logger *log.Logger) []AppTelemetry {
No changes to main.go needed.
Phase 4: Hub — Store Changes
File: hub/internal/store/store.go (MODIFY)
4a. Add table creation in migrate() function
Add after the existing table creation statements:
CREATE TABLE IF NOT EXISTS app_telemetry (
id INTEGER PRIMARY KEY AUTOINCREMENT,
customer_id TEXT NOT NULL,
app_name TEXT NOT NULL,
display_name TEXT NOT NULL DEFAULT '',
reported_at DATETIME NOT NULL,
memory_current_mb REAL DEFAULT 0,
memory_avg_mb REAL DEFAULT 0,
memory_peak_mb REAL DEFAULT 0,
cpu_avg_percent REAL DEFAULT 0,
catalog_estimate TEXT DEFAULT '',
catalog_limit TEXT DEFAULT '',
log_errors INTEGER DEFAULT 0,
log_warnings INTEGER DEFAULT 0,
containers_json TEXT DEFAULT '[]',
issues_json TEXT DEFAULT '[]'
);
CREATE INDEX IF NOT EXISTS idx_app_telemetry_lookup
ON app_telemetry(app_name, reported_at);
CREATE INDEX IF NOT EXISTS idx_app_telemetry_customer
ON app_telemetry(customer_id, app_name, reported_at);
CREATE INDEX IF NOT EXISTS idx_app_telemetry_prune
ON app_telemetry(reported_at);
CREATE TABLE IF NOT EXISTS app_log_issues (
id INTEGER PRIMARY KEY AUTOINCREMENT,
app_name TEXT NOT NULL,
fingerprint TEXT NOT NULL,
severity TEXT NOT NULL,
message TEXT NOT NULL,
first_seen DATETIME NOT NULL,
last_seen DATETIME NOT NULL,
occurrence_count INTEGER DEFAULT 1,
affected_customers TEXT DEFAULT '[]',
UNIQUE(app_name, fingerprint)
);
CREATE INDEX IF NOT EXISTS idx_app_log_issues_app
ON app_log_issues(app_name, last_seen DESC);
4b. Add types (top of file or separate types section)
type AppTelemetryRecord struct {
AppName string `json:"app_name"`
DisplayName string `json:"display_name"`
Containers []string `json:"containers"`
MemoryCurrentMB float64 `json:"memory_current_mb"`
MemoryAvgMB float64 `json:"memory_avg_mb"`
MemoryPeakMB float64 `json:"memory_peak_mb"`
CPUAvgPercent float64 `json:"cpu_avg_percent"`
CatalogEstimate string `json:"catalog_estimate"`
CatalogLimit string `json:"catalog_limit"`
LogErrors int `json:"log_errors"`
LogWarnings int `json:"log_warnings"`
Issues []struct {
Severity string `json:"severity"`
Message string `json:"message"`
Count int `json:"count"`
LastSeen time.Time `json:"last_seen"`
} `json:"issues,omitempty"`
}
type FleetAppSummary struct {
AppName string
DisplayName string
DeploymentCount int
AvgMemoryMB float64
PeakMemoryMB float64 // max across fleet
P95MemoryMB float64
AvgCPU float64
TotalErrors int
TotalWarnings int
CatalogEstimate string
CatalogLimit string
}
type AppTelemetryPoint struct {
ReportedAt time.Time
CustomerID string
MemoryAvgMB float64
MemoryPeakMB float64
CPUAvgPercent float64
LogErrors int
LogWarnings int
}
type AppCustomerStats struct {
CustomerID string
AvgMemoryMB float64
PeakMemoryMB float64
AvgCPU float64
TotalErrors int
LastReport time.Time
}
type CustomerAppSummary struct {
AppName string
DisplayName string
MemoryCurrentMB float64
MemoryAvgMB float64
MemoryPeakMB float64
CatalogLimit string
LogErrors int
LogWarnings int
}
type AppIssue struct {
ID int
AppName string
Fingerprint string
Severity string
Message string
FirstSeen time.Time
LastSeen time.Time
OccurrenceCount int
AffectedCustomers []string
}
4c. Add store methods
SaveAppTelemetry(customerID string, reportedAt time.Time, records []AppTelemetryRecord) error
- Insert each record into
app_telemetrytable - Serialize
Containersto JSON forcontainers_json - Serialize
Issuesto JSON forissues_json - Use a transaction for batch insert
- For each record with non-empty issues, call
upsertAppIssuefor each issue
upsertAppIssue(appName, fingerprint, severity, message, customerID string, lastSeen time.Time) error
- Private helper
- Use
INSERT INTO app_log_issues ... ON CONFLICT(app_name, fingerprint) DO UPDATE SET last_seen=MAX(last_seen, excluded.last_seen), occurrence_count=occurrence_count+1, ... - For
affected_customers: parse existing JSON array, add customerID if not present, re-serialize
GetFleetAppSummary(since time.Time) ([]FleetAppSummary, error)
SELECT app_name,
MAX(display_name) as display_name,
COUNT(DISTINCT customer_id) as deployment_count,
AVG(memory_avg_mb) as avg_memory_mb,
MAX(memory_peak_mb) as peak_memory_mb,
AVG(cpu_avg_percent) as avg_cpu,
SUM(log_errors) as total_errors,
SUM(log_warnings) as total_warnings,
MAX(catalog_estimate) as catalog_estimate,
MAX(catalog_limit) as catalog_limit
FROM app_telemetry
WHERE reported_at > ?
GROUP BY app_name
ORDER BY deployment_count DESC, avg_memory_mb DESC
For P95 memory: separate query per app (can be batched in Go):
SELECT memory_peak_mb FROM app_telemetry
WHERE app_name = ? AND reported_at > ?
ORDER BY memory_peak_mb ASC
LIMIT 1 OFFSET (
SELECT CAST(COUNT(*) * 0.95 AS INTEGER)
FROM app_telemetry WHERE app_name = ? AND reported_at > ?
)
GetAppTelemetryHistory(appName string, since time.Time) ([]AppTelemetryPoint, error)
SELECT reported_at, customer_id, memory_avg_mb, memory_peak_mb, cpu_avg_percent, log_errors, log_warnings
FROM app_telemetry
WHERE app_name = ? AND reported_at > ?
ORDER BY reported_at ASC
GetAppCustomerBreakdown(appName string, since time.Time) ([]AppCustomerStats, error)
SELECT customer_id, AVG(memory_avg_mb), MAX(memory_peak_mb), AVG(cpu_avg_percent),
SUM(log_errors), MAX(reported_at)
FROM app_telemetry
WHERE app_name = ? AND reported_at > ?
GROUP BY customer_id
ORDER BY AVG(memory_avg_mb) DESC
GetCustomerAppSummary(customerID string, since time.Time) ([]CustomerAppSummary, error)
SELECT t1.app_name, t1.display_name, t1.memory_current_mb,
AVG(t2.memory_avg_mb), MAX(t2.memory_peak_mb),
MAX(t2.catalog_limit),
SUM(t2.log_errors), SUM(t2.log_warnings)
FROM app_telemetry t1
INNER JOIN (
SELECT app_name, MAX(reported_at) as max_reported
FROM app_telemetry WHERE customer_id = ? AND reported_at > ?
GROUP BY app_name
) latest ON t1.app_name = latest.app_name AND t1.reported_at = latest.max_reported AND t1.customer_id = ?
LEFT JOIN app_telemetry t2 ON t2.app_name = t1.app_name AND t2.customer_id = ? AND t2.reported_at > ?
GROUP BY t1.app_name
ORDER BY AVG(t2.memory_avg_mb) DESC
Actually, this query is overly complex. Simpler approach:
-- Get 7d averages/peaks
SELECT app_name, MAX(display_name), AVG(memory_avg_mb), MAX(memory_peak_mb),
MAX(catalog_limit), SUM(log_errors), SUM(log_warnings)
FROM app_telemetry WHERE customer_id = ? AND reported_at > ?
GROUP BY app_name ORDER BY AVG(memory_avg_mb) DESC
Then for current memory, get from the latest record per app:
SELECT app_name, memory_current_mb FROM app_telemetry
WHERE customer_id = ? AND reported_at = (
SELECT MAX(reported_at) FROM app_telemetry WHERE customer_id = ? AND app_name = app_telemetry.app_name
)
Or just do two queries and merge in Go. Simpler and more readable.
GetAppIssues(appName string, limit int) ([]AppIssue, error)
SELECT id, app_name, fingerprint, severity, message, first_seen, last_seen,
occurrence_count, affected_customers
FROM app_log_issues WHERE app_name = ?
ORDER BY last_seen DESC LIMIT ?
Parse affected_customers from JSON string to []string in Go.
GetRecentIssuesAllApps(limit int) ([]AppIssue, error)
SELECT id, app_name, fingerprint, severity, message, first_seen, last_seen,
occurrence_count, affected_customers
FROM app_log_issues ORDER BY last_seen DESC LIMIT ?
PruneAppTelemetry(before time.Time) (int64, error)
DELETE FROM app_telemetry WHERE reported_at < ?
PruneStaleIssues(notSeenSince time.Time) (int64, error)
DELETE FROM app_log_issues WHERE last_seen < ?
Phase 5: Hub — API Handler Changes
File: hub/internal/api/handler.go (MODIFY)
In the report handler (POST /api/v1/report), AFTER store.SaveReport(customerID, body):
// Parse and save app telemetry (backward-compatible — old controllers won't have this field)
var telemetryPayload struct {
AppTelemetry []store.AppTelemetryRecord `json:"app_telemetry"`
}
if err := json.Unmarshal(body, &telemetryPayload); err == nil && len(telemetryPayload.AppTelemetry) > 0 {
if err := h.store.SaveAppTelemetry(customerID, time.Now(), telemetryPayload.AppTelemetry); err != nil {
h.logger.Printf("[WARN] Failed to save app telemetry for %s: %v", customerID, err)
}
}
This is non-breaking: if app_telemetry is absent or null, the slice will be empty and nothing is stored.
Phase 6: Hub — Prune Updates
File: hub/cmd/hub/main.go (MODIFY)
In the prune goroutine (find where store.Prune(maxDays) is called), add after it:
if n, err := st.PruneAppTelemetry(time.Now().Add(-90 * 24 * time.Hour)); err != nil {
logger.Printf("[ERROR] Prune app telemetry: %v", err)
} else if n > 0 {
logger.Printf("[INFO] Pruned %d old app telemetry rows", n)
}
if n, err := st.PruneStaleIssues(time.Now().Add(-30 * 24 * time.Hour)); err != nil {
logger.Printf("[ERROR] Prune stale issues: %v", err)
} else if n > 0 {
logger.Printf("[INFO] Pruned %d stale app issues", n)
}
Phase 7: Hub — Web Dashboard Pages
7a. Add Chart.js to hub
Copy controller/internal/web/static/chart.min.js to hub/internal/web/static/chart.min.js.
Create or modify hub/internal/web/embed.go:
package web
import "embed"
//go:embed templates/*.html templates/*.css
var templateFS embed.FS
//go:embed static/chart.min.js
var chartJS []byte
The hub already has hub/internal/web/embed.go with //go:embed templates/* and var templateFS embed.FS. Add the chart.min.js embed directive there alongside the existing ones.
Add route in server.go ServeHTTP:
case path == "/static/chart.min.js":
w.Header().Set("Content-Type", "application/javascript")
w.Header().Set("Cache-Control", "public, max-age=86400")
w.Write(chartJS)
7b. Add routes in server.go ServeHTTP
Add BEFORE the case strings.HasPrefix(path, "/customers/") block (to avoid the prefix match catching /apps):
case path == "/apps" || path == "/apps/":
s.handleApps(w, r)
case strings.HasPrefix(path, "/apps/"):
appName := strings.TrimPrefix(path, "/apps/")
s.handleAppDetail(w, r, appName)
7c. Navigation update
In hub/internal/web/templates/dashboard.html (or wherever the header nav is defined), add between "Dashboard" and "Customers" links:
<a href="/apps" class="nav-link">Alkalmazások</a>
If the nav is in a shared layout/header included in all templates, update it there. If each template has its own nav block, update all of them.
7d. Handler: handleApps
File: hub/internal/web/server.go (or a new hub/internal/web/apps.go for cleaner organization)
func (s *Server) handleApps(w http.ResponseWriter, r *http.Request) {
// Parse time range from query param: ?period=24h|7d|30d (default 7d)
period := r.URL.Query().Get("period")
since := parsePeriod(period, 7*24*time.Hour) // helper: returns time.Now().Add(-duration)
// Sort param: ?sort=memory|deployments|errors&order=asc|desc
sortBy := r.URL.Query().Get("sort")
order := r.URL.Query().Get("order")
summary, err := s.store.GetFleetAppSummary(since)
if err != nil { ... }
// Sort in Go based on query params (default: by deployment count DESC)
sortFleetSummary(summary, sortBy, order)
// Summary cards
totalApps := len(summary)
totalDeployments := 0
appsWithErrors := 0
for _, s := range summary {
totalDeployments += s.DeploymentCount
if s.TotalErrors > 0 { appsWithErrors++ }
}
data := map[string]interface{}{
"Apps": summary, "Period": period,
"TotalApps": totalApps, "TotalDeployments": totalDeployments,
"AppsWithErrors": appsWithErrors,
"Sort": sortBy, "Order": order,
"CSRFToken": csrfToken,
}
s.templates.ExecuteTemplate(w, "apps.html", data)
}
Add helper parsePeriod(s string, defaultDur time.Duration) time.Time:
func parsePeriod(s string, defaultDur time.Duration) time.Time {
switch s {
case "24h": return time.Now().Add(-24 * time.Hour)
case "7d": return time.Now().Add(-7 * 24 * time.Hour)
case "30d": return time.Now().Add(-30 * 24 * time.Hour)
default: return time.Now().Add(-defaultDur)
}
}
7e. Handler: handleAppDetail
func (s *Server) handleAppDetail(w http.ResponseWriter, r *http.Request, appName string) {
period := r.URL.Query().Get("period")
since := parsePeriod(period, 7*24*time.Hour)
// Get customer breakdown
customers, _ := s.store.GetAppCustomerBreakdown(appName, since)
// Get telemetry history for chart
history, _ := s.store.GetAppTelemetryHistory(appName, since)
// Get issues
issues, _ := s.store.GetAppIssues(appName, 20)
// Compute fleet summary for this single app (for overview card)
fleetAll, _ := s.store.GetFleetAppSummary(since)
var appSummary *FleetAppSummary
for i := range fleetAll {
if fleetAll[i].AppName == appName { appSummary = &fleetAll[i]; break }
}
// Suggested mem_limit: ceil(P95 * 1.2), rounded to nearest 32M
var suggestedLimit int
if appSummary != nil && appSummary.P95MemoryMB > 0 {
raw := appSummary.P95MemoryMB * 1.2
suggestedLimit = ((int(raw) + 31) / 32) * 32 // round up to nearest 32
}
// Prepare chart data (aggregate by time bucket for fleet-wide view)
// Group history points by reported_at hour, compute avg of avgs and max of peaks
chartData := aggregateHistoryForChart(history)
data := map[string]interface{}{
"AppName": appName, "Summary": appSummary,
"Customers": customers, "Issues": issues,
"ChartData": chartData, "SuggestedLimit": suggestedLimit,
"Period": period, "CSRFToken": csrfToken,
}
s.templates.ExecuteTemplate(w, "app_detail.html", data)
}
aggregateHistoryForChart groups data points into hourly buckets, returns {Labels []string, AvgMemory []float64, PeakMemory []float64, CatalogLimit []float64} for Chart.js.
7f. Extend handleCustomerUnified
In hub/internal/web/configs.go where handleCustomerUnified builds its template data, add:
// App telemetry section
appTelemetry, _ := s.store.GetCustomerAppSummary(customerID, time.Now().Add(-7*24*time.Hour))
// Only show section if data exists
data["AppTelemetry"] = appTelemetry
data["HasAppTelemetry"] = len(appTelemetry) > 0
7g. Template: hub/internal/web/templates/apps.html (NEW)
Fleet-wide app list page. Follow existing hub template patterns:
- Same dark theme, same table styles
- Header nav with "Alkalmazások" active
- Summary cards row at top (3 cards: Total apps, Total deployments, Apps with errors)
- Time range selector buttons (24h / 7d / 30d) — links to
?period=... - Main table with columns: App name (link to /apps/{name}), Deployments, Avg Memory, P95 Memory, Catalog Estimate, Catalog Limit, Estimate Accuracy (icon), Errors (24h badge), Warnings (24h badge)
- Sortable column headers (links to
?sort=...&order=...) - Estimate accuracy: green dot if P95 < limit, yellow if P95 > 50% of limit, red if P95 > limit
- All text in Hungarian
7h. Template: hub/internal/web/templates/app_detail.html (NEW)
Per-app detail page with:
- Overview card: App name, display name, catalog estimates, deployment count, suggested mem_limit
- Memory trend chart:
<canvas id="memoryChart">, Chart.js line chart with:- Lines: Avg Memory (blue), Peak Memory (red)
- Dashed horizontal line: Catalog mem_limit (green)
- Data from
chartData(injected via{{ json .ChartData }})
- Customer breakdown table: Customer link, Avg Memory, Peak Memory, CPU, Errors, Last Report
- Common issues table: Severity badge, Message, Occurrences, Affected Customers, First/Last Seen
Include Chart.js: <script src="/static/chart.min.js"></script>
7i. Template: hub/internal/web/templates/customer_unified.html (MODIFY)
Add new section "Alkalmazás telemetria" (conditionally shown):
{{ if .HasAppTelemetry }}
<div class="section">
<h2>Alkalmazás telemetria</h2>
<table class="data-table">
<thead>
<tr>
<th>Alkalmazás</th>
<th>Memória (jelenlegi)</th>
<th>Memória (átlag 7d)</th>
<th>Memória (csúcs 7d)</th>
<th>Katalógus limit</th>
<th>Hibák (24ó)</th>
<th>Figyelmeztetések (24ó)</th>
</tr>
</thead>
<tbody>
{{ range .AppTelemetry }}
<tr>
<td><a href="/apps/{{ .AppName }}">{{ .DisplayName }}</a></td>
<td>{{ formatFloat .MemoryCurrentMB }} MB</td>
<td>{{ formatFloat .MemoryAvgMB }} MB</td>
<td>{{ formatFloat .MemoryPeakMB }} MB</td>
<td>{{ .CatalogLimit }}</td>
<td>{{ if gt .LogErrors 0 }}<span class="badge badge-error">{{ .LogErrors }}</span>{{ else }}0{{ end }}</td>
<td>{{ if gt .LogWarnings 0 }}<span class="badge badge-warn">{{ .LogWarnings }}</span>{{ else }}0{{ end }}</td>
</tr>
{{ end }}
</tbody>
</table>
</div>
{{ end }}
Memory cell coloring: Use inline style or CSS class based on percentage of catalog_limit. This requires a template function to parse the limit string and compare — add a memoryColor(currentMB float64, limitStr string) string template function that returns a CSS color.
7j. CSS additions in hub/internal/web/templates/style.css (MODIFY)
Add styles for:
.badge,.badge-error(red),.badge-warn(yellow).summary-cards(flex row of 3 cards).summary-card(dark card with number + label).chart-container(responsive canvas wrapper).period-selector(button group for time ranges).accuracy-dot(small colored circle for estimate accuracy)- Memory cell colors:
.mem-ok(green text),.mem-warn(yellow),.mem-danger(red)
Follow existing hub dark theme color variables.
Phase 8: Version Bumps & Changelog
Controller
- Update version constant to
v0.28.0in the appropriate file (likelycmd/controller/main.goor aversion.gofile) - Add CHANGELOG.md entry
Hub
- Update version constant to
v0.4.0 - Add CHANGELOG.md entry (if hub has one)
Implementation Checklist
Controller (can be deployed and tested independently)
internal/metrics/telemetry.go— GetContainerTelemetry methodinternal/metrics/logscanner.go— ScanContainerLogs + typesinternal/report/types.go— Add AppTelemetry field + type definitionsinternal/report/telemetry.go— buildAppTelemetry + buildAppTelemetrySectioninternal/report/builder.go— Call buildAppTelemetrySection in BuildReport- Test: build, deploy to demo, verify
app_telemetryin report JSON
Hub (deploy after controller verified)
internal/store/store.go— Tables + types + all store methodsinternal/api/handler.go— Parse & save telemetry from reportscmd/hub/main.go— Add prune callsinternal/web/static/chart.min.js— Copy from controllerinternal/web/embed.go— Add chart.min.js embedinternal/web/server.go— Routes + handlers + chart.js serving + parsePeriod helperinternal/web/templates/apps.html— Fleet app list pageinternal/web/templates/app_detail.html— App detail page with chartinternal/web/templates/customer_unified.html— Add telemetry sectioninternal/web/templates/style.css— New styles- Navigation update in all templates (add "Alkalmazások" link)
- Test: deploy hub, verify pages after a few report cycles
Notes for Implementation
- Run
go build ./...after each phase to catch compile errors early - The report package already imports
metricsandstacks— no new import cycles - Hub templates use
template.Must(template.New("").Funcs(funcMap).ParseFS(templateFS, "templates/*.html"))— new .html files are automatically picked up - Hub already has
embed.goathub/internal/web/embed.go— add chart.min.js embed directive there. Createhub/internal/web/static/directory - All hub template functions are defined in
server.goNew() constructor — add any new functions (likememoryColor) there - CSRF tokens: All POST forms need
<input type="hidden" name="_csrf" value="{{ .CSRFToken }}">— but the apps pages are read-only (GET only), so no CSRF needed - Hub's ServeHTTP uses hasSuffix/hasPrefix routing — put exact matches (
/apps) BEFORE prefix matches. The/apps/{name}route must come before/apps/fallback