Files
felhom.eu/documentation/architecture/02-controller-module-map.md
T
admin 0d832def7b fix: update repo-name refs after deploy-felhom-compose -> felhom-controller rename
- hub/internal/web/templatefetcher.go: raw-template URL now points at the renamed
  repo (was relying on Gitea's post-rename redirect)
- documentation/ (moved here from the felhom-agent repo): fix controller-source path
  refs (deploy-felhom-compose -> felhom-controller) and the platform repo name
  (proxmox-controller -> felhom-agent)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 14:03:13 +02:00

27 KiB

Felhom Controller Architecture — Part 2: Controller Module Map

Status: audit (keep / port / delete / modify / add), grounded in the v0.33 source. Subject: the v0.33 controller in felhom-controller/controller/ (110 .go files, ~40 K LOC) audited against 01-topology-and-trust.md and ../proxmox-platform.md.

This is a planning map, not the port. No controller code was changed. Source citations use controller/internal/...:line (a different repo, so links are not clickable). Classifications reflect the target model: the in-guest controller is Docker-only and holds no Proxmox credentials; everything host/disk/Proxmox moves to a new host agent (out of scope here); the controller reaches the agent through a constrained local API.

Classification scheme

KEEP (host-agnostic, ~unchanged) · PORT (survives, needs rework) · DELETE (→agent) (responsibility moves to the host agent) · DELETE (obsolete) (no longer needed) · MODIFY (stays, materially changes) · NEW (no v0.33 equivalent). Risk tags: clean · needs-rework · hazard (entangles a delete-target with a keep/port target).


0. Executive summary

  • The app domain is largely intact and portable: stack lifecycle (stacks/), catalog git-sync (sync/), app-to-app integrations (integrations/), .fab export/import (appexport/), the scheduler, crypto, asset sync, the hub report/notify channels, and most of the web UI KEEP/PORT cleanly.
  • The disk/storage/host half deletes wholesale to the agent: all of storage/, monitor/watchdog.go, the restic/cross-drive/disk-layout/drive-mount parts of backup/, report/infra_backup*+infra_pull, and the host-physical parts of system/.
  • The setup wizard (setup/) is obsolete — the agent provisions the controller.
  • The single biggest hazard is backup/: the keep side (DB dumps, Docker-volume archive, per-app restore — needed by appexport/ and the backup UI) and the delete side (restic, cross-drive, drive-mount) are interleaved inside the same files (backup.go, restore.go, paths.go), not cleanly file-separated. Extracting the app-data-backup subset into a clean retained package is the critical refactor.
  • Intent-vs-reality corrections (vs the task's provisional split): monitor/pinger.go is already dead (legacy Healthchecks.io, "deprecated… now handled by Hub" per main.go) → DELETE(obsolete), not keep. backup.go/restore.go/paths.go do not split on file boundaries — they split within the file. settings/ is not pure app domain — it stores disk/disconnect/decommission state. system/ is genuinely mixed-per-function, not per-file.

1. v0.33 module inventory (package → purpose, key deps)

Package Purpose Key internal deps
cmd/controller/main.go Entry point; wires all subsystems; 6 adapters break import cycles; branches into setup mode imports every package
api/ REST API (router.go) + geo endpoints (geo.go) stacks, backup, metrics, notify, selfupdate, sync, system, assets, integrations, cloudflare, config, settings
appexport/ .fab app export/import (config+DB+volumes, AES-256-CTR+scrypt) backup (DB dump), (provider iface → stacks)
assets/ Download/cache app assets from Hub API — (HTTP only)
backup/ DB dumps, Docker-volume archive, restic, cross-drive rsync, per-app restore, drive mount, disk-layout, infra-backup metadata config, monitor, settings, system, util
cloudflare/ Geo-restriction via Cloudflare WAF (zone/waf/geosync/countries) — enforcement → hub (S4) settings
config/ controller.yaml schema + load
crypto/ AES-256-GCM for app.yaml secrets
integrations/ App-to-app (OnlyOffice→FileBrowser/Nextcloud) via docker exec / config patch stacks, crypto, settings
metrics/ SQLite time-series: system + container metrics, log scan system
monitor/ App health (healthcheck,pinger) + storage/USB watchdog config, notify, settings, system
notify/ Hub event push (direct, own API key) settings
recovery/ Generate recovery-info.txt (DR guide)
report/ Build+push hub report; infra-backup payload; recovery pull backup, config, metrics, monitor, scheduler, settings, stacks, system
scheduler/ Cron/interval jobs, Budapest TZ
selftest/ Startup checks (docker/dirs/catalog/hub/restic repos/mountpoint) backup, config, settings, system
selfupdate/ Self-update: pull image, edit compose, up -d config
settings/ settings.json persistent state: storage paths/disconnect/decommission, cross-drive cfg, notif prefs, geo, integration state, DB-validation cache
setup/ First-run wizard (scan drives, hub-restore, manual config) backup, config, report, settings, web
stacks/ Docker Compose lifecycle, deploy + memory validation, metadata (.felhom.yml), HDD-data delete config, crypto, system
storage/ Physical disk scan/format/attach/mount/migrate/fstab/safety backup, settings, util
sync/ Catalog git-sync (pull templates) config
system/ Resource info: mem/cpu/load (guest) + temp/disk-model/USB/mount topology (host)
util/ String helper
web/ Hungarian dashboard: pages, auth, deploy, backup UI, storage/disk UI, DR restore UI, export UI, debug appexport, backup, config, crypto, integrations, monitor, notify, scheduler, selfupdate, settings, stacks, storage, system

2. Classification table (per package/file)

cmd/

File Class Reason Risk
cmd/controller/main.go MODIFY Wiring stays, but drop the setup-mode branch, the storage/watchdog/drive-migrator/restic/cross-drive/infra-backup wiring, and add the agent local-API client. 6 adapters shrink. hazard

api/

File Class Reason Risk
api/router.go PORT/MODIFY Keep stacks/deploy/integrations/metrics/sync/assets/selfupdate routes; remove /api/storage/* (disk); backup routes become agent-coordinated guest-backup requests; config/apply (hub-pushes-yaml) changes since the agent now injects config at provision. needs-rework
api/geo.go PORT/MODIFY Keep the customer-facing geo preference endpoints (set/get global + per-app); drop the Cloudflare-sync trigger — enforcement → hub (S4). The controller reports geo desired-state up instead of calling the CF API. needs-rework

appexport/ — KEEP/PORT (Docker-volume + DB level, no disk ops)

File Class Reason Risk
crypto.go KEEP Self-contained AES-256-CTR+HMAC+scrypt for .fab. clean
manifest.go, provider.go KEEP Bundle metadata; provider interface (impl in main). clean
export.go PORT Docker-volume tar, DB dump via backup.DumpOne, config copy. Depends on the retained app-data-backup subset of backup/; HDD-mount enumeration reworked to per-volume placement. needs-rework
restore.go PORT docker volume create/tar xf, DB import, compose up. Same per-volume rework. needs-rework
estimate.go PORT du/df on mounts → per-volume sizing. clean

assets/

File Class Reason Risk
syncer.go KEEP Hub API download + checksum cache; already a direct hub channel. clean

backup/ — THE SPLIT (delete side interleaved with keep side; see §3)

File Class Reason Risk
dbdump.go KEEP Pure docker exec pg_dump/mariadb-dump — app/DB data layer; the retained per-app backup. clean
appdata.go PORT App-data discovery (stacks/volumes/DB containers, du). "HDD mount" concept → per-volume. needs-rework
backup.go (1478 L) MODIFY (split) Mixes keep (RunDBDumps, DumpAppVolumes(Safe), app restore) with delete→agent (RunBackup/backupDrive/restic snapshot/prune/check on per-drive repos). Must be torn in two. hazard
restore.go (442 L) MODIFY (split) RestoreApp restic path → agent; Docker-volume + Tier-2 rsync restore (app layer) → keep. hazard
restore_app_linux.go/_other.go PORT Per-app restore: compose pull/up, rsync app data, DB-dump restore. App layer; depends on backup location that changes. needs-rework
paths.go MODIFY (split) AppDBDumpPath/AppVolumeDumpPath keep; Primary/SecondaryResticRepoPath, InfraBackupDir → agent. needs-rework
restic.go DELETE (→agent) restic repos on drives = infra backup tier; agent does vzdump/PBS. hazard
crossdrive.go DELETE (→agent) Tier-2 cross-drive rsync to secondary storage = storage-tier (agent + storage manifest). hazard
restore_drives_linux.go/_other.go DELETE (→agent) lsblk/blkid/mount/fstab — pure host disk. hazard
disk_layout.go DELETE (→agent) Disk topology for DR → agent. clean
local_infra.go DELETE (→agent) Per-drive infra-backup metadata → agent. clean
restore_scan.go DELETE (→agent) Scans drives to build a DR restore plan = agent-tier DR. needs-rework

cloudflare/ — DELETE (→hub): CF-API enforcement moves to the hub (S4)

File Class Reason Risk
client.go,zone.go,waf.go,geosync.go,countries.go DELETE (→hub) The hub holds the CF API token and reconciles geo desired-state → WAF (doc 01 §5, doc 03 §2). The controller no longer calls the Cloudflare API — it reports geo desired-state up. The customer-facing geo preference UI/data stays (see api/geo.go). needs-rework

config/, crypto/, util/

File Class Reason Risk
config/config.go MODIFY Drop BackupConfig (restic/retention), storage-drive keys, and InfrastructureConfig.cf_api_token (→hub, S4); keep customer/paths/web/git/stacks/monitoring/hub/assets/system; add agent local-API endpoint+token. needs-rework
crypto/crypto.go KEEP App.yaml secret encryption. clean
util/strings.go KEEP Trivial helper. clean

integrations/ — all KEEP (pure app-domain)

File Class Reason Risk
integrations.go,lifecycle.go,manager.go,onlyoffice_filebrowser.go,onlyoffice_nextcloud.go KEEP App-to-app via docker exec / compose-config patch; no host ops. clean

metrics/

File Class Reason Risk
store.go,logscanner.go,telemetry.go,types.go KEEP SQLite store, docker logs scan, container telemetry — app-domain. clean
collector.go PORT Container metrics (docker stats) keep; host metrics via system.GetInfo (temp, physical disk) become agent-provided or dropped. needs-rework
sysinfo.go/sysinfo_other.go MODIFY Reads /host/etc, /proc/cpuinfo, uptime — host static info; in-guest some is meaningful, hardware identity via agent. needs-rework

monitor/

File Class Reason Risk
healthcheck.go PORT (split) Keep guest health (mem/cpu/docker/protected-containers); host health (temp, physical disk, storage-path mount status) becomes agent-fed. needs-rework
pinger.go DELETE (obsolete) Legacy Healthchecks.io; main.go itself marks it "deprecated… now handled by Hub". (Corrects the task's KEEP/PORT guess.) clean
watchdog.go (902 L) DELETE (→agent) Storage/USB disconnect monitoring: umount -l, mount -T /host-fstab, UUID probing, restic-lock cleanup — pure host storage. hazard

notify/, recovery/, scheduler/, selftest/

File Class Reason Risk
notify/notifier.go KEEP/MODIFY Direct hub event channel (own API key) — keep; prune infra event types that move to the agent (storage_disconnected, crossdrive_*, disaster_recovery_*). clean
recovery/info.go DELETE (obsolete) Generates a DR text guide (OS install, docker-setup.sh, hub restore UI); DR is now agent+hub provisioning. clean
scheduler/scheduler.go KEEP Generic cron/interval, Budapest TZ. clean
selftest/selftest.go PORT Keep docker/dirs/catalog/hub checks; drop restic-repo + system-data mountpoint checks (→agent). needs-rework

report/

File Class Reason Risk
pusher.go KEEP Direct hub push (/api/v1/report, Bearer). clean
telemetry.go KEEP Per-app telemetry section. clean
builder.go (326 L) MODIFY Keep containers/telemetry/stacks/geo/app-health; drop/relocate host system info, physical storage, restic backup status incl. restic password. hazard
types.go MODIFY Schema: drop infra fields (restic password, physical storage), keep app-domain. needs-rework
infra_backup.go/_linux.go/_other.go DELETE (→agent) Builds infra-backup payload (disk layout, restic/enc passwords) for hub. hazard
infra_pull.go DELETE (→agent) Pulls recovery config + infra backup from hub (setup-wizard DR). needs-rework

selfupdate/ — controller is agent-managed (doc 03 §11)

File Class Reason Risk
version.go KEEP Semver parse / version string (still used for reporting). clean
state.go DELETE (obsolete) Self-update audit state — the agent owns controller updates now (doc 03 §11). clean
updater.go DELETE (→agent) Resolved (doc 03 §11): the controller is agent-managed — the agent snapshots → redeploys → health-gates → rolls back the controller. The controller's old self-update path (image pull + compose edit) is removed. clean

settings/

File Class Reason Risk
settings/settings.go (1101 L) MODIFY (split) Keep notif prefs, integration state, geo, DB-validation cache, cross-drive intent. The storage-path registry (StoragePath with Disconnected/DisconnectedAt/StoppedStacks/decommission) is disk-management state → reshape to per-volume placement fed by the agent's storage manifest; disconnect/decommission/migrate state leaves. (UUID is not a persisted field — runtime-derived from fstab.) hazard

setup/ — all DELETE (obsolete); the agent provisions the controller

File Class Reason Risk
handlers.go,setup.go,csrf.go,network.go DELETE (obsolete) First-run wizard (hub-restore, manual config, LAN-IP detection). needs-rework
scanner.go DELETE (→agent) Drive scan (lsblk+temp mounts) for backup discovery — host op; its capability informs the agent. clean

stacks/ — core app domain (KEEP/PORT)

File Class Reason Risk
manager.go (1074 L) KEEP/PORT Docker Compose orchestration, scan/state/start/stop/logs — the heart. Minor port. clean
deploy.go PORT Memory validation (system.GetMemoryMBguest mem, fine in LXC), secret gen, encrypted app.yaml. Add snapshot-before-deploy → agent hook. needs-rework
healthprobe.go KEEP TCP/HTTP app probes. clean
metadata.go PORT .felhom.yml parse. Add per-volume hot/bulk classification (doc 01 §8). needs-rework
delete.go PORT Stack delete + HDD-data os.RemoveAll on bind mounts → per-volume cleanup. needs-rework

storage/ — entire package DELETE (→agent)

File Class Reason Risk
scan*,format*,attach*,migrate*,migrate_drive*,safety* DELETE (→agent) Physical disk: lsblk/sfdisk/wipefs/mkfs.ext4/partprobe/mount/umount/fstab/blkid/drive-rsync. The agent owns all of this (doc 01 §3, §8). hazard

sync/

File Class Reason Risk
sync/sync.go KEEP Catalog git-sync (clone/fetch/reset, copy compose+.felhom.yml, never overwrite app.yaml). clean

system/ — split per-function (not per-file)

File Class Reason Risk
cpu_linux.go/cpu_other.go KEEP /proc/stat works inside an LXC. clean
info.go/info_other.go KEEP Structs/stubs. clean
info_linux.go MODIFY (split) Keep mem (/proc/meminfo)/load/statfs (guest); temp via /host/sys, hwmon → agent. needs-rework
mounts_linux.go/mounts_other.go DELETE (→agent) mostly Mount-point detection, USB, disk model, fstab, probe — host/disk. Guest-meaningful statfs disk-usage is the only keep-candidate → fold into the kept info. hazard

web/ — split by UI surface

File Class Reason Risk
auth.go,csrf.go,logbuffer.go,embed.go,templates.go KEEP Session/CSRF, log ring buffer, embeds/logo. clean
funcmap.go KEEP/PORT Template helpers; a few backup/state labels track the backup rework. clean
server.go (559 L) MODIFY Routing/wiring; remove storage/DR-restore/watchdog wiring; keep app/deploy/backup/settings/export/debug. needs-rework
handlers.go (1883 L) PORT/MODIFY Core pages keep; the embedded storage-path management (add/remove/label/schedulable, storage bars, FileBrowser mount sync) → per-volume / agent-fed. hazard
handler_export.go KEEP/PORT .fab UI. clean
handler_debug.go (823 L) PORT Drop storage-simulate/infra-push/DR debug; keep the rest. needs-rework
alerts.go PORT/MODIFY Storage-disconnect alert now sourced from agent status; backup/update alerts keep. needs-rework
handler_restore.go DELETE (→agent) / MODIFY DR restore-mode UI; DR is agent-tier — replace with an agent-status view or remove. needs-rework
storage_handlers.go (1600 L) DELETE (→agent) Format/attach/mount/disconnect/migrate-drive/decommission disk UI. Any survivor is a thin client calling the agent API (e.g. per-volume placement requests). hazard
templates/ (HTML, non-Go) PORT Remove disk-wizard + DR pages; keep app/deploy/backup/settings pages. needs-rework

scripts/

File Class Reason Risk
scripts/hashpass.go KEEP Standalone bcrypt helper. clean

3. Coupling hazards (delete-targets depended on by keep/port)

  1. backup/ is half-deleted but split inside files, not across them. backup.go contains both RunDBDumps/DumpAppVolumesSafe/app-restore (keep) and RunBackup/backupDrive + restic (delete→agent); restore.go and paths.go are likewise mixed. Keep/port consumers reach into this same package:

    • appexport/export.go:295backup.DiscoverDatabases/DumpOne (DB dump is app-layer — must survive)
    • report/builder.go:buildBackupReport → backup status (MODIFY)
    • web/handlers.go (backups page, buildAppBackupRows), web/funcmap.go, web/alerts.go, web/handler_restore.go, web/handler_debug.go
    • selftest/selftest.go:217checkResticRepos (restic path — delete)
    • main.go scheduler chain RunFullBackup (DB→volume→restic→infra-push) interleaves both sides. Action: extract the app-data-backup subset (DB dump, volume archive, per-app restore) into a clean retained package before deleting the restic/cross-drive code, or every keep consumer breaks.
  2. backup/crossdrive.go (delete→agent) is wired as crossDriveRunner into main.go, api/router.go, web/server.go, and surfaced by report/builder.go and the backups page. Removing it requires reworking the backup UI/report to the agent's guest-backup status.

  3. storage/ (delete→agent) depended on by keep/port UI: web/storage_handlers.go (delete) and web/server.go/web/handlers.go (port) — the latter renders storage labels/bars and runs FileBrowser mount sync off the storage-path registry. storage/migrate*.go also imports backup (also being split). Untangle the per-volume placement UI from the disk-management UI.

  4. monitor/watchdog.go (delete→agent) depended on by web/alerts.go (port), web/server.go, web/handler_debug.go, main.go. The disconnect alert must instead consume agent-reported storage status.

  5. system/ mixed-per-function, consumed by both sides. Keep consumers — stacks/deploy.go (GetMemoryMB, guest), metrics/collector.go (container) — must not drag in the host-disk/temp/USB code that goes to the agent (mounts_linux.go, info_linux.go temp). Also consumed by report/builder.go (MODIFY), monitor/healthcheck.go (PORT), selftest, crossdrive (delete). Split system/ cleanly into guest-info vs host-info first.

  6. settings/StoragePath carries disk state into an app-domain store. Disk fields (Disconnected,DisconnectedAt,StoppedStacks, decommission — UUID is not persisted, it's runtime-derived from fstab via system.ParseFstabUUID/watchdog.go) are written by watchdog.go/storage_handlers.go/crossdrive.go (all delete) but the same struct is read by stacks/web for labels and placement (keep). Reshape StoragePath to a placement record fed by the agent manifest.

  7. report/builder.go imports almost everything (backup, monitor, scheduler, stacks, system, metrics, settings, config). Its MODIFY must land after the backup and system splits, or it pulls deleted code along.

  8. backup/paths.go shared both waysappexport + selftest + the kept DB-dump flow use the app-dump path helpers; the same file holds the restic/secondary helpers that leave.

  9. DR/provisioning chain is cross-cut: setup/ (obsolete) → report/infra_pull + recovery/info + backup.MountDrivesFromLayout + backup.ReadLocalInfraBackup. All obsolete/→agent, but main.go's setup branch and web/handler_restore.go reference them; remove together.


4. Moves to the host agent (consolidated — feeds the future agent design)

Reporting only; not designing the agent here.

  • All physical-disk managementstorage/ in full: scan/classify, format (wipefs/sfdisk/mkfs.ext4/partprobe), attach (raw mount + bind + fstab), per-app and full-drive migration (rsync), safety checks (system-disk detection).
  • Storage/USB watchdogmonitor/watchdog.go: disconnect/reconnect detection, umount -l, mount -T /host-fstab, UUID-by-id probing, safe-disconnect, restic-lock cleanup.
  • Infra/disk backup tierbackup/restic.go, crossdrive.go, restore_drives_*, disk_layout.go, local_infra.go, restore_scan.go, plus the restic-snapshot half of backup.go, the restic-restore half of restore.go, and the restic/secondary path helpers in paths.go. (Maps to the agent's vzdump→tiers→PBS in doc 01 §8.)
  • Infra-backup payload + recovery pullreport/infra_backup*, report/infra_pull.
  • Host-physical telemetrysystem/mounts_linux.go (mount topology, USB, disk model), the temp/hwmon parts of system/info_linux.go, and the host-hardware parts of metrics/sysinfo.go.
  • Drive scanning for provisioning/DRsetup/scanner.go.
  • Self-restore-test execution — the agent performs the restore-to-scratch-guest; the controller only orchestrates/validates (see §5).

5. New components to build (no v0.33 equivalent)

  1. Agent local-API client — the controller's only path to guest-level Proxmox operations (doc 01 §3, §5): snapshot-before-deploy + rollback, "grow my RAM", request guest backup/restore, read the storage manifest / mount placement, query per-target storage status. Replaces the deleted direct host/disk code with constrained RPC. The controller holds no Proxmox creds — only a local-API token.
  2. Per-volume storage placement (doc 01 §8) — .felhom.yml hot/bulk volume classification (extend stacks/metadata.go), enforcement at deploy (extend stacks/deploy.go), and a placement record in settings. Replaces the per-app HDD-path + cross-drive model. A bulk volume must be realized as a backup=0 mount point, never a rootfs Docker named volume (validated recipe: phase3-findings.md B2 / doc 03 §7).
  3. Self-restore-test status display (read-only) — the agent owns orchestration (it holds the PBS key and creates the scratch guest — operator-tier, doc 03 §8); the controller only surfaces GET /restore-test/status in its UI. (Round-trip validated: Phase 2, ../proxmox-platform.md §4.)
  4. Snapshot-before-deploy/rollback flow in the deploy path — wraps the existing compose deploy with agent snapshot → health check → agent rollback-on-failure (doc 01 §9). New behaviour on top of stacks/deploy.go + stacks/healthprobe.go.
  5. Agent-provisioning bootstrap receiver — the controller accepts its injected hub API key + local-API token from the agent at provision time (doc 01 §6), replacing the deleted setup/ wizard.

6. Open / blocked items

  • Geo — resolved (S4): CF-API enforcement moves to the hub (it holds the CF token and reconciles geo → WAF); the controller keeps the geo preference UI/data and reports desired-state up. Tunnel placement is settled (host, agent-managed, doc 03 §3/§5). The cloudflare/ package + api/geo.go's CF-sync are DELETE-from-controller → hub.
  • Self-update — resolved (doc 03 §11): the controller is agent-managed; its self-update path is removed.
  • settings/stacks per-volume reshape — depends on the storage-manifest contract between hub ↔ agent ↔ controller (doc 01 §8), not yet specified.
  • Backup UI/report surface — depends on the agent's guest-backup status API shape (what the controller can see about vzdump/PBS state) — undefined.
  • Notification event taxonomy — which infra events (storage_disconnected, crossdrive_*, disaster_recovery_*) the agent emits vs the controller, once those responsibilities move.

Changelog — design-review + Phase-3 fold-in (2026-06-08)

  • M1: removed UUID from the settings.StoragePath field lists (§ settings, hazard #6) — it is runtime-derived from fstab, not persisted.
  • S4 (geo): cloudflare/ reclassified PORT(blocked) → DELETE(→hub) (CF-API enforcement moves to the hub); api/geo.goPORT/MODIFY (keep geo preference endpoints, drop the CF-sync trigger); config/config.go also drops cf_api_token. §6 + §1 updated.
  • S5: cloudflare/geo no longer "blocked on tunnel placement" (resolved).
  • S6: §5(3) self-restore-test → status-display only; the agent owns orchestration.
  • Self-update resolved (03 §11): updater.goDELETE(→agent), state.go → DELETE(obsolete), version.go KEEP; §6 + §5(2) updated (bulk = backup=0 mountpoint recipe).