Files
felhom-agent/docs/architecture/02-controller-module-map.md
T
admin 61bfea3610 docs: rework references for repo rename proxmox-controller -> felhom-agent
Repo renamed on Gitea (admin/proxmox-controller -> admin/felhom-agent). Updates the
self-name reference in docs/proxmox-platform.md and the controller-source path
(deploy-felhom-compose -> felhom-controller) in the architecture docs. Docs-only; no
code or layout change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 13:38:58 +02:00

375 lines
27 KiB
Markdown

# Felhom Controller Architecture — Part 2: Controller Module Map
**Status:** audit (keep / port / delete / modify / add), grounded in the v0.33 source.
**Subject:** the v0.33 controller in `felhom-controller/controller/` (110 `.go` files,
~40 K LOC) audited against [01-topology-and-trust.md](01-topology-and-trust.md) and
[../proxmox-platform.md](../proxmox-platform.md).
> This is a **planning map, not the port.** No controller code was changed. Source
> citations use `controller/internal/...:line` (a different repo, so links are not
> clickable). Classifications reflect the **target model**: the in-guest controller is
> **Docker-only and holds no Proxmox credentials**; everything host/disk/Proxmox moves to
> a new **host agent** (out of scope here); the controller reaches the agent through a
> constrained **local API**.
## Classification scheme
**KEEP** (host-agnostic, ~unchanged) · **PORT** (survives, needs rework) ·
**DELETE (→agent)** (responsibility moves to the host agent) ·
**DELETE (obsolete)** (no longer needed) · **MODIFY** (stays, materially changes) ·
**NEW** (no v0.33 equivalent).
Risk tags: **clean** · **needs-rework** · **hazard** (entangles a delete-target with a keep/port target).
---
## 0. Executive summary
- The **app domain is largely intact and portable**: stack lifecycle (`stacks/`), catalog
git-sync (`sync/`), app-to-app integrations (`integrations/`), `.fab` export/import
(`appexport/`), the scheduler, crypto, asset sync, the hub report/notify *channels*, and
most of the web UI **KEEP/PORT cleanly**.
- The **disk/storage/host half deletes wholesale to the agent**: all of `storage/`,
`monitor/watchdog.go`, the restic/cross-drive/disk-layout/drive-mount parts of `backup/`,
`report/infra_backup*`+`infra_pull`, and the host-physical parts of `system/`.
- The **setup wizard (`setup/`) is obsolete** — the agent provisions the controller.
- **The single biggest hazard is `backup/`**: the keep side (DB dumps, Docker-volume
archive, per-app restore — needed by `appexport/` and the backup UI) and the delete side
(restic, cross-drive, drive-mount) are **interleaved inside the same files**
(`backup.go`, `restore.go`, `paths.go`), not cleanly file-separated. Extracting the
app-data-backup subset into a clean retained package is the critical refactor.
- **Intent-vs-reality corrections** (vs the task's provisional split): `monitor/pinger.go`
is already **dead** (legacy Healthchecks.io, "deprecated… now handled by Hub" per
`main.go`) → DELETE(obsolete), not keep. `backup.go`/`restore.go`/`paths.go` do **not**
split on file boundaries — they split *within* the file. `settings/` is **not** pure app
domain — it stores disk/disconnect/decommission state. `system/` is genuinely
mixed-per-function, not per-file.
---
## 1. v0.33 module inventory (package → purpose, key deps)
| Package | Purpose | Key internal deps |
|---|---|---|
| `cmd/controller/main.go` | Entry point; wires all subsystems; 6 adapters break import cycles; branches into setup mode | imports **every** package |
| `api/` | REST API (`router.go`) + geo endpoints (`geo.go`) | stacks, backup, metrics, notify, selfupdate, sync, system, assets, integrations, cloudflare, config, settings |
| `appexport/` | `.fab` app export/import (config+DB+volumes, AES-256-CTR+scrypt) | **backup** (DB dump), (provider iface → stacks) |
| `assets/` | Download/cache app assets from Hub API | — (HTTP only) |
| `backup/` | DB dumps, Docker-volume archive, **restic**, **cross-drive rsync**, per-app restore, **drive mount**, disk-layout, infra-backup metadata | config, monitor, settings, system, util |
| `cloudflare/` | Geo-restriction via Cloudflare WAF (zone/waf/geosync/countries) — **enforcement → hub** (S4) | settings |
| `config/` | `controller.yaml` schema + load | — |
| `crypto/` | AES-256-GCM for app.yaml secrets | — |
| `integrations/` | App-to-app (OnlyOffice→FileBrowser/Nextcloud) via docker exec / config patch | stacks, crypto, settings |
| `metrics/` | SQLite time-series: system + container metrics, log scan | system |
| `monitor/` | App health (`healthcheck`,`pinger`) + **storage/USB watchdog** | config, notify, settings, system |
| `notify/` | Hub event push (direct, own API key) | settings |
| `recovery/` | Generate `recovery-info.txt` (DR guide) | — |
| `report/` | Build+push hub report; **infra-backup payload**; **recovery pull** | backup, config, metrics, monitor, scheduler, settings, stacks, system |
| `scheduler/` | Cron/interval jobs, Budapest TZ | — |
| `selftest/` | Startup checks (docker/dirs/catalog/hub/**restic repos**/mountpoint) | backup, config, settings, system |
| `selfupdate/` | Self-update: pull image, edit compose, `up -d` | config |
| `settings/` | `settings.json` persistent state: **storage paths/disconnect/decommission**, cross-drive cfg, notif prefs, geo, integration state, DB-validation cache | — |
| `setup/` | **First-run wizard** (scan drives, hub-restore, manual config) | backup, config, report, settings, web |
| `stacks/` | Docker Compose lifecycle, deploy + memory validation, metadata (`.felhom.yml`), HDD-data delete | config, crypto, system |
| `storage/` | **Physical disk** scan/format/attach/mount/migrate/fstab/safety | backup, settings, util |
| `sync/` | Catalog git-sync (pull templates) | config |
| `system/` | Resource info: mem/cpu/load (guest) + **temp/disk-model/USB/mount topology (host)** | — |
| `util/` | String helper | — |
| `web/` | Hungarian dashboard: pages, auth, deploy, backup UI, **storage/disk UI**, DR restore UI, export UI, debug | appexport, backup, config, crypto, integrations, monitor, notify, scheduler, selfupdate, settings, stacks, storage, system |
---
## 2. Classification table (per package/file)
### `cmd/`
| File | Class | Reason | Risk |
|---|---|---|---|
| `cmd/controller/main.go` | **MODIFY** | Wiring stays, but drop the setup-mode branch, the storage/watchdog/drive-migrator/restic/cross-drive/infra-backup wiring, and add the **agent local-API client**. 6 adapters shrink. | hazard |
### `api/`
| File | Class | Reason | Risk |
|---|---|---|---|
| `api/router.go` | **PORT/MODIFY** | Keep stacks/deploy/integrations/metrics/sync/assets/selfupdate routes; **remove `/api/storage/*` (disk)**; backup routes become **agent-coordinated guest-backup** requests; `config/apply` (hub-pushes-yaml) changes since the **agent** now injects config at provision. | needs-rework |
| `api/geo.go` | **PORT/MODIFY** | Keep the customer-facing geo **preference** endpoints (set/get global + per-app); **drop the Cloudflare-sync trigger** — enforcement → hub (S4). The controller reports geo desired-state up instead of calling the CF API. | needs-rework |
### `appexport/` — KEEP/PORT (Docker-volume + DB level, no disk ops)
| File | Class | Reason | Risk |
|---|---|---|---|
| `crypto.go` | **KEEP** | Self-contained AES-256-CTR+HMAC+scrypt for `.fab`. | clean |
| `manifest.go`, `provider.go` | **KEEP** | Bundle metadata; provider interface (impl in main). | clean |
| `export.go` | **PORT** | Docker-volume `tar`, DB dump via `backup.DumpOne`, config copy. Depends on the **retained** app-data-backup subset of `backup/`; HDD-mount enumeration reworked to **per-volume placement**. | needs-rework |
| `restore.go` | **PORT** | `docker volume create`/`tar xf`, DB import, compose up. Same per-volume rework. | needs-rework |
| `estimate.go` | **PORT** | `du`/`df` on mounts → per-volume sizing. | clean |
### `assets/`
| File | Class | Reason | Risk |
|---|---|---|---|
| `syncer.go` | **KEEP** | Hub API download + checksum cache; already a direct hub channel. | clean |
### `backup/` — THE SPLIT (delete side interleaved with keep side; see §3)
| File | Class | Reason | Risk |
|---|---|---|---|
| `dbdump.go` | **KEEP** | Pure `docker exec pg_dump`/`mariadb-dump` — app/DB data layer; the retained per-app backup. | clean |
| `appdata.go` | **PORT** | App-data discovery (stacks/volumes/DB containers, `du`). "HDD mount" concept → per-volume. | needs-rework |
| `backup.go` (1478 L) | **MODIFY (split)** | Mixes **keep** (`RunDBDumps`, `DumpAppVolumes(Safe)`, app restore) with **delete→agent** (`RunBackup`/`backupDrive`/restic snapshot/prune/check on per-drive repos). Must be torn in two. | hazard |
| `restore.go` (442 L) | **MODIFY (split)** | `RestoreApp` restic path → agent; Docker-volume + Tier-2 rsync restore (app layer) → keep. | hazard |
| `restore_app_linux.go`/`_other.go` | **PORT** | Per-app restore: compose pull/up, rsync app data, DB-dump restore. App layer; depends on backup location that changes. | needs-rework |
| `paths.go` | **MODIFY (split)** | `AppDBDumpPath`/`AppVolumeDumpPath` keep; `Primary/SecondaryResticRepoPath`, `InfraBackupDir` → agent. | needs-rework |
| `restic.go` | **DELETE (→agent)** | restic repos on drives = infra backup tier; agent does vzdump/PBS. | hazard |
| `crossdrive.go` | **DELETE (→agent)** | Tier-2 cross-drive rsync to secondary storage = storage-tier (agent + storage manifest). | hazard |
| `restore_drives_linux.go`/`_other.go` | **DELETE (→agent)** | `lsblk`/`blkid`/`mount`/fstab — pure host disk. | hazard |
| `disk_layout.go` | **DELETE (→agent)** | Disk topology for DR → agent. | clean |
| `local_infra.go` | **DELETE (→agent)** | Per-drive infra-backup metadata → agent. | clean |
| `restore_scan.go` | **DELETE (→agent)** | Scans drives to build a DR restore plan = agent-tier DR. | needs-rework |
### `cloudflare/` — DELETE (→hub): CF-API enforcement moves to the hub (S4)
| File | Class | Reason | Risk |
|---|---|---|---|
| `client.go`,`zone.go`,`waf.go`,`geosync.go`,`countries.go` | **DELETE (→hub)** | The **hub** holds the CF API token and reconciles geo desired-state → WAF (doc 01 §5, doc 03 §2). The controller no longer calls the Cloudflare API — it reports geo desired-state up. The customer-facing geo *preference UI/data* stays (see `api/geo.go`). | needs-rework |
### `config/`, `crypto/`, `util/`
| File | Class | Reason | Risk |
|---|---|---|---|
| `config/config.go` | **MODIFY** | Drop `BackupConfig` (restic/retention), storage-drive keys, and `InfrastructureConfig.cf_api_token` (→hub, S4); keep customer/paths/web/git/stacks/monitoring/hub/assets/system; **add agent local-API endpoint+token**. | needs-rework |
| `crypto/crypto.go` | **KEEP** | App.yaml secret encryption. | clean |
| `util/strings.go` | **KEEP** | Trivial helper. | clean |
### `integrations/` — all KEEP (pure app-domain)
| File | Class | Reason | Risk |
|---|---|---|---|
| `integrations.go`,`lifecycle.go`,`manager.go`,`onlyoffice_filebrowser.go`,`onlyoffice_nextcloud.go` | **KEEP** | App-to-app via `docker exec` / compose-config patch; no host ops. | clean |
### `metrics/`
| File | Class | Reason | Risk |
|---|---|---|---|
| `store.go`,`logscanner.go`,`telemetry.go`,`types.go` | **KEEP** | SQLite store, `docker logs` scan, container telemetry — app-domain. | clean |
| `collector.go` | **PORT** | Container metrics (`docker stats`) keep; host metrics via `system.GetInfo` (temp, physical disk) become **agent-provided or dropped**. | needs-rework |
| `sysinfo.go`/`sysinfo_other.go` | **MODIFY** | Reads `/host/etc`, `/proc/cpuinfo`, uptime — host static info; in-guest some is meaningful, hardware identity via agent. | needs-rework |
### `monitor/`
| File | Class | Reason | Risk |
|---|---|---|---|
| `healthcheck.go` | **PORT (split)** | Keep guest health (mem/cpu/docker/protected-containers); host health (temp, **physical disk**, storage-path mount status) becomes **agent-fed**. | needs-rework |
| `pinger.go` | **DELETE (obsolete)** | Legacy Healthchecks.io; `main.go` itself marks it "deprecated… now handled by Hub". *(Corrects the task's KEEP/PORT guess.)* | clean |
| `watchdog.go` (902 L) | **DELETE (→agent)** | Storage/USB disconnect monitoring: `umount -l`, `mount -T /host-fstab`, UUID probing, restic-lock cleanup — pure host storage. | hazard |
### `notify/`, `recovery/`, `scheduler/`, `selftest/`
| File | Class | Reason | Risk |
|---|---|---|---|
| `notify/notifier.go` | **KEEP/MODIFY** | Direct hub event channel (own API key) — keep; prune infra event types that move to the agent (`storage_disconnected`, `crossdrive_*`, `disaster_recovery_*`). | clean |
| `recovery/info.go` | **DELETE (obsolete)** | Generates a DR text guide (OS install, docker-setup.sh, hub restore UI); DR is now agent+hub provisioning. | clean |
| `scheduler/scheduler.go` | **KEEP** | Generic cron/interval, Budapest TZ. | clean |
| `selftest/selftest.go` | **PORT** | Keep docker/dirs/catalog/hub checks; drop restic-repo + system-data **mountpoint** checks (→agent). | needs-rework |
### `report/`
| File | Class | Reason | Risk |
|---|---|---|---|
| `pusher.go` | **KEEP** | Direct hub push (`/api/v1/report`, Bearer). | clean |
| `telemetry.go` | **KEEP** | Per-app telemetry section. | clean |
| `builder.go` (326 L) | **MODIFY** | Keep containers/telemetry/stacks/geo/app-health; drop/relocate host system info, physical storage, **restic backup status incl. restic password**. | hazard |
| `types.go` | **MODIFY** | Schema: drop infra fields (`restic password`, physical storage), keep app-domain. | needs-rework |
| `infra_backup.go`/`_linux.go`/`_other.go` | **DELETE (→agent)** | Builds infra-backup payload (disk layout, restic/enc passwords) for hub. | hazard |
| `infra_pull.go` | **DELETE (→agent)** | Pulls recovery config + infra backup from hub (setup-wizard DR). | needs-rework |
### `selfupdate/` — controller is agent-managed (doc 03 §11)
| File | Class | Reason | Risk |
|---|---|---|---|
| `version.go` | **KEEP** | Semver parse / version string (still used for reporting). | clean |
| `state.go` | **DELETE (obsolete)** | Self-update audit state — the agent owns controller updates now (doc 03 §11). | clean |
| `updater.go` | **DELETE (→agent)** | Resolved (doc 03 §11): the controller is **agent-managed** — the agent snapshots → redeploys → health-gates → rolls back the controller. The controller's old self-update path (image pull + compose edit) is **removed**. | clean |
### `settings/`
| File | Class | Reason | Risk |
|---|---|---|---|
| `settings/settings.go` (1101 L) | **MODIFY (split)** | Keep notif prefs, integration state, geo, DB-validation cache, cross-drive *intent*. The **storage-path registry** (`StoragePath` with `Disconnected`/`DisconnectedAt`/`StoppedStacks`/decommission) is disk-management state → reshape to **per-volume placement** fed by the agent's storage manifest; disconnect/decommission/migrate state leaves. (UUID is *not* a persisted field — runtime-derived from fstab.) | hazard |
### `setup/` — all DELETE (obsolete); the agent provisions the controller
| File | Class | Reason | Risk |
|---|---|---|---|
| `handlers.go`,`setup.go`,`csrf.go`,`network.go` | **DELETE (obsolete)** | First-run wizard (hub-restore, manual config, LAN-IP detection). | needs-rework |
| `scanner.go` | **DELETE (→agent)** | Drive scan (`lsblk`+temp mounts) for backup discovery — host op; its capability informs the agent. | clean |
### `stacks/` — core app domain (KEEP/PORT)
| File | Class | Reason | Risk |
|---|---|---|---|
| `manager.go` (1074 L) | **KEEP/PORT** | Docker Compose orchestration, scan/state/start/stop/logs — the heart. Minor port. | clean |
| `deploy.go` | **PORT** | Memory validation (`system.GetMemoryMB`**guest** mem, fine in LXC), secret gen, encrypted app.yaml. **Add snapshot-before-deploy → agent** hook. | needs-rework |
| `healthprobe.go` | **KEEP** | TCP/HTTP app probes. | clean |
| `metadata.go` | **PORT** | `.felhom.yml` parse. **Add per-volume hot/bulk classification** (doc 01 §8). | needs-rework |
| `delete.go` | **PORT** | Stack delete + HDD-data `os.RemoveAll` on bind mounts → per-volume cleanup. | needs-rework |
### `storage/` — entire package DELETE (→agent)
| File | Class | Reason | Risk |
|---|---|---|---|
| `scan*`,`format*`,`attach*`,`migrate*`,`migrate_drive*`,`safety*` | **DELETE (→agent)** | Physical disk: `lsblk`/`sfdisk`/`wipefs`/`mkfs.ext4`/`partprobe`/`mount`/`umount`/fstab/`blkid`/drive-rsync. The agent owns all of this (doc 01 §3, §8). | hazard |
### `sync/`
| File | Class | Reason | Risk |
|---|---|---|---|
| `sync/sync.go` | **KEEP** | Catalog git-sync (clone/fetch/reset, copy compose+`.felhom.yml`, never overwrite app.yaml). | clean |
### `system/` — split per-function (not per-file)
| File | Class | Reason | Risk |
|---|---|---|---|
| `cpu_linux.go`/`cpu_other.go` | **KEEP** | `/proc/stat` works inside an LXC. | clean |
| `info.go`/`info_other.go` | **KEEP** | Structs/stubs. | clean |
| `info_linux.go` | **MODIFY (split)** | Keep mem (`/proc/meminfo`)/load/statfs (guest); **temp via `/host/sys`, hwmon → agent**. | needs-rework |
| `mounts_linux.go`/`mounts_other.go` | **DELETE (→agent)** mostly | Mount-point detection, USB, disk model, fstab, probe — host/disk. Guest-meaningful `statfs` disk-usage is the only keep-candidate → fold into the kept `info`. | hazard |
### `web/` — split by UI surface
| File | Class | Reason | Risk |
|---|---|---|---|
| `auth.go`,`csrf.go`,`logbuffer.go`,`embed.go`,`templates.go` | **KEEP** | Session/CSRF, log ring buffer, embeds/logo. | clean |
| `funcmap.go` | **KEEP/PORT** | Template helpers; a few backup/state labels track the backup rework. | clean |
| `server.go` (559 L) | **MODIFY** | Routing/wiring; remove storage/DR-restore/watchdog wiring; keep app/deploy/backup/settings/export/debug. | needs-rework |
| `handlers.go` (1883 L) | **PORT/MODIFY** | Core pages keep; the embedded **storage-path management** (add/remove/label/schedulable, storage bars, FileBrowser mount sync) → per-volume / agent-fed. | hazard |
| `handler_export.go` | **KEEP/PORT** | `.fab` UI. | clean |
| `handler_debug.go` (823 L) | **PORT** | Drop storage-simulate/infra-push/DR debug; keep the rest. | needs-rework |
| `alerts.go` | **PORT/MODIFY** | Storage-disconnect alert now sourced from **agent** status; backup/update alerts keep. | needs-rework |
| `handler_restore.go` | **DELETE (→agent) / MODIFY** | DR restore-mode UI; DR is agent-tier — replace with an agent-status view or remove. | needs-rework |
| `storage_handlers.go` (1600 L) | **DELETE (→agent)** | Format/attach/mount/disconnect/migrate-drive/decommission disk UI. Any survivor is a **thin client calling the agent API** (e.g. per-volume placement requests). | hazard |
| `templates/` (HTML, non-Go) | **PORT** | Remove disk-wizard + DR pages; keep app/deploy/backup/settings pages. | needs-rework |
### `scripts/`
| File | Class | Reason | Risk |
|---|---|---|---|
| `scripts/hashpass.go` | **KEEP** | Standalone bcrypt helper. | clean |
---
## 3. Coupling hazards (delete-targets depended on by keep/port)
1. **`backup/` is half-deleted but split *inside files*, not across them.** `backup.go`
contains both `RunDBDumps`/`DumpAppVolumesSafe`/app-restore (keep) and
`RunBackup`/`backupDrive` + restic (delete→agent); `restore.go` and `paths.go` are
likewise mixed. **Keep/port consumers reach into this same package:**
- `appexport/export.go:295``backup.DiscoverDatabases`/`DumpOne` (DB dump is app-layer — must survive)
- `report/builder.go:buildBackupReport` → backup status (MODIFY)
- `web/handlers.go` (backups page, `buildAppBackupRows`), `web/funcmap.go`, `web/alerts.go`, `web/handler_restore.go`, `web/handler_debug.go`
- `selftest/selftest.go:217``checkResticRepos` (restic path — delete)
- `main.go` scheduler chain `RunFullBackup` (DB→volume→restic→infra-push) interleaves both sides.
**Action:** extract the app-data-backup subset (DB dump, volume archive, per-app
restore) into a clean retained package *before* deleting the restic/cross-drive code,
or every keep consumer breaks.
2. **`backup/crossdrive.go` (delete→agent) is wired as `crossDriveRunner` into**
`main.go`, `api/router.go`, `web/server.go`, and surfaced by `report/builder.go` and the
backups page. Removing it requires reworking the backup UI/report to the agent's
guest-backup status.
3. **`storage/` (delete→agent) depended on by keep/port UI:** `web/storage_handlers.go`
(delete) and `web/server.go`/`web/handlers.go` (port) — the latter renders storage
labels/bars and runs **FileBrowser mount sync** off the storage-path registry.
`storage/migrate*.go` also imports `backup` (also being split). Untangle the per-volume
placement UI from the disk-management UI.
4. **`monitor/watchdog.go` (delete→agent) depended on by** `web/alerts.go` (port),
`web/server.go`, `web/handler_debug.go`, `main.go`. The disconnect **alert** must instead
consume agent-reported storage status.
5. **`system/` mixed-per-function, consumed by both sides.** Keep consumers —
`stacks/deploy.go` (`GetMemoryMB`, guest), `metrics/collector.go` (container) — must not
drag in the host-disk/temp/USB code that goes to the agent (`mounts_linux.go`,
`info_linux.go` temp). Also consumed by `report/builder.go` (MODIFY), `monitor/healthcheck.go`
(PORT), `selftest`, `crossdrive` (delete). **Split `system/` cleanly into guest-info vs
host-info first.**
6. **`settings/StoragePath` carries disk state into an app-domain store.** Disk fields
(`Disconnected`,`DisconnectedAt`,`StoppedStacks`, decommission — UUID is *not* persisted, it's runtime-derived from fstab via `system.ParseFstabUUID`/`watchdog.go`) are written by
`watchdog.go`/`storage_handlers.go`/`crossdrive.go` (all delete) but the same struct is
read by `stacks`/`web` for labels and **placement** (keep). Reshape `StoragePath` to a
placement record fed by the agent manifest.
7. **`report/builder.go` imports almost everything** (backup, monitor, scheduler, stacks,
system, metrics, settings, config). Its MODIFY must land *after* the backup and system
splits, or it pulls deleted code along.
8. **`backup/paths.go` shared both ways** — `appexport` + `selftest` + the kept DB-dump
flow use the app-dump path helpers; the same file holds the restic/secondary helpers
that leave.
9. **DR/provisioning chain is cross-cut:** `setup/` (obsolete) → `report/infra_pull` +
`recovery/info` + `backup.MountDrivesFromLayout` + `backup.ReadLocalInfraBackup`. All
obsolete/→agent, but `main.go`'s setup branch and `web/handler_restore.go` reference
them; remove together.
---
## 4. Moves to the host agent (consolidated — feeds the future agent design)
> Reporting only; **not** designing the agent here.
- **All physical-disk management** — `storage/` in full: scan/classify, format
(`wipefs`/`sfdisk`/`mkfs.ext4`/`partprobe`), attach (raw mount + bind + fstab), per-app
and full-drive migration (rsync), safety checks (system-disk detection).
- **Storage/USB watchdog** — `monitor/watchdog.go`: disconnect/reconnect detection,
`umount -l`, `mount -T /host-fstab`, UUID-by-id probing, safe-disconnect, restic-lock
cleanup.
- **Infra/disk backup tier** — `backup/restic.go`, `crossdrive.go`,
`restore_drives_*`, `disk_layout.go`, `local_infra.go`, `restore_scan.go`, plus the
restic-snapshot half of `backup.go`, the restic-restore half of `restore.go`, and the
restic/secondary path helpers in `paths.go`. (Maps to the agent's `vzdump`→tiers→PBS in
doc 01 §8.)
- **Infra-backup payload + recovery pull** — `report/infra_backup*`, `report/infra_pull`.
- **Host-physical telemetry** — `system/mounts_linux.go` (mount topology, USB, disk
model), the temp/hwmon parts of `system/info_linux.go`, and the host-hardware parts of
`metrics/sysinfo.go`.
- **Drive scanning for provisioning/DR** — `setup/scanner.go`.
- **Self-restore-test execution** — the agent performs the restore-to-scratch-guest; the
controller only orchestrates/validates (see §5).
---
## 5. New components to build (no v0.33 equivalent)
1. **Agent local-API client** — the controller's only path to guest-level Proxmox
operations (doc 01 §3, §5): `snapshot-before-deploy` + rollback, "grow my RAM", request
guest backup/restore, read the storage manifest / mount placement, query per-target
storage status. Replaces the deleted direct host/disk code with constrained RPC. The
controller holds **no Proxmox creds** — only a local-API token.
2. **Per-volume storage placement** (doc 01 §8) — `.felhom.yml` `hot`/`bulk` volume
classification (extend `stacks/metadata.go`), enforcement at deploy (extend
`stacks/deploy.go`), and a placement record in `settings`. Replaces the per-app
HDD-path + cross-drive model. A `bulk` volume must be realized as a `backup=0` mount point,
**never** a rootfs Docker named volume (validated recipe: `phase3-findings.md` B2 / doc 03 §7).
3. **Self-restore-test status display** (read-only) — the **agent owns orchestration** (it
holds the PBS key and creates the scratch guest — operator-tier, doc 03 §8); the controller
only surfaces `GET /restore-test/status` in its UI. (Round-trip validated: Phase 2,
[../proxmox-platform.md](../proxmox-platform.md) §4.)
4. **Snapshot-before-deploy/rollback flow** in the deploy path — wraps the existing
compose deploy with agent snapshot → health check → agent rollback-on-failure
(doc 01 §9). New behaviour on top of `stacks/deploy.go` + `stacks/healthprobe.go`.
5. **Agent-provisioning bootstrap receiver** — the controller accepts its injected hub API
key + local-API token from the agent at provision time (doc 01 §6), replacing the
deleted `setup/` wizard.
---
## 6. Open / blocked items
- **Geo — resolved (S4):** CF-API **enforcement moves to the hub** (it holds the CF token and
reconciles geo → WAF); the controller keeps the geo **preference UI/data** and reports
desired-state up. Tunnel placement is settled (host, agent-managed, doc 03 §3/§5). The
`cloudflare/` package + `api/geo.go`'s CF-sync are DELETE-from-controller → hub.
- **Self-update — resolved (doc 03 §11):** the controller is agent-managed; its self-update
path is removed.
- **`settings`/`stacks` per-volume reshape** — depends on the storage-manifest contract
between hub ↔ agent ↔ controller (doc 01 §8), not yet specified.
- **Backup UI/report surface** — depends on the agent's guest-backup status API shape
(what the controller can see about vzdump/PBS state) — undefined.
- **Notification event taxonomy** — which infra events (`storage_disconnected`,
`crossdrive_*`, `disaster_recovery_*`) the **agent** emits vs the controller, once those
responsibilities move.
---
## Changelog — design-review + Phase-3 fold-in (2026-06-08)
- **M1:** removed `UUID` from the `settings.StoragePath` field lists (§ settings, hazard #6) —
it is runtime-derived from fstab, not persisted.
- **S4 (geo):** `cloudflare/` reclassified **PORT(blocked) → DELETE(→hub)** (CF-API enforcement
moves to the hub); `api/geo.go`**PORT/MODIFY** (keep geo *preference* endpoints, drop the
CF-sync trigger); `config/config.go` also drops `cf_api_token`. §6 + §1 updated.
- **S5:** cloudflare/geo no longer "blocked on tunnel placement" (resolved).
- **S6:** §5(3) self-restore-test → **status-display only**; the agent owns orchestration.
- **Self-update resolved (03 §11):** `updater.go`**DELETE(→agent)**, `state.go`
DELETE(obsolete), `version.go` KEEP; §6 + §5(2) updated (bulk = `backup=0` mountpoint recipe).