Files
felhom-controller/REPORT.md
T
admin d8fe8f5ead docs: Phase 2b fail-closed gate LIVE-validated on AdventureLog
Demo has no dashboard password (API open: auth+CSRF both skip in that mode), driven
via the public URL. AdventureLog's unit manifest carries data_key_env_vars=[SECRET_KEY]
(catalog->manifest live); with SECRET_KEY unrecoverable, POST /backup/restore REFUSED
with the exact fail-closed message before any compose-up. Full deploy-with-data e2e
blocked by the 8G guest rootfs (AdventureLog images too big — the Phase 3 concern, live).
CHANGELOG/REPORT/CONTEXT updated; demo left clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 12:35:08 +02:00

7.1 KiB

REPORT — felhom-controller v0.54.0 (Phase 2: recovery unit — capture + restore)

Each app's on-drive backup is a self-contained, recreatable recovery unit (secret-free), and restore now recreates an app from its unit + the guest's own secrets with a fail-closed data-key gate. Built, unit/integration-tested, shipped to main, deployed to guest 9201. (Phase 1, the deploy-side double-nest GATE, shipped as v0.52.0; Phase 2 capture side as v0.53.x — see git history.)

The design decision that shaped Phase 2 (secret handling)

The recovery unit carries no secrets, no data-keys, and not the Docker image. This was decided after reading the actual hub code (the controller README that implied the hub stores app.yaml is stale pre-strip):

  • The hub is deliberately zero-knowledge — it holds a per-host recovery-code-wrapped PBS key it cannot decrypt + non-secret config; no per-app secrets. Escrowing app secrets there would regress that posture, so it was rejected.
  • app.yaml (encrypted) + the encryption key live on the guest rootfs (local-lvm:vm-9201-disk-0, confirmed via pct config) → already inside the PBS whole-guest snapshot; the external drive (mp0 bind) is not. So the secret↔data split maps onto the tiers: secrets ride PBS; bulk userdata rides the drive + (Phase 3) Tier 2.
  • Therefore: secret-free unit; restore recovers the original secrets from the guest's own app.yaml (live, else PBS); regenerate nothing. data_key is a fail-closed annotation, not a preserve/regenerate decision.

What shipped (v0.53.0 + v0.53.1)

  • Unit layout (rooted at the existing backups/primary/<app>/ — a deliberate low-churn choice, no risky dump-dir migration): compose/ (docker-compose.yml + .felhom.yml + a secret-stripped app.yaml) + the existing db-dumps/ + volume-dumps/ + manifest.json. New helpers RecoveryUnit{Path,ComposePath,ManifestPath} (internal/appbackup/paths.go).
  • Secret-free manifest (internal/backup/recovery_unit.go): app id, display name, controller version, timestamp, drive, namespace root, image pins (image NOT stored — re-pulled on restore), the NAMES of secret env vars (values never stored), data_key env-var names, an explicit secret_source note, captured config-file list, enumerated dumps, sha256 checksums.
  • Capture needs no secret access: non-secret env is plaintext in app.yaml, so the capture excludes secret-named keys (plus a defensive crypto.IsEncrypted guard) and reads no secret value. New StackDataProvider.GetStackRecoveryInfo + RecoveryInfo, implemented by the main.go stackAdapter; ParseComposeImages extracts pins.
  • data_key: DeployField.DataKey + Metadata.DataKeyEnvVars(); catalog adventurelog/.felhom.yml SECRET_KEY ("Titkosítási kulcs") marked data_key: true.
  • Refresh cadence (v0.53.1): capture runs from the daily DB dump AND the periodic RefreshCache (startup + every 5m), idempotent — content is built in memory and writes are skipped when the unit is already current (checksum + dump-set + version), so a spinning USB drive is not thrashed.
  • Tests: capture is secret-free (a secret in the source app.yaml never appears in the unit) + manifest structure + idempotency (unchanged → skip; config change → rewrite). go build ./... clean.

Deploy mechanism (resolved this session)

The controller in guest 9201 is golden/bootstrap-managed: felhom-controller-bootstrap.service runs /usr/local/sbin/felhom-controller-bootstrap.sh, which docker runs the tag from /etc/felhom-controller-image (gitea anon-pull, no login). Deploy = build+push tag → anon-pull → update that tag file → systemctl restart felhom-controller-bootstrap.service. Data volume + encryption key persist. (This is what "self-update handles version drift" refers to.)

Live validation (guest 9201, demo-felhom)

  • Deployed v0.53.1; on startup RefreshCache captured units: romm (images=3, secrets-referenced=3, data_keys=0) and actualbudget (images=1, system-fallback path …/sys_drive/felhom-data/…).
  • RomM unit on disk: compose/{app.yaml,docker-compose.yml,.felhom.yml} + db-dumps/romm-mariadb.sql + manifest.json. Manifest is secret-free (image pins + secret NAMES + secret_source); captured app.yaml holds only DOMAIN/HDD_PATH/SUBDOMAIN with the three secret names listed as stripped.
  • Secret-leak grep against the three actual RomM secret values → NO_LEAK. Idempotency confirmed (single capture log line; the 5m refresh skips).

Phase 2b — restore-from-unit + fail-closed gate (v0.54.0)

  • reconcileRestoreSecrets (pure, exhaustively unit-tested): merges the unit's non-secret env with the secrets recovered from the guest's live app.yaml. A missing/empty data-encrypting key aborts the restore (a PBS whole-guest restore is required) — regenerating it would corrupt data. A missing resettable secret is non-fatal (warn + proceed). Regenerates nothing.
  • RestoreFromRecoveryUnit: manifest → recover secrets from the guest → gate → restore named-volume tars → recover the app definition from the unit → redeploy with the reconstructed env (re-pull pinned image). Falls back to volume-only RestoreApp when no unit exists. Wired into /backup/restore.
  • Seams: RecoverStackSecrets / RecreateStackFromUnit (adapter, with encKey to decrypt the live app.yaml); stacks.RedeployFromEnv. isDebug made nil-safe.
  • Tests: the gate (recovered / data-key-missing→refuse / empty-data-key→refuse / resettable-missing →proceed, values used verbatim), the full orchestration (success→recreate-with-merged-env; data-key-missing→refused, recreate never called), and data_key parsing from .felhom.yml.

Validation status

  • Unit/integration-tested (authoritative): the fail-closed gate, the restore orchestration, secret reconciliation (regenerate-nothing), and the catalog→metadata data_key flow.
  • Live-validated (guest 9201): the capture side (v0.53.1, RomM — secret-free, NO_LEAK). For Phase 2b on AdventureLog (a real data_key app): its unit manifest carries data_key_env_vars: [SECRET_KEY] (catalog→manifest flow live); and with SECRET_KEY made unrecoverable, POST /backup/restore refused with the exact fail-closed message before any compose-up (no side effects). The demo has no dashboard password → the API is open (auth + CSRF skipped), driven via the public URL.
  • One e2e not run — environment limit, not a code gap: the full "deploy with data → restore → confirm decrypts" — AdventureLog's images do not fit the 8 GB guest rootfs (deploy hit "no space left on device"). That is precisely the Phase 3 rootfs-headroom concern, now observed live. Key-preservation is covered by the gate's verbatim-recovery unit test. Demo left clean (AdventureLog reverted to not-deployed, no leftovers).

Still ahead

Phase 3 (auto off-drive Tier 2 with rootfs-headroom guard) and Phase 4 (FileBrowser scoping + deploy-UI DB-on-SSD note + monitoring sort). The README backup-paths section still shows the stale restic/secondary layout — rewritten when Tier 2 lands.