REPORT overwritten (secret-free recovery unit: design, what shipped, golden deploy mechanism, live 9201 validation incl. NO_LEAK grep). CONTEXT dated entry. README: recovery-unit subsection + flagged the stale restic/secondary paths section. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
5.1 KiB
REPORT — felhom-controller v0.53.1 (Phase 2: per-app recovery unit, capture side)
Each app's on-drive backup is now a self-contained, recreatable recovery unit — and it is
secret-free by design. Built, unit-tested, shipped to main, and validated live on guest 9201.
(Phase 1, the deploy-side double-nest GATE, shipped earlier as v0.52.0 — see git history.)
The design decision that shaped Phase 2 (secret handling)
The recovery unit carries no secrets, no data-keys, and not the Docker image. This was decided after reading the actual hub code (the controller README that implied the hub stores app.yaml is stale pre-strip):
- The hub is deliberately zero-knowledge — it holds a per-host recovery-code-wrapped PBS key it cannot decrypt + non-secret config; no per-app secrets. Escrowing app secrets there would regress that posture, so it was rejected.
app.yaml(encrypted) + the encryption key live on the guest rootfs (local-lvm:vm-9201-disk-0, confirmed viapct config) → already inside the PBS whole-guest snapshot; the external drive (mp0bind) is not. So the secret↔data split maps onto the tiers: secrets ride PBS; bulk userdata rides the drive + (Phase 3) Tier 2.- Therefore: secret-free unit; restore recovers the original secrets from the guest's own app.yaml
(live, else PBS); regenerate nothing.
data_keyis a fail-closed annotation, not a preserve/regenerate decision.
What shipped (v0.53.0 + v0.53.1)
- Unit layout (rooted at the existing
backups/primary/<app>/— a deliberate low-churn choice, no risky dump-dir migration):compose/(docker-compose.yml + .felhom.yml + a secret-stripped app.yaml) + the existingdb-dumps/+volume-dumps/+manifest.json. New helpersRecoveryUnit{Path,ComposePath,ManifestPath}(internal/appbackup/paths.go). - Secret-free manifest (
internal/backup/recovery_unit.go): app id, display name, controller version, timestamp, drive, namespace root, image pins (image NOT stored — re-pulled on restore), the NAMES of secret env vars (values never stored),data_keyenv-var names, an explicitsecret_sourcenote, captured config-file list, enumerated dumps, sha256 checksums. - Capture needs no secret access: non-secret env is plaintext in app.yaml, so the capture excludes
secret-named keys (plus a defensive
crypto.IsEncryptedguard) and reads no secret value. NewStackDataProvider.GetStackRecoveryInfo+RecoveryInfo, implemented by the main.gostackAdapter;ParseComposeImagesextracts pins. data_key:DeployField.DataKey+Metadata.DataKeyEnvVars(); catalogadventurelog/.felhom.ymlSECRET_KEY("Titkosítási kulcs") markeddata_key: true.- Refresh cadence (v0.53.1): capture runs from the daily DB dump AND the periodic
RefreshCache(startup + every 5m), idempotent — content is built in memory and writes are skipped when the unit is already current (checksum + dump-set + version), so a spinning USB drive is not thrashed. - Tests: capture is secret-free (a secret in the source app.yaml never appears in the unit) +
manifest structure + idempotency (unchanged → skip; config change → rewrite).
go build ./...clean.
Deploy mechanism (resolved this session)
The controller in guest 9201 is golden/bootstrap-managed: felhom-controller-bootstrap.service runs
/usr/local/sbin/felhom-controller-bootstrap.sh, which docker runs the tag from
/etc/felhom-controller-image (gitea anon-pull, no login). Deploy = build+push tag → anon-pull → update
that tag file → systemctl restart felhom-controller-bootstrap.service. Data volume + encryption key
persist. (This is what "self-update handles version drift" refers to.)
Live validation (guest 9201, demo-felhom)
- Deployed v0.53.1; on startup
RefreshCachecaptured units: romm (images=3, secrets-referenced=3, data_keys=0) and actualbudget (images=1, system-fallback path…/sys_drive/felhom-data/…). - RomM unit on disk:
compose/{app.yaml,docker-compose.yml,.felhom.yml}+db-dumps/romm-mariadb.sql+manifest.json. Manifest is secret-free (image pins + secret NAMES +secret_source); captured app.yaml holds only DOMAIN/HDD_PATH/SUBDOMAIN with the three secret names listed as stripped. - Secret-leak grep against the three actual RomM secret values →
NO_LEAK. Idempotency confirmed (single capture log line; the 5m refresh skips).
Not done — Phase 2b (the immediate next increment)
The restore-from-unit recreate (write compose/config back → re-pull image from pins → recover
secrets from the guest's app.yaml, live or via PBS → restore DB+volumes+userdata → boot), the
fail-closed data_key gate (refuse + warn if an encrypted app's key is unrecoverable), and the
live AdventureLog readable-data validation (deploy with an encryption key → back up → recreate →
confirm data decrypts). The existing RestoreApp still does the live-guest volume-tar restore. The
README backup-paths section still describes the stale restic/secondary layout — rewritten when Tier 2
(Phase 3) lands.