v0.54.0: Phase 2b — restore-from-recovery-unit + fail-closed data-key gate

Restore recreates an app from its on-drive unit + the guest's own secrets,
regenerating nothing. reconcileRestoreSecrets (pure, unit-tested) merges the unit's
non-secret env with secrets recovered from the live app.yaml and FAILS CLOSED if a
data-encrypting key is unrecoverable (refuse — a PBS whole-guest restore is needed —
rather than regenerate and corrupt). Resettable secrets missing → warn + proceed.

- backup: RestoreFromRecoveryUnit (manifest -> recover secrets -> gate -> restore
  volumes -> recreate definition + redeploy w/ re-pull); falls back to volume-only.
- seams: RecoverStackSecrets/RecreateStackFromUnit (adapter +encKey),
  stacks.RedeployFromEnv. Wired into /backup/restore.
- tests: gate (refuse/proceed/verbatim) + data_key parsing.

Gate + reconcile + data_key parsing unit-tested; capture live-validated (v0.53.1).
Full readable-data e2e vs AdventureLog needs the auth-gated dashboard restore — pending.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-13 11:12:43 +02:00
parent 39d623a1c1
commit 7863e62f29
9 changed files with 377 additions and 1 deletions
+28
View File
@@ -1,5 +1,33 @@
## Changelog
### v0.54.0 — Phase 2b: restore-from-recovery-unit + fail-closed data-key gate (2026-06-13)
Restore now recreates an app from its on-drive recovery unit **plus the guest's own secrets** — never
from secrets stored in the unit (there are none), and **regenerating nothing**.
- **Fail-closed data-key gate** (`reconcileRestoreSecrets`, `internal/backup/restore_unit.go` — a pure,
exhaustively unit-tested function): merges the unit's non-secret env with the secret values recovered
from the guest's live app.yaml. A missing/empty **data-encrypting key** (`data_key`) **aborts the
restore** with a clear message (a PBS whole-guest restore is required) — because regenerating it would
render stored data unreadable. A missing *resettable* secret (DB/admin password) is non-fatal (warn +
proceed; the app may need a credential reset). Secrets are recovered, never regenerated.
- **`RestoreFromRecoveryUnit`**: reads the unit manifest → recovers secrets from the guest
(`RecoverStackSecrets`) → applies the gate → restores named-volume data from the unit's tars →
recovers the app definition from the unit and redeploys with the reconstructed env (re-pulling the
pinned image). Falls back to the legacy volume-only `RestoreApp` if no unit exists. Wired into the
`/backup/restore` web handler.
- **New seams:** `StackDataProvider.RecoverStackSecrets` / `RecreateStackFromUnit` (main.go
`stackAdapter`, with the controller `encKey` for decrypting the live app.yaml); `stacks.Manager.
RedeployFromEnv` (writes app.yaml from the full env incl. locked secrets, then `compose up -d`).
- **Tests:** the gate (all recovered / data-key missing → refuse / empty data-key → refuse / resettable
missing → proceed+warn, recovered values used verbatim) and `data_key` parsing from `.felhom.yml`
(`Metadata.DataKeyEnvVars()`).
- **Validation status:** the gate + reconciliation + data_key parsing are unit-tested (authoritative for
the refuse/proceed/regenerate-nothing behaviour); the capture side is live-validated (v0.53.1, RomM).
The full live **readable-data e2e** against AdventureLog (deploy → back up → restore → confirm the
data decrypts) requires triggering the **auth-gated** `/backup/restore` from the dashboard — pending an
operator-run on the demo.
### v0.53.1 — Phase 2: recovery units refresh on the periodic cache cycle (idempotent) (2026-06-13)
The recovery-unit capture now also runs from `RefreshCache` (controller startup + every 5m), not only