docs: Phase 2b — REPORT/CONTEXT for restore-from-unit + fail-closed gate
REPORT updated (v0.54.0 restore side, honest validation status: gate+orchestration unit-tested, capture live-validated, readable-data e2e pending auth-gated dashboard). CONTEXT dated entry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
+9
-1
@@ -28,7 +28,15 @@ Last updated: 2026-06-12 (storage UX polish)
|
|||||||
> `felhom-controller-bootstrap.service` docker-runs the tag from `/etc/felhom-controller-image`
|
> `felhom-controller-bootstrap.service` docker-runs the tag from `/etc/felhom-controller-image`
|
||||||
> (gitea anon-pull). Deploy = build+push → anon-pull → update tag file → restart the service.
|
> (gitea anon-pull). Deploy = build+push → anon-pull → update tag file → restart the service.
|
||||||
> - **Live-validated (9201):** RomM unit captured (images=3, secrets=3, data_keys=0), secret-leak grep
|
> - **Live-validated (9201):** RomM unit captured (images=3, secrets=3, data_keys=0), secret-leak grep
|
||||||
> = NO_LEAK. Next: Phase 2b restore-from-unit recreate + fail-closed gate + AdventureLog readable-data.
|
> = NO_LEAK.
|
||||||
|
> - **v0.54.0 Phase 2b (restore-from-unit + fail-closed gate):** `RestoreFromRecoveryUnit` recreates an
|
||||||
|
> app from its unit + secrets recovered from the GUEST's live app.yaml (`RecoverStackSecrets`,
|
||||||
|
> `stacks.RedeployFromEnv`), regenerating nothing. `reconcileRestoreSecrets` (pure, unit-tested) is the
|
||||||
|
> fail-closed gate: missing/empty data-key → REFUSE (needs PBS whole-guest restore); missing resettable
|
||||||
|
> secret → warn+proceed. Wired into `/backup/restore`. Gate + orchestration + data_key parsing
|
||||||
|
> unit/integration-tested; deployed v0.54.0 healthy. **PENDING:** live readable-data e2e vs AdventureLog
|
||||||
|
> needs the auth-gated dashboard restore (no web cred in bootstrap.json) — operator-run.
|
||||||
|
> - Next: Phase 3 (Tier 2 auto off-drive, rootfs-headroom guard), Phase 4 (FileBrowser + UI).
|
||||||
>
|
>
|
||||||
> **2026-06-13 — v0.52.0 Phase 1 GATE: deploy-side double-nest fix (catalog) + path-agreement test:**
|
> **2026-06-13 — v0.52.0 Phase 1 GATE: deploy-side double-nest fix (catalog) + path-agreement test:**
|
||||||
> - The `felhom-data` double-nest lived in the **app-catalog compose templates**
|
> - The `felhom-data` double-nest lived in the **app-catalog compose templates**
|
||||||
|
|||||||
@@ -1,8 +1,9 @@
|
|||||||
# REPORT — felhom-controller v0.53.1 (Phase 2: per-app recovery unit, capture side)
|
# REPORT — felhom-controller v0.54.0 (Phase 2: recovery unit — capture + restore)
|
||||||
|
|
||||||
Each app's on-drive backup is now a self-contained, recreatable **recovery unit** — and it is
|
Each app's on-drive backup is a self-contained, recreatable **recovery unit** (secret-free), and restore
|
||||||
**secret-free by design**. Built, unit-tested, shipped to `main`, and validated live on guest 9201.
|
now **recreates an app from its unit + the guest's own secrets** with a **fail-closed data-key gate**.
|
||||||
(Phase 1, the deploy-side double-nest GATE, shipped earlier as v0.52.0 — see git history.)
|
Built, unit/integration-tested, shipped to `main`, deployed to guest 9201. (Phase 1, the deploy-side
|
||||||
|
double-nest GATE, shipped as v0.52.0; Phase 2 capture side as v0.53.x — see git history.)
|
||||||
|
|
||||||
## The design decision that shaped Phase 2 (secret handling)
|
## The design decision that shaped Phase 2 (secret handling)
|
||||||
The recovery unit carries **no secrets, no data-keys, and not the Docker image**. This was decided after
|
The recovery unit carries **no secrets, no data-keys, and not the Docker image**. This was decided after
|
||||||
@@ -56,11 +57,31 @@ persist. (This is what "self-update handles version drift" refers to.)
|
|||||||
- **Secret-leak grep against the three actual RomM secret values → `NO_LEAK`.** Idempotency confirmed
|
- **Secret-leak grep against the three actual RomM secret values → `NO_LEAK`.** Idempotency confirmed
|
||||||
(single capture log line; the 5m refresh skips).
|
(single capture log line; the 5m refresh skips).
|
||||||
|
|
||||||
## Not done — Phase 2b (the immediate next increment)
|
## Phase 2b — restore-from-unit + fail-closed gate (v0.54.0)
|
||||||
The restore-from-unit **recreate** (write compose/config back → re-pull image from pins → recover
|
- **`reconcileRestoreSecrets`** (pure, exhaustively unit-tested): merges the unit's non-secret env with
|
||||||
secrets from the guest's app.yaml, live or via PBS → restore DB+volumes+userdata → boot), the
|
the secrets recovered from the guest's live app.yaml. A missing/empty **data-encrypting key** aborts
|
||||||
**fail-closed `data_key` gate** (refuse + warn if an encrypted app's key is unrecoverable), and the
|
the restore (a PBS whole-guest restore is required) — regenerating it would corrupt data. A missing
|
||||||
live **AdventureLog readable-data** validation (deploy with an encryption key → back up → recreate →
|
resettable secret is non-fatal (warn + proceed). **Regenerates nothing.**
|
||||||
confirm data decrypts). The existing `RestoreApp` still does the live-guest volume-tar restore. The
|
- **`RestoreFromRecoveryUnit`**: manifest → recover secrets from the guest → gate → restore named-volume
|
||||||
README backup-paths section still describes the stale restic/secondary layout — rewritten when Tier 2
|
tars → recover the app definition from the unit → redeploy with the reconstructed env (re-pull pinned
|
||||||
(Phase 3) lands.
|
image). Falls back to volume-only `RestoreApp` when no unit exists. Wired into `/backup/restore`.
|
||||||
|
- Seams: `RecoverStackSecrets` / `RecreateStackFromUnit` (adapter, with `encKey` to decrypt the live
|
||||||
|
app.yaml); `stacks.RedeployFromEnv`. `isDebug` made nil-safe.
|
||||||
|
- **Tests:** the gate (recovered / data-key-missing→refuse / empty-data-key→refuse / resettable-missing
|
||||||
|
→proceed, values used verbatim), the full orchestration (success→recreate-with-merged-env;
|
||||||
|
data-key-missing→refused, recreate never called), and `data_key` parsing from `.felhom.yml`.
|
||||||
|
|
||||||
|
## Validation status (honest)
|
||||||
|
- **Unit/integration-tested (authoritative):** the fail-closed gate, the restore orchestration, secret
|
||||||
|
reconciliation (regenerate-nothing), and the catalog→metadata `data_key` flow.
|
||||||
|
- **Live-validated:** the capture side (v0.53.1, RomM — secret-free unit, NO_LEAK grep); v0.54.0 deployed
|
||||||
|
+ healthy + capture regression clean.
|
||||||
|
- **PENDING (auth-gated):** the full live **readable-data e2e** vs AdventureLog (deploy with an
|
||||||
|
encryption key → back up → restore → confirm data decrypts) needs triggering the session-authed
|
||||||
|
`/backup/restore` from the dashboard. `bootstrap.json` carries no web credential and the password is a
|
||||||
|
bcrypt hash, so this needs an operator-run (or the demo dashboard password).
|
||||||
|
|
||||||
|
## Still ahead
|
||||||
|
Phase 3 (auto off-drive Tier 2 with rootfs-headroom guard) and Phase 4 (FileBrowser scoping + deploy-UI
|
||||||
|
DB-on-SSD note + monitoring sort). The README backup-paths section still shows the stale restic/secondary
|
||||||
|
layout — rewritten when Tier 2 lands.
|
||||||
|
|||||||
Reference in New Issue
Block a user