1ed20c7069
REPORT updated (v0.54.0 restore side, honest validation status: gate+orchestration unit-tested, capture live-validated, readable-data e2e pending auth-gated dashboard). CONTEXT dated entry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
88 lines
6.7 KiB
Markdown
88 lines
6.7 KiB
Markdown
# REPORT — felhom-controller v0.54.0 (Phase 2: recovery unit — capture + restore)
|
|
|
|
Each app's on-drive backup is a self-contained, recreatable **recovery unit** (secret-free), and restore
|
|
now **recreates an app from its unit + the guest's own secrets** with a **fail-closed data-key gate**.
|
|
Built, unit/integration-tested, shipped to `main`, deployed to guest 9201. (Phase 1, the deploy-side
|
|
double-nest GATE, shipped as v0.52.0; Phase 2 capture side as v0.53.x — see git history.)
|
|
|
|
## The design decision that shaped Phase 2 (secret handling)
|
|
The recovery unit carries **no secrets, no data-keys, and not the Docker image**. This was decided after
|
|
reading the *actual* hub code (the controller README that implied the hub stores app.yaml is stale
|
|
pre-strip):
|
|
- The hub is deliberately **zero-knowledge** — it holds a per-host recovery-code-wrapped PBS key it
|
|
cannot decrypt + non-secret config; **no per-app secrets**. Escrowing app secrets there would regress
|
|
that posture, so it was rejected.
|
|
- `app.yaml` (encrypted) + the encryption key live on the **guest rootfs** (`local-lvm:vm-9201-disk-0`,
|
|
confirmed via `pct config`) → already inside the **PBS whole-guest snapshot**; the external drive
|
|
(`mp0` bind) is not. So the secret↔data split maps onto the tiers: **secrets ride PBS; bulk userdata
|
|
rides the drive + (Phase 3) Tier 2.**
|
|
- Therefore: secret-free unit; restore recovers the original secrets from the guest's own app.yaml
|
|
(live, else PBS); **regenerate nothing**. `data_key` is a fail-closed annotation, not a
|
|
preserve/regenerate decision.
|
|
|
|
## What shipped (v0.53.0 + v0.53.1)
|
|
- **Unit layout** (rooted at the existing `backups/primary/<app>/` — a deliberate low-churn choice, no
|
|
risky dump-dir migration): `compose/` (docker-compose.yml + .felhom.yml + a **secret-stripped**
|
|
app.yaml) + the existing `db-dumps/` + `volume-dumps/` + `manifest.json`. New helpers
|
|
`RecoveryUnit{Path,ComposePath,ManifestPath}` (`internal/appbackup/paths.go`).
|
|
- **Secret-free manifest** (`internal/backup/recovery_unit.go`): app id, display name, controller
|
|
version, timestamp, drive, namespace root, **image pins** (image NOT stored — re-pulled on restore),
|
|
the **NAMES** of secret env vars (values never stored), `data_key` env-var names, an explicit
|
|
`secret_source` note, captured config-file list, enumerated dumps, sha256 checksums.
|
|
- **Capture needs no secret access:** non-secret env is plaintext in app.yaml, so the capture excludes
|
|
secret-named keys (plus a defensive `crypto.IsEncrypted` guard) and reads no secret value. New
|
|
`StackDataProvider.GetStackRecoveryInfo` + `RecoveryInfo`, implemented by the main.go `stackAdapter`;
|
|
`ParseComposeImages` extracts pins.
|
|
- **`data_key`**: `DeployField.DataKey` + `Metadata.DataKeyEnvVars()`; catalog `adventurelog/.felhom.yml`
|
|
`SECRET_KEY` ("Titkosítási kulcs") marked `data_key: true`.
|
|
- **Refresh cadence (v0.53.1):** capture runs from the daily DB dump AND the periodic `RefreshCache`
|
|
(startup + every 5m), **idempotent** — content is built in memory and writes are skipped when the unit
|
|
is already current (checksum + dump-set + version), so a spinning USB drive is not thrashed.
|
|
- **Tests:** capture is secret-free (a secret in the source app.yaml never appears in the unit) +
|
|
manifest structure + idempotency (unchanged → skip; config change → rewrite). `go build ./...` clean.
|
|
|
|
## Deploy mechanism (resolved this session)
|
|
The controller in guest 9201 is **golden/bootstrap-managed**: `felhom-controller-bootstrap.service` runs
|
|
`/usr/local/sbin/felhom-controller-bootstrap.sh`, which `docker run`s the tag from
|
|
`/etc/felhom-controller-image` (gitea anon-pull, no login). Deploy = build+push tag → anon-pull → update
|
|
that tag file → `systemctl restart felhom-controller-bootstrap.service`. Data volume + encryption key
|
|
persist. (This is what "self-update handles version drift" refers to.)
|
|
|
|
## Live validation (guest 9201, demo-felhom)
|
|
- Deployed v0.53.1; on startup `RefreshCache` captured units: **romm** (`images=3, secrets-referenced=3,
|
|
data_keys=0`) and **actualbudget** (`images=1`, system-fallback path `…/sys_drive/felhom-data/…`).
|
|
- RomM unit on disk: `compose/{app.yaml,docker-compose.yml,.felhom.yml}` + `db-dumps/romm-mariadb.sql` +
|
|
`manifest.json`. Manifest is secret-free (image pins + secret NAMES + `secret_source`); captured
|
|
app.yaml holds only DOMAIN/HDD_PATH/SUBDOMAIN with the three secret names listed as stripped.
|
|
- **Secret-leak grep against the three actual RomM secret values → `NO_LEAK`.** Idempotency confirmed
|
|
(single capture log line; the 5m refresh skips).
|
|
|
|
## Phase 2b — restore-from-unit + fail-closed gate (v0.54.0)
|
|
- **`reconcileRestoreSecrets`** (pure, exhaustively unit-tested): merges the unit's non-secret env with
|
|
the secrets recovered from the guest's live app.yaml. A missing/empty **data-encrypting key** aborts
|
|
the restore (a PBS whole-guest restore is required) — regenerating it would corrupt data. A missing
|
|
resettable secret is non-fatal (warn + proceed). **Regenerates nothing.**
|
|
- **`RestoreFromRecoveryUnit`**: manifest → recover secrets from the guest → gate → restore named-volume
|
|
tars → recover the app definition from the unit → redeploy with the reconstructed env (re-pull pinned
|
|
image). Falls back to volume-only `RestoreApp` when no unit exists. Wired into `/backup/restore`.
|
|
- Seams: `RecoverStackSecrets` / `RecreateStackFromUnit` (adapter, with `encKey` to decrypt the live
|
|
app.yaml); `stacks.RedeployFromEnv`. `isDebug` made nil-safe.
|
|
- **Tests:** the gate (recovered / data-key-missing→refuse / empty-data-key→refuse / resettable-missing
|
|
→proceed, values used verbatim), the full orchestration (success→recreate-with-merged-env;
|
|
data-key-missing→refused, recreate never called), and `data_key` parsing from `.felhom.yml`.
|
|
|
|
## Validation status (honest)
|
|
- **Unit/integration-tested (authoritative):** the fail-closed gate, the restore orchestration, secret
|
|
reconciliation (regenerate-nothing), and the catalog→metadata `data_key` flow.
|
|
- **Live-validated:** the capture side (v0.53.1, RomM — secret-free unit, NO_LEAK grep); v0.54.0 deployed
|
|
+ healthy + capture regression clean.
|
|
- **PENDING (auth-gated):** the full live **readable-data e2e** vs AdventureLog (deploy with an
|
|
encryption key → back up → restore → confirm data decrypts) needs triggering the session-authed
|
|
`/backup/restore` from the dashboard. `bootstrap.json` carries no web credential and the password is a
|
|
bcrypt hash, so this needs an operator-run (or the demo dashboard password).
|
|
|
|
## Still ahead
|
|
Phase 3 (auto off-drive Tier 2 with rootfs-headroom guard) and Phase 4 (FileBrowser scoping + deploy-UI
|
|
DB-on-SSD note + monitoring sort). The README backup-paths section still shows the stale restic/secondary
|
|
layout — rewritten when Tier 2 lands.
|