Files
felhom-controller/REPORT.md
T
2026-06-10 15:02:30 +02:00

2.5 KiB

REPORT — slice 8B.2 (controller half): resume at snapshotted (v0.38.0) (2026-06-10)

Overwrite-latest report. Cumulative history: CHANGELOG.md. Implements the controller half of TASK — Slice 8B.2. Pairs with felhom-agent v0.13.0. No hub change.

Outcome

The quiesce loop (8B) kept the app stopped for the whole backup. In snapshot mode the app only needs to be stopped until the storage snapshot is taken; after that vzdump reads from the snapshot. The controller now resumes its app at the agent's snapshotted phase instead of done — app downtime drops from whole-backup to until-snapshot, with no loss of app-consistency. Measured live: ~3s vs ~23s (~87% cut), restore still clean.

What landed (internal/quiesce)

  • The status-poll loop resumes (StartStack + clears the marker) at snapshotted, then keeps polling to done/failed — so a new backup isn't started until this one truly finishes and a post-snapshot failure is still observed (the backup isn't "successful" until done; the early resume does not mark it done).
  • Fallback: if snapshotted never arrives (stop/downgraded storage), it resumes at done exactly as 8B. The agent only emits snapshotted when the actual mode is snapshot.
  • Crash-safety unchanged: marker written before stop; guaranteed unquiesce (deferred); startup Recover(). A failure after snapshotted is harmless — the app is already up.

Tests

go build ./... + go test ./... green. quiesce: resume at snapshotted (RESUME event before done, marker cleared, then tracked to done); stop-mode fallback (resume at done, no snapshotted); fail-after-snapshotted (single resume, app stays up); the 8B crash-safety tests stay green.

Live validation (demo-felhom)

A provisioned controller v0.38.0 with a postgres stack, short quiesce poll: timeline — quiescing [pgtest] 12:58:45 → snapshotted — resuming app early 12:58:48 → backup done 12:59:08. App downtime ≈ 3s (vs ≈ 23s to done). The snapshot backup restored to a scratch guest came up clean (database system was shut down at 12:58:45, no WAL replay) — the early resume preserved app-consistency. The controller kept tracking to done after resuming (no overlapping backup).

Deferred / dependency

Snapshot-capable storage (lvm-thin/ZFS) required for the win; stop/downgraded storage falls back to resume-at-done (8B). No consistency-contract or crash-safety change. No secrets committed.