slice 8B (controller half): app-consistent backup quiesce loop (v0.36.0)
internal/quiesce: poll /backup/due -> quiesce (stop app stacks) -> POST /backup -> poll /backup/status -> unquiesce (restart exactly those). Crash-safety: persisted marker before stopping, guaranteed unquiesce (defer), max-quiesce guard, startup Recover, single-flight. agentapi BackupDue/StartBackup/ BackupStatus; stacks.RunningAppStacks(); config QuiesceConfig; main wiring. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -1,5 +1,37 @@
|
||||
## Changelog
|
||||
|
||||
### v0.36.0 — slice 8B: app-consistent backup quiesce loop (stack-stop) (2026-06-10)
|
||||
|
||||
The in-guest controller half of slice 8B (doc 03 §6/§8). Pairs with `felhom-agent` v0.11.0. An
|
||||
agent-initiated vzdump is crash-consistent only (an LXC has no fsfreeze); this makes app-consistency
|
||||
the controller's job — it stops its app stacks around the backup so the captured state is
|
||||
clean-shutdown-consistent.
|
||||
|
||||
#### Added
|
||||
- **`internal/quiesce`** — the background quiesce loop: poll the agent's `GET /backup/due` → when
|
||||
due, **quiesce** (stop deployed, non-protected, running stacks) → `POST /backup` → poll
|
||||
`GET /backup/status` to `done`/`failed` → **unquiesce** (restart exactly the stacks it stopped).
|
||||
- **Crash-safety (the centerpiece — a stranded-down app is worse than a crash-consistent backup):**
|
||||
a persisted **marker** (atomic, `0600`) written **before** stopping anything; **guaranteed
|
||||
unquiesce** (a deferred closure restarts the stacks on a backup error, a status-poll error, the
|
||||
max-quiesce bound, or context cancellation); a **max-quiesce-duration** hard bound that restarts
|
||||
the app no matter what (the backup continues on the agent); **crash recovery** at startup
|
||||
(`Recover()` restarts stacks left stopped by a mid-quiesce crash, then clears the marker); and the
|
||||
marker as a **single-flight** guard.
|
||||
- **`agentapi`**: `BackupDue` / `StartBackup` / `BackupStatus` methods + a `post` helper.
|
||||
- **`stacks.Manager.RunningAppStacks()`** — deployed, non-protected, currently-up stacks (protected
|
||||
infra — traefik/cloudflared/felhom-controller — is never stopped), sorted for deterministic order.
|
||||
- **`config.QuiesceConfig`** (`quiesce`: enabled, poll_interval, status_poll_interval,
|
||||
max_quiesce_duration). Wired in `main.go`: `Recover()` at startup, then the loop goroutine, gated on
|
||||
the local API being configured (a provisioned guest) + quiesce enabled.
|
||||
|
||||
#### Tests
|
||||
- happy path (stop → backup → poll done → restart exactly those, in order; marker cleared);
|
||||
**backup-start failure → stacks STILL restarted**; failed phase → restarted; **max-quiesce guard →
|
||||
restarted at the bound**; **crash recovery → marker stacks restarted + cleared**; single-flight (no
|
||||
second backup while a marker is active); **only the stacks we stopped are restarted** (an
|
||||
already-stopped stack is never started); and **marker-written-before-stop** ordering.
|
||||
|
||||
### v0.35.0 — slice 8A: bootstrap.json ingestion + pinned agent local-API client (2026-06-10)
|
||||
|
||||
The in-guest controller half of slice 8A (doc 03 §6). Pairs with `felhom-agent` v0.10.0. No
|
||||
|
||||
Reference in New Issue
Block a user