REPORT: slice 8B controller half (app-consistent backup quiesce loop, v0.36.0)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -1,52 +1,53 @@
|
|||||||
# REPORT — slice 8A (controller half): bootstrap.json ingestion + pinned agent local-API client (v0.35.0) (2026-06-10)
|
# REPORT — slice 8B (controller half): app-consistent backup quiesce loop (v0.36.0) (2026-06-10)
|
||||||
|
|
||||||
> Overwrite-latest report (most recent significant work only). Cumulative history lives in
|
> Overwrite-latest report (most recent significant work only). Cumulative history lives in
|
||||||
> [CHANGELOG.md](CHANGELOG.md). Implements the in-guest controller half of `TASK — Slice 8A`. The
|
> [CHANGELOG.md](CHANGELOG.md). Implements the in-guest controller half of `TASK — Slice 8B`. The
|
||||||
> host-agent half is in `felhom-agent` v0.10.0. No hub change.
|
> host-agent half is in `felhom-agent` v0.11.0. No hub change.
|
||||||
|
|
||||||
## Outcome
|
## Outcome
|
||||||
|
|
||||||
The controller now ingests the host agent's stable **`bootstrap.json`** on first run and reaches the
|
An agent-initiated vzdump is crash-consistent only (an LXC has no fsfreeze). This makes
|
||||||
agent's **local API** over the bridge with a pinned client. With the agent half, this completes the
|
app-consistency the controller's job: a background loop polls the agent's `GET /backup/due`, and when
|
||||||
provisioning chain: a freshly provisioned guest's controller **comes up configured (skipping the
|
due **stops its app stacks** around the agent backup so the captured state is clean-shutdown-consistent,
|
||||||
setup wizard) and talks to the agent** — validated live end-to-end on the demo. No behaviour change
|
then restarts exactly those stacks. **Validated live end-to-end**, including the load-bearing restore
|
||||||
for an already-configured controller.
|
contrast and crash-safety.
|
||||||
|
|
||||||
## What landed
|
## What landed
|
||||||
|
|
||||||
- **`internal/bootstrap`** — first-run **`bootstrap.json` ingestion** (config-contract decision (c)).
|
- **`internal/quiesce`** — the loop: `GET /backup/due` → (if due) **quiesce** (stop deployed,
|
||||||
On startup, if the controller is NOT yet configured AND the agent's back-half attached a
|
non-protected, running stacks) → `POST /backup` → poll `GET /backup/status` to `done`/`failed` →
|
||||||
`bootstrap.json` config mount, the controller **seeds `controller.yaml` from it and comes up
|
**unquiesce** (restart exactly the stacks it stopped).
|
||||||
configured, skipping setup mode**. **Idempotent** (an existing `controller.yaml` is never clobbered)
|
- **Crash-safety (the centerpiece — a stranded-down app is worse than a crash-consistent backup):**
|
||||||
and **fail-safe** (a malformed / absent / missing-identity / unsupported-schema bootstrap leaves the
|
a persisted **marker** (atomic, `0600`) written **before** stopping anything; **guaranteed
|
||||||
controller in setup mode — logs, never crashes). The agent emits the stable contract; the controller
|
unquiesce** via a deferred closure (restart on a backup error, a status-poll error, the max-quiesce
|
||||||
owns the translation (decoupled — no shared config schema).
|
bound, or context cancellation); a **max-quiesce-duration** hard bound (restart no matter what — the
|
||||||
- **`internal/agentapi`** — a minimal **pinned client** for the agent local API: reaches the agent
|
backup finishes on the agent); **crash recovery** at startup (`Recover()` restarts stacks left
|
||||||
over the bridge **pinning the agent leaf-cert SHA-256** from the bootstrap (fails closed on
|
stopped by a mid-quiesce crash, then clears the marker); the marker also single-flights the loop.
|
||||||
mismatch — exact leaf-DER match in `VerifyPeerCertificate`, the same pin convention the agent uses
|
- **`agentapi`**: `BackupDue` / `StartBackup` / `BackupStatus` + a `post` helper (leaf-pinned client).
|
||||||
for the Proxmox/PBS host certs) + the per-guest bearer token. In 8A it exercises `GET /storage`
|
- **`stacks.Manager.RunningAppStacks()`** — deployed, non-protected, currently-up stacks (protected
|
||||||
(connectivity + the controller learning its mounts); the `/backup/due` quiesce loop is 8B.
|
infra — traefik/cloudflared/felhom-controller — is never stopped), sorted for deterministic order.
|
||||||
- **`config.LocalAPIConfig`** (`local_api`: endpoint, fingerprint, token) seeded from the bootstrap;
|
- **`config.QuiesceConfig`** + `main.go` wiring: `Recover()` at startup, then the loop goroutine, gated
|
||||||
a startup **probe** proves the channel and logs this guest's mounts (non-fatal).
|
on the local API being configured (a provisioned guest) + quiesce enabled.
|
||||||
|
|
||||||
## Tests
|
## Tests
|
||||||
|
|
||||||
`go test ./...` green. bootstrap: seeds when unconfigured (reloads configured, skips setup); never
|
`go test ./...` green. quiesce: happy path (stop → backup → poll done → restart exactly those, in
|
||||||
clobbers a configured controller; stays in setup on malformed / missing-identity / unsupported-schema
|
order; marker cleared); **backup-start failure → stacks STILL restarted**; failed phase → restarted;
|
||||||
/ absent bootstrap. agentapi: correct pin + token reaches `/storage`; a **wrong pin fails closed**; a
|
**max-quiesce guard → restarted at the bound**; **crash recovery → marker stacks restarted + cleared**;
|
||||||
bad fingerprint is rejected at construction; colon-separated fingerprints accepted.
|
single-flight; **only the stacks we stopped are restarted**; **marker-written-before-stop** ordering.
|
||||||
|
|
||||||
## Live validation (demo-felhom)
|
## Live validation (demo-felhom)
|
||||||
|
|
||||||
The controller (v0.35.0, **baked into the new golden**, deployed by the golden's bootstrap unit with
|
A provisioned guest with a postgres app stack: the loop quiesced `[pgtest]` → `POST /backup` (real
|
||||||
no registry pull) **ingested `bootstrap.json` → seeded `controller.yaml` → came up CONFIGURED**
|
agent vzdump, taken with pg stopped) → `done` → unquiesce (postgres uptime reset confirms stop+restart);
|
||||||
(`felhom-controller v0.35.0 starting (customer: cust-8201, …)`, not setup mode), then reached the
|
`/backup/due` then went false (no re-loop). **Load-bearing restore contrast:** the **quiesced** backup
|
||||||
agent's real local-API `GET /storage` over the bridge (leaf-pin + token) → **channel up, 1 mount
|
restored to a scratch guest → postgres `database system was shut down` (**clean, no recovery**); an
|
||||||
visible**. The test guest was torn down after validation.
|
**un-quiesced** (crash-consistent, pg running) backup restored → postgres `redo starts … redo done`
|
||||||
|
(**WAL crash recovery**). **Crash-safety live:** the controller was hard-killed mid-quiesce (pg down,
|
||||||
|
marker present) → on restart `[quiesce] crash recovery … restarting them` fired and postgres came back
|
||||||
|
up, marker cleared. The backup-failure → restart path is unit-tested.
|
||||||
|
|
||||||
## Deferred (stated, not built)
|
## Deferred (stated, not built)
|
||||||
|
|
||||||
The full local-API usage — the `GET /backup/due`-driven **quiesce → `POST /backup` → unquiesce**
|
**8B.2** downtime optimization (snapshot mode + a `snapshotted` phase so the controller resumes at
|
||||||
app-consistent loop → 8B. Controller de-privileging (retire the disk-execution subsystem; customer
|
snapshot-taken). Controller de-privileging + customer disk endpoints → **8C**. No secrets committed.
|
||||||
disk endpoints behind the slice-4 classifier) → 8C. No secrets committed (the per-guest token arrives
|
|
||||||
via the agent-written 0600 bootstrap mount; it is never logged or committed).
|
|
||||||
|
|||||||
Reference in New Issue
Block a user