15c4728e2c
§9: the provision front half, guest-loss DR front half, and golden recipe are now implemented (agent v0.8.0, internal/reconcile/bringup.go; configs/build-golden.sh). Identity reset settled + implemented: provision resets MAC (unconditional, F1) + hostname host-side; machine-id + SSH host keys regenerate guest-side (systemd + the baked first-boot felhom-regen-hostkeys unit, F3) — agent stays host-side-only. Slice mapping table statuses updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
399 lines
30 KiB
Markdown
399 lines
30 KiB
Markdown
# Architecture Part 3 — The Host Agent
|
||
|
||
> Status: design draft (decision content). To be grounded by Claude Code against
|
||
> `docs/proxmox-platform.md` and `docs/architecture/02-controller-module-map.md`,
|
||
> then placed at `docs/architecture/03-host-agent.md`.
|
||
>
|
||
> Builds on Part 1 (`01-topology-and-trust.md`) and Part 2 (`02-controller-module-map.md`).
|
||
> Where this doc and the locked decisions disagree, the locked decisions win and this
|
||
> draft is wrong — flag it.
|
||
|
||
## 1. Purpose & scope
|
||
|
||
The **host agent** is the operator-tier component that runs on each Proxmox host and
|
||
owns *all* Proxmox interaction. It is the trusted host actor: it provisions and restores
|
||
guests, manages host storage, orchestrates backups and restore-tests, watches the host
|
||
and the tunnel, talks to the hub, and exposes a narrow local API to the in-guest
|
||
controllers it deploys.
|
||
|
||
It is the privileged tier. The controller deliberately holds **no** Proxmox credentials
|
||
(Part 1) — the privilege the controller shed by losing `storage/` did not disappear, it
|
||
**moved here**. That makes the agent's hardening and blast-radius discipline the most
|
||
security-sensitive part of the platform.
|
||
|
||
The agent manages a **set** of guests on its host (usually one customer = one guest, but
|
||
the multi-tenant/company case is not precluded — the agent's data model is per-host,
|
||
N-guests, never "the guest").
|
||
|
||
## 2. Responsibilities (and explicit non-responsibilities)
|
||
|
||
Owns:
|
||
|
||
1. **Proxmox lifecycle** — create/start/stop/destroy guests, snapshots, storage allocation. Via a scoped Proxmox API token (the **`FelhomAgent` operator role** — `proxmox-platform.md` §3.6, validated Phase 3 B3) for everything the API covers; raw host ops only where unavoidable.
|
||
2. **Storage management** — attach/classify targets, reconcile the storage manifest, mount USB-by-UUID, present mounts into guests.
|
||
3. **Backup/restore orchestration** — vzdump to the tiers, PBS, snapshot management, and the **self-restore-test**.
|
||
4. **Host & tunnel monitoring** — host metrics, guest up/down, storage-target status, and `cloudflared` health; reports the host domain to the hub.
|
||
5. **Provisioning** — provision a guest **by restoring the golden base image** (§9), deploy the controller into it, hand it its bootstrap config; also **build and refresh the golden base image** itself.
|
||
6. **Hub control loop** — poll for desired state + signed jobs, reconcile, execute, report, heartbeat.
|
||
7. **Local API** — the per-guest authorization gate the controller calls.
|
||
8. **Self-update** — update itself (carefully — it is a host service) and update the controllers it owns.
|
||
|
||
Explicitly does **not**:
|
||
|
||
- Serve application traffic or sit in the data path. **Control plane, not data plane**: if the agent dies, apps keep serving (Docker + LXC run without it); only *management* degrades — no new backups, no provisioning, hub loses the heartbeat.
|
||
- Hold or proxy customer application data.
|
||
- Run inside a guest. It is the thing that recovers guests and the host; it cannot be one of them.
|
||
- Manage **geo-restriction / the Cloudflare API**. Geo is hub-owned: the customer sets it in the controller UI, the controller reports the geo desired-state to the hub, and the **hub** (holding the CF API token) reconciles the WAF (S4). The agent manages only the *tunnel* service (`cloudflared`, §3/§5), never WAF rules.
|
||
|
||
## 3. Process model & host integration
|
||
|
||
- **Native Go binary, systemd service** on the host: boot-start, `Restart=always`, systemd watchdog (kill+restart on hang), journald logging, resource limits.
|
||
- **Root-minimized (boundary settled — Phase 3 B3).** The agent runs as a **non-root** service user with the scoped `FelhomAgent` token for all API-covered work + a **narrow `sudoers` allowlist** for true host ops. Per Phase 3 (B3) the boundary is settled: the entire per-customer guest lifecycle — provision (by restore, §9), config, start/stop, snapshot, backup, **restore**, destroy — is token-covered. Genuine OS-root is confined to: (1) building/refreshing the **golden base image** (`keyctl` create is `root@pam`-only — one-time at enrollment + a maintenance cadence, §9); (2) **host mounts** (USB mount-by-UUID, systemd mount units / fstab); (3) **SMART / hardware sensors**. Root therefore never sits on the per-customer path. See `proxmox-platform.md` §3.6 for the role + boundary table.
|
||
- **`cloudflared` is a separate systemd service**, not embedded in the agent. This is what makes the data path survive control-plane death by construction. The agent **manages and health-watches** it (see §5) but the tunnel does not live or die with the agent process.
|
||
|
||
## 4. Control model — reconcile + signed destructive ops
|
||
|
||
Two channels, split by **reversibility**, not by transport.
|
||
|
||
**(a) Desired-state reconciliation — steady state.**
|
||
The hub holds desired state for the host: which guests should exist (and at what spec),
|
||
the storage manifest, backup/retention policies, controller image versions. The agent
|
||
runs a reconcile loop converging actual Proxmox state → desired: idempotent, self-healing,
|
||
and tolerant of missed polls (drift is corrected on the next loop). Provisioning retries,
|
||
re-attach of a flapping USB target, redeploy of a crashed controller — all fall out of
|
||
reconciliation for free.
|
||
|
||
**(b) Signed one-shot jobs — operator actions.**
|
||
Restore-now, decommission, force-backup, break-glass-enable. Discrete, run-once
|
||
(idempotency key), written to the customer-visible audit log, and **outside** the reconcile
|
||
loop — they are point-in-time and often destructive, and a reconciler must never re-run a
|
||
restore because it "sees drift." A one-shot job names a **target** ("restore guest X from
|
||
snapshot S"), not a procedure; the agent owns the *how*.
|
||
|
||
**The reversibility gate (security-critical).**
|
||
"Signed jobs resist hub compromise" only holds if the agent also distrusts hub-supplied
|
||
*desired state* for destructive changes. The gate is by **provenance + data-bearing-ness, not
|
||
by verb**:
|
||
|
||
- **The reconciler MAY act without an operator signature** when: (a) creating/starting/restarting; (b) destroying resources it created earlier **within the same journaled transaction** (compensating rollback, §10); (c) destroying resources it **tagged ephemeral/scratch** (e.g. restore-test scratch guests, §8). The ephemeral/scratch tag is **agent-internal provenance and is never accepted from the hub** — else a compromised hub could relabel a data-bearing guest as scratch to walk the gate.
|
||
- **An operator signature is always required** to destroy/overwrite any resource holding the only/primary copy of customer data — live-guest destroy, storage detach/wipe, restore-overwrite, decommission — *regardless of whether it arrives as a job or as a desired-state delta*. A compromised hub cannot forge them because the signing key is **not held by the hub** (it lives with the operator / a separate signing path; the hub only queues opaque signed blobs).
|
||
- **Healing a crashed controller is non-destructive by construction:** it is reconstructable from its image + the guest's persistent volume, so "redeploy" = restart the LXC / `docker compose up -d` **inside the existing guest** — never a guest destroy. (v0.33 precedent: `watchdog.go` restarts stopped stacks, it never destroys the guest.)
|
||
|
||
Signed payloads carry a **nonce + expiry** (anti-replay: a captured "restore" job cannot be
|
||
re-injected later) and a target binding (host + guest id) so a signature can't be retargeted.
|
||
Notification-on-destructive-op is an **audit signal, never the guard** — a compromised hub
|
||
could both issue and suppress the notice, which is exactly why the *signature* (not the
|
||
notification) is the control.
|
||
|
||
## 5. Hub ↔ agent protocol (host domain)
|
||
|
||
**Box-initiated poll.** The hub never connects inbound. Each poll cycle exchanges:
|
||
|
||
- **Up:** heartbeat + a host-domain state report — host CPU/RAM/disk, per-guest up/down + spec, storage-target status (USB connected? NFS/CIFS reachable? PBS reachable?), last backup per target, last restore-test result, `cloudflared` health, agent + controller versions, audit-log tail.
|
||
- **Down:** the current desired state, any pending signed one-shot jobs, and config (poll interval, update window, policy changes).
|
||
|
||
**Dead-man's-switch (essential, not optional).** In a box-initiated model the heartbeat
|
||
*is* the liveness signal — a box that stops checking in is otherwise invisible. The hub
|
||
alerts the operator when an agent misses its expected check-in window. This is the worst
|
||
failure mode for a managed service, so it gets first-class treatment hub-side.
|
||
|
||
**Break-glass.** Standing inbound control is off. But when the poll loop *itself* is wedged
|
||
(agent hung, host sick) you cannot fix it through the poll loop. So there is an explicit,
|
||
**off-by-default, customer-consented, fully-audited** emergency path: SSH to the host via
|
||
the Cloudflare Tunnel behind Cloudflare Access (or on-site). Enabling it is itself a signed,
|
||
logged operation; it auto-expires.
|
||
|
||
## 6. Agent ↔ controller local API
|
||
|
||
The controller (in its LXC) reaches the agent (on the host) over the local bridge.
|
||
|
||
- **Transport:** HTTPS to the host's bridge IP on a fixed port.
|
||
- **Auth:** a per-guest local token, minted by the agent when it deploys the controller and written into the guest's bootstrap config. The agent maps token → guest and **authorizes per guest**: a controller can only act on *its own* guest. This is the agent acting as the per-guest authorization gate from Part 1.
|
||
- **Surface (minimal, all scoped to the caller's own guest):**
|
||
- `GET /storage` — mounts available to this guest and their **class** (fast/slow), so the controller can place hot vs bulk volumes per `.felhom.yml`. (The agent owns the actual mounts; the controller just binds to the paths it's given.)
|
||
- `POST /snapshot` — snapshot *this* guest (the snapshot-before-deploy primitive).
|
||
- `POST /rollback` — roll *this* guest back to a named snapshot (post-deploy failure recovery).
|
||
- `POST /backup` — request a backup-now of *this* guest (enqueued; non-destructive).
|
||
- `GET /backup/due` — whether a policy-scheduled backup is due for *this* guest, so the controller can quiesce then call `POST /backup` (the app-consistent path, §8).
|
||
- `GET /backup/status`, `GET /restore-test/status` — read-only status for the controller's UI.
|
||
|
||
Note what is *absent*: nothing here lets a controller touch another guest, the host, storage
|
||
attachment, or restore-overwrite. Destructive/cross-guest power stays operator-signed (§4).
|
||
|
||
A controller can only `POST /rollback` (or snapshot/backup) **its own** guest — the agent maps
|
||
token → guest and authorizes per guest, so a compromised controller's blast radius is
|
||
**self-scoped and bounded** to its own guest.
|
||
|
||
## 7. Storage manifest & reconciliation
|
||
|
||
The manifest is the load-bearing contract. It absorbs the **persisted** disk-state fields that
|
||
`settings.StoragePath` carries today **and adds** `durable_id`/UUID — today the controller
|
||
re-derives the UUID from fstab each boot (Part 2 / Phase-3), so persisting it is an
|
||
improvement. Held in the hub, reconciled by the agent.
|
||
|
||
Per target:
|
||
|
||
| field | meaning |
|
||
|---|---|
|
||
| `type` | `local-dir` / `usb` / `nfs` / `cifs` / `pbs` |
|
||
| `durable_id` | UUID (USB), `server:export` (NFS/CIFS), `repo+fingerprint` (PBS) — survives box loss |
|
||
| `class` | `fast` or `slow`, set **once at attach**, with an IOPS marker; no runtime speed-test |
|
||
| `role` | `primary` / `vzdump-target` / `pbs-offsite` / `bulk-data` |
|
||
| `creds` | encrypted (NFS/CIFS/PBS); USB has none |
|
||
| `policy` | schedule + retention for this target |
|
||
| `state` | `attached` / `disconnected` / `decommissioned` |
|
||
|
||
Reconciliation: ensure each `attached` target is mounted (USB-by-UUID via the sudoers
|
||
allowlist), each Proxmox storage entry matches, and `disconnected` targets are surfaced to
|
||
the hub (the storage watchdog — detect a USB drop in seconds, not at the next health cycle).
|
||
|
||
**Placement is per-volume, not per-app.** Hot volumes (DB/config) → a `fast` target,
|
||
**enforced**; bulk volumes (media) → may live on `slow`, declared in `.felhom.yml`.
|
||
|
||
A `bulk` volume **MUST** be realized as a `backup=0` **volume mount point** (or an external
|
||
bind mount) — **never** a Docker named volume in rootfs, which `vzdump` always captures
|
||
(verified, `phase3-findings.md` B2). Proven recipe: attach
|
||
`-mpN <storage>:<size>,mp=/mnt/bulk,backup=0`, then
|
||
`docker volume create --driver local -o type=none -o o=bind -o device=/mnt/bulk <vol>` (or a
|
||
compose bind). The per-volume placement component (Part 2 §5(2)) enforces this at deploy. The
|
||
**DR consequence** of excluding bulk is covered in §8.
|
||
|
||
**Field re-homing (from `settings.StoragePath`, Part 2):** `Label` → manifest (canonical);
|
||
`IsDefault`/`Schedulable` → manifest `policy`; `MigratedTo` + decommission → manifest `state`;
|
||
`StoppedStacks` → the **controller's `settings`** (app-domain: which apps to restart on
|
||
reconnect, not a host concern).
|
||
|
||
## 8. Backup/restore orchestration
|
||
|
||
Tiers double as backup *and* restore-source priority (fastest surviving source first),
|
||
per Part 1: **snapshot** (LVM-thin, transient, whole-guest rollback — not a backup) →
|
||
**local second storage** (vzdump to dir/NFS/CIFS) → **PBS offsite** (the DR substrate).
|
||
|
||
- **Quiescing (controller-driven for app-consistency):** an LXC has no fsfreeze
|
||
(`proxmox-platform.md` §4.2), so app-consistency is the controller's job: it learns a backup
|
||
is due (`GET /backup/due`, §6, or via its hub channel) → **quiesces** the app stack →
|
||
`POST /backup` → polls `GET /backup/status` → unquiesces. **An agent-initiated vzdump is
|
||
crash-consistent only** (there is no inbound-to-guest channel to trigger a quiesce — §3/§5).
|
||
Every Proxmox op is async → the agent polls `task exitstatus`, never trusts the POST return.
|
||
- **Bulk volumes have no DR coverage from the guest vzdump** — they are excluded (§7). Every
|
||
`bulk` volume needs an explicit own-backup decision: its own backup target per the manifest
|
||
`policy`, **or deliberately none** when the data is re-downloadable (customer informed). On
|
||
host-loss, un-backed-up bulk is gone; a **bind-mounted** bulk volume re-attaches only on the
|
||
*same* host, so cross-host DR needs the separate backup. A deliberate per-volume choice,
|
||
never a silent loss.
|
||
- **Key custody (PBS):** the **live** PBS key sits on the box so the agent can both back up
|
||
*and* run restore-tests. The hub holds only the **recovery-code-wrapped escrow** copy it
|
||
cannot open (zero-knowledge default). So: the box can restore-test; the operator cannot
|
||
read the data; the customer's offsite recovery code is the irreducible residual.
|
||
- **Self-restore-test:** the closing of the "tested restore is the critical gap" theme. The
|
||
agent periodically restores a backup into a **throwaway scratch guest**, boots it, runs
|
||
health checks, reports pass/fail, and tears it down. Zero-knowledge backups can *only* be
|
||
restore-tested by the box (the operator lacks the key) — so this lives in the agent by
|
||
necessity, not just convenience. Integrity-verify (cheap, ciphertext-level) runs more often
|
||
as the lighter check.
|
||
|
||
### 8a. PBS recovery-code escrow (zero-knowledge offsite-key recovery)
|
||
|
||
The DR substrate is the PBS offsite tier, and it is client-side encrypted (zero-knowledge): if the
|
||
box dies, restoring the offsite backups requires the **PBS client encryption key `K`**, which died
|
||
with the box. The escrow is how `K` comes back **without** Felhom ever being able to read customer
|
||
data. Design (decisions, with the rationale that pins them):
|
||
|
||
- **Live key unencrypted on the box** (`0600`, root): the agent backs up *and* runs restore-tests
|
||
unattended — no passphrase prompt on the management path. The privilege concentration this
|
||
implies is the whole argument for §3 root-minimization + a small auditable agent.
|
||
- **Wrap mechanism — PBS-native, not custom crypto.** At enrollment the agent generates a
|
||
high-entropy **recovery code `R`** and produces a **passphrase-protected copy of `K` under `R`**
|
||
using PBS's own key passphrase KDF (`proxmox-backup-client key` family). *Decision: lean on PBS's
|
||
documented, battle-tested key+passphrase path; do not roll a bespoke AEAD wrap.* Host/customer
|
||
binding is provided at the hub-storage layer (blob keyed by host-id), not by custom crypto.
|
||
- **Agent-side generation.** `R` is generated **on the box** (it already holds `K` and does the
|
||
wrapping), so `R` never touches the hub even in transit — zero-knowledge by construction.
|
||
- **Escrow = the `R`-wrapped blob → hub.** The hub stores opaque ciphertext bound to the
|
||
host/customer. Without `R` it is undecryptable; the operator cannot read customer data. (Hub-side
|
||
storage schema for the blob is a slice-10 / doc-05 item.)
|
||
- **Recovery code custody.** `R` is shown to the customer **once** at enrollment (printed/displayed)
|
||
and **never stored by Felhom in recoverable form**. Format: a grouped/word-list code (≥128-bit
|
||
entropy) — it is transcribed off paper by a non-technical household, so raw base32 invites typos.
|
||
- **Consumption (slice 10, host-loss).** New box re-enrolls in restore mode → hub ships the escrow
|
||
blob → customer enters `R` → box unwraps `K` → PBS restores proceed.
|
||
- **Optional belt-and-suspenders (product decision, default OFF).** A PBS **paperkey** (the raw key,
|
||
for a safe) gives the customer a recovery path that survives *both* box loss *and* recovery-code
|
||
loss, at the cost of a higher-value secret (raw key on paper, no second factor). Default is
|
||
hub-escrow + `R` only; offer the paperkey as an opt-in "advanced" path.
|
||
|
||
**Properties stated for honesty (these go to the customer at enrollment):**
|
||
- **Irreducible residual:** losing `R` *and* the box (and, if not opted in, having no paperkey) =
|
||
the offsite backups are **unrecoverable, by anyone, including Felhom.** This is the cost of
|
||
genuine zero-knowledge and must be communicated, not buried.
|
||
- **Rotation ≠ key rotation:** rotating `R` re-wraps the escrow blob (and re-shows the customer a
|
||
new code) but does **not** re-encrypt existing PBS data — that data stays keyed by `K`. Changing
|
||
`K` itself is a separate, heavier operation (new key → new backups; old backups still need old
|
||
`K`) and is out of scope for routine recovery-code rotation.
|
||
|
||
## 9. Provisioning & DR flows
|
||
|
||
**Provisioning (reconcile-driven, by restore).** Fresh creation of a Docker-capable LXC needs
|
||
the `keyctl=1` feature flag, which Proxmox permits only for `root@pam` (Phase 3, B3) — not the
|
||
scoped token. But a token-authorized **restore preserves `keyctl`** (Phase 3, B3, empirically:
|
||
a token `vzrestore` of a keyctl archive produced a guest that kept `features:
|
||
nesting=1,keyctl=1,unprivileged:1`), so the agent provisions **by restoring a golden base
|
||
image**, never by `pct create` on the per-customer path.
|
||
|
||
**Golden base image.** A **golden base archive** — minimal Debian + Docker, `nesting=1,keyctl=1`,
|
||
overlayfs — is built once as `root@pam` **at enrollment** (when the agent legitimately holds root
|
||
to mint its Proxmox token) and refreshed on a maintenance cadence. This is the one place
|
||
`keyctl`/root provisioning lives — off the per-customer path. Refresh cadence + fleet versioning
|
||
remain an operational open item (§13).
|
||
|
||
**Unified bring-up primitive (shared *front half* — NOT shared identity policy).** Provisioning
|
||
and DR-restore share one token-covered front-half code path:
|
||
|
||
> restore an archive → **reset identity** → size the guest (CPU/mem config + `pct resize`
|
||
> rootfs, token-covered) → attach storage mounts per the manifest
|
||
|
||
run as a **journaled reconcile job**; a mid-flight failure is compensating-rolled-back (destroy
|
||
the just-restored guest — allowed unsigned per §4, same-transaction provenance). They diverge in
|
||
the *archive* and the *back half*, **and in identity policy** (below).
|
||
|
||
**Identity reset is scenario-specific — this is a correctness boundary, not a detail.** "Reset
|
||
identity" is shorthand for two different operations:
|
||
|
||
- **Provision (golden base) → fresh identity, everything.** A provisioned guest is new: reset
|
||
MAC + hostname **host-side via the token config** (the agent does NOT touch guest internals),
|
||
while **`/etc/machine-id`** (a duplicate breaks journald/DHCP/systemd) and **SSH host keys**
|
||
regenerate **guest-side on first boot** — machine-id by systemd for free, host keys by a baked,
|
||
Condition-gated `felhom-regen-hostkeys.service` unit in the golden (the F3 decision: Debian does
|
||
NOT auto-regenerate host keys after a restore, so the golden carries the regeneration, keeping
|
||
the agent host-side-only). It then receives a **fresh** controller identity (host-id, local
|
||
token, hub channel), **fresh restic repo identity**, and a fresh tunnel association — all minted
|
||
in the back half (slice 8).
|
||
- **Guest-loss DR (customer backup) → preserve continuity identity, reset only what would
|
||
collide.** The restored guest must *continue* the customer's world: **keep** the restic repo
|
||
identity (resetting it orphans the existing backup chain — a silent data-continuity bug), the
|
||
tunnel/DNS association, and the hub host/customer binding. Reset only collision-prone host-local
|
||
identity (`machine-id`, SSH host keys, hostname as needed). **MAC is reset only when a source
|
||
guest may still be live** (e.g. partial loss, or the restore-*test* which boots link-down for
|
||
exactly this reason); in a true total guest-loss the original is gone, so the MAC can be kept to
|
||
preserve DHCP reservations. The agent decides MAC handling from the scenario, not a fixed rule.
|
||
|
||
The exact reset set was pinned empirically by the slice-7 bring-up spike (live, link-up —
|
||
`documentation/tests/slice7-bringup-spike-findings.md`, commit `3342993`) and **implemented in the
|
||
unified bring-up reconcile job** (agent v0.8.0, `internal/reconcile/bringup.go`): F1 — a restore
|
||
preserves the archived MAC, so provision reset is unconditional (`PUT net0` with `hwaddr` omitted);
|
||
F3 — host keys via the baked golden unit, not an agent guest-internal op.
|
||
|
||
**Guest loss (slice 7).** Agent restores G from the fastest surviving tier (snapshot → local →
|
||
PBS) and applies the **DR identity policy** above so the restored guest rejoins cleanly. The
|
||
customer backup already contains the controller + data, so there is **no controller deploy** in
|
||
this path — bring up + reattach external storage and it is whole. This is fully in slice 7.
|
||
|
||
### Slice mapping (what is built where — keep this current)
|
||
|
||
| Capability | Slice | Status |
|
||
|---|---|---|
|
||
| Golden base image build (root@pam, at enrollment) | **7** | **recipe implemented** (`felhom-agent/configs/build-golden.sh`, incl. the F3 host-key unit); golden archived at enrollment |
|
||
| Unified bring-up **front half** (restore→reset identity→size→attach storage), journaled + compensating rollback | **7** | **implemented** (agent v0.8.0, `internal/reconcile/bringup.go`) |
|
||
| **Guest-loss DR** (front half + DR identity policy; no controller deploy) | **7** | **implemented** (v0.8.0, `dr_guest_loss` mode — continuity identity preserved) |
|
||
| PBS recovery-code escrow **creation** (§8a) | **7** | designed (§8a); implement |
|
||
| Provisioning **back half** — deploy controller, hand bootstrap config, mint per-guest local token | **8** | deferred — needs the controller-deploy path + agent↔controller local API (§6) |
|
||
| **Host/hardware loss** DR — re-enroll in "restore mode"; hub serves identity / PBS namespace / tunnel token / storage manifest / restore directive | **10** | deferred — needs hub desired-state serving; hub store today holds only `{host_id, customer_id, api_key}` (slice 3) |
|
||
| PBS escrow **consumption** (recover `K` on a new box) | **10** | deferred — exercised by host-loss DR |
|
||
| Golden base refresh cadence + fleet versioning | post-launch | operational, non-blocking (§13) |
|
||
|
||
**Host/hardware loss (design intent — slice 10).** Re-enroll the new host in **restore mode**;
|
||
the hub — the durable source of truth that survives box death — hands the new agent the existing
|
||
identity, PBS namespace, tunnel token, storage manifest, a restore directive, and the **escrow
|
||
blob** (§8a) for the customer to unlock with their recovery code. Tunnel is reused from the hub
|
||
record, so DNS stays intact. This depends on hub desired-state serving (slice 10) and is not
|
||
buildable until then; recorded here so the front-half built in slice 7 lands ready for it.
|
||
|
||
## 10. Concurrency, crash-safety, idempotency
|
||
|
||
- **Per-guest serialization.** Reconcile, one-shot jobs, and local-API calls all feed a
|
||
work queue that serializes mutations **per guest** (Proxmox dislikes concurrent conflicting
|
||
ops on the same guest). Independent guests proceed in parallel.
|
||
- **Operation journaling.** Multi-step async ops (provision, restore, controller-update, agent
|
||
self-update) are journaled with their in-flight Proxmox task ids. On agent restart, the
|
||
journal is replayed: resume-or-rollback, so a crash mid-restore never leaves a corrupt or
|
||
half-built guest.
|
||
- **Idempotency keys** on one-shot jobs (run-once across retries and restarts).
|
||
|
||
## 11. Self-update
|
||
|
||
- **Agent (the hard case — a host service, no snapshot-rollback).** **A/B layout:** download →
|
||
verify signature → stage as the inactive slot → flip a `current → good|new` symlink → restart.
|
||
**Revert authority lives outside the swapped binary** — `Restart=always` alone just
|
||
crash-loops a bad binary — so a **separate health-gate** (a systemd oneshot `ExecStartPost`
|
||
probe, or a tiny supervisor unit) flips `current` back to last-good and restarts on a failed
|
||
health window. The new version is **committed as "good" only after a clean health window**.
|
||
Triggered by a hub signed job within the update window; manual always allowed. Journaled (§10).
|
||
- **Controller (the easy case — it's a guest).** The agent owns the controller's lifecycle,
|
||
so the **agent updates the controller**: snapshot-before-update (free rollback, because the
|
||
controller *is* a snapshottable guest) → pull new image → redeploy → health-check → rollback
|
||
on failure. This resolves the Part-2 `selfupdate/` open: the controller is **agent-managed**,
|
||
not self-updating; the controller's old self-update path is removed.
|
||
|
||
## 12. Secrets at rest on the host
|
||
|
||
The agent holds, root-only on the host fs: the scoped Proxmox token, the hub API key, the
|
||
operator's **public** verify key (for §4 signatures — public, low-risk), the Cloudflare
|
||
tunnel token, encrypted storage creds (NFS/CIFS/PBS), and the **live PBS key**. The privilege
|
||
and the secret footprint that left the controller now concentrate here — which is the whole
|
||
argument for §3's root-minimization and a small, auditable agent.
|
||
|
||
## 13. Open items / what this unblocks
|
||
|
||
Resolved here: tunnel placement (host, agent-managed, own systemd service), the
|
||
reconcile-vs-jobs fork (hybrid, gated by reversibility), agent process model, self-update
|
||
ownership, the local-API surface, the storage-manifest schema, **provision-by-restore**, the
|
||
**provision/DR slice boundary** (7 front-half + guest-loss DR + escrow creation; 8 provisioning
|
||
back-half; 10 host-loss DR + escrow consumption — §9 table), the **PBS recovery-code escrow
|
||
design** (§8a), and the **root-vs-API boundary** (Phase 3, B3).
|
||
|
||
Still open:
|
||
|
||
- Multi-tenant **resource fairness** on a shared host (per-guest cgroup limits, noisy-neighbor) — deferred to the company-case pass.
|
||
- Operator-side **signing tooling** — where the operator signing key lives operationally and how a destructive op gets signed without undue friction (offline key vs. a small signing service; the security floor is "not in the hub").
|
||
- Hub-side **desired-state editing UX** and the host-domain report schema details — belong to the hub architecture doc.
|
||
- **Golden base image** refresh cadence + fleet versioning — operational, non-blocking (§9).
|
||
- **Identity-reset set** (live, link-up) — pinned empirically by the slice-7 bring-up spike; the
|
||
scenario-specific policy is settled in §9, the exact field list is the spike's deliverable.
|
||
- **Hub-side escrow storage + restore-mode serving** — the blob's hub schema and the restore-mode
|
||
desired-state handover are slice-10 / doc-05 (§8a, §9 host-loss).
|
||
|
||
This doc hands the implementation three contracts it was waiting on:
|
||
|
||
1. the **local-API surface** (§6) → the controller's NEW local-API client, snapshot-before-deploy, and self-restore-test wiring (Part 2);
|
||
2. the **storage-manifest schema** (§7) → the `settings.StoragePath` reshape and per-volume hot/bulk placement (Part 2);
|
||
3. the **backup contract** (§7–8) → the destination for the app-data-backup package extracted in the Part-2 refactor.
|
||
|
||
---
|
||
|
||
## Changelog — design-review + Phase-3 fold-in (2026-06-08)
|
||
|
||
### Slice-7 scope + escrow design (2026-06-09)
|
||
- §9 rewritten: the bring-up primitive is a **shared front half only** — identity-reset policy is
|
||
**scenario-specific** (provision = fresh everything; guest-loss DR = preserve restic/tunnel/hub
|
||
continuity identity, reset only collision-prone host-local identity). Added the **slice 7/8/10
|
||
mapping table** (front half + guest-loss DR + escrow creation in 7; provisioning back-half in 8;
|
||
host-loss DR + escrow consumption in 10).
|
||
- NEW §8a: **PBS recovery-code escrow** — live key unencrypted on box for unattended ops; agent
|
||
generates recovery code `R`; PBS-native passphrase-wrap of `K` under `R` escrowed to the hub
|
||
(zero-knowledge); consumption is slice 10. Irreducible-residual + rotation≠key-rotation stated.
|
||
- §13 updated accordingly.
|
||
|
||
- **NEW provision-by-restore** (§9): the agent provisions by **restoring a golden base image**
|
||
(token-covered, preserves `keyctl`), never `pct create` on the per-customer path; one unified
|
||
restore primitive shared with DR. §2 responsibility + §3 boundary updated.
|
||
- **B3** (§2/§3): replaced "Phase-1 minimal role" with the validated **`FelhomAgent`** operator
|
||
role; root-vs-API boundary **settled** (root only for golden-image build, host mounts, SMART).
|
||
- **B1** (§4): reversibility gate rewritten as **provenance + data-bearing** (scratch tag is
|
||
agent-internal, never hub-supplied; crashed-controller heal is non-destructive in-place).
|
||
- **B2** (§7/§8): validated bulk-as-`backup=0`-mountpoint recipe + the **bulk-DR consequence**
|
||
(excluded bulk needs its own backup decision).
|
||
- **S1** (§6/§8): `GET /backup/due` added; controller-driven quiescing; agent vzdump is
|
||
crash-consistent only. **S2** (§10/§11): A/B self-update with external revert authority;
|
||
controller-update + agent self-update journaled. **S3** (§7): `StoragePath` field re-homing.
|
||
**S4:** geo non-responsibility added (§2). **M2** (§7): manifest "absorbs + adds durable_id".
|
||
**§6:** rollback is self-scoped/bounded. **§13:** golden-image refresh cadence added as open. |