04-control-plane-authorization.md

2026-06-08 10:13:24 +02:00
parent 333c65cbc4
commit 36451f57e0
1 changed files with 154 additions and 0 deletions
@@ -0,0 +1,154 @@
+# Architecture Part 4 — Control-plane authorization (operator signing)
+
+> Status: design draft (decision content), grounded on `docs/tests/phase4-signing-findings.md`.
+> To be reviewed by Claude Code against that spike + `03` §4, then placed at
+> `docs/architecture/04-control-plane-authorization.md`.
+>
+> Builds on Part 1 (enrollment / trust), Part 3 (the agent verifies + the §4 reversibility gate).
+> This doc defines the **mechanism** behind `03` §4's "an operator signature the hub can't forge."
+
+## 1. Purpose & scope
+
+`03` §4 gates **destructive/irreversible** operations behind an operator signature the hub cannot
+forge. That gate is only real if signing is real. This doc defines the signing mechanism: the
+primitive, the keys, rotation, the three components' roles, and the operator workflow. The
+*policy* (what needs a signature) lives in `03` §4; this is the *how*.
+
+**Recap of what needs a signature** (from `03` §4, by reversibility, not by verb): destroying or
+overwriting any resource holding the only/primary copy of customer data — live-guest destroy,
+storage detach/wipe, restore-overwrite, decommission — **regardless of whether it arrives as a job
+or a desired-state delta**. Benign convergence (deploy a guest, attach storage, restore to a *new*
+guest, bump a version) runs on normal hub auth, unsigned. Most recovery is therefore unsigned;
+signed ops are rare and deliberate.
+
+## 2. Primitive — SSH signatures (SSHSIG)
+
+Confirmed by Phase 4: destructive ops carry an **SSH signature** (`ssh-keygen -Y sign`, the armored
+`SSHSIG` format), verified by the agent in Go (`golang.org/x/crypto/ssh`) — `pem.Decode` →
+`ssh.Unmarshal` → `ssh.ParsePublicKey` → `pub.Verify`. ~40 lines of framing, no hand-rolled crypto.
+
+**Why SSHSIG and not raw Ed25519 / minisign:** SSHSIG verification dispatches on the key type
+embedded in the signature, so the **same verifier accepts a software key (`ssh-ed25519`) today and
+a FIDO2 hardware key (`sk-ssh-ed25519@openssh.com`) later** — which is exactly the hardware-ready
+foundation we want (§7). A raw-Ed25519 verifier cannot consume an sk signature (flags+counter,
+different signed-data), so it would force a verifier change on every box at hardware-adoption time.
+SSHSIG buys key-type-agnosticism for a one-file framing cost (Phase 4 §5–6).
+
+### 2.1 The signed object — canonical op blob
+The signature covers an op blob (Phase 4 §2):
+
+```
+{ op, target:{host_id, guest_id}, params, nonce, issued_at, expires_at, key_id }
+```
+
+- **Canonical form is a *signer-side* requirement** — JSON, keys sorted at every level, no
+  insignificant whitespace, UTF-8 — so the blob is deterministic and human-auditable. The
+  **verifier trusts the exact bytes it receives** (it verifies the signature over the raw bytes and
+  parses those same bytes for fields), so there is no canonicalization-mismatch risk on the verify
+  side. The canonical form is the shared contract between the operator CLI and the agent (both Go).
+- `nonce` ≥128-bit random; `issued_at`/`expires_at` a short window (minutes); `key_id` identifies
+  the signing key (rotation/audit).
+
+### 2.2 Domain separation — the namespace
+The SSHSIG **namespace** `felhom-op-v1` is a **fixed constant in the verifier**, never
+caller-supplied. A signature minted for any other namespace must not verify (proven). This stops a
+signature made for one purpose being reused for another.
+
+### 2.3 Verify pipeline (order is load-bearing)
+`namespace → allow-list → crypto verify → target binding → time window → nonce`. The **nonce is
+recorded last**, only after everything else passes, so an invalid signature can never consume a
+nonce (DoS-safe). Each layer is mandatory and was proven to reject independently (Phase 4 §3–4):
+- **target binding** — `target.host_id`/`guest_id` must equal *this* box/guest (a signature for box
+  A cannot be replayed at box B);
+- **time window** — `now ∈ [issued_at, expires_at]`;
+- **nonce** — unseen within the window (the nonce store **must be persistent across agent restarts**
+  and expiry-pruned; a non-persistent store reopens the replay window after a restart).
+
+The Phase-4 reference verifier (`VerifySignedOp`) is the seed of the agent's implementation.
+
+## 3. The keys — two-key model, software now
+
+Both software (SSH-format) keys today; both are also valid FIDO2-resident keys later with no box
+change (§7).
+
+- **Operational signing key** — the "master stamp" for destructive ops. A **dedicated** key (NOT
+  the operator's daily SSH login key), passphrase-protected, on the operator workstation. Used only
+  for destructive ops — rare, so its exposure is low.
+- **Cold recovery key** — generated once, kept **offline** (password manager / a USB held back /
+  printed). Never used for ordinary ops; its sole power is to authorize rotating the operational key
+  if that key is lost or compromised.
+
+Both **public** keys are pinned onto the agent at enrollment (the allowed-signers set). The
+operational key is authorized for ops; the recovery key is authorized **only** for key-rotation
+instructions.
+
+**Allowed-signers is a set** → single signer today; **quorum (N-of-M) for the highest-blast ops is
+just set sizing + a threshold policy**, addable later without a redesign (Phase 4 §8). Out of scope
+now.
+
+## 4. Rotation & compromise recovery
+
+The agents pin the operator public keys. The danger: rotation must **not** flow as plain hub config,
+or a compromised hub re-pins its own key and forges everything. So **every re-pin is itself a signed
+op the agent verifies** (same pipeline, §2.3) — never unauthenticated config.
+
+- **Planned rotation:** the *current* operational key signs a "new operational public key = X" op;
+  the agent accepts it because it's signed by the trusted current key (key-signs-key).
+- **Operational key lost/compromised:** the **cold recovery key** signs the re-pin; the agent accepts
+  it because the recovery key is pinned and authorized for rotation. The compromised key is removed
+  from the allowed set in the same signed op.
+- **Both keys gone:** on-site physical re-enrollment (last resort — re-establishes the trust root the
+  way initial enrollment did).
+
+## 5. Component roles
+
+- **Operator tooling (the workstation).** A signing CLI behind a thin **`Signer` interface**
+  (`Sign(blob) → signature`). The backend today is a **file key**; a **FIDO2/PIV** backend drops in
+  later (§7) with no change to the blob format, the hub, or the agent. Holds the operational private
+  key (passphrase-protected); can reach the cold recovery key when rotation is needed.
+- **Hub.** Queues the **opaque** signed blobs and surfaces pending destructive ops + their signature
+  status in the operator UI. Holds **no** private key and cannot sign — a compromised hub can only
+  queue blobs the agent rejects. (Matches `03` §4 / box-initiated poll.)
+- **Agent (each box).** Pins the allowed-signers set (operational + recovery) at enrollment; runs the
+  verify pipeline (§2.3) on any destructive op before executing; writes every signed op to the
+  customer-visible **audit log**. Notification-on-destructive-op is an audit signal, never the guard
+  (a compromised hub could issue *and* suppress notice — the signature is the control).
+- **Enrollment.** Pins the initial operational + recovery public keys onto the agent during the
+  physical-presence provisioning step (the trust root is established on-site, not via the hub).
+
+## 6. Operator workflow
+
+- **Routine work** (deploy, monitor, attach storage, restore to a *new* guest): no signing, zero
+  overhead.
+- **A destructive op** (rare): the operator runs the signing CLI on their workstation — which builds
+  the canonical blob, signs it (passphrase, or later a hardware touch), and posts it to the hub
+  queue — then the agent polls, verifies, executes, and audits. One command + passphrase, from the
+  desk. **Never** a site visit.
+
+## 7. Hardware readiness (Viktor's "build the foundation now")
+
+Software `ssh-ed25519` now; a FIDO2 `sk-ssh-ed25519@openssh.com` key later is a **no-op on the
+boxes** — proven end-to-end against the OpenSSH spec in Phase 4 §5 (the unchanged verifier accepts a
+spec-faithful sk signature). At hardware adoption the operator generates an sk-key, points the
+`Signer` backend at it, and updates the allowed-signers entry; nothing on the boxes changes.
+
+Two honest notes:
+- **Confirm with a real device at adoption.** §5 was validated to spec, not against live hardware —
+  a 5-minute real-key round-trip should confirm it (no surprise expected; signer/library/device all
+  follow the same spec).
+- **Optional future hardening:** require the FIDO2 **user-presence (touch) flag**. The verifier is
+  crypto-only today (correct for software keys); enforcing the flag is a small later option once
+  hardware is in use.
+
+## 8. Open items
+- **Quorum policy** (N-of-M per op-class, e.g. two signatures for decommission) — deferred; the
+  allowed-signers-set foundation supports it.
+- **Signing-key passphrase UX** on the workstation (ssh-agent / askpass) — minor operator-tooling
+  detail.
+- **Hub-side pending-op UI** (showing ops awaiting signature + audit) — belongs to the hub doc.
+
+## 9. What this unblocks
+Closes the `03` §4 "undesigned signing path." Hands the implementation: the **canonical blob spec**
+(§2.1) + the **`VerifySignedOp` reference** (Phase 4 §7) for the agent's verify path, the
+**`Signer` interface** for the operator CLI, and the **allowed-signers pinning** step for enrollment.
+The hub's signed-job queue + pending-op UI carry into the hub architecture doc.