04-control-plane-authorization.md
This commit is contained in:
@@ -0,0 +1,154 @@
|
||||
# Architecture Part 4 — Control-plane authorization (operator signing)
|
||||
|
||||
> Status: design draft (decision content), grounded on `docs/tests/phase4-signing-findings.md`.
|
||||
> To be reviewed by Claude Code against that spike + `03` §4, then placed at
|
||||
> `docs/architecture/04-control-plane-authorization.md`.
|
||||
>
|
||||
> Builds on Part 1 (enrollment / trust), Part 3 (the agent verifies + the §4 reversibility gate).
|
||||
> This doc defines the **mechanism** behind `03` §4's "an operator signature the hub can't forge."
|
||||
|
||||
## 1. Purpose & scope
|
||||
|
||||
`03` §4 gates **destructive/irreversible** operations behind an operator signature the hub cannot
|
||||
forge. That gate is only real if signing is real. This doc defines the signing mechanism: the
|
||||
primitive, the keys, rotation, the three components' roles, and the operator workflow. The
|
||||
*policy* (what needs a signature) lives in `03` §4; this is the *how*.
|
||||
|
||||
**Recap of what needs a signature** (from `03` §4, by reversibility, not by verb): destroying or
|
||||
overwriting any resource holding the only/primary copy of customer data — live-guest destroy,
|
||||
storage detach/wipe, restore-overwrite, decommission — **regardless of whether it arrives as a job
|
||||
or a desired-state delta**. Benign convergence (deploy a guest, attach storage, restore to a *new*
|
||||
guest, bump a version) runs on normal hub auth, unsigned. Most recovery is therefore unsigned;
|
||||
signed ops are rare and deliberate.
|
||||
|
||||
## 2. Primitive — SSH signatures (SSHSIG)
|
||||
|
||||
Confirmed by Phase 4: destructive ops carry an **SSH signature** (`ssh-keygen -Y sign`, the armored
|
||||
`SSHSIG` format), verified by the agent in Go (`golang.org/x/crypto/ssh`) — `pem.Decode` →
|
||||
`ssh.Unmarshal` → `ssh.ParsePublicKey` → `pub.Verify`. ~40 lines of framing, no hand-rolled crypto.
|
||||
|
||||
**Why SSHSIG and not raw Ed25519 / minisign:** SSHSIG verification dispatches on the key type
|
||||
embedded in the signature, so the **same verifier accepts a software key (`ssh-ed25519`) today and
|
||||
a FIDO2 hardware key (`sk-ssh-ed25519@openssh.com`) later** — which is exactly the hardware-ready
|
||||
foundation we want (§7). A raw-Ed25519 verifier cannot consume an sk signature (flags+counter,
|
||||
different signed-data), so it would force a verifier change on every box at hardware-adoption time.
|
||||
SSHSIG buys key-type-agnosticism for a one-file framing cost (Phase 4 §5–6).
|
||||
|
||||
### 2.1 The signed object — canonical op blob
|
||||
The signature covers an op blob (Phase 4 §2):
|
||||
|
||||
```
|
||||
{ op, target:{host_id, guest_id}, params, nonce, issued_at, expires_at, key_id }
|
||||
```
|
||||
|
||||
- **Canonical form is a *signer-side* requirement** — JSON, keys sorted at every level, no
|
||||
insignificant whitespace, UTF-8 — so the blob is deterministic and human-auditable. The
|
||||
**verifier trusts the exact bytes it receives** (it verifies the signature over the raw bytes and
|
||||
parses those same bytes for fields), so there is no canonicalization-mismatch risk on the verify
|
||||
side. The canonical form is the shared contract between the operator CLI and the agent (both Go).
|
||||
- `nonce` ≥128-bit random; `issued_at`/`expires_at` a short window (minutes); `key_id` identifies
|
||||
the signing key (rotation/audit).
|
||||
|
||||
### 2.2 Domain separation — the namespace
|
||||
The SSHSIG **namespace** `felhom-op-v1` is a **fixed constant in the verifier**, never
|
||||
caller-supplied. A signature minted for any other namespace must not verify (proven). This stops a
|
||||
signature made for one purpose being reused for another.
|
||||
|
||||
### 2.3 Verify pipeline (order is load-bearing)
|
||||
`namespace → allow-list → crypto verify → target binding → time window → nonce`. The **nonce is
|
||||
recorded last**, only after everything else passes, so an invalid signature can never consume a
|
||||
nonce (DoS-safe). Each layer is mandatory and was proven to reject independently (Phase 4 §3–4):
|
||||
- **target binding** — `target.host_id`/`guest_id` must equal *this* box/guest (a signature for box
|
||||
A cannot be replayed at box B);
|
||||
- **time window** — `now ∈ [issued_at, expires_at]`;
|
||||
- **nonce** — unseen within the window (the nonce store **must be persistent across agent restarts**
|
||||
and expiry-pruned; a non-persistent store reopens the replay window after a restart).
|
||||
|
||||
The Phase-4 reference verifier (`VerifySignedOp`) is the seed of the agent's implementation.
|
||||
|
||||
## 3. The keys — two-key model, software now
|
||||
|
||||
Both software (SSH-format) keys today; both are also valid FIDO2-resident keys later with no box
|
||||
change (§7).
|
||||
|
||||
- **Operational signing key** — the "master stamp" for destructive ops. A **dedicated** key (NOT
|
||||
the operator's daily SSH login key), passphrase-protected, on the operator workstation. Used only
|
||||
for destructive ops — rare, so its exposure is low.
|
||||
- **Cold recovery key** — generated once, kept **offline** (password manager / a USB held back /
|
||||
printed). Never used for ordinary ops; its sole power is to authorize rotating the operational key
|
||||
if that key is lost or compromised.
|
||||
|
||||
Both **public** keys are pinned onto the agent at enrollment (the allowed-signers set). The
|
||||
operational key is authorized for ops; the recovery key is authorized **only** for key-rotation
|
||||
instructions.
|
||||
|
||||
**Allowed-signers is a set** → single signer today; **quorum (N-of-M) for the highest-blast ops is
|
||||
just set sizing + a threshold policy**, addable later without a redesign (Phase 4 §8). Out of scope
|
||||
now.
|
||||
|
||||
## 4. Rotation & compromise recovery
|
||||
|
||||
The agents pin the operator public keys. The danger: rotation must **not** flow as plain hub config,
|
||||
or a compromised hub re-pins its own key and forges everything. So **every re-pin is itself a signed
|
||||
op the agent verifies** (same pipeline, §2.3) — never unauthenticated config.
|
||||
|
||||
- **Planned rotation:** the *current* operational key signs a "new operational public key = X" op;
|
||||
the agent accepts it because it's signed by the trusted current key (key-signs-key).
|
||||
- **Operational key lost/compromised:** the **cold recovery key** signs the re-pin; the agent accepts
|
||||
it because the recovery key is pinned and authorized for rotation. The compromised key is removed
|
||||
from the allowed set in the same signed op.
|
||||
- **Both keys gone:** on-site physical re-enrollment (last resort — re-establishes the trust root the
|
||||
way initial enrollment did).
|
||||
|
||||
## 5. Component roles
|
||||
|
||||
- **Operator tooling (the workstation).** A signing CLI behind a thin **`Signer` interface**
|
||||
(`Sign(blob) → signature`). The backend today is a **file key**; a **FIDO2/PIV** backend drops in
|
||||
later (§7) with no change to the blob format, the hub, or the agent. Holds the operational private
|
||||
key (passphrase-protected); can reach the cold recovery key when rotation is needed.
|
||||
- **Hub.** Queues the **opaque** signed blobs and surfaces pending destructive ops + their signature
|
||||
status in the operator UI. Holds **no** private key and cannot sign — a compromised hub can only
|
||||
queue blobs the agent rejects. (Matches `03` §4 / box-initiated poll.)
|
||||
- **Agent (each box).** Pins the allowed-signers set (operational + recovery) at enrollment; runs the
|
||||
verify pipeline (§2.3) on any destructive op before executing; writes every signed op to the
|
||||
customer-visible **audit log**. Notification-on-destructive-op is an audit signal, never the guard
|
||||
(a compromised hub could issue *and* suppress notice — the signature is the control).
|
||||
- **Enrollment.** Pins the initial operational + recovery public keys onto the agent during the
|
||||
physical-presence provisioning step (the trust root is established on-site, not via the hub).
|
||||
|
||||
## 6. Operator workflow
|
||||
|
||||
- **Routine work** (deploy, monitor, attach storage, restore to a *new* guest): no signing, zero
|
||||
overhead.
|
||||
- **A destructive op** (rare): the operator runs the signing CLI on their workstation — which builds
|
||||
the canonical blob, signs it (passphrase, or later a hardware touch), and posts it to the hub
|
||||
queue — then the agent polls, verifies, executes, and audits. One command + passphrase, from the
|
||||
desk. **Never** a site visit.
|
||||
|
||||
## 7. Hardware readiness (Viktor's "build the foundation now")
|
||||
|
||||
Software `ssh-ed25519` now; a FIDO2 `sk-ssh-ed25519@openssh.com` key later is a **no-op on the
|
||||
boxes** — proven end-to-end against the OpenSSH spec in Phase 4 §5 (the unchanged verifier accepts a
|
||||
spec-faithful sk signature). At hardware adoption the operator generates an sk-key, points the
|
||||
`Signer` backend at it, and updates the allowed-signers entry; nothing on the boxes changes.
|
||||
|
||||
Two honest notes:
|
||||
- **Confirm with a real device at adoption.** §5 was validated to spec, not against live hardware —
|
||||
a 5-minute real-key round-trip should confirm it (no surprise expected; signer/library/device all
|
||||
follow the same spec).
|
||||
- **Optional future hardening:** require the FIDO2 **user-presence (touch) flag**. The verifier is
|
||||
crypto-only today (correct for software keys); enforcing the flag is a small later option once
|
||||
hardware is in use.
|
||||
|
||||
## 8. Open items
|
||||
- **Quorum policy** (N-of-M per op-class, e.g. two signatures for decommission) — deferred; the
|
||||
allowed-signers-set foundation supports it.
|
||||
- **Signing-key passphrase UX** on the workstation (ssh-agent / askpass) — minor operator-tooling
|
||||
detail.
|
||||
- **Hub-side pending-op UI** (showing ops awaiting signature + audit) — belongs to the hub doc.
|
||||
|
||||
## 9. What this unblocks
|
||||
Closes the `03` §4 "undesigned signing path." Hands the implementation: the **canonical blob spec**
|
||||
(§2.1) + the **`VerifySignedOp` reference** (Phase 4 §7) for the agent's verify path, the
|
||||
**`Signer` interface** for the operator CLI, and the **allowed-signers pinning** step for enrollment.
|
||||
The hub's signed-job queue + pending-op UI carry into the hub architecture doc.
|
||||
Reference in New Issue
Block a user