Files
felhom-agent/docs/architecture/04-control-plane-authorization.md
T

154 lines
9.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Architecture Part 4 — Control-plane authorization (operator signing)
> Status: design draft (decision content), grounded on `docs/tests/phase4-signing-findings.md`.
> To be reviewed by Claude Code against that spike + `03` §4, then placed at
> `docs/architecture/04-control-plane-authorization.md`.
>
> Builds on Part 1 (enrollment / trust), Part 3 (the agent verifies + the §4 reversibility gate).
> This doc defines the **mechanism** behind `03` §4's "an operator signature the hub can't forge."
## 1. Purpose & scope
`03` §4 gates **destructive/irreversible** operations behind an operator signature the hub cannot
forge. That gate is only real if signing is real. This doc defines the signing mechanism: the
primitive, the keys, rotation, the three components' roles, and the operator workflow. The
*policy* (what needs a signature) lives in `03` §4; this is the *how*.
**Recap of what needs a signature** (from `03` §4, by reversibility, not by verb): destroying or
overwriting any resource holding the only/primary copy of customer data — live-guest destroy,
storage detach/wipe, restore-overwrite, decommission — **regardless of whether it arrives as a job
or a desired-state delta**. Benign convergence (deploy a guest, attach storage, restore to a *new*
guest, bump a version) runs on normal hub auth, unsigned. Most recovery is therefore unsigned;
signed ops are rare and deliberate.
## 2. Primitive — SSH signatures (SSHSIG)
Confirmed by Phase 4: destructive ops carry an **SSH signature** (`ssh-keygen -Y sign`, the armored
`SSHSIG` format), verified by the agent in Go (`golang.org/x/crypto/ssh`) — `pem.Decode`
`ssh.Unmarshal``ssh.ParsePublicKey``pub.Verify`. ~40 lines of framing, no hand-rolled crypto.
**Why SSHSIG and not raw Ed25519 / minisign:** SSHSIG verification dispatches on the key type
embedded in the signature, so the **same verifier accepts a software key (`ssh-ed25519`) today and
a FIDO2 hardware key (`sk-ssh-ed25519@openssh.com`) later** — which is exactly the hardware-ready
foundation we want (§7). A raw-Ed25519 verifier cannot consume an sk signature (flags+counter,
different signed-data), so it would force a verifier change on every box at hardware-adoption time.
SSHSIG buys key-type-agnosticism for a one-file framing cost (Phase 4 §56).
### 2.1 The signed object — canonical op blob
The signature covers an op blob (Phase 4 §2):
```
{ op, target:{host_id, guest_id}, params, nonce, issued_at, expires_at, key_id }
```
- **Canonical form is a *signer-side* requirement** — JSON, keys sorted at every level, no
insignificant whitespace, UTF-8 — so the blob is deterministic and human-auditable. The
**verifier trusts the exact bytes it receives** (it verifies the signature over the raw bytes and
parses those same bytes for fields), so there is no canonicalization-mismatch risk on the verify
side. The canonical form is the shared contract between the operator CLI and the agent (both Go).
- `nonce` ≥128-bit random; `issued_at`/`expires_at` a short window (minutes); `key_id` identifies
the signing key (rotation/audit).
### 2.2 Domain separation — the namespace
The SSHSIG **namespace** `felhom-op-v1` is a **fixed constant in the verifier**, never
caller-supplied. A signature minted for any other namespace must not verify (proven). This stops a
signature made for one purpose being reused for another.
### 2.3 Verify pipeline (order is load-bearing)
`namespace → allow-list → crypto verify → target binding → time window → nonce`. The **nonce is
recorded last**, only after everything else passes, so an invalid signature can never consume a
nonce (DoS-safe). Each layer is mandatory and was proven to reject independently (Phase 4 §34):
- **target binding** — `target.host_id`/`guest_id` must equal *this* box/guest (a signature for box
A cannot be replayed at box B);
- **time window** — `now ∈ [issued_at, expires_at]`;
- **nonce** — unseen within the window (the nonce store **must be persistent across agent restarts**
and expiry-pruned; a non-persistent store reopens the replay window after a restart).
The Phase-4 reference verifier (`VerifySignedOp`) is the seed of the agent's implementation.
## 3. The keys — two-key model, software now
Both software (SSH-format) keys today; both are also valid FIDO2-resident keys later with no box
change (§7).
- **Operational signing key** — the "master stamp" for destructive ops. A **dedicated** key (NOT
the operator's daily SSH login key), passphrase-protected, on the operator workstation. Used only
for destructive ops — rare, so its exposure is low.
- **Cold recovery key** — generated once, kept **offline** (password manager / a USB held back /
printed). Never used for ordinary ops; its sole power is to authorize rotating the operational key
if that key is lost or compromised.
Both **public** keys are pinned onto the agent at enrollment (the allowed-signers set). The
operational key is authorized for ops; the recovery key is authorized **only** for key-rotation
instructions.
**Allowed-signers is a set** → single signer today; **quorum (N-of-M) for the highest-blast ops is
just set sizing + a threshold policy**, addable later without a redesign (Phase 4 §8). Out of scope
now.
## 4. Rotation & compromise recovery
The agents pin the operator public keys. The danger: rotation must **not** flow as plain hub config,
or a compromised hub re-pins its own key and forges everything. So **every re-pin is itself a signed
op the agent verifies** (same pipeline, §2.3) — never unauthenticated config.
- **Planned rotation:** the *current* operational key signs a "new operational public key = X" op;
the agent accepts it because it's signed by the trusted current key (key-signs-key).
- **Operational key lost/compromised:** the **cold recovery key** signs the re-pin; the agent accepts
it because the recovery key is pinned and authorized for rotation. The compromised key is removed
from the allowed set in the same signed op.
- **Both keys gone:** on-site physical re-enrollment (last resort — re-establishes the trust root the
way initial enrollment did).
## 5. Component roles
- **Operator tooling (the workstation).** A signing CLI behind a thin **`Signer` interface**
(`Sign(blob) → signature`). The backend today is a **file key**; a **FIDO2/PIV** backend drops in
later (§7) with no change to the blob format, the hub, or the agent. Holds the operational private
key (passphrase-protected); can reach the cold recovery key when rotation is needed.
- **Hub.** Queues the **opaque** signed blobs and surfaces pending destructive ops + their signature
status in the operator UI. Holds **no** private key and cannot sign — a compromised hub can only
queue blobs the agent rejects. (Matches `03` §4 / box-initiated poll.)
- **Agent (each box).** Pins the allowed-signers set (operational + recovery) at enrollment; runs the
verify pipeline (§2.3) on any destructive op before executing; writes every signed op to the
customer-visible **audit log**. Notification-on-destructive-op is an audit signal, never the guard
(a compromised hub could issue *and* suppress notice — the signature is the control).
- **Enrollment.** Pins the initial operational + recovery public keys onto the agent during the
physical-presence provisioning step (the trust root is established on-site, not via the hub).
## 6. Operator workflow
- **Routine work** (deploy, monitor, attach storage, restore to a *new* guest): no signing, zero
overhead.
- **A destructive op** (rare): the operator runs the signing CLI on their workstation — which builds
the canonical blob, signs it (passphrase, or later a hardware touch), and posts it to the hub
queue — then the agent polls, verifies, executes, and audits. One command + passphrase, from the
desk. **Never** a site visit.
## 7. Hardware readiness (Viktor's "build the foundation now")
Software `ssh-ed25519` now; a FIDO2 `sk-ssh-ed25519@openssh.com` key later is a **no-op on the
boxes** — proven end-to-end against the OpenSSH spec in Phase 4 §5 (the unchanged verifier accepts a
spec-faithful sk signature). At hardware adoption the operator generates an sk-key, points the
`Signer` backend at it, and updates the allowed-signers entry; nothing on the boxes changes.
Two honest notes:
- **Confirm with a real device at adoption.** §5 was validated to spec, not against live hardware —
a 5-minute real-key round-trip should confirm it (no surprise expected; signer/library/device all
follow the same spec).
- **Optional future hardening:** require the FIDO2 **user-presence (touch) flag**. The verifier is
crypto-only today (correct for software keys); enforcing the flag is a small later option once
hardware is in use.
## 8. Open items
- **Quorum policy** (N-of-M per op-class, e.g. two signatures for decommission) — deferred; the
allowed-signers-set foundation supports it.
- **Signing-key passphrase UX** on the workstation (ssh-agent / askpass) — minor operator-tooling
detail.
- **Hub-side pending-op UI** (showing ops awaiting signature + audit) — belongs to the hub doc.
## 9. What this unblocks
Closes the `03` §4 "undesigned signing path." Hands the implementation: the **canonical blob spec**
(§2.1) + the **`VerifySignedOp` reference** (Phase 4 §7) for the agent's verify path, the
**`Signer` interface** for the operator CLI, and the **allowed-signers pinning** step for enrollment.
The hub's signed-job queue + pending-op UI carry into the hub architecture doc.