felhom-agent/docs/architecture/04-control-plane-authorization.md

# Architecture Part 4 — Control-plane authorization (operator signing)

> Status: design draft (decision content), grounded on `docs/tests/phase4-signing-findings.md`.
> To be reviewed by Claude Code against that spike + `03` §4, then placed at
> `docs/architecture/04-control-plane-authorization.md`.
>
> Builds on Part 1 (enrollment / trust), Part 3 (the agent verifies + the §4 reversibility gate).
> This doc defines the **mechanism** behind `03` §4's "an operator signature the hub can't forge."

## 1. Purpose & scope

`03` §4 gates **destructive/irreversible** operations behind an operator signature the hub cannot
forge. That gate is only real if signing is real. This doc defines the signing mechanism: the
primitive, the keys, rotation, the three components' roles, and the operator workflow. The
*policy* (what needs a signature) lives in `03` §4; this is the *how*.

**Recap of what needs a signature** (from `03` §4, by reversibility, not by verb): destroying or
overwriting any resource holding the only/primary copy of customer data — live-guest destroy,
storage detach/wipe, restore-overwrite, decommission — **regardless of whether it arrives as a job
or a desired-state delta**. Benign convergence (deploy a guest, attach storage, restore to a *new*
guest, bump a version) runs on normal hub auth, unsigned. Most recovery is therefore unsigned;
signed ops are rare and deliberate.

## 2. Primitive — SSH signatures (SSHSIG)

Confirmed by Phase 4: destructive ops carry an **SSH signature** (`ssh-keygen -Y sign`, the armored
`SSHSIG` format), verified by the agent in Go (`golang.org/x/crypto/ssh`) — `pem.Decode` →
`ssh.Unmarshal` → `ssh.ParsePublicKey` → `pub.Verify`. ~40 lines of framing, no hand-rolled crypto.

**Why SSHSIG and not raw Ed25519 / minisign:** SSHSIG verification dispatches on the key type
embedded in the signature, so the **same verifier accepts a software key (`ssh-ed25519`) today and
a FIDO2 hardware key (`sk-ssh-ed25519@openssh.com`) later** — which is exactly the hardware-ready
foundation we want (§7). A raw-Ed25519 verifier cannot consume an sk signature (flags+counter,
different signed-data), so it would force a verifier change on every box at hardware-adoption time.
SSHSIG buys key-type-agnosticism for a one-file framing cost (Phase 4 §5–6).

### 2.1 The signed object — canonical op blob
The signature covers an op blob (Phase 4 §2):

```
{ op, target:{host_id, guest_id}, params, nonce, issued_at, expires_at, key_id }
```

- **Canonical form is a *signer-side* requirement** — JSON, keys sorted at every level, no
  insignificant whitespace, UTF-8 — so the blob is deterministic and human-auditable. The
  **verifier trusts the exact bytes it receives** (it verifies the signature over the raw bytes and
  parses those same bytes for fields), so there is no canonicalization-mismatch risk on the verify
  side. The canonical form is the shared contract between the operator CLI and the agent (both Go).
- `nonce` ≥128-bit random; `issued_at`/`expires_at` a short window (minutes); `key_id` identifies
  the signing key (rotation/audit).

### 2.2 Domain separation — the namespace
The SSHSIG **namespace** `felhom-op-v1` is a **fixed constant in the verifier**, never
caller-supplied. A signature minted for any other namespace must not verify (proven). This stops a
signature made for one purpose being reused for another.

### 2.3 Verify pipeline (order is load-bearing)
`namespace → allow-list → crypto verify → target binding → time window → nonce`. The **nonce is
recorded last**, only after everything else passes, so an invalid signature can never consume a
nonce (DoS-safe). Each layer is mandatory and was proven to reject independently (Phase 4 §3–4):
- **target binding** — `target.host_id`/`guest_id` must equal *this* box/guest (a signature for box
  A cannot be replayed at box B);
- **time window** — `now ∈ [issued_at, expires_at]`;
- **nonce** — unseen within the window (the nonce store **must be persistent across agent restarts**
  and expiry-pruned; a non-persistent store reopens the replay window after a restart).

The Phase-4 reference verifier (`VerifySignedOp`) is the seed of the agent's implementation.

## 3. The keys — two-key model, software now

Both software (SSH-format) keys today; both are also valid FIDO2-resident keys later with no box
change (§7).

- **Operational signing key** — the "master stamp" for destructive ops. A **dedicated** key (NOT
  the operator's daily SSH login key), passphrase-protected, on the operator workstation. Used only
  for destructive ops — rare, so its exposure is low.
- **Cold recovery key** — generated once, kept **offline** (password manager / a USB held back /
  printed). Never used for ordinary ops; its sole power is to authorize rotating the operational key
  if that key is lost or compromised.

Both **public** keys are pinned onto the agent at enrollment (the allowed-signers set). The
operational key is authorized for ops; the recovery key is authorized **only** for key-rotation
instructions.

**Allowed-signers is a set** → single signer today; **quorum (N-of-M) for the highest-blast ops is
just set sizing + a threshold policy**, addable later without a redesign (Phase 4 §8). Out of scope
now.

## 4. Rotation & compromise recovery

The agents pin the operator public keys. The danger: rotation must **not** flow as plain hub config,
or a compromised hub re-pins its own key and forges everything. So **every re-pin is itself a signed
op the agent verifies** (same pipeline, §2.3) — never unauthenticated config.

- **Planned rotation:** the *current* operational key signs a "new operational public key = X" op;
  the agent accepts it because it's signed by the trusted current key (key-signs-key).
- **Operational key lost/compromised:** the **cold recovery key** signs the re-pin; the agent accepts
  it because the recovery key is pinned and authorized for rotation. The compromised key is removed
  from the allowed set in the same signed op.
- **Both keys gone:** on-site physical re-enrollment (last resort — re-establishes the trust root the
  way initial enrollment did).

## 5. Component roles

- **Operator tooling (the workstation).** A signing CLI behind a thin **`Signer` interface**
  (`Sign(blob) → signature`). The backend today is a **file key**; a **FIDO2/PIV** backend drops in
  later (§7) with no change to the blob format, the hub, or the agent. Holds the operational private
  key (passphrase-protected); can reach the cold recovery key when rotation is needed.
- **Hub.** Queues the **opaque** signed blobs and surfaces pending destructive ops + their signature
  status in the operator UI. Holds **no** private key and cannot sign — a compromised hub can only
  queue blobs the agent rejects. (Matches `03` §4 / box-initiated poll.)
- **Agent (each box).** Pins the allowed-signers set (operational + recovery) at enrollment; runs the
  verify pipeline (§2.3) on any destructive op before executing; writes every signed op to the
  customer-visible **audit log**. Notification-on-destructive-op is an audit signal, never the guard
  (a compromised hub could issue *and* suppress notice — the signature is the control).
- **Enrollment.** Pins the initial operational + recovery public keys onto the agent during the
  physical-presence provisioning step (the trust root is established on-site, not via the hub).

## 6. Operator workflow

- **Routine work** (deploy, monitor, attach storage, restore to a *new* guest): no signing, zero
  overhead.
- **A destructive op** (rare): the operator runs the signing CLI on their workstation — which builds
  the canonical blob, signs it (passphrase, or later a hardware touch), and posts it to the hub
  queue — then the agent polls, verifies, executes, and audits. One command + passphrase, from the
  desk. **Never** a site visit.

## 7. Hardware readiness (Viktor's "build the foundation now")

Software `ssh-ed25519` now; a FIDO2 `sk-ssh-ed25519@openssh.com` key later is a **no-op on the
boxes** — proven end-to-end against the OpenSSH spec in Phase 4 §5 (the unchanged verifier accepts a
spec-faithful sk signature). At hardware adoption the operator generates an sk-key, points the
`Signer` backend at it, and updates the allowed-signers entry; nothing on the boxes changes.

Two honest notes:
- **Confirm with a real device at adoption.** §5 was validated to spec, not against live hardware —
  a 5-minute real-key round-trip should confirm it (no surprise expected; signer/library/device all
  follow the same spec).
- **Optional future hardening:** require the FIDO2 **user-presence (touch) flag**. The verifier is
  crypto-only today (correct for software keys); enforcing the flag is a small later option once
  hardware is in use.

## 8. Open items
- **Quorum policy** (N-of-M per op-class, e.g. two signatures for decommission) — deferred; the
  allowed-signers-set foundation supports it.
- **Signing-key passphrase UX** on the workstation (ssh-agent / askpass) — minor operator-tooling
  detail.
- **Hub-side pending-op UI** (showing ops awaiting signature + audit) — belongs to the hub doc.

## 9. What this unblocks
Closes the `03` §4 "undesigned signing path." Hands the implementation: the **canonical blob spec**
(§2.1) + the **`VerifySignedOp` reference** (Phase 4 §7) for the agent's verify path, the
**`Signer` interface** for the operator CLI, and the **allowed-signers pinning** step for enrollment.
The hub's signed-job queue + pending-op UI carry into the hub architecture doc.