Files
felhom-agent/docs/architecture/04-control-plane-authorization.md
T

9.4 KiB
Raw Blame History

Architecture Part 4 — Control-plane authorization (operator signing)

Status: design draft (decision content), grounded on docs/tests/phase4-signing-findings.md. To be reviewed by Claude Code against that spike + 03 §4, then placed at docs/architecture/04-control-plane-authorization.md.

Builds on Part 1 (enrollment / trust), Part 3 (the agent verifies + the §4 reversibility gate). This doc defines the mechanism behind 03 §4's "an operator signature the hub can't forge."

1. Purpose & scope

03 §4 gates destructive/irreversible operations behind an operator signature the hub cannot forge. That gate is only real if signing is real. This doc defines the signing mechanism: the primitive, the keys, rotation, the three components' roles, and the operator workflow. The policy (what needs a signature) lives in 03 §4; this is the how.

Recap of what needs a signature (from 03 §4, by reversibility, not by verb): destroying or overwriting any resource holding the only/primary copy of customer data — live-guest destroy, storage detach/wipe, restore-overwrite, decommission — regardless of whether it arrives as a job or a desired-state delta. Benign convergence (deploy a guest, attach storage, restore to a new guest, bump a version) runs on normal hub auth, unsigned. Most recovery is therefore unsigned; signed ops are rare and deliberate.

2. Primitive — SSH signatures (SSHSIG)

Confirmed by Phase 4: destructive ops carry an SSH signature (ssh-keygen -Y sign, the armored SSHSIG format), verified by the agent in Go (golang.org/x/crypto/ssh) — pem.Decodessh.Unmarshalssh.ParsePublicKeypub.Verify. ~40 lines of framing, no hand-rolled crypto.

Why SSHSIG and not raw Ed25519 / minisign: SSHSIG verification dispatches on the key type embedded in the signature, so the same verifier accepts a software key (ssh-ed25519) today and a FIDO2 hardware key (sk-ssh-ed25519@openssh.com) later — which is exactly the hardware-ready foundation we want (§7). A raw-Ed25519 verifier cannot consume an sk signature (flags+counter, different signed-data), so it would force a verifier change on every box at hardware-adoption time. SSHSIG buys key-type-agnosticism for a one-file framing cost (Phase 4 §56).

2.1 The signed object — canonical op blob

The signature covers an op blob (Phase 4 §2):

{ op, target:{host_id, guest_id}, params, nonce, issued_at, expires_at, key_id }
  • Canonical form is a signer-side requirement — JSON, keys sorted at every level, no insignificant whitespace, UTF-8 — so the blob is deterministic and human-auditable. The verifier trusts the exact bytes it receives (it verifies the signature over the raw bytes and parses those same bytes for fields), so there is no canonicalization-mismatch risk on the verify side. The canonical form is the shared contract between the operator CLI and the agent (both Go).
  • nonce ≥128-bit random; issued_at/expires_at a short window (minutes); key_id identifies the signing key (rotation/audit).

2.2 Domain separation — the namespace

The SSHSIG namespace felhom-op-v1 is a fixed constant in the verifier, never caller-supplied. A signature minted for any other namespace must not verify (proven). This stops a signature made for one purpose being reused for another.

2.3 Verify pipeline (order is load-bearing)

namespace → allow-list → crypto verify → target binding → time window → nonce. The nonce is recorded last, only after everything else passes, so an invalid signature can never consume a nonce (DoS-safe). Each layer is mandatory and was proven to reject independently (Phase 4 §34):

  • target bindingtarget.host_id/guest_id must equal this box/guest (a signature for box A cannot be replayed at box B);
  • time windownow ∈ [issued_at, expires_at];
  • nonce — unseen within the window (the nonce store must be persistent across agent restarts and expiry-pruned; a non-persistent store reopens the replay window after a restart).

The Phase-4 reference verifier (VerifySignedOp) is the seed of the agent's implementation.

3. The keys — two-key model, software now

Both software (SSH-format) keys today; both are also valid FIDO2-resident keys later with no box change (§7).

  • Operational signing key — the "master stamp" for destructive ops. A dedicated key (NOT the operator's daily SSH login key), passphrase-protected, on the operator workstation. Used only for destructive ops — rare, so its exposure is low.
  • Cold recovery key — generated once, kept offline (password manager / a USB held back / printed). Never used for ordinary ops; its sole power is to authorize rotating the operational key if that key is lost or compromised.

Both public keys are pinned onto the agent at enrollment (the allowed-signers set). The operational key is authorized for ops; the recovery key is authorized only for key-rotation instructions.

Allowed-signers is a set → single signer today; quorum (N-of-M) for the highest-blast ops is just set sizing + a threshold policy, addable later without a redesign (Phase 4 §8). Out of scope now.

4. Rotation & compromise recovery

The agents pin the operator public keys. The danger: rotation must not flow as plain hub config, or a compromised hub re-pins its own key and forges everything. So every re-pin is itself a signed op the agent verifies (same pipeline, §2.3) — never unauthenticated config.

  • Planned rotation: the current operational key signs a "new operational public key = X" op; the agent accepts it because it's signed by the trusted current key (key-signs-key).
  • Operational key lost/compromised: the cold recovery key signs the re-pin; the agent accepts it because the recovery key is pinned and authorized for rotation. The compromised key is removed from the allowed set in the same signed op.
  • Both keys gone: on-site physical re-enrollment (last resort — re-establishes the trust root the way initial enrollment did).

5. Component roles

  • Operator tooling (the workstation). A signing CLI behind a thin Signer interface (Sign(blob) → signature). The backend today is a file key; a FIDO2/PIV backend drops in later (§7) with no change to the blob format, the hub, or the agent. Holds the operational private key (passphrase-protected); can reach the cold recovery key when rotation is needed.
  • Hub. Queues the opaque signed blobs and surfaces pending destructive ops + their signature status in the operator UI. Holds no private key and cannot sign — a compromised hub can only queue blobs the agent rejects. (Matches 03 §4 / box-initiated poll.)
  • Agent (each box). Pins the allowed-signers set (operational + recovery) at enrollment; runs the verify pipeline (§2.3) on any destructive op before executing; writes every signed op to the customer-visible audit log. Notification-on-destructive-op is an audit signal, never the guard (a compromised hub could issue and suppress notice — the signature is the control).
  • Enrollment. Pins the initial operational + recovery public keys onto the agent during the physical-presence provisioning step (the trust root is established on-site, not via the hub).

6. Operator workflow

  • Routine work (deploy, monitor, attach storage, restore to a new guest): no signing, zero overhead.
  • A destructive op (rare): the operator runs the signing CLI on their workstation — which builds the canonical blob, signs it (passphrase, or later a hardware touch), and posts it to the hub queue — then the agent polls, verifies, executes, and audits. One command + passphrase, from the desk. Never a site visit.

7. Hardware readiness (Viktor's "build the foundation now")

Software ssh-ed25519 now; a FIDO2 sk-ssh-ed25519@openssh.com key later is a no-op on the boxes — proven end-to-end against the OpenSSH spec in Phase 4 §5 (the unchanged verifier accepts a spec-faithful sk signature). At hardware adoption the operator generates an sk-key, points the Signer backend at it, and updates the allowed-signers entry; nothing on the boxes changes.

Two honest notes:

  • Confirm with a real device at adoption. §5 was validated to spec, not against live hardware — a 5-minute real-key round-trip should confirm it (no surprise expected; signer/library/device all follow the same spec).
  • Optional future hardening: require the FIDO2 user-presence (touch) flag. The verifier is crypto-only today (correct for software keys); enforcing the flag is a small later option once hardware is in use.

8. Open items

  • Quorum policy (N-of-M per op-class, e.g. two signatures for decommission) — deferred; the allowed-signers-set foundation supports it.
  • Signing-key passphrase UX on the workstation (ssh-agent / askpass) — minor operator-tooling detail.
  • Hub-side pending-op UI (showing ops awaiting signature + audit) — belongs to the hub doc.

9. What this unblocks

Closes the 03 §4 "undesigned signing path." Hands the implementation: the canonical blob spec (§2.1) + the VerifySignedOp reference (Phase 4 §7) for the agent's verify path, the Signer interface for the operator CLI, and the allowed-signers pinning step for enrollment. The hub's signed-job queue + pending-op UI carry into the hub architecture doc.