9.4 KiB
Architecture Part 4 — Control-plane authorization (operator signing)
Status: design draft (decision content), grounded on
docs/tests/phase4-signing-findings.md. To be reviewed by Claude Code against that spike +03§4, then placed atdocs/architecture/04-control-plane-authorization.md.Builds on Part 1 (enrollment / trust), Part 3 (the agent verifies + the §4 reversibility gate). This doc defines the mechanism behind
03§4's "an operator signature the hub can't forge."
1. Purpose & scope
03 §4 gates destructive/irreversible operations behind an operator signature the hub cannot
forge. That gate is only real if signing is real. This doc defines the signing mechanism: the
primitive, the keys, rotation, the three components' roles, and the operator workflow. The
policy (what needs a signature) lives in 03 §4; this is the how.
Recap of what needs a signature (from 03 §4, by reversibility, not by verb): destroying or
overwriting any resource holding the only/primary copy of customer data — live-guest destroy,
storage detach/wipe, restore-overwrite, decommission — regardless of whether it arrives as a job
or a desired-state delta. Benign convergence (deploy a guest, attach storage, restore to a new
guest, bump a version) runs on normal hub auth, unsigned. Most recovery is therefore unsigned;
signed ops are rare and deliberate.
2. Primitive — SSH signatures (SSHSIG)
Confirmed by Phase 4: destructive ops carry an SSH signature (ssh-keygen -Y sign, the armored
SSHSIG format), verified by the agent in Go (golang.org/x/crypto/ssh) — pem.Decode →
ssh.Unmarshal → ssh.ParsePublicKey → pub.Verify. ~40 lines of framing, no hand-rolled crypto.
Why SSHSIG and not raw Ed25519 / minisign: SSHSIG verification dispatches on the key type
embedded in the signature, so the same verifier accepts a software key (ssh-ed25519) today and
a FIDO2 hardware key (sk-ssh-ed25519@openssh.com) later — which is exactly the hardware-ready
foundation we want (§7). A raw-Ed25519 verifier cannot consume an sk signature (flags+counter,
different signed-data), so it would force a verifier change on every box at hardware-adoption time.
SSHSIG buys key-type-agnosticism for a one-file framing cost (Phase 4 §5–6).
2.1 The signed object — canonical op blob
The signature covers an op blob (Phase 4 §2):
{ op, target:{host_id, guest_id}, params, nonce, issued_at, expires_at, key_id }
- Canonical form is a signer-side requirement — JSON, keys sorted at every level, no insignificant whitespace, UTF-8 — so the blob is deterministic and human-auditable. The verifier trusts the exact bytes it receives (it verifies the signature over the raw bytes and parses those same bytes for fields), so there is no canonicalization-mismatch risk on the verify side. The canonical form is the shared contract between the operator CLI and the agent (both Go).
nonce≥128-bit random;issued_at/expires_ata short window (minutes);key_ididentifies the signing key (rotation/audit).
2.2 Domain separation — the namespace
The SSHSIG namespace felhom-op-v1 is a fixed constant in the verifier, never
caller-supplied. A signature minted for any other namespace must not verify (proven). This stops a
signature made for one purpose being reused for another.
2.3 Verify pipeline (order is load-bearing)
namespace → allow-list → crypto verify → target binding → time window → nonce. The nonce is
recorded last, only after everything else passes, so an invalid signature can never consume a
nonce (DoS-safe). Each layer is mandatory and was proven to reject independently (Phase 4 §3–4):
- target binding —
target.host_id/guest_idmust equal this box/guest (a signature for box A cannot be replayed at box B); - time window —
now ∈ [issued_at, expires_at]; - nonce — unseen within the window (the nonce store must be persistent across agent restarts and expiry-pruned; a non-persistent store reopens the replay window after a restart).
The Phase-4 reference verifier (VerifySignedOp) is the seed of the agent's implementation.
3. The keys — two-key model, software now
Both software (SSH-format) keys today; both are also valid FIDO2-resident keys later with no box change (§7).
- Operational signing key — the "master stamp" for destructive ops. A dedicated key (NOT the operator's daily SSH login key), passphrase-protected, on the operator workstation. Used only for destructive ops — rare, so its exposure is low.
- Cold recovery key — generated once, kept offline (password manager / a USB held back / printed). Never used for ordinary ops; its sole power is to authorize rotating the operational key if that key is lost or compromised.
Both public keys are pinned onto the agent at enrollment (the allowed-signers set). The operational key is authorized for ops; the recovery key is authorized only for key-rotation instructions.
Allowed-signers is a set → single signer today; quorum (N-of-M) for the highest-blast ops is just set sizing + a threshold policy, addable later without a redesign (Phase 4 §8). Out of scope now.
4. Rotation & compromise recovery
The agents pin the operator public keys. The danger: rotation must not flow as plain hub config, or a compromised hub re-pins its own key and forges everything. So every re-pin is itself a signed op the agent verifies (same pipeline, §2.3) — never unauthenticated config.
- Planned rotation: the current operational key signs a "new operational public key = X" op; the agent accepts it because it's signed by the trusted current key (key-signs-key).
- Operational key lost/compromised: the cold recovery key signs the re-pin; the agent accepts it because the recovery key is pinned and authorized for rotation. The compromised key is removed from the allowed set in the same signed op.
- Both keys gone: on-site physical re-enrollment (last resort — re-establishes the trust root the way initial enrollment did).
5. Component roles
- Operator tooling (the workstation). A signing CLI behind a thin
Signerinterface (Sign(blob) → signature). The backend today is a file key; a FIDO2/PIV backend drops in later (§7) with no change to the blob format, the hub, or the agent. Holds the operational private key (passphrase-protected); can reach the cold recovery key when rotation is needed. - Hub. Queues the opaque signed blobs and surfaces pending destructive ops + their signature
status in the operator UI. Holds no private key and cannot sign — a compromised hub can only
queue blobs the agent rejects. (Matches
03§4 / box-initiated poll.) - Agent (each box). Pins the allowed-signers set (operational + recovery) at enrollment; runs the verify pipeline (§2.3) on any destructive op before executing; writes every signed op to the customer-visible audit log. Notification-on-destructive-op is an audit signal, never the guard (a compromised hub could issue and suppress notice — the signature is the control).
- Enrollment. Pins the initial operational + recovery public keys onto the agent during the physical-presence provisioning step (the trust root is established on-site, not via the hub).
6. Operator workflow
- Routine work (deploy, monitor, attach storage, restore to a new guest): no signing, zero overhead.
- A destructive op (rare): the operator runs the signing CLI on their workstation — which builds the canonical blob, signs it (passphrase, or later a hardware touch), and posts it to the hub queue — then the agent polls, verifies, executes, and audits. One command + passphrase, from the desk. Never a site visit.
7. Hardware readiness (Viktor's "build the foundation now")
Software ssh-ed25519 now; a FIDO2 sk-ssh-ed25519@openssh.com key later is a no-op on the
boxes — proven end-to-end against the OpenSSH spec in Phase 4 §5 (the unchanged verifier accepts a
spec-faithful sk signature). At hardware adoption the operator generates an sk-key, points the
Signer backend at it, and updates the allowed-signers entry; nothing on the boxes changes.
Two honest notes:
- Confirm with a real device at adoption. §5 was validated to spec, not against live hardware — a 5-minute real-key round-trip should confirm it (no surprise expected; signer/library/device all follow the same spec).
- Optional future hardening: require the FIDO2 user-presence (touch) flag. The verifier is crypto-only today (correct for software keys); enforcing the flag is a small later option once hardware is in use.
8. Open items
- Quorum policy (N-of-M per op-class, e.g. two signatures for decommission) — deferred; the allowed-signers-set foundation supports it.
- Signing-key passphrase UX on the workstation (ssh-agent / askpass) — minor operator-tooling detail.
- Hub-side pending-op UI (showing ops awaiting signature + audit) — belongs to the hub doc.
9. What this unblocks
Closes the 03 §4 "undesigned signing path." Hands the implementation: the canonical blob spec
(§2.1) + the VerifySignedOp reference (Phase 4 §7) for the agent's verify path, the
Signer interface for the operator CLI, and the allowed-signers pinning step for enrollment.
The hub's signed-job queue + pending-op UI carry into the hub architecture doc.