Files

T

admin 36451f57e0 04-control-plane-authorization.md

2026-06-08 10:13:24 +02:00

9.4 KiB

Raw Blame History

Architecture Part 4 — Control-plane authorization (operator signing)

Status: design draft (decision content), grounded on docs/tests/phase4-signing-findings.md. To be reviewed by Claude Code against that spike + 03 §4, then placed at docs/architecture/04-control-plane-authorization.md.

Builds on Part 1 (enrollment / trust), Part 3 (the agent verifies + the §4 reversibility gate). This doc defines the mechanism behind 03 §4's "an operator signature the hub can't forge."

1. Purpose & scope

03 §4 gates destructive/irreversible operations behind an operator signature the hub cannot forge. That gate is only real if signing is real. This doc defines the signing mechanism: the primitive, the keys, rotation, the three components' roles, and the operator workflow. The policy (what needs a signature) lives in 03 §4; this is the how.

Recap of what needs a signature (from 03 §4, by reversibility, not by verb): destroying or overwriting any resource holding the only/primary copy of customer data — live-guest destroy, storage detach/wipe, restore-overwrite, decommission — regardless of whether it arrives as a job or a desired-state delta. Benign convergence (deploy a guest, attach storage, restore to a new guest, bump a version) runs on normal hub auth, unsigned. Most recovery is therefore unsigned; signed ops are rare and deliberate.

2. Primitive — SSH signatures (SSHSIG)

Confirmed by Phase 4: destructive ops carry an SSH signature (ssh-keygen -Y sign, the armored SSHSIG format), verified by the agent in Go (golang.org/x/crypto/ssh) — pem.Decode → ssh.Unmarshal → ssh.ParsePublicKey → pub.Verify. ~40 lines of framing, no hand-rolled crypto.

Why SSHSIG and not raw Ed25519 / minisign: SSHSIG verification dispatches on the key type embedded in the signature, so the same verifier accepts a software key (ssh-ed25519) today and a FIDO2 hardware key (sk-ssh-ed25519@openssh.com) later — which is exactly the hardware-ready foundation we want (§7). A raw-Ed25519 verifier cannot consume an sk signature (flags+counter, different signed-data), so it would force a verifier change on every box at hardware-adoption time. SSHSIG buys key-type-agnosticism for a one-file framing cost (Phase 4 §5–6).

2.1 The signed object — canonical op blob

The signature covers an op blob (Phase 4 §2):

{ op, target:{host_id, guest_id}, params, nonce, issued_at, expires_at, key_id }

Canonical form is a signer-side requirement — JSON, keys sorted at every level, no insignificant whitespace, UTF-8 — so the blob is deterministic and human-auditable. The verifier trusts the exact bytes it receives (it verifies the signature over the raw bytes and parses those same bytes for fields), so there is no canonicalization-mismatch risk on the verify side. The canonical form is the shared contract between the operator CLI and the agent (both Go).
nonce ≥128-bit random; issued_at/expires_at a short window (minutes); key_id identifies the signing key (rotation/audit).

2.2 Domain separation — the namespace

The SSHSIG namespace felhom-op-v1 is a fixed constant in the verifier, never caller-supplied. A signature minted for any other namespace must not verify (proven). This stops a signature made for one purpose being reused for another.

2.3 Verify pipeline (order is load-bearing)

namespace → allow-list → crypto verify → target binding → time window → nonce. The nonce is recorded last, only after everything else passes, so an invalid signature can never consume a nonce (DoS-safe). Each layer is mandatory and was proven to reject independently (Phase 4 §3–4):

target binding — target.host_id/guest_id must equal this box/guest (a signature for box A cannot be replayed at box B);
time window — now ∈ [issued_at, expires_at];
nonce — unseen within the window (the nonce store must be persistent across agent restarts and expiry-pruned; a non-persistent store reopens the replay window after a restart).

The Phase-4 reference verifier (VerifySignedOp) is the seed of the agent's implementation.

3. The keys — two-key model, software now

Both software (SSH-format) keys today; both are also valid FIDO2-resident keys later with no box change (§7).

Operational signing key — the "master stamp" for destructive ops. A dedicated key (NOT the operator's daily SSH login key), passphrase-protected, on the operator workstation. Used only for destructive ops — rare, so its exposure is low.
Cold recovery key — generated once, kept offline (password manager / a USB held back / printed). Never used for ordinary ops; its sole power is to authorize rotating the operational key if that key is lost or compromised.

Both public keys are pinned onto the agent at enrollment (the allowed-signers set). The operational key is authorized for ops; the recovery key is authorized only for key-rotation instructions.

Allowed-signers is a set → single signer today; quorum (N-of-M) for the highest-blast ops is just set sizing + a threshold policy, addable later without a redesign (Phase 4 §8). Out of scope now.

4. Rotation & compromise recovery

The agents pin the operator public keys. The danger: rotation must not flow as plain hub config, or a compromised hub re-pins its own key and forges everything. So every re-pin is itself a signed op the agent verifies (same pipeline, §2.3) — never unauthenticated config.

Planned rotation: the current operational key signs a "new operational public key = X" op; the agent accepts it because it's signed by the trusted current key (key-signs-key).
Operational key lost/compromised: the cold recovery key signs the re-pin; the agent accepts it because the recovery key is pinned and authorized for rotation. The compromised key is removed from the allowed set in the same signed op.
Both keys gone: on-site physical re-enrollment (last resort — re-establishes the trust root the way initial enrollment did).

5. Component roles

Operator tooling (the workstation). A signing CLI behind a thin Signer interface (Sign(blob) → signature). The backend today is a file key; a FIDO2/PIV backend drops in later (§7) with no change to the blob format, the hub, or the agent. Holds the operational private key (passphrase-protected); can reach the cold recovery key when rotation is needed.
Hub. Queues the opaque signed blobs and surfaces pending destructive ops + their signature status in the operator UI. Holds no private key and cannot sign — a compromised hub can only queue blobs the agent rejects. (Matches 03 §4 / box-initiated poll.)
Agent (each box). Pins the allowed-signers set (operational + recovery) at enrollment; runs the verify pipeline (§2.3) on any destructive op before executing; writes every signed op to the customer-visible audit log. Notification-on-destructive-op is an audit signal, never the guard (a compromised hub could issue and suppress notice — the signature is the control).
Enrollment. Pins the initial operational + recovery public keys onto the agent during the physical-presence provisioning step (the trust root is established on-site, not via the hub).

6. Operator workflow

Routine work (deploy, monitor, attach storage, restore to a new guest): no signing, zero overhead.
A destructive op (rare): the operator runs the signing CLI on their workstation — which builds the canonical blob, signs it (passphrase, or later a hardware touch), and posts it to the hub queue — then the agent polls, verifies, executes, and audits. One command + passphrase, from the desk. Never a site visit.

7. Hardware readiness (Viktor's "build the foundation now")

Software ssh-ed25519 now; a FIDO2 sk-ssh-ed25519@openssh.com key later is a no-op on the boxes — proven end-to-end against the OpenSSH spec in Phase 4 §5 (the unchanged verifier accepts a spec-faithful sk signature). At hardware adoption the operator generates an sk-key, points the Signer backend at it, and updates the allowed-signers entry; nothing on the boxes changes.

Two honest notes:

Confirm with a real device at adoption. §5 was validated to spec, not against live hardware — a 5-minute real-key round-trip should confirm it (no surprise expected; signer/library/device all follow the same spec).
Optional future hardening: require the FIDO2 user-presence (touch) flag. The verifier is crypto-only today (correct for software keys); enforcing the flag is a small later option once hardware is in use.

8. Open items

Quorum policy (N-of-M per op-class, e.g. two signatures for decommission) — deferred; the allowed-signers-set foundation supports it.
Signing-key passphrase UX on the workstation (ssh-agent / askpass) — minor operator-tooling detail.
Hub-side pending-op UI (showing ops awaiting signature + audit) — belongs to the hub doc.

9. What this unblocks

Closes the 03 §4 "undesigned signing path." Hands the implementation: the canonical blob spec (§2.1) + the VerifySignedOp reference (Phase 4 §7) for the agent's verify path, the Signer interface for the operator CLI, and the allowed-signers pinning step for enrollment. The hub's signed-job queue + pending-op UI carry into the hub architecture doc.

9.4 KiB Raw Blame History Unescape Escape