Files
felhom-agent/docs/tests/phase4-signing-findings.md
T
2026-06-08 10:03:49 +02:00

13 KiB

Phase 4 — Control-plane signing primitive (SSHSIG + Go verify): Findings

Where run: build server 192.168.0.180 (Debian 13, Go 1.24.4, OpenSSH 10.0p2), no Proxmox. Date: 2026-06-08. Throwaway key generated, used, and deleted — no private key, passphrase, or .sig committed.

De-risks the signing primitive before it is written into 04-control-plane-authorization.md or the agent's verify code. Verdict up front: the approach works cleanly and is key-type- agnostic — no fallback needed. Go verifies the armored SSHSIG format, every tamper/replay/ authorization case is rejected, and a synthetic FIDO2 sk-ssh-ed25519 signature verifies through the unchanged code path (true hardware drop-in).


0. Result at a glance — 14/14 checks pass

== Step 2: SSHSIG signature verification (key-type-agnostic path) ==
  PASS  correct                verified, op="guest_destroy"
  PASS  wrong key              rejected: signer not in allowed set
  PASS  tampered blob          rejected: signature invalid: ssh: signature did not verify
  PASS  wrong namespace        rejected: namespace mismatch: got "felhom-op-wrong" want "felhom-op-v1"

== Step 3: anti-replay / authorization (valid signature, still rejected) ==
  PASS  first use              verified, op="guest_destroy"
  PASS  replay (same nonce)    rejected: replay: nonce a1b2c3d4...8f90 already seen
  PASS  expired                rejected: expired (expires_at=2020-01-02 ..., now=2026-06-08 ...)
  PASS  not-yet-valid          rejected: not yet valid (issued_at=2030-01-01 ...)
  PASS  retargeted host        rejected: target mismatch: blob=demo-felhom/9001 this=other-host/9001
  PASS  retargeted guest       rejected: target mismatch: blob=demo-felhom/9001 this=demo-felhom/8888

== Step 4: key-type-agnosticism — FIDO2 sk-ssh-ed25519 (synthetic, no device) ==
  PASS  parses sk pubkey       type="sk-ssh-ed25519@openssh.com"
  PASS  authorized_keys form   sk-ssh-ed25519@openssh.com AAAAGnNrLXNzaC1lZDI1NTE5...
  PASS  sk end-to-end verify   verified, op="guest_destroy"

1. Software round-trip (baseline, CLI)

  • Key: ssh-keygen -t ed25519 -f felhom-op -N '<passphrase>' -C felhom-operator. (Signing non-interactively used an SSH_ASKPASS helper + setsid -w; in production the operator key lives behind an agent or a FIDO2 device, so the at-sign passphrase prompt is a non-issue. The passphrase mechanics are not what this spike de-risks.)
  • Sign with a domain-separated namespace: ssh-keygen -Y sign -f felhom-op -n felhom-op-v1 blob.jsonblob.json.sig (armored -----BEGIN SSH SIGNATURE-----).
  • Baseline verify (CLI sanity) with an allow-list:
    allowed_signers:  felhom-operator namespaces="felhom-op-v1" ssh-ed25519 AAAAC3...
    $ ssh-keygen -Y verify -f allowed_signers -I felhom-operator -n felhom-op-v1 \
          -s blob.json.sig < blob.json
    Good "felhom-op-v1" signature for felhom-operator with ED25519 key SHA256:y0Lj8dIYTM6...
    

2. Canonical op blob spec (documented)

The signature covers these exact bytes; the operator CLI (also Go) must reproduce them byte-for-byte. Canonical form: JSON, keys sorted lexicographically at every level, no insignificant whitespace, no trailing newline, UTF-8.

{"expires_at":"<RFC3339 UTC>","issued_at":"<RFC3339 UTC>","key_id":"<id>","nonce":"<128-bit hex>","op":"<op>","params":{...},"target":{"guest_id":"<vmid>","host_id":"<node>"}}
field meaning
op the operation, e.g. guest_destroy, storage_detach, restore_overwrite
target.host_id / target.guest_id the box + guest the op is bound to (anti-retarget)
params op-specific arguments (themselves canonical-sorted)
nonce unique per op (anti-replay); ≥128-bit random
issued_at / expires_at validity window (short — minutes)
key_id which operator key (for rotation / audit)

Exact test blob (236 bytes): {"expires_at":"2026-06-09T00:00:00Z","issued_at":"2026-06-08T00:00:00Z","key_id":"felhom-op-1","nonce":"a1b2c3d4e5f60718293a4b5c6d7e8f90","op":"guest_destroy","params":{"purge":true},"target":{"guest_id":"9001","host_id":"demo-felhom"}}

Note: the SSHSIG namespace (felhom-op-v1) is the cryptographic domain separator and is a fixed constant in the verifier, never caller-supplied — a signature minted for any other namespace must not verify (proven: "wrong namespace" rejected).

3. Go SSHSIG verify — approach + implementation cost

It is not a one-call verify, but it is clean — no hand-rolled crypto. The only manual work is SSHSIG framing; all crypto and key-type dispatch is the library's. Steps:

  1. pem.Decode the armor → block.Type == "SSH SIGNATURE", block.Bytes is the binary SSHSIG. (Go's encoding/pem parses the armor directly — no manual base64/line handling.)
  2. Strip the literal 6-byte SSHSIG magic preamble (it is not length-prefixed).
  3. ssh.Unmarshal the rest into a struct {Version uint32; PublicKey, Namespace, Reserved, HashAlgo, Signature string} — library does the SSH wire parsing.
  4. ssh.ParsePublicKey([]byte(PublicKey)) → an ssh.PublicKey.
  5. Recompute the signed data per spec: "SSHSIG" || string(namespace) || string(reserved) || string(hash_algorithm) || string(H(message)), where H is the named hash (sha256/sha512) — built with one ssh.Marshal.
  6. ssh.Unmarshal([]byte(Signature)) into ssh.Signature, then pub.Verify(signed, &sig) — which dispatches on the key's own algorithm (this is what makes it key-agnostic).

Cost verdict: ~40 lines of framing in one file, zero crypto implemented by us. Well within the agent's budget; no reason to fall back to a different primitive.

4. Anti-replay / authorization layer (on top of signature validity)

Enforced in VerifySignedOp after the signature check, each proven to reject even with a valid signature (Step 3 output above):

  • replay — nonce already recorded in the window → reject;
  • expired / not-yet-validnow ∉ [issued_at, expires_at] → reject (both sides shown);
  • retargetedtarget.host_id/guest_id ≠ this box/guest → reject (both shown).

(Order matters: signature → namespace → allow-list → crypto verify → target → time → nonce, so a replayed but otherwise valid op is still caught, and an invalid sig never consumes a nonce.)

5. Key-type-agnosticism — TRUE DROP-IN (no box change for FIDO2 later)

No FIDO2 device was used (by choice). Instead the spike emulated the authenticator exactly:

  • Synthesized a well-formed sk-ssh-ed25519@openssh.com public key; ssh.ParsePublicKey parses it and ssh.MarshalAuthorizedKey round-trips it.
  • Constructed a real SSHSIG whose inner signature follows the sk scheme (per OpenSSH PROTOCOL.u2f): ed25519 over sha256(application) || flags || counter || sha256(signed_data), with the blob string(format) string(ed25519_sig) byte(flags) uint32(counter) — i.e. exactly what a FIDO2 key emits.
  • Ran it through the unchanged VerifySignedOpverified (op="guest_destroy").

Verdict: true drop-in. pub.Verify for sk-ssh-ed25519 is implemented in golang.org/x/crypto/ssh v0.52.0 (it reconstructs appDigest‖flags‖counter‖dataDigest and ed25519.Verifys it). Introducing a hardware operator key later is a no-op on the boxes — the agent's verify code is identical; only the operator's signer key (and the allowed-signers set entry) changes. No sk-specific handler is needed.

Because verification dispatches on the key type embedded in the signature, the same path also accepts ssh-ed25519, rsa-sha2-*, ecdsa-sha2-*, etc. — algorithm choice is the operator's, not the agent's.

6. Fallback (not taken) and its cost

A fallback would be a raw Ed25519 detached signature (or minisign): trivially one ed25519.Verify call, no SSHSIG framing. Rejected because it loses the clean FIDO2 path — a raw-Ed25519 verifier cannot consume an sk-ssh-ed25519 signature (which carries flags+counter and a different signed-data construction), so the future hardware swap would require changing the verifier on every box. SSHSIG buys exactly the key-type-agnosticism (§5) that a raw scheme forfeits, at a one-file framing cost (§3). No fallback is warranted.

7. Reference verifier (seed of the agent's verify code)

Verified working on Go 1.24.4 / x/crypto v0.52.0. (Test harness omitted; this is the verify core + SSHSIG framing + anti-replay/authz.)

const Namespace = "felhom-op-v1"   // FIXED domain separator, never caller-supplied
const sshsigMagic = "SSHSIG"

type Target struct{ HostID, GuestID string }
type OpBlob struct {
	Op        string          `json:"op"`
	Target    Target          `json:"target"`
	Params    json.RawMessage `json:"params"`
	Nonce     string          `json:"nonce"`
	IssuedAt  time.Time       `json:"issued_at"`
	ExpiresAt time.Time       `json:"expires_at"`
	KeyID     string          `json:"key_id"`
}
// (Target needs json tags host_id/guest_id in the real struct.)

type NonceStore interface{ SeenOrRecord(nonce string, exp time.Time) bool }

type sshsigBlob struct {
	Version                                       uint32
	PublicKey, Namespace, Reserved, HashAlgo, Signature string
}

func hashByName(n string) (hash.Hash, error) {
	switch n {
	case "sha256": return sha256.New(), nil
	case "sha512": return sha512.New(), nil
	}
	return nil, fmt.Errorf("unsupported SSHSIG hash %q", n)
}

func parseArmoredSSHSIG(armored []byte) (*sshsigBlob, error) {
	block, _ := pem.Decode(armored)
	if block == nil || block.Type != "SSH SIGNATURE" {
		return nil, errors.New("not an SSH SIGNATURE armor")
	}
	if len(block.Bytes) < 6 || string(block.Bytes[:6]) != sshsigMagic {
		return nil, errors.New("missing SSHSIG magic")
	}
	var sb sshsigBlob
	if err := ssh.Unmarshal(block.Bytes[6:], &sb); err != nil { return nil, err }
	if sb.Version != 1 { return nil, fmt.Errorf("bad version %d", sb.Version) }
	return &sb, nil
}

func signedData(sb *sshsigBlob, msg []byte) ([]byte, error) {
	h, err := hashByName(sb.HashAlgo); if err != nil { return nil, err }
	h.Write(msg); md := h.Sum(nil)
	body := ssh.Marshal(struct{ Namespace, Reserved, HashAlgo string; Hash []byte }{
		sb.Namespace, sb.Reserved, sb.HashAlgo, md})
	return append([]byte(sshsigMagic), body...), nil
}

// VerifySignedOp: key-type-agnostic signature verify + anti-replay/authorization.
// allowedSigners is the trusted operator set (one key now; a quorum set later).
func VerifySignedOp(blob, sigArmored []byte, allowedSigners []ssh.PublicKey,
	thisHostID, thisGuestID string, seenNonces NonceStore) (string, error) {

	sb, err := parseArmoredSSHSIG(sigArmored)
	if err != nil { return "", err }
	if sb.Namespace != Namespace {
		return "", fmt.Errorf("namespace mismatch: got %q want %q", sb.Namespace, Namespace)
	}
	pub, err := ssh.ParsePublicKey([]byte(sb.PublicKey))
	if err != nil { return "", err }
	allowed := false
	for _, a := range allowedSigners {
		if bytes.Equal(a.Marshal(), pub.Marshal()) { allowed = true; break }
	}
	if !allowed { return "", errors.New("signer not in allowed set") }

	signed, err := signedData(sb, blob)
	if err != nil { return "", err }
	var inner ssh.Signature
	if err := ssh.Unmarshal([]byte(sb.Signature), &inner); err != nil { return "", err }
	if err := pub.Verify(signed, &inner); err != nil {     // dispatches on key algorithm
		return "", fmt.Errorf("signature invalid: %w", err)
	}

	var op OpBlob
	if err := json.Unmarshal(blob, &op); err != nil { return "", err }
	if op.Target.HostID != thisHostID || op.Target.GuestID != thisGuestID {
		return "", fmt.Errorf("target mismatch")
	}
	now := time.Now().UTC()
	if now.Before(op.IssuedAt) { return "", errors.New("not yet valid") }
	if now.After(op.ExpiresAt) { return "", errors.New("expired") }
	if seenNonces.SeenOrRecord(op.Nonce, op.ExpiresAt) {
		return "", fmt.Errorf("replay: nonce %s already seen", op.Nonce)
	}
	return op.Op, nil
}

8. Inputs to the design doc (04-control-plane-authorization.md)

  • Primitive confirmed: SSHSIG (ssh-keygen -Y sign / armored BEGIN SSH SIGNATURE), verified in Go via pem.Decode + ssh.Unmarshal + ssh.ParsePublicKey + pub.Verify. Low implementation cost; no crypto hand-rolled.
  • Hub cannot forge: the operator private key never touches the hub; the hub only queues the opaque armored blob (matches 03 §4).
  • Key-type-agnostic / hardware-ready: software ed25519 now, FIDO2 sk-ssh-ed25519 later is a box no-op (proven end-to-end). The verifier hardcodes neither key type nor algorithm.
  • allowedSigners is a set: single signer today; threshold/quorum is just set sizing plus an N-of-M policy on top (out of scope here).
  • Anti-replay/authz are mandatory and cheap: namespace (fixed), allow-list, then crypto, then target-binding, time-window, nonce — all enforced and tested.
  • Canonical blob (§2) is the shared contract between the operator CLI and the agent verifier.