Architecture Advanced 2 minute read Updated 2026-06-26 UTC

Evaluator gates

Independent quality, robustness, safety, and cost gates that prevent candidates from selecting or redefining their own fitness.

Research statusEstablished evaluation and security practices Publication statePublished Reviewed byMichael Kappel Source reports2

Evaluation is a separate product

A breeding system can generate many candidates quickly. The scarce resource is trustworthy evaluation. The evaluator should be versioned, tested, monitored, and governed with at least as much care as the candidate factory.

Gate layers

  1. Package gate: signature, digest, manifest, license, malware, and dependency checks.
  2. Contract gate: schema, semantics, error behavior, permissions, and resource class.
  3. Task gate: capability benchmarks and domain-specific acceptance tests.
  4. Robustness gate: perturbations, distribution shifts, malformed inputs, and stress tests.
  5. Safety gate: prohibited behavior, data leakage, prompt injection, tool abuse, and policy compliance.
  6. Cost gate: latency, memory, throughput, energy, storage, and operational complexity.
  7. Release gate: shadow and canary evidence under production-like traffic.

Hard versus soft criteria

Hard invariants are Boolean: a candidate either satisfies the license requirement, access boundary, or prohibited-action rule or it does not. Soft criteria contribute to a score and can trade off. Never bury a hard safety condition inside a weighted average.

Judge models

A model judge can help evaluate open-ended outputs, but it should not be the only source of truth. Use deterministic validators where possible, multiple judges with known diversity, calibrated human review on samples, and hidden reference cases. Track judge–human disagreement and judge sensitivity to superficial style.

Frozen and rotating suites

Freeze a suite for a comparison round so candidates face the same environment. Rotate or expand hidden suites periodically to reduce overfitting. Keep final holdouts inaccessible to candidate-generation systems.

pseudocode
FUNCTION evaluate_candidate(candidate, champion, suite_bundle)
    package_result <- RUN_PACKAGE_GATES(candidate)
    IF NOT package_result.pass
        RETURN REJECT(package_result)
    END IF

    candidate_scores <- RUN_ALL_SUITES(candidate, suite_bundle)
    champion_scores <- RUN_ALL_SUITES(champion, suite_bundle)
    delta <- COMPARE_WITH_UNCERTAINTY(candidate_scores, champion_scores)

    hard_pass <- ALL_HARD_INVARIANTS(candidate_scores)
    viability <- CALCULATE_NET_VIABILITY(delta)

    RETURN EVIDENCE_CARD(hard_pass, viability, candidate_scores, delta)
END FUNCTION

Evaluator security

Candidates receive no write access to evaluator code, hidden cases, labels, or scoring weights. Evaluation workers use clean environments and short-lived credentials. Logs distinguish candidate output from evaluator interpretation so prompt injection cannot masquerade as an instruction to the judge.

Evaluation debt

As capabilities expand, the evaluation surface must expand. A new tool permission, language, jurisdiction, or output modality adds test obligations. Track evaluation debt explicitly; do not promote a candidate into an untested capability region merely because average scores improved.

Source reports used for this guide

These reports are preserved verbatim in the site archive. The guide above is an editorial synthesis and may narrow, qualify, or reorganize claims from the source material.