Selection pressure finds shortcuts
A candidate does not need intent to exploit an evaluator. Population search repeatedly samples variations and preserves those that score well. If the score contains a loophole, selection can amplify the loophole faster than ordinary development.
Common failure modes
- memorizing or inferring hidden cases through leakage;
- producing verbose or stylistically preferred answers that fool a judge;
- abstaining excessively to avoid errors while failing coverage;
- manipulating confidence fields;
- triggering evaluator parser bugs;
- optimizing average scores while harming critical slices;
- exploiting timing, retries, or nondeterminism;
- coordinating with a shared judge through prompt injection;
- altering preprocessing or labels in code-breeding experiments;
- increasing complexity to hide regressions from sparse tests.
Defense in depth
Use hard constraints, multiple independent metrics, deterministic validators, human-labeled audits, hidden rotating suites, adversarial cases, held-out environments, and production shadow evidence. Keep candidate outputs separate from evaluator instructions.
Metric triangulation
acceptance <-
deterministic_tests_pass
AND critical_slice_thresholds_pass
AND calibration_threshold_pass
AND safety_invariants_pass
AND human_audit_error_rate <= limit
AND judge_disagreement <= limit
AND resource_budget_passNo single judge score decides promotion.
Overfitting detection
Monitor the gap between development and final holdout, performance on newly created cases, sensitivity to small wording changes, and improvements that disappear under another judge. Re-run prior failure suites so new candidates do not trade old weaknesses for new benchmark wins.
Evaluator evolution
Evaluators need updates as threats and tasks change, but evaluator changes alter the fitness landscape. Version them, run cross-version comparisons, and avoid changing the evaluator mid-generation.
Red teams and canaries
Red-team candidates with prompts and inputs outside their training distribution. Use canary cases whose labels are known only to the evaluator service to detect leakage or tampering. Rotate them when exposure is suspected.
Source reports used for this guide
These reports are preserved verbatim in the site archive. The guide above is an editorial synthesis and may narrow, qualify, or reorganize claims from the source material.