Why tournaments
Model-breeding systems need a selection mechanism. A single global score is tempting, but it hides trade-offs. Tournament selection compares candidates across task, robustness, calibration, latency, cost, diversity, and safety dimensions.
A tournament does not mean every candidate fights every other candidate in production. It means each candidate faces a structured evaluation bracket against the current champion, no-op, and relevant specialists.
Bracket design
| Bracket | Purpose |
|---|---|
| champion comparison | must beat current deployed choice or justify complementary role |
| no-op comparison | must prove change is better than doing nothing |
| specialist comparison | must avoid duplicating existing capability without reason |
| stress comparison | must survive perturbations and drift cases |
| budget comparison | must repay memory, latency, and maintenance cost |
FUNCTION run_selection_tournament(candidates, champion, ecology, policy)
bracket <- []
FOR each candidate IN candidates
scorecard <- EVALUATE(candidate, policy.suites)
scorecard.noop_delta <- COMPARE_TO_NOOP(candidate, ecology)
scorecard.champion_delta <- COMPARE_TO(champion, candidate)
scorecard.complementarity <- MEASURE_NON_OVERLAP(candidate, ecology)
bracket.ADD(scorecard)
END FOR
eligible <- FILTER(bracket, hard_invariants_pass AND noop_delta > policy.minimum_gain)
ranked <- SORT_BY_NET_VIABILITY_AND_DIVERSITY(eligible)
RETURN ranked.FIRST_OR_NOOP()
END FUNCTIONAvoiding bracket gaming
Keep holdout sets private from candidate generation. Freeze suite versions during a tournament. Record all failed candidates, not only winners. Failed candidates are useful evidence because they reveal which operators are wasting budget.
Multi-winner tournaments
Sometimes the right output is not one winner. A candidate can lose as a general champion but win as a narrow specialist. Use lifecycle states to reflect that: candidate, shadow-specialist, canary-specialist, champion-route, or retired.
Source reports used for this guide
These reports are preserved verbatim in the site archive. The guide above is an editorial synthesis and may narrow, qualify, or reorganize claims from the source material.