Evolution Lab Intermediate 2 minute read Updated 2026-06-26 UTC

Tournament selection for model populations

How to compare candidate descendants without reducing selection to one benchmark score.

Research statusExperiment design guide Publication statePublished Reviewed byMichael Kappel Source reports3

Why tournaments

Model-breeding systems need a selection mechanism. A single global score is tempting, but it hides trade-offs. Tournament selection compares candidates across task, robustness, calibration, latency, cost, diversity, and safety dimensions.

A tournament does not mean every candidate fights every other candidate in production. It means each candidate faces a structured evaluation bracket against the current champion, no-op, and relevant specialists.

Bracket design

BracketPurpose
champion comparisonmust beat current deployed choice or justify complementary role
no-op comparisonmust prove change is better than doing nothing
specialist comparisonmust avoid duplicating existing capability without reason
stress comparisonmust survive perturbations and drift cases
budget comparisonmust repay memory, latency, and maintenance cost
pseudocode
FUNCTION run_selection_tournament(candidates, champion, ecology, policy)
    bracket <- []
    FOR each candidate IN candidates
        scorecard <- EVALUATE(candidate, policy.suites)
        scorecard.noop_delta <- COMPARE_TO_NOOP(candidate, ecology)
        scorecard.champion_delta <- COMPARE_TO(champion, candidate)
        scorecard.complementarity <- MEASURE_NON_OVERLAP(candidate, ecology)
        bracket.ADD(scorecard)
    END FOR

    eligible <- FILTER(bracket, hard_invariants_pass AND noop_delta > policy.minimum_gain)
    ranked <- SORT_BY_NET_VIABILITY_AND_DIVERSITY(eligible)

    RETURN ranked.FIRST_OR_NOOP()
END FUNCTION

Avoiding bracket gaming

Keep holdout sets private from candidate generation. Freeze suite versions during a tournament. Record all failed candidates, not only winners. Failed candidates are useful evidence because they reveal which operators are wasting budget.

Multi-winner tournaments

Sometimes the right output is not one winner. A candidate can lose as a general champion but win as a narrow specialist. Use lifecycle states to reflect that: candidate, shadow-specialist, canary-specialist, champion-route, or retired.

Source reports used for this guide

These reports are preserved verbatim in the site archive. The guide above is an editorial synthesis and may narrow, qualify, or reorganize claims from the source material.