Goal
The theory should improve through experiments, not through stronger metaphors. This research program identifies evidence that would make the ModelBreeder framework more precise, falsifiable, and useful.
Experiment 1: marginal specialist value
Compare a monolithic baseline, a champion model, and a population of small specialists under identical task mixes. Measure utility, latency, memory, cost, calibration, and failure correlation. The key question is whether population composition improves net viability under realistic constraints.
Experiment 2: no-op threshold calibration
Generate candidate descendants across a range of mutation operators. Compare aggressive promotion, conservative promotion, and explicit no-op policies. Measure regressions, cost growth, and useful improvements over time.
Experiment 3: quality-diversity archive utility
Maintain a MAP-Elites-style archive of specialists across task and runtime descriptors. After workload shift, measure whether archive-seeded experiments recover faster than experiments seeded only from the current champion.
Experiment 4: evaluator independence stress test
Allow some candidate generators to propose changes to evaluation thresholds, test suites, or router policies, but require external approval. Measure how often candidates improve genuine performance versus exploiting evaluator weaknesses.
Experiment 5: human capability retention
In a documentation, coding, or analysis workflow, measure team performance with full AI assistance, reduced assistance, and no assistance after several weeks. The goal is to determine whether the system is scaffolding human skill or replacing it.
Experiment template
FUNCTION run_modelbreeder_experiment(hypothesis, environment, policy)
REGISTER_EXPERIMENT(hypothesis, created_at_utc: NOW_UTC())
baseline <- FREEZE_BASELINE(environment)
candidates <- GENERATE_CANDIDATES(policy.allowed_operators)
evidence <- EVALUATE_ALL(baseline, candidates, policy.suites)
decisions <- APPLY_VIABILITY_POLICY(evidence, policy)
REPORT_RESULTS(hypothesis, evidence, decisions, limitations)
ARCHIVE_REPRODUCIBLE_PACKET()
END FUNCTIONMinimum evidence packet
Every experiment should publish the environment definition, candidate generation rules, evaluation suite version, resource ledger, random seeds where applicable, hardware profile, rejected candidates, promotion rules, and known limitations.
What would falsify the theory
The theory weakens if populations consistently add cost without useful complementarity, if no-op thresholds block nearly all useful innovation, if archives do not help after drift, or if evaluator independence is too expensive to operate. Those outcomes would not be embarrassing. They would clarify where the framework must change.
Source reports used for this guide
These reports are preserved verbatim in the site archive. The guide above is an editorial synthesis and may narrow, qualify, or reorganize claims from the source material.