Theory Advanced 2 minute read Updated 2026-06-26 UTC

Research program

A prioritized set of experiments that would convert the ModelBreeder theory into stronger evidence.

Research statusResearch roadmap Publication statePublished Reviewed byMichael Kappel Source reports4

Goal

The theory should improve through experiments, not through stronger metaphors. This research program identifies evidence that would make the ModelBreeder framework more precise, falsifiable, and useful.

Experiment 1: marginal specialist value

Compare a monolithic baseline, a champion model, and a population of small specialists under identical task mixes. Measure utility, latency, memory, cost, calibration, and failure correlation. The key question is whether population composition improves net viability under realistic constraints.

Experiment 2: no-op threshold calibration

Generate candidate descendants across a range of mutation operators. Compare aggressive promotion, conservative promotion, and explicit no-op policies. Measure regressions, cost growth, and useful improvements over time.

Experiment 3: quality-diversity archive utility

Maintain a MAP-Elites-style archive of specialists across task and runtime descriptors. After workload shift, measure whether archive-seeded experiments recover faster than experiments seeded only from the current champion.

Experiment 4: evaluator independence stress test

Allow some candidate generators to propose changes to evaluation thresholds, test suites, or router policies, but require external approval. Measure how often candidates improve genuine performance versus exploiting evaluator weaknesses.

Experiment 5: human capability retention

In a documentation, coding, or analysis workflow, measure team performance with full AI assistance, reduced assistance, and no assistance after several weeks. The goal is to determine whether the system is scaffolding human skill or replacing it.

Experiment template

pseudocode
FUNCTION run_modelbreeder_experiment(hypothesis, environment, policy)
    REGISTER_EXPERIMENT(hypothesis, created_at_utc: NOW_UTC())
    baseline <- FREEZE_BASELINE(environment)
    candidates <- GENERATE_CANDIDATES(policy.allowed_operators)
    evidence <- EVALUATE_ALL(baseline, candidates, policy.suites)
    decisions <- APPLY_VIABILITY_POLICY(evidence, policy)
    REPORT_RESULTS(hypothesis, evidence, decisions, limitations)
    ARCHIVE_REPRODUCIBLE_PACKET()
END FUNCTION

Minimum evidence packet

Every experiment should publish the environment definition, candidate generation rules, evaluation suite version, resource ledger, random seeds where applicable, hardware profile, rejected candidates, promotion rules, and known limitations.

What would falsify the theory

The theory weakens if populations consistently add cost without useful complementarity, if no-op thresholds block nearly all useful innovation, if archives do not help after drift, or if evaluator independence is too expensive to operate. Those outcomes would not be embarrassing. They would clarify where the framework must change.

Source reports used for this guide

These reports are preserved verbatim in the site archive. The guide above is an editorial synthesis and may narrow, qualify, or reorganize claims from the source material.