Evolution Lab Intermediate 2 minute read Updated 2026-06-26 UTC

Experiment design

Reproducible hypotheses, baselines, frozen suites, seeds, ablations, budgets, and stopping rules for model-breeding research.

Research statusEstablished experimental methodology Publication statePublished Reviewed byMichael Kappel Source reports2

Every breeding cycle is an experiment

A good experiment can conclude that the proposed operator did not help. A bad experiment produces a candidate but cannot explain why it changed or whether the result generalizes.

Experiment charter

pseudocode
experiment <- {
    id: "EXP-2026-044",
    hypothesis: "A domain adapter improves contract accuracy without exceeding 20 ms p95 latency increase.",
    parents: [champion_artifact_id],
    operator: "adapter_train/1.3",
    independent_variables: [adapter_rank, learning_rate],
    fixed_variables: [base_model, tokenizer, dataset_split, runtime],
    baselines: [champion, majority_class, prior_adapter],
    primary_metrics: [critical_slice_recall, calibration, p95_latency],
    hard_invariants: [privacy, license, prohibited_actions],
    budget: {gpu_hours: 24, candidates: 12},
    stopping_rule: "budget exhausted or confidence target reached",
    preregistered_at_utc: NOW_UTC()
}

Baselines

Always compare with the current champion, no-op, and a simple baseline. For recombination, compare with both parents and an output ensemble. For a router, compare with static routing. For an expensive search, compare with random search under equal budget.

Data splits

Keep training, development, evaluation, and final holdout data separate. Use time-based or organization-based splits when leakage across similar records is likely. Record data manifests and exact preprocessing.

Multiple seeds and uncertainty

Report distribution, not one best run. Use enough seeds to estimate variance within the available budget. Pair candidate and champion evaluations on the same cases where appropriate.

Ablations

Remove or replace one component at a time to identify contribution. If a pipeline improves after adding retrieval, a new judge, and an adapter simultaneously, the experiment needs ablations before claiming the adapter caused the gain.

Stopping rules

Define maximum cost, maximum candidates, early-stopping criteria, and conditions for inconclusive results before running. Do not continue searching until a desired result appears without accounting for multiple comparisons.

Negative results

Record failed operators, incompatible parents, unstable slices, and cost overruns. Negative evidence prevents repeated expensive searches and improves future parent or operator selection.

Reproducibility package

Preserve the experiment charter, code digest, environment, hardware, seeds, parent packages, data manifests, logs, candidate packages, evaluator versions, and final decision. A result that cannot be reproduced is not suitable for automated promotion.

Source reports used for this guide

These reports are preserved verbatim in the site archive. The guide above is an editorial synthesis and may narrow, qualify, or reorganize claims from the source material.