Every breeding cycle is an experiment
A good experiment can conclude that the proposed operator did not help. A bad experiment produces a candidate but cannot explain why it changed or whether the result generalizes.
Experiment charter
experiment <- {
id: "EXP-2026-044",
hypothesis: "A domain adapter improves contract accuracy without exceeding 20 ms p95 latency increase.",
parents: [champion_artifact_id],
operator: "adapter_train/1.3",
independent_variables: [adapter_rank, learning_rate],
fixed_variables: [base_model, tokenizer, dataset_split, runtime],
baselines: [champion, majority_class, prior_adapter],
primary_metrics: [critical_slice_recall, calibration, p95_latency],
hard_invariants: [privacy, license, prohibited_actions],
budget: {gpu_hours: 24, candidates: 12},
stopping_rule: "budget exhausted or confidence target reached",
preregistered_at_utc: NOW_UTC()
}Baselines
Always compare with the current champion, no-op, and a simple baseline. For recombination, compare with both parents and an output ensemble. For a router, compare with static routing. For an expensive search, compare with random search under equal budget.
Data splits
Keep training, development, evaluation, and final holdout data separate. Use time-based or organization-based splits when leakage across similar records is likely. Record data manifests and exact preprocessing.
Multiple seeds and uncertainty
Report distribution, not one best run. Use enough seeds to estimate variance within the available budget. Pair candidate and champion evaluations on the same cases where appropriate.
Ablations
Remove or replace one component at a time to identify contribution. If a pipeline improves after adding retrieval, a new judge, and an adapter simultaneously, the experiment needs ablations before claiming the adapter caused the gain.
Stopping rules
Define maximum cost, maximum candidates, early-stopping criteria, and conditions for inconclusive results before running. Do not continue searching until a desired result appears without accounting for multiple comparisons.
Negative results
Record failed operators, incompatible parents, unstable slices, and cost overruns. Negative evidence prevents repeated expensive searches and improves future parent or operator selection.
Reproducibility package
Preserve the experiment charter, code digest, environment, hardware, seeds, parent packages, data manifests, logs, candidate packages, evaluator versions, and final decision. A result that cannot be reproduced is not suitable for automated promotion.
Source reports used for this guide
These reports are preserved verbatim in the site archive. The guide above is an editorial synthesis and may narrow, qualify, or reorganize claims from the source material.