Benchmark the system, not only the model
A specialist may score well in isolation but perform poorly after routing, retries, aggregation, and fallback. Benchmark individual artifacts, the router, coalitions, and the full request path.
Metric domains
| Domain | Example metrics |
|---|---|
| Task utility | Accuracy, F1, exact match, reward, domain score, human acceptance |
| Calibration | ECE, Brier score, abstention precision and recall |
| Robustness | Perturbation degradation, out-of-distribution performance, adversarial success rate |
| Resource | p50/p95/p99 latency, throughput, peak memory, energy, bytes transferred, load time |
| Diversity | Error correlation, disagreement entropy, niche coverage, complementarity |
| Adaptation | Time and cost to recover after drift, new-task learning curve, rollback speed |
| Governance | Policy blocks, unsigned artifacts, provenance completeness, approval coverage |
| Lifecycle | Evaluation cost, operator yield, maintenance hours, population churn |
Equal-budget comparisons
Compare systems under equal latency, cost, energy, or evaluation budgets. A coalition that uses three models should not be compared with one specialist only on accuracy. Report Pareto frontiers.
Slice catalog
Maintain domain, language, user group, hardware, input-length, difficulty, and risk slices. Critical slices can have hard thresholds even when their sample count is small.
Dynamic benchmarks
Static tests are necessary but insufficient for adaptive systems. Add non-stationary sequences: change task frequencies, remove a specialist, constrain memory, introduce network loss, revoke a dataset, or alter latency budgets. Measure recovery and whether the controller chooses appropriate structural actions.
Benchmark harness pseudocode
PROCEDURE benchmark_ecology(ecology, workload, constraints)
RESET_TO_KNOWN_STATE(ecology)
results <- []
FOR each phase IN workload.phases
APPLY_ENVIRONMENT_CHANGE(phase.change)
phase_result <- RUN_REQUEST_SET(ecology, phase.requests, constraints)
APPEND results, phase_result
END FOR
RETURN {
utility: SUMMARIZE_TASK_RESULTS(results),
resources: SUMMARIZE_RESOURCE_RESULTS(results),
robustness: SUMMARIZE_FAILURE_RECOVERY(results),
adaptation: SUMMARIZE_STRUCTURAL_DECISIONS(results),
governance: VERIFY_AUDIT_COMPLETENESS(results)
}
END PROCEDUREReporting
Publish raw distributions, suite versions, hardware, runtime, data manifests, and known limitations. Avoid a single “fitness” number in external reporting even when the controller uses an internal aggregate score.
Source reports used for this guide
These reports are preserved verbatim in the site archive. The guide above is an editorial synthesis and may narrow, qualify, or reorganize claims from the source material.