Metrics catalog — ModelBreeder.com

Measurement rules

Every metric record should include the numerator, denominator, unit, population, time window, model digest, router version, evaluator version, hardware, and UTC collection interval. A number without those dimensions is not comparable evidence.

Task quality

Metric	Definition	Caution
Accuracy	Correct outcomes / evaluated outcomes	Hides class imbalance and abstention
Precision	True positives / predicted positives	Sensitive to threshold and prevalence
Recall	True positives / actual positives	Report critical-class recall separately
F1	Harmonic mean of precision and recall	Can hide calibration and business cost
Exact match	Outputs matching canonical result exactly	Too strict for some generative tasks
Pass rate	Cases satisfying executable or deterministic criteria	Test quality determines meaning
Utility score	Domain-specific value of outcome	Must be externally defined and audited
Human acceptance	Accepted outputs / reviewed outputs	Review selection bias can be severe

Calibration and abstention

Expected calibration error: weighted difference between confidence bins and observed correctness. Report binning strategy and include reliability diagrams where possible.

Brier score: mean squared difference between predicted probability and observed outcome. Useful for probabilistic classification.

Selective risk: error rate among non-abstained cases at a stated coverage level.

Coverage: proportion of eligible requests receiving a normal result rather than abstention or escalation.

Escalation precision: proportion of escalated cases that actually required the higher-cost path or human review.

pseudocode

FUNCTION selective_risk(results, threshold)
    accepted <- FILTER(results, result.confidence >= threshold)
    IF COUNT(accepted) == 0
        RETURN UNDEFINED
    END IF
    RETURN COUNT_ERRORS(accepted) / COUNT(accepted)
END FUNCTION

Routing and coalition metrics

Metric	Formula or definition	What it reveals
Route accuracy	Requests sent to acceptable route / labeled requests	Direct routing correctness
Router regret	Utility(best eligible route) − utility(selected route)	Cost of routing mistakes
Specialist utilization	Requests handled by specialist / eligible requests	Load and possible starvation
Coalition size	Components activated per request	Coordination and cost growth
Disagreement rate	Coalition cases with materially different outputs / coalition cases	Diversity or instability
Escalation rate	Escalated requests / eligible requests	Local coverage and cost
Fallback success	Successful fallback outcomes / fallbacks	Resilience of cascade
Route churn	Requests whose route changed between policy versions / replay set	Operational instability

Resource metrics

End-to-end latency: request arrival to accepted response, including routing, queueing, network, retries, validation, and aggregation.
Time to first output: relevant for streaming interactions but not a substitute for completion latency.
p50, p95, p99 latency: distribution percentiles; report by route and device tier.
Throughput: accepted requests or generated units per second under defined concurrency.
Peak resident memory: maximum process or device memory during a representative request.
Loaded-model memory: memory retained by eligible packages before request-specific allocations.
Energy per accepted outcome: total measured energy divided by accepted outcomes, not attempted requests.
Compute per outcome: accelerator-seconds, CPU-seconds, FLOPs estimate, or cloud cost per accepted result.
Communication volume: bytes transmitted per request or federation round, split by direction.
Cold-start latency: time from package not loaded to first acceptable response.

Evolution and population metrics

Metric	Definition	Interpretation
Candidate yield	Candidates passing cheap gates / proposals attempted	Operator efficiency
Promotion rate	Promoted descendants / evaluated descendants	Selectivity and candidate quality
Net improvement rate	Promotions retaining measured gain after full release / promotions	Evaluation validity
Lineage depth	Longest parent path to an artifact	Accumulated inheritance and audit complexity
Branch factor	Mean accepted descendants per parent	Exploration pattern
Population size	Eligible active packages	Capability and complexity footprint
Population debt	Weighted stale, redundant, unsupported, or unevaluated packages	Maintenance burden
Niche coverage	Occupied behavioral cells / defined cells	Breadth in quality-diversity search
Archive quality	Aggregate score of elites across occupied niches	Quality plus diversity
Operator success	Net-improving candidates / candidates by operator	Mutation strategy value
Time to adaptation	Drift confirmation to approved mitigation	Responsiveness
Forgetting score	Weighted regression on protected historical tasks	Continual-learning damage

Robustness and resilience

Perturbation robustness: performance under defined input, environment, dependency, or hardware perturbations relative to baseline.

Component loss tolerance: retained system utility when a model, service, region, or dependency is unavailable.

Recovery time objective: maximum acceptable time to restore service after a failure.

Rollback time: detection or decision to verified restoration of the prior configuration.

False rollback rate: rollbacks later judged unnecessary / total rollbacks. A low value is desirable, but overly cautious rollback should not be punished without considering avoided risk.

Blast-radius utilization: actual affected scope / maximum permitted scope during a canary or incident.

Safety, security, and governance

hard-gate failure rate by category;
critical safety violation rate per eligible and per generated outcome;
policy abstention rate and incorrect-policy-pass rate;
prompt-injection or adversarial success rate under a versioned suite;
provenance completeness and signature-verification rate;
vulnerable dependency count weighted by exploitability and reachability;
unauthorized network-attempt count from isolated workers;
approval lead time and emergency-override frequency;
percent of production traffic tied to a complete release bundle;
audit-log coverage and integrity verification;
human override rate, reason distribution, and outcome after override;
user exit, export, and rollback success rates.

Federation metrics

Metric	Purpose
Participation rate	Measures eligible sites contributing to a round
Site quality distribution	Prevents global average from hiding local harm
Worst-site regression	Protects the least-served participant
Update rejection rate	Signals incompatibility, poisoning, or poor local quality
Round staleness	Age of local updates at aggregation
Communication per net gain	Efficiency of distributed learning
Privacy budget	Cumulative differential-privacy accounting where used
Contribution concentration	Dependence on a small number of clients
Local rollback success	Preserves site autonomy and resilience

Metric anti-patterns

Do not compare candidates on different datasets, hardware, concurrency, or routing distributions without adjustment. Do not optimize a public benchmark as the only fitness signal. Do not aggregate safety-critical slices into an overall average. Do not call a decrease in model-only latency an end-to-end improvement until the complete path is measured.

Source reports used for this guide

These reports are preserved verbatim in the site archive. The guide above is an editorial synthesis and may narrow, qualify, or reorganize claims from the source material.

Core synthesisThe 4Fs Framework: Fast, Flexible, Frugal, FederatedEmerging practice · 22.5 KB Evolutionary AIDesigning the “Perfect” Evolutionary AI SystemEstablished and emerging methods · 46.5 KB

Measurement rules

Task quality

Calibration and abstention

Routing and coalition metrics

Resource metrics

Evolution and population metrics

Robustness and resilience

Safety, security, and governance

Federation metrics

Metric anti-patterns

Source reports used for this guide

Related guides

Reference library

Frequently asked questions

Glossary

Pseudocode cookbook