Measurement rules
Every metric record should include the numerator, denominator, unit, population, time window, model digest, router version, evaluator version, hardware, and UTC collection interval. A number without those dimensions is not comparable evidence.
Task quality
| Metric | Definition | Caution |
|---|---|---|
| Accuracy | Correct outcomes / evaluated outcomes | Hides class imbalance and abstention |
| Precision | True positives / predicted positives | Sensitive to threshold and prevalence |
| Recall | True positives / actual positives | Report critical-class recall separately |
| F1 | Harmonic mean of precision and recall | Can hide calibration and business cost |
| Exact match | Outputs matching canonical result exactly | Too strict for some generative tasks |
| Pass rate | Cases satisfying executable or deterministic criteria | Test quality determines meaning |
| Utility score | Domain-specific value of outcome | Must be externally defined and audited |
| Human acceptance | Accepted outputs / reviewed outputs | Review selection bias can be severe |
Calibration and abstention
Expected calibration error: weighted difference between confidence bins and observed correctness. Report binning strategy and include reliability diagrams where possible.
Brier score: mean squared difference between predicted probability and observed outcome. Useful for probabilistic classification.
Selective risk: error rate among non-abstained cases at a stated coverage level.
Coverage: proportion of eligible requests receiving a normal result rather than abstention or escalation.
Escalation precision: proportion of escalated cases that actually required the higher-cost path or human review.
FUNCTION selective_risk(results, threshold)
accepted <- FILTER(results, result.confidence >= threshold)
IF COUNT(accepted) == 0
RETURN UNDEFINED
END IF
RETURN COUNT_ERRORS(accepted) / COUNT(accepted)
END FUNCTIONRouting and coalition metrics
| Metric | Formula or definition | What it reveals |
|---|---|---|
| Route accuracy | Requests sent to acceptable route / labeled requests | Direct routing correctness |
| Router regret | Utility(best eligible route) − utility(selected route) | Cost of routing mistakes |
| Specialist utilization | Requests handled by specialist / eligible requests | Load and possible starvation |
| Coalition size | Components activated per request | Coordination and cost growth |
| Disagreement rate | Coalition cases with materially different outputs / coalition cases | Diversity or instability |
| Escalation rate | Escalated requests / eligible requests | Local coverage and cost |
| Fallback success | Successful fallback outcomes / fallbacks | Resilience of cascade |
| Route churn | Requests whose route changed between policy versions / replay set | Operational instability |
Resource metrics
- End-to-end latency: request arrival to accepted response, including routing, queueing, network, retries, validation, and aggregation.
- Time to first output: relevant for streaming interactions but not a substitute for completion latency.
- p50, p95, p99 latency: distribution percentiles; report by route and device tier.
- Throughput: accepted requests or generated units per second under defined concurrency.
- Peak resident memory: maximum process or device memory during a representative request.
- Loaded-model memory: memory retained by eligible packages before request-specific allocations.
- Energy per accepted outcome: total measured energy divided by accepted outcomes, not attempted requests.
- Compute per outcome: accelerator-seconds, CPU-seconds, FLOPs estimate, or cloud cost per accepted result.
- Communication volume: bytes transmitted per request or federation round, split by direction.
- Cold-start latency: time from package not loaded to first acceptable response.
Evolution and population metrics
| Metric | Definition | Interpretation |
|---|---|---|
| Candidate yield | Candidates passing cheap gates / proposals attempted | Operator efficiency |
| Promotion rate | Promoted descendants / evaluated descendants | Selectivity and candidate quality |
| Net improvement rate | Promotions retaining measured gain after full release / promotions | Evaluation validity |
| Lineage depth | Longest parent path to an artifact | Accumulated inheritance and audit complexity |
| Branch factor | Mean accepted descendants per parent | Exploration pattern |
| Population size | Eligible active packages | Capability and complexity footprint |
| Population debt | Weighted stale, redundant, unsupported, or unevaluated packages | Maintenance burden |
| Niche coverage | Occupied behavioral cells / defined cells | Breadth in quality-diversity search |
| Archive quality | Aggregate score of elites across occupied niches | Quality plus diversity |
| Operator success | Net-improving candidates / candidates by operator | Mutation strategy value |
| Time to adaptation | Drift confirmation to approved mitigation | Responsiveness |
| Forgetting score | Weighted regression on protected historical tasks | Continual-learning damage |
Robustness and resilience
Perturbation robustness: performance under defined input, environment, dependency, or hardware perturbations relative to baseline.
Component loss tolerance: retained system utility when a model, service, region, or dependency is unavailable.
Recovery time objective: maximum acceptable time to restore service after a failure.
Rollback time: detection or decision to verified restoration of the prior configuration.
False rollback rate: rollbacks later judged unnecessary / total rollbacks. A low value is desirable, but overly cautious rollback should not be punished without considering avoided risk.
Blast-radius utilization: actual affected scope / maximum permitted scope during a canary or incident.
Safety, security, and governance
- hard-gate failure rate by category;
- critical safety violation rate per eligible and per generated outcome;
- policy abstention rate and incorrect-policy-pass rate;
- prompt-injection or adversarial success rate under a versioned suite;
- provenance completeness and signature-verification rate;
- vulnerable dependency count weighted by exploitability and reachability;
- unauthorized network-attempt count from isolated workers;
- approval lead time and emergency-override frequency;
- percent of production traffic tied to a complete release bundle;
- audit-log coverage and integrity verification;
- human override rate, reason distribution, and outcome after override;
- user exit, export, and rollback success rates.
Federation metrics
| Metric | Purpose |
|---|---|
| Participation rate | Measures eligible sites contributing to a round |
| Site quality distribution | Prevents global average from hiding local harm |
| Worst-site regression | Protects the least-served participant |
| Update rejection rate | Signals incompatibility, poisoning, or poor local quality |
| Round staleness | Age of local updates at aggregation |
| Communication per net gain | Efficiency of distributed learning |
| Privacy budget | Cumulative differential-privacy accounting where used |
| Contribution concentration | Dependence on a small number of clients |
| Local rollback success | Preserves site autonomy and resilience |
Metric anti-patterns
Do not compare candidates on different datasets, hardware, concurrency, or routing distributions without adjustment. Do not optimize a public benchmark as the only fitness signal. Do not aggregate safety-critical slices into an overall average. Do not call a decrease in model-only latency an end-to-end improvement until the complete path is measured.
Source reports used for this guide
These reports are preserved verbatim in the site archive. The guide above is an editorial synthesis and may narrow, qualify, or reorganize claims from the source material.