Reference Intermediate 5 minute read Updated 2026-06-26 UTC

Metrics catalog

Definitions and cautions for quality, calibration, routing, diversity, resources, evolution, federation, safety, and operations metrics.

Research statusMeasurement reference Publication statePublished Reviewed byMichael Kappel Source reports2

Measurement rules

Every metric record should include the numerator, denominator, unit, population, time window, model digest, router version, evaluator version, hardware, and UTC collection interval. A number without those dimensions is not comparable evidence.

Task quality

MetricDefinitionCaution
AccuracyCorrect outcomes / evaluated outcomesHides class imbalance and abstention
PrecisionTrue positives / predicted positivesSensitive to threshold and prevalence
RecallTrue positives / actual positivesReport critical-class recall separately
F1Harmonic mean of precision and recallCan hide calibration and business cost
Exact matchOutputs matching canonical result exactlyToo strict for some generative tasks
Pass rateCases satisfying executable or deterministic criteriaTest quality determines meaning
Utility scoreDomain-specific value of outcomeMust be externally defined and audited
Human acceptanceAccepted outputs / reviewed outputsReview selection bias can be severe

Calibration and abstention

Expected calibration error: weighted difference between confidence bins and observed correctness. Report binning strategy and include reliability diagrams where possible.

Brier score: mean squared difference between predicted probability and observed outcome. Useful for probabilistic classification.

Selective risk: error rate among non-abstained cases at a stated coverage level.

Coverage: proportion of eligible requests receiving a normal result rather than abstention or escalation.

Escalation precision: proportion of escalated cases that actually required the higher-cost path or human review.

pseudocode
FUNCTION selective_risk(results, threshold)
    accepted <- FILTER(results, result.confidence >= threshold)
    IF COUNT(accepted) == 0
        RETURN UNDEFINED
    END IF
    RETURN COUNT_ERRORS(accepted) / COUNT(accepted)
END FUNCTION

Routing and coalition metrics

MetricFormula or definitionWhat it reveals
Route accuracyRequests sent to acceptable route / labeled requestsDirect routing correctness
Router regretUtility(best eligible route) − utility(selected route)Cost of routing mistakes
Specialist utilizationRequests handled by specialist / eligible requestsLoad and possible starvation
Coalition sizeComponents activated per requestCoordination and cost growth
Disagreement rateCoalition cases with materially different outputs / coalition casesDiversity or instability
Escalation rateEscalated requests / eligible requestsLocal coverage and cost
Fallback successSuccessful fallback outcomes / fallbacksResilience of cascade
Route churnRequests whose route changed between policy versions / replay setOperational instability

Resource metrics

  • End-to-end latency: request arrival to accepted response, including routing, queueing, network, retries, validation, and aggregation.
  • Time to first output: relevant for streaming interactions but not a substitute for completion latency.
  • p50, p95, p99 latency: distribution percentiles; report by route and device tier.
  • Throughput: accepted requests or generated units per second under defined concurrency.
  • Peak resident memory: maximum process or device memory during a representative request.
  • Loaded-model memory: memory retained by eligible packages before request-specific allocations.
  • Energy per accepted outcome: total measured energy divided by accepted outcomes, not attempted requests.
  • Compute per outcome: accelerator-seconds, CPU-seconds, FLOPs estimate, or cloud cost per accepted result.
  • Communication volume: bytes transmitted per request or federation round, split by direction.
  • Cold-start latency: time from package not loaded to first acceptable response.

Evolution and population metrics

MetricDefinitionInterpretation
Candidate yieldCandidates passing cheap gates / proposals attemptedOperator efficiency
Promotion ratePromoted descendants / evaluated descendantsSelectivity and candidate quality
Net improvement ratePromotions retaining measured gain after full release / promotionsEvaluation validity
Lineage depthLongest parent path to an artifactAccumulated inheritance and audit complexity
Branch factorMean accepted descendants per parentExploration pattern
Population sizeEligible active packagesCapability and complexity footprint
Population debtWeighted stale, redundant, unsupported, or unevaluated packagesMaintenance burden
Niche coverageOccupied behavioral cells / defined cellsBreadth in quality-diversity search
Archive qualityAggregate score of elites across occupied nichesQuality plus diversity
Operator successNet-improving candidates / candidates by operatorMutation strategy value
Time to adaptationDrift confirmation to approved mitigationResponsiveness
Forgetting scoreWeighted regression on protected historical tasksContinual-learning damage

Robustness and resilience

Perturbation robustness: performance under defined input, environment, dependency, or hardware perturbations relative to baseline.

Component loss tolerance: retained system utility when a model, service, region, or dependency is unavailable.

Recovery time objective: maximum acceptable time to restore service after a failure.

Rollback time: detection or decision to verified restoration of the prior configuration.

False rollback rate: rollbacks later judged unnecessary / total rollbacks. A low value is desirable, but overly cautious rollback should not be punished without considering avoided risk.

Blast-radius utilization: actual affected scope / maximum permitted scope during a canary or incident.

Safety, security, and governance

  • hard-gate failure rate by category;
  • critical safety violation rate per eligible and per generated outcome;
  • policy abstention rate and incorrect-policy-pass rate;
  • prompt-injection or adversarial success rate under a versioned suite;
  • provenance completeness and signature-verification rate;
  • vulnerable dependency count weighted by exploitability and reachability;
  • unauthorized network-attempt count from isolated workers;
  • approval lead time and emergency-override frequency;
  • percent of production traffic tied to a complete release bundle;
  • audit-log coverage and integrity verification;
  • human override rate, reason distribution, and outcome after override;
  • user exit, export, and rollback success rates.

Federation metrics

MetricPurpose
Participation rateMeasures eligible sites contributing to a round
Site quality distributionPrevents global average from hiding local harm
Worst-site regressionProtects the least-served participant
Update rejection rateSignals incompatibility, poisoning, or poor local quality
Round stalenessAge of local updates at aggregation
Communication per net gainEfficiency of distributed learning
Privacy budgetCumulative differential-privacy accounting where used
Contribution concentrationDependence on a small number of clients
Local rollback successPreserves site autonomy and resilience

Metric anti-patterns

Do not compare candidates on different datasets, hardware, concurrency, or routing distributions without adjustment. Do not optimize a public benchmark as the only fitness signal. Do not aggregate safety-critical slices into an overall average. Do not call a decrease in model-only latency an end-to-end improvement until the complete path is measured.

Source reports used for this guide

These reports are preserved verbatim in the site archive. The guide above is an editorial synthesis and may narrow, qualify, or reorganize claims from the source material.