Observe decisions, not only tokens
Traditional model monitoring focuses on latency and errors. A breeding system must also explain why a model was selected, which lineage produced it, which evaluator approved it, what resources it consumed, and why the controller changed the population.
Correlation identifiers
Use stable identifiers across planes:
request_idfor the external request;trace_idfor the complete execution path;route_plan_idfor the router decision;artifact_idfor each model package;lineage_idfor descendant provenance;evaluation_idfor an evidence card;decision_idfor viability-controller output;release_idfor traffic and alias changes;incident_idfor abnormal events.
Request trace
trace <- {
trace_id: UUID_V7(),
request_contract: request.contract,
risk_tier: request.risk_tier,
route_plan: route.plan_id,
artifacts: route.artifact_ids,
start_utc: start,
end_utc: end,
latency_ms: elapsed,
resource_usage: measured_resources,
evaluator_verdict: verdict.summary,
response_status: response.status,
policy_events: policy_events
}Minimize or tokenize sensitive content. Store enough to reproduce behavior without turning telemetry into an uncontrolled data lake.
Metric families
Execution: latency, throughput, timeouts, crashes, queueing, memory, GPU residency, fallback rate.
Quality: task score, calibration, abstention, human correction, disagreement, drift, slice performance.
Ecology: population size, active specialists, traffic concentration, diversity, redundancy, churn, archive growth.
Evolution: candidates per cycle, evaluation cost, promotion rate, rollback rate, no-op rate, time to evidence, cumulative complexity.
Governance: blocked actions, permission requests, approval latency, policy violations, unsigned artifacts, expired evidence.
Logs versus evidence
Operational logs are high-volume and short-lived. Evaluation cards, lineage records, approvals, and release decisions are durable evidence. Do not rely on transient logs for compliance or reproducibility.
Decision dashboards
A useful dashboard compares champion and challengers, decomposes viability scores, shows confidence intervals, highlights missing evidence, and links every production metric to artifact lineage. Avoid aggregate green indicators that hide failing slices.
Alerting
Alert on hard-invariant failures, unexpected permission attempts, unsigned packages, rapid population growth, oscillating releases, evaluator disagreement, traffic concentration, and rollback failure. Ordinary model quality drift can use slower investigation workflows; control-plane compromise requires immediate containment.
Source reports used for this guide
These reports are preserved verbatim in the site archive. The guide above is an editorial synthesis and may narrow, qualify, or reorganize claims from the source material.