Observability and auditability

Observe decisions, not only tokens

Traditional model monitoring focuses on latency and errors. A breeding system must also explain why a model was selected, which lineage produced it, which evaluator approved it, what resources it consumed, and why the controller changed the population.

Correlation identifiers

Use stable identifiers across planes:

request_id for the external request;
trace_id for the complete execution path;
route_plan_id for the router decision;
artifact_id for each model package;
lineage_id for descendant provenance;
evaluation_id for an evidence card;
decision_id for viability-controller output;
release_id for traffic and alias changes;
incident_id for abnormal events.

Request trace

pseudocode

trace <- {
    trace_id: UUID_V7(),
    request_contract: request.contract,
    risk_tier: request.risk_tier,
    route_plan: route.plan_id,
    artifacts: route.artifact_ids,
    start_utc: start,
    end_utc: end,
    latency_ms: elapsed,
    resource_usage: measured_resources,
    evaluator_verdict: verdict.summary,
    response_status: response.status,
    policy_events: policy_events
}

Minimize or tokenize sensitive content. Store enough to reproduce behavior without turning telemetry into an uncontrolled data lake.

Metric families

Execution: latency, throughput, timeouts, crashes, queueing, memory, GPU residency, fallback rate.

Quality: task score, calibration, abstention, human correction, disagreement, drift, slice performance.

Ecology: population size, active specialists, traffic concentration, diversity, redundancy, churn, archive growth.

Evolution: candidates per cycle, evaluation cost, promotion rate, rollback rate, no-op rate, time to evidence, cumulative complexity.

Governance: blocked actions, permission requests, approval latency, policy violations, unsigned artifacts, expired evidence.

Logs versus evidence

Operational logs are high-volume and short-lived. Evaluation cards, lineage records, approvals, and release decisions are durable evidence. Do not rely on transient logs for compliance or reproducibility.

Decision dashboards

A useful dashboard compares champion and challengers, decomposes viability scores, shows confidence intervals, highlights missing evidence, and links every production metric to artifact lineage. Avoid aggregate green indicators that hide failing slices.

Alerting

Alert on hard-invariant failures, unexpected permission attempts, unsigned packages, rapid population growth, oscillating releases, evaluator disagreement, traffic concentration, and rollback failure. Ordinary model quality drift can use slower investigation workflows; control-plane compromise requires immediate containment.

Source reports used for this guide

These reports are preserved verbatim in the site archive. The guide above is an editorial synthesis and may narrow, qualify, or reorganize claims from the source material.

Core synthesisThe Four Fs of AI: Code Breeding, Model Breeding, and the Teleodynamic Convergence of Mutable Small-Model EcologiesConceptual synthesis · 80.5 KB Evolutionary AIDesigning the “Perfect” Evolutionary AI SystemEstablished and emerging methods · 46.5 KB

Observe decisions, not only tokens

Correlation identifiers

Request trace

Metric families

Logs versus evidence

Decision dashboards

Alerting

Source reports used for this guide

Related guides

Architecture

Reference architecture

Browser and edge runtime architecture

Skill package manifests