Release is part of evaluation
Offline suites cannot reproduce every production condition. Shadow and canary stages collect operational evidence while limiting user impact. Promotion is a sequence of reversible state transitions, not one deployment event.
- Candidateimmutable artifact
- Offline gatequality, safety, cost
- Shadowno user impact
- Canarybounded cohort
- Progressivemeasured expansion
- Championrollback retained
Stages
Candidate
Immutable package has offline evidence but no production eligibility.
Shadow
Receives mirrored production inputs where policy permits. Outputs are logged for comparison but never returned to users or downstream systems.
Canary
Serves a small, explicitly selected cohort. Start with low-risk tasks and users who can report issues. Set hard traffic, time, and cost limits.
Progressive
Increase traffic only after observation windows pass. Expand one dimension at a time—volume, task range, jurisdiction, or risk tier.
Champion
Becomes the default alias while the prior champion remains verified as rollback.
Abort criteria
IF hard_invariant_failure_count > 0
ABORT_IMMEDIATELY
ELSE IF critical_slice_quality < threshold
ABORT_IMMEDIATELY
ELSE IF p99_latency > budget FOR sustained_window
REDUCE_TRAFFIC_OR_ABORT
ELSE IF unexplained_disagreement_rate > threshold
PAUSE_AND_REVIEW
END IFRollback design
Rollback includes model package, router state, prompt and policy versions, retrieval configuration, cache compatibility, and memory-schema migration. A previous model file alone may not restore prior behavior.
Automatic versus manual
Automatic abort is appropriate for clear invariant or SLO breaches. Manual review is appropriate for ambiguous quality or user-experience changes. The release controller can reduce traffic automatically while preserving evidence for investigation.
Roll-forward
Some incidents require a corrected descendant rather than rollback, especially when the prior artifact is affected by a revoked dataset or vulnerability. Maintain both paths and document when rollback is prohibited.
Release evidence
Record cohort definition, traffic weights, exact package and configuration, start and end times in UTC, metrics, alerts, operator actions, and final decision. Link the release record to the original experiment and evaluation card.
Source reports used for this guide
These reports are preserved verbatim in the site archive. The guide above is an editorial synthesis and may narrow, qualify, or reorganize claims from the source material.