The central safety boundary
A model-breeding system fails if the candidate can change the rules that promote it. The candidate can generate outputs, proposals, and explanations. It cannot edit the evaluator, hide test cases, alter thresholds, change deployment policy, or decide that its own evidence is sufficient.
This boundary is the difference between controlled evolution and self-referential optimization.
Protected components
| Component | Why it is protected |
|---|---|
| Test cases | Prevents training to the exact gate by unauthorized access |
| Scoring weights | Prevents candidates from optimizing the easiest dimensions |
| Hard gates | Prevents safety, legal, and provenance failures from being averaged away |
| Evaluation runtime | Prevents sandbox escape and timing manipulation |
| Evidence store | Prevents deletion or rewriting of failed trials |
| Promotion policy | Prevents popularity or self-advocacy from becoming approval |
Evolving the evaluator
Evaluators may need to improve. That belongs to a separate code-breeding or governance process. The evaluator can have candidates too, but those candidates must be judged by a higher-level review rule, regression corpus, and human-owned approval path.
FUNCTION propose_evaluator_change(change, policy)
REQUIRE change.origin != "candidate_model_self_edit"
REQUIRE change.has_regression_suite
REQUIRE change.has_human_owner
REQUIRE change.has_rollback_plan
shadow_scores <- SCORE_HISTORICAL_CANDIDATES_WITH(change)
IF shadow_scores.flip_critical_decisions_without_explanation
RETURN REJECT("Evaluator change destabilizes historical decisions")
END IF
RETURN APPROVE_FOR_GOVERNANCE_REVIEW(change)
END FUNCTIONMetric gaming signals
Look for sudden benchmark jumps without broad slice improvement, high self-confidence with poor calibration, evidence that improves only on public tests, repeated proposals to relax thresholds, or candidates that perform unusually well when inspected less strictly.
Independent evidence stores
Candidate artifacts should not be able to write directly to their scorecards. Evaluators produce scorecards, sign them, and store them append-only. Candidates may attach rebuttals or analysis, but they do not alter the record.
Human role
Humans do not need to review every low-risk candidate manually, but humans must own the evaluation constitution: what cannot be optimized away, who may change gates, how conflicts are resolved, and when automation stops.
Source reports used for this guide
These reports are preserved verbatim in the site archive. The guide above is an editorial synthesis and may narrow, qualify, or reorganize claims from the source material.