Containment and human oversight

Containment is layered

No single sandbox or kill switch is sufficient. Containment combines limited action spaces, least privilege, resource ceilings, independent monitoring, progressive release, and human authority.

Layers

Research boundary: simulation or offline data; no production credentials.
Package boundary: verified formats, manifests, signatures, and permissions.
Process boundary: container or process isolation, memory and time limits.
Network boundary: deny by default, explicit destinations, monitored egress.
Tool boundary: structured mediation and risk-tier approval.
Data boundary: classification, minimization, jurisdiction, and holdout separation.
Release boundary: shadow, canary, traffic caps, and rollback.
Governance boundary: human-owned policy, signing, revocation, and emergency stop.

Oversight modes

Human in the loop: approval before a high-impact action.
Human on the loop: continuous monitoring with ability to intervene.
Human over the loop: policy, budgets, and audit governing mostly automated low-risk work.

Choose mode by consequence and reversibility, not by model confidence alone.

Emergency stop

The stop mechanism must be outside the model runtime, require independent credentials, work during partial control-plane failure, and be tested. Stopping should include traffic, tool access, candidate generation, federated rounds, and alias changes.

pseudocode

PROCEDURE emergency_freeze(scope, operator)
    REQUIRE operator HAS emergency_role
    SET_RELEASE_CONTROLLER_READ_ONLY(scope)
    SET_ROUTER_TO_VERIFIED_SAFE_MODE(scope)
    REVOKE_RUNTIME_TOOL_TOKENS(scope)
    PAUSE_CANDIDATE_FACTORY(scope)
    TERMINATE_ACTIVE_HIGH_RISK_TASKS(scope)
    APPEND_TAMPER_EVIDENT_AUDIT("emergency_freeze", scope, operator)
END PROCEDURE

Monitoring independence

The monitor should not rely only on model self-report. Use runtime telemetry, policy events, deterministic validators, independent judges, and human review. Candidate-generated explanations are evidence, not authority.

Scaling autonomy

Increase automation only after the lower-risk stage has stable evidence. A system allowed to select among approved models is not automatically ready to train descendants; a system that trains adapters is not ready to change code; a code generator is not ready to deploy its own patch.

Source reports used for this guide

These reports are preserved verbatim in the site archive. The guide above is an editorial synthesis and may narrow, qualify, or reorganize claims from the source material.

Speculative risk scenariosAggressive Mutualism: Safety, Governance, and Containment AnalysisRisk analysis · 42.0 KB Speculative risk scenariosInstrumental Drives in Powerful AI SystemsRisk analysis · 42.2 KB Governance and safetyMutualist Persistence: Research Synthesis and RecommendationsConceptual governance framework · 15.7 KB

Containment is layered

Layers

Oversight modes

Emergency stop

Monitoring independence

Scaling autonomy

Source reports used for this guide

Related guides

Safety and governance

Safety invariants

Instrumental-drive containment

Responsible model-breeding research