Safety Advanced 2 minute read Updated 2026-06-26 UTC

Containment and human oversight

Layered isolation, capability control, monitoring, approval, and emergency response for increasingly adaptive model systems.

Research statusEstablished containment principles adapted to adaptive AI Publication statePublished Reviewed byMichael Kappel Source reports3

Containment is layered

No single sandbox or kill switch is sufficient. Containment combines limited action spaces, least privilege, resource ceilings, independent monitoring, progressive release, and human authority.

Layers

  1. Research boundary: simulation or offline data; no production credentials.
  2. Package boundary: verified formats, manifests, signatures, and permissions.
  3. Process boundary: container or process isolation, memory and time limits.
  4. Network boundary: deny by default, explicit destinations, monitored egress.
  5. Tool boundary: structured mediation and risk-tier approval.
  6. Data boundary: classification, minimization, jurisdiction, and holdout separation.
  7. Release boundary: shadow, canary, traffic caps, and rollback.
  8. Governance boundary: human-owned policy, signing, revocation, and emergency stop.

Oversight modes

  • Human in the loop: approval before a high-impact action.
  • Human on the loop: continuous monitoring with ability to intervene.
  • Human over the loop: policy, budgets, and audit governing mostly automated low-risk work.

Choose mode by consequence and reversibility, not by model confidence alone.

Emergency stop

The stop mechanism must be outside the model runtime, require independent credentials, work during partial control-plane failure, and be tested. Stopping should include traffic, tool access, candidate generation, federated rounds, and alias changes.

pseudocode
PROCEDURE emergency_freeze(scope, operator)
    REQUIRE operator HAS emergency_role
    SET_RELEASE_CONTROLLER_READ_ONLY(scope)
    SET_ROUTER_TO_VERIFIED_SAFE_MODE(scope)
    REVOKE_RUNTIME_TOOL_TOKENS(scope)
    PAUSE_CANDIDATE_FACTORY(scope)
    TERMINATE_ACTIVE_HIGH_RISK_TASKS(scope)
    APPEND_TAMPER_EVIDENT_AUDIT("emergency_freeze", scope, operator)
END PROCEDURE

Monitoring independence

The monitor should not rely only on model self-report. Use runtime telemetry, policy events, deterministic validators, independent judges, and human review. Candidate-generated explanations are evidence, not authority.

Scaling autonomy

Increase automation only after the lower-risk stage has stable evidence. A system allowed to select among approved models is not automatically ready to train descendants; a system that trains adapters is not ready to change code; a code generator is not ready to deploy its own patch.

Source reports used for this guide

These reports are preserved verbatim in the site archive. The guide above is an editorial synthesis and may narrow, qualify, or reorganize claims from the source material.