Safety Advanced 2 minute read Updated 2026-06-26 UTC

Threat model for model breeding

Assets, actors, trust boundaries, abuse cases, and mitigations for candidates, data, evaluators, registries, routers, and federated updates.

Research statusEstablished threat-modeling adapted to AI systems Publication statePublished Reviewed byMichael Kappel Source reports3

Protect the control plane

The main security concern is not a model “wanting” something. It is a candidate or compromised component gaining capabilities that let it alter evidence, policy, data, or release state. Threat modeling should be capability-based.

Assets

  • model and code artifacts;
  • signing keys and release aliases;
  • contracts, prompts, policies, and scoring weights;
  • hidden evaluation data and labels;
  • training and user data;
  • lineage and audit evidence;
  • runtime credentials and tool permissions;
  • resource quotas and deployment infrastructure;
  • user trust and organizational decision processes.

Actors

  • external attacker;
  • malicious or compromised model supplier;
  • poisoned federated client;
  • insider with excessive privileges;
  • vulnerable runtime dependency;
  • candidate output attempting prompt or tool injection;
  • misconfigured automation;
  • evaluator or judge with correlated failure;
  • well-intentioned operator bypassing controls under pressure.

Abuse cases and controls

Abuse casePrimary control
Candidate reads hidden holdoutsSeparate storage, credentials, and network zones
Package alters evaluatorImmutable evaluator image and no write path
Malicious artifact executes on loadSafe formats, scanners, isolated loader, no arbitrary hooks
Router sends sensitive data to wrong modelData-class predicates and policy gate before ranking
Federated update poisons global modelAttestation, clipping, robust aggregation, central evaluation
Model output triggers a tool unsafelyStructured mediation, schema validation, approval tiers
Alias is changed without approvalSigned atomic aliases and append-only audit
Population grows without boundExternal population and resource ceilings
Candidate hides behavior until productionhidden suites, shadow traffic, canary, interpretability and anomaly tests
User becomes locked inportability, exit tests, interoperability, alternative workflows

Trust boundaries

Draw boundaries between candidate build, evaluation, registry, release, production runtime, user data, and governance. For each crossing, specify authentication, authorization, schema validation, encryption, rate limits, logging, and failure behavior.

Threat-driven testing

pseudocode
FOR each threat IN threat_model
    test <- DESIGN_ADVERSARIAL_TEST(threat)
    RUN_IN_STAGING(test)
    VERIFY_EXPECTED_CONTROL(test)
    RECORD_RESIDUAL_RISK(threat, test.result)
END FOR

Residual risk

No architecture removes all risk. Record assumptions such as trusted hardware, honest evaluators, protected signing keys, or human review quality. Revisit the model when capabilities, permissions, or deployment topology change.

Source reports used for this guide

These reports are preserved verbatim in the site archive. The guide above is an editorial synthesis and may narrow, qualify, or reorganize claims from the source material.