Threat model for model breeding

Protect the control plane

The main security concern is not a model “wanting” something. It is a candidate or compromised component gaining capabilities that let it alter evidence, policy, data, or release state. Threat modeling should be capability-based.

Assets

model and code artifacts;
signing keys and release aliases;
contracts, prompts, policies, and scoring weights;
hidden evaluation data and labels;
training and user data;
lineage and audit evidence;
runtime credentials and tool permissions;
resource quotas and deployment infrastructure;
user trust and organizational decision processes.

Actors

external attacker;
malicious or compromised model supplier;
poisoned federated client;
insider with excessive privileges;
vulnerable runtime dependency;
candidate output attempting prompt or tool injection;
misconfigured automation;
evaluator or judge with correlated failure;
well-intentioned operator bypassing controls under pressure.

Abuse cases and controls

Abuse case	Primary control
Candidate reads hidden holdouts	Separate storage, credentials, and network zones
Package alters evaluator	Immutable evaluator image and no write path
Malicious artifact executes on load	Safe formats, scanners, isolated loader, no arbitrary hooks
Router sends sensitive data to wrong model	Data-class predicates and policy gate before ranking
Federated update poisons global model	Attestation, clipping, robust aggregation, central evaluation
Model output triggers a tool unsafely	Structured mediation, schema validation, approval tiers
Alias is changed without approval	Signed atomic aliases and append-only audit
Population grows without bound	External population and resource ceilings
Candidate hides behavior until production	hidden suites, shadow traffic, canary, interpretability and anomaly tests
User becomes locked in	portability, exit tests, interoperability, alternative workflows

Trust boundaries

Draw boundaries between candidate build, evaluation, registry, release, production runtime, user data, and governance. For each crossing, specify authentication, authorization, schema validation, encryption, rate limits, logging, and failure behavior.

Threat-driven testing

pseudocode

FOR each threat IN threat_model
    test <- DESIGN_ADVERSARIAL_TEST(threat)
    RUN_IN_STAGING(test)
    VERIFY_EXPECTED_CONTROL(test)
    RECORD_RESIDUAL_RISK(threat, test.result)
END FOR

Residual risk

No architecture removes all risk. Record assumptions such as trusted hardware, honest evaluators, protected signing keys, or human review quality. Revisit the model when capabilities, permissions, or deployment topology change.

Source reports used for this guide

These reports are preserved verbatim in the site archive. The guide above is an editorial synthesis and may narrow, qualify, or reorganize claims from the source material.

Speculative risk scenariosAggressive Mutualism: Safety, Governance, and Containment AnalysisRisk analysis · 42.0 KB Speculative risk scenariosInstrumental Drives in Powerful AI SystemsRisk analysis · 42.2 KB Core synthesisThe 4Fs Framework: Fast, Flexible, Frugal, FederatedEmerging practice · 22.5 KB

Protect the control plane

Assets

Actors

Abuse cases and controls

Trust boundaries

Threat-driven testing

Residual risk

Source reports used for this guide

Related guides

Safety and governance

Safety invariants

Mutualism versus dependency

Instrumental-drive containment