Protect the control plane
The main security concern is not a model “wanting” something. It is a candidate or compromised component gaining capabilities that let it alter evidence, policy, data, or release state. Threat modeling should be capability-based.
Assets
- model and code artifacts;
- signing keys and release aliases;
- contracts, prompts, policies, and scoring weights;
- hidden evaluation data and labels;
- training and user data;
- lineage and audit evidence;
- runtime credentials and tool permissions;
- resource quotas and deployment infrastructure;
- user trust and organizational decision processes.
Actors
- external attacker;
- malicious or compromised model supplier;
- poisoned federated client;
- insider with excessive privileges;
- vulnerable runtime dependency;
- candidate output attempting prompt or tool injection;
- misconfigured automation;
- evaluator or judge with correlated failure;
- well-intentioned operator bypassing controls under pressure.
Abuse cases and controls
| Abuse case | Primary control |
|---|---|
| Candidate reads hidden holdouts | Separate storage, credentials, and network zones |
| Package alters evaluator | Immutable evaluator image and no write path |
| Malicious artifact executes on load | Safe formats, scanners, isolated loader, no arbitrary hooks |
| Router sends sensitive data to wrong model | Data-class predicates and policy gate before ranking |
| Federated update poisons global model | Attestation, clipping, robust aggregation, central evaluation |
| Model output triggers a tool unsafely | Structured mediation, schema validation, approval tiers |
| Alias is changed without approval | Signed atomic aliases and append-only audit |
| Population grows without bound | External population and resource ceilings |
| Candidate hides behavior until production | hidden suites, shadow traffic, canary, interpretability and anomaly tests |
| User becomes locked in | portability, exit tests, interoperability, alternative workflows |
Trust boundaries
Draw boundaries between candidate build, evaluation, registry, release, production runtime, user data, and governance. For each crossing, specify authentication, authorization, schema validation, encryption, rate limits, logging, and failure behavior.
Threat-driven testing
FOR each threat IN threat_model
test <- DESIGN_ADVERSARIAL_TEST(threat)
RUN_IN_STAGING(test)
VERIFY_EXPECTED_CONTROL(test)
RECORD_RESIDUAL_RISK(threat, test.result)
END FORResidual risk
No architecture removes all risk. Record assumptions such as trusted hardware, honest evaluators, protected signing keys, or human review quality. Revisit the model when capabilities, permissions, or deployment topology change.
Source reports used for this guide
These reports are preserved verbatim in the site archive. The guide above is an editorial synthesis and may narrow, qualify, or reorganize claims from the source material.