AI Agent Threat Model: What Can Go Wrong
What Is an AI Agent Threat Model
A threat model is a structured representation of everything that can go wrong with a system from a security perspective. For AI agents, this means cataloging the ways an attacker (or the agent itself through misalignment) can cause harm, estimating the probability and impact of each scenario, and mapping those scenarios to specific defensive controls. The goal is not to eliminate all risk, which is impossible, but to understand the risk landscape well enough to make informed decisions about where to invest in security.
Traditional threat modeling frameworks like STRIDE (Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, Elevation of Privilege) remain relevant for the infrastructure components of an agent system. But they do not fully capture the novel threats introduced by natural language processing, autonomous decision-making, and tool use. AI agent threat models extend these frameworks with categories specific to LLM-based systems.
Attacker Profiles
Different attackers target AI agents for different reasons, and understanding their profiles helps prioritize defenses.
Opportunistic attackers probe AI agents for fun or curiosity, attempting basic jailbreaks and prompt injection to see what they can make the agent do. These attackers typically use well-known techniques copied from online tutorials and social media. They represent the highest volume of attacks but the lowest sophistication. Basic input validation and prompt hardening are usually sufficient to deter them.
Targeted attackers have a specific objective, such as extracting proprietary data, gaining unauthorized access to systems the agent connects to, or manipulating the agent to perform specific actions on their behalf. These attackers invest time in understanding how the target agent works, what tools it has access to, and what data it can reach. They craft custom injection payloads tailored to the specific system. Defending against targeted attackers requires defense in depth because any single control can be bypassed with sufficient effort.
Insider threats come from users who have legitimate access to the agent but attempt to exceed their authorized scope. An employee might use a customer service agent to access records they are not authorized to view. A developer might exploit debugging interfaces to extract model weights or system prompts. Insider threats are particularly dangerous because the attacker already has some level of trust and access within the system.
Automated threats include bots, scrapers, and automated attack tools that interact with agents at scale. These can overwhelm rate limiting, test thousands of injection variants, or attempt to map out the full capability set of the agent through systematic probing. Defending against automated threats requires robust rate limiting, challenge-response mechanisms, and anomaly detection tuned for high-volume interaction patterns.
Attack Vectors by Layer
Each layer of the agent architecture presents distinct attack vectors that must be modeled independently.
Prompt layer attacks target the instructions that guide agent behavior. System prompt extraction attempts to reveal the confidential instructions that define how the agent operates. System prompt override attempts to replace those instructions with attacker-controlled ones. Context poisoning inserts malicious content into the conversation history or retrieval context that influences future agent decisions. Instruction smuggling hides malicious directives inside seemingly benign content, such as embedding instructions in whitespace characters, Unicode variations, or comment fields.
Tool layer attacks exploit the interface between the agent and its tools. Tool parameter injection crafts inputs that, when passed through to a tool, exploit vulnerabilities in that tool (similar to SQL injection or command injection in traditional applications). Tool confusion tricks the agent into calling the wrong tool or calling the right tool with unintended parameters. Permission escalation attempts to use one tool to gain access to another tool that the agent should not be able to use.
Data layer attacks target the information the agent processes and produces. Training data poisoning introduces malicious patterns during fine-tuning that create backdoors or biases. Retrieval poisoning manipulates the documents or data sources that the agent queries during operation, injecting false information or hidden instructions. Output manipulation causes the agent to produce responses that serve the attacker, such as embedding invisible tracking pixels, generating phishing content, or including malicious links.
Infrastructure layer attacks target the compute and network environment. Container escape attempts to break out of the sandboxed execution environment. Credential harvesting searches the runtime environment for API keys, tokens, and other secrets. Network probing maps the internal network from within the container to identify other services that can be attacked. Side-channel attacks extract information through timing variations, resource usage patterns, or error messages.
Impact Analysis Framework
Not all threats have equal impact. A structured impact analysis helps prioritize defensive investments by considering several dimensions:
Data sensitivity measures the value and confidentiality of information the agent can access. An agent with access to PII, financial records, or trade secrets has a much higher data sensitivity score than one that only accesses public information. Higher data sensitivity demands stronger access controls, encryption, and monitoring.
Action scope measures the breadth and severity of actions the agent can take. An agent that can only generate text has limited action scope. An agent that can send emails, modify databases, deploy code, or transfer funds has extensive action scope. Wider scope demands more granular permission controls, confirmation requirements for high-impact actions, and stricter behavioral monitoring.
Blast radius measures how far the consequences of a compromised agent can spread. An isolated agent with no connections to other systems has a contained blast radius. An agent that connects to multiple internal services, shares context with other agents, or has credentials that grant access to production infrastructure has an expansive blast radius. Larger blast radii demand stronger containment through sandboxing, network segmentation, and credential scoping.
Recovery difficulty measures how hard it is to undo the damage caused by a successful attack. Some actions are easily reversible (like generating incorrect text that can be corrected). Others are extremely difficult to reverse (like exfiltrated data that cannot be un-leaked, or deleted records that had no backup). Higher recovery difficulty justifies stronger preventive controls and more conservative permission models.
Building Your Threat Model
A practical threat modeling exercise for an AI agent deployment follows these steps:
Step 1: Map the system. Document every component of the agent system including the language model, the system prompt, all tools and APIs the agent can access, all data sources the agent queries, the execution environment, and any other agents it interacts with. Create a data flow diagram showing how information moves through the system.
Step 2: Identify trust boundaries. Mark every point where data crosses from one trust level to another. The boundary between user input and agent processing is a trust boundary. The boundary between the agent and a tool API is a trust boundary. The boundary between the agent and a database is a trust boundary. Each trust boundary is a potential attack surface.
Step 3: Enumerate threats. For each trust boundary, systematically consider what attacks are possible. Use the attack vector categories described above as a checklist. For each potential threat, document the attack technique, the prerequisite conditions, and the potential impact.
Step 4: Assess risk. For each threat, estimate the likelihood (considering attacker motivation, capability, and the difficulty of the attack) and the impact (using the framework above). Combine likelihood and impact into a risk score that enables prioritization.
Step 5: Map mitigations. For each high-priority threat, identify the defensive controls that reduce either the likelihood or the impact. Verify that the combination of controls provides defense in depth, meaning that no single control failure leaves the threat unmitigated.
Step 6: Validate and iterate. Test the threat model through red-team exercises, security reviews, and incident analysis. Update the model when the agent gains new capabilities, connects to new systems, or when new attack techniques emerge.
Threat modeling for AI agents extends traditional frameworks by accounting for natural language manipulation, autonomous behavior, and tool-enabled actions. Start by mapping the system and trust boundaries, then systematically enumerate threats across prompt, tool, data, and infrastructure layers. Prioritize defenses based on data sensitivity, action scope, blast radius, and recovery difficulty.