Anomaly Detection in Agent Behavior
What Counts as Anomalous Agent Behavior
An anomaly is any behavior that departs meaningfully from the agent's established norm. Anomalies come in two broad shapes. Point anomalies are individual events that stand out sharply: a single response that is ten times longer than usual, a task that called a tool four hundred times, a request that cost a hundred times the median. Pattern anomalies are shifts in the distribution of behavior over time: a steady climb in the rate of refusals, a gradual drift in the topics the agent discusses, a slow increase in how often it gets stuck in loops.
What unites them is deviation from an expected baseline. Anomaly detection does not require knowing in advance exactly what will go wrong; it requires knowing what normal looks like and flagging departures from it. This is what makes it complementary to accuracy monitoring, which measures performance against known-correct answers. Anomaly detection catches the unexpected, including failure modes no one anticipated and therefore wrote no specific test for, which is exactly the category of problem that learning systems tend to produce.
Why Learning Systems Especially Need Anomaly Detection
A static agent that worked yesterday will behave the same way today, so its anomalies usually come from the outside, such as a change in inputs or a dependency failure. A learning agent changes itself, which means it can generate anomalies from within. A memory update can cause it to retrieve misleading context. A prompt refinement can have unintended side effects on cases it was not meant to touch. A fine-tuning run can introduce subtle behavioral shifts that no targeted test happens to cover.
This internal source of change makes anomaly detection more important for learning agents than for static ones. The very mechanisms that drive improvement are also potential sources of regression, and because learning is continuous, so is the risk. An anomaly detector acts as a safety net under the learning process, watching for the moment when a change meant to help instead introduces behavior outside the normal envelope. Without it, a learning system can drift into trouble between the scheduled evaluations, with no one aware until the damage shows up in user complaints.
Signals That Reveal Anomalies
Anomalies show up across several observable dimensions, and a thorough detector watches all of them. Operational signals are the easiest to monitor: cost per task, latency, token consumption, the number of steps or tool calls per task, and error rates. A sudden change in any of these often indicates a behavioral problem, such as an agent that has started looping, retrying excessively, or producing runaway output.
Output signals concern the content the agent produces: its length, its structure, the rate at which it refuses or hedges, the frequency of particular phrases, and its adherence to expected formats. A spike in refusals or a collapse in output diversity can signal that something has gone wrong with the agent's behavior. Tool-usage signals track which tools the agent calls and in what patterns; an agent that suddenly stops using a tool it used to rely on, or starts hammering one it rarely touched, is behaving anomalously. Watching all these dimensions together gives a fuller picture than any one alone, and it builds directly on the telemetry described in agent monitoring and logging.
Statistical and Threshold-Based Detection
The simplest and often most effective anomaly detection is statistical. For each monitored signal, you establish the normal range from historical data, then flag values that fall far outside it. A common approach is to compute a rolling mean and standard deviation and flag any value more than a set number of standard deviations away, which catches point anomalies in metrics like cost, latency, and output length. Percentile-based thresholds, flagging anything beyond the ninety-ninth percentile of the historical distribution, are robust to skew and easy to reason about.
Threshold-based detection extends this with fixed limits derived from domain knowledge: no task should call a tool more than a set number of times, no response should exceed a certain length, no single request should cost more than a defined ceiling. These hard limits catch the most dangerous anomalies, such as runaway loops, immediately and without waiting for a statistical baseline to accumulate. Statistical and threshold methods are inexpensive, interpretable, and a strong first line of defense, and most learning systems should implement them before reaching for anything more sophisticated.
Model-Based and Semantic Anomaly Detection
Some anomalies are invisible to simple statistics because they live in the meaning of the agent's behavior rather than in its measurable surface. An agent might produce responses of perfectly normal length and cost that are nonetheless subtly off-topic, factually degraded, or stylistically wrong. Catching these requires looking at content semantically.
One approach embeds the agent's outputs into vectors and watches for drift in the distribution of those embeddings over time, which can reveal that the agent has started talking about different things or in a different way even when the surface metrics look stable. Another uses a separate model as a judge to score a sample of outputs for quality, coherence, or appropriateness, flagging when those scores fall outside the normal range. These model-based methods are more expensive and more complex than statistical ones, and they introduce their own noise, so they are best applied to a sample of traffic rather than every interaction, and used to complement statistical detection rather than replace it.
Distinguishing Anomalies from Healthy Change
The hardest problem in anomaly detection for learning systems is that not every deviation is bad. A learning agent is supposed to change, and a successful improvement will, by definition, move its behavior away from the previous baseline. An anomaly detector that fires on every change would cry wolf constantly and quickly be ignored. The goal is to distinguish harmful deviation from beneficial change.
Several practices help draw this line. Pairing anomaly signals with outcome signals is the most powerful: a change in behavior accompanied by stable or improving success rates is probably healthy, while a change accompanied by rising failures or escalations is probably an anomaly. Updating the baseline deliberately after a validated improvement, rather than letting it drift, keeps the detector calibrated to current normal. And treating anomalies as candidates for investigation rather than automatic failures keeps a human in the loop for the ambiguous cases. The aim is a detector sensitive enough to catch real problems and specific enough that its alerts are worth acting on, which is the same balance that makes accuracy monitoring useful rather than noisy.
Responding to Detected Anomalies
Detection is only useful if it leads to action, and the right response depends on severity and confidence. For high-severity anomalies with clear signatures, such as a runaway loop or a cost spike, an automated response is appropriate: cut off the offending task, throttle the behavior, or roll back to the last known-good configuration without waiting for human review. The speed of automation matters most where the cost of inaction is high and the signal is unambiguous.
For lower-severity or ambiguous anomalies, the right response is to alert a human and queue the case for review. A sample of anomalous interactions, presented with the relevant context, lets a person quickly judge whether the deviation is a problem or a benign side effect of learning. Either way, every confirmed anomaly is itself a learning opportunity: the cases that triggered it become candidates for the evaluation set and for the next round of training, so the system grows more robust against the failure mode that produced them. Wiring detection, alerting, and rollback together is part of building the resilient learning system described in setting up learning pipelines.
Setting Up Your First Anomaly Detector
For a team that has never monitored for anomalies, the right starting point is deliberately simple, because a basic detector running today is worth far more than a sophisticated one that is still being designed. Begin with the operational signals you already collect: cost per task, latency, step count, and error rate. These require no new instrumentation beyond the telemetry every production agent should already have, and they catch the most damaging anomalies, such as runaway loops and cost spikes.
Establish a normal range for each signal from a few weeks of historical data, then set two kinds of limits. Statistical limits flag values that fall outside the usual distribution, catching unusual but not catastrophic deviations. Hard limits enforce absolute ceilings that should never be crossed regardless of history, such as a maximum number of tool calls per task, catching dangerous behavior immediately. Together these give broad coverage from a small amount of work, and they produce interpretable alerts that a human can act on without deep analysis.
From this foundation, expand only where the simple detector leaves gaps. Add output-based signals such as response length and refusal rate once the operational layer is stable. Introduce model-based semantic checks on a sample of traffic when subtle quality drift becomes a concern that surface metrics cannot catch. Growing the detector incrementally, in response to the anomalies you actually encounter, keeps it proportionate to the real risks and avoids the trap of building elaborate machinery for problems you may never face.
Anomaly detection flags when an agent departs from its normal behavior across operational, output, and tool-usage signals. Start with cheap statistical and threshold methods, add model-based semantic detection on a sample, and always pair behavioral signals with outcome signals to tell harmful deviation from the healthy change that learning is supposed to produce. Automate response to clear high-severity anomalies and route ambiguous ones to human review.