What Should You Monitor First in AI Agents
The Detailed Answer
The question of what to monitor first matters because the full set of possible agent metrics is large enough to be paralyzing, and teams that try to set up comprehensive monitoring before launch often end up launching with no monitoring at all because the setup was never completed. The better approach is to start with the smallest set of metrics that catches the most important problems, get that working before or at launch, and expand based on what the initial metrics reveal.
The three starting metrics are chosen because they cover the three most common ways an agent deployment fails catastrophically. Task success rate catches the case where the agent stops working correctly. Cost per task catches the case where the agent works but costs more than it should, which can happen silently and accumulate into a significant financial problem. Error rate catches hard failures that produce visible errors rather than quietly wrong answers. Together, these three metrics answer the most existential questions about a newly deployed agent: does it work, can we afford it, and is it crashing.
Task success rate requires you to define what success means, which is itself a valuable exercise that many teams skip. For structured outputs, success can be validated automatically (does the JSON parse, does the generated code compile, does the extracted data match the schema). For open-ended outputs, you may need a lightweight automated judge or a sampling-based human review process. Even an imperfect success metric is better than none, because it gives you a trend line that shows whether quality is improving, stable, or degrading.
Cost per task requires instrumenting every LLM call to capture token counts and converting them to dollars at the provider's rates. This is the metric that most frequently surprises teams at launch, because the cost of serving real user traffic at production volume is often two to five times what development testing predicted, and without the metric you do not discover the discrepancy until you see the invoice. A simple alert that fires when daily cost exceeds a threshold is the minimum defense against runaway spending.
Error rate requires catching and logging every exception, timeout, and explicit failure in the agent's execution. This is the simplest metric to implement because most frameworks already track errors; the key addition is making errors visible in a dashboard rather than buried in server logs where nobody checks them.
The Expansion Path
Once the initial three metrics are stable and you have baselines, the natural expansion path follows the investigation needs that arise from real incidents.
The first expansion is usually step-level metrics: LLM calls per task and tool call success rate. These are the diagnostic metrics that explain why the headline metrics are changing. An increase in LLM calls per task, even with stable success rate, means the agent is working harder for the same results, which predicts both future cost increases and eventual quality degradation. A drop in tool call success rate pinpoints the specific tool causing problems, which is far more actionable than a broad success rate decline.
The second expansion is usually tracing, not as a metric but as an investigative capability. The first time you have a failure that the metrics cannot explain, you will want a full trace of what happened, and that is the moment to implement tracing if you have not already. Tail-based sampling that captures full traces for all failures gives you investigative capability with minimal storage cost.
The third expansion is user experience metrics: follow-up rate, session length, and explicit feedback. These connect the internal metrics to what actually matters, whether users find the agent useful, and they often reveal quality problems that task success rate misses because the agent technically succeeds but does so in a way that does not meet user expectations.
The fourth expansion is automated evaluation: running the agent against a fixed test set on a regular schedule and tracking the scores over time. This catches regressions before they reach production traffic, because the evaluation set runs the agent on known-good inputs where you can compare the output to expected results. It is the most reliable way to ensure that changes to the prompt, model, or tools do not quietly degrade quality in ways that the production metrics would only reveal after affecting real users.
Monitor task success rate, cost per task, and error rate from day one. Expand to step-level diagnostics, tracing, user experience metrics, and automated evaluation as the initial metrics reveal questions they cannot answer. Start small, start early, and let real incidents guide what you add next.