Key Metrics to Track for AI Agents
Task-Level Health Metrics
Task-level metrics tell you whether the agent is doing its job. The headline number is task success rate: the percentage of tasks that complete correctly. Defining "correctly" requires thought, because for many agent tasks there is no binary right or wrong. Structured outputs can be validated against a schema. Code generation can be checked against tests. Conversational responses require either human evaluation or an automated judge model. However you define success, measuring it consistently over time is what lets you detect degradation and confirm that improvements are real.
The hard failure rate is the percentage of tasks that crash, time out, or produce an explicitly invalid result. Hard failures are easy to detect and usually easy to debug because they leave a clear error in the logs. They are the tip of the iceberg. More dangerous is the silent error rate: the percentage of tasks that complete without any error signal but produce an incorrect or low-quality result. Silent errors are what make users lose trust, because the agent looks like it is working when it is not. Measuring silent errors typically requires sampling outputs and evaluating them, either with human review, an automated quality scorer, or a suite of validation checks appropriate to your domain.
Task completion time measures end-to-end latency from the user's input to the agent's final response. For agents, this metric has high variance because simple tasks complete in seconds while complex ones can take minutes. Report it as percentiles (p50, p90, p99) rather than as an average, because a mean that blends one-second and sixty-second tasks tells you nothing useful. The p90 and p99 are where the user experience pain lives, and those are the numbers your service level objectives should target.
Retry and escalation rate tracks how often users have to rephrase their request, ask the agent to try again, or give up and seek help elsewhere. A low task failure rate combined with a high retry rate suggests that the agent is technically completing tasks but not meeting user expectations, a quality problem that the success rate metric alone would miss.
Step-Level Performance Metrics
Step-level metrics look inside each task to measure the efficiency and reliability of the individual operations the agent performs. They are the diagnostic layer that explains why task-level metrics are changing.
LLM calls per task measures how many times the agent invokes the language model to complete one task. This number directly drives both cost and latency, and it is one of the most sensitive indicators of agent health. An increase in average LLM calls per task, even without a change in success rate, signals that the agent is working harder to reach the same outcomes. Common causes include degraded tool responses forcing retries, prompt changes that make the model less decisive, or an input distribution shift toward more complex tasks. Tracking this metric with a threshold alert catches these problems before they become user-visible.
Tool call success rate measures the percentage of tool invocations that return a usable result on the first attempt. Track this per tool, because a single unreliable tool can drag down overall agent performance while the others work perfectly. A tool whose success rate drops from ninety-eight percent to ninety percent has not broken in an obvious way, but the agent now retries it roughly ten times more often, which adds latency and cost to every task that uses it.
Tokens per LLM call breaks down into input tokens and output tokens. Input token count reflects the size of the prompt, which includes the system instructions, any retrieved context, conversation history, and the current user input. Output token count reflects the length of the model's response. Tracking both separately matters because they cost different amounts and they indicate different things. A spike in input tokens suggests the retrieval system is pulling in more documents or the conversation history is growing without summarization. A spike in output tokens suggests the model is generating longer responses, possibly because the prompt has become less directive or the task complexity has increased.
Tool call latency measures how long each external tool invocation takes. This is often the dominant contributor to end-to-end task latency because LLM inference times are relatively predictable while tool latency depends on external services that can slow down unpredictably. Track tool latency per tool and alert on sustained increases, because a slow tool does not just add its own latency; it often triggers retries and alternative strategies that multiply the total impact.
Model Behavior Metrics
Model behavior metrics track the statistical properties of the model's outputs over time. They serve as a change detection system: if something shifts in the model's behavior that you did not intentionally cause, these metrics will surface it.
Output length distribution tracks the average and percentile response lengths across tasks. A sustained change in this distribution, particularly a decrease, can indicate that the model provider has updated the model version, that your prompt has drifted, or that the input distribution has shifted. Each of these causes has different implications and requires different investigation, but the metric flags the anomaly regardless of cause.
Refusal rate measures how often the model declines to answer or perform the requested action, typically due to safety filters. A baseline refusal rate is normal and expected; a sudden increase indicates either a change in the model's safety thresholds (often following a provider update) or a change in user inputs toward topics the model considers sensitive. Either way, it affects the user experience and warrants investigation.
Format compliance rate measures how often the model's output matches the expected structure, whether that is valid JSON, a specific XML schema, a particular markdown format, or any other structured expectation your application requires. Format non-compliance is one of the most common agent failure modes because it causes downstream parsing errors that break the entire task. Tracking compliance rate as a metric, rather than just treating each failure as an individual bug, lets you see patterns: a compliance rate that slowly declines from ninety-nine percent to ninety-five percent over two weeks is a trend that individual error reports would not reveal.
Planning pattern distribution tracks how the model distributes its reasoning across different strategies. If your agent can call multiple tools, the distribution of which tools the model selects and in what order reveals shifts in the model's decision-making that may not be visible in outcome metrics. A model that suddenly prefers tool A over tool B for the same class of tasks has changed its internal reasoning, and understanding when and why this happens helps you anticipate quality impacts before they materialize.
Cost Metrics
Total tokens per task is the sum of all input and output tokens across all LLM calls in a single task. This is the raw measure of how much compute the agent consumed. Track it as a distribution rather than just an average, because cost variance in agent systems is extreme: the top ten percent of tasks by token consumption often account for half or more of total spend, and understanding what drives those expensive outliers is where the biggest optimization opportunities lie.
Cost per task in currency translates token counts into actual dollars by applying the provider's pricing for each model tier and token type. This is the metric that connects engineering decisions to budget reality. Track it by task type, by user or tenant, and over time. When you change the model, modify the prompt, or adjust the retrieval strategy, cost per task is what tells you whether the change is financially sustainable regardless of whether it improved quality.
Cost by component breaks total cost into its contributors: how much goes to the system prompt (which is the same on every call), how much to retrieved context, how much to conversation history, how much to model output, and how much to retries. This decomposition reveals where optimization will have the most impact. If sixty percent of your cost comes from retrieved context that is mostly the same across calls, implementing a context cache could cut your bill substantially. If thirty percent comes from retries, improving tool reliability would save more than compressing prompts.
Budget utilization rate tracks what fraction of your daily or monthly API budget has been consumed and at what velocity. This is the operational metric that prevents financial surprises. An alert that fires when daily spend exceeds one hundred fifty percent of the trailing seven-day average catches runaway costs while they are still a spike rather than a crisis.
User Experience Metrics
Perceived latency is the time from when the user submits their request to when they see the agent's response. For streaming responses, this is time to first token. Perceived latency includes network round trips and any queuing delays on top of the agent's processing time, so it can be substantially longer than the internal task completion time. This is the metric the user actually experiences, and it should be the one your service level objectives are built around.
Follow-up rate measures how often a user sends a follow-up message after the agent responds, especially messages that rephrase or correct the original request. A high follow-up rate indicates that the agent's first response is not meeting expectations, even if it is technically correct. This is a quality signal that task-level success metrics may not capture, because the task "succeeded" from the agent's perspective but failed from the user's perspective.
Session abandonment rate measures how often users leave mid-conversation without completing their goal. High abandonment correlates with frustration and is a leading indicator of user churn. It is a lagging quality signal, because many problems have to accumulate before a user gives up, but it is also the most consequential: a user who abandons is a user you failed.
Explicit feedback rate and sentiment tracks thumbs up/down, star ratings, or written feedback, both the rate at which users provide it and the distribution of positive versus negative signals. Most users never leave explicit feedback, so the sample is biased, but it remains the most direct signal of user satisfaction available. Pair it with implicit signals like follow-up rate and abandonment to get a more complete picture.
The five metric categories, task health, step performance, model behavior, cost, and user experience, form a complete picture when tracked together. Task metrics tell you whether things are working. Step metrics tell you why they are or are not. Model metrics detect upstream changes. Cost metrics keep the system financially viable. User metrics ground everything in actual value delivered.