API Cost Tracking for AI Agents
Why Cost Tracking Is an Operational Signal
In traditional software, the cost of processing one request is essentially fixed: the servers are already running and the marginal cost of one more request is negligible. AI agents break this model completely. Every LLM call consumes tokens that are billed per unit, and the number of calls per task varies dramatically. A simple task might require one call with a few hundred tokens. A complex task might require ten calls with thousands of tokens each. A misbehaving agent stuck in a retry loop might make fifty calls before hitting a timeout, consuming hundreds of thousands of tokens for zero useful output.
This variability means that cost is not just a financial metric; it is an operational health signal as important as error rate or latency. A sudden increase in average cost per task, even when the success rate holds steady, often indicates a problem: the agent is taking more steps to reach the same result, a tool is returning degraded responses that force re-planning, or the retrieval system is pulling in more context than necessary. Cost anomalies frequently surface problems before they become visible in quality metrics, making cost tracking one of the most valuable early warning systems available.
Token-Level Accounting
Accurate cost tracking starts with counting every token. Most LLM providers return token counts in the API response, broken down into input tokens (also called prompt tokens) and output tokens (also called completion tokens). Capture both for every LLM call the agent makes, along with the model identifier, because different models and different token types are priced differently. Input tokens are typically cheaper than output tokens, sometimes by a factor of three to five, and cached input tokens may be cheaper still if your provider offers prompt caching.
Sum the token counts across all LLM calls in a task to get the total tokens per task, and apply the provider's per-token pricing to convert to dollars. Store both the raw token counts and the computed cost, because token counts remain comparable across time even when prices change, while dollar costs let you track actual financial exposure. When you change models, the token count comparison tells you whether the new model is more or less efficient in absolute terms, while the dollar comparison tells you the net financial effect after the price difference.
A subtlety that matters for accurate accounting is that some providers count tokens differently for different features. Tokens in a system message may be billed at a different rate than tokens in a user message. Tokens in tool call descriptions may be counted separately. Function calling metadata adds tokens that do not appear in the visible prompt. Read your provider's billing documentation carefully and test your accounting against actual invoices to ensure they match. A ten percent discrepancy between your tracked costs and your bill is common when accounting logic does not capture every token category.
Cost Decomposition by Component
Knowing the total cost per task is necessary but not sufficient for optimization. You also need to know where the cost comes from. Breaking the total into its components reveals which parts of the agent's operation are most expensive and therefore most worth optimizing.
The system prompt is the fixed instruction set that appears at the beginning of every LLM call. It is typically the same across all tasks and represents a baseline cost that scales with call volume rather than task complexity. If your system prompt is two thousand tokens and the agent makes three calls per task on average, you are spending six thousand tokens per task just on instructions. Prompt compression, caching (where the provider supports it), and reducing unnecessary instructions are the direct levers.
The retrieved context is the variable-length material pulled from a knowledge base or memory store and injected into the prompt. Retrieval-augmented generation is one of the most cost-intensive patterns because it adds potentially thousands of tokens to every call. The optimization question is whether the agent is retrieving too much (pulling in five documents when two would suffice), retrieving the wrong things (documents that are topically related but do not actually help answer the question), or retrieving redundantly (the same context across multiple calls in the same task). Tracking retrieved context volume per call and correlating it with task outcomes tells you whether more context is actually helping or just adding cost.
Conversation history grows across multi-turn interactions and is included in every subsequent LLM call. Without active management, history grows until it fills the context window, at which point each call is as expensive as it can possibly be. Summarization (periodically replacing the full history with a condensed version), sliding windows (keeping only the most recent N turns), and selective retention (keeping only the turns that are relevant to the current task) all reduce history-related cost, and the right choice depends on how much historical context the agent actually needs for quality.
Model output cost depends on response length, which varies with the task. Verbose models that produce long explanations when a short answer would suffice are spending output tokens unnecessarily. Prompt instructions that explicitly request concise responses, max_tokens limits that prevent runaway generation, and choosing smaller models for tasks that do not need the full capability of a frontier model are the standard levers.
Retries are the most wasteful cost component because they represent work done twice or more without proportional value. Every retry repeats the system prompt cost, the context cost, and the model output cost. Tracking retry rate and retry cost separately reveals whether retry reduction (by improving tool reliability or prompt clarity) would save more than any other optimization.
Budget Enforcement and Runaway Prevention
The most dangerous cost scenario in agent systems is the runaway loop: an agent that enters a cycle of retrying a failed operation, elaborating on a plan that never converges, or generating output that it then evaluates and regenerates repeatedly. Without a hard limit, a single runaway task can consume hundreds or thousands of dollars in API credits before anyone notices. This is not a theoretical concern; it is a common failure mode that every production agent system encounters.
The defense is a per-task budget, a hard ceiling on the total tokens or dollars that a single task may consume. When the cumulative cost of a task reaches the budget, the agent framework terminates execution and returns an error or a partial result. Set the budget high enough that legitimate complex tasks can complete (perhaps five to ten times the median cost per task) but low enough that a runaway is caught within a reasonable financial loss (perhaps fifty to a hundred dollars, depending on your scale and tolerance).
Complement the per-task budget with an aggregate rate limit: a maximum spend rate over a rolling window, such as no more than a certain amount per hour or per day. This catches scenarios where individual tasks are within budget but an abnormal volume of tasks is burning through spend faster than expected, which can happen during a traffic spike, a bot attack, or a misconfigured batch job.
When a task hits its budget limit, log the event with full context and alert the engineering team. A budget termination is not a normal failure; it is an anomaly that warrants investigation. The goal is not just to cap the damage but to understand and fix the root cause so the runaway does not recur.
Optimization Strategies
Cost optimization for agents is not about spending less in absolute terms; it is about spending the same amount more effectively, getting better results per dollar. The strategies fall into three categories: reducing waste, improving efficiency, and selecting the right model for each task.
Reducing waste means eliminating tokens that do not contribute to output quality. The most common sources of waste are oversized system prompts, excessive retrieved context, redundant conversation history, and unnecessary retries. Measure each component's contribution to cost, then experiment with reductions and measure whether quality changes. Often it does not, because agents are robust to moderate reductions in context volume, which means the extra tokens were waste all along.
Improving efficiency means getting the same result in fewer steps. If the agent typically takes four LLM calls for a task that should require two, the problem is usually in the prompt (the instructions are not clear enough for the model to act decisively) or in the tools (the model cannot get the information it needs in one call, so it has to try multiple approaches). Trace analysis reveals which tasks take excessive steps and why, and targeted prompt or tool improvements can often cut the step count substantially.
Model selection means routing tasks to the cheapest model that can handle them at the required quality level. A frontier model that costs ten times more per token than a mid-tier model may produce indistinguishable results on simple tasks, saving money for no quality loss. The implementation is a routing layer that classifies incoming tasks by complexity and sends each to the appropriate model tier. Building this classifier requires enough labeled data to distinguish simple from complex tasks reliably, which is another reason comprehensive cost and quality tracking is so important: it provides the data the classifier needs.
Treat cost as an operational health signal, not just a financial number. Track every token, decompose cost by component, enforce per-task budgets to prevent runaways, and use cost data to optimize where it actually matters rather than guessing.