Context Windows: What Agents Remember Per Turn
How Context Windows Work
Every language model has a fixed context window measured in tokens. A token is roughly three-quarters of a word in English, so a 200,000 token context window holds approximately 150,000 words. When you send a request to the model, the total size of your input (system prompt, all messages, all tool descriptions, all tool results) plus the generated output must fit within this window. If the input exceeds the window, the request fails or the oldest content is automatically truncated.
Context windows have grown dramatically over the past few years. GPT-3 started with 4,096 tokens. GPT-4 expanded to 128,000. Claude 3.5 Sonnet and Claude Opus support 200,000 tokens. Google Gemini offers up to 2 million tokens. These increases have fundamentally changed what agents can accomplish in a single interaction, making it possible to process entire codebases, research papers, or lengthy document collections without chunking.
However, larger context windows do not solve all problems. Research consistently shows that model attention is not uniform across the context. Information near the beginning and end of the context receives more attention than information in the middle, a phenomenon known as the "lost in the middle" problem. For agents, this means that critical instructions in the system prompt (at the beginning) and recent tool results (at the end) are processed more reliably than intermediate conversation history buried in the middle of a long context.
What Consumes Context Space
In an agent interaction, context space is consumed by five categories of content. The system prompt typically uses 500 to 5,000 tokens depending on the complexity of the agent instructions. Tool descriptions use 100 to 500 tokens per tool, so an agent with 20 tools might dedicate 5,000 tokens to tool schemas alone. Conversation history grows with each turn, averaging 200 to 1,000 tokens per turn for simple interactions and much more for turns that involve long tool results.
Tool results are often the largest context consumers. A web search might return several pages of summarized results. A database query might return hundreds of rows. A code execution might produce extensive output. A single tool result can easily consume 5,000 to 20,000 tokens. Over the course of a multi-step task, accumulated tool results can fill the context window faster than any other content type.
The practical impact is that agents running complex, multi-step tasks eventually hit context limits even with large windows. A 200,000 token window sounds enormous, but an agent making 30 tool calls with average results of 3,000 tokens each has already used 90,000 tokens on tool results alone, before counting the system prompt, conversation messages, and tool descriptions.
Context Management Strategies
Sliding window is the simplest strategy: keep the most recent N messages and discard everything older. This works well for tasks where only recent context matters, but it causes the agent to lose important information from early in the conversation, including initial instructions, early decisions, and foundational research results.
Summarization compresses older context into shorter summaries. Instead of keeping every message verbatim, the system periodically summarizes the conversation so far into a compact representation that preserves the key facts, decisions, and progress. The summary replaces the original messages, freeing context space for new interactions. The tradeoff is information loss: summaries inevitably omit details that might be relevant later.
Selective retrieval stores all context externally and loads only the relevant portions for each turn. Before each model call, the system identifies which past messages, tool results, and facts are relevant to the current step and includes only those in the context. This approach uses embeddings or keyword matching to determine relevance. It is the most token-efficient strategy but requires additional infrastructure (a vector database or search index) and adds latency from the retrieval step.
Hierarchical context maintains multiple levels of detail. A short executive summary of the entire conversation is always included. Medium-detail summaries of each phase or subtask are included when relevant. Full verbatim content is included only for the most recent turns. This layered approach gives the model both broad awareness of the full task and detailed visibility into the current step.
Cost and Performance Implications
Context window usage directly impacts cost. Language model APIs charge per token, both for input tokens (the context you send) and output tokens (the response the model generates). Sending a full 200,000 token context on every turn is significantly more expensive than sending a carefully managed 20,000 token context. For agents that make dozens of model calls per task, the cost difference between efficient and inefficient context management can be 10x or more.
Latency also increases with context size. Larger contexts take longer to process, adding seconds or even tens of seconds to each model call. For interactive agents where users are waiting for responses, this latency directly affects the user experience. For background agents processing tasks asynchronously, the latency affects throughput and the total time to complete batches of tasks.
There is also a reasoning quality tradeoff. Including more context gives the model more information to work with, potentially improving decision quality. But including irrelevant context can distract the model, leading it to fixate on details that are not important to the current step. The optimal context size depends on the task: some tasks benefit from maximum context (complex analysis across many data sources), while others perform better with minimal, focused context (simple tool calls with clear parameters).
Prompt Caching and Context Optimization
Prompt caching is a technique offered by model providers that reduces the cost and latency of repeated context. When the same prefix of a prompt is sent on multiple consecutive calls (which is common in agent loops where the system prompt and tool descriptions stay the same), the provider caches the processed representation of that prefix. Subsequent calls only pay for the new content, not the cached prefix. Anthropic prompt caching can reduce costs by up to 90 percent for the cached portion of the context.
To take advantage of prompt caching, agent architects should structure their context so that static content (system prompt, tool descriptions, fixed instructions) appears at the beginning, and dynamic content (recent messages, tool results) appears at the end. This maximizes the cacheable prefix and minimizes the per-turn cost. Rearranging tool descriptions or modifying the system prompt between turns breaks the cache and eliminates the savings.
Context compression techniques reduce the total token count without losing critical information. Structured data from tool results can be reformatted to remove redundant field names and whitespace. Long text outputs can be summarized to their essential facts. Previous reasoning traces that are no longer relevant can be removed. Each of these optimizations saves tokens, reduces cost, and potentially improves reasoning quality by reducing noise in the context.
Token counting should be done before sending requests to the model. Knowing exactly how many tokens the current context contains lets the agent runtime make informed decisions about what to include and what to trim. Most model provider SDKs include token counting functions that match the model internal tokenizer, giving accurate counts without making an API call.
Multi-turn context budgeting allocates portions of the context window to different content types across an entire task. A budget might reserve 3,000 tokens for the system prompt, 5,000 tokens for tool descriptions, 2,000 tokens for the state object, and allocate the remaining capacity to conversation history and tool results. When the total approaches the budget limit, the system applies compression strategies to the categories that are most compressible (typically older conversation history) while protecting categories that must remain complete (system prompt, current state). This disciplined approach prevents unexpected context overflow and ensures that critical information is never lost to truncation.
Context window management is a core agent design challenge. The strategy you choose affects cost, speed, reasoning quality, and the types of tasks your agent can handle. There is no universal best approach, only tradeoffs that need to be matched to your specific use case.