LLM Fundamentals: Part 3 - Context Windows
Context windows are working memory, not storage. More tokens means more cost, more latency, and accuracy that degrades as context grows.
You know what tokens are and how the model samples them. Now: how many tokens fit in the room at once, and what happens when the room gets crowded.
Every API call you send to a language model has a hard ceiling on how many tokens it can process in a single request. Input tokens, output tokens, system prompt, conversation history, tool definitions: all of it competes for the same space. Anthropic calls this the context window, and it functions as working memory, not storage. Nothing persists between requests unless you explicitly send it again.
This is Post 3 in a 10-part series called LLM Fundamentals. Posts 1 through 3 are provider-agnostic. Starting with the next post, this series uses the Anthropic API for examples. Concepts map to other providers, but implementation details differ.
Working Memory, Not a Database
I find the working memory analogy genuinely useful because it reframes a common misconception. Developers new to LLMs often assume a large context window means the model “remembers” more. It does not remember anything between calls. Context windows represent all text the model can reference when generating a response, including the response itself. Once the request completes, that working memory is gone.
Training data is a separate concept entirely. A model trained on billions of documents does not store those documents in its context window. Its weights encode patterns from that data, but the context window only holds what you send right now, in this request. Confusing the two leads to architectural mistakes: expecting the model to recall details from earlier conversations without re-sending them, or assuming a larger window means the model “knows” more.
How Big Is the Room
Context window sizes vary significantly across models. Claude Opus 4.6 and Sonnet 4.6 offer 1 million tokens. Sonnet 4.5 and Haiku 4.5 work with 200,000 tokens. OpenAI’s GPT-5.4 supports up to 1 million. Google’s Gemini 3.1 Pro supports 1 million, while the legacy Gemini 1.5 Pro still offers a 2 million token window.
Numbers that large sound limitless until you run a multi-turn agent loop. I hit context limits faster than I expected the first time I built a tool-using agent, because context grows in a way that surprises most people.
Linear Growth per Turn
Here is the part that catches developers off guard: context usage grows linearly with each turn, with previous turns preserved completely. Every message you send includes the entire conversation history that came before it. Turn 1 sends the system prompt plus your message. Turn 2 re-sends everything from turn 1 plus the model’s response plus your new message. Turn 10 re-sends all of turns 1 through 9.
A simple conversation with 2,000 tokens per exchange hits 20,000 tokens by turn 10. An agent loop with tool calls, where each tool result might be 500 to 2,000 tokens, can burn through 100,000 tokens in under 20 steps. I have watched agent sessions cross the 200K boundary within minutes when each tool call returns verbose JSON.
This growth pattern also means cost scales faster than you might expect. Every turn re-processes all prior input tokens at your model’s per-token rate, so the total cost of a conversation is not the sum of individual messages but the sum of all cumulative context sent across all turns.
Context Rot Is Real
Bigger context windows do not automatically produce better results. As token count grows, accuracy and recall degrade, a phenomenon called context rot. Filling a 1M-token window with everything you have is like dumping every document in your company onto someone’s desk and asking them to find one specific paragraph. Technically the information is present. Practically, retrieval quality suffers.
I have seen this firsthand running agents over my own blog corpus. A focused question against one post returns clean citations. Load the same question against eighty posts plus tool definitions and the model starts hedging, missing details, or pulling facts from the wrong section. Curating what goes into context matters as much as how much space you have available.
Claude performs well on long-context retrieval benchmarks like MRCR and GraphWalks, but even state-of-the-art results depend on what is in context, not just how much fits.
Document Placement Matters
Where you place content within the context window changes output quality significantly. Putting long documents and inputs near the top of your prompt, above your query, instructions, and examples, can significantly improve performance across all models. Anthropic’s own testing shows that queries placed at the end can improve response quality by up to 30%, especially with complex, multi-document inputs.
In practice, this means structuring your prompts with a consistent pattern: reference material first, then instructions, then your actual question last. I adopted this layout after noticing measurably better results extracting structured data from long PDFs and multi-document prompts, and I have not gone back.
For multi-document tasks, wrapping each document in XML tags with metadata helps the model distinguish between sources. And asking the model to quote relevant passages before answering forces it to ground its response in specific text rather than synthesizing loosely from the full context.
Managing the Budget
Once you think of context as a budget rather than a feature, the design decisions change. A verbose system prompt that costs 3,000 tokens gets re-sent on every single API call. Over 10,000 daily conversations, that is 30 million tokens per day just in system prompt overhead. Trimming it by 40% saves 12 million tokens daily.
Strategies I use in production:
- Summarize conversation history instead of sending raw transcripts after a certain turn count
- Clear tool results from earlier turns when the model no longer needs them (Anthropic recently added context editing for exactly this)
- Use the token counting API before sending requests to know exactly how close you are to the limit
- Structure prompts so reference material sits at the top and the active query sits at the bottom
Newer Claude models return a validation error when prompt and output tokens exceed the context window, rather than silently truncating. That is actually helpful: it forces you to manage context deliberately instead of discovering degraded quality after the fact.
What This Means for System Design
Context window management is not an optimization problem you solve once. It is a continuous constraint that shapes every conversation-based application. Chatbots need history truncation strategies. Agent loops need tool result pruning. RAG systems need relevance filtering so they are not stuffing irrelevant chunks into the window.
Knowing the mechanism, that context is working memory with linear growth and accuracy degradation, lets you design around it instead of being surprised by it. More context is not automatically better. Curated context is.
Next up: building your first API call. You understand tokens, sampling, and context windows. Post 4 walks through the Messages API, the actual interface you use to send prompts and receive responses from Claude.