What Is a Context Window? And Why Context Engineering Has Become Critical
Published March 14, 2026 by Joel Thyberg

The context window has become one of the most important practical questions in modern AI systems. That applies whether you are building a simple assistant, a RAG system, or a more advanced agent. For almost every real-world solution, the same question appears: what information does the model actually get to see, in what order, and at what cost?
If you first want to understand why text is measured in tokens at all, the natural next step is our deep dive on tokenizers. Here, the focus is on the model's actual working area, the context window itself, and why context engineering or context management has become so important.
Short answer
The context window is the amount of tokenized information a model can keep active in a single call. That includes not only your question, but also system prompt, history, documents, RAG context, and often room for the answer itself.
Why it matters
It determines what the model can take into account, what a call costs, and how robust the answer stays as the amount of information grows.
Common misconception
That the context window is just the user's prompt. In reality, many different parts compete for the same token budget.
What Is a Context Window?
A context window is the amount of tokenized information the model can work with in a given call. You can think of it as the model's active workspace or temporary working memory.
It is important to understand that the context window does not just consist of your latest question. In a real system it is often filled by several layers at once:
- system prompt or developer instructions
- the user's question
- previous messages in the conversation
- tool results and function responses
- documents, attachments, or RAG-retrieved chunks
- the space the model needs to generate its answer
All of this competes for the same token budget
Input tokens
Everything you send before the model responds. That includes system prompt, user prompt, history, tool output, and retrieved context.
Output tokens
The tokens the model actually generates in its visible response. They cost money too and must fit inside the model's limits.
Reasoning tokens
In reasoning models, additional tokens may be used for internal thinking. How they are counted and billed varies across providers.
The proportions above are illustrative. The important part is that everything competes for the same limited workspace.
Which Tokens Actually Count?
In practice, you almost always need to think in three budgets at the same time:
- Input tokens, meaning everything that is sent in
- Output tokens, meaning the visible answer the model writes
- Reasoning tokens, in the models that use a separate or hidden thinking process
This is also where many people get surprised. A "short" user message may in reality sit on top of a large system prompt, a long chat history, and several RAG chunks. Then the total token count becomes much larger than it first appears.
Reasoning tokens, what are they?
Reasoning models often try to spend extra budget on intermediate thinking before they produce the final answer. You do not always see that in the UI, but you notice it in cost, latency, and sometimes in how much room is left for everything else.
The exact behavior differs across providers:
- At Anthropic, extended thinking is counted as output tokens while it is generated. Their documentation also describes how thinking blocks from previous turns can count as input tokens in later calls in some flows.
- At Google, the Gemini pricing page states that the output price covers both response and reasoning.
- At other providers, both naming and accounting vary, but the practical conclusion is the same: reasoning is not free.
That means a model that "thinks longer" can become not only more expensive, but can also use more of the available budget for the call itself.
Why You Cannot Just Paste in an Entire Book
It is tempting to think that a larger context window solves everything. If the model can handle hundreds of thousands or even a million tokens, why not just paste in all the material at once?
There are at least three reasons that is often a bad idea:
- it may not fit
- it may become expensive
- it may become worse
A very long book, plus instructions, plus history, plus the desired answer, can quickly exceed the limit. And even before that, relevant details can drown in noise.
Loading visualization...
Why Do Answers Get Worse as Context Grows?
This is one of the most important lessons in long-context work: a larger context does not automatically mean more stable performance.
In the report Context Rot, published by Chroma on July 14, 2025, the authors found that model performance often declines as input length grows, even in relatively simple and controlled tasks. The point was not only that models can miss information, but that the increase in input length itself appears to create problems.
In practice, that often shows up as the model:
- missing details that are actually present in the context
- becoming more uncertain or contradictory
- locking onto the wrong part of the material
- answering a distracting detail instead of the central question
- losing track of instructions when too much other material competes for attention
More text does not automatically mean better answers
Too little context
The model lacks important facts. The answer becomes shallow or guess-heavy.
Right amount, right order
Relevant signal dominates. The model gets exactly what it needs for the task.
Too much or too messy
Noise, distractors, and long sequences make the model lose sharpness. This is where context rot becomes visible.
That is also why models and AI agents rarely become better than the information they actually have access to and can work with. If the context is weak, scattered, or poorly prioritized, a strong model alone will not save the result.
What Is Context Engineering or Context Management?
Context engineering is about controlling which information the model gets, in what order, in what form, and with what budget. It is one of the most important practical layers in modern AI systems.
So it is not just about "having a long context", but about using that context intelligently.
In practice, that often means you need to:
- choose the right information instead of all information
- place important information early and clearly
- summarize or compress old history
- retrieve relevant context with RAG instead of sending everything every time
- budget for both output and reasoning, not just input
- use caching when the same large context returns often
This is where much of the real quality in AI agents and other agent systems is created. An agent with good tools but poor context quickly becomes expensive, slow, and unreliable. An agent with strong context engineering can feel much smarter than the base model alone suggests.
These Are the Tokens You Pay For
API providers normally charge per million input tokens and million output tokens. That means the context window is not only a capacity question, but a direct cost question as well.
The pricing examples below were verified on March 14, 2026 against the providers' official pages. They can change quickly.
| Model | Input per 1M | Output per 1M | Comment |
|---|---|---|---|
| Claude Opus 4.6 | $5.00 | $25.00 | Heavy frontier model |
| Claude Sonnet 4.5 | $3.00 | $15.00 | Faster closed-source flagship |
| GPT-5.4 | $2.50 | $15.00 | Standard price applies below 270K context |
| GPT-5.4 Pro | $30.00 | $180.00 | Extremely expensive pro tier |
| Gemini 3 Pro Preview | $2.00 to $3.00 | $12.00 to $15.00 | Higher tier above 200K input |
| Gemini 2.5 Pro | $1.25 to $2.50 | $10.00 to $15.00 | Higher tier above 200K input |
| MiniMax M2.5 | $0.30 | $1.20 | Far cheaper than frontier-tier models |
| Kimi K2 Turbo | $1.15 cache miss | $8.00 | Moonshot also lists $0.15 for cache hit |
Some practical conclusions:
- frontier closed-source models often sit around a few dollars per million input tokens and significantly more for output
- pro-tier or heavily reasoning-oriented models can become dramatically more expensive
- cheaper open or more open models delivered through APIs can sit well below one dollar per million input tokens and just a few dollars on the output side
That is also why long prompts, large RAG payloads, and generous output limits quickly affect production costs.
If you want to understand deployment cost more deeply, or compare API usage with your own infrastructure, read Where does your AI live? and our financial analysis of AI deployment.
How Large Are Context Windows Today?
Context lengths vary a lot between models. It is also important to read the fine print. Some providers state a theoretical maximum, others distinguish between standard mode and special beta or premium modes, and some count the total as input plus output.
Some verified examples as of March 14, 2026:
| Model | Verified context length | Comment |
|---|---|---|
| Claude Opus 4.6 | 200K standard, 1M beta | 1M requires a special beta header |
| Claude Sonnet 4.5 | 200K standard, 1M beta | Same setup as Opus |
| GPT-5.4 | 272K standard, up to 1M in API and Codex | Above the standard tier, higher usage rules apply |
| Gemini 2.5 Pro | 1,048,576 input, 65,536 output | Official model limit |
| Gemini 3 Pro Preview | 1,048,576 input, 65,536 output | Preview model |
| MiniMax M2.5 | 204,800 total | MiniMax counts input and output together |
The largest verified mainstream windows I found therefore sit around one million tokens, not because everything automatically works perfectly there. Many other models sit around 128K, 200K, or 400K. Older or more specialized models can be significantly lower.
How You Should Think About This in Practice
When you build a real system, the most important question is rarely "which model has the longest context?" but rather:
- which information the model actually needs to solve the task
- how you can reduce noise and distractors
- how much budget needs to be reserved for the answer
- when you should use RAG, summarization, or caching instead of sending everything again
- when it becomes cheaper or smarter to use another model or your own deployment
Long context windows are powerful. But they are not the same thing as free memory, unlimited understanding, or automatic quality.
Summary
The context window is the model's active workspace. It is filled with input tokens, and often also with space for output and sometimes reasoning. That is why capacity, quality, and cost all connect directly to how you manage context.
That is also why context engineering has become so central. Good AI systems are not just about choosing the right model, but about giving the model the right information, in the right amount, in the right order, and at the right cost.
Larger context often helps. But better context almost always helps more.