The context window has become one of the most important practical questions in modern AI systems. That applies whether you are building a simple assistant, a RAG system, or a more advanced agent. For almost every real-world solution, the same question appears: what information does the model actually get to see, in what order, and at what cost?

If you first want to understand why text is measured in tokens at all, the natural next step is our deep dive on tokenizers. Here, the focus is on the model's actual working area, the context window itself, and why context engineering or context management has become so important.

Short answer

The context window is the amount of tokenized information a model can keep active in a single call. That includes not only your question, but also system prompt, history, documents, RAG context, and often room for the answer itself.

Why it matters

It determines what the model can take into account, what a call costs, and how robust the answer stays as the amount of information grows.

Common misconception

That the context window is just the user's prompt. In reality, many different parts compete for the same token budget.

What Is a Context Window?

A context window is the amount of tokenized information the model can work with in a given call. You can think of it as the model's active workspace or temporary working memory.

It is important to understand that the context window does not just consist of your latest question. In a real system it is often filled by several layers at once:

system prompt or developer instructions
the user's question
previous messages in the conversation
tool results and function responses
documents, attachments, or RAG-retrieved chunks
the space the model needs to generate its answer

All of this competes for the same token budget

System prompt

History

Question

Documents and RAG

Answer

Thinking

Input tokens

Everything you send before the model responds. That includes system prompt, user prompt, history, tool output, and retrieved context.

Output tokens

The tokens the model actually generates in its visible response. They cost money too and must fit inside the model's limits.

Reasoning tokens

In reasoning models, additional tokens may be used for internal thinking. How they are counted and billed varies across providers.

The proportions above are illustrative. The important part is that everything competes for the same limited workspace.

Which Tokens Actually Count?

In practice, you almost always need to think in three budgets at the same time:

Input tokens, meaning everything that is sent in
Output tokens, meaning the visible answer the model writes
Reasoning tokens, in the models that use a separate or hidden thinking process

This is also where many people get surprised. A "short" user message may in reality sit on top of a large system prompt, a long chat history, and several RAG chunks. Then the total token count becomes much larger than it first appears.

Reasoning tokens, what are they?

Reasoning models often try to spend extra budget on intermediate thinking before they produce the final answer. You do not always see that in the UI, but you notice it in cost, latency, and sometimes in how much room is left for everything else.

The exact behavior differs across providers:

At Anthropic, extended thinking is counted as output tokens while it is generated. Their documentation also describes how thinking blocks from previous turns can count as input tokens in later calls in some flows.
At Google, the Gemini pricing page states that the output price covers both response and reasoning.
At other providers, both naming and accounting vary, but the practical conclusion is the same: reasoning is not free.

That means a model that "thinks longer" can become not only more expensive, but can also use more of the available budget for the call itself.

Why You Cannot Just Paste in an Entire Book

It is tempting to think that a larger context window solves everything. If the model can handle hundreds of thousands or even a million tokens, why not just paste in all the material at once?

There are at least three reasons that is often a bad idea:

it may not fit
it may become expensive
it may become worse

A very long book, plus instructions, plus history, plus the desired answer, can quickly exceed the limit. And even before that, relevant details can drown in noise.

Loading visualization...

The context window acts as the model's workspace. When less of the relevant information fits, the answer tends to become thinner or less reliable.

Why Do Answers Get Worse as Context Grows?

This is one of the most important lessons in long-context work: a larger context does not automatically mean more stable performance.

In the report Context Rot, published by Chroma on July 14, 2025, the authors found that model performance often declines as input length grows, even in relatively simple and controlled tasks. The point was not only that models can miss information, but that the increase in input length itself appears to create problems.

In practice, that often shows up as the model:

missing details that are actually present in the context
becoming more uncertain or contradictory
locking onto the wrong part of the material
answering a distracting detail instead of the central question
losing track of instructions when too much other material competes for attention

More text does not automatically mean better answers

Too little context

The model lacks important facts. The answer becomes shallow or guess-heavy.

Right amount, right order

Relevant signal dominates. The model gets exactly what it needs for the task.

Too much or too messy

Noise, distractors, and long sequences make the model lose sharpness. This is where context rot becomes visible.

That is also why models and AI agents rarely become better than the information they actually have access to and can work with. If the context is weak, scattered, or poorly prioritized, a strong model alone will not save the result.

What Is Context Engineering or Context Management?

Context engineering is about controlling which information the model gets, in what order, in what form, and with what budget. It is one of the most important practical layers in modern AI systems.

So it is not just about "having a long context", but about using that context intelligently.

In practice, that often means you need to:

choose the right information instead of all information
place important information early and clearly
summarize or compress old history
retrieve relevant context with RAG instead of sending everything every time
budget for both output and reasoning, not just input
use caching when the same large context returns often

This is where much of the real quality in AI agents and other agent systems is created. An agent with good tools but poor context quickly becomes expensive, slow, and unreliable. An agent with strong context engineering can feel much smarter than the base model alone suggests.

These Are the Tokens You Pay For

API providers normally charge per million input tokens and million output tokens. That means the context window is not only a capacity question, but a direct cost question as well.

The pricing examples below were verified on March 14, 2026 against the providers' official pages. They can change quickly.

Model	Input per 1M	Output per 1M	Comment
Claude Opus 4.6	$5.00	$25.00	Heavy frontier model
Claude Sonnet 4.5	$3.00	$15.00	Faster closed-source flagship
GPT-5.4	$2.50	$15.00	Standard price applies below 270K context
GPT-5.4 Pro	$30.00	$180.00	Extremely expensive pro tier
Gemini 3 Pro Preview	$2.00 to $3.00	$12.00 to $15.00	Higher tier above 200K input
Gemini 2.5 Pro	$1.25 to $2.50	$10.00 to $15.00	Higher tier above 200K input
MiniMax M2.5	$0.30	$1.20	Far cheaper than frontier-tier models
Kimi K2 Turbo	$1.15 cache miss	$8.00	Moonshot also lists $0.15 for cache hit

Some practical conclusions:

frontier closed-source models often sit around a few dollars per million input tokens and significantly more for output
pro-tier or heavily reasoning-oriented models can become dramatically more expensive
cheaper open or more open models delivered through APIs can sit well below one dollar per million input tokens and just a few dollars on the output side

That is also why long prompts, large RAG payloads, and generous output limits quickly affect production costs.

If you want to understand deployment cost more deeply, or compare API usage with your own infrastructure, read Where does your AI live? and our financial analysis of AI deployment.

How Large Are Context Windows Today?

Context lengths vary a lot between models. It is also important to read the fine print. Some providers state a theoretical maximum, others distinguish between standard mode and special beta or premium modes, and some count the total as input plus output.

Some verified examples as of March 14, 2026:

Model	Verified context length	Comment
Claude Opus 4.6	200K standard, 1M beta	1M requires a special beta header
Claude Sonnet 4.5	200K standard, 1M beta	Same setup as Opus
GPT-5.4	272K standard, up to 1M in API and Codex	Above the standard tier, higher usage rules apply
Gemini 2.5 Pro	1,048,576 input, 65,536 output	Official model limit
Gemini 3 Pro Preview	1,048,576 input, 65,536 output	Preview model
MiniMax M2.5	204,800 total	MiniMax counts input and output together

The largest verified mainstream windows I found therefore sit around one million tokens, not because everything automatically works perfectly there. Many other models sit around 128K, 200K, or 400K. Older or more specialized models can be significantly lower.

How You Should Think About This in Practice

When you build a real system, the most important question is rarely "which model has the longest context?" but rather:

which information the model actually needs to solve the task
how you can reduce noise and distractors
how much budget needs to be reserved for the answer
when you should use RAG, summarization, or caching instead of sending everything again
when it becomes cheaper or smarter to use another model or your own deployment

Long context windows are powerful. But they are not the same thing as free memory, unlimited understanding, or automatic quality.

Summary

The context window is the model's active workspace. It is filled with input tokens, and often also with space for output and sometimes reasoning. That is why capacity, quality, and cost all connect directly to how you manage context.

That is also why context engineering has become so central. Good AI systems are not just about choosing the right model, but about giving the model the right information, in the right amount, in the right order, and at the right cost.

Larger context often helps. But better context almost always helps more.