What Is a Context Window? And Why Context Engineering Has Become Critical

Published March 14, 2026 by Joel Thyberg

What Is a Context Window? And Why Context Engineering Has Become Critical

The context window has become one of the most important practical questions in modern AI systems. That applies whether you are building a simple assistant, a RAG system, or a more advanced agent. For almost every real-world solution, the same question appears: what information does the model actually get to see, in what order, and at what cost?

If you first want to understand why text is measured in tokens at all, the natural next step is our deep dive on tokenizers. Here, the focus is on the model's actual working area, the context window itself, and why context engineering or context management has become so important.

Short answer

The context window is the amount of tokenized information a model can keep active in a single call. That includes not only your question, but also system prompt, history, documents, RAG context, and often room for the answer itself.

Why it matters

It determines what the model can take into account, what a call costs, and how robust the answer stays as the amount of information grows.

Common misconception

That the context window is just the user's prompt. In reality, many different parts compete for the same token budget.

What Is a Context Window?

A context window is the amount of tokenized information the model can work with in a given call. You can think of it as the model's active workspace or temporary working memory.

It is important to understand that the context window does not just consist of your latest question. In a real system it is often filled by several layers at once:

  • system prompt or developer instructions
  • the user's question
  • previous messages in the conversation
  • tool results and function responses
  • documents, attachments, or RAG-retrieved chunks
  • the space the model needs to generate its answer

All of this competes for the same token budget

System prompt
History
Question
Documents and RAG
Answer
Thinking

Input tokens

Everything you send before the model responds. That includes system prompt, user prompt, history, tool output, and retrieved context.

Output tokens

The tokens the model actually generates in its visible response. They cost money too and must fit inside the model's limits.

Reasoning tokens

In reasoning models, additional tokens may be used for internal thinking. How they are counted and billed varies across providers.

The proportions above are illustrative. The important part is that everything competes for the same limited workspace.

Which Tokens Actually Count?

In practice, you almost always need to think in three budgets at the same time:

  1. Input tokens, meaning everything that is sent in
  2. Output tokens, meaning the visible answer the model writes
  3. Reasoning tokens, in the models that use a separate or hidden thinking process

This is also where many people get surprised. A "short" user message may in reality sit on top of a large system prompt, a long chat history, and several RAG chunks. Then the total token count becomes much larger than it first appears.

Reasoning tokens, what are they?

Reasoning models often try to spend extra budget on intermediate thinking before they produce the final answer. You do not always see that in the UI, but you notice it in cost, latency, and sometimes in how much room is left for everything else.

The exact behavior differs across providers:

  • At Anthropic, extended thinking is counted as output tokens while it is generated. Their documentation also describes how thinking blocks from previous turns can count as input tokens in later calls in some flows.
  • At Google, the Gemini pricing page states that the output price covers both response and reasoning.
  • At other providers, both naming and accounting vary, but the practical conclusion is the same: reasoning is not free.

That means a model that "thinks longer" can become not only more expensive, but can also use more of the available budget for the call itself.

Why You Cannot Just Paste in an Entire Book

It is tempting to think that a larger context window solves everything. If the model can handle hundreds of thousands or even a million tokens, why not just paste in all the material at once?

There are at least three reasons that is often a bad idea:

  • it may not fit
  • it may become expensive
  • it may become worse

A very long book, plus instructions, plus history, plus the desired answer, can quickly exceed the limit. And even before that, relevant details can drown in noise.

Loading visualization...

The context window acts as the model's workspace. When less of the relevant information fits, the answer tends to become thinner or less reliable.

Why Do Answers Get Worse as Context Grows?

This is one of the most important lessons in long-context work: a larger context does not automatically mean more stable performance.

In the report Context Rot, published by Chroma on July 14, 2025, the authors found that model performance often declines as input length grows, even in relatively simple and controlled tasks. The point was not only that models can miss information, but that the increase in input length itself appears to create problems.

In practice, that often shows up as the model:

  • missing details that are actually present in the context
  • becoming more uncertain or contradictory
  • locking onto the wrong part of the material
  • answering a distracting detail instead of the central question
  • losing track of instructions when too much other material competes for attention

More text does not automatically mean better answers

Too little context

The model lacks important facts. The answer becomes shallow or guess-heavy.

Right amount, right order

Relevant signal dominates. The model gets exactly what it needs for the task.

Too much or too messy

Noise, distractors, and long sequences make the model lose sharpness. This is where context rot becomes visible.

That is also why models and AI agents rarely become better than the information they actually have access to and can work with. If the context is weak, scattered, or poorly prioritized, a strong model alone will not save the result.

What Is Context Engineering or Context Management?

Context engineering is about controlling which information the model gets, in what order, in what form, and with what budget. It is one of the most important practical layers in modern AI systems.

So it is not just about "having a long context", but about using that context intelligently.

In practice, that often means you need to:

  1. choose the right information instead of all information
  2. place important information early and clearly
  3. summarize or compress old history
  4. retrieve relevant context with RAG instead of sending everything every time
  5. budget for both output and reasoning, not just input
  6. use caching when the same large context returns often

This is where much of the real quality in AI agents and other agent systems is created. An agent with good tools but poor context quickly becomes expensive, slow, and unreliable. An agent with strong context engineering can feel much smarter than the base model alone suggests.

These Are the Tokens You Pay For

API providers normally charge per million input tokens and million output tokens. That means the context window is not only a capacity question, but a direct cost question as well.

The pricing examples below were verified on March 14, 2026 against the providers' official pages. They can change quickly.

ModelInput per 1MOutput per 1MComment
Claude Opus 4.6$5.00$25.00Heavy frontier model
Claude Sonnet 4.5$3.00$15.00Faster closed-source flagship
GPT-5.4$2.50$15.00Standard price applies below 270K context
GPT-5.4 Pro$30.00$180.00Extremely expensive pro tier
Gemini 3 Pro Preview$2.00 to $3.00$12.00 to $15.00Higher tier above 200K input
Gemini 2.5 Pro$1.25 to $2.50$10.00 to $15.00Higher tier above 200K input
MiniMax M2.5$0.30$1.20Far cheaper than frontier-tier models
Kimi K2 Turbo$1.15 cache miss$8.00Moonshot also lists $0.15 for cache hit

Some practical conclusions:

  • frontier closed-source models often sit around a few dollars per million input tokens and significantly more for output
  • pro-tier or heavily reasoning-oriented models can become dramatically more expensive
  • cheaper open or more open models delivered through APIs can sit well below one dollar per million input tokens and just a few dollars on the output side

That is also why long prompts, large RAG payloads, and generous output limits quickly affect production costs.

If you want to understand deployment cost more deeply, or compare API usage with your own infrastructure, read Where does your AI live? and our financial analysis of AI deployment.

How Large Are Context Windows Today?

Context lengths vary a lot between models. It is also important to read the fine print. Some providers state a theoretical maximum, others distinguish between standard mode and special beta or premium modes, and some count the total as input plus output.

Some verified examples as of March 14, 2026:

ModelVerified context lengthComment
Claude Opus 4.6200K standard, 1M beta1M requires a special beta header
Claude Sonnet 4.5200K standard, 1M betaSame setup as Opus
GPT-5.4272K standard, up to 1M in API and CodexAbove the standard tier, higher usage rules apply
Gemini 2.5 Pro1,048,576 input, 65,536 outputOfficial model limit
Gemini 3 Pro Preview1,048,576 input, 65,536 outputPreview model
MiniMax M2.5204,800 totalMiniMax counts input and output together

The largest verified mainstream windows I found therefore sit around one million tokens, not because everything automatically works perfectly there. Many other models sit around 128K, 200K, or 400K. Older or more specialized models can be significantly lower.

How You Should Think About This in Practice

When you build a real system, the most important question is rarely "which model has the longest context?" but rather:

  • which information the model actually needs to solve the task
  • how you can reduce noise and distractors
  • how much budget needs to be reserved for the answer
  • when you should use RAG, summarization, or caching instead of sending everything again
  • when it becomes cheaper or smarter to use another model or your own deployment

Long context windows are powerful. But they are not the same thing as free memory, unlimited understanding, or automatic quality.

Summary

The context window is the model's active workspace. It is filled with input tokens, and often also with space for output and sometimes reasoning. That is why capacity, quality, and cost all connect directly to how you manage context.

That is also why context engineering has become so central. Good AI systems are not just about choosing the right model, but about giving the model the right information, in the right amount, in the right order, and at the right cost.

Larger context often helps. But better context almost always helps more.