What is a context window? AI's working memory, explained for business owners
When people talk about an AI model's “context window,” they make it sound technical. It is not. A context window is the model's working memory: how much text it can hold in mind at one time while it answers you.
Get this concept and a lot of AI behaviour suddenly makes sense - why pasting a whole contract works, why a long chat sometimes goes vague, and why bigger is not always better.
Tokens, in plain terms
Context windows are measured in tokens, not words. A token is a chunk of a word. On average, in English, one token is about 0.75 of a word, so:
- 1,000 tokens is roughly 750 words
- 1,000,000 tokens is roughly 750,000 words, or more than 700 pages
So when a model advertises a “one million token context window,” read it as: it can keep about 700-plus pages of text in mind at once. That is the whole point - you can hand it a lot.
Tokens also matter because pricing is measured in them. Hold that thought for the cost section.
Why it matters for your business
A big context window changes what you can hand a model in one go. Instead of chopping a job into pieces, you can give it the whole thing:
- A full set of supplier contracts, and ask where the renewal and termination clauses differ
- A quarter of customer email threads, and ask for the three complaints that keep recurring
- An entire product manual, and ask it to draft a plain-English FAQ
Before large windows, you had to slice documents up and stitch the answers back together, which lost context and introduced errors. Now the model can see the connections across the whole document because it is all in memory at once.
The 2026 reality
Context windows have grown fast. Where the major models sit today:
- Claude Opus 4.8 and Claude Sonnet 4.6: 1 million tokens each
- Claude Haiku 4.5: around 200,000 tokens - smaller window, but faster and cheaper
- Google Gemini 3.5: around 1 million tokens
- OpenAI GPT-5.5: around 400,000 tokens in Codex, and up to 1 million tokens via the API
A million tokens is now the high end across several providers, which is why “just give it the whole thing” has become a realistic way to work for many tasks.
Note the Haiku example, though. A smaller window is a feature, not a flaw, when the job is small and you want speed and a lower bill. Matching the model to the task beats always reaching for the biggest one.
Bigger is not always better
Here is the part people miss. Two reasons a giant window is not a free win.
Cost. Everything in the window is re-processed each time the model generates a response, and you pay per input token. A massive prompt carried across a long conversation can get expensive quickly. The mitigation is prompt caching: the provider stores repeated context so you do not pay full price to re-process it every turn, which can cut the cost of that repeated portion by a large margin. If you reuse the same big document across many questions, caching matters.
Quality. Counterintuitively, stuffing the window full can lower answer quality. This is often called context rot: bury the relevant detail in a huge pile of loosely related text and the model can lose the thread. Models also tend to pay closest attention to material at the very start and very end of a long input, so something important in the middle of a 600-page dump can get under-weighted.
The takeaway: giving the model the right context usually beats giving it the most context.
So what should you actually do?
For one-off jobs - analyse this contract, summarise these emails - pasting the whole thing into a big window is exactly the right move. Simple and effective.
For a standing knowledge base you query again and again - your policies, your product catalogue, years of support tickets - do not paste everything every time. Two better patterns:
- Projects and persistent context. Tools like Claude’s Projects let you attach reference material once so it is available across your conversations, instead of re-pasting it.
- Retrieval (RAG). Retrieval-augmented generation stores your knowledge base separately and fetches only the most relevant snippets for each question, feeding just those into the window. You get the relevant facts without re-processing the entire library on every query, which is cheaper and scales far better as the knowledge base grows and changes.
A rough rule: if it is one document and one task, use the context window directly. If it is a large or frequently updated body of knowledge you will ask about repeatedly, use retrieval.
The one-line version
A context window is working memory measured in tokens. One million tokens is over 700 pages, and several 2026 models reach it. That lets you hand a model whole documents at once - but everything in the window costs tokens and gets re-processed, and over-stuffing it can hurt quality. Use big windows for big one-off jobs, use caching when you repeat context, and use retrieval for the knowledge base you keep coming back to.
Frequently asked questions
- How big is a 1 million token context window?
- A token is roughly 0.75 of a word, so one million tokens is around 750,000 words, or more than 700 pages of text. In practical terms that is a stack of contracts, a quarter of a busy inbox, or a full product manual, all able to sit in the model's working memory at the same time. It is a lot, but it is not infinite, and it is not free to use.
- Does a bigger context window mean better answers?
- Not automatically. A bigger window means the model can consider more at once, but stuffing it full of loosely relevant material can actually reduce answer quality, an effect often called context rot. Models also tend to attend better to information at the start and end of a long input. Giving the model the right context usually beats giving it the most context.
- What is a token?
- A token is the unit AI models use to measure text. It is a chunk of a word, on average about 0.75 of a word in English, so 1,000 tokens is roughly 750 words. Pricing and context limits are both measured in tokens, which is why understanding them helps you reason about cost and capacity.
- What is RAG and how is it different from a big context window?
- Retrieval-augmented generation, or RAG, fetches only the most relevant snippets from a large knowledge base and feeds just those into the context window for each question. A giant context window holds everything at once. RAG is usually cheaper and scales better for large or frequently updated knowledge bases, because you are not re-processing the entire library on every query.
- Why does context cost money every time?
- Models re-process the content in the window each time they generate a response, and you are billed per token of input. So a huge prompt repeated across many messages can get expensive. Prompt caching helps a lot by storing repeated context so it does not have to be paid for in full each time, which can cut the cost of that repeated portion substantially.
Where this fits
Claude Implementation
Install Claude properly across your team - Claude Code, Claude.ai projects and skills, custom Anthropic SDK builds.