Интерфейс моделей ИИ

Гайд

What is a Context Window in LLMs?

Plain-language explanation of context windows, tokens, and how to plan long inputs in real applications.

A context window is the maximum amount of text — measured in tokens — that an LLM can read at once. Everything you send (system prompt, user message, tool results, retrieved documents, conversation history) plus everything the model writes back lives inside the same window. Once you exceed that limit, the API rejects the request or silently truncates.

Tokens, not characters

Most modern LLMs use byte-pair encoding tokenizers. A useful rough rule for English: 1 token ≈ 4 characters or ≈ 0.75 words. A 200K-token window therefore fits roughly 150,000 English words, or 600 pages of standard text.

Code, JSON and non-Latin scripts tokenize less efficiently — Python source can run 1 token per 2-3 characters, and Chinese typically uses one token per character.

Why bigger isn't automatically better

Two practical limits push back against "just paste everything":

  • Cost — long-context models often charge a premium above 200K tokens (see each model page's >200K-token rate).
  • Effective recall — published context lengths describe what the model accepts, not what it reliably uses. Most models exhibit decreasing recall at the middle of very long contexts ("lost in the middle").
  • Latency — time-to-first-token grows roughly linearly with input length.

When long context wins vs. RAG

If your full corpus fits in 200K-1M tokens and is queried infrequently, long-context inlining is simpler than building a retrieval pipeline. If the corpus is large or growing, classical RAG (chunk + embed + retrieve top-k) remains cheaper and faster.

A common production pattern is hybrid: retrieve a coarse chunk set with RAG, then drop the top-k chunks into a 200K-context model for re-ranking and answer generation.

Frequently asked questions

What does context window measure — characters or words?

Tokens, which are a model-specific unit roughly equivalent to 0.75 words (English) or 1.5 characters (CJK). A 200K-token window holds about 150K English words or roughly 500 standard book pages.

Is more context always better?

No. Larger context costs proportionally more per token, and effective recall (getting the right detail back) often degrades for content buried in the middle of very long prompts. For many apps, RAG over a 200K window beats stuffing a 1M window.

Do models charge differently for the >200K portion?

Some do. Anthropic and Google list separate over-200K rates that can be 2× the standard rate. Each model's detail page shows the over-200K tier when applicable.

How is context window different from output limit?

Context window is the total of input + output you can fit in one call. The output limit is the maximum the model can generate in one reply (usually much smaller — e.g. 128K context with 16K max output).

Что почитать дальше

Data is sourced from models.dev and normalized for comparison. Prices and capabilities may change. Always verify critical production decisions with the provider's official documentation.