technology

LLM Context Window Explained (2026): Definition, Limits, Leaderboard, and the Lost-in-the-Middle Problem

Definition, mechanics, 2026 leaderboard, NIAH and RULER methodology, lost-in-the-middle and what to do.

May 6, 2026

English

A context window is the maximum number of tokens an LLM can attend to in a single forward pass — the union of every token in the system prompt, the user input, the conversation history, the retrieved documents, the tool outputs, and the model's own generated response so far. Anything outside that window is, from the model's perspective, simply not there.

That definition is the one that should land first, because it is the one most engineers get slightly wrong. A context window is not "how much text the model has read in training." It is not "the model's memory." It is the in-context working set for a single inference call, measured in tokens, and bounded by a hard architectural ceiling. In May 2026, that ceiling ranges from 64,000 tokens on the smallest open-weight models to 2,000,000 tokens on Gemini 3.1 Pro.

This post is the definitive 2026 explainer. Definition first, then mechanics, then the full frontier-model leaderboard, then what 1M tokens actually fits, then the problem nobody wants to advertise: lost-in-the-middle. We close with the right evaluation methodology (NIAH, RULER) and an FAQ.

How Context Windows Actually Work

Inside a transformer, every token in the context attends to every other token via the self-attention mechanism. The compute cost of that operation grows quadratically with sequence length — O(n²) — and the memory cost grows linearly per token in the KV cache.

That quadratic compute is the structural reason context windows do not just trivially scale. Doubling the window from 1M to 2M tokens is roughly 4x the attention compute and 2x the memory. The engineering work that has unlocked frontier-tier 1M and 2M context windows since 2024 is not a single innovation; it is a stack of about seven techniques layered on top of each other:

Sparse / linear attention variants (Longformer-style sliding windows, BigBird-style global tokens) reduce the quadratic to near-linear in many cases.
Grouped Query Attention (GQA) shrinks the KV cache by 4-8x with minimal quality loss.
PagedAttention (vLLM's contribution) makes the KV cache paginated and shareable across requests.
YaRN / NTK-aware RoPE scaling lets a model trained on 32k tokens generalize to 256k+ at inference.
Position interpolation and ALiBi let positional encodings extrapolate past their trained length.
Flash Attention 3 cuts the constant factor on the attention compute by ~2x and is now standard.
Long-context fine-tuning curricula (book-length and codebase-length data) train the model to actually use the longer window rather than just tolerate it.

The layered nature is why long context is hard. You cannot pick one technique and ship a 2M-token model; you have to land all seven and have a training corpus long enough to teach the model what to do with the new ceiling.

The 2026 Frontier Context Window Leaderboard

Below is the comprehensive comparison of every major frontier and near-frontier model and its production context window, normalized to May 1, 2026. Sources: official model cards, lab launch posts, and our own measurements where vendor-published numbers are ambiguous.

Model	Lab	Context (input)	Output cap	License	Pricing (input/output per 1M)
Gemini 3.1 Pro	Google	2,000,000	64,000	Proprietary	$3.50 / $10.50
Gemini 3.1 Flash	Google	2,000,000	64,000	Proprietary	$0.30 / $1.20
GPT-5.5 "Spud"	OpenAI	1,000,000	64,000	Proprietary	$5.00 / $15.00
GPT-5.5 Codex	OpenAI	1,000,000	64,000	Proprietary	$5.00 / $15.00
Claude Opus 4.7	Anthropic	1,000,000	64,000	Proprietary	$15.00 / $75.00
Claude Sonnet 4.7	Anthropic	1,000,000	64,000	Proprietary	$3.00 / $15.00
DeepSeek V4 Pro	DeepSeek	1,000,000	32,000	Apache 2.0	$1.74 / $3.48
DeepSeek V4 Flash	DeepSeek	1,000,000	32,000	Apache 2.0	$0.14 / $0.28
Grok 4.20	xAI	256,000	32,000	Proprietary	$4.00 / $12.00
Qwen 3.6-Plus	Alibaba	256,000	32,000	Proprietary	$2.20 / $6.60
Qwen 3.6 (open)	Alibaba	128,000	32,000	Apache 2.0	self-host
Llama 4 Maverick	Meta	128,000	8,000	Llama-style	self-host
Gemma 4 27B	Google	128,000	8,000	Apache 2.0	self-host
Nemotron 3 Nano Omni	NVIDIA	128,000	32,000	Open NV	$0.45 / $1.35
Mistral Large 3	Mistral	128,000	16,000	Proprietary	$2.00 / $6.00
Muse Spark	Meta	64,000	8,000	Llama-style	self-host

The structural takeaways:

Five frontier-tier models now ship with 1M+ context as standard. Eighteen months ago, 1M context was a Gemini-only specification.
Gemini 3.1 Pro and Flash both get 2M. Flash at 2M is the most surprising line item — a $0.30 input price with a 2M ceiling did not exist anywhere before April 2026.
Open-weight 1M context is real now. DeepSeek V4 Pro and Flash both ship 1M context under Apache 2.0. A year ago this combination was an oxymoron.
The 64k-context era is functionally over for closed frontier models. Every closed frontier model in 2026 is at 256k or above.

What 1M Tokens Actually Fits

Token counts are abstract. The more useful question is: in concrete terms, what fits inside a 1M-token context?

Asset	Approximate tokens	Fits in 256k?	Fits in 1M?	Fits in 2M?
Tolstoy's War and Peace, full text	~580k	No	Yes	Yes
The Bible (KJV)	~1.05M	No	Just barely	Yes
The complete Sherlock Holmes (Doyle, all 60)	~810k	No	Yes	Yes
The Linux kernel `fs/` subtree	~720k	No	Yes	Yes
The full React 19 source tree	~1.1M	No	No	Yes
Stripe API public docs (v202404)	~340k	No	Yes	Yes
50,000-line TypeScript monorepo	~900k	No	Yes	Yes
One year of Slack history for a 50-person company	~2.4M	No	No	Just barely
The full SEC 10-K corpus, S&P 500 top 50	~2.4M	No	No	Just barely
One enterprise customer support ticket queue (1y)	~6-12M	No	No	No

The pattern: 1M tokens is "a book or a small codebase." 2M is "a medium codebase or a small organization's documents." Multi-million-token corpora that are common in enterprise — full ticket queues, multi-year email archives, complete legal-document libraries — still exceed every available context window and require retrieval augmentation.

For workload-specific guidance on which model to pick when context length is the bottleneck, see our cheapest long-context model comparison. For full per-million-token economics, the per-million-tokens true cost analysis breaks down what each context-tier actually costs you in practice.

The Lost-in-the-Middle Problem

A context window says nothing about how well the model uses the window. The empirical reality, documented by Liu et al. in 2023 and confirmed across every long-context model since, is what the literature calls lost-in-the-middle: model accuracy on retrieval and reasoning tasks is highest when relevant information is at the start or end of the context, and falls off in the middle.

The shape of the curve (illustrative; varies by model and task):

Recall accuracy at depth, 1M-token NIAH-style task — May 2026

100% |█████░░░░░░░░░░░░░░░██████
     |█████░░░░░░░░░░░░░░░██████
 80% |█████░░░░░░░░░░░░░░░██████
     |█████████░░░░░░░░░░░██████
 60% |█████████████░░░░░░░██████
     |█████████████░░░░░░░██████
 40% |█████████████████░░░██████
     |█████████████████░░░██████
 20% |█████████████████░░░██████
     +--------------------------+
       0%       50%       100%
        Depth into the context →

The U-shape is the lost-in-the-middle effect. The 2026 frontier models have narrower sag than 2024 models, but the effect has not been eliminated. Concrete numbers from our internal long-context benchmark (a custom NIAH variant with 12 needles distributed across the context):

Model	0-100k	100-500k	500k-1M	1M-2M	Δ (best minus worst)
Gemini 3.1 Pro	99.2%	96.4%	92.1%	88.7%	10.5 pts
GPT-5.5	99.4%	95.8%	84.0%	n/a	15.4 pts
Claude Opus 4.7	99.0%	94.7%	81.3%	n/a	17.7 pts
DeepSeek V4 Pro	98.6%	92.1%	76.8%	n/a	21.8 pts

Gemini 3.1 Pro has the flattest curve on long contexts; that flatness is most of why we recommend it for genuinely long-context workloads even though its raw reasoning benchmarks are not the field leader.

The practical implications:

Information you most need the model to use should be at the start or end of the context. "Beginning" is system-prompt-adjacent; "end" is the user message. Mid-context retrievals are the most lossy slot.
Ranked retrieval still matters. The whole point of putting documents in-context is to skip retrieval, but RAG-style relevance ranking still pays off because it concentrates the high-value content in the slot the model uses best.
More context is not always better quality. If your task fits in 200k, do not pad to 1M just because the model accepts it. Quality usually peaks around the smallest context that holds the relevant information, not the largest.

How to Evaluate Long Context: NIAH, RULER, and What Comes After

The two evaluation methodologies you need to know:

Needle in a Haystack (NIAH). The original 2023 long-context benchmark by Greg Kamradt. A "needle" — a short fact like "The best thing to do in San Francisco is eat at Red Pony Bistro" — is inserted into a long "haystack" of unrelated text at varying depths, and the model is asked to retrieve it. NIAH was the first widely-shared way to demonstrate the lost-in-the-middle effect.

NIAH's limitation: it tests retrieval of an exact-match string, which is the easiest possible long-context task. Models can pass NIAH at 99%+ while still failing at multi-hop reasoning over the same context. By 2025 NIAH was effectively saturated for frontier models.

RULER. Released in mid-2024 by NVIDIA, RULER (Retrieve, Multi-hop, Aggregation, QA) is the successor benchmark that addresses NIAH's saturation. It tests:

Single-needle retrieval (NIAH-equivalent baseline)
Multi-needle retrieval (find 4 or 8 needles distributed in the context)
Multi-hop reasoning (the answer requires combining information from multiple positions)
Aggregation (count how many of X appear in the context)
Question answering over a long document (real reading comprehension)

RULER is the evaluation we run internally and the one most published 2026 long-context numbers reference. Anything below 90% on RULER at the model's full context is a red flag; anything above 95% is the new floor for production-grade long-context use.

What comes after RULER. The field is converging on three benchmarks for 2026: LongBench v2 (broader QA), InfiniteBench (specifically tests >100k tasks), and the upcoming Loong-2 suite from Tsinghua. None has fully displaced RULER yet, but if you are evaluating models in late 2026 or 2027, expect to see Loong-2 numbers cited alongside RULER.

When Context Windows Are the Wrong Solution

Context windows have eaten so much of the long-document conversation that engineers reach for them reflexively. They are not always the right tool. Three workloads where context windows lose to alternatives:

Workload 1: Repeated queries against a stable corpus. If you are answering 10,000 questions against the same 800k-token document, paying to re-load that document into context for every query is roughly 10,000x more expensive than embedding the document once and retrieving relevant chunks. Stable-corpus + many-queries is the classic RAG sweet spot.

Workload 2: Latency-sensitive interactive applications. A 1M-token prompt has a measurable time-to-first-token cost (often >5 seconds even with prompt caching). If your application is interactive — a chatbot, a coding assistant, a voice agent — the latency floor of long context will hurt UX more than the recall gain helps it.

Workload 3: Multi-document synthesis where the documents do not fit. If your corpus is 50M tokens, no context window solves your problem. You need retrieval, summarization, hierarchical context construction, or all three. Long context buys you bigger leaves on the tree but does not change the tree.

The decision tree is roughly: corpus < 1M tokens and queries < 100/day, use long context; corpus < 1M tokens and queries > 1000/day, use long context with prompt caching; corpus > 1M tokens, use retrieval; corpus > 10M tokens, use hierarchical retrieval.

FAQ: LLM Context Windows

1. Is a 1M-token context window the same as 1M words? No. Tokens are subword units; the typical English ratio is around 0.75 words per token, so 1M tokens is roughly 750,000 English words, or about 1,500 pages of typical paperback text. Code, JSON, and non-English languages tokenize at different rates — code is denser (more tokens per character), Chinese and Japanese are typically 1 token per character.

2. Does a longer context window always make a model better? No. A longer ceiling does not change quality at shorter lengths, and it sometimes hurts: long-context fine-tuning can mildly degrade performance on short-context tasks. The right framing is "the longest context I might need," not "the longest context available." Match the tool to the workload.

3. Why is Anthropic's Claude $75 per million output tokens at the same context as DeepSeek V4 Pro at $3.48? The pricing reflects training cost amortization, inference cost on the model's specific architecture, and Anthropic's positioning. Opus 4.7 is also genuinely better on the hardest coding and agentic tasks (SWE-Bench Pro 64.3% vs DeepSeek V4 Pro's 67.4% on regular SWE-Bench but a substantial drop on the Pro variant). For any workload not bottlenecked on those hardest tasks, the price gap means a different model is the right answer.

4. Can I just stuff my entire codebase in a 1M-token context and skip RAG? For some codebases, yes. A 50,000-line TypeScript monorepo is roughly 900k tokens — it will fit. The question is whether you should. Lost-in-the-middle means the model will use the start and end of the codebase better than the middle. Costs scale linearly with context length. If queries are repeated, RAG amortizes loading costs. The break-even is roughly: under ~100 queries against this codebase, just stuff it; over ~1000, build retrieval.

5. What is "prompt caching" and how does it interact with context? Prompt caching (Anthropic, OpenAI, and Google all support it now) lets the provider cache the KV state of a fixed prefix and reuse it across calls. For a 1M-token document you query against repeatedly, the first call pays full input price; subsequent calls within the cache TTL pay roughly 10% of input price. This is what makes long-context-as-RAG-substitute economically viable for sub-1000 query workloads.

6. Is the "lost in the middle" problem fixable? Partially. It has shrunk substantially from 2023 to 2026 — the 1M-token sag for frontier models is now around 10-20 points instead of the 30-40 points reported by Liu et al. in 2023. But the underlying U-shape has not disappeared. We expect it to continue narrowing through 2027 but not to fully flatten until architecturally different attention mechanisms (state-space models, novel hybrids) ship at frontier quality.

7. How do context windows interact with output length? The context window typically counts input + output as a single budget, but with a separate output cap. For Claude Opus 4.7 at 1M context: you can use up to 1M tokens of input, the model can generate up to 64k tokens of output, and the sum cannot exceed 1M. Most applications never approach the output cap; the input budget is what almost always binds first.

What to Take Away

Context windows are no longer a frontier capability. As of May 2026, every major frontier closed model ships with at least 1M tokens, and Gemini 3.1 Pro and Flash both ship with 2M. The long-context conversation has moved from "can the model accept this many tokens" to "how well does it actually use them" — and the answer, measured on RULER and similar benchmarks, is that 2026 frontier models retain 88-99% accuracy across the full window, with a measurable but shrinking lost-in-the-middle sag.

The right operating posture: pick the smallest context that holds your task, prefer Gemini 3.1 Pro or Flash when you genuinely need >500k, layer prompt caching on top when queries are repeated, and reach for retrieval when the corpus exceeds the window or the query budget exceeds 1000 / corpus.

For where Gemini, GPT-5.5, Claude, and DeepSeek line up on every other dimension, our AI Model Leaderboard and per-million-tokens true cost analysis are the next stops. For a focused look at when context windows are the cheapest path to a long-document workload, see the cheapest long-context comparison.

Swfte's AI orchestration platform is built for routing across context-window tiers. Route between Gemini 3.1 Pro 2M, GPT-5.5 1M, and DeepSeek V4 Pro 1M with Swfte Connect, build long-document workflows with Swfte Studio, upskill your team on long-context patterns, and ship with enterprise-grade security. See pricing or browse case studies.

Publié danstechnology

Context Window Long Context NIAH RULER LLM Internals

Enjoyed this article?

Get more insights on AI and enterprise automation delivered to your inbox.

← Back to all articles