MMLU Benchmark — May 2026
The Massive Multitask Language Understanding benchmark, six years on. Top 15 model scores, the MMLU-to-MMLU-Pro divergence, the subject-level picture, and our Saturation Index — a single number that tells you when MMLU stops being useful for your workload.
MMLU in one paragraph
MMLU is a 57-subject, ~16,000-question multiple-choice benchmark introduced by Hendrycks et al. in 2020. Each question has four answer choices; a model's score is the percentage answered correctly. Subjects span STEM (college mathematics, formal logic, machine learning), humanities (philosophy, professional law), social sciences (econometrics, public relations), and other (clinical knowledge, business ethics). For the first three years it was the closest thing AI had to a universal capability measure. Today it is partially saturated at the frontier, which is why the ecosystem is migrating to MMLU Pro and a portfolio of alternatives.
Top 15 on MMLU and MMLU Pro
| # | Model | MMLU | MMLU Pro | Arena Elo |
|---|---|---|---|---|
| #1 | Claude Opus 4.7 Anthropic — Coding Arena #1; strongest aggregate MMLU + MMLU Pro pairing. | 91.2 | 84.6 | 1495 |
| #2 | Gemini 3.1 Pro Google — Text Arena leader. MMLU score saturated; MMLU Pro is the better signal here. | 90.8 | 83.9 | 1500 |
| #3 | GPT-5.5 Pro OpenAI — $30/$180 per 1M tokens. Strong reasoning; MMLU value is poor. | 90.4 | 84.1 | 1488 |
| #4 | GPT-5.5 OpenAI — Default tier. Drop from MMLU to MMLU Pro is moderate. | 89.7 | 82.8 | 1462 |
| #5 | DeepSeek V4 Pro DeepSeek — Apache 2.0. Best MMLU per dollar by a wide margin. | 89.4 | 82.3 | 1462 |
| #6 | Claude Sonnet 4 Anthropic — Default workhorse. Strong real-world performance vs MMLU score. | 88.9 | 80.4 | 1402 |
| #7 | Qwen 3.6 Plus Alibaba — Open weights; strong multilingual MMLU subjects. | 88.5 | 79.8 | 1423 |
| #8 | GPT-4.1 OpenAI — Legacy frontier. MMLU Pro gap shows the saturation effect. | 87.8 | 78.2 | 1395 |
| #9 | Gemini 2.5 Pro Google — Strong long-context; MMLU Pro lags slightly. | 87.4 | 77.5 | 1388 |
| #10 | Llama 4 Maverick Meta — Open weights; widely used for fine-tuning. | 86.9 | 76.4 | 1352 |
| #11 | Mistral Large 3 Mistral — EU-hosted option; competitive MMLU. | 86.4 | 75.8 | 1341 |
| #12 | Gemma 4 Google — Apache 2.0; first sub-10B model above 85 MMLU. | 85.8 | 74.6 | 1268 |
| #13 | Phi-4 Microsoft — Small model with strong MMLU; Arena rank is much lower. | 85.1 | 73.2 | 1245 |
| #14 | NVIDIA Nemotron 3 NVIDIA — 30B multimodal open weights. | 84.9 | 73.8 | 1289 |
| #15 | DeepSeek V4 Flash DeepSeek — $0.14/$0.28 per 1M tokens — extreme MMLU value. | 84.2 | 72.4 | 1142 |
MMLU vs MMLU Pro divergence
MMLU is saturating; MMLU Pro is not. The gap between a model's MMLU and MMLU Pro score is how much harder the follow-up benchmark is for that specific model. Larger gap = more headroom remaining at the original benchmark.
Top 5 MMLU minus MMLU Pro gap (May 2026) Claude Opus 4.7 91.2 - 84.6 = 6.6 pts gap ████████ Gemini 3.1 Pro 90.8 - 83.9 = 6.9 pts gap █████████ GPT-5.5 Pro 90.4 - 84.1 = 6.3 pts gap ████████ GPT-5.5 89.7 - 82.8 = 6.9 pts gap █████████ DeepSeek V4 Pro 89.4 - 82.3 = 7.1 pts gap █████████ Mid-pack divergence (median legacy frontier) GPT-4.1 87.8 - 78.2 = 9.6 pts gap ████████████ Llama 4 Maverick 86.9 - 76.4 = 10.5 pts gap █████████████ Phi-4 85.1 - 73.2 = 11.9 pts gap ███████████████ Smaller gap = closer to ceiling on MMLU Pro too. Top of leaderboard is now near-uniform on both benchmarks.
Subject-level breakdown
| Subject | Category | Top model | Median | Spread |
|---|---|---|---|---|
| College Mathematics Where reasoning models pull ahead of base models. | STEM | 88.4 | 71.2 | 17.2 |
| Professional Law Highly saturated; small spread between models. | Humanities | 91.8 | 79.4 | 12.4 |
| Clinical Knowledge Most-saturated subject. Top of the leaderboard hits ceiling. | Medicine | 95.2 | 86.1 | 9.1 |
| Formal Logic Largest score spread; benchmark still discriminates. | STEM | 86.5 | 65.3 | 21.2 |
| Moral Disputes Subject to RLHF tuning; high-variance under preference shifts. | Humanities | 89.6 | 78.4 | 11.2 |
| Machine Learning Heavily represented in training data; ceiling is close. | STEM | 92.3 | 82.7 | 9.6 |
The MMLU Saturation Index
Our framework: the MMLU Saturation Index is the percentage of top-15 models scoring at or above 85 on MMLU. When this number crosses 50%, MMLU has stopped discriminating at the frontier. When it crosses 80%, MMLU is functionally retired for procurement. Scores at the ceiling become noise.
MMLU Saturation Index (May 2026) Models above 85 MMLU 13 of 15 Models above 90 MMLU 3 of 15 Saturation Index 87% ABOVE 80%: functionally retired for procurement Trajectory: at the current rate of new model releases scoring 85+, the index crosses 80% in approximately 6-9 months.
Why this matters: a benchmark's usefulness drops sharply once a critical mass of models clusters near ceiling. The statistical signal-to-noise ratio collapses. Procurement teams still citing "model X scores 91 on MMLU" in 2026 are quoting noise.
What to do this quarter
- Stop using MMLU as a primary procurement input. The top of the leaderboard is uniformly above 89. Score differences are within statistical noise.
- Add MMLU Pro to every model-comparison spec. It still discriminates at the frontier and will continue to for another 12-18 months.
- Pair MMLU with GPQA Diamond and HLE. Three benchmarks together survive a single benchmark's saturation curve. Composite scoring beats single-number rankings.
- Use MMLU as a floor, not a ceiling. A model below 80 MMLU is suspect for general use; a model above 88 MMLU has cleared the floor and needs a different signal to justify selection.
- Read subject-level scores when subjects matter. If you are deploying for clinical or legal work, the relevant MMLU subject score is more useful than the aggregate.
- Audit your eval harness for MMLU contamination. MMLU is heavily represented in training data. Treat repeated MMLU runs as a sanity check, not as performance evidence.
- Track the Saturation Index quarterly. When a benchmark crosses 50%, plan its retirement. When it crosses 80%, retire it. The same logic will apply to MMLU Pro within two years.
Related reading
- AI Model Leaderboard — composite quality index across MMLU, HumanEval, MATH
- LMSys Arena Leaderboard May 2026 — the human-preference complement to capability benchmarks
- April 2026 AI Model Releases Roundup — the release cadence driving saturation
Internal eval harnesses that run MMLU, MMLU Pro, GPQA, and HLE against multiple providers in parallel typically front the providers through Swfte Connect for a single endpoint and uniform telemetry across vendors.
Scores compiled from official model technical reports, Papers with Code MMLU leaderboard, Artificial Analysis snapshots, and published Arena Elo data, May 2026-05-06.