MMLU (Massive Multitask Language Understanding) is a benchmark of ~16,000 multiple-choice questions across 57 subjects, ranging from high-school biology to professional law. It was introduced in 2020 to test broad knowledge and reasoning across academic and professional domains. A model is scored on the percentage of questions answered correctly.

What is the difference between MMLU and MMLU Pro?

MMLU Pro is the harder 2024 follow-up: 12 subjects, ~12K questions, 10 answer choices instead of 4, with harder distractors and a stronger emphasis on multi-step reasoning. MMLU is largely saturated at the frontier (top scores 90+), while MMLU Pro still discriminates between leading models (top scores 80-85).

Which model has the highest MMLU score in June 2026?

Claude Opus 4.7 leads at 91.2 on MMLU. The top three (Claude Opus 4.7, Gemini 3.1 Pro, GPT-5.5 Pro) cluster between 90.4 and 91.2 — a spread of less than one percentage point. On MMLU Pro the spread widens, which is why MMLU Pro is now the more useful frontier signal.

Is MMLU still useful?

Partially. MMLU is saturating: 67% of top-15 models now score above 85, the threshold at which the benchmark stops discriminating cleanly. It remains useful as a sanity check (a model under 80 is suspect) but procurement decisions should rely on MMLU Pro, GPQA Diamond, and HLE for signal at the top.

What is replacing MMLU?

Three benchmarks are taking over discrimination duty at the frontier: MMLU Pro (the harder cousin), GPQA Diamond (graduate-level science), and HLE (Humanity's Last Exam — a curated hard-question set). Multi-benchmark composite scores are increasingly preferred to single-number rankings.

Updated May 6, 2026

MMLU Benchmark — June 2026

The Massive Multitask Language Understanding benchmark, six years on. Top 15 model scores, the MMLU-to-MMLU-Pro divergence, the subject-level picture, and our Saturation Index — a single number that tells you when MMLU stops being useful for your workload.

MMLU in one paragraph

MMLU is a 57-subject, ~16,000-question multiple-choice benchmark introduced by Hendrycks et al. in 2020. Each question has four answer choices; a model's score is the percentage answered correctly. Subjects span STEM (college mathematics, formal logic, machine learning), humanities (philosophy, professional law), social sciences (econometrics, public relations), and other (clinical knowledge, business ethics). For the first three years it was the closest thing AI had to a universal capability measure. Today it is partially saturated at the frontier, which is why the ecosystem is migrating to MMLU Pro and a portfolio of alternatives.

Top 15 on MMLU and MMLU Pro

#	Model	MMLU	MMLU Pro	Arena Elo
#1	Claude Opus 4.7 Anthropic — Coding Arena #1; strongest aggregate MMLU + MMLU Pro pairing.	91.2	84.6	1495
#2	Gemini 3.1 Pro Google — Text Arena leader. MMLU score saturated; MMLU Pro is the better signal here.	90.8	83.9	1500
#3	GPT-5.5 Pro OpenAI — $30/$180 per 1M tokens. Strong reasoning; MMLU value is poor.	90.4	84.1	1488
#4	GPT-5.5 OpenAI — Default tier. Drop from MMLU to MMLU Pro is moderate.	89.7	82.8	1462
#5	DeepSeek V4 Pro DeepSeek — Apache 2.0. Best MMLU per dollar by a wide margin.	89.4	82.3	1462
#6	Claude Sonnet 4 Anthropic — Default workhorse. Strong real-world performance vs MMLU score.	88.9	80.4	1402
#7	Qwen 3.6 Plus Alibaba — Open weights; strong multilingual MMLU subjects.	88.5	79.8	1423
#8	GPT-4.1 OpenAI — Legacy frontier. MMLU Pro gap shows the saturation effect.	87.8	78.2	1395
#9	Gemini 2.5 Pro Google — Strong long-context; MMLU Pro lags slightly.	87.4	77.5	1388
#10	Llama 4 Maverick Meta — Open weights; widely used for fine-tuning.	86.9	76.4	1352
#11	Mistral Large 3 Mistral — EU-hosted option; competitive MMLU.	86.4	75.8	1341
#12	Gemma 4 Google — Apache 2.0; first sub-10B model above 85 MMLU.	85.8	74.6	1268
#13	Phi-4 Microsoft — Small model with strong MMLU; Arena rank is much lower.	85.1	73.2	1245
#14	NVIDIA Nemotron 3 NVIDIA — 30B multimodal open weights.	84.9	73.8	1289
#15	DeepSeek V4 Flash DeepSeek — $0.14/$0.28 per 1M tokens — extreme MMLU value.	84.2	72.4	1142

MMLU vs MMLU Pro divergence

MMLU is saturating; MMLU Pro is not. The gap between a model's MMLU and MMLU Pro score is how much harder the follow-up benchmark is for that specific model. Larger gap = more headroom remaining at the original benchmark.

Top 5 MMLU minus MMLU Pro gap (June 2026)

  Claude Opus 4.7    91.2 - 84.6 = 6.6 pts gap   ████████
  Gemini 3.1 Pro     90.8 - 83.9 = 6.9 pts gap   █████████
  GPT-5.5 Pro        90.4 - 84.1 = 6.3 pts gap   ████████
  GPT-5.5            89.7 - 82.8 = 6.9 pts gap   █████████
  DeepSeek V4 Pro    89.4 - 82.3 = 7.1 pts gap   █████████

Mid-pack divergence (median legacy frontier)
  GPT-4.1            87.8 - 78.2 = 9.6 pts gap   ████████████
  Llama 4 Maverick   86.9 - 76.4 = 10.5 pts gap  █████████████
  Phi-4              85.1 - 73.2 = 11.9 pts gap  ███████████████

Smaller gap = closer to ceiling on MMLU Pro too.
Top of leaderboard is now near-uniform on both benchmarks.

Subject-level breakdown

Subject	Category	Top model	Median	Spread
College Mathematics Where reasoning models pull ahead of base models.	STEM	88.4	71.2	17.2
Professional Law Highly saturated; small spread between models.	Humanities	91.8	79.4	12.4
Clinical Knowledge Most-saturated subject. Top of the leaderboard hits ceiling.	Medicine	95.2	86.1	9.1
Formal Logic Largest score spread; benchmark still discriminates.	STEM	86.5	65.3	21.2
Moral Disputes Subject to RLHF tuning; high-variance under preference shifts.	Humanities	89.6	78.4	11.2
Machine Learning Heavily represented in training data; ceiling is close.	STEM	92.3	82.7	9.6

The MMLU Saturation Index

Our framework: the MMLU Saturation Index is the percentage of top-15 models scoring at or above 85 on MMLU. When this number crosses 50%, MMLU has stopped discriminating at the frontier. When it crosses 80%, MMLU is functionally retired for procurement. Scores at the ceiling become noise.

MMLU Saturation Index (June 2026)

  Models above 85 MMLU      13 of 15
  Models above 90 MMLU      3 of 15
  Saturation Index          87%

  
  
  ABOVE 80%: functionally retired for procurement

Trajectory: at the current rate of new model releases scoring 85+,
the index crosses 80% in approximately 6-9 months.

Why this matters: a benchmark's usefulness drops sharply once a critical mass of models clusters near ceiling. The statistical signal-to-noise ratio collapses. Procurement teams still citing "model X scores 91 on MMLU" in 2026 are quoting noise.

What to do this quarter

Stop using MMLU as a primary procurement input. The top of the leaderboard is uniformly above 89. Score differences are within statistical noise.
Add MMLU Pro to every model-comparison spec. It still discriminates at the frontier and will continue to for another 12-18 months.
Pair MMLU with GPQA Diamond and HLE. Three benchmarks together survive a single benchmark's saturation curve. Composite scoring beats single-number rankings.
Use MMLU as a floor, not a ceiling. A model below 80 MMLU is suspect for general use; a model above 88 MMLU has cleared the floor and needs a different signal to justify selection.
Read subject-level scores when subjects matter. If you are deploying for clinical or legal work, the relevant MMLU subject score is more useful than the aggregate.
Audit your eval harness for MMLU contamination. MMLU is heavily represented in training data. Treat repeated MMLU runs as a sanity check, not as performance evidence.
Track the Saturation Index quarterly. When a benchmark crosses 50%, plan its retirement. When it crosses 80%, retire it. The same logic will apply to MMLU Pro within two years.