Updated May 6, 2026

MMLU Benchmark — May 2026

The Massive Multitask Language Understanding benchmark, six years on. Top 15 model scores, the MMLU-to-MMLU-Pro divergence, the subject-level picture, and our Saturation Index — a single number that tells you when MMLU stops being useful for your workload.

MMLU in one paragraph

MMLU is a 57-subject, ~16,000-question multiple-choice benchmark introduced by Hendrycks et al. in 2020. Each question has four answer choices; a model's score is the percentage answered correctly. Subjects span STEM (college mathematics, formal logic, machine learning), humanities (philosophy, professional law), social sciences (econometrics, public relations), and other (clinical knowledge, business ethics). For the first three years it was the closest thing AI had to a universal capability measure. Today it is partially saturated at the frontier, which is why the ecosystem is migrating to MMLU Pro and a portfolio of alternatives.

Top 15 on MMLU and MMLU Pro

#ModelMMLUMMLU ProArena Elo
#1Claude Opus 4.7

AnthropicCoding Arena #1; strongest aggregate MMLU + MMLU Pro pairing.

91.284.61495
#2Gemini 3.1 Pro

GoogleText Arena leader. MMLU score saturated; MMLU Pro is the better signal here.

90.883.91500
#3GPT-5.5 Pro

OpenAI$30/$180 per 1M tokens. Strong reasoning; MMLU value is poor.

90.484.11488
#4GPT-5.5

OpenAIDefault tier. Drop from MMLU to MMLU Pro is moderate.

89.782.81462
#5DeepSeek V4 Pro

DeepSeekApache 2.0. Best MMLU per dollar by a wide margin.

89.482.31462
#6Claude Sonnet 4

AnthropicDefault workhorse. Strong real-world performance vs MMLU score.

88.980.41402
#7Qwen 3.6 Plus

AlibabaOpen weights; strong multilingual MMLU subjects.

88.579.81423
#8GPT-4.1

OpenAILegacy frontier. MMLU Pro gap shows the saturation effect.

87.878.21395
#9Gemini 2.5 Pro

GoogleStrong long-context; MMLU Pro lags slightly.

87.477.51388
#10Llama 4 Maverick

MetaOpen weights; widely used for fine-tuning.

86.976.41352
#11Mistral Large 3

MistralEU-hosted option; competitive MMLU.

86.475.81341
#12Gemma 4

GoogleApache 2.0; first sub-10B model above 85 MMLU.

85.874.61268
#13Phi-4

MicrosoftSmall model with strong MMLU; Arena rank is much lower.

85.173.21245
#14NVIDIA Nemotron 3

NVIDIA30B multimodal open weights.

84.973.81289
#15DeepSeek V4 Flash

DeepSeek$0.14/$0.28 per 1M tokens — extreme MMLU value.

84.272.41142

MMLU vs MMLU Pro divergence

MMLU is saturating; MMLU Pro is not. The gap between a model's MMLU and MMLU Pro score is how much harder the follow-up benchmark is for that specific model. Larger gap = more headroom remaining at the original benchmark.

Top 5 MMLU minus MMLU Pro gap (May 2026)

  Claude Opus 4.7    91.2 - 84.6 = 6.6 pts gap   ████████
  Gemini 3.1 Pro     90.8 - 83.9 = 6.9 pts gap   █████████
  GPT-5.5 Pro        90.4 - 84.1 = 6.3 pts gap   ████████
  GPT-5.5            89.7 - 82.8 = 6.9 pts gap   █████████
  DeepSeek V4 Pro    89.4 - 82.3 = 7.1 pts gap   █████████

Mid-pack divergence (median legacy frontier)
  GPT-4.1            87.8 - 78.2 = 9.6 pts gap   ████████████
  Llama 4 Maverick   86.9 - 76.4 = 10.5 pts gap  █████████████
  Phi-4              85.1 - 73.2 = 11.9 pts gap  ███████████████

Smaller gap = closer to ceiling on MMLU Pro too.
Top of leaderboard is now near-uniform on both benchmarks.

Subject-level breakdown

SubjectCategoryTop modelMedianSpread
College Mathematics

Where reasoning models pull ahead of base models.

STEM88.471.217.2
Professional Law

Highly saturated; small spread between models.

Humanities91.879.412.4
Clinical Knowledge

Most-saturated subject. Top of the leaderboard hits ceiling.

Medicine95.286.19.1
Formal Logic

Largest score spread; benchmark still discriminates.

STEM86.565.321.2
Moral Disputes

Subject to RLHF tuning; high-variance under preference shifts.

Humanities89.678.411.2
Machine Learning

Heavily represented in training data; ceiling is close.

STEM92.382.79.6

The MMLU Saturation Index

Our framework: the MMLU Saturation Index is the percentage of top-15 models scoring at or above 85 on MMLU. When this number crosses 50%, MMLU has stopped discriminating at the frontier. When it crosses 80%, MMLU is functionally retired for procurement. Scores at the ceiling become noise.

MMLU Saturation Index (May 2026)

  Models above 85 MMLU      13 of 15
  Models above 90 MMLU      3 of 15
  Saturation Index          87%

  
  
  ABOVE 80%: functionally retired for procurement

Trajectory: at the current rate of new model releases scoring 85+,
the index crosses 80% in approximately 6-9 months.

Why this matters: a benchmark's usefulness drops sharply once a critical mass of models clusters near ceiling. The statistical signal-to-noise ratio collapses. Procurement teams still citing "model X scores 91 on MMLU" in 2026 are quoting noise.

What to do this quarter

  1. Stop using MMLU as a primary procurement input. The top of the leaderboard is uniformly above 89. Score differences are within statistical noise.
  2. Add MMLU Pro to every model-comparison spec. It still discriminates at the frontier and will continue to for another 12-18 months.
  3. Pair MMLU with GPQA Diamond and HLE. Three benchmarks together survive a single benchmark's saturation curve. Composite scoring beats single-number rankings.
  4. Use MMLU as a floor, not a ceiling. A model below 80 MMLU is suspect for general use; a model above 88 MMLU has cleared the floor and needs a different signal to justify selection.
  5. Read subject-level scores when subjects matter. If you are deploying for clinical or legal work, the relevant MMLU subject score is more useful than the aggregate.
  6. Audit your eval harness for MMLU contamination. MMLU is heavily represented in training data. Treat repeated MMLU runs as a sanity check, not as performance evidence.
  7. Track the Saturation Index quarterly. When a benchmark crosses 50%, plan its retirement. When it crosses 80%, retire it. The same logic will apply to MMLU Pro within two years.

Related reading

Internal eval harnesses that run MMLU, MMLU Pro, GPQA, and HLE against multiple providers in parallel typically front the providers through Swfte Connect for a single endpoint and uniform telemetry across vendors.

Scores compiled from official model technical reports, Papers with Code MMLU leaderboard, Artificial Analysis snapshots, and published Arena Elo data, May 2026-05-06.