What does LM Arena Elo actually measure?

It's an Elo rating computed from pairwise human preference votes on freeform chat prompts. Higher Elo means humans preferred this model's responses more often in head-to-head comparisons. It is a real-world signal but it does not predict accuracy on structured workloads — see our LMArena Elo explainer for the 5 failure modes.

How often is the LM leaderboard updated?

We refresh data every 24 hours from official provider pricing pages and weekly from Artificial Analysis benchmarks. LMArena Elo is pulled as it publishes.

Which LM has the best price-to-Elo ratio?

DeepSeek V4 Pro at 1462 Elo and $1.74/$3.48 per 1M tokens — by a 3-5x margin over the closed-source frontier.

Are open-weight LMs competitive in May 2026?

Yes. DeepSeek V4 Pro, Gemma 4, Qwen 3.6 Plus, and NVIDIA Nemotron 3 Nano Omni all sit within 35-40 Elo of frontier closed models. For most workloads outside the absolute top tier the gap is no longer the decisive factor.

Updated May 8, 2026

LM Leaderboard — May 2026

Large language models ranked by LMSys Arena Elo, MMLU, HumanEval, MATH, pricing, and inference speed. Refreshed monthly with live data from official provider pricing pages, Artificial Analysis, and the Arena.

What is the top LM on the Arena right now?

LMArena (formerly LMSYS Chatbot Arena) tracks pairwise human votes across hundreds of thousands of conversations. Our May 2026 snapshot below ranks 32 language models on Arena Elo plus the standard MMLU / HumanEval / MATH benchmark suite. The Arena re-ranks roughly weekly as votes accumulate; what you see is the most recent snapshot verified against the public Arena and Artificial Analysis.

32 models

#	Model	Quality	Arena ELO	Speed	Price	Context	Value	Released
1	o3 OpenAI · Hard reasoning	96	1370	68 t/s	$10 / $40	200K	3.8	Apr 2025
2	Claude Opus 4 Anthropic · Complex analysis	95	1360	52 t/s	$15 / $75	200K	2.1	May 2025
3	GPT-5.5 Pro New OpenAI · Reasoning at any cost	95	1502	92 t/s	$30 / $180	1M	0.9	Apr 2026
4	Claude Opus 4.7 New Anthropic · Coding & agentic workflows	93	1497	78 t/s	$5 / $25	1M	6.2	Apr 2026
5	Gemini 2.5 Pro Google · Multimodal + value	92	1345	87 t/s	$1.25 / $10	1M	16.4	Mar 2025
6	GPT-5.5 New OpenAI · Frontier general purpose	92	1481	138 t/s	$5 / $30	1M	5.3	Apr 2026
7	DeepSeek R1OSS DeepSeek · Cheap reasoning	91	1350	35 t/s	$0.55 / $2.19	128K	66.4	Jan 2025
8	Gemini 3.1 Pro New Google · Science & long-context	91	1500	165 t/s	$3.5 / $10.5	2M	13.0	Apr 2026
9	GPT-4.1 OpenAI · Long context	89	1310	120 t/s	$2 / $8	1M	17.8	Apr 2025
10	o3 Mini OpenAI · Reasoning & math	88	1305	155 t/s	$1.1 / $4.4	200K	32.0	Jan 2025
11	Claude Sonnet 4 Anthropic · Coding & balance	88	1320	95 t/s	$3 / $15	200K	9.8	May 2025
12	DeepSeek V4 Pro NewOSS DeepSeek · Open-source value leader	88	1462	112 t/s	$1.74 / $3.48	1M	33.7	Apr 2026
13	Grok 3 xAI · Real-time info	87	1330	82 t/s	$3 / $15	131K	9.7	Feb 2025
14	DeepSeek V3OSS DeepSeek · Best open-source value	86	1310	62 t/s	$0.27 / $1.1	128K	125.5	Mar 2025
15	GPT-4o OpenAI · General purpose	85	1285	109 t/s	$2.5 / $10	128K	13.6	May 2024
16	Qwen 3.6 Plus New Alibaba Cloud · Multilingual & APAC	84	1423	124 t/s	$1.4 / $5.6	256K	24.0	Apr 2026
17	Llama 4 MaverickOSS Meta · Open-source value	80	1260	135 t/s	$0.2 / $0.6	1M	200.0	Apr 2025
18	Qwen 2.5 72BOSS Alibaba Cloud · Open-source flagship	80	1255	85 t/s	$0.3 / $0.9	131K	133.3	Sep 2024
19	Mistral Large 2 Mistral AI · Multilingual	79	1250	78 t/s	$2 / $6	128K	19.8	Nov 2024
20	Grok 3 Mini xAI · Budget reasoning	78	1275	165 t/s	$0.3 / $0.5	131K	195.0	Feb 2025
21	Sonar Pro Perplexity · Search + citations	78	—	65 t/s	$3 / $15	200K	8.7	Feb 2025
22	DeepSeek V4 Flash NewOSS DeepSeek · Cheap-and-fast cascade tier	78	1392	218 t/s	$0.14 / $0.28	1M	371.4	Apr 2026
23	Codestral Mistral AI · Code generation	76	—	195 t/s	$0.3 / $0.9	256K	126.7	Jan 2025
24	Nemotron 3 Nano Omni NewOSS Mistral AI · Open multimodal	76	1361	158 t/s	Self-host	256K	—	Apr 2026
25	Claude 3.5 Haiku Anthropic · Speed & cost	75	1230	172 t/s	$0.8 / $4	200K	31.3	Oct 2024
26	Gemma 4 27B NewOSS Google · Self-hosted general purpose	75	1351	142 t/s	Self-host	128K	—	Apr 2026
27	Gemini 2.0 Flash Google · Fastest + cheapest	74	1240	244 t/s	$0.1 / $0.4	1M	296.0	Feb 2025
28	Qwen 2.5 Coder 32BOSS Alibaba Cloud · Open-source coding	74	—	125 t/s	$0.15 / $0.45	131K	246.7	Nov 2024
29	GPT-4o Mini OpenAI · High throughput	72	1216	183 t/s	$0.15 / $0.6	128K	192.0	Jul 2024
30	Llama 4 ScoutOSS Meta · Longest context	71	1195	198 t/s	$0.15 / $0.4	10M	258.2	Apr 2025
31	Amazon Nova Pro Amazon · AWS ecosystem	70	—	110 t/s	$0.8 / $3.2	300K	35.0	Dec 2024
32	Command R+ Cohere · Enterprise RAG	68	1170	72 t/s	$2.5 / $10	128K	10.9	Aug 2024

Quality = composite benchmark (MMLU, HumanEval, MATH)Arena ELO = LMSYS Chatbot Arena ratingValue = quality per dollarPrice = input / output per 1M tokens

How the LLM leaderboard works

We pull official provider pricing every 24 hours, Artificial Analysis benchmark snapshots weekly, and LMSys Arena Elo as it publishes. The composite quality index is a 0-100 normalization over MMLU Pro, HumanEval, and MATH, weighted by recency and cross-validated against Arena Elo. We do not accept vendor-supplied numbers without an independent reference.

Where the leaderboard is wrong

No leaderboard predicts your production accuracy. LMSys Arena rewards style and short-conversation polish; a top-Arena model can still under-perform on your specific function-calling schema or long-context retrieval workload. Build an internal eval harness before you commit. See our LMArena Elo explained and LLM routing writeups for the deep-dive.

Related rankings

AI Model Leaderboard — same data, broader entry point
Models Leaderboard
GenAI Leaderboard
AI Vendor Lock-in Leaderboard