Updated May 8, 2026

LMArena.ai — Top Models May 2026

LMArena.ai is the rebranded LMSys Chatbot Arena. Same blind pairwise-voting methodology, same Elo math, new home. Here is who leads each board this month and what the rebrand actually changed for buyers.

The LMSys to LMArena.ai story

The Chatbot Arena began in 2023 as a research project under LMSys, an academic group out of UC Berkeley. It quickly became the most-cited LLM benchmark because it measured something the capability-only benchmarks could not: actual human preference under blind side-by-side comparison. By 2024 the project had processed millions of votes, become a procurement input for Fortune 500 buyers, and outgrown its original academic scaffold. The 2024-25 transition to the lmarena.ai domain consolidated the project as an independent organisation while keeping the same Elo methodology and open vote pool.

For users the rebrand changed almost nothing: same prompts, same blind voting, same Elo math. For procurement teams the rebrand codified Arena Elo as a vendor-neutral signal independent of any single university. That is what made it sticky as a reference point in enterprise contracts.

The four-way race at the top

The May 2026 snapshot below shows three models above the historical 1500 Elo barrier on text. The top of LMArena.ai is now genuinely contested.

Top of LMArena.ai text leaderboard (May 2026)

  Gemini 3.1 Pro Preview   1500   ████████████████████   text leader
  Claude Opus 4.7 Thinking 1495   ███████████████████    coding #1
  GPT-5.5 Pro              1488   ██████████████████     reasoning
  DeepSeek V4 Pro          1462   █████████████████      Apache 2.0
  Qwen 3.6 Plus            1423   ███████████████        open weights
  Claude Sonnet 4          1402   ██████████████         workhorse tier
  GPT-4.1                  1395   █████████████          legacy frontier
  Gemini 2.5 Pro           1388   █████████████          legacy frontier
  Llama 4 Maverick         1352   ███████████            open weights
  Mistral Large 3          1341   ███████████            open weights

Full Leaderboard

32 models
#ModelQualityArena ELOSpeedPriceContextValueReleased
1

OpenAI · Hard reasoning

96
137068 t/s$10 / $40200K3.8Apr 2025
2

Anthropic · Complex analysis

95
136052 t/s$15 / $75200K2.1May 2025
3

OpenAI · Reasoning at any cost

95
150292 t/s$30 / $1801M0.9Apr 2026
4

Anthropic · Coding & agentic workflows

93
149778 t/s$5 / $251M6.2Apr 2026
5

Google · Multimodal + value

92
134587 t/s$1.25 / $101M16.4Mar 2025
6

OpenAI · Frontier general purpose

92
1481138 t/s$5 / $301M5.3Apr 2026
7

DeepSeek · Cheap reasoning

91
135035 t/s$0.55 / $2.19128K66.4Jan 2025
8

Google · Science & long-context

91
1500165 t/s$3.5 / $10.52M13.0Apr 2026
9

OpenAI · Long context

89
1310120 t/s$2 / $81M17.8Apr 2025
10

OpenAI · Reasoning & math

88
1305155 t/s$1.1 / $4.4200K32.0Jan 2025
11

Anthropic · Coding & balance

88
132095 t/s$3 / $15200K9.8May 2025
12

DeepSeek · Open-source value leader

88
1462112 t/s$1.74 / $3.481M33.7Apr 2026
13

xAI · Real-time info

87
133082 t/s$3 / $15131K9.7Feb 2025
14

DeepSeek · Best open-source value

86
131062 t/s$0.27 / $1.1128K125.5Mar 2025
15

OpenAI · General purpose

85
1285109 t/s$2.5 / $10128K13.6May 2024
16

Alibaba Cloud · Multilingual & APAC

84
1423124 t/s$1.4 / $5.6256K24.0Apr 2026
17

Meta · Open-source value

80
1260135 t/s$0.2 / $0.61M200.0Apr 2025
18

Alibaba Cloud · Open-source flagship

80
125585 t/s$0.3 / $0.9131K133.3Sep 2024
19

Mistral AI · Multilingual

79
125078 t/s$2 / $6128K19.8Nov 2024
20

xAI · Budget reasoning

78
1275165 t/s$0.3 / $0.5131K195.0Feb 2025
21

Perplexity · Search + citations

78
65 t/s$3 / $15200K8.7Feb 2025
22

DeepSeek · Cheap-and-fast cascade tier

78
1392218 t/s$0.14 / $0.281M371.4Apr 2026
23

Mistral AI · Code generation

76
195 t/s$0.3 / $0.9256K126.7Jan 2025
24

Mistral AI · Open multimodal

76
1361158 t/sSelf-host256KApr 2026
25

Anthropic · Speed & cost

75
1230172 t/s$0.8 / $4200K31.3Oct 2024
26

Google · Self-hosted general purpose

75
1351142 t/sSelf-host128KApr 2026
27

Google · Fastest + cheapest

74
1240244 t/s$0.1 / $0.41M296.0Feb 2025
28

Alibaba Cloud · Open-source coding

74
125 t/s$0.15 / $0.45131K246.7Nov 2024
29

OpenAI · High throughput

72
1216183 t/s$0.15 / $0.6128K192.0Jul 2024
30

Meta · Longest context

71
1195198 t/s$0.15 / $0.410M258.2Apr 2025
31

Amazon · AWS ecosystem

70
110 t/s$0.8 / $3.2300K35.0Dec 2024
32

Cohere · Enterprise RAG

68
117072 t/s$2.5 / $10128K10.9Aug 2024
Quality = composite benchmark (MMLU, HumanEval, MATH)Arena ELO = LMSYS Chatbot Arena ratingValue = quality per dollarPrice = input / output per 1M tokens

What to do this quarter

  1. Update bookmarks and citations. Internal eval-spec docs and procurement RFPs that reference "lmsys.org" should be updated to lmarena.ai. The data continues at the new domain.
  2. Pull from the right board. Coding teams should cite the coding Arena Elo (Claude Opus 4.7 leads at 1567). Generic chat teams should cite the text leaderboard (Gemini 3.1 Pro Preview leads at ~1500).
  3. Build dual-vendor capability. The top four models are within 40 Elo of each other. Treat them as interchangeable on capability and optimise for switching cost.
  4. Pair Arena scores with workload-specific evals. Arena rewards short-conversation polish. Long-context, tool-use, and domain-specific tasks need their own measurement.
  5. Track the open-weight gap. DeepSeek V4 Pro under Apache 2.0 sits at 1462 Elo, within 38 points of the text leader. The gap is the smallest it has ever been.
  6. Watch GPT-5.5 Pro pricing. At $30/$180 per 1M tokens, paying for the top of LMArena.ai now costs 200x more per token than the cheapest tier. The cost curve is steepening.
  7. Re-baseline at every model launch. Tokenizer changes (Claude Opus 4.7 ships ~35% more tokens per input than 4.6) shift effective cost without shifting list price.

Related reading

Teams running side-by-side evals against multiple LMArena.ai leaders typically expose them through Swfte Connect as a single endpoint, then run their own internal Elo on production prompts. That is the only way to verify whether public Arena rank translates to your workload.