Updated May 15, 2026 · 8 min read

LMSys Chatbot Arena (July 2026)

Short version. LMSys (rebranded to LMArena) is the crowdsourced human-preference leaderboard that became the reference for AI model quality. Claude Opus 4.8 is the new #1, leading both coding (~1582 Elo) and the overall arena (~1580). GPT-5.5 Pro and Gemini 3.1 Pro round out the top tier; Gemini owns the science and long-context categories.

Use the LMSys top-10

Run every top-Elo model through one gateway

Top 10 LMSys rankings, July 2026

Live snapshot of the LMSys Chatbot Arena top 10. Elo ratings refresh weekly; major releases produce faster updates once enough votes accumulate. Coding Elo is the sub-leaderboard restricted to code prompts.

#ModelMakerOverall EloCoding EloNotes
1Claude Opus 4.8Anthropic15801582New #1 overall & coding. AAII 61.4. SWE-bench Pro 69.2%, computer-use leader.
2Claude Opus 4.7Anthropic15671567Prior coding #1. Still elite for agentic workflows.
3GPT-5.5 ProOpenAI15511531Reasoning #1 (AAII 59). Voice + multimodal flagship.
4Gemini 3.1 ProGoogle DeepMind15381505Science #1 (GPQA Diamond 94.3%). 2M context.
5GPT-5.5OpenAI15231483Mainline GPT-5.5. Workhorse for general chat.
6Claude Sonnet 4Anthropic15181520Production workhorse. 1M context.
7Gemini 3.0Google DeepMind15051462Cost leader at mainline tier.
8DeepSeek V4 ProDeepSeek14621454Open weights. ~1/8 the price of GPT-5.5 Pro.
9Qwen 3.7 MaxAlibaba14551450Highest-ranked Chinese model (AAII 56.6). 35h autonomous agentic runs.
10Grok 4xAI14411440Real-time X data access.

Source: LMArena public leaderboard. Live snapshot as of 2026-05-15. Verify at lmarena.ai before contracting decisions.

Full Arena Elo leaderboard — 41 models

Every model in our directory with a published Arena Elo. Default sort is Elo descending; click any column to re-sort. Filter by category or open-source above the table.

41 models
#ModelQualityArena ELOSpeedPriceContextValueReleased
1

Anthropic · Frontier agentic coding & knowledge work

100
152558 t/s$10 / $501M3.3Jun 2026
2

Anthropic · Coding, agents & computer use

99
151272 t/s$5 / $251M6.6May 2026
3

OpenAI · Reasoning at any cost

98
151068 t/s$30 / $1801M0.9Apr 2026
4

OpenAI · Frontier general purpose

97
150670 t/s$5 / $301M5.5Apr 2026
5

Anthropic · Coding & agentic workflows

96
150568 t/s$5 / $251M6.4Apr 2026
6

Google · Speed & cost

96
1505$2 / $121M13.7Feb 2026
7

Google · Science & long-context

96
1505131 t/s$2 / $121M13.7Apr 2026
8

xAI · Agentic tasks & real-time info

93
149683 t/s$1.25 / $2.51M49.6May 2026
9

xAI · General purpose

93
1496$1.25 / $2.52M49.6Mar 2026
10

OpenAI · General purpose

93
1495$2.5 / $151M10.6Mar 2026
11

Anthropic · General purpose

95
1490$5 / $251M6.3Feb 2026
12

Alibaba Cloud · Long autonomous agentic runs

94
148890 t/s$2.5 / $7.51M18.8May 2026
13

DeepSeek · Open-source value leader

90
146733 t/s$1.74 / $3.481M34.5Apr 2026
14

Anthropic · Coding & balance

90
146773 t/s$3 / $151M10.0Feb 2026
15

· Open-weight agentic & tool use

88
146748 t/s$0.98 / $3.08200K43.3Apr 2026
16

Moonshot AI · Frontier quality at low cost

92
146648 t/s$0.73 / $3.49256K43.6Apr 2026
17

OpenAI · General purpose

90
1455$1.25 / $10400K16.0Aug 2025
18

· Open-weight agentic coding

89
145580 t/s$0.6 / $2.41M59.3Jun 2026
19

DeepSeek · Open-source

87
1455$0.252 / $0.378164K276.2Dec 2025
20

Moonshot AI · Speed & cost

89
1452$0.4 / $1.9262K77.4Jan 2026
21

· Open-source

88
1450$0.6 / $1.9280K69.8Feb 2026
22

Alibaba Cloud · Multilingual & APAC

86
1448124 t/s$1.4 / $5.6256K24.6Apr 2026
23

DeepSeek · Cheap-and-fast cascade tier

80
1410105 t/s$0.1 / $0.21M533.3Apr 2026
24

OpenAI · Hard reasoning

94
137068 t/s$10 / $40200K3.8Apr 2025
25

Anthropic · Complex analysis

91
136052 t/s$15 / $75200K2.0May 2025
26

Google · Multimodal + value

92
134587 t/s$1.25 / $101M16.4Mar 2025
27

xAI · Real-time info

87
133082 t/s$3 / $15131K9.7Feb 2025
28

Anthropic · Coding & balance

88
132095 t/s$3 / $15200K9.8May 2025
29

OpenAI · Long context

89
1310120 t/s$2 / $81M17.8Apr 2025
30

DeepSeek · Best open-source value

86
131062 t/s$0.27 / $1.1128K125.5Mar 2025
31

OpenAI · Reasoning & math

88
1305155 t/s$1.1 / $4.4200K32.0Jan 2025
32

OpenAI · General purpose

85
1285109 t/s$2.5 / $10128K13.6May 2024
33

xAI · Budget reasoning

78
1275165 t/s$0.3 / $0.5131K195.0Feb 2025
34

Meta · Open-source value

80
1260135 t/s$0.2 / $0.61M200.0Apr 2025
35

Alibaba Cloud · Open-source flagship

80
125585 t/s$0.3 / $0.9131K133.3Sep 2024
36

Mistral AI · Multilingual

79
125078 t/s$2 / $6128K19.8Nov 2024
37

Google · Fastest + cheapest

74
1240244 t/s$0.1 / $0.41M296.0Feb 2025
38

Anthropic · Speed & cost

75
1230172 t/s$0.8 / $4200K31.3Oct 2024
39

OpenAI · High throughput

72
1216183 t/s$0.15 / $0.6128K192.0Jul 2024
40

Meta · Longest context

71
1195198 t/s$0.15 / $0.410M258.2Apr 2025
41

Cohere · Enterprise RAG

68
117072 t/s$2.5 / $10128K10.9Aug 2024
Quality = composite benchmark (MMLU, HumanEval, MATH)Arena ELO = LMSYS Chatbot Arena ratingValue = quality per dollarPrice = input / output per 1M tokens

How LMSys Elo actually works

The mechanic is the chess Elo system, retargeted at language models. A user submits a prompt. The platform sends the prompt to two anonymised models in parallel and shows both responses side by side. The user votes on which is better, picks a tie, or marks both as bad. The vote updates both ratings: the winner gains Elo points proportional to the rating gap, the loser drops by the same.

Over millions of votes the system converges. A rating gap of 100 Elo means the higher model wins about 64% of head-to-head votes. A gap of 30 Elo means roughly 54%. Below 10 Elo, the difference is noise.

The platform runs sub-leaderboards for specific categories (coding, reasoning, multilingual, long context, hard prompts) by filtering votes where the prompt belongs to that category. This is why Claude Opus 4.8 now leads coding (~1582) and overall (~1580), having overtaken Opus 4.7 across both. The categories can still split — Gemini 3.1 Pro, for instance, leads science while trailing on coding.

How to use the LMSys leaderboard to pick a model

Start by identifying your dominant workload. If your traffic is mostly coding, the overall leaderboard is misleading. If your traffic is mostly multilingual chat or structured reasoning, same. Pull the sub-leaderboard that matches your traffic shape and read the top 5 from there.

Next, filter by your binding constraints. Cost ceiling, on-prem requirement, context length, voice support, fine-tuning availability. The top 2 to 3 models on the relevant sub-leaderboard that satisfy your constraints are usually within margin on quality. At that point the choice comes down to price and deployment fit, not Elo.

Finally, validate with your own eval harness. LMSys captures general human preference but cannot capture your specific workload, your specific data, your specific success criteria. Run the top 2 candidates through a 100 to 500 example golden dataset before committing. Swfte's AI Model Leaderboard pairs LMSys Elo with the sub-benchmarks (AAII, GPQA, SWE-bench Pro) and live API pricing so the trade-offs sit in one view.

FAQ

What is the LMSys Chatbot Arena leaderboard?

LMSys Chatbot Arena (now known as LMArena) is a crowdsourced AI model evaluation platform run by the Large Model Systems Organization. Users submit prompts and vote on blind A/B responses from two models. Votes feed an Elo rating system that ranks models by human preference. Launched in 2023, it became the de facto reference leaderboard for frontier LLM quality.

How does LMSys Elo work?

Same Elo system used in chess. Every blind A/B vote updates both models' ratings: the winner gains points proportional to the rating gap (more points for upsetting a higher-rated model), the loser loses the same. After millions of votes, the rating converges on a stable rank order. Differences below 10 Elo are within noise; differences above 30 Elo are reliably meaningful in production.

Why is the LMSys leaderboard important?

Three reasons. (1) It measures human preference, not benchmark accuracy, which correlates better with downstream usefulness. (2) It uses blind A/B comparison, eliminating brand bias. (3) Crowdsourced volume (millions of votes) gives statistical confidence that no single-shot benchmark can match. Procurement teams routinely require an LMSys position before approving a new model.

Who is at the top of the LMSys leaderboard in 2026?

Claude Opus 4.8 is the new #1, leading both the overall arena (~1580 Elo) and the coding sub-leaderboard (~1582 Elo) after topping the Artificial Analysis Intelligence Index at 61.4. Claude Opus 4.7 and GPT-5.5 Pro follow, with Gemini 3.1 Pro leading science and 2M-context categories. The top 10 also includes GPT-5.5, Claude Sonnet 4, Gemini 3.0, DeepSeek V4 Pro, Alibaba's Qwen 3.7 Max, and Grok 4.

What is the difference between LMSys, LMArena, and Chatbot Arena?

Same thing under different names. The Large Model Systems organisation (LMSys) at UC Berkeley launched Chatbot Arena in 2023. The platform later rebranded to LMArena (lmarena.ai) under its independent corporate entity. People still use all three terms interchangeably to refer to the leaderboard.

How often does the LMSys leaderboard update?

Continuously. New votes flow in 24/7, and the public leaderboard typically refreshes weekly with the latest Elo ratings. Major model releases trigger a faster refresh, sometimes within hours, once enough votes accumulate to produce a stable rating.

Are LMSys rankings gameable?

In theory, yes. A coordinated voting campaign for a specific model could move ratings. In practice, LMSys deploys multiple defences: prompt diversity sampling, vote velocity caps, account-quality filters, and per-region rate limits. The Elo system is also self-correcting: artificial inflation against a strong model produces enough losses to wash out gains.

How does the LMSys coding leaderboard differ from the overall arena?

Same Elo system, restricted to coding prompts. Voters classify their prompt as code-related (or the system infers it), and only those votes feed the coding sub-rating. Claude Opus 4.8 sits at #1 for coding (~1582 Elo), having overtaken Opus 4.7 (1567) — the gap reflects Claude's well-documented coding strength.

How do I use LMSys to pick a model?

Three steps. (1) Identify your dominant workload (coding, reasoning, science, multilingual, voice). (2) Pull the LMSys sub-leaderboard for that workload. (3) Filter by your binding constraint (cost, latency, context window, deployment posture). The top 2-3 models on the relevant sub-leaderboard that satisfy your constraint are usually within margin of error on quality.

Is LMSys the only AI leaderboard that matters?

No. Pair it with AAII (reasoning), GPQA Diamond (science), SWE-bench Pro (coding), and your own internal eval harness. LMSys captures human preference at the chat level; the others capture specific capabilities. The strongest production decisions combine all four.