Updated May 15, 2026 · 8 min read

LMSys Chatbot Arena (July 2026)

Short version. LMSys (rebranded to LMArena) is the crowdsourced human-preference leaderboard that became the reference for AI model quality. Claude Opus 4.8 is the new #1, leading both coding (~1582 Elo) and the overall arena (~1580). GPT-5.5 Pro and Gemini 3.1 Pro round out the top tier; Gemini owns the science and long-context categories.

Use the LMSys top-10

Run every top-Elo model through one gateway

Three ways to put the LMSys leaderboard to work. All start with a free Swfte account, no card.

Prompt the top LMSys model free Track Elo movement The Model-Hopper Challenge50% OFF · 6 MO

Top 10 LMSys rankings, July 2026

Live snapshot of the LMSys Chatbot Arena top 10. Elo ratings refresh weekly; major releases produce faster updates once enough votes accumulate. Coding Elo is the sub-leaderboard restricted to code prompts.

#	Model	Maker	Overall Elo	Coding Elo	Notes
1	Claude Opus 4.8	Anthropic	1580	1582	New #1 overall & coding. AAII 61.4. SWE-bench Pro 69.2%, computer-use leader.
2	Claude Opus 4.7	Anthropic	1567	1567	Prior coding #1. Still elite for agentic workflows.
3	GPT-5.5 Pro	OpenAI	1551	1531	Reasoning #1 (AAII 59). Voice + multimodal flagship.
4	Gemini 3.1 Pro	Google DeepMind	1538	1505	Science #1 (GPQA Diamond 94.3%). 2M context.
5	GPT-5.5	OpenAI	1523	1483	Mainline GPT-5.5. Workhorse for general chat.
6	Claude Sonnet 4	Anthropic	1518	1520	Production workhorse. 1M context.
7	Gemini 3.0	Google DeepMind	1505	1462	Cost leader at mainline tier.
8	DeepSeek V4 Pro	DeepSeek	1462	1454	Open weights. ~1/8 the price of GPT-5.5 Pro.
9	Qwen 3.7 Max	Alibaba	1455	1450	Highest-ranked Chinese model (AAII 56.6). 35h autonomous agentic runs.
10	Grok 4	xAI	1441	1440	Real-time X data access.

Source: LMArena public leaderboard. Live snapshot as of 2026-05-15. Verify at lmarena.ai before contracting decisions.

Full Arena Elo leaderboard — 41 models

Every model in our directory with a published Arena Elo. Default sort is Elo descending; click any column to re-sort. Filter by category or open-source above the table.

41 models

#	Model	Quality	Arena ELO	Speed	Price	Context	Value	Released
1	Anthropic: Claude Fable 5 New Anthropic · Frontier agentic coding & knowledge work	100	1525	58 t/s	$10 / $50	1M	3.3	Jun 2026
2	Anthropic: Claude Opus 4.8 Anthropic · Coding, agents & computer use	99	1512	72 t/s	$5 / $25	1M	6.6	May 2026
3	OpenAI: GPT-5.5 Pro OpenAI · Reasoning at any cost	98	1510	68 t/s	$30 / $180	1M	0.9	Apr 2026
4	OpenAI: GPT-5.5 OpenAI · Frontier general purpose	97	1506	70 t/s	$5 / $30	1M	5.5	Apr 2026
5	Anthropic: Claude Opus 4.7 Anthropic · Coding & agentic workflows	96	1505	68 t/s	$5 / $25	1M	6.4	Apr 2026
6	Google: Gemini 3.1 Pro Preview Custom Tools Google · Speed & cost	96	1505	—	$2 / $12	1M	13.7	Feb 2026
7	Google: Gemini 3.1 Pro Preview Google · Science & long-context	96	1505	131 t/s	$2 / $12	1M	13.7	Apr 2026
8	xAI: Grok 4.3 xAI · Agentic tasks & real-time info	93	1496	83 t/s	$1.25 / $2.5	1M	49.6	May 2026
9	xAI: Grok 4.20 xAI · General purpose	93	1496	—	$1.25 / $2.5	2M	49.6	Mar 2026
10	OpenAI: GPT-5.4 OpenAI · General purpose	93	1495	—	$2.5 / $15	1M	10.6	Mar 2026
11	Anthropic: Claude Opus 4.6 Anthropic · General purpose	95	1490	—	$5 / $25	1M	6.3	Feb 2026
12	Qwen: Qwen3.7 Max Alibaba Cloud · Long autonomous agentic runs	94	1488	90 t/s	$2.5 / $7.5	1M	18.8	May 2026
13	DeepSeek: DeepSeek V4 ProOSS DeepSeek · Open-source value leader	90	1467	33 t/s	$1.74 / $3.48	1M	34.5	Apr 2026
14	Anthropic: Claude Sonnet 4.6 Anthropic · Coding & balance	90	1467	73 t/s	$3 / $15	1M	10.0	Feb 2026
15	Z.ai: GLM 5.1OSS · Open-weight agentic & tool use	88	1467	48 t/s	$0.98 / $3.08	200K	43.3	Apr 2026
16	MoonshotAI: Kimi K2.6 Moonshot AI · Frontier quality at low cost	92	1466	48 t/s	$0.73 / $3.49	256K	43.6	Apr 2026
17	OpenAI: GPT-5 OpenAI · General purpose	90	1455	—	$1.25 / $10	400K	16.0	Aug 2025
18	MiniMax: MiniMax M3OSS · Open-weight agentic coding	89	1455	80 t/s	$0.6 / $2.4	1M	59.3	Jun 2026
19	DeepSeek: DeepSeek V3.2OSS DeepSeek · Open-source	87	1455	—	$0.252 / $0.378	164K	276.2	Dec 2025
20	MoonshotAI: Kimi K2.5 Moonshot AI · Speed & cost	89	1452	—	$0.4 / $1.9	262K	77.4	Jan 2026
21	Z.ai: GLM 5OSS · Open-source	88	1450	—	$0.6 / $1.92	80K	69.8	Feb 2026
22	Qwen: Qwen3.6 Plus Alibaba Cloud · Multilingual & APAC	86	1448	124 t/s	$1.4 / $5.6	256K	24.6	Apr 2026
23	DeepSeek: DeepSeek V4 FlashOSS DeepSeek · Cheap-and-fast cascade tier	80	1410	105 t/s	$0.1 / $0.2	1M	533.3	Apr 2026
24	OpenAI: o3 OpenAI · Hard reasoning	94	1370	68 t/s	$10 / $40	200K	3.8	Apr 2025
25	Anthropic: Claude Opus 4 Anthropic · Complex analysis	91	1360	52 t/s	$15 / $75	200K	2.0	May 2025
26	Google: Gemini 2.5 Pro Google · Multimodal + value	92	1345	87 t/s	$1.25 / $10	1M	16.4	Mar 2025
27	xAI: Grok 3 xAI · Real-time info	87	1330	82 t/s	$3 / $15	131K	9.7	Feb 2025
28	Anthropic: Claude Sonnet 4 Anthropic · Coding & balance	88	1320	95 t/s	$3 / $15	200K	9.8	May 2025
29	OpenAI: GPT-4.1 OpenAI · Long context	89	1310	120 t/s	$2 / $8	1M	17.8	Apr 2025
30	DeepSeek: DeepSeek V3OSS DeepSeek · Best open-source value	86	1310	62 t/s	$0.27 / $1.1	128K	125.5	Mar 2025
31	OpenAI: o3 Mini OpenAI · Reasoning & math	88	1305	155 t/s	$1.1 / $4.4	200K	32.0	Jan 2025
32	OpenAI: GPT-4o (2024-08-06) OpenAI · General purpose	85	1285	109 t/s	$2.5 / $10	128K	13.6	May 2024
33	xAI: Grok 3 Mini xAI · Budget reasoning	78	1275	165 t/s	$0.3 / $0.5	131K	195.0	Feb 2025
34	Meta: Llama 4 MaverickOSS Meta · Open-source value	80	1260	135 t/s	$0.2 / $0.6	1M	200.0	Apr 2025
35	Qwen2.5 72B InstructOSS Alibaba Cloud · Open-source flagship	80	1255	85 t/s	$0.3 / $0.9	131K	133.3	Sep 2024
36	Mistral Large 2411 Mistral AI · Multilingual	79	1250	78 t/s	$2 / $6	128K	19.8	Nov 2024
37	Google: Gemini 2.0 Flash Google · Fastest + cheapest	74	1240	244 t/s	$0.1 / $0.4	1M	296.0	Feb 2025
38	Anthropic: Claude 3.5 Haiku Anthropic · Speed & cost	75	1230	172 t/s	$0.8 / $4	200K	31.3	Oct 2024
39	OpenAI: GPT-4o-mini OpenAI · High throughput	72	1216	183 t/s	$0.15 / $0.6	128K	192.0	Jul 2024
40	Meta: Llama 4 ScoutOSS Meta · Longest context	71	1195	198 t/s	$0.15 / $0.4	10M	258.2	Apr 2025
41	Cohere: Command R+ (08-2024) Cohere · Enterprise RAG	68	1170	72 t/s	$2.5 / $10	128K	10.9	Aug 2024

Quality = composite benchmark (MMLU, HumanEval, MATH)Arena ELO = LMSYS Chatbot Arena ratingValue = quality per dollarPrice = input / output per 1M tokens

How LMSys Elo actually works

The mechanic is the chess Elo system, retargeted at language models. A user submits a prompt. The platform sends the prompt to two anonymised models in parallel and shows both responses side by side. The user votes on which is better, picks a tie, or marks both as bad. The vote updates both ratings: the winner gains Elo points proportional to the rating gap, the loser drops by the same.

Over millions of votes the system converges. A rating gap of 100 Elo means the higher model wins about 64% of head-to-head votes. A gap of 30 Elo means roughly 54%. Below 10 Elo, the difference is noise.

The platform runs sub-leaderboards for specific categories (coding, reasoning, multilingual, long context, hard prompts) by filtering votes where the prompt belongs to that category. This is why Claude Opus 4.8 now leads coding (~1582) and overall (~1580), having overtaken Opus 4.7 across both. The categories can still split — Gemini 3.1 Pro, for instance, leads science while trailing on coding.

How to use the LMSys leaderboard to pick a model

Start by identifying your dominant workload. If your traffic is mostly coding, the overall leaderboard is misleading. If your traffic is mostly multilingual chat or structured reasoning, same. Pull the sub-leaderboard that matches your traffic shape and read the top 5 from there.

Next, filter by your binding constraints. Cost ceiling, on-prem requirement, context length, voice support, fine-tuning availability. The top 2 to 3 models on the relevant sub-leaderboard that satisfy your constraints are usually within margin on quality. At that point the choice comes down to price and deployment fit, not Elo.

Finally, validate with your own eval harness. LMSys captures general human preference but cannot capture your specific workload, your specific data, your specific success criteria. Run the top 2 candidates through a 100 to 500 example golden dataset before committing. Swfte's AI Model Leaderboard pairs LMSys Elo with the sub-benchmarks (AAII, GPQA, SWE-bench Pro) and live API pricing so the trade-offs sit in one view.

FAQ

What is the LMSys Chatbot Arena leaderboard?

LMSys Chatbot Arena (now known as LMArena) is a crowdsourced AI model evaluation platform run by the Large Model Systems Organization. Users submit prompts and vote on blind A/B responses from two models. Votes feed an Elo rating system that ranks models by human preference. Launched in 2023, it became the de facto reference leaderboard for frontier LLM quality.

How does LMSys Elo work?

Same Elo system used in chess. Every blind A/B vote updates both models' ratings: the winner gains points proportional to the rating gap (more points for upsetting a higher-rated model), the loser loses the same. After millions of votes, the rating converges on a stable rank order. Differences below 10 Elo are within noise; differences above 30 Elo are reliably meaningful in production.

Why is the LMSys leaderboard important?

Three reasons. (1) It measures human preference, not benchmark accuracy, which correlates better with downstream usefulness. (2) It uses blind A/B comparison, eliminating brand bias. (3) Crowdsourced volume (millions of votes) gives statistical confidence that no single-shot benchmark can match. Procurement teams routinely require an LMSys position before approving a new model.

Who is at the top of the LMSys leaderboard in 2026?

Claude Opus 4.8 is the new #1, leading both the overall arena (~1580 Elo) and the coding sub-leaderboard (~1582 Elo) after topping the Artificial Analysis Intelligence Index at 61.4. Claude Opus 4.7 and GPT-5.5 Pro follow, with Gemini 3.1 Pro leading science and 2M-context categories. The top 10 also includes GPT-5.5, Claude Sonnet 4, Gemini 3.0, DeepSeek V4 Pro, Alibaba's Qwen 3.7 Max, and Grok 4.

What is the difference between LMSys, LMArena, and Chatbot Arena?

Same thing under different names. The Large Model Systems organisation (LMSys) at UC Berkeley launched Chatbot Arena in 2023. The platform later rebranded to LMArena (lmarena.ai) under its independent corporate entity. People still use all three terms interchangeably to refer to the leaderboard.

How often does the LMSys leaderboard update?

Continuously. New votes flow in 24/7, and the public leaderboard typically refreshes weekly with the latest Elo ratings. Major model releases trigger a faster refresh, sometimes within hours, once enough votes accumulate to produce a stable rating.

Are LMSys rankings gameable?

In theory, yes. A coordinated voting campaign for a specific model could move ratings. In practice, LMSys deploys multiple defences: prompt diversity sampling, vote velocity caps, account-quality filters, and per-region rate limits. The Elo system is also self-correcting: artificial inflation against a strong model produces enough losses to wash out gains.

How does the LMSys coding leaderboard differ from the overall arena?

Same Elo system, restricted to coding prompts. Voters classify their prompt as code-related (or the system infers it), and only those votes feed the coding sub-rating. Claude Opus 4.8 sits at #1 for coding (~1582 Elo), having overtaken Opus 4.7 (1567) — the gap reflects Claude's well-documented coding strength.

How do I use LMSys to pick a model?

Three steps. (1) Identify your dominant workload (coding, reasoning, science, multilingual, voice). (2) Pull the LMSys sub-leaderboard for that workload. (3) Filter by your binding constraint (cost, latency, context window, deployment posture). The top 2-3 models on the relevant sub-leaderboard that satisfy your constraint are usually within margin of error on quality.

Is LMSys the only AI leaderboard that matters?

No. Pair it with AAII (reasoning), GPQA Diamond (science), SWE-bench Pro (coding), and your own internal eval harness. LMSys captures human preference at the chat level; the others capture specific capabilities. The strongest production decisions combine all four.