LMSys Chatbot Arena (July 2026)
Short version. LMSys (rebranded to LMArena) is the crowdsourced human-preference leaderboard that became the reference for AI model quality. Claude Opus 4.8 is the new #1, leading both coding (~1582 Elo) and the overall arena (~1580). GPT-5.5 Pro and Gemini 3.1 Pro round out the top tier; Gemini owns the science and long-context categories.
Run every top-Elo model through one gateway
Three ways to put the LMSys leaderboard to work. All start with a free Swfte account, no card.
Top 10 LMSys rankings, July 2026
Live snapshot of the LMSys Chatbot Arena top 10. Elo ratings refresh weekly; major releases produce faster updates once enough votes accumulate. Coding Elo is the sub-leaderboard restricted to code prompts.
| # | Model | Maker | Overall Elo | Coding Elo | Notes |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.8 | Anthropic | 1580 | 1582 | New #1 overall & coding. AAII 61.4. SWE-bench Pro 69.2%, computer-use leader. |
| 2 | Claude Opus 4.7 | Anthropic | 1567 | 1567 | Prior coding #1. Still elite for agentic workflows. |
| 3 | GPT-5.5 Pro | OpenAI | 1551 | 1531 | Reasoning #1 (AAII 59). Voice + multimodal flagship. |
| 4 | Gemini 3.1 Pro | Google DeepMind | 1538 | 1505 | Science #1 (GPQA Diamond 94.3%). 2M context. |
| 5 | GPT-5.5 | OpenAI | 1523 | 1483 | Mainline GPT-5.5. Workhorse for general chat. |
| 6 | Claude Sonnet 4 | Anthropic | 1518 | 1520 | Production workhorse. 1M context. |
| 7 | Gemini 3.0 | Google DeepMind | 1505 | 1462 | Cost leader at mainline tier. |
| 8 | DeepSeek V4 Pro | DeepSeek | 1462 | 1454 | Open weights. ~1/8 the price of GPT-5.5 Pro. |
| 9 | Qwen 3.7 Max | Alibaba | 1455 | 1450 | Highest-ranked Chinese model (AAII 56.6). 35h autonomous agentic runs. |
| 10 | Grok 4 | xAI | 1441 | 1440 | Real-time X data access. |
Source: LMArena public leaderboard. Live snapshot as of 2026-05-15. Verify at lmarena.ai before contracting decisions.
Full Arena Elo leaderboard — 41 models
Every model in our directory with a published Arena Elo. Default sort is Elo descending; click any column to re-sort. Filter by category or open-source above the table.
| # | Model | Quality | Arena ELO | Speed | Price | Context | Value | Released |
|---|---|---|---|---|---|---|---|---|
| 1 | Anthropic · Frontier agentic coding & knowledge work | 100 | 1525 | 58 t/s | $10 / $50 | 1M | 3.3 | Jun 2026 |
| 2 | Anthropic · Coding, agents & computer use | 99 | 1512 | 72 t/s | $5 / $25 | 1M | 6.6 | May 2026 |
| 3 | OpenAI · Reasoning at any cost | 98 | 1510 | 68 t/s | $30 / $180 | 1M | 0.9 | Apr 2026 |
| 4 | OpenAI · Frontier general purpose | 97 | 1506 | 70 t/s | $5 / $30 | 1M | 5.5 | Apr 2026 |
| 5 | Anthropic · Coding & agentic workflows | 96 | 1505 | 68 t/s | $5 / $25 | 1M | 6.4 | Apr 2026 |
| 6 | Google · Speed & cost | 96 | 1505 | — | $2 / $12 | 1M | 13.7 | Feb 2026 |
| 7 | Google · Science & long-context | 96 | 1505 | 131 t/s | $2 / $12 | 1M | 13.7 | Apr 2026 |
| 8 | xAI · Agentic tasks & real-time info | 93 | 1496 | 83 t/s | $1.25 / $2.5 | 1M | 49.6 | May 2026 |
| 9 | xAI · General purpose | 93 | 1496 | — | $1.25 / $2.5 | 2M | 49.6 | Mar 2026 |
| 10 | OpenAI · General purpose | 93 | 1495 | — | $2.5 / $15 | 1M | 10.6 | Mar 2026 |
| 11 | Anthropic · General purpose | 95 | 1490 | — | $5 / $25 | 1M | 6.3 | Feb 2026 |
| 12 | Alibaba Cloud · Long autonomous agentic runs | 94 | 1488 | 90 t/s | $2.5 / $7.5 | 1M | 18.8 | May 2026 |
| 13 | DeepSeek · Open-source value leader | 90 | 1467 | 33 t/s | $1.74 / $3.48 | 1M | 34.5 | Apr 2026 |
| 14 | Anthropic · Coding & balance | 90 | 1467 | 73 t/s | $3 / $15 | 1M | 10.0 | Feb 2026 |
| 15 | · Open-weight agentic & tool use | 88 | 1467 | 48 t/s | $0.98 / $3.08 | 200K | 43.3 | Apr 2026 |
| 16 | Moonshot AI · Frontier quality at low cost | 92 | 1466 | 48 t/s | $0.73 / $3.49 | 256K | 43.6 | Apr 2026 |
| 17 | OpenAI · General purpose | 90 | 1455 | — | $1.25 / $10 | 400K | 16.0 | Aug 2025 |
| 18 | · Open-weight agentic coding | 89 | 1455 | 80 t/s | $0.6 / $2.4 | 1M | 59.3 | Jun 2026 |
| 19 | DeepSeek · Open-source | 87 | 1455 | — | $0.252 / $0.378 | 164K | 276.2 | Dec 2025 |
| 20 | Moonshot AI · Speed & cost | 89 | 1452 | — | $0.4 / $1.9 | 262K | 77.4 | Jan 2026 |
| 21 | Z.ai: GLM 5OSS · Open-source | 88 | 1450 | — | $0.6 / $1.92 | 80K | 69.8 | Feb 2026 |
| 22 | Alibaba Cloud · Multilingual & APAC | 86 | 1448 | 124 t/s | $1.4 / $5.6 | 256K | 24.6 | Apr 2026 |
| 23 | DeepSeek · Cheap-and-fast cascade tier | 80 | 1410 | 105 t/s | $0.1 / $0.2 | 1M | 533.3 | Apr 2026 |
| 24 | OpenAI · Hard reasoning | 94 | 1370 | 68 t/s | $10 / $40 | 200K | 3.8 | Apr 2025 |
| 25 | Anthropic · Complex analysis | 91 | 1360 | 52 t/s | $15 / $75 | 200K | 2.0 | May 2025 |
| 26 | Google · Multimodal + value | 92 | 1345 | 87 t/s | $1.25 / $10 | 1M | 16.4 | Mar 2025 |
| 27 | xAI · Real-time info | 87 | 1330 | 82 t/s | $3 / $15 | 131K | 9.7 | Feb 2025 |
| 28 | Anthropic · Coding & balance | 88 | 1320 | 95 t/s | $3 / $15 | 200K | 9.8 | May 2025 |
| 29 | OpenAI · Long context | 89 | 1310 | 120 t/s | $2 / $8 | 1M | 17.8 | Apr 2025 |
| 30 | DeepSeek · Best open-source value | 86 | 1310 | 62 t/s | $0.27 / $1.1 | 128K | 125.5 | Mar 2025 |
| 31 | OpenAI · Reasoning & math | 88 | 1305 | 155 t/s | $1.1 / $4.4 | 200K | 32.0 | Jan 2025 |
| 32 | OpenAI · General purpose | 85 | 1285 | 109 t/s | $2.5 / $10 | 128K | 13.6 | May 2024 |
| 33 | xAI · Budget reasoning | 78 | 1275 | 165 t/s | $0.3 / $0.5 | 131K | 195.0 | Feb 2025 |
| 34 | Meta · Open-source value | 80 | 1260 | 135 t/s | $0.2 / $0.6 | 1M | 200.0 | Apr 2025 |
| 35 | Alibaba Cloud · Open-source flagship | 80 | 1255 | 85 t/s | $0.3 / $0.9 | 131K | 133.3 | Sep 2024 |
| 36 | Mistral AI · Multilingual | 79 | 1250 | 78 t/s | $2 / $6 | 128K | 19.8 | Nov 2024 |
| 37 | Google · Fastest + cheapest | 74 | 1240 | 244 t/s | $0.1 / $0.4 | 1M | 296.0 | Feb 2025 |
| 38 | Anthropic · Speed & cost | 75 | 1230 | 172 t/s | $0.8 / $4 | 200K | 31.3 | Oct 2024 |
| 39 | OpenAI · High throughput | 72 | 1216 | 183 t/s | $0.15 / $0.6 | 128K | 192.0 | Jul 2024 |
| 40 | Meta · Longest context | 71 | 1195 | 198 t/s | $0.15 / $0.4 | 10M | 258.2 | Apr 2025 |
| 41 | Cohere · Enterprise RAG | 68 | 1170 | 72 t/s | $2.5 / $10 | 128K | 10.9 | Aug 2024 |
How LMSys Elo actually works
The mechanic is the chess Elo system, retargeted at language models. A user submits a prompt. The platform sends the prompt to two anonymised models in parallel and shows both responses side by side. The user votes on which is better, picks a tie, or marks both as bad. The vote updates both ratings: the winner gains Elo points proportional to the rating gap, the loser drops by the same.
Over millions of votes the system converges. A rating gap of 100 Elo means the higher model wins about 64% of head-to-head votes. A gap of 30 Elo means roughly 54%. Below 10 Elo, the difference is noise.
The platform runs sub-leaderboards for specific categories (coding, reasoning, multilingual, long context, hard prompts) by filtering votes where the prompt belongs to that category. This is why Claude Opus 4.8 now leads coding (~1582) and overall (~1580), having overtaken Opus 4.7 across both. The categories can still split — Gemini 3.1 Pro, for instance, leads science while trailing on coding.
How to use the LMSys leaderboard to pick a model
Start by identifying your dominant workload. If your traffic is mostly coding, the overall leaderboard is misleading. If your traffic is mostly multilingual chat or structured reasoning, same. Pull the sub-leaderboard that matches your traffic shape and read the top 5 from there.
Next, filter by your binding constraints. Cost ceiling, on-prem requirement, context length, voice support, fine-tuning availability. The top 2 to 3 models on the relevant sub-leaderboard that satisfy your constraints are usually within margin on quality. At that point the choice comes down to price and deployment fit, not Elo.
Finally, validate with your own eval harness. LMSys captures general human preference but cannot capture your specific workload, your specific data, your specific success criteria. Run the top 2 candidates through a 100 to 500 example golden dataset before committing. Swfte's AI Model Leaderboard pairs LMSys Elo with the sub-benchmarks (AAII, GPQA, SWE-bench Pro) and live API pricing so the trade-offs sit in one view.
FAQ
What is the LMSys Chatbot Arena leaderboard?
LMSys Chatbot Arena (now known as LMArena) is a crowdsourced AI model evaluation platform run by the Large Model Systems Organization. Users submit prompts and vote on blind A/B responses from two models. Votes feed an Elo rating system that ranks models by human preference. Launched in 2023, it became the de facto reference leaderboard for frontier LLM quality.
How does LMSys Elo work?
Same Elo system used in chess. Every blind A/B vote updates both models' ratings: the winner gains points proportional to the rating gap (more points for upsetting a higher-rated model), the loser loses the same. After millions of votes, the rating converges on a stable rank order. Differences below 10 Elo are within noise; differences above 30 Elo are reliably meaningful in production.
Why is the LMSys leaderboard important?
Three reasons. (1) It measures human preference, not benchmark accuracy, which correlates better with downstream usefulness. (2) It uses blind A/B comparison, eliminating brand bias. (3) Crowdsourced volume (millions of votes) gives statistical confidence that no single-shot benchmark can match. Procurement teams routinely require an LMSys position before approving a new model.
Who is at the top of the LMSys leaderboard in 2026?
Claude Opus 4.8 is the new #1, leading both the overall arena (~1580 Elo) and the coding sub-leaderboard (~1582 Elo) after topping the Artificial Analysis Intelligence Index at 61.4. Claude Opus 4.7 and GPT-5.5 Pro follow, with Gemini 3.1 Pro leading science and 2M-context categories. The top 10 also includes GPT-5.5, Claude Sonnet 4, Gemini 3.0, DeepSeek V4 Pro, Alibaba's Qwen 3.7 Max, and Grok 4.
What is the difference between LMSys, LMArena, and Chatbot Arena?
Same thing under different names. The Large Model Systems organisation (LMSys) at UC Berkeley launched Chatbot Arena in 2023. The platform later rebranded to LMArena (lmarena.ai) under its independent corporate entity. People still use all three terms interchangeably to refer to the leaderboard.
How often does the LMSys leaderboard update?
Continuously. New votes flow in 24/7, and the public leaderboard typically refreshes weekly with the latest Elo ratings. Major model releases trigger a faster refresh, sometimes within hours, once enough votes accumulate to produce a stable rating.
Are LMSys rankings gameable?
In theory, yes. A coordinated voting campaign for a specific model could move ratings. In practice, LMSys deploys multiple defences: prompt diversity sampling, vote velocity caps, account-quality filters, and per-region rate limits. The Elo system is also self-correcting: artificial inflation against a strong model produces enough losses to wash out gains.
How does the LMSys coding leaderboard differ from the overall arena?
Same Elo system, restricted to coding prompts. Voters classify their prompt as code-related (or the system infers it), and only those votes feed the coding sub-rating. Claude Opus 4.8 sits at #1 for coding (~1582 Elo), having overtaken Opus 4.7 (1567) — the gap reflects Claude's well-documented coding strength.
How do I use LMSys to pick a model?
Three steps. (1) Identify your dominant workload (coding, reasoning, science, multilingual, voice). (2) Pull the LMSys sub-leaderboard for that workload. (3) Filter by your binding constraint (cost, latency, context window, deployment posture). The top 2-3 models on the relevant sub-leaderboard that satisfy your constraint are usually within margin of error on quality.
Is LMSys the only AI leaderboard that matters?
No. Pair it with AAII (reasoning), GPQA Diamond (science), SWE-bench Pro (coding), and your own internal eval harness. LMSys captures human preference at the chat level; the others capture specific capabilities. The strongest production decisions combine all four.