Updated May 15, 2026 · 8 min read

LMSys Chatbot Arena (May 2026)

Short version. LMSys (rebranded to LMArena) is the crowdsourced human-preference leaderboard that became the reference for AI model quality. Claude Opus 4.7 leads coding at 1567 Elo. GPT-5.5 Pro leads the overall arena at 1551. Gemini 3.1 Pro owns the science and long-context categories.

Top 10 LMSys rankings, May 2026

Live snapshot of the LMSys Chatbot Arena top 10. Elo ratings refresh weekly; major releases produce faster updates once enough votes accumulate. Coding Elo is the sub-leaderboard restricted to code prompts.

#ModelMakerOverall EloCoding EloNotes
1Claude Opus 4.7Anthropic15671567Coding #1. Slight reasoning gap vs GPT-5.5 Pro.
2GPT-5.5 ProOpenAI15511531Reasoning #1 (AAII 59). Voice + multimodal flagship.
3Gemini 3.1 ProGoogle DeepMind15381505Science #1 (GPQA Diamond 94.3%). 2M context.
4GPT-5.5OpenAI15231483Mainline GPT-5.5. Workhorse for general chat.
5Claude Sonnet 4Anthropic15181520Production workhorse. 1M context.
6Gemini 3.0Google DeepMind15051462Cost leader at mainline tier.
7DeepSeek V4 ProDeepSeek14621454Open weights. ~1/8 the price of GPT-5.5 Pro.
8Grok 4xAI14411440Real-time X data access.
9Kimi K2.5Moonshot14281395Long-context specialist.
10Llama 4 MaverickMeta14121388Open weights. Western-jurisdiction alternative to DeepSeek.

Source: LMArena public leaderboard. Live snapshot as of 2026-05-15. Verify at lmarena.ai before contracting decisions.

How LMSys Elo actually works

The mechanic is the chess Elo system, retargeted at language models. A user submits a prompt. The platform sends the prompt to two anonymised models in parallel and shows both responses side by side. The user votes on which is better, picks a tie, or marks both as bad. The vote updates both ratings: the winner gains Elo points proportional to the rating gap, the loser drops by the same.

Over millions of votes the system converges. A rating gap of 100 Elo means the higher model wins about 64% of head-to-head votes. A gap of 30 Elo means roughly 54%. Below 10 Elo, the difference is noise.

The platform runs sub-leaderboards for specific categories (coding, reasoning, multilingual, long context, hard prompts) by filtering votes where the prompt belongs to that category. This is why Claude Opus 4.7 leads coding (1567) while GPT-5.5 Pro leads overall (1551). They split.

How to use the LMSys leaderboard to pick a model

Start by identifying your dominant workload. If your traffic is mostly coding, the overall leaderboard is misleading. If your traffic is mostly multilingual chat or structured reasoning, same. Pull the sub-leaderboard that matches your traffic shape and read the top 5 from there.

Next, filter by your binding constraints. Cost ceiling, on-prem requirement, context length, voice support, fine-tuning availability. The top 2 to 3 models on the relevant sub-leaderboard that satisfy your constraints are usually within margin on quality. At that point the choice comes down to price and deployment fit, not Elo.

Finally, validate with your own eval harness. LMSys captures general human preference but cannot capture your specific workload, your specific data, your specific success criteria. Run the top 2 candidates through a 100 to 500 example golden dataset before committing. Swfte's AI Model Leaderboard pairs LMSys Elo with the sub-benchmarks (AAII, GPQA, SWE-bench Pro) and live API pricing so the trade-offs sit in one view.

FAQ

What is the LMSys Chatbot Arena leaderboard?

LMSys Chatbot Arena (now known as LMArena) is a crowdsourced AI model evaluation platform run by the Large Model Systems Organization. Users submit prompts and vote on blind A/B responses from two models. Votes feed an Elo rating system that ranks models by human preference. Launched in 2023, it became the de facto reference leaderboard for frontier LLM quality.

How does LMSys Elo work?

Same Elo system used in chess. Every blind A/B vote updates both models' ratings: the winner gains points proportional to the rating gap (more points for upsetting a higher-rated model), the loser loses the same. After millions of votes, the rating converges on a stable rank order. Differences below 10 Elo are within noise; differences above 30 Elo are reliably meaningful in production.

Why is the LMSys leaderboard important?

Three reasons. (1) It measures human preference, not benchmark accuracy, which correlates better with downstream usefulness. (2) It uses blind A/B comparison, eliminating brand bias. (3) Crowdsourced volume (millions of votes) gives statistical confidence that no single-shot benchmark can match. Procurement teams routinely require an LMSys position before approving a new model.

Who is at the top of the LMSys leaderboard in 2026?

Claude Opus 4.7 leads the coding sub-leaderboard at 1567 Elo. GPT-5.5 Pro leads the overall arena at 1551 Elo. Gemini 3.1 Pro leads science and 2M-context categories. The top 10 also includes Claude Sonnet 4, GPT-5.5, Gemini 3.0, DeepSeek V4 Pro, Grok 4, Kimi K2.5, and Llama 4 Maverick.

What is the difference between LMSys, LMArena, and Chatbot Arena?

Same thing under different names. The Large Model Systems organisation (LMSys) at UC Berkeley launched Chatbot Arena in 2023. The platform later rebranded to LMArena (lmarena.ai) under its independent corporate entity. People still use all three terms interchangeably to refer to the leaderboard.

How often does the LMSys leaderboard update?

Continuously. New votes flow in 24/7, and the public leaderboard typically refreshes weekly with the latest Elo ratings. Major model releases trigger a faster refresh, sometimes within hours, once enough votes accumulate to produce a stable rating.

Are LMSys rankings gameable?

In theory, yes. A coordinated voting campaign for a specific model could move ratings. In practice, LMSys deploys multiple defences: prompt diversity sampling, vote velocity caps, account-quality filters, and per-region rate limits. The Elo system is also self-correcting: artificial inflation against a strong model produces enough losses to wash out gains.

How does the LMSys coding leaderboard differ from the overall arena?

Same Elo system, restricted to coding prompts. Voters classify their prompt as code-related (or the system infers it), and only those votes feed the coding sub-rating. Claude Opus 4.7 sits at #1 for coding (1567 Elo) while GPT-5.5 Pro leads overall — the split reflects Claude's well-documented coding strength.

How do I use LMSys to pick a model?

Three steps. (1) Identify your dominant workload (coding, reasoning, science, multilingual, voice). (2) Pull the LMSys sub-leaderboard for that workload. (3) Filter by your binding constraint (cost, latency, context window, deployment posture). The top 2-3 models on the relevant sub-leaderboard that satisfy your constraint are usually within margin of error on quality.

Is LMSys the only AI leaderboard that matters?

No. Pair it with AAII (reasoning), GPQA Diamond (science), SWE-bench Pro (coding), and your own internal eval harness. LMSys captures human preference at the chat level; the others capture specific capabilities. The strongest production decisions combine all four.

Run every LMSys top-10 model through one gateway

Swfte routes traffic across Claude, GPT, Gemini, DeepSeek, Grok, and self-hosted models with policy, prompt caching, per-team budgets, and audit on every call.

Free tier · OpenAI-compatible API · SOC2 Type II · On-prem available