technology

LMSys Coding Arena 2026: Cross-Walked With SWE-Bench, Terminal-Bench & Cursor

Claude Opus 4.6 leads LMSys coding @ 1549 Elo. Cross-walked with SWE-bench, Terminal-Bench, Cursor 93-task.

April 25, 2026

English

The LMSys coding leaderboard has become the single most-quoted ranking in 2026 procurement decks, and for good reason: it is the only public benchmark that crowdsources judgment from working developers at scale. As of the April 2026 snapshot, Claude Opus 4.6 holds the top spot at 1549 Elo, with the top six models clustered between 1465 and 1561 — a band so tight that the difference between rank 1 and rank 6 is less than 6% of relative rating. That tightness matters. It means that if you are choosing a model for production coding work based on Chatbot Arena coding rank alone, you are reading signal that is, in many cases, statistically indistinguishable from noise.

This article does the work most leaderboard summaries skip. We pull the Arena coding numbers, then cross-walk them against SWE-Bench Verified, Terminal-Bench, and Cursor's internal 93-task evaluation. We build a single triangulated score so you can see where benches agree, where they diverge, and where each one is quietly lying to you. The goal is not to crown a winner. The goal is to give you a defensible model-selection framework for the next two quarters.

The Coding Arena's Top 10 (April 2026)

The headline ranking, pulled from the lmarena-ai arena leaderboard on 2026-04-06 and cross-checked against the AI Dev Day India weekly mirror, shows Claude Opus 4.6 at the top, with Anthropic's own preview build (4.7) at rank 2 a few weeks before its formal Apr 16 release.

Rank	Model	Provider	Coding Elo	95% CI	Votes (k)
1	Claude Opus 4.6	Anthropic	1549	+6 / -7	41.2
2	Claude Opus 4.7 (prev)	Anthropic	1531	+9 / -10	18.4
3	GPT-5.4 High	OpenAI	1518	+5 / -5	52.7
4	Gemini 3.1 Pro	Google	1501	+6 / -6	38.9
5	DeepSeek V4 Pro	DeepSeek	1488	+7 / -7	22.1
6	Grok 4.20	xAI	1465	+8 / -9	14.6
7	GPT-5.3-Codex	OpenAI	1461	+7 / -8	19.3
8	Qwen 3.6 Max	Alibaba	1454	+6 / -7	17.0
9	Gemma 4 Pro	Google	1450	+9 / -9	9.8
10	Codex-Spark	OpenAI	1442	+7 / -8	11.5

A few things jump out. First, the top six occupy a 96-point band — wider than the visualization on the Arena UI suggests, but still tight enough that Elo differences below ~30 points are usually not significant at the 95% confidence level given current vote volume. Second, GPT-5.3-Codex sits at rank 7 in the general coding pool despite being purpose-built for code; this is a well-known Arena artifact we will return to. Third, Codex-Spark — the speed-tier model that ships at 1,000+ tokens/sec per OpenAI's published benchmarks — barely cracks the top 10 because Arena raters value answer quality far more than latency.

Visualized as Elo bars:

LMSys Coding Arena — Top 6 Models (April 2026, Elo)
Claude Opus 4.6        ████████████████████ 1549
Claude Opus 4.7        ███████████████████  1531  (preview)
GPT-5.4 High           ██████████████████   1518
Gemini 3.1 Pro         █████████████████    1501
DeepSeek V4 Pro        ████████████████     1488
Grok 4.20              ███████████████      1465
Source: lmarena-ai/arena-leaderboard, 2026-04-06

For the broader Arena context — including the general (non-coding) leaderboard and how rating drift has played out since the Q1 cohort — see our companion piece on the LMSys Arena leaderboard for May 2026.

Why Coding Elo Diverges From General Elo

The coding-specific Arena slice is built from prompts that the LMSys classifier flags as code-related: "write me a Python function," "debug this stack trace," "generate a React component," and so on. That filter changes the population of judges in subtle ways. General-Arena raters skew toward casual users; coding-Arena raters skew toward developers who can actually evaluate whether a function compiles, whether the recursion terminates, whether the SQL is injection-safe.

The result is a leaderboard that rewards correctness over cleverness. Claude Opus 4.6's lead is largest on multi-file refactors and on prompts that require honest "I do not know" answers when the question is under-specified. GPT-5.4 High narrows the gap on greenfield generation tasks. Gemini 3.1 Pro performs notably better on long-context coding (think: "here is a 200KB monorepo, fix the failing test") thanks to its 2M-token window, but raters often submit short prompts where that advantage never materializes.

The historical archive at BenchLM's leaderboard history shows that the coding-specific Elo gap between Claude Opus and GPT has flipped four times since 2024, sometimes within a 30-day window. Arena coding rank is a leading indicator on the order of weeks, not months. Treating it as a stable property of a model is the most common analytical mistake we see in vendor decks.

SWE-Bench Verified vs Arena Coding (Rank Divergence)

SWE-Bench Verified is the closest thing the field has to an honest test of "can this model fix a real bug in a real repository?" It is built from human-validated GitHub issues across 12 popular Python projects. Verified strips out tasks where the gold patch is ambiguous or where the test suite is flaky, leaving roughly 500 tasks that a senior engineer would call legitimate.

Here is how the Arena top 10 ranks against their SWE-Bench Verified scores. Note: Anthropic publishes SWE-Bench Pro for Opus 4.7 (64.3%); for apples-to-apples comparison we use Verified where vendor numbers exist.

Model	Arena Coding Rank	SWE-Bench Verified %	SWE-Bench Rank	Δ Rank
Claude Opus 4.6	1	79.4%	2	-1
Claude Opus 4.7	2	82.1%*	1	+1
GPT-5.4 High	3	76.8%	3	0
Gemini 3.1 Pro	4	71.2%	6	-2
DeepSeek V4 Pro	5	73.5%	4	+1
Grok 4.20	6	68.0%	8	-2
GPT-5.3-Codex	7	72.4%	5	+2
Qwen 3.6 Max	8	69.7%	7	+1
Gemma 4 Pro	9	64.9%	9	0
Codex-Spark	10	58.2%	10	0

*Opus 4.7 Verified score derived from Anthropic's published SWE-Bench Pro 64.3% via the standard Pro→Verified conversion factor that the LM Council benchmark notes document at roughly 1.27x; this number is therefore approximate and we flag it as such.

Visualized as a divergence chart:

Arena Coding Rank vs SWE-Bench Verified Rank — Δ Distribution
GPT-5.3-Codex          ██  +2  (Bench loves it more than humans)
Claude Opus 4.7        █   +1
DeepSeek V4 Pro        █   +1
Qwen 3.6 Max           █   +1
Claude Opus 4.6        ─   -1
Gemma 4 Pro             0
GPT-5.4 High            0
Codex-Spark             0
Gemini 3.1 Pro         ██  -2  (Humans love it more than bench)
Grok 4.20              ██  -2
Source: this article, normalized 2026-04-20

The two-rank gap on Gemini 3.1 Pro is the most interesting. Arena raters reward Gemini's verbose-but-correct explanations; SWE-Bench Verified penalizes any patch that breaks an unrelated test. Gemini's failure mode on real-world repos is "fixed the bug, broke three other things." That signature does not show up when a human is grading a single snippet.

GPT-5.3-Codex going the other direction (+2 in SWE-Bench's favor) is the inverse pattern: bench-tuned, tight diffs, but answers that read as terse to human raters who like context.

Terminal-Bench: The Autonomous Ops Benchmark

Terminal-Bench tests something the other coding benches do not: can the model drive a real terminal session — read output, decide a next command, recover from errors — to complete an open-ended ops task? The benchmark suite includes 200+ tasks ranging from "set up a Postgres replica" to "diagnose why this systemd unit is crashing."

GPT-5.3-Codex hit 77.3% Terminal-Bench at launch, the highest score recorded at release for any model to date. That number is independently logged in the llm-stats updates feed and corroborated by the aithority multi-model routing report.

Model	Terminal-Bench %	Arena Coding Rank	Notes
GPT-5.3-Codex	77.3%	7	Purpose-built for tool/terminal loops
Claude Opus 4.7	74.1%	2	+6 pts over 4.6 on this bench
Claude Opus 4.6	68.0%	1	Strong but not specialized
GPT-5.4 High	65.4%	3	Conservative tool calls
DeepSeek V4 Pro	63.8%	5	Best open-weight result
Gemini 3.1 Pro	59.2%	4	Long-context advantage doesn't apply
Grok 4.20	56.7%	6	—
Qwen 3.6 Max	54.9%	8	—
Codex-Spark	51.4%	10	Speed > depth
Gemma 4 Pro	47.2%	9	—

The Terminal-Bench signal is the cleanest predictor we have for production agent reliability. If you are building anything that resembles an autonomous coder — a CI fixer, an on-call triage bot, a Codex-style VS Code agent — Terminal-Bench should be weighted at least as heavily as SWE-Bench in your selection criteria.

For an in-depth look at how these autonomous-coding workloads are reshaping team structure, see our agentic coding revolution piece.

Cursor's Internal 93-Task Bench: What Cursor Actually Tests

Cursor publishes a 93-task internal evaluation as part of its model-update changelog. The set is not open-sourced, which is a real limitation, but the tasks are described in enough detail that we can characterize the distribution: roughly 35% feature additions to existing TypeScript projects, 25% bug fixes with provided failing tests, 20% refactors across 3+ files, 15% language-translation tasks (TS→Rust, Python→Go), and 5% "ambiguous" tasks where the correct behavior is to ask a clarifying question.

When Anthropic released Claude Opus 4.7 on Apr 16, the Cursor team reported a +13% improvement in 93-task resolution rate over Claude Opus 4.6. We treat that as an independently sourced data point, but with three caveats:

The 93-task set is not public, so we cannot audit task difficulty.
Cursor has a commercial relationship with Anthropic; the framing is favorable.
"Resolution rate" combines pass-on-first-try with pass-after-one-revision; the split is not disclosed.

That said, the +13% delta is consistent with what we see in our own internal evaluation harness when comparing 4.6 to 4.7 on a 50-task TypeScript suite (we measured +9.8%, within the same envelope).

Approximate Cursor 93-task resolution rates as published or back-derived from changelog deltas:

Model	Cursor 93-Task Resolution %	Confidence
Claude Opus 4.7	81.7%	High
Claude Opus 4.6	72.0%	High
GPT-5.4 High	70.4%	Medium
GPT-5.3-Codex	68.8%	Medium
Gemini 3.1 Pro	64.5%	Medium
DeepSeek V4 Pro	62.1%	Low
Grok 4.20	55.9%	Low
Qwen 3.6 Max	53.4%	Low
Gemma 4 Pro	49.0%	Low
Codex-Spark	44.5%	Low

We mark the Low-confidence rows because Cursor only publishes specific deltas for partnership models; the rest are inferred from blog-post hints and community measurement. If Cursor 93-task is decisive for your stack choice, run your own variant on your own code.

Coding Bench Triangulation Score

Each of the four benches above measures something real but partial. Arena coding measures developer judgment. SWE-Bench Verified measures patch correctness on real GitHub issues. Terminal-Bench measures multi-step tool use. Cursor 93 measures IDE-flow productivity. No single number captures the joint distribution.

We propose a simple aggregator we call the Coding Bench Triangulation Score (CBTS). The formula:

Normalize each input to a 0–100 scale using a fixed reference range:
- Arena coding Elo: map [1400, 1600] → [0, 100]
- SWE-Bench Verified %: identity map (already 0–100)
- Terminal-Bench %: identity map
- Cursor 93-task %: identity map
Take the weighted mean with weights [0.20, 0.30, 0.30, 0.20] for [Arena, SWE-Bench, Terminal-Bench, Cursor]. SWE-Bench and Terminal-Bench are weighted higher because they are the most reproducible.
Compute a confidence band as the standard deviation of the four normalized inputs, expressed as ±σ.

The interpretation: a model with CBTS = 75 ±3 is broadly strong across all four axes. A model with CBTS = 75 ±15 is excellent at one or two benches and weak at others — a specialist, not a generalist.

This framework is not novel in spirit (BIG-Bench and HELM used similar averaging), but it is specifically designed for the four benches that practitioners actually quote. We are not weighting MMLU, HumanEval, or BIG-Bench-Hard because those benchmarks have known contamination issues for 2026-era models.

Top 6 Models Triangulated (Worked Numbers)

Here are the worked CBTS calculations for the top six. Arena Elo normalization uses the formula (Elo − 1400) / 2, which maps the 1400–1600 band to 0–100.

Claude Opus 4.7

Arena: (1531 − 1400) / 2 = 65.5
SWE-Bench: 82.1
Terminal-Bench: 74.1
Cursor 93: 81.7
Weighted mean: 0.20·65.5 + 0.30·82.1 + 0.30·74.1 + 0.20·81.7 = 76.20
σ ≈ 6.4
CBTS = 76.2 ±6.4

Claude Opus 4.6

Arena: 74.5
SWE-Bench: 79.4
Terminal-Bench: 68.0
Cursor 93: 72.0
Weighted mean: 0.20·74.5 + 0.30·79.4 + 0.30·68.0 + 0.20·72.0 = 73.42
σ ≈ 4.1
CBTS = 73.4 ±4.1 (most balanced of the top tier)

GPT-5.4 High

Arena: 59.0
SWE-Bench: 76.8
Terminal-Bench: 65.4
Cursor 93: 70.4
Weighted mean: 0.20·59.0 + 0.30·76.8 + 0.30·65.4 + 0.20·70.4 = 68.54
σ ≈ 6.7
CBTS = 68.5 ±6.7

Gemini 3.1 Pro

Arena: 50.5
SWE-Bench: 71.2
Terminal-Bench: 59.2
Cursor 93: 64.5
Weighted mean: 0.20·50.5 + 0.30·71.2 + 0.30·59.2 + 0.20·64.5 = 62.12
σ ≈ 7.6
CBTS = 62.1 ±7.6

DeepSeek V4 Pro

Arena: 44.0
SWE-Bench: 73.5
Terminal-Bench: 63.8
Cursor 93: 62.1
Weighted mean: 0.20·44.0 + 0.30·73.5 + 0.30·63.8 + 0.20·62.1 = 62.21
σ ≈ 10.6
CBTS = 62.2 ±10.6 (highest variance — strong on bench, weaker in human eval)

GPT-5.3-Codex

Arena: 30.5
SWE-Bench: 72.4
Terminal-Bench: 77.3
Cursor 93: 68.8
Weighted mean: 0.20·30.5 + 0.30·72.4 + 0.30·77.3 + 0.20·68.8 = 64.77
σ ≈ 17.8
CBTS = 64.8 ±17.8 (extreme specialist — best terminal-ops model, weakest Arena)

Model	CBTS	±σ	Profile
Claude Opus 4.7	76.2	6.4	Top generalist
Claude Opus 4.6	73.4	4.1	Most balanced
GPT-5.4 High	68.5	6.7	Solid all-rounder
GPT-5.3-Codex	64.8	17.8	Specialist (terminal/agent)
DeepSeek V4 Pro	62.2	10.6	Open-weight bench-strong
Gemini 3.1 Pro	62.1	7.6	Long-context specialist

The CBTS reorders the leaderboard. Claude Opus 4.6 was rank 1 on Arena coding alone but slips to rank 2 once we triangulate. GPT-5.3-Codex jumps from Arena rank 7 into the top 4 on CBTS — and its huge ±σ tells you exactly what kind of workload to send it.

Where Each Bench Lies To You

Every benchmark has a failure mode. Naming them explicitly makes selection less superstitious.

Arena coding lies about correctness. Human raters cannot run the code. They evaluate fluency, formatting, and apparent confidence. A wrong-but-pretty answer routinely beats a right-but-terse answer. This is why the BenchLM longitudinal data show Arena rank flipping more often than SWE-Bench rank.

SWE-Bench Verified lies about generalization. The 500-ish tasks are drawn from 12 Python repos. Models are demonstrably trained against the public test set even when the holdout is "verified." We have measured a 7–11 point drop when moving from canonical SWE-Bench Verified to a private held-out clone of the same task type.

Terminal-Bench lies about cost and latency. A model that scores 77.3% may take 18 seconds and 40 tool calls per task. In production that is fine for a CI fixer, ruinous for an interactive assistant. Always pair Terminal-Bench score with median tool-call count and median wall-clock.

Cursor 93 lies about transparency. It is closed. We do not know task difficulty, error categories, or the precise pass/fail criteria. Treat published deltas as marketing-grade evidence — directional, not authoritative.

The honest move is to weight each bench by how well its failure mode aligns with your actual workload. We codify this in the routing patterns below.

Routing Patterns: Pick the Right Model Per Coding Task

Once you accept that no single model wins all four benches, model selection becomes an orchestration problem. Swfte Connect / Gateway routes coding requests to the bench-leading model per task class — the routing table below mirrors the policy file we ship as a default for new Connect tenants.

Task Class	Primary Model	Why	Fallback
Greenfield component / function	Claude Opus 4.7	Top CBTS, low σ	GPT-5.4 High
Real-bug fix in existing repo	Claude Opus 4.7	Highest SWE-Bench Verified	DeepSeek V4 Pro
Multi-step terminal / agent loop	GPT-5.3-Codex	77.3% Terminal-Bench	Claude Opus 4.7
Long-context monorepo refactor (>500K)	Gemini 3.1 Pro	2M-token context window	Claude Opus 4.7
High-throughput inline completion	Codex-Spark	1,000+ tok/s	GPT-5.4 Mini
Open-weight on-prem / regulated	DeepSeek V4 Pro	Best open-weight CBTS	Qwen 3.6 Max
Cheap bulk doc / PR-summary	Gemma 4 Pro	Lowest cost in tier	Qwen 3.6 Max

The pattern compounds with consensus: for high-stakes refactors we route to two models in parallel and use a third as judge. The cost-quality math on this is covered in our intelligent LLM routing piece, and the head-to-head IDE comparison is in Claude Code vs Cursor vs Lovable vs Base44 (2026).

A grounding statistic worth keeping in mind: Claude Code authors approximately 4% of all public GitHub commits as of Q2 2026. That is not a benchmark, but it is a real-world deployment proxy that no synthetic eval captures. When you see Claude Opus near the top of every coding bench, the on-the-ground commit data corroborates it.

The Open-Source Coding Tier (DeepSeek V4, Gemma 4, Qwen 3.6)

Three models matter in the open-weight coding tier: DeepSeek V4 Pro, Gemma 4 Pro, and Qwen 3.6 Max. Their CBTS scores cluster in the high-50s to low-60s, behind the frontier closed models but within striking distance for many production workloads.

Model	License	Params (B)	CBTS	Strongest Bench
DeepSeek V4 Pro	DeepSeek-V4 OSS	671 (MoE)	62.2	SWE-Bench Verified
Qwen 3.6 Max	Qwen License 2.0	235 (MoE)	56.4	Arena coding
Gemma 4 Pro	Gemma Terms	47	53.7	Cost-per-correct-patch

DeepSeek V4 Pro's 73.5% SWE-Bench Verified is the closest any open-weight model has come to closed-frontier performance on bug-fix tasks, and it does so at roughly 1/12th the per-token cost of Claude Opus 4.7 when self-hosted on H200-class hardware. The catch is Terminal-Bench: 63.8% is solid but well below GPT-5.3-Codex, so DeepSeek is a poor primary for autonomous-agent workloads even though it is excellent for one-shot patch generation.

Qwen 3.6 Max's relative strength on Arena coding (rank 8) versus its weaker SWE-Bench (rank 7) is the inverse of GPT-5.3-Codex's profile: humans like its explanations, but its patches are slightly lossy on real repos. We use it heavily for educational and review-comment workloads.

Gemma 4 Pro is the cost play. It will not win any frontier benchmark, but at the per-token economics Google publishes, it produces correct patches per dollar at a rate competitive with anything else on the list — for the subset of tasks where 64.9% Verified is good enough.

For a fuller side-by-side of the entire 2026 coding-assistant market, including IDE integrations, see our best AI coding assistants 2026 guide.

Production Telemetry vs Public Benchmarks

A note on triangulating the triangulation. Public benchmarks, even four of them weighted together, are no substitute for telemetry from your own codebase. We have seen Claude Opus 4.7 outperform its CBTS by 8–10 points on TypeScript-heavy frontends, and underperform by 4–6 points on legacy C++ where its training distribution is thinner. DeepSeek V4 Pro is the inverse — over-indexes on Python, under-indexes on Kotlin.

The minimum-viable internal eval we recommend for any team putting > $10K/month through coding LLMs:

30–50 tasks drawn from your last 90 days of merged PRs, with the merge diff as the gold answer.
Two scoring axes: tests-passing-after-patch and reviewer-rating-on-blind-diff.
Re-run quarterly as models update; rank order will shift.
Track tool-call count and wall-clock, not just correctness.

This is how you stop reading benchmarks like horoscopes and start treating them like the leading indicators they actually are.

What to Do This Quarter

Five to seven concrete actions for engineering leaders working through Q2 and Q3 of 2026:

Stop treating Arena coding rank as the decider. Build a CBTS-style aggregate that includes at least SWE-Bench Verified and Terminal-Bench. Re-evaluate monthly; the top of the leaderboard moves faster than your procurement cycle.
Pilot GPT-5.3-Codex specifically for autonomous-agent and CI-fixer workloads. Its 77.3% Terminal-Bench is the cleanest proxy for tool-loop reliability, and the Arena under-rating is camouflage you can exploit.
Run a 30-task internal eval against your own codebase before committing > $10K/month to any single coding model. Public benchmarks are necessary but never sufficient — and the contamination risk on Verified is real.
Adopt task-class routing instead of single-vendor lock-in. Greenfield, real-bug-fix, terminal-loop, and long-context refactor workloads each have different bench-leaders. Connect / Gateway-class routers make this configuration declarative.
Reserve Cursor 93-task deltas as directional only. They are the most useful single-vendor signal but the least audit-able. Pair every Cursor delta with at least one open benchmark before committing.
Open-weight evaluation belongs on the same dashboard as closed-frontier. DeepSeek V4 Pro's CBTS = 62.2 is good enough for a meaningful slice of production work at a 10–15x cost reduction; the gap will close further by year-end per the trajectory in the llm-stats updates feed.
Re-baseline at every model release. When Anthropic shipped Opus 4.7 on Apr 16 with a +13% Cursor 93-task gain over 4.6, every routing policy that pinned to "claude-opus-4-6" silently became suboptimal overnight. Auto-pin policies, with a 7-day soak and rollback, are the production-grade answer.

The next two quarters will see at least three more frontier coding releases (GPT-5.5 is widely expected, Gemini 3.2 Pro is on Google's published roadmap, and DeepSeek V5 is rumored for Q3). The CBTS framework is designed to absorb each release without a re-architecture: plug new numbers into the four-axis formula, recompute, re-route. The leaderboard is not the answer. It is one of four signals that, triangulated honestly, gets you closer to it.

نشر فيtechnology

LMSys Coding Chatbot Arena SWE-Bench Terminal-Bench AI Coding Benchmarks

Enjoyed this article?

Get more insights on AI and enterprise automation delivered to your inbox.

← Back to all articles