|
English

By June 2026 the top of the field is a four-horse race, and no single model wins every event. Claude Opus 4.8 leads the blended index, GPT-5.5 owns the shell, Gemini 3.1 Pro runs the table on science and speed, and Qwen 3.7 Max does the marathon for half the money. Here is how to pick, by job, with the numbers that matter.

Start here: the one-glance table

Opus 4.8GPT-5.5Gemini 3.1 ProQwen 3.7 Max
MakerAnthropicOpenAIGoogleAlibaba
Blended index61.4 (#1)~59~5856.6
Best atCoding, computer useShell agents, voiceScience, speed, long docsDay-long agents, math
SWE-bench Pro69.2%high 60smid 60s60.6
Science (GPQA Diamond)~94%~94%94.3%92.4
Speed (tokens/sec)~72~70~131~90
Price (in / out per 1M)$5 / $25$5 / $30$2 / $12$2.50 / $7.50
Context1M1M1M+1M
Open weights?NoNoNoNo

Every one of these is closed. If you need weights you can host, none of these four is your answer, and you want DeepSeek V4 Pro or GLM-5.1 instead. For everyone else, the choice comes down to which job you are buying for. See them all live on the leaderboard.

If the job is coding

Pick Opus 4.8. It posts the best score on the hard coding set (69.2% on SWE-bench Pro) and the best result on operating a browser, which is where a lot of "coding" agents actually spend their time, reading a failing CI page, clicking through a dashboard, checking a deploy. In practice it reads more of the surrounding files before it edits, so you get fewer patches that fix one call site and miss the second.

GPT-5.5 is close and has one specific edge: shell work. On terminal and command-line task suites it finishes ahead of Opus. If your agent lives in a shell more than an editor, run both against your own repos before deciding; the gap is small enough that your codebase, not the benchmark, should break the tie.

Qwen 3.7 Max sits a clear step back on coding (60.6) but makes it up on length. For a refactor that takes hours rather than minutes, its stamina can matter more than a few points of accuracy per step.

If the job is science or research

Pick Gemini 3.1 Pro. It leads graduate-level science at 94.3% on GPQA Diamond, it is the fastest of the four by a wide margin (roughly 131 tokens per second), and it handles the longest documents comfortably. For a workflow that reads a 200-page filing and answers questions about page 4 and page 190 in the same breath, it is the natural pick, and the cheapest of the group at $2 / $12.

Opus 4.8 is its equal on closed-book reasoning, where it actually leads, so if your "research" is more about hard inference from memory than about parsing huge documents, the two trade places. Test on your own material.

If the job is voice or images

This is the one event none of the text-first models win cleanly. Opus 4.8 reads images but makes none and has no native voice. Qwen is text-forward. GPT-5.5 has the most complete real-time voice and multimodal story of the four, so if your product talks and listens, it starts in front. For generated images or video you are reaching past all of these to a dedicated model anyway; our image and video rankings cover that separately.

If the job is a long-running agent

Pick Qwen 3.7 Max, or at least try it first. Its demonstrated thirty-five-hour, eleven-hundred-call run is the clearest evidence any of these four can hold a goal over a very long horizon. It also works with outside agent runners, including Claude Code, so adopting it does not mean rebuilding your orchestration. The trade is rank (fifth overall) and governance (traffic routes through Alibaba Cloud), so it is a fit for teams without a data-residency rule against that.

Opus 4.8 is the safer agent for shorter, higher-stakes runs where each step has to be right and you want the calibration and honesty Anthropic tuned for. Different shapes of the same job.

If the job is "keep the bill down"

Among these four, Gemini 3.1 Pro is the value pick at the frontier: top-tier science, the fastest output, and the lowest list price. Qwen 3.7 Max is next, at about half of Opus's output price, and earns its keep on high-volume agent work.

But the real money move is not buying one model. It is routing. Send the easy majority of your traffic to a small, cheap model, and escalate only the hard minority to whichever of these four wins your job. A request that a $0.20-per-million model can answer should never touch a $25-per-million model, and most of your requests are that kind. Teams that route well spend a fraction of what single-model teams spend for the same output quality.

The honest caveat on these numbers

Two of these models (GPT-5.5, Gemini 3.1 Pro) publish some figures that vendors and independent testers report a little differently, and the blended index moves by a point or two depending on the week and the test mix. Treat the table above as a guide to which model wins which job, not as a stopwatch reading. The ranking is also temporary: GPT-5.5 and Gemini both have point releases due, and Opus 4.8's lead is barely more than a point. Re-check before any big commitment.

A decision in one paragraph

Coding and computer use: Opus 4.8. Science, speed, and long documents: Gemini 3.1 Pro. Voice and real-time multimodal: GPT-5.5. Marathon agents and math on a budget: Qwen 3.7 Max. Anything you need to host yourself: none of them, go open. And for almost everyone, the best single decision is to route between a cheap model and one of these, rather than pay frontier rates on every call.

Keep reading


Sources:

0
0
0
0

Enjoyed this article?

Get more insights on AI and enterprise automation delivered to your inbox.