Model Deep Dive

Qwen 3.7 Max: The Model That Ran for 35 Hours Straight

Alibaba's Qwen 3.7 Max reached #5 on the AA Index and ran an agent 35 hours. Strong, cheap, and closed not open.

June 2, 2026

English

Alibaba showed Qwen 3.7 Max at its Cloud Summit in Hangzhou on May 20, 2026, with the API already live a day earlier. It arrived at #5 on the Artificial Analysis Intelligence Index at 56.6, which makes it the best-scoring model out of China and a near-five-point jump over the 3.6 Max preview. The number that got people's attention, though, was not the index. It was 35.

The 35-hour run

In Alibaba's demo, Qwen 3.7 Max worked on its own for thirty-five hours without a human stepping in. Over that stretch it made 1,158 separate tool calls and finished the work about ten times faster, by geometric mean, than a straight-through baseline. Pick whatever framing you like for the marketing; the underlying claim is the interesting one. Staying on a single goal for a day and a half, across more than a thousand actions, without losing the plot, is something most models cannot do for thirty minutes.

That endurance is the reason to care about this model. If you build agents that grind, code migrations, multi-source research, anything that runs for hours, this belongs on your shortlist. It also speaks common agent protocols and works with outside runners, including Anthropic's Claude Code, so you can point tooling you already have at it instead of rebuilding your setup.

You can line it up against the rest of the field on the leaderboard, with the spec on its model page.

The scorecard

Test	What it covers	Qwen 3.7 Max	For scale
AA Intelligence Index	Blended ability	56.6 (#5)	Opus 4.8 61.4, GPT-5.5 ~59
HMMT Feb 2026	Competition math	97.1	Opus-class 96.2, DeepSeek V4 Pro 95.2
GPQA Diamond	Graduate science	92.4	Gemini 3.1 Pro 94.3
SWE-bench Pro	Hard coding issues	60.6	Opus 4.8 69.2
Terminal-Bench 2.0	Shell tasks	69.7	—
Context	Working memory	1,000,000	matches the frontier

Two things stand out. The jump from the 3.6 preview (51.8) to 56.6 is one of the bigger single-release gains any lab posted this year, so the team is moving quickly. And on competition math, the HMMT result of 97.1 is the top score on that board, ahead of the best Claude and DeepSeek runs. If you need a model that is actually correct on quantitative problems rather than merely confident, that is a real edge.

What it is good at

The headline is stamina, but a few other things travel with it.

Math and quantitative reasoning hold up under pressure, which matters for finance, scientific computing, and anywhere a wrong answer is worse than no answer. Coding is strong without leading: 60.6 on SWE-bench Pro and 69.7 on Terminal-Bench put it in serious company, a step behind Claude but well ahead of most. And because it carries Qwen's multilingual roots, Chinese and English coverage are both excellent, with solid breadth across other Asian languages, which is the practical reason a team operating across the region might pick it.

On cost, it lists at $2.50 per million input tokens and $7.50 per million output, roughly half of Opus 4.7's rate for a top-five model. For agent workloads that burn tokens by the hour, that gap compounds fast. You can model it on the token cost calculator.

Where it falls short

Start with rank. Fifth is excellent, but Opus 4.8, GPT-5.5, Gemini 3.1 Pro, and Opus 4.7 all sit above it on the blended index. For the single best-quality call you can make, this is not the one.

The bigger surprise is that the Max tier is closed. Qwen built its name on open weights, and a lot of buyers reach for a Chinese model precisely because they want to self-host or keep the weights. That option is not on the table here; Qwen 3.7 Max is API-only, served from Alibaba Cloud. If portability is your reason, the open alternatives are DeepSeek V4 Pro under Apache 2.0 or GLM-5.1 under MIT, both of which you can run yourself.

Coding trails Claude. The SWE-bench Pro gap to Opus 4.8 (60.6 against 69.2) is real, and for pure agentic coding I would still reach for Claude first.

Then there is governance. Sending traffic through Alibaba Cloud is a hard stop for some regulated and Western-jurisdiction buyers, full stop, regardless of how the model scores. If that describes you, a Western open-weight model is the path, and the benchmark conversation is moot.

Who it fits

Situation	Read
Agents that run for hours	Strong fit. Stamina is the selling point.
Competition-grade math or quant	Best on HMMT this cycle.
High-volume agent work on a budget	Good value at about half Opus pricing.
You need to self-host	Wrong model. Use DeepSeek or GLM.
The single best answer	Use Opus 4.8 or GPT-5.5.
Data must stay in-region (West)	Governance likely blocks it.

What it tells us about the race

Two years ago the question was whether a Chinese lab could reach the frontier. Qwen 3.7 Max settles it: the gap to the Western leaders is now a few points, not a tier. The more telling shift is in strategy. Alibaba is holding its strongest model back as a paid, closed API instead of releasing the weights, the same move OpenAI and Anthropic make. The open-versus-closed line is no longer drawn neatly along a map. For the longer version of that argument, see Open weights vs proprietary: where the June 2026 frontier actually splits.

Keep reading

Sources:

发布于technology

Qwen 3.7 Max Alibaba Chinese AI Models Agentic AI LLM Comparison

Enjoyed this article?

Get more insights on AI and enterprise automation delivered to your inbox.

← Back to all articles