Model Deep Dive

Claude Opus 4.8: What Actually Changed, and Where GPT-5.5 Still Wins

Claude Opus 4.8 retook #1 on the AA Intelligence Index at 61.4. What changed, and the 3 jobs GPT-5.5 still wins.

June 2, 2026

English

Anthropic put out Claude Opus 4.8 on May 28, 2026. The short story: a Claude model is back at the top of the Artificial Analysis Intelligence Index, at 61.4, a little over four points above Opus 4.7 and just over a point above GPT-5.5. The price did not move. Below is what the release does well, what it does not, and how to decide whether to move your traffic over.

The short version

Opus 4.8 is a tune-up of the 4.x family, not a fresh model trained from scratch. Anthropic itself called it "modest but tangible," and that reads honestly against the scores. The gains cluster in three places: long, multi-step coding tasks; driving a computer through a browser; and answering hard questions without any tools to lean on. It costs the same as Opus 4.7, which is the part most teams will care about.

If you only read one line: if you already run Opus 4.7 for coding or agent work, switch, because you get a better model for the same bill. If you run GPT-5.5 specifically for shell-based agents, hold and test, because that is the one job GPT-5.5 keeps.

You can see it ranked next to everything else on the model leaderboard, with the raw spec on its model page.

What moved since 4.7

Here are the published numbers side by side. I have kept the ones that actually shifted; the knowledge tests like MMLU are saturated now and tell you almost nothing between top models.

Test	What it measures	Opus 4.7	Opus 4.8
AA Intelligence Index	Blended score across many tasks	57.3	61.4
SWE-bench Verified	Fixing real GitHub issues	87.6%	88.6%
SWE-bench Pro	The harder issue set	64.3%	69.2%
Terminal-Bench 2.1	Getting things done in a shell	66.1%	74.6%
Humanity's Last Exam (no tools)	Very hard questions, closed book	46.9%	49.8%

The number I would watch is Terminal-Bench, up roughly eight and a half points, and SWE-bench Pro, up close to five. Both reward a model that can hold a plan across many steps instead of solving one neat puzzle. That endurance is the practical difference you feel when an agent runs for twenty minutes without going off the rails.

The coding gains, in plain terms

SWE-bench Pro hands the model a real bug report from a real repository and asks it to ship a patch that passes the project's own tests. Scoring 69.2% means roughly seven in ten of those land without a person touching the code. A year ago the best models sat in the low fifties on the same set, so this is steady, not flashy, progress on the part that costs engineering teams the most time.

In day-to-day use this shows up as fewer "almost right" patches. Opus 4.8 is more willing to read the surrounding files, notice a second call site that also needs the change, and fix both. If you drive it through an agent runner like Claude Code, you get that for free by pointing the same setup at the new model. No price change, no rework.

Driving a computer

The release's most interesting result is 84% on Online-Mind2Web, a test that makes the model operate a real website by reading the screen, clicking, and typing, the way a person would. That is the highest score anyone has logged, and it beats both Opus 4.7 and GPT-5.5 by a clear margin.

This matters because it is the gap between a model that tells you how to file an expense report and one that files it. Booking, form-filling, pulling a number out of a dashboard that has no API, all of that lives in the browser. Opus 4.8 is currently the most reliable model at it, though "most reliable" still means it will occasionally misclick, so keep a human in the loop for anything that spends money or sends mail.

Thinking without a safety net

On Humanity's Last Exam, a set of very hard questions run with no tools and no web access, Opus 4.8 scores 49.8%. Gemini 3.1 Pro sits at 44.4% and GPT-5.5 at 41.4%. Closed-book is the cleanest read on what a model actually knows and can reason through, rather than what it can look up, and Opus 4.8 leads it. For work where the model has to be right from memory, such as a first-pass medical or legal read, that lead is worth more than a couple of points on a coding chart.

Honesty as a feature

Anthropic led the announcement with calibration, not capability: the model owns up to not knowing more often and invents fewer confident, wrong answers. That is dull to demo and valuable in production. A model that says "I am not sure, here is what I would check" is far cheaper to supervise than one that fabricates a plausible citation. For regulated work (legal, healthcare), that behavior is part of the spec, not a nicety.

Price and the faster fast mode

The headline most teams have been waiting for is that nothing got more expensive. Opus 4.8 holds at $5 per million input tokens and $25 per million output, the same card as Opus 4.7. Fast mode, the lower-latency serving tier, is now about two and a half times quicker and roughly three times cheaper than fast mode was on earlier models, which makes Opus realistic for interactive features that used to need a smaller model. The full rundown is on the Claude pricing page.

For context, GPT-5.5 lists at $5 / $30 and GPT-5.5 Pro at $30 / $180. Holding the line on price while taking the top spot is the quiet reason this release lands well.

Three jobs it still loses

A fair review names the losses too.

First, shell agents. On pure terminal and command-line task suites, GPT-5.5 still finishes ahead. If your agent spends most of its life in a shell rather than a browser or an editor, run both against your own tasks before you commit.

Second, graduate science. Opus 4.8 and GPT-5.5 are about even on GPQA-class science questions, and Gemini 3.1 Pro still leads that category outright at 94.3% on GPQA Diamond. For research-heavy science, Gemini remains the reference.

Third, output types. Opus reads images but does not make them, and there is no native voice. If your product needs generated images or real-time speech, you are pairing Opus with a second model regardless of how good its text is.

Who should move

You are running…	Do this
Opus 4.7 for coding or agents	Switch. Same price, better scores.
GPT-5.5 for general coding	Test 4.8. It likely wins on issue-level work and computer use.
GPT-5.5 for shell agents	Hold and A/B. GPT-5.5 still edges terminal tasks.
Gemini 3.1 Pro for science	Keep Gemini there; add 4.8 for code.
A cheap tier like DeepSeek V4 Pro	Route: bulk on the cheap model, hard 10% to 4.8.

The pattern that ages well is not "pick one model." It is routing: send the easy majority of calls to something small and quick, and escalate only the hard ones to Opus 4.8. You pay the $25 rate on the calls that earn it and nothing close to it on the calls that do not.

Where it sits now

By our blended index Opus 4.8 is first, and it leads both the coding and overall boards on our LMArena tracker this month. That ranking will not hold forever; GPT-5.5's next point release and Gemini's are both close. For the full four-way against GPT-5.5, Gemini 3.1 Pro, and Qwen 3.7 Max, see the June flagship comparison.

Keep reading

Sources:

Publicado entechnology

Claude Opus 4.8 Anthropic GPT-5.5 AI Benchmarks LLM Comparison

Enjoyed this article?

Get more insights on AI and enterprise automation delivered to your inbox.

← Back to all articles