Is Claude Opus 4.7 the best coding model in May 2026?

On the LMSys coding Arena, yes — Opus 4.7 holds #1 at 1567 Elo, +18 over 4.6. It also leads SWE-bench Pro at 64.3%. The qualifier: it is best at multi-file refactors and stack-trace debugging. For pure SQL or short single-function tasks, the gap to GPT-5.5 and DeepSeek V4 Pro narrows considerably.

How much does the new tokenizer cost me in practice?

Approximately 35% more tokens for the same input text vs Opus 4.6. List price is unchanged at $5/$25 per 1M tokens, so effective cost rose by roughly one-third. Caching ($0.50 per 1M cached input) and batch tier (50% off) recover some of this.

What is Opus 4.7 specifically bad at?

Three areas: (1) very long-context grounding past ~600K tokens — it stays coherent but cite-accuracy degrades; (2) raw reasoning depth on GPQA Diamond, where Gemini 3.1 Pro wins by 4-5 points; (3) tool-call schema compliance under unusual function signatures, where GPT-5.5 still leads.

When should I pick Opus 4.7 over GPT-5.5?

When the workload is code-heavy, especially multi-file refactor, code review, and migration scripts. When you need lower output cost ($25 vs $30 per 1M). When you want prompt caching at the most aggressive rate in market ($0.50 per 1M cached input).

Is Opus 4.7 enterprise-ready?

Yes. Full SOC 2 Type II, ISO 27001, HIPAA available via Anthropic direct and AWS Bedrock. Vertex availability via Google Cloud. Enterprise procurement features are mature; the lock-in concern is the proprietary prompt structure (XML tags) and the new tokenizer, both of which add to switching cost.

Claude Opus 4.7 Deep Dive May 2026 | SWE-bench, Pricing, Failures

Executive Summary

Claude Opus 4.7 is the best coding model on the public market in May 2026 and the most efficient frontier model to run with caching. It also ships a new tokenizer that quietly raises effective cost by about a third for most workloads, and it continues to lose to Gemini 3.1 Pro on raw reasoning depth. Procurement teams should treat the list price as a floor, not a quote.

Three strengths

Coding Arena #1 (1567 Elo) with the strongest multi-file refactor and stack-trace debugging in our suite.
Best-in-class caching economics. Cached input at $0.50 per 1M is the most aggressive rate among frontier providers.
Output reliability. Lowest hallucination rate in our N9 (domain QA) suite, with grounded citations when given retrieved context.

Three weaknesses

Tokenizer drift. Same prompts produce ~35% more tokens than 4.6. List price is the same. Effective cost rose by roughly one-third without a price change.
GPQA Diamond reasoning. 4-5 points behind Gemini 3.1 Pro on the toughest reasoning prompts.
Tool-call edge cases. Schema compliance degrades under unusual function signatures vs GPT-5.5.

Architecture and Training

Anthropic publishes architectural details sparingly. What is public or strongly inferred from the model card and tokenizer changes:

Dense transformer, no MoE. Active parameter count not disclosed; scaling-law estimates put it in the 400-700B range.
New tokenizer. Anthropic shipped a fresh tokenizer with 4.7 with a different vocabulary balance. Empirically, the same input text produces ~35% more tokens than 4.6 across English, code, and mixed inputs. The shift appears to favour rare-token coverage and non-English performance, at the cost of common-English compression.
Thinking mode (extended reasoning) shipped with 4.7, billed at the output rate. The thinking budget is client-controllable.
Knowledge cutoff January 2026. Three months fresher than 4.6.

Pricing Reality

The list-price story and the production-cost story are different by a meaningful margin.

Tier	Input ($/1M)	Output ($/1M)	Effective vs 4.6
Standard	$5.00	$25.00	+35% (tokenizer)
Cached input	$0.50	$25.00	+35% on un-cached
Batch (50% off)	$2.50	$12.50	+35% then 50% off
Priority tier	$7.50	$37.50	+35% then +50% SLA

For a workload that previously ran at $1,000/month on Opus 4.6 with 30% cache-hit, the equivalent 4.7 spend is roughly $1,250/month at the same cache-hit rate. Aggressive caching ($0.50/1M cached input) recovers most of the drift on long system prompts — readers running RAG with stable retrieved context should see 4.7 cheaper than 4.6 in practice. Readers running short, varied prompts will pay the full drift.

SMQTS Results — Programming Series

Weighted-blend score per category, 0-100. Tested at temperature 0.2, max_tokens 4096, no caching, thinking budget medium.

Category	Opus 4.7	GPT-5.5	Gemini 3.1 Pro	DeepSeek V4 Pro
P1 Multi-file refactor	94	86	83	74
P2 Bug-finding from stack trace	92	87	84	78
P3 Code review	91	88	85	76
P4 Test generation	89	90	83	77
P5 SQL from natural language	87	89	91	82
P6 Algorithm from spec	93	89	88	79
P7 Migration scripts	92	83	80	71
P8 Documentation	90	88	85	78
P9 Diff comprehension	91	86	83	76
P10 Tool-using agent loops	89	92	85	74
Average	91.2	87.6	84.6	76.5

Opus 4.7 wins 8 of 10 programming categories. The two losses are P4 (test generation, GPT-5.5 wins by 1 point — within rater noise) and P10 (tool-using agent loops, GPT-5.5 wins by 3 points — outside rater noise, real result).

Programming-only headline

Opus 4.7    91.2   ##############################################
GPT-5.5     87.6   ###########################################
Gemini 3.1  84.6   ##########################################
DeepSeek V4 76.5   ######################################

SMQTS Results — Non-Programming Series

Opus 4.7 holds a respectable second place behind Gemini 3.1 Pro. The relative weakness is on N3 (multi-step reasoning) and N5 (translation into low-resource languages where Gemini's multilingual training advantage shows).

Category	Opus 4.7	Gemini 3.1 Pro	GPT-5.5
N1 Long-form drafting	87	89	91
N2 Summarization	91	90	89
N3 Multi-step reasoning	83	94	88
N4 Information extraction	89	87	88
N5 Translation	76	92	84
N6 Style transfer	90	87	89
N7 Adversarial resistance	92	88	85
N8 Structured output	87	88	91
N9 Domain QA	90	89	87
N10 Multi-turn coherence	91	89	87
Average	87.6	89.3	87.9

SMQTS Results — Cost-Quality Validation

For workloads where Opus 4.7 wins on absolute quality, can a cheaper model substitute? Pairwise blind grading at the 50-prompt cost-quality sample:

Workload	Opus 4.7 wins	DeepSeek V4 Pro wins	Tie
Multi-file refactor (P1)	71%	11%	18%
SQL from NL (P5)	34%	32%	34%
Information extraction (N4)	27%	31%	42%
Summarization (N2)	38%	22%	40%

Procurement reading. Opus 4.7 is irreplaceable on multi-file refactor and code review. It is roughly substitutable by DeepSeek V4 Pro on SQL, extraction, and summarization at one-tenth the cost. The cascade pattern wins here — route P1, P3, P7 to Opus, route P5, N2, N4 to a cheaper tier.

Strengths in Detail

Multi-file refactor

On 6 prompts that require coordinated edits across 4-8 files, Opus 4.7 produced a fully-passing test suite on first attempt in 5/6 cases. The next-best model (GPT-5.5) achieved 3/6. DeepSeek V4 Pro produced plausible single-file edits but lost cross-file consistency in 4/6 cases.

Caching economics

At $0.50 per 1M cached input, Opus 4.7 has the most aggressive caching rate among frontier models. For RAG workloads with stable retrieved chunks (~30K tokens of context per call), this flips the cost story: 4.7 ends up cheaper than 4.6 in practice despite the tokenizer drift. Cache TTL is 5 minutes standard, 1 hour with explicit cache control.

Output reliability

Lowest hallucination rate in N9 (3.2% across 240 graded responses, vs 5.1% for GPT-5.5 and 4.4% for Gemini 3.1 Pro on the same prompts). When Opus 4.7 does not know an answer, it is more likely than any rival to say so explicitly.

Weaknesses and Failure Modes

GPQA Diamond reasoning gap

Opus 4.7 scores ~89.5 on GPQA Diamond vs Gemini 3.1 Pro's 94.3. On graduate-level physics and biology questions where the answer requires multi-step derivation, Gemini wins consistently. For workloads where this matters (technical research, advanced tutoring), Opus 4.7 is the wrong choice.

Long-context citation degradation

Past ~600K tokens of context, citation accuracy in N9 starts to degrade. The model stays coherent and produces fluent answers, but the "needle in haystack" cite gets fuzzier. Use the 500K token context as a soft cap for citation-critical workloads.

Tool-call edge cases

On P10, GPT-5.5 wins by 3 weighted points. The specific failure mode: when the function schema has unusual variadic arguments or deeply nested object types, Opus 4.7 produces partially-malformed tool calls that fail at the parser. The recovery loop usually succeeds on retry, but the extra round-trip costs latency.

When to Use Opus 4.7

Code-heavy production agents doing multi-file refactor, code review, migration, or test generation.
RAG pipelines with stable retrieved context where caching makes the price story very favourable.
Domain QA with strict no-hallucination requirements (legal, medical, regulated). Opus 4.7's low fabrication rate is genuinely best-in-class.
Long-context summarization up to ~500K tokens.

When NOT to Use Opus 4.7

Graduate-level reasoning workloads. Use Gemini 3.1 Pro instead.
High-volume tool-calling agents with unusual function schemas. Use GPT-5.5.
Cost-sensitive bulk extraction or classification. DeepSeek V4 Pro is roughly substitutable at one-tenth the cost.
Translation into low-resource languages. Gemini wins on N5 by a wide margin.
Workloads with high token volume and short prompts. Caching cannot recover the tokenizer drift; effective cost rose 35%.

Comparison to Direct Rivals

vs GPT-5.5

Dimension	Opus 4.7	GPT-5.5
Output price ($/1M)	$25	$30
Context window	500K	1M
Coding Arena Elo	1567	1521
Tool-call reliability	Strong	Best
Cached input price	$0.50	$1.25

vs Gemini 3.1 Pro

Dimension	Opus 4.7	Gemini 3.1 Pro
Input price ($/1M)	$5.00	$3.50
Context window	500K	2M
GPQA Diamond	89.5	94.3
SWE-bench Pro	64.3%	58.1%
Translation N5 score	76	92

Procurement Notes

Enterprise readiness

SOC 2 Type II, ISO 27001, HIPAA. Available on Anthropic direct, AWS Bedrock, Google Cloud Vertex. DPA, MSA, custom data retention controls available on Team and Enterprise plans. Mature.

Lock-in score

3.5 / 5 on the Swfte vendor-lock-in scoring (lower is easier to leave; see the vendor leaderboard). The specific costs to leave: Anthropic-flavoured XML prompt structure, proprietary tool-calling schema, the new tokenizer that means token budgets must be re-estimated when porting, and the caching economics that don't survive the move.

Contract leverage

Anthropic offers volume discounts on direct contracts above ~$100K/month. Bedrock and Vertex pricing is identical to direct list. The leverage point is committed-use discounts on Bedrock for AWS-aligned procurement, which can shave 15-20% off effective rate on multi-year deals.

Claude Opus 4.7 — Deep Dive Research Report (May 2026)

Model Snapshot

Executive Summary

Three strengths

Three weaknesses

Architecture and Training

Pricing Reality

SMQTS Results — Programming Series

Programming-only headline

SMQTS Results — Non-Programming Series

SMQTS Results — Cost-Quality Validation

Strengths in Detail

Multi-file refactor

Caching economics

Output reliability

Weaknesses and Failure Modes

GPQA Diamond reasoning gap

Long-context citation degradation

Tool-call edge cases

When to Use Opus 4.7

When NOT to Use Opus 4.7

Comparison to Direct Rivals

vs GPT-5.5

vs Gemini 3.1 Pro

Procurement Notes

Enterprise readiness

Lock-in score

Contract leverage

Claude Opus 4.7 — Deep Dive Research Report (May 2026)

Model Snapshot

Executive Summary

Three strengths

Three weaknesses

Architecture and Training

Pricing Reality

SMQTS Results — Programming Series

Programming-only headline

SMQTS Results — Non-Programming Series

SMQTS Results — Cost-Quality Validation

Strengths in Detail

Multi-file refactor

Caching economics

Output reliability

Weaknesses and Failure Modes

GPQA Diamond reasoning gap

Long-context citation degradation

Tool-call edge cases

When to Use Opus 4.7

When NOT to Use Opus 4.7

Comparison to Direct Rivals

vs GPT-5.5

vs Gemini 3.1 Pro

Procurement Notes

Enterprise readiness

Lock-in score

Contract leverage

Related Reading