SMQTS v1.3 · Pinned 2026-04-16

Claude Opus 4.7 — Deep Dive Research Report (May 2026)

The coding Arena leader. The expensive caching champion. The tokenizer drift surprise.

Download research report (.md)

Model Snapshot

Released

2026-04-16

License

Closed

Context

500K tokens

Knowledge cutoff

Jan 2026

Input price

$5 / 1M

Output price

$25 / 1M

Cached input

$0.50 / 1M

Coding Arena Elo

1567 (#1)

Executive Summary

Claude Opus 4.7 is the best coding model on the public market in May 2026 and the most efficient frontier model to run with caching. It also ships a new tokenizer that quietly raises effective cost by about a third for most workloads, and it continues to lose to Gemini 3.1 Pro on raw reasoning depth. Procurement teams should treat the list price as a floor, not a quote.

Three strengths

  1. Coding Arena #1 (1567 Elo) with the strongest multi-file refactor and stack-trace debugging in our suite.
  2. Best-in-class caching economics. Cached input at $0.50 per 1M is the most aggressive rate among frontier providers.
  3. Output reliability. Lowest hallucination rate in our N9 (domain QA) suite, with grounded citations when given retrieved context.

Three weaknesses

  1. Tokenizer drift. Same prompts produce ~35% more tokens than 4.6. List price is the same. Effective cost rose by roughly one-third without a price change.
  2. GPQA Diamond reasoning. 4-5 points behind Gemini 3.1 Pro on the toughest reasoning prompts.
  3. Tool-call edge cases. Schema compliance degrades under unusual function signatures vs GPT-5.5.

Architecture and Training

Anthropic publishes architectural details sparingly. What is public or strongly inferred from the model card and tokenizer changes:

  • Dense transformer, no MoE. Active parameter count not disclosed; scaling-law estimates put it in the 400-700B range.
  • New tokenizer. Anthropic shipped a fresh tokenizer with 4.7 with a different vocabulary balance. Empirically, the same input text produces ~35% more tokens than 4.6 across English, code, and mixed inputs. The shift appears to favour rare-token coverage and non-English performance, at the cost of common-English compression.
  • Thinking mode (extended reasoning) shipped with 4.7, billed at the output rate. The thinking budget is client-controllable.
  • Knowledge cutoff January 2026. Three months fresher than 4.6.

Pricing Reality

The list-price story and the production-cost story are different by a meaningful margin.

TierInput ($/1M)Output ($/1M)Effective vs 4.6
Standard$5.00$25.00+35% (tokenizer)
Cached input$0.50$25.00+35% on un-cached
Batch (50% off)$2.50$12.50+35% then 50% off
Priority tier$7.50$37.50+35% then +50% SLA

For a workload that previously ran at $1,000/month on Opus 4.6 with 30% cache-hit, the equivalent 4.7 spend is roughly $1,250/month at the same cache-hit rate. Aggressive caching ($0.50/1M cached input) recovers most of the drift on long system prompts — readers running RAG with stable retrieved context should see 4.7 cheaper than 4.6 in practice. Readers running short, varied prompts will pay the full drift.

SMQTS Results — Programming Series

Weighted-blend score per category, 0-100. Tested at temperature 0.2, max_tokens 4096, no caching, thinking budget medium.

CategoryOpus 4.7GPT-5.5Gemini 3.1 ProDeepSeek V4 Pro
P1 Multi-file refactor94868374
P2 Bug-finding from stack trace92878478
P3 Code review91888576
P4 Test generation89908377
P5 SQL from natural language87899182
P6 Algorithm from spec93898879
P7 Migration scripts92838071
P8 Documentation90888578
P9 Diff comprehension91868376
P10 Tool-using agent loops89928574
Average91.287.684.676.5

Opus 4.7 wins 8 of 10 programming categories. The two losses are P4 (test generation, GPT-5.5 wins by 1 point — within rater noise) and P10 (tool-using agent loops, GPT-5.5 wins by 3 points — outside rater noise, real result).

Programming-only headline

Opus 4.7    91.2   ##############################################
GPT-5.5     87.6   ###########################################
Gemini 3.1  84.6   ##########################################
DeepSeek V4 76.5   ######################################

SMQTS Results — Non-Programming Series

Opus 4.7 holds a respectable second place behind Gemini 3.1 Pro. The relative weakness is on N3 (multi-step reasoning) and N5 (translation into low-resource languages where Gemini's multilingual training advantage shows).

CategoryOpus 4.7Gemini 3.1 ProGPT-5.5
N1 Long-form drafting878991
N2 Summarization919089
N3 Multi-step reasoning839488
N4 Information extraction898788
N5 Translation769284
N6 Style transfer908789
N7 Adversarial resistance928885
N8 Structured output878891
N9 Domain QA908987
N10 Multi-turn coherence918987
Average87.689.387.9

SMQTS Results — Cost-Quality Validation

For workloads where Opus 4.7 wins on absolute quality, can a cheaper model substitute? Pairwise blind grading at the 50-prompt cost-quality sample:

WorkloadOpus 4.7 winsDeepSeek V4 Pro winsTie
Multi-file refactor (P1)71%11%18%
SQL from NL (P5)34%32%34%
Information extraction (N4)27%31%42%
Summarization (N2)38%22%40%

Procurement reading. Opus 4.7 is irreplaceable on multi-file refactor and code review. It is roughly substitutable by DeepSeek V4 Pro on SQL, extraction, and summarization at one-tenth the cost. The cascade pattern wins here — route P1, P3, P7 to Opus, route P5, N2, N4 to a cheaper tier.

Strengths in Detail

Multi-file refactor

On 6 prompts that require coordinated edits across 4-8 files, Opus 4.7 produced a fully-passing test suite on first attempt in 5/6 cases. The next-best model (GPT-5.5) achieved 3/6. DeepSeek V4 Pro produced plausible single-file edits but lost cross-file consistency in 4/6 cases.

Caching economics

At $0.50 per 1M cached input, Opus 4.7 has the most aggressive caching rate among frontier models. For RAG workloads with stable retrieved chunks (~30K tokens of context per call), this flips the cost story: 4.7 ends up cheaper than 4.6 in practice despite the tokenizer drift. Cache TTL is 5 minutes standard, 1 hour with explicit cache control.

Output reliability

Lowest hallucination rate in N9 (3.2% across 240 graded responses, vs 5.1% for GPT-5.5 and 4.4% for Gemini 3.1 Pro on the same prompts). When Opus 4.7 does not know an answer, it is more likely than any rival to say so explicitly.

Weaknesses and Failure Modes

GPQA Diamond reasoning gap

Opus 4.7 scores ~89.5 on GPQA Diamond vs Gemini 3.1 Pro's 94.3. On graduate-level physics and biology questions where the answer requires multi-step derivation, Gemini wins consistently. For workloads where this matters (technical research, advanced tutoring), Opus 4.7 is the wrong choice.

Long-context citation degradation

Past ~600K tokens of context, citation accuracy in N9 starts to degrade. The model stays coherent and produces fluent answers, but the "needle in haystack" cite gets fuzzier. Use the 500K token context as a soft cap for citation-critical workloads.

Tool-call edge cases

On P10, GPT-5.5 wins by 3 weighted points. The specific failure mode: when the function schema has unusual variadic arguments or deeply nested object types, Opus 4.7 produces partially-malformed tool calls that fail at the parser. The recovery loop usually succeeds on retry, but the extra round-trip costs latency.

When to Use Opus 4.7

  • Code-heavy production agents doing multi-file refactor, code review, migration, or test generation.
  • RAG pipelines with stable retrieved context where caching makes the price story very favourable.
  • Domain QA with strict no-hallucination requirements (legal, medical, regulated). Opus 4.7's low fabrication rate is genuinely best-in-class.
  • Long-context summarization up to ~500K tokens.

When NOT to Use Opus 4.7

  • Graduate-level reasoning workloads. Use Gemini 3.1 Pro instead.
  • High-volume tool-calling agents with unusual function schemas. Use GPT-5.5.
  • Cost-sensitive bulk extraction or classification. DeepSeek V4 Pro is roughly substitutable at one-tenth the cost.
  • Translation into low-resource languages. Gemini wins on N5 by a wide margin.
  • Workloads with high token volume and short prompts. Caching cannot recover the tokenizer drift; effective cost rose 35%.

Comparison to Direct Rivals

vs GPT-5.5

DimensionOpus 4.7GPT-5.5
Output price ($/1M)$25$30
Context window500K1M
Coding Arena Elo15671521
Tool-call reliabilityStrongBest
Cached input price$0.50$1.25

vs Gemini 3.1 Pro

DimensionOpus 4.7Gemini 3.1 Pro
Input price ($/1M)$5.00$3.50
Context window500K2M
GPQA Diamond89.594.3
SWE-bench Pro64.3%58.1%
Translation N5 score7692

Procurement Notes

Enterprise readiness

SOC 2 Type II, ISO 27001, HIPAA. Available on Anthropic direct, AWS Bedrock, Google Cloud Vertex. DPA, MSA, custom data retention controls available on Team and Enterprise plans. Mature.

Lock-in score

3.5 / 5 on the Swfte vendor-lock-in scoring (lower is easier to leave; see the vendor leaderboard). The specific costs to leave: Anthropic-flavoured XML prompt structure, proprietary tool-calling schema, the new tokenizer that means token budgets must be re-estimated when porting, and the caching economics that don't survive the move.

Contract leverage

Anthropic offers volume discounts on direct contracts above ~$100K/month. Bedrock and Vertex pricing is identical to direct list. The leverage point is committed-use discounts on Bedrock for AWS-aligned procurement, which can shave 15-20% off effective rate on multi-year deals.