Executive Summary
Claude Opus 4.7 is the best coding model on the public market in May 2026 and the most efficient frontier model to run with caching. It also ships a new tokenizer that quietly raises effective cost by about a third for most workloads, and it continues to lose to Gemini 3.1 Pro on raw reasoning depth. Procurement teams should treat the list price as a floor, not a quote.
Three strengths
- Coding Arena #1 (1567 Elo) with the strongest multi-file refactor and stack-trace debugging in our suite.
- Best-in-class caching economics. Cached input at $0.50 per 1M is the most aggressive rate among frontier providers.
- Output reliability. Lowest hallucination rate in our N9 (domain QA) suite, with grounded citations when given retrieved context.
Three weaknesses
- Tokenizer drift. Same prompts produce ~35% more tokens than 4.6. List price is the same. Effective cost rose by roughly one-third without a price change.
- GPQA Diamond reasoning. 4-5 points behind Gemini 3.1 Pro on the toughest reasoning prompts.
- Tool-call edge cases. Schema compliance degrades under unusual function signatures vs GPT-5.5.
Architecture and Training
Anthropic publishes architectural details sparingly. What is public or strongly inferred from the model card and tokenizer changes:
- Dense transformer, no MoE. Active parameter count not disclosed; scaling-law estimates put it in the 400-700B range.
- New tokenizer. Anthropic shipped a fresh tokenizer with 4.7 with a different vocabulary balance. Empirically, the same input text produces ~35% more tokens than 4.6 across English, code, and mixed inputs. The shift appears to favour rare-token coverage and non-English performance, at the cost of common-English compression.
- Thinking mode (extended reasoning) shipped with 4.7, billed at the output rate. The thinking budget is client-controllable.
- Knowledge cutoff January 2026. Three months fresher than 4.6.
Pricing Reality
The list-price story and the production-cost story are different by a meaningful margin.
| Tier | Input ($/1M) | Output ($/1M) | Effective vs 4.6 |
|---|---|---|---|
| Standard | $5.00 | $25.00 | +35% (tokenizer) |
| Cached input | $0.50 | $25.00 | +35% on un-cached |
| Batch (50% off) | $2.50 | $12.50 | +35% then 50% off |
| Priority tier | $7.50 | $37.50 | +35% then +50% SLA |
For a workload that previously ran at $1,000/month on Opus 4.6 with 30% cache-hit, the equivalent 4.7 spend is roughly $1,250/month at the same cache-hit rate. Aggressive caching ($0.50/1M cached input) recovers most of the drift on long system prompts — readers running RAG with stable retrieved context should see 4.7 cheaper than 4.6 in practice. Readers running short, varied prompts will pay the full drift.
SMQTS Results — Programming Series
Weighted-blend score per category, 0-100. Tested at temperature 0.2, max_tokens 4096, no caching, thinking budget medium.
| Category | Opus 4.7 | GPT-5.5 | Gemini 3.1 Pro | DeepSeek V4 Pro |
|---|---|---|---|---|
| P1 Multi-file refactor | 94 | 86 | 83 | 74 |
| P2 Bug-finding from stack trace | 92 | 87 | 84 | 78 |
| P3 Code review | 91 | 88 | 85 | 76 |
| P4 Test generation | 89 | 90 | 83 | 77 |
| P5 SQL from natural language | 87 | 89 | 91 | 82 |
| P6 Algorithm from spec | 93 | 89 | 88 | 79 |
| P7 Migration scripts | 92 | 83 | 80 | 71 |
| P8 Documentation | 90 | 88 | 85 | 78 |
| P9 Diff comprehension | 91 | 86 | 83 | 76 |
| P10 Tool-using agent loops | 89 | 92 | 85 | 74 |
| Average | 91.2 | 87.6 | 84.6 | 76.5 |
Opus 4.7 wins 8 of 10 programming categories. The two losses are P4 (test generation, GPT-5.5 wins by 1 point — within rater noise) and P10 (tool-using agent loops, GPT-5.5 wins by 3 points — outside rater noise, real result).
Programming-only headline
Opus 4.7 91.2 ############################################## GPT-5.5 87.6 ########################################### Gemini 3.1 84.6 ########################################## DeepSeek V4 76.5 ######################################
SMQTS Results — Non-Programming Series
Opus 4.7 holds a respectable second place behind Gemini 3.1 Pro. The relative weakness is on N3 (multi-step reasoning) and N5 (translation into low-resource languages where Gemini's multilingual training advantage shows).
| Category | Opus 4.7 | Gemini 3.1 Pro | GPT-5.5 |
|---|---|---|---|
| N1 Long-form drafting | 87 | 89 | 91 |
| N2 Summarization | 91 | 90 | 89 |
| N3 Multi-step reasoning | 83 | 94 | 88 |
| N4 Information extraction | 89 | 87 | 88 |
| N5 Translation | 76 | 92 | 84 |
| N6 Style transfer | 90 | 87 | 89 |
| N7 Adversarial resistance | 92 | 88 | 85 |
| N8 Structured output | 87 | 88 | 91 |
| N9 Domain QA | 90 | 89 | 87 |
| N10 Multi-turn coherence | 91 | 89 | 87 |
| Average | 87.6 | 89.3 | 87.9 |
SMQTS Results — Cost-Quality Validation
For workloads where Opus 4.7 wins on absolute quality, can a cheaper model substitute? Pairwise blind grading at the 50-prompt cost-quality sample:
| Workload | Opus 4.7 wins | DeepSeek V4 Pro wins | Tie |
|---|---|---|---|
| Multi-file refactor (P1) | 71% | 11% | 18% |
| SQL from NL (P5) | 34% | 32% | 34% |
| Information extraction (N4) | 27% | 31% | 42% |
| Summarization (N2) | 38% | 22% | 40% |
Procurement reading. Opus 4.7 is irreplaceable on multi-file refactor and code review. It is roughly substitutable by DeepSeek V4 Pro on SQL, extraction, and summarization at one-tenth the cost. The cascade pattern wins here — route P1, P3, P7 to Opus, route P5, N2, N4 to a cheaper tier.
Strengths in Detail
Multi-file refactor
On 6 prompts that require coordinated edits across 4-8 files, Opus 4.7 produced a fully-passing test suite on first attempt in 5/6 cases. The next-best model (GPT-5.5) achieved 3/6. DeepSeek V4 Pro produced plausible single-file edits but lost cross-file consistency in 4/6 cases.
Caching economics
At $0.50 per 1M cached input, Opus 4.7 has the most aggressive caching rate among frontier models. For RAG workloads with stable retrieved chunks (~30K tokens of context per call), this flips the cost story: 4.7 ends up cheaper than 4.6 in practice despite the tokenizer drift. Cache TTL is 5 minutes standard, 1 hour with explicit cache control.
Output reliability
Lowest hallucination rate in N9 (3.2% across 240 graded responses, vs 5.1% for GPT-5.5 and 4.4% for Gemini 3.1 Pro on the same prompts). When Opus 4.7 does not know an answer, it is more likely than any rival to say so explicitly.
Weaknesses and Failure Modes
GPQA Diamond reasoning gap
Opus 4.7 scores ~89.5 on GPQA Diamond vs Gemini 3.1 Pro's 94.3. On graduate-level physics and biology questions where the answer requires multi-step derivation, Gemini wins consistently. For workloads where this matters (technical research, advanced tutoring), Opus 4.7 is the wrong choice.
Long-context citation degradation
Past ~600K tokens of context, citation accuracy in N9 starts to degrade. The model stays coherent and produces fluent answers, but the "needle in haystack" cite gets fuzzier. Use the 500K token context as a soft cap for citation-critical workloads.
Tool-call edge cases
On P10, GPT-5.5 wins by 3 weighted points. The specific failure mode: when the function schema has unusual variadic arguments or deeply nested object types, Opus 4.7 produces partially-malformed tool calls that fail at the parser. The recovery loop usually succeeds on retry, but the extra round-trip costs latency.
When to Use Opus 4.7
- Code-heavy production agents doing multi-file refactor, code review, migration, or test generation.
- RAG pipelines with stable retrieved context where caching makes the price story very favourable.
- Domain QA with strict no-hallucination requirements (legal, medical, regulated). Opus 4.7's low fabrication rate is genuinely best-in-class.
- Long-context summarization up to ~500K tokens.
When NOT to Use Opus 4.7
- Graduate-level reasoning workloads. Use Gemini 3.1 Pro instead.
- High-volume tool-calling agents with unusual function schemas. Use GPT-5.5.
- Cost-sensitive bulk extraction or classification. DeepSeek V4 Pro is roughly substitutable at one-tenth the cost.
- Translation into low-resource languages. Gemini wins on N5 by a wide margin.
- Workloads with high token volume and short prompts. Caching cannot recover the tokenizer drift; effective cost rose 35%.
Comparison to Direct Rivals
vs GPT-5.5
| Dimension | Opus 4.7 | GPT-5.5 |
|---|---|---|
| Output price ($/1M) | $25 | $30 |
| Context window | 500K | 1M |
| Coding Arena Elo | 1567 | 1521 |
| Tool-call reliability | Strong | Best |
| Cached input price | $0.50 | $1.25 |
vs Gemini 3.1 Pro
| Dimension | Opus 4.7 | Gemini 3.1 Pro |
|---|---|---|
| Input price ($/1M) | $5.00 | $3.50 |
| Context window | 500K | 2M |
| GPQA Diamond | 89.5 | 94.3 |
| SWE-bench Pro | 64.3% | 58.1% |
| Translation N5 score | 76 | 92 |
Procurement Notes
Enterprise readiness
SOC 2 Type II, ISO 27001, HIPAA. Available on Anthropic direct, AWS Bedrock, Google Cloud Vertex. DPA, MSA, custom data retention controls available on Team and Enterprise plans. Mature.
Lock-in score
3.5 / 5 on the Swfte vendor-lock-in scoring (lower is easier to leave; see the vendor leaderboard). The specific costs to leave: Anthropic-flavoured XML prompt structure, proprietary tool-calling schema, the new tokenizer that means token budgets must be re-estimated when porting, and the caching economics that don't survive the move.
Contract leverage
Anthropic offers volume discounts on direct contracts above ~$100K/month. Bedrock and Vertex pricing is identical to direct list. The leverage point is committed-use discounts on Bedrock for AWS-aligned procurement, which can shave 15-20% off effective rate on multi-year deals.