Executive Summary
GPT-5.5 "Spud" is the model you pick when you do not know exactly what your workload looks like. It rarely wins any individual category outright but it never loses badly on any category — that is what the leading AAII score measures and what makes it the safest default for general-purpose production traffic. The Pro variant exists for one reason: to capture the customers who need the absolute strongest reasoning at any price.
Three strengths
- Highest AAII (59). The most reliable generalist in the market.
- Tool-use reliability leader. Best schema compliance under unusual function signatures, lowest tool error rate in long agent loops.
- Long-form drafting and structured output. Wins N1 (long-form) and N8 (JSON schema) outright.
Three weaknesses
- Pricing. $30 output is the highest among non-Pro frontier; $180 Pro output sets a new market ceiling.
- Multi-file refactor. Loses to Claude Opus 4.7 by a meaningful margin on code-spanning tasks.
- Hallucination on N9 (domain QA). 5.1% fabrication rate, highest of the frontier four. The model will confidently invent citations under retrieval pressure.
Architecture and Training
- Mixture-of-experts. Active parameter count not disclosed; community estimates put it in the 100-200B active range from a much larger total. The "Spud" codename references the project's internal staging.
- Two variants. GPT-5.5 standard and GPT-5.5 Pro share the base weights; Pro runs with a much larger thinking budget and additional verifier passes.
- Tokenizer. Carries forward the GPT-4 family tokenizer (cl100k variant). No drift to manage when upgrading from GPT-4o or GPT-5.
- Knowledge cutoff February 2026. Tied for the freshest among closed-frontier models.
Pricing Reality
| Tier | Input ($/1M) | Output ($/1M) | Notes |
|---|---|---|---|
| GPT-5.5 standard | $5.00 | $30.00 | Default for most workloads |
| GPT-5.5 cached input | $1.25 | $30.00 | 4x cheaper than uncached input |
| GPT-5.5 batch | $2.50 | $15.00 | 50% off, async only |
| GPT-5.5 Pro | $30.00 | $180.00 | 6x standard input, 6x standard output |
| GPT-5.5 Pro cached | $7.50 | $180.00 | 4x cheaper than uncached Pro input |
Pro tier reality check. The $180 output rate compounds with reasoning models' chain-of-thought billing at the output rate. A single Pro answer to a hard reasoning prompt — including the thinking trace — can run $0.50-$2 per call. For a 50K/month workload, the bill is $25K-$100K monthly. The Pro variant is not a default; it is a per-call choice for the hardest 1-3% of traffic.
SMQTS Results — Programming Series
| Category | GPT-5.5 | Opus 4.7 | Gemini 3.1 Pro | DeepSeek V4 Pro |
|---|---|---|---|---|
| P1 Multi-file refactor | 86 | 94 | 83 | 74 |
| P2 Bug-finding from stack trace | 87 | 92 | 84 | 78 |
| P3 Code review | 88 | 91 | 85 | 76 |
| P4 Test generation | 90 | 89 | 83 | 77 |
| P5 SQL from natural language | 89 | 87 | 91 | 82 |
| P6 Algorithm from spec | 89 | 93 | 88 | 79 |
| P7 Migration scripts | 83 | 92 | 80 | 71 |
| P8 Documentation | 88 | 90 | 85 | 78 |
| P9 Diff comprehension | 86 | 91 | 83 | 76 |
| P10 Tool-using agent loops | 92 | 89 | 85 | 74 |
| Average | 87.6 | 91.2 | 84.6 | 76.5 |
GPT-5.5 wins P4 (test generation) and P10 (tool-using agent loops) outright. On P1, P2, P3, P6, P7, P8, P9 it loses to Opus 4.7. On P5 it loses to Gemini. Solid second place.
SMQTS Results — Non-Programming Series
| Category | GPT-5.5 | Opus 4.7 | Gemini 3.1 Pro |
|---|---|---|---|
| N1 Long-form drafting | 91 | 87 | 89 |
| N2 Summarization | 89 | 91 | 90 |
| N3 Multi-step reasoning | 88 | 83 | 94 |
| N4 Information extraction | 88 | 89 | 87 |
| N5 Translation | 84 | 76 | 92 |
| N6 Style transfer | 89 | 90 | 87 |
| N7 Adversarial resistance | 85 | 92 | 88 |
| N8 Structured output | 91 | 87 | 88 |
| N9 Domain QA | 87 | 90 | 89 |
| N10 Multi-turn coherence | 87 | 91 | 89 |
| Average | 87.9 | 87.6 | 89.3 |
AAII headline (composite)
GPT-5.5 59.0 ############################## Gemini 3.1 58.4 ############################# Opus 4.7 58.1 ############################# DeepSeek V4 54.7 ########################### Gemma 4 27B 47.2 #######################
SMQTS Results — Cost-Quality Validation
Pairwise blind grading of GPT-5.5 standard vs cheaper tier substitutes on the 50-prompt sample:
| Workload | GPT-5.5 wins | DeepSeek V4 Pro wins | Tie |
|---|---|---|---|
| Long-form drafting (N1) | 56% | 21% | 23% |
| Tool-using agent loops (P10) | 67% | 13% | 20% |
| Information extraction (N4) | 22% | 34% | 44% |
| Structured JSON output (N8) | 41% | 28% | 31% |
GPT-5.5 Pro vs GPT-5.5 standard, on the hardest reasoning sub-set of N3:
| Workload | Pro wins | Standard wins | Tie |
|---|---|---|---|
| N3 Hard reasoning subset | 61% | 14% | 25% |
| P6 Algorithm from spec (hard) | 43% | 22% | 35% |
| P10 Tool loops (10+ turn) | 34% | 30% | 36% |
Procurement reading. Pro pays for itself only on the hardest reasoning subset, and the gap shrinks fast as prompts get easier. For a workload routing 95% to standard and 5% to Pro, the blended cost is roughly $13/$45 per 1M — still expensive, but defensible if reasoning quality is critical.
Strengths in Detail
Tool-use reliability
GPT-5.5 produces valid tool calls on first attempt 97.4% of the time across our P10 prompts, including unusual variadic and deeply nested function schemas. The next-best (Opus 4.7) hits 94.2%. For high-volume agentic workloads, that 3.2 percentage points is meaningful — it is the difference between a clean loop and a retry round-trip, which compounds in latency and cost.
Long-form drafting
On N1 (3,000-word article from outline), GPT-5.5 wins outright with the strongest faithfulness-to-outline plus voice consistency. The competing Anthropic and Google models are fluent but tend to drift away from the bullet structure partway through.
Structured output
N8 winner. Schema-strict JSON mode produces fully-valid output 98.1% of the time across our N8 prompts. Opus 4.7 is at 95.4% and Gemini at 96.7%.
Weaknesses and Failure Modes
Multi-file refactor loss to Opus
GPT-5.5 produces high-quality single-file edits but loses cross-file consistency more often than Opus 4.7 on P1. The specific failure mode: it identifies the changes to make but applies them inconsistently across the affected files, leaving tests broken in 3-4 places.
Domain QA fabrication
N9 fabrication rate is 5.1%, highest of the frontier four. The model is more likely to confidently invent a citation under retrieval pressure than Opus 4.7 (3.2%) or Gemini (4.4%). This is the single biggest reason to look elsewhere for regulated-domain QA.
Cost
Standard tier $30 output is the highest among non-Pro frontier. Pro tier $180 is 7x the next-most-expensive frontier output rate. For high-volume workloads, GPT-5.5 is the costly choice unless you have a specific reason to need it.
When to Use GPT-5.5
- General-purpose default. If you do not know what your workload mix will look like, GPT-5.5 standard is the lowest-regret pick.
- Tool-using agents. Especially in production, where a 3.2 pp tool-call success advantage compounds.
- Long-form drafting and JSON-strict output. The category winners.
- GPT-5.5 Pro: hardest reasoning only. Deploy as a per-call escalation tier, not a default.
When NOT to Use GPT-5.5
- Multi-file refactor / migration scripts. Use Claude Opus 4.7.
- Translation, especially low-resource languages. Use Gemini 3.1 Pro.
- Strict-fidelity domain QA where one hallucinated citation is unacceptable. Use Opus 4.7.
- Cost-sensitive bulk extraction. DeepSeek V4 Pro at one-tenth the cost is roughly substitutable.
Comparison to Direct Rivals
vs Claude Opus 4.7
| Dimension | GPT-5.5 | Opus 4.7 |
|---|---|---|
| Output price ($/1M) | $30 | $25 |
| Context window | 1M | 500K |
| AAII Index | 59 | 58.1 |
| Tool-call success | 97.4% | 94.2% |
| Hallucination rate (N9) | 5.1% | 3.2% |
vs Gemini 3.1 Pro
| Dimension | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|
| Input price ($/1M) | $5.00 | $3.50 |
| Context window | 1M | 2M |
| GPQA Diamond | 89.7 | 94.3 |
| Long-form drafting (N1) | 91 | 89 |
| Translation (N5) | 84 | 92 |
Procurement Notes
Enterprise readiness
The most mature procurement story in the market. Available via OpenAI direct, Azure OpenAI, Microsoft 365 Copilot. Full SOC 2 Type II, ISO 27001, HIPAA, PCI DSS available. Custom data retention and zero-day-retention configurations standard on Enterprise plans.
Lock-in score
4.0 / 5 — highest in the closed-frontier tier. Specific switching costs: OpenAI's function-calling schema shape that doesn't map cleanly to Anthropic's tool schema, persistent quirks in JSON output handling that target code expects, and Azure-specific IAM coupling on the gateway path. Swfte Connect specifically exists to abstract this away.
Contract leverage
OpenAI's direct enterprise tier offers volume discounts starting at ~$50K/month and significantly improves at $500K+. Azure adds Azure-committed-use leverage. The Pro tier is rarely negotiable on price; OpenAI treats it as a premium per-call tier rather than a contract line item.