SMQTS v1.3 · Pinned 2026-04-23

GPT-5.5 "Spud" — Deep Dive Research Report (May 2026)

The strongest generalist. The most reliable tool-caller. The most expensive frontier API ever launched.

Download research report (.md)

Model Snapshot

Released

2026-04-23

License

Closed

Context

1M tokens

Knowledge cutoff

Feb 2026

Input price

$5 / 1M

Output price

$30 / 1M

Pro variant

$30 / $180

AAII score

59 (#1)

Executive Summary

GPT-5.5 "Spud" is the model you pick when you do not know exactly what your workload looks like. It rarely wins any individual category outright but it never loses badly on any category — that is what the leading AAII score measures and what makes it the safest default for general-purpose production traffic. The Pro variant exists for one reason: to capture the customers who need the absolute strongest reasoning at any price.

Three strengths

  1. Highest AAII (59). The most reliable generalist in the market.
  2. Tool-use reliability leader. Best schema compliance under unusual function signatures, lowest tool error rate in long agent loops.
  3. Long-form drafting and structured output. Wins N1 (long-form) and N8 (JSON schema) outright.

Three weaknesses

  1. Pricing. $30 output is the highest among non-Pro frontier; $180 Pro output sets a new market ceiling.
  2. Multi-file refactor. Loses to Claude Opus 4.7 by a meaningful margin on code-spanning tasks.
  3. Hallucination on N9 (domain QA). 5.1% fabrication rate, highest of the frontier four. The model will confidently invent citations under retrieval pressure.

Architecture and Training

  • Mixture-of-experts. Active parameter count not disclosed; community estimates put it in the 100-200B active range from a much larger total. The "Spud" codename references the project's internal staging.
  • Two variants. GPT-5.5 standard and GPT-5.5 Pro share the base weights; Pro runs with a much larger thinking budget and additional verifier passes.
  • Tokenizer. Carries forward the GPT-4 family tokenizer (cl100k variant). No drift to manage when upgrading from GPT-4o or GPT-5.
  • Knowledge cutoff February 2026. Tied for the freshest among closed-frontier models.

Pricing Reality

TierInput ($/1M)Output ($/1M)Notes
GPT-5.5 standard$5.00$30.00Default for most workloads
GPT-5.5 cached input$1.25$30.004x cheaper than uncached input
GPT-5.5 batch$2.50$15.0050% off, async only
GPT-5.5 Pro$30.00$180.006x standard input, 6x standard output
GPT-5.5 Pro cached$7.50$180.004x cheaper than uncached Pro input

Pro tier reality check. The $180 output rate compounds with reasoning models' chain-of-thought billing at the output rate. A single Pro answer to a hard reasoning prompt — including the thinking trace — can run $0.50-$2 per call. For a 50K/month workload, the bill is $25K-$100K monthly. The Pro variant is not a default; it is a per-call choice for the hardest 1-3% of traffic.

SMQTS Results — Programming Series

CategoryGPT-5.5Opus 4.7Gemini 3.1 ProDeepSeek V4 Pro
P1 Multi-file refactor86948374
P2 Bug-finding from stack trace87928478
P3 Code review88918576
P4 Test generation90898377
P5 SQL from natural language89879182
P6 Algorithm from spec89938879
P7 Migration scripts83928071
P8 Documentation88908578
P9 Diff comprehension86918376
P10 Tool-using agent loops92898574
Average87.691.284.676.5

GPT-5.5 wins P4 (test generation) and P10 (tool-using agent loops) outright. On P1, P2, P3, P6, P7, P8, P9 it loses to Opus 4.7. On P5 it loses to Gemini. Solid second place.

SMQTS Results — Non-Programming Series

CategoryGPT-5.5Opus 4.7Gemini 3.1 Pro
N1 Long-form drafting918789
N2 Summarization899190
N3 Multi-step reasoning888394
N4 Information extraction888987
N5 Translation847692
N6 Style transfer899087
N7 Adversarial resistance859288
N8 Structured output918788
N9 Domain QA879089
N10 Multi-turn coherence879189
Average87.987.689.3

AAII headline (composite)

GPT-5.5      59.0   ##############################
Gemini 3.1   58.4   #############################
Opus 4.7     58.1   #############################
DeepSeek V4  54.7   ###########################
Gemma 4 27B  47.2   #######################

SMQTS Results — Cost-Quality Validation

Pairwise blind grading of GPT-5.5 standard vs cheaper tier substitutes on the 50-prompt sample:

WorkloadGPT-5.5 winsDeepSeek V4 Pro winsTie
Long-form drafting (N1)56%21%23%
Tool-using agent loops (P10)67%13%20%
Information extraction (N4)22%34%44%
Structured JSON output (N8)41%28%31%

GPT-5.5 Pro vs GPT-5.5 standard, on the hardest reasoning sub-set of N3:

WorkloadPro winsStandard winsTie
N3 Hard reasoning subset61%14%25%
P6 Algorithm from spec (hard)43%22%35%
P10 Tool loops (10+ turn)34%30%36%

Procurement reading. Pro pays for itself only on the hardest reasoning subset, and the gap shrinks fast as prompts get easier. For a workload routing 95% to standard and 5% to Pro, the blended cost is roughly $13/$45 per 1M — still expensive, but defensible if reasoning quality is critical.

Strengths in Detail

Tool-use reliability

GPT-5.5 produces valid tool calls on first attempt 97.4% of the time across our P10 prompts, including unusual variadic and deeply nested function schemas. The next-best (Opus 4.7) hits 94.2%. For high-volume agentic workloads, that 3.2 percentage points is meaningful — it is the difference between a clean loop and a retry round-trip, which compounds in latency and cost.

Long-form drafting

On N1 (3,000-word article from outline), GPT-5.5 wins outright with the strongest faithfulness-to-outline plus voice consistency. The competing Anthropic and Google models are fluent but tend to drift away from the bullet structure partway through.

Structured output

N8 winner. Schema-strict JSON mode produces fully-valid output 98.1% of the time across our N8 prompts. Opus 4.7 is at 95.4% and Gemini at 96.7%.

Weaknesses and Failure Modes

Multi-file refactor loss to Opus

GPT-5.5 produces high-quality single-file edits but loses cross-file consistency more often than Opus 4.7 on P1. The specific failure mode: it identifies the changes to make but applies them inconsistently across the affected files, leaving tests broken in 3-4 places.

Domain QA fabrication

N9 fabrication rate is 5.1%, highest of the frontier four. The model is more likely to confidently invent a citation under retrieval pressure than Opus 4.7 (3.2%) or Gemini (4.4%). This is the single biggest reason to look elsewhere for regulated-domain QA.

Cost

Standard tier $30 output is the highest among non-Pro frontier. Pro tier $180 is 7x the next-most-expensive frontier output rate. For high-volume workloads, GPT-5.5 is the costly choice unless you have a specific reason to need it.

When to Use GPT-5.5

  • General-purpose default. If you do not know what your workload mix will look like, GPT-5.5 standard is the lowest-regret pick.
  • Tool-using agents. Especially in production, where a 3.2 pp tool-call success advantage compounds.
  • Long-form drafting and JSON-strict output. The category winners.
  • GPT-5.5 Pro: hardest reasoning only. Deploy as a per-call escalation tier, not a default.

When NOT to Use GPT-5.5

  • Multi-file refactor / migration scripts. Use Claude Opus 4.7.
  • Translation, especially low-resource languages. Use Gemini 3.1 Pro.
  • Strict-fidelity domain QA where one hallucinated citation is unacceptable. Use Opus 4.7.
  • Cost-sensitive bulk extraction. DeepSeek V4 Pro at one-tenth the cost is roughly substitutable.

Comparison to Direct Rivals

vs Claude Opus 4.7

DimensionGPT-5.5Opus 4.7
Output price ($/1M)$30$25
Context window1M500K
AAII Index5958.1
Tool-call success97.4%94.2%
Hallucination rate (N9)5.1%3.2%

vs Gemini 3.1 Pro

DimensionGPT-5.5Gemini 3.1 Pro
Input price ($/1M)$5.00$3.50
Context window1M2M
GPQA Diamond89.794.3
Long-form drafting (N1)9189
Translation (N5)8492

Procurement Notes

Enterprise readiness

The most mature procurement story in the market. Available via OpenAI direct, Azure OpenAI, Microsoft 365 Copilot. Full SOC 2 Type II, ISO 27001, HIPAA, PCI DSS available. Custom data retention and zero-day-retention configurations standard on Enterprise plans.

Lock-in score

4.0 / 5 — highest in the closed-frontier tier. Specific switching costs: OpenAI's function-calling schema shape that doesn't map cleanly to Anthropic's tool schema, persistent quirks in JSON output handling that target code expects, and Azure-specific IAM coupling on the gateway path. Swfte Connect specifically exists to abstract this away.

Contract leverage

OpenAI's direct enterprise tier offers volume discounts starting at ~$50K/month and significantly improves at $500K+. Azure adds Azure-committed-use leverage. The Pro tier is rarely negotiable on price; OpenAI treats it as a premium per-call tier rather than a contract line item.