What is the AAII score and why does GPT-5.5 lead?

The Artificial Analysis Intelligence Index (AAII) is a composite of MMLU Pro, GPQA Diamond, MATH, HumanEval, and a few proprietary evals. GPT-5.5 leads at 59 because it is the strongest generalist — it is rarely the best on any single benchmark, but it never loses badly on any of them.

What is GPT-5.5 Pro and is it worth $30/$180 per 1M?

GPT-5.5 Pro is the deeper-thinking variant with extended reasoning enabled. At $30 input / $180 output per 1M tokens, it is the most expensive frontier API in market by a wide margin. Worth it for: hard scientific reasoning, complex multi-step planning, regulated workloads where one wrong answer is catastrophic. Not worth it for: anything else.

How does GPT-5.5 compare to Claude Opus 4.7 on coding?

Claude Opus 4.7 wins on multi-file refactor, code review, and migration scripts. GPT-5.5 wins on tool-using agent loops (P10) and test generation (P4). For mixed coding workloads, the two are within rater noise of each other; for pure refactor work, Opus 4.7 is the clear pick.

Does GPT-5.5 have a 1M context window?

Yes — GPT-5.5 launched with 1M token input context (output is still capped at 16K). Long-context grounding accuracy stays strong to ~750K, then degrades on the needle-in-haystack benchmarks. Use the 1M as a soft maximum for non-critical workloads.

Is GPT-5.5 enterprise-ready?

Yes. Available via OpenAI direct, Azure OpenAI, and through Microsoft 365 Copilot integrations. Full SOC 2, ISO 27001, HIPAA. Enterprise procurement is the most mature in the market — but lock-in to the OpenAI prompt format and tool-calling schema is the highest among closed-frontier providers.

GPT-5.5 Deep Dive May 2026 | AAII 59, Pricing, Strengths, Failures

Executive Summary

GPT-5.5 "Spud" is the model you pick when you do not know exactly what your workload looks like. It rarely wins any individual category outright but it never loses badly on any category — that is what the leading AAII score measures and what makes it the safest default for general-purpose production traffic. The Pro variant exists for one reason: to capture the customers who need the absolute strongest reasoning at any price.

Three strengths

Highest AAII (59). The most reliable generalist in the market.
Tool-use reliability leader. Best schema compliance under unusual function signatures, lowest tool error rate in long agent loops.
Long-form drafting and structured output. Wins N1 (long-form) and N8 (JSON schema) outright.

Three weaknesses

Pricing. $30 output is the highest among non-Pro frontier; $180 Pro output sets a new market ceiling.
Multi-file refactor. Loses to Claude Opus 4.7 by a meaningful margin on code-spanning tasks.
Hallucination on N9 (domain QA). 5.1% fabrication rate, highest of the frontier four. The model will confidently invent citations under retrieval pressure.

Architecture and Training

Mixture-of-experts. Active parameter count not disclosed; community estimates put it in the 100-200B active range from a much larger total. The "Spud" codename references the project's internal staging.
Two variants. GPT-5.5 standard and GPT-5.5 Pro share the base weights; Pro runs with a much larger thinking budget and additional verifier passes.
Tokenizer. Carries forward the GPT-4 family tokenizer (cl100k variant). No drift to manage when upgrading from GPT-4o or GPT-5.
Knowledge cutoff February 2026. Tied for the freshest among closed-frontier models.

Pricing Reality

Tier	Input ($/1M)	Output ($/1M)	Notes
GPT-5.5 standard	$5.00	$30.00	Default for most workloads
GPT-5.5 cached input	$1.25	$30.00	4x cheaper than uncached input
GPT-5.5 batch	$2.50	$15.00	50% off, async only
GPT-5.5 Pro	$30.00	$180.00	6x standard input, 6x standard output
GPT-5.5 Pro cached	$7.50	$180.00	4x cheaper than uncached Pro input

Pro tier reality check. The $180 output rate compounds with reasoning models' chain-of-thought billing at the output rate. A single Pro answer to a hard reasoning prompt — including the thinking trace — can run $0.50-$2 per call. For a 50K/month workload, the bill is $25K-$100K monthly. The Pro variant is not a default; it is a per-call choice for the hardest 1-3% of traffic.

SMQTS Results — Programming Series

Category	GPT-5.5	Opus 4.7	Gemini 3.1 Pro	DeepSeek V4 Pro
P1 Multi-file refactor	86	94	83	74
P2 Bug-finding from stack trace	87	92	84	78
P3 Code review	88	91	85	76
P4 Test generation	90	89	83	77
P5 SQL from natural language	89	87	91	82
P6 Algorithm from spec	89	93	88	79
P7 Migration scripts	83	92	80	71
P8 Documentation	88	90	85	78
P9 Diff comprehension	86	91	83	76
P10 Tool-using agent loops	92	89	85	74
Average	87.6	91.2	84.6	76.5

GPT-5.5 wins P4 (test generation) and P10 (tool-using agent loops) outright. On P1, P2, P3, P6, P7, P8, P9 it loses to Opus 4.7. On P5 it loses to Gemini. Solid second place.

SMQTS Results — Non-Programming Series

Category	GPT-5.5	Opus 4.7	Gemini 3.1 Pro
N1 Long-form drafting	91	87	89
N2 Summarization	89	91	90
N3 Multi-step reasoning	88	83	94
N4 Information extraction	88	89	87
N5 Translation	84	76	92
N6 Style transfer	89	90	87
N7 Adversarial resistance	85	92	88
N8 Structured output	91	87	88
N9 Domain QA	87	90	89
N10 Multi-turn coherence	87	91	89
Average	87.9	87.6	89.3

AAII headline (composite)

GPT-5.5      59.0   ##############################
Gemini 3.1   58.4   #############################
Opus 4.7     58.1   #############################
DeepSeek V4  54.7   ###########################
Gemma 4 27B  47.2   #######################

SMQTS Results — Cost-Quality Validation

Pairwise blind grading of GPT-5.5 standard vs cheaper tier substitutes on the 50-prompt sample:

Workload	GPT-5.5 wins	DeepSeek V4 Pro wins	Tie
Long-form drafting (N1)	56%	21%	23%
Tool-using agent loops (P10)	67%	13%	20%
Information extraction (N4)	22%	34%	44%
Structured JSON output (N8)	41%	28%	31%

GPT-5.5 Pro vs GPT-5.5 standard, on the hardest reasoning sub-set of N3:

Workload	Pro wins	Standard wins	Tie
N3 Hard reasoning subset	61%	14%	25%
P6 Algorithm from spec (hard)	43%	22%	35%
P10 Tool loops (10+ turn)	34%	30%	36%

Procurement reading. Pro pays for itself only on the hardest reasoning subset, and the gap shrinks fast as prompts get easier. For a workload routing 95% to standard and 5% to Pro, the blended cost is roughly $13/$45 per 1M — still expensive, but defensible if reasoning quality is critical.

Strengths in Detail

Tool-use reliability

GPT-5.5 produces valid tool calls on first attempt 97.4% of the time across our P10 prompts, including unusual variadic and deeply nested function schemas. The next-best (Opus 4.7) hits 94.2%. For high-volume agentic workloads, that 3.2 percentage points is meaningful — it is the difference between a clean loop and a retry round-trip, which compounds in latency and cost.

Long-form drafting

On N1 (3,000-word article from outline), GPT-5.5 wins outright with the strongest faithfulness-to-outline plus voice consistency. The competing Anthropic and Google models are fluent but tend to drift away from the bullet structure partway through.

Structured output

N8 winner. Schema-strict JSON mode produces fully-valid output 98.1% of the time across our N8 prompts. Opus 4.7 is at 95.4% and Gemini at 96.7%.

Weaknesses and Failure Modes

Multi-file refactor loss to Opus

GPT-5.5 produces high-quality single-file edits but loses cross-file consistency more often than Opus 4.7 on P1. The specific failure mode: it identifies the changes to make but applies them inconsistently across the affected files, leaving tests broken in 3-4 places.

Domain QA fabrication

N9 fabrication rate is 5.1%, highest of the frontier four. The model is more likely to confidently invent a citation under retrieval pressure than Opus 4.7 (3.2%) or Gemini (4.4%). This is the single biggest reason to look elsewhere for regulated-domain QA.

Cost

Standard tier $30 output is the highest among non-Pro frontier. Pro tier $180 is 7x the next-most-expensive frontier output rate. For high-volume workloads, GPT-5.5 is the costly choice unless you have a specific reason to need it.

When to Use GPT-5.5

General-purpose default. If you do not know what your workload mix will look like, GPT-5.5 standard is the lowest-regret pick.
Tool-using agents. Especially in production, where a 3.2 pp tool-call success advantage compounds.
Long-form drafting and JSON-strict output. The category winners.
GPT-5.5 Pro: hardest reasoning only. Deploy as a per-call escalation tier, not a default.

When NOT to Use GPT-5.5

Multi-file refactor / migration scripts. Use Claude Opus 4.7.
Translation, especially low-resource languages. Use Gemini 3.1 Pro.
Strict-fidelity domain QA where one hallucinated citation is unacceptable. Use Opus 4.7.
Cost-sensitive bulk extraction. DeepSeek V4 Pro at one-tenth the cost is roughly substitutable.

Comparison to Direct Rivals

vs Claude Opus 4.7

Dimension	GPT-5.5	Opus 4.7
Output price ($/1M)	$30	$25
Context window	1M	500K
AAII Index	59	58.1
Tool-call success	97.4%	94.2%
Hallucination rate (N9)	5.1%	3.2%

vs Gemini 3.1 Pro

Dimension	GPT-5.5	Gemini 3.1 Pro
Input price ($/1M)	$5.00	$3.50
Context window	1M	2M
GPQA Diamond	89.7	94.3
Long-form drafting (N1)	91	89
Translation (N5)	84	92

Procurement Notes

Enterprise readiness

The most mature procurement story in the market. Available via OpenAI direct, Azure OpenAI, Microsoft 365 Copilot. Full SOC 2 Type II, ISO 27001, HIPAA, PCI DSS available. Custom data retention and zero-day-retention configurations standard on Enterprise plans.

Lock-in score

4.0 / 5 — highest in the closed-frontier tier. Specific switching costs: OpenAI's function-calling schema shape that doesn't map cleanly to Anthropic's tool schema, persistent quirks in JSON output handling that target code expects, and Azure-specific IAM coupling on the gateway path. Swfte Connect specifically exists to abstract this away.

Contract leverage

OpenAI's direct enterprise tier offers volume discounts starting at ~$50K/month and significantly improves at $500K+. Azure adds Azure-committed-use leverage. The Pro tier is rarely negotiable on price; OpenAI treats it as a premium per-call tier rather than a contract line item.

GPT-5.5 "Spud" — Deep Dive Research Report (May 2026)

Model Snapshot

Executive Summary

Three strengths

Three weaknesses

Architecture and Training

Pricing Reality

SMQTS Results — Programming Series

SMQTS Results — Non-Programming Series

AAII headline (composite)

SMQTS Results — Cost-Quality Validation

Strengths in Detail

Tool-use reliability

Long-form drafting

Structured output

Weaknesses and Failure Modes

Multi-file refactor loss to Opus

Domain QA fabrication

Cost

When to Use GPT-5.5

When NOT to Use GPT-5.5

Comparison to Direct Rivals

vs Claude Opus 4.7

vs Gemini 3.1 Pro

Procurement Notes

Enterprise readiness

Lock-in score

Contract leverage

GPT-5.5 "Spud" — Deep Dive Research Report (May 2026)

Model Snapshot

Executive Summary

Three strengths

Three weaknesses

Architecture and Training

Pricing Reality

SMQTS Results — Programming Series

SMQTS Results — Non-Programming Series

AAII headline (composite)

SMQTS Results — Cost-Quality Validation

Strengths in Detail

Tool-use reliability

Long-form drafting

Structured output

Weaknesses and Failure Modes

Multi-file refactor loss to Opus

Domain QA fabrication

Cost

When to Use GPT-5.5

When NOT to Use GPT-5.5

Comparison to Direct Rivals

vs Claude Opus 4.7

vs Gemini 3.1 Pro

Procurement Notes

Enterprise readiness

Lock-in score

Contract leverage

Related Reading