# GPT-5.5 ("Spud") — Independent Research Report

**Publisher**: Swfte AI Research
**Report date**: May 2026
**Methodology**: https://www.swfte.com/research/methodology
**Web version**: https://www.swfte.com/research/gpt-5-5
**Citation**: Swfte AI Research, "GPT-5.5 — Independent Research Report", May 2026.

## Executive Summary

GPT-5.5, internally codenamed "Spud," is OpenAI's flagship general-purpose model, released April 23, 2026 and made available on AWS Bedrock April 28, 2026. It is the first fully retrained base model from OpenAI since GPT-4.5 in early 2024 — a meaningful change after two years of post-training-only releases. The base model was re-pretrained on a significantly larger corpus with what OpenAI describes as "improved data curation and synthetic-data ratio tuning." A separate variant, GPT-5.5 Pro, exists at $30/$180 per million tokens for high-stakes reasoning workloads.

Standard GPT-5.5 is priced at $5 per million input tokens and $30 per million output tokens, with a 1M context window across all production tiers. On Artificial Analysis's composite Intelligence Index (AAII), GPT-5.5 scored 59 — the highest in its class as of mid-May 2026 — driven primarily by strong showings in instruction following, multi-turn coherence, and agentic eval suites. It is the best-balanced general-purpose model in the frontier tier.

Three strengths define the model. First, balance: GPT-5.5 is the model with the fewest measurable weaknesses across SMQTS categories. It does not lead on any single axis, but it does not trail meaningfully either. Second, instruction following on long, complex system prompts (10K+ tokens) is the best we measured. Third, multi-turn coherence across 50+ message conversations was best in class.

Three weaknesses to flag. First, GPT-5.5 lost to Opus 4.7 on every coding category we tested, often by 6-12 points. Second, output token pricing at $30/M is 20% above Opus 4.7 and roughly 9x DeepSeek V4 Pro. Third, the model exhibits a measurably higher rate of confident-but-wrong factual outputs on niche domain questions (medical, legal precedent, scientific subspecialty) than Gemini 3.1 Pro.

For buyers: GPT-5.5 is the right default model for organizations that want one model to handle a broad mix of workloads with predictable behavior. It is the wrong model when you have a workload that maps cleanly onto Opus 4.7's strengths (agentic coding, security review) or onto a cheaper model's adequacy zone (routine generation).

## 1. Model Snapshot

| Attribute | Value |
|---|---|
| Provider | OpenAI |
| Release date | April 23, 2026 (Azure + OpenAI API), April 28, 2026 (AWS Bedrock) |
| Parameters | Not disclosed; estimated MoE ~1.8T total / 220B active |
| Context window | 1,000,000 tokens |
| Max output | 64,000 tokens |
| License | Proprietary (commercial only) |
| Input pricing | $5.00 per 1M tokens |
| Output pricing | $30.00 per 1M tokens |
| Cache hit | $0.50 per 1M tokens |
| Batch (50% off) | $2.50 / $15.00 per 1M |
| Pro variant | GPT-5.5 Pro: $30 / $180 per 1M |
| Modalities | Text, image, audio (in/out), video (input only) |
| Providers | OpenAI API, Azure OpenAI, AWS Bedrock |
| Knowledge cutoff | December 2025 |

## 2. Architecture & Training (what's known publicly)

OpenAI's GPT-5.5 model card (April 23, 2026) is unusually detailed by their post-2024 standards. The headline claim is that GPT-5.5 is the first fully retrained base since GPT-4.5 in early 2024, meaning the entire pretraining run was redone rather than continued from a prior checkpoint. The pretraining corpus is described as "approximately 18T tokens after deduplication," with an explicit ratio of synthetic data described as "below 30%." OpenAI cited this as a deliberate cap to avoid the model-collapse signals observed in late-2025 ablations.

The post-training stack is described as a four-stage pipeline: SFT → RLHF → tool-use RL with verifiable rewards → constitutional refinement. The third stage — tool-use RL — is the most novel. OpenAI ran the model through a large suite of tool-use environments (computer use, code execution, web browsing, structured-data agents) with verifier-driven rewards. This shows up clearly in our tool-use evaluations: GPT-5.5 is competitive with Opus 4.7 on tool-use accuracy when the toolset is small, although it loses ground on five-or-more-tool agentic loops.

The "Spud" codename traces to an internal joke from OpenAI's training cluster nomenclature — earlier 2025 leaked Slack messages had referenced models with vegetable codenames. OpenAI's release notes do not officially adopt the codename, but it is widely used in the developer community.

## 3. Pricing Reality

Headline: $5 / $30 per million input/output tokens. The Pro variant at $30 / $180 is positioned for verification-heavy workloads where cost-per-correct-answer beats cost-per-token.

Effective production cost on a 4,000-token prompt → 1,000-token completion:

| Scenario | Cost |
|---|---|
| Cold prompt | $0.050 |
| Cache hit (90% of prompt) | $0.0320 |
| Batch | $0.0250 |
| GPT-5.5 Pro (cold) | $0.300 |

The Pro variant is 6x the standard tier on input and 6x on output. OpenAI markets it as "for tasks where being right is worth 10x more than being fast." In our testing, the Pro variant produced measurably better answers on 31 of 50 hard reasoning prompts, scored worse on 4, and tied on 15. This is a real quality gap, but the 6x premium is hard to justify outside specific high-stakes use cases (medical decision support, regulatory drafting, financial DCF assumptions).

Cache hit pricing at $0.50/M is the same as Opus 4.7. OpenAI's caching is automatic and content-addressed (no cache key management), which is operationally simpler than Anthropic's prompt-prefix scheme. The trade-off: hits are less predictable.

Batch API at 50% off has a 24-hour SLA. AWS Bedrock pricing matches OpenAI API list pricing exactly as of May 1, 2026.

## 4. SMQTS Programming Series Results

| Category | Score | Notes |
|---|---|---|
| Algorithm implementation (LeetCode-Hard) | 88 | Strong; 6 points behind Opus 4.7. |
| TypeScript refactor (50K LOC repo) | 82 | Broke 11 import paths in our migration test. |
| Python data pipeline (pandas → polars) | 86 | Reliable; conservative refactors. |
| Go concurrency bug isolation | 84 | Good; missed one channel-direction bug. |
| SQL query optimization (Postgres) | 84 | Solid EXPLAIN reasoning. |
| React server component migration | 81 | Below Opus 4.7; client/server boundary detection weaker. |
| Rust lifetime errors | 78 | Frequent suggestion churn. |
| Code review (security-focused) | 86 | Below Opus 4.7 by 7 points; missed 3 OWASP cases. |
| Test generation (pytest, vitest) | 87 | Strong coverage; clean assertion style. |
| Long-context refactor (600K-token monorepo) | 85 | Better than expected; some skipped files. |

**Series average: 84.1** (vs. 90.5 for Opus 4.7, 86.7 for Gemini 3.1 Pro, 78.3 for DeepSeek V4 Pro)

## 5. SMQTS Non-Programming Series Results

| Category | Score | Notes |
|---|---|---|
| Long-form analytical writing | 90 | Top of field; cleanest structure. |
| Multi-step financial analysis | 88 | Strong; one DCF rounding error in 36 trials. |
| Legal contract review (redlines) | 87 | Caught 12 of 14 indemnification edge cases. |
| Multilingual translation (EN→ZH/JA/KO) | 84 | Strong across all three; below Gemini on KO. |
| Image OCR + table extraction | 80 | Solid; below Gemini 3.1 Pro on dense scans. |
| Data extraction from PDFs (structured) | 89 | Reliable JSON; zero schema violations in 36 trials. |
| Creative writing (genre fiction) | 86 | Top of field on voice-versatility. |
| Instruction-following under adversarial prompts | 89 | Behind Opus 4.7 by 6 points. |
| Mathematical reasoning (AIME-2025) | 84 | Below Gemini 3.1 Pro and Opus 4.7. |
| Tool use (5+ interleaved tools) | 88 | Strong; behind Opus 4.7. |

**Series average: 85.4** (vs. 86.7 for Opus 4.7, 88.1 for Gemini 3.1 Pro, 76.2 for DeepSeek V4 Pro)

## 6. Cost-Quality Validation

We re-ran 200 prompts on DeepSeek V4 Pro and Gemma 4 27B. For 168 of 200 prompts, the cheaper model produced output that blinded raters scored as indistinguishable or better than GPT-5.5's. This is a higher fungibility ratio than we measured for Opus 4.7 (142/200), reflecting GPT-5.5's "balanced but not dominant" positioning.

The 32 prompts where GPT-5.5 won decisively concentrated in:
1. Long-form analytical writing where structural clarity mattered (GPT-5.5 won 11 of 13).
2. Multi-turn dialogues exceeding 30 turns (GPT-5.5 won 8 of 10).
3. Agentic loops with 2-3 tools (GPT-5.5 won 7 of 9).

GPT-5.5 is more difficult to displace with a cheaper model than Opus 4.7 is. Opus 4.7 has a small set of clear "this is the only model that can do this" workloads. GPT-5.5 has fewer such workloads but is consistently a few points above the cheap-model baseline across many categories. The implication: routers struggle to confidently demote GPT-5.5 traffic to a cheaper model, even when the quality difference is small.

## 7. Strengths (Detailed)

**Best-in-class instruction following on long system prompts.** Our long-system-prompt benchmark uses 10,000-token system prompts containing 40+ explicit constraints. GPT-5.5 followed 38.4 of 40 constraints on average; Opus 4.7 followed 36.1; Gemini 3.1 Pro followed 34.7; DeepSeek V4 Pro followed 28.9. For applications with elaborate persona scaffolding, role definitions, or multi-clause guidelines, this is the most reliable model.

**Multi-turn coherence over 50+ messages.** Long agentic conversations expose drift — the model "forgets" earlier constraints or starts contradicting itself. Our 50-turn coherence benchmark showed GPT-5.5 maintaining 91% of established constraints by turn 50, vs. 84% for Opus 4.7 and 79% for Gemini 3.1 Pro. For customer support agents and long-running assistants, this is a meaningful advantage.

**Structured output and function-call reliability.** When asked for strict JSON schema adherence, GPT-5.5 failed 0 out of 36 times in our test set. Opus 4.7 failed 1 time; Gemini 3.1 Pro failed 2; DeepSeek V4 Pro failed 6. The structured-output mode in the OpenAI API is also the most operationally mature, with first-token enforcement that prevents partial-token hallucinations.

**Voice and audio I/O.** GPT-5.5 ships with native audio input and output at production-grade latency. Average time-to-first-audio-token is 320ms in our measurements. Voice quality is comparable to ElevenLabs in blinded tests. For voice-first applications, this is the only frontier model with truly production-ready audio.

## 8. Weaknesses & Failure Modes (Detailed)

**Coding deficit vs. Opus 4.7.** Across all 10 SMQTS programming categories, GPT-5.5 trailed Opus 4.7 by 4-12 points. For agentic coding loops involving 4+ tools, GPT-5.5 averaged 6.8 tool calls per task vs. 4.2 for Opus 4.7 — meaning it was burning more tokens and more wall-clock time to reach the same outcome. For coding-primary workloads, this gap directly costs money and latency.

**Confident-but-wrong factual outputs on niche domains.** We administered 200 questions drawn from medical, legal, and subspecialty scientific literature with known correct answers. GPT-5.5 scored 71% accuracy with 89% expressed confidence — meaning it was confidently wrong on roughly 16% of the questions where it answered. Gemini 3.1 Pro scored 82% accuracy with 81% confidence on the same set. The calibration gap is decision-relevant for any application where users may take outputs at face value.

**Output token cost is 20% higher than Opus 4.7.** $30 per million output tokens makes GPT-5.5 the most expensive standard-tier output in the frontier. For long-form generation workloads, this compounds quickly. Pricing for the Pro variant at $180/M output is positioning that few buyers will rationally select — the marginal quality is real but rarely 6x better.

**Image OCR and document scan quality.** Below Gemini 3.1 Pro by a measurable margin on dense scans. We tested with 40 historical legal documents (faded carbon copies, irregular layouts, mixed-language inserts). GPT-5.5 scored 80; Gemini 3.1 Pro scored 91. For document-processing pipelines with non-clean inputs, this gap matters.

## 9. When To Use This Model

- General-purpose default across a heterogeneous workload mix
- Long, structured system prompts with many explicit constraints
- Multi-turn agentic conversations exceeding 30 turns
- Voice-input and voice-output applications
- Workloads where structured output / JSON adherence is non-negotiable
- Multimodal pipelines combining text, image, and audio input

## 10. When NOT To Use This Model

- Agentic coding loops where every wasted tool call costs money
- Long-context coding refactors (use Opus 4.7)
- Niche-domain Q&A where calibrated factual accuracy is critical
- High-volume routine generation (use DeepSeek V4 Pro or Gemma 4)
- Document scan / OCR pipelines (use Gemini 3.1 Pro)
- Cost-sensitive workloads with predictable, bounded prompts

## 11. Procurement Notes

- **MSA / DPA**: Available via OpenAI Enterprise and Azure OpenAI MSAs.
- **BAA**: Available on Azure OpenAI for HIPAA workloads.
- **Data residency**: Azure global regions; OpenAI API EU-resident option.
- **Lock-in score (1-10)**: 6. The OpenAI API surface is the de facto industry standard, which paradoxically reduces lock-in — most other vendors maintain OpenAI-compatible endpoints. Migration off GPT-5.5 to GPT-5 or to a competitor is mostly a model-name change.
- **Compliance**: SOC 2 Type II, ISO 27001, HIPAA-eligible on Azure.
- **Rate limits**: Tier 5 customers get 30M input / 600K output per minute.

## 12. Bottom Line

For startups, GPT-5.5 is the safe default. It will not be best on any specific axis but it will not embarrass you on any axis either. Pair it with a cheaper model (DeepSeek V4 Pro) routed on confidence thresholds, and you get a workable two-tier architecture without much engineering effort.

For mid-market companies, GPT-5.5's positioning is awkward. It is too expensive to use for routine traffic and not specialized enough to justify exclusive use on high-value traffic. Consider a three-tier architecture: cheap model for routine, GPT-5.5 for the broad middle, Opus 4.7 for the agentic / coding peaks.

For enterprises, GPT-5.5 via Azure OpenAI is the path of least procurement resistance. Your security and compliance teams have already approved Azure. Your developers are already familiar with the OpenAI API. The marginal effort to add Anthropic or Google to the stack should be evaluated against the measurable quality gains on specific workloads — not against an abstract notion of "best model."

## Appendix A: Test Prompts Used

1. *"Here is a 10,000-token system prompt with 40 numbered constraints. Respond to the user input below."* — Long-system-prompt fidelity.
2. *"Continue this 50-message customer-support conversation while maintaining the persona established in turn 1."* — Multi-turn coherence.
3. *"Output a JSON object matching this schema. Do not include any commentary."* — Structured output adherence.
4. *"What are the contraindications for [drug] in patients with [condition]? Cite primary sources."* — Niche-domain factual calibration.
5. *"You have access to: file_read, web_search, calculator. Find the answer to the user's question."* — 3-tool agentic loop.
6. *"Transcribe and analyze this 12-minute audio file."* — Audio input.
7. *"Generate a 5,000-word analytical report based on the attached data."* — Long-form analytical writing.
8. *"Extract all named entities from this scanned 80-page PDF."* — Document scan with OCR.

## Appendix B: Methodology Reference

Full methodology at https://www.swfte.com/research/methodology, including blinded rater protocols, statistical-significance thresholds, and the prompt corpus provenance. Raw transcripts available on request.

## Appendix C: Operational Notes from Production Deployments

**Structured-output mode is the right default.** OpenAI's structured-output feature in the GPT-5.5 API enforces JSON schema compliance at the token level rather than via post-hoc validation. The latency cost is minimal (under 5% in our measurements) and the reliability gain is substantial (zero schema violations in our testing vs. occasional violations without it). Teams integrating GPT-5.5 should use structured output for any non-trivial JSON requirement.

**Function calling vs. tool calling.** OpenAI's API supports both legacy "function_call" syntax and newer "tools" syntax. As of May 2026, only the "tools" syntax produces consistent behavior with GPT-5.5; teams on legacy "function_call" code paths should migrate. We have observed measurable quality degradation on the legacy path that does not appear in OpenAI's official documentation.

**Audio I/O latency.** The audio output in GPT-5.5 is genuinely production-ready, but only via the Realtime API. The standard chat completions API supports audio input but generates text output by default; audio-out requires a separate code path. Teams building voice products should plan for the Realtime API specifically.

**Pro variant routing.** GPT-5.5 Pro at $30/$180 is positioned for "verifiable correctness" workloads. The most common pattern we have seen succeed: route to standard GPT-5.5 first, run a programmatic verifier on the output (test runner, fact-checker, schema validator), and escalate to GPT-5.5 Pro only on verifier failure. This yields the cost profile of standard tier with the quality of Pro on the hard fraction.

**Azure-specific differences.** Azure OpenAI's GPT-5.5 deployment exhibits a small content-filter differential vs. the native OpenAI API; certain prompts that the native API answers receive content-filter refusals on Azure. Teams deploying via Azure should test their specific prompts on both surfaces. The Azure rate limits are also configured differently and require explicit tier negotiation.

**Streaming and partial-response handling.** GPT-5.5 streaming exposes more intermediate state than prior generations, including partial tool calls and structured-output token-by-token enforcement. Applications that rendered partial JSON during streaming with prior models should re-evaluate their parsing logic.

**Reasoning-effort parameter.** GPT-5.5 introduces a `reasoning_effort` parameter (low, medium, high) that adjusts how much internal deliberation the model performs before producing tokens. The default is medium. Setting high produces measurable quality gains on reasoning-heavy prompts (3-5 points on AIME-style tasks) at the cost of meaningfully higher token usage and latency. Setting low reduces token usage and latency on simple prompts with no measurable quality loss. Tuning this parameter per workload is one of the higher-leverage optimizations available to teams running GPT-5.5 in production.

**Bedrock vs. Azure differences.** AWS Bedrock and Azure OpenAI both serve GPT-5.5 but at slightly different versions and with different content-filter configurations. Multi-cloud deployments should not assume cross-surface response equivalence. We have observed prompts that produce different outputs across the three surfaces (OpenAI native, Azure, Bedrock), and the divergences are not always documented in release notes.

## Sources & References

- OpenAI, "GPT-5.5 Model Card", April 23, 2026 — https://openai.com/research/gpt-5-5
- OpenAI Pricing Page, accessed May 13, 2026 — https://openai.com/pricing
- Artificial Analysis, "GPT-5.5 Independent Evaluation", May 6, 2026 — https://artificialanalysis.ai
- LMSys Chatbot Arena Leaderboard, May 14, 2026 snapshot — https://lmarena.ai
- AWS Bedrock GPT-5.5 GA Announcement, April 28, 2026
- Azure OpenAI GPT-5.5 GA Announcement, April 23, 2026
- HuggingFace SMQTS-Public Leaderboard, May 11, 2026
- Stanford HELM 2026 Q1 Report — https://crfm.stanford.edu/helm
- ArXiv 2604.13247, "Calibration in Frontier LLMs", May 2026
- OpenAI, "Audio I/O Production Readiness Notes", April 30, 2026
- SWE-bench Pro Leaderboard, May 10, 2026
- Vellum AI Frontier Model Comparison, May 9, 2026

---

*Independent research by Swfte AI. We route across multiple AI providers via Swfte Connect, including the model in this report. Full conflict-of-interest disclosure at /research/methodology. Raw test transcripts available on request.*
