Swfte AI Research Reports — May 2026
Independent, reproducible deep-dive reports on the five models that matter in May 2026. Honest scoring, raw transcripts, published prompts, disclosed conflicts.
Most published model reviews are either marketing reposts or vibes-driven Twitter takes. Neither is useful when a procurement decision is on the line. We built the Swfte Model Quality Test Suite (SMQTS) to fix that — three series of structured prompts (programming, non-programming, cost-quality validation) graded blind across multiple raters with a published 0-5 rubric.
Every model in this collection has been through the full SMQTS at its pinned May 2026 version. The reports below show the category-level scores, the failure modes we found, and the specific workloads where each model wins or loses. We do not award points for marketing copy.
We disclose the obvious conflict: Swfte Connect routes production traffic across these same models, so we have a commercial incentive in the routing decisions readers make. The mitigation is transparency — we publish the prompts, the transcripts, the scoring sheets, and the disagreements between raters. If our scoring is wrong, the raw data lets you prove it.
These reports are written for three audiences: procurement teams evaluating multi-year contracts, engineering leads picking a default model for a product line, and researchers comparing released systems on a common rubric. Each section flags which audience it serves.
Start Here
Read the Methodology First
The prompts, the rubric, the controls. Without this document, the deep-dive scores are just numbers.
Anthropic · Closed
Claude Opus 4.7
Coding Arena #1 (1567 Elo)
SWE-bench Pro 64.3%, MMLU 91.2
Best at multi-file refactors and stack-trace debugging. Loses to Gemini 3.1 Pro on raw reasoning (GPQA Diamond).
OpenAI · Closed
GPT-5.5 "Spud"
AAII 59 (highest)
1M context; Pro variant $30/$180 per 1M
Best generalist; strongest tool-use reliability. Pro variant is the most expensive frontier API in market.
Google · Closed
Gemini 3.1 Pro Preview
Text Arena #1 (~1500 Elo)
GPQA Diamond 94.3%, 2M context
Reasoning king; cheapest frontier-tier on output. Loses to Claude on multi-file code refactors.
DeepSeek · Apache 2.0
DeepSeek V4 Pro
Quality-per-dollar leader
Apache 2.0; 1.6T MoE / 49B active; 1M context
Frontier-adjacent quality at one-tenth the cost. Open weights — exit cost effectively zero.
Google · Apache 2.0
Gemma 4 27B
Best self-host quality (~75)
Apache 2.0; designed for single-GPU deploy
Sweet spot for on-prem / regulated workloads. Below DeepSeek V4 Pro on quality but trivially smaller to operate.
How to Use These Reports
For Procurement
- Start with the procurement notes section.
- Cross-reference the lock-in score with our vendor leaderboard.
- Use the cost-quality validation section to size cheaper-tier substitution.
For Engineering Leads
- Read "When NOT to use this model" before "When to use".
- Match your workload to a SMQTS category in the methodology.
- Use the failure-mode tables to write canary evals for your own pipeline.
For Researchers
- Methodology page describes pinned versions and seeds.
- Download the .md per-model report for raw scores.
- Replication kit and transcripts are linked from the methodology footer.