Updated May 6, 2026

Swfte AI Research Reports — May 2026

Independent, reproducible deep-dive reports on the five models that matter in May 2026. Honest scoring, raw transcripts, published prompts, disclosed conflicts.

Most published model reviews are either marketing reposts or vibes-driven Twitter takes. Neither is useful when a procurement decision is on the line. We built the Swfte Model Quality Test Suite (SMQTS) to fix that — three series of structured prompts (programming, non-programming, cost-quality validation) graded blind across multiple raters with a published 0-5 rubric.

Every model in this collection has been through the full SMQTS at its pinned May 2026 version. The reports below show the category-level scores, the failure modes we found, and the specific workloads where each model wins or loses. We do not award points for marketing copy.

We disclose the obvious conflict: Swfte Connect routes production traffic across these same models, so we have a commercial incentive in the routing decisions readers make. The mitigation is transparency — we publish the prompts, the transcripts, the scoring sheets, and the disagreements between raters. If our scoring is wrong, the raw data lets you prove it.

These reports are written for three audiences: procurement teams evaluating multi-year contracts, engineering leads picking a default model for a product line, and researchers comparing released systems on a common rubric. Each section flags which audience it serves.

Start Here

Read the Methodology First

The prompts, the rubric, the controls. Without this document, the deep-dive scores are just numbers.

Open Methodology

Anthropic · Closed

Claude Opus 4.7

2026-04-16

Coding Arena #1 (1567 Elo)

SWE-bench Pro 64.3%, MMLU 91.2

Best at multi-file refactors and stack-trace debugging. Loses to Gemini 3.1 Pro on raw reasoning (GPQA Diamond).

Input $5 / 1M
Output $25 / 1M

OpenAI · Closed

GPT-5.5 "Spud"

2026-04-23

AAII 59 (highest)

1M context; Pro variant $30/$180 per 1M

Best generalist; strongest tool-use reliability. Pro variant is the most expensive frontier API in market.

Input $5 / 1M
Output $30 / 1M

Google · Closed

Gemini 3.1 Pro Preview

2026-04-30

Text Arena #1 (~1500 Elo)

GPQA Diamond 94.3%, 2M context

Reasoning king; cheapest frontier-tier on output. Loses to Claude on multi-file code refactors.

Input $3.50 / 1M
Output $10.50 / 1M

DeepSeek · Apache 2.0

DeepSeek V4 Pro

2026-04-24

Quality-per-dollar leader

Apache 2.0; 1.6T MoE / 49B active; 1M context

Frontier-adjacent quality at one-tenth the cost. Open weights — exit cost effectively zero.

Input $1.74 / 1M
Output $3.48 / 1M

Google · Apache 2.0

Gemma 4 27B

2026-04-12

Best self-host quality (~75)

Apache 2.0; designed for single-GPU deploy

Sweet spot for on-prem / regulated workloads. Below DeepSeek V4 Pro on quality but trivially smaller to operate.

Input Self-host
Output Self-host

How to Use These Reports

For Procurement

  • Start with the procurement notes section.
  • Cross-reference the lock-in score with our vendor leaderboard.
  • Use the cost-quality validation section to size cheaper-tier substitution.

For Engineering Leads

  • Read "When NOT to use this model" before "When to use".
  • Match your workload to a SMQTS category in the methodology.
  • Use the failure-mode tables to write canary evals for your own pipeline.

For Researchers

  • Methodology page describes pinned versions and seeds.
  • Download the .md per-model report for raw scores.
  • Replication kit and transcripts are linked from the methodology footer.