Are these reports independent?

Yes — and we disclose the conflict. Swfte Connect routes traffic across these models, so we have a commercial incentive to keep our scoring honest. We publish the raw transcripts, pinned model versions, and exact prompts so any reader can re-run the suite.

How often are these reports updated?

Each model report is re-graded monthly when the model ships a meaningful update. The methodology document is versioned independently and changes only when the rubric or test set changes.

Can I download the raw data?

Yes. Each deep-dive page links to a downloadable .md report at /research/{slug}.md. The full SMQTS prompt set is published alongside the methodology page.

Updated May 6, 2026

Swfte AI Research Reports — May 2026

Independent, reproducible deep-dive reports on the five models that matter in May 2026. Honest scoring, raw transcripts, published prompts, disclosed conflicts.

Most published model reviews are either marketing reposts or vibes-driven Twitter takes. Neither is useful when a procurement decision is on the line. We built the Swfte Model Quality Test Suite (SMQTS) to fix that — three series of structured prompts (programming, non-programming, cost-quality validation) graded blind across multiple raters with a published 0-5 rubric.

Every model in this collection has been through the full SMQTS at its pinned May 2026 version. The reports below show the category-level scores, the failure modes we found, and the specific workloads where each model wins or loses. We do not award points for marketing copy.

We disclose the obvious conflict: Swfte Connect routes production traffic across these same models, so we have a commercial incentive in the routing decisions readers make. The mitigation is transparency — we publish the prompts, the transcripts, the scoring sheets, and the disagreements between raters. If our scoring is wrong, the raw data lets you prove it.

These reports are written for three audiences: procurement teams evaluating multi-year contracts, engineering leads picking a default model for a product line, and researchers comparing released systems on a common rubric. Each section flags which audience it serves.

Start Here

Read the Methodology First

The prompts, the rubric, the controls. Without this document, the deep-dive scores are just numbers.

Open Methodology

Anthropic · Closed

Claude Opus 4.7

2026-04-16

Coding Arena #1 (1567 Elo)

SWE-bench Pro 64.3%, MMLU 91.2

Best at multi-file refactors and stack-trace debugging. Loses to Gemini 3.1 Pro on raw reasoning (GPQA Diamond).

Input $5 / 1M

Output $25 / 1M

Read deep dive Download .md report

OpenAI · Closed

GPT-5.5 "Spud"

2026-04-23

AAII 59 (highest)

1M context; Pro variant $30/$180 per 1M

Best generalist; strongest tool-use reliability. Pro variant is the most expensive frontier API in market.

Input $5 / 1M

Output $30 / 1M

Read deep dive Download .md report

Google · Closed

Gemini 3.1 Pro Preview

2026-04-30

Text Arena #1 (~1500 Elo)

GPQA Diamond 94.3%, 2M context

Reasoning king; cheapest frontier-tier on output. Loses to Claude on multi-file code refactors.

Input $3.50 / 1M

Output $10.50 / 1M

Read deep dive Download .md report

DeepSeek · Apache 2.0

DeepSeek V4 Pro

2026-04-24

Quality-per-dollar leader

Apache 2.0; 1.6T MoE / 49B active; 1M context

Frontier-adjacent quality at one-tenth the cost. Open weights — exit cost effectively zero.

Input $1.74 / 1M

Output $3.48 / 1M

Read deep dive Download .md report

Google · Apache 2.0

Gemma 4 27B

2026-04-12

Best self-host quality (~75)

Apache 2.0; designed for single-GPU deploy

Sweet spot for on-prem / regulated workloads. Below DeepSeek V4 Pro on quality but trivially smaller to operate.

Input Self-host

Output Self-host

Read deep dive Download .md report

How to Use These Reports

For Procurement

Start with the procurement notes section.
Cross-reference the lock-in score with our vendor leaderboard.
Use the cost-quality validation section to size cheaper-tier substitution.

For Engineering Leads

Read "When NOT to use this model" before "When to use".
Match your workload to a SMQTS category in the methodology.
Use the failure-mode tables to write canary evals for your own pipeline.

For Researchers

Methodology page describes pinned versions and seeds.
Download the .md per-model report for raw scores.
Replication kit and transcripts are linked from the methodology footer.

Related Resources

AI Model Leaderboard

Live LMSys Arena, benchmarks, and pricing across 80+ models.

Vendor Lock-in Leaderboard

Exit-cost audit across 10 AI providers and gateways.

Per Million Tokens True Cost

Why list pricing hides 1.5-3x in real production cost.

SMQTS Methodology

The prompts, the rubric, the controls.