Most published model evaluations are vibes-driven Twitter takes dressed up with a benchmark or two. They are useless when a procurement decision is on the line, because they cannot be reproduced and the conflicts of interest are buried. This document is the antidote — the full methodology behind every Swfte model deep-dive report, written so any reader can re-run the scoring.
The Three Test Series
The Swfte Model Quality Test Suite (SMQTS) is the union of three series. Each series targets a distinct decision context, and a model's overall score is a weighted blend across all three.
- Programming Series (40% weight) — code-centric workloads. The biggest single category in production use of frontier models, and the one where wrong answers cost real engineering time.
- Non-Programming Series (40% weight) — writing, reasoning, extraction, translation. Where most general-purpose business workloads land.
- Cost-Quality Validation Series (20% weight) — a separate sample run against tiers (frontier, mid, cheap) to quantify when a cheaper model matches an expensive one.
Each series produces a category-level score (0-100), a list of failure modes, and a set of representative transcripts. Reports show all three in the same form so cross-model comparison is direct.
Programming Series — 10 Categories
Each category contains 6 prompts at varying difficulty (2 easy, 2 medium, 2 hard). All prompts use real-world code drawn from public open-source repos. Prompts are versioned with the suite.
| # | Category | Example prompt shape | What we score |
|---|---|---|---|
| P1 | Multi-file refactor | "Refactor this Python module to use async/await throughout" (4-8 file repo) | Correctness, completeness, no regressions |
| P2 | Bug-finding from stack trace | Stack trace + repo. Find the real cause, not the surface symptom. | Root-cause accuracy, fix quality |
| P3 | Code review | Review a PR diff for security, performance, and idiomatic issues | Coverage, false-positive rate |
| P4 | Test generation | Generate unit, integration, and edge-case tests for a function | Coverage, edge-case discovery, runnable |
| P5 | SQL from natural language | "Find the top 10 customers by Q1 revenue, excluding refunds" against a known schema | Correctness on real data, performance |
| P6 | Algorithm from spec | Implement Tarjan's SCC algorithm from a written spec | Correctness, complexity, edge cases |
| P7 | Migration scripts | Upgrade React 18 to React 19; Django 4.2 to 5.1; etc. | Build passes, tests pass, no behavioural drift |
| P8 | Documentation generation | Generate API docs from a 600-line module | Accuracy, completeness, voice consistency |
| P9 | Diff comprehension | Given a 500-line diff, produce a one-paragraph summary that a reviewer can act on | Faithfulness, omission rate |
| P10 | Tool-using agent loops | Multi-turn agentic task: file read, code edit, test run, re-edit on failure | Loop completion, recovery from tool errors |
Non-Programming Series — 10 Categories
| # | Category | Example prompt shape | What we score |
|---|---|---|---|
| N1 | Long-form drafting | 3,000-word article from a 12-bullet outline | Faithfulness to outline, voice, structure |
| N2 | Summarization | Executive summary of a 30K-token document | Coverage, faithfulness, no hallucinations |
| N3 | Multi-step reasoning | Math word problems, logic puzzles, planning tasks | Correctness, working shown |
| N4 | Information extraction | Pull 14 fields from a messy invoice or contract | Field accuracy, missing-vs-fabricated rate |
| N5 | Translation | EN → {ES, FR, DE, JA, ZH, AR}; back-translate check | Fluency, faithfulness, register |
| N6 | Style transfer / tone | Rewrite a memo in three target voices (formal, casual, technical) | Voice match, content preservation |
| N7 | Adversarial prompt resistance | Standard jailbreak suite + prompt-injection traps in retrieved content | Refusal correctness, no false-positive refusals |
| N8 | Structured output | Strict JSON schema; mid-sized nested object | Schema validity, value correctness |
| N9 | Domain QA | Open-book questions on a 50K-token corpus (legal, medical, technical) | Faithfulness to source, citation accuracy |
| N10 | Multi-turn coherence | 20-turn conversation where context from turn 3 must influence turn 18 | Context retention, contradiction rate |
Cost-Quality Validation Series
List pricing only matters if the cheap model can do the work. This series exists to find out — and it is the part of SMQTS that has changed the most procurement decisions for our readers.
The protocol:
- Take a 50-prompt sample stratified across the 20 SMQTS categories.
- Run that sample on three model tiers: frontier (e.g., Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro), mid-tier (e.g., DeepSeek V4 Pro, Sonnet, Mini), and cheap (e.g., DeepSeek V4 Flash, Haiku, Mini-Tier).
- Score side-by-side blind. Raters see two anonymous outputs and a prompt; pick a winner or call a tie. Repeat across raters.
- Compute the "quality-equivalent cost" — at what spend does the cheaper tier match the expensive tier on this category?
- Document the failure modes — where, specifically, does the cheap model lose? Single-shot reasoning depth? Long-context grounding? Tool-call schema compliance?
Worked example. On category N4 (information extraction from messy text), DeepSeek V4 Pro at $1.74/$3.48 ties GPT-5.5 at $5/$30 in 78% of pairwise comparisons. The remaining 22% are split: GPT-5.5 wins 15%, DeepSeek wins 7%. The quality-equivalent cost of frontier-grade information extraction is therefore not the GPT-5.5 list price — it is roughly the DeepSeek price plus a small QA budget.
Worked example, the other direction. On category P1 (multi-file refactor), the same swap fails. Claude Opus 4.7 wins 71% of pairwise comparisons against DeepSeek V4 Pro. The cheaper model produces plausible single-file changes but loses cross-file consistency. Substitution is a quality regression, not a savings.
Scoring Rubric — 0-5 Per Dimension
Every response is scored on five dimensions. The dimensions are the same across all categories so cross-category comparison is meaningful.
| Dimension | 0 (fail) | 3 (acceptable) | 5 (best in class) |
|---|---|---|---|
| Correctness | Wrong answer / does not run | Mostly right, minor issues | Fully correct, no caveats |
| Completeness | Major omissions | Covers requested scope | Anticipates edge cases |
| Faithfulness | Hallucinations or fabrication | Tied to source / spec | Cites and grounds every claim |
| Form | Wrong format / unusable | Correct structure | Idiomatic and clean |
| Efficiency | Wasteful or wrong-Big-O | Reasonable | Best practical complexity |
Per-category weights for the dimensions are published in the suite manifest. For example, P5 (SQL) weights Correctness 40%, Efficiency 25%, Form 20%, Completeness 10%, Faithfulness 5%. N2 (Summarization) inverts this — Faithfulness 40%, Completeness 30%, Form 15%, Correctness 10%, Efficiency 5%.
Example scoring matrix (P1, one prompt)
| Model | Correct. | Compl. | Faith. | Form | Effic. | Weighted |
|---|---|---|---|---|---|---|
| Claude Opus 4.7 | 5 | 5 | 5 | 4 | 4 | 4.75 |
| GPT-5.5 | 4 | 4 | 4 | 5 | 4 | 4.20 |
| Gemini 3.1 Pro | 4 | 3 | 4 | 4 | 3 | 3.65 |
| DeepSeek V4 Pro | 3 | 3 | 4 | 4 | 3 | 3.30 |
| Gemma 4 27B | 2 | 2 | 3 | 3 | 3 | 2.45 |
Bias Controls
Three controls run on every grading session. They are not optional and they do not get bypassed under deadline pressure.
Randomized response order
Raters never see "Claude said X, GPT said Y". Outputs are anonymized to Model A, Model B labels that rotate per prompt. The mapping is held in a sealed file revealed only after grading is complete.
Multi-rater grading with tiebreaker
Two raters score every response independently. If they disagree by more than 1 point on any dimension, a third rater breaks the tie. Inter-rater agreement (Krippendorff's alpha) is published per category in each report.
Drift checks
At random intervals we re-grade old responses without telling the rater. If a rater's scoring drifts more than 0.5 points between sessions, their grading is recalibrated and recent work is reviewed.
Reproducibility
Reproducibility is what separates a methodology from an opinion. The full kit is published with every report:
- Pinned model versions. Reports name the exact API model identifier (e.g.,
claude-opus-4-7-2026-04-16) and the system prompt used. - Pinned suite version. Every report carries a SMQTS version (e.g., v1.3) and the prompt-set hash. Changing the prompts requires a version bump.
- Decoding parameters. Temperature, top_p, max tokens, and stop sequences are pinned per category. Reasoning models run with thinking budgets pinned in the manifest.
- Raw transcripts archived. Every model response is stored unmodified, with timestamps and the API request/response envelope.
- Scoring sheets. Per-rater per-prompt scores are released as CSV, including disagreements and tiebreaker outcomes.
SMQTS Score Distribution (May 2026)
Below is the headline weighted-blend score across the 20-category core (programming + non-programming). The cost-quality validation score is reported separately because its scale is different.
Model Score ASCII bar (0-100) ================================================= Claude Opus 4.7 87.4 ########################################## GPT-5.5 85.1 ######################################### Gemini 3.1 Pro 86.7 ########################################## DeepSeek V4 Pro 79.2 ##################################### Gemma 4 27B 71.8 ##################################
Programming-only score
Model Score ASCII bar (0-100) ================================================= Claude Opus 4.7 91.2 ############################################ GPT-5.5 83.5 ######################################## Gemini 3.1 Pro 82.1 ####################################### DeepSeek V4 Pro 76.4 #################################### Gemma 4 27B 65.9 ###############################
Non-programming-only score
Model Score ASCII bar (0-100) ================================================= Gemini 3.1 Pro 91.3 ############################################ Claude Opus 4.7 83.6 ######################################## GPT-5.5 86.7 ######################################### DeepSeek V4 Pro 82.0 ####################################### Gemma 4 27B 77.7 #####################################
Conflict-of-Interest Disclosure
Swfte Connect routes production traffic across these same models. We have a commercial incentive in the routing decisions readers make based on these reports. Two specific incentives we want flagged:
- We benefit from readers using a router (any router, not just ours) over single-vendor lock-in. Our reports therefore consistently emphasize multi-model strategies. Readers should weigh that against single-vendor procurement value.
- We benefit from readers picking models we have routing contracts with. We do not have contracts that depend on directing traffic to a specific provider, but we do have volume-based pricing on some providers that improves with scale.
The mitigation is transparency. We publish the prompts, the transcripts, the rater scoring sheets, and the disagreements. If our scoring is off, the raw data lets you prove it. If you find an error, we will publish a correction with attribution.
What We Do NOT Test
SMQTS is text-input/text-output. The following workloads are explicitly out of scope and will not appear in any deep-dive report:
- Image generation quality (Midjourney, DALL-E, Imagen-class outputs).
- Video generation quality (Sora, Veo, Kling-class outputs).
- Audio TTS quality, audio STT accuracy, music generation.
- Embedding-only models. Retrieval quality is part of N9 only when the model is the answerer, not the retriever.
- Vision-only multimodal QA where the model never produces text. Vision-conditioned text generation IS in scope (it appears as a sub-track in N4 and N9).
- Real-time voice agents. Latency profile is too coupled to the pipeline to attribute to the model.
These workloads matter — they are simply not what SMQTS measures. Readers evaluating image or video models should look for purpose-built suites (FID, CLIP-score, or human preference panels appropriate to that domain).
Versioning
SMQTS is versioned major.minor. A minor bump (1.3 → 1.4) means new prompts added or rubric weights adjusted; a major bump (1.x → 2.0) means a structural change in how series compose. Every model report carries the suite version in its header.
Version 1.3 (May 2026) added 12 prompts to N7 (adversarial resistance) following the spike in prompt-injection attacks against retrieval pipelines. The full changelog is at /research/smqts-changelog.md.
How to Replicate
Anyone can re-run the suite. The kit at /research/smqts-replication.md includes:
- The 180 prompts, with category and difficulty tags.
- The decoding parameters per category (temperature, top_p, max tokens).
- The rubric guide and per-category dimension weights.
- A reference scoring spreadsheet template.
- The full transcripts of our May 2026 grading run, for comparison.