What is the Swfte Model Quality Test Suite (SMQTS)?

SMQTS is the union of three test series — Programming, Non-Programming, and Cost-Quality Validation — used to score frontier AI models on a 0-5 weighted rubric. Version 1.3 (May 2026) includes 180 prompts across 26 categories.

How do you control for bias?

Three controls: (1) randomized response order so raters never know which model produced which output, (2) two raters per response with a third tiebreaker on disagreement, and (3) periodic spot-checks where the same response is regraded blind weeks later to measure rater drift.

Are the prompts public?

Yes. The full SMQTS prompt set is published at /research/smqts-prompts.md and pinned to the suite version. Changing the prompt set requires a version bump (e.g., v1.3 to v1.4) and a re-grade of all models.

How is conflict of interest handled?

Swfte Connect routes traffic across the same models we evaluate, so we have a commercial incentive in routing decisions. We mitigate by publishing raw transcripts, rater scoring sheets, and disagreements. If our scoring is wrong, the data lets readers prove it.

What workloads are out of scope?

Image generation, video generation, audio TTS/STT, embedding-only models, and multimodal vision-only tasks. SMQTS targets text input/output workloads where the model is the primary author of the response.

AI Model Evaluation Methodology May 2026

Most published model evaluations are vibes-driven Twitter takes dressed up with a benchmark or two. They are useless when a procurement decision is on the line, because they cannot be reproduced and the conflicts of interest are buried. This document is the antidote — the full methodology behind every Swfte model deep-dive report, written so any reader can re-run the scoring.

The Three Test Series

The Swfte Model Quality Test Suite (SMQTS) is the union of three series. Each series targets a distinct decision context, and a model's overall score is a weighted blend across all three.

Programming Series (40% weight) — code-centric workloads. The biggest single category in production use of frontier models, and the one where wrong answers cost real engineering time.
Non-Programming Series (40% weight) — writing, reasoning, extraction, translation. Where most general-purpose business workloads land.
Cost-Quality Validation Series (20% weight) — a separate sample run against tiers (frontier, mid, cheap) to quantify when a cheaper model matches an expensive one.

Each series produces a category-level score (0-100), a list of failure modes, and a set of representative transcripts. Reports show all three in the same form so cross-model comparison is direct.

Programming Series — 10 Categories

Each category contains 6 prompts at varying difficulty (2 easy, 2 medium, 2 hard). All prompts use real-world code drawn from public open-source repos. Prompts are versioned with the suite.

#	Category	Example prompt shape	What we score
P1	Multi-file refactor	"Refactor this Python module to use async/await throughout" (4-8 file repo)	Correctness, completeness, no regressions
P2	Bug-finding from stack trace	Stack trace + repo. Find the real cause, not the surface symptom.	Root-cause accuracy, fix quality
P3	Code review	Review a PR diff for security, performance, and idiomatic issues	Coverage, false-positive rate
P4	Test generation	Generate unit, integration, and edge-case tests for a function	Coverage, edge-case discovery, runnable
P5	SQL from natural language	"Find the top 10 customers by Q1 revenue, excluding refunds" against a known schema	Correctness on real data, performance
P6	Algorithm from spec	Implement Tarjan's SCC algorithm from a written spec	Correctness, complexity, edge cases
P7	Migration scripts	Upgrade React 18 to React 19; Django 4.2 to 5.1; etc.	Build passes, tests pass, no behavioural drift
P8	Documentation generation	Generate API docs from a 600-line module	Accuracy, completeness, voice consistency
P9	Diff comprehension	Given a 500-line diff, produce a one-paragraph summary that a reviewer can act on	Faithfulness, omission rate
P10	Tool-using agent loops	Multi-turn agentic task: file read, code edit, test run, re-edit on failure	Loop completion, recovery from tool errors

Non-Programming Series — 10 Categories

#	Category	Example prompt shape	What we score
N1	Long-form drafting	3,000-word article from a 12-bullet outline	Faithfulness to outline, voice, structure
N2	Summarization	Executive summary of a 30K-token document	Coverage, faithfulness, no hallucinations
N3	Multi-step reasoning	Math word problems, logic puzzles, planning tasks	Correctness, working shown
N4	Information extraction	Pull 14 fields from a messy invoice or contract	Field accuracy, missing-vs-fabricated rate
N5	Translation	EN → {ES, FR, DE, JA, ZH, AR}; back-translate check	Fluency, faithfulness, register
N6	Style transfer / tone	Rewrite a memo in three target voices (formal, casual, technical)	Voice match, content preservation
N7	Adversarial prompt resistance	Standard jailbreak suite + prompt-injection traps in retrieved content	Refusal correctness, no false-positive refusals
N8	Structured output	Strict JSON schema; mid-sized nested object	Schema validity, value correctness
N9	Domain QA	Open-book questions on a 50K-token corpus (legal, medical, technical)	Faithfulness to source, citation accuracy
N10	Multi-turn coherence	20-turn conversation where context from turn 3 must influence turn 18	Context retention, contradiction rate

Cost-Quality Validation Series

List pricing only matters if the cheap model can do the work. This series exists to find out — and it is the part of SMQTS that has changed the most procurement decisions for our readers.

The protocol:

Take a 50-prompt sample stratified across the 20 SMQTS categories.
Run that sample on three model tiers: frontier (e.g., Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro), mid-tier (e.g., DeepSeek V4 Pro, Sonnet, Mini), and cheap (e.g., DeepSeek V4 Flash, Haiku, Mini-Tier).
Score side-by-side blind. Raters see two anonymous outputs and a prompt; pick a winner or call a tie. Repeat across raters.
Compute the "quality-equivalent cost" — at what spend does the cheaper tier match the expensive tier on this category?
Document the failure modes — where, specifically, does the cheap model lose? Single-shot reasoning depth? Long-context grounding? Tool-call schema compliance?

Worked example. On category N4 (information extraction from messy text), DeepSeek V4 Pro at $1.74/$3.48 ties GPT-5.5 at $5/$30 in 78% of pairwise comparisons. The remaining 22% are split: GPT-5.5 wins 15%, DeepSeek wins 7%. The quality-equivalent cost of frontier-grade information extraction is therefore not the GPT-5.5 list price — it is roughly the DeepSeek price plus a small QA budget.

Worked example, the other direction. On category P1 (multi-file refactor), the same swap fails. Claude Opus 4.7 wins 71% of pairwise comparisons against DeepSeek V4 Pro. The cheaper model produces plausible single-file changes but loses cross-file consistency. Substitution is a quality regression, not a savings.

Scoring Rubric — 0-5 Per Dimension

Every response is scored on five dimensions. The dimensions are the same across all categories so cross-category comparison is meaningful.

Dimension	0 (fail)	3 (acceptable)	5 (best in class)
Correctness	Wrong answer / does not run	Mostly right, minor issues	Fully correct, no caveats
Completeness	Major omissions	Covers requested scope	Anticipates edge cases
Faithfulness	Hallucinations or fabrication	Tied to source / spec	Cites and grounds every claim
Form	Wrong format / unusable	Correct structure	Idiomatic and clean
Efficiency	Wasteful or wrong-Big-O	Reasonable	Best practical complexity

Per-category weights for the dimensions are published in the suite manifest. For example, P5 (SQL) weights Correctness 40%, Efficiency 25%, Form 20%, Completeness 10%, Faithfulness 5%. N2 (Summarization) inverts this — Faithfulness 40%, Completeness 30%, Form 15%, Correctness 10%, Efficiency 5%.

Example scoring matrix (P1, one prompt)

Model	Correct.	Compl.	Faith.	Form	Effic.	Weighted
Claude Opus 4.7	5	5	5	4	4	4.75
GPT-5.5	4	4	4	5	4	4.20
Gemini 3.1 Pro	4	3	4	4	3	3.65
DeepSeek V4 Pro	3	3	4	4	3	3.30
Gemma 4 27B	2	2	3	3	3	2.45

Bias Controls

Three controls run on every grading session. They are not optional and they do not get bypassed under deadline pressure.

Randomized response order

Raters never see "Claude said X, GPT said Y". Outputs are anonymized to Model A, Model B labels that rotate per prompt. The mapping is held in a sealed file revealed only after grading is complete.

Multi-rater grading with tiebreaker

Two raters score every response independently. If they disagree by more than 1 point on any dimension, a third rater breaks the tie. Inter-rater agreement (Krippendorff's alpha) is published per category in each report.

Drift checks

At random intervals we re-grade old responses without telling the rater. If a rater's scoring drifts more than 0.5 points between sessions, their grading is recalibrated and recent work is reviewed.

Reproducibility

Reproducibility is what separates a methodology from an opinion. The full kit is published with every report:

Pinned model versions. Reports name the exact API model identifier (e.g., claude-opus-4-7-2026-04-16) and the system prompt used.
Pinned suite version. Every report carries a SMQTS version (e.g., v1.3) and the prompt-set hash. Changing the prompts requires a version bump.
Decoding parameters. Temperature, top_p, max tokens, and stop sequences are pinned per category. Reasoning models run with thinking budgets pinned in the manifest.
Raw transcripts archived. Every model response is stored unmodified, with timestamps and the API request/response envelope.
Scoring sheets. Per-rater per-prompt scores are released as CSV, including disagreements and tiebreaker outcomes.

SMQTS Score Distribution (May 2026)

Below is the headline weighted-blend score across the 20-category core (programming + non-programming). The cost-quality validation score is reported separately because its scale is different.

Model              Score   ASCII bar (0-100)
=================================================
Claude Opus 4.7    87.4    ##########################################
GPT-5.5            85.1    #########################################
Gemini 3.1 Pro     86.7    ##########################################
DeepSeek V4 Pro    79.2    #####################################
Gemma 4 27B        71.8    ##################################

Programming-only score

Model              Score   ASCII bar (0-100)
=================================================
Claude Opus 4.7    91.2    ############################################
GPT-5.5            83.5    ########################################
Gemini 3.1 Pro     82.1    #######################################
DeepSeek V4 Pro    76.4    ####################################
Gemma 4 27B        65.9    ###############################

Non-programming-only score

Model              Score   ASCII bar (0-100)
=================================================
Gemini 3.1 Pro     91.3    ############################################
Claude Opus 4.7    83.6    ########################################
GPT-5.5            86.7    #########################################
DeepSeek V4 Pro    82.0    #######################################
Gemma 4 27B        77.7    #####################################

Conflict-of-Interest Disclosure

Swfte Connect routes production traffic across these same models. We have a commercial incentive in the routing decisions readers make based on these reports. Two specific incentives we want flagged:

We benefit from readers using a router (any router, not just ours) over single-vendor lock-in. Our reports therefore consistently emphasize multi-model strategies. Readers should weigh that against single-vendor procurement value.
We benefit from readers picking models we have routing contracts with. We do not have contracts that depend on directing traffic to a specific provider, but we do have volume-based pricing on some providers that improves with scale.

The mitigation is transparency. We publish the prompts, the transcripts, the rater scoring sheets, and the disagreements. If our scoring is off, the raw data lets you prove it. If you find an error, we will publish a correction with attribution.

What We Do NOT Test

SMQTS is text-input/text-output. The following workloads are explicitly out of scope and will not appear in any deep-dive report:

Image generation quality (Midjourney, DALL-E, Imagen-class outputs).
Video generation quality (Sora, Veo, Kling-class outputs).
Audio TTS quality, audio STT accuracy, music generation.
Embedding-only models. Retrieval quality is part of N9 only when the model is the answerer, not the retriever.
Vision-only multimodal QA where the model never produces text. Vision-conditioned text generation IS in scope (it appears as a sub-track in N4 and N9).
Real-time voice agents. Latency profile is too coupled to the pipeline to attribute to the model.

These workloads matter — they are simply not what SMQTS measures. Readers evaluating image or video models should look for purpose-built suites (FID, CLIP-score, or human preference panels appropriate to that domain).

Versioning

SMQTS is versioned major.minor. A minor bump (1.3 → 1.4) means new prompts added or rubric weights adjusted; a major bump (1.x → 2.0) means a structural change in how series compose. Every model report carries the suite version in its header.

Version 1.3 (May 2026) added 12 prompts to N7 (adversarial resistance) following the spike in prompt-injection attacks against retrieval pipelines. The full changelog is at /research/smqts-changelog.md.

How to Replicate

Anyone can re-run the suite. The kit at /research/smqts-replication.md includes:

The 180 prompts, with category and difficulty tags.
The decoding parameters per category (temperature, top_p, max tokens).
The rubric guide and per-category dimension weights.
A reference scoring spreadsheet template.
The full transcripts of our May 2026 grading run, for comparison.

Swfte Model Evaluation Methodology — May 2026