SMQTS v1.3 (May 2026)

Swfte Model Evaluation Methodology — May 2026

How we score AI models. The prompts, the rubric, the controls, the disclosures.

Last updated May 6, 2026

Most published model evaluations are vibes-driven Twitter takes dressed up with a benchmark or two. They are useless when a procurement decision is on the line, because they cannot be reproduced and the conflicts of interest are buried. This document is the antidote — the full methodology behind every Swfte model deep-dive report, written so any reader can re-run the scoring.

The Three Test Series

The Swfte Model Quality Test Suite (SMQTS) is the union of three series. Each series targets a distinct decision context, and a model's overall score is a weighted blend across all three.

  1. Programming Series (40% weight) — code-centric workloads. The biggest single category in production use of frontier models, and the one where wrong answers cost real engineering time.
  2. Non-Programming Series (40% weight) — writing, reasoning, extraction, translation. Where most general-purpose business workloads land.
  3. Cost-Quality Validation Series (20% weight) — a separate sample run against tiers (frontier, mid, cheap) to quantify when a cheaper model matches an expensive one.

Each series produces a category-level score (0-100), a list of failure modes, and a set of representative transcripts. Reports show all three in the same form so cross-model comparison is direct.

Programming Series — 10 Categories

Each category contains 6 prompts at varying difficulty (2 easy, 2 medium, 2 hard). All prompts use real-world code drawn from public open-source repos. Prompts are versioned with the suite.

#CategoryExample prompt shapeWhat we score
P1Multi-file refactor"Refactor this Python module to use async/await throughout" (4-8 file repo)Correctness, completeness, no regressions
P2Bug-finding from stack traceStack trace + repo. Find the real cause, not the surface symptom.Root-cause accuracy, fix quality
P3Code reviewReview a PR diff for security, performance, and idiomatic issuesCoverage, false-positive rate
P4Test generationGenerate unit, integration, and edge-case tests for a functionCoverage, edge-case discovery, runnable
P5SQL from natural language"Find the top 10 customers by Q1 revenue, excluding refunds" against a known schemaCorrectness on real data, performance
P6Algorithm from specImplement Tarjan's SCC algorithm from a written specCorrectness, complexity, edge cases
P7Migration scriptsUpgrade React 18 to React 19; Django 4.2 to 5.1; etc.Build passes, tests pass, no behavioural drift
P8Documentation generationGenerate API docs from a 600-line moduleAccuracy, completeness, voice consistency
P9Diff comprehensionGiven a 500-line diff, produce a one-paragraph summary that a reviewer can act onFaithfulness, omission rate
P10Tool-using agent loopsMulti-turn agentic task: file read, code edit, test run, re-edit on failureLoop completion, recovery from tool errors

Non-Programming Series — 10 Categories

#CategoryExample prompt shapeWhat we score
N1Long-form drafting3,000-word article from a 12-bullet outlineFaithfulness to outline, voice, structure
N2SummarizationExecutive summary of a 30K-token documentCoverage, faithfulness, no hallucinations
N3Multi-step reasoningMath word problems, logic puzzles, planning tasksCorrectness, working shown
N4Information extractionPull 14 fields from a messy invoice or contractField accuracy, missing-vs-fabricated rate
N5TranslationEN → {ES, FR, DE, JA, ZH, AR}; back-translate checkFluency, faithfulness, register
N6Style transfer / toneRewrite a memo in three target voices (formal, casual, technical)Voice match, content preservation
N7Adversarial prompt resistanceStandard jailbreak suite + prompt-injection traps in retrieved contentRefusal correctness, no false-positive refusals
N8Structured outputStrict JSON schema; mid-sized nested objectSchema validity, value correctness
N9Domain QAOpen-book questions on a 50K-token corpus (legal, medical, technical)Faithfulness to source, citation accuracy
N10Multi-turn coherence20-turn conversation where context from turn 3 must influence turn 18Context retention, contradiction rate

Cost-Quality Validation Series

List pricing only matters if the cheap model can do the work. This series exists to find out — and it is the part of SMQTS that has changed the most procurement decisions for our readers.

The protocol:

  1. Take a 50-prompt sample stratified across the 20 SMQTS categories.
  2. Run that sample on three model tiers: frontier (e.g., Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro), mid-tier (e.g., DeepSeek V4 Pro, Sonnet, Mini), and cheap (e.g., DeepSeek V4 Flash, Haiku, Mini-Tier).
  3. Score side-by-side blind. Raters see two anonymous outputs and a prompt; pick a winner or call a tie. Repeat across raters.
  4. Compute the "quality-equivalent cost" — at what spend does the cheaper tier match the expensive tier on this category?
  5. Document the failure modes — where, specifically, does the cheap model lose? Single-shot reasoning depth? Long-context grounding? Tool-call schema compliance?

Worked example. On category N4 (information extraction from messy text), DeepSeek V4 Pro at $1.74/$3.48 ties GPT-5.5 at $5/$30 in 78% of pairwise comparisons. The remaining 22% are split: GPT-5.5 wins 15%, DeepSeek wins 7%. The quality-equivalent cost of frontier-grade information extraction is therefore not the GPT-5.5 list price — it is roughly the DeepSeek price plus a small QA budget.

Worked example, the other direction. On category P1 (multi-file refactor), the same swap fails. Claude Opus 4.7 wins 71% of pairwise comparisons against DeepSeek V4 Pro. The cheaper model produces plausible single-file changes but loses cross-file consistency. Substitution is a quality regression, not a savings.

Scoring Rubric — 0-5 Per Dimension

Every response is scored on five dimensions. The dimensions are the same across all categories so cross-category comparison is meaningful.

Dimension0 (fail)3 (acceptable)5 (best in class)
CorrectnessWrong answer / does not runMostly right, minor issuesFully correct, no caveats
CompletenessMajor omissionsCovers requested scopeAnticipates edge cases
FaithfulnessHallucinations or fabricationTied to source / specCites and grounds every claim
FormWrong format / unusableCorrect structureIdiomatic and clean
EfficiencyWasteful or wrong-Big-OReasonableBest practical complexity

Per-category weights for the dimensions are published in the suite manifest. For example, P5 (SQL) weights Correctness 40%, Efficiency 25%, Form 20%, Completeness 10%, Faithfulness 5%. N2 (Summarization) inverts this — Faithfulness 40%, Completeness 30%, Form 15%, Correctness 10%, Efficiency 5%.

Example scoring matrix (P1, one prompt)

ModelCorrect.Compl.Faith.FormEffic.Weighted
Claude Opus 4.7555444.75
GPT-5.5444544.20
Gemini 3.1 Pro434433.65
DeepSeek V4 Pro334433.30
Gemma 4 27B223332.45

Bias Controls

Three controls run on every grading session. They are not optional and they do not get bypassed under deadline pressure.

Randomized response order

Raters never see "Claude said X, GPT said Y". Outputs are anonymized to Model A, Model B labels that rotate per prompt. The mapping is held in a sealed file revealed only after grading is complete.

Multi-rater grading with tiebreaker

Two raters score every response independently. If they disagree by more than 1 point on any dimension, a third rater breaks the tie. Inter-rater agreement (Krippendorff's alpha) is published per category in each report.

Drift checks

At random intervals we re-grade old responses without telling the rater. If a rater's scoring drifts more than 0.5 points between sessions, their grading is recalibrated and recent work is reviewed.

Reproducibility

Reproducibility is what separates a methodology from an opinion. The full kit is published with every report:

  • Pinned model versions. Reports name the exact API model identifier (e.g., claude-opus-4-7-2026-04-16) and the system prompt used.
  • Pinned suite version. Every report carries a SMQTS version (e.g., v1.3) and the prompt-set hash. Changing the prompts requires a version bump.
  • Decoding parameters. Temperature, top_p, max tokens, and stop sequences are pinned per category. Reasoning models run with thinking budgets pinned in the manifest.
  • Raw transcripts archived. Every model response is stored unmodified, with timestamps and the API request/response envelope.
  • Scoring sheets. Per-rater per-prompt scores are released as CSV, including disagreements and tiebreaker outcomes.

SMQTS Score Distribution (May 2026)

Below is the headline weighted-blend score across the 20-category core (programming + non-programming). The cost-quality validation score is reported separately because its scale is different.

Model              Score   ASCII bar (0-100)
=================================================
Claude Opus 4.7    87.4    ##########################################
GPT-5.5            85.1    #########################################
Gemini 3.1 Pro     86.7    ##########################################
DeepSeek V4 Pro    79.2    #####################################
Gemma 4 27B        71.8    ##################################

Programming-only score

Model              Score   ASCII bar (0-100)
=================================================
Claude Opus 4.7    91.2    ############################################
GPT-5.5            83.5    ########################################
Gemini 3.1 Pro     82.1    #######################################
DeepSeek V4 Pro    76.4    ####################################
Gemma 4 27B        65.9    ###############################

Non-programming-only score

Model              Score   ASCII bar (0-100)
=================================================
Gemini 3.1 Pro     91.3    ############################################
Claude Opus 4.7    83.6    ########################################
GPT-5.5            86.7    #########################################
DeepSeek V4 Pro    82.0    #######################################
Gemma 4 27B        77.7    #####################################

Conflict-of-Interest Disclosure

Swfte Connect routes production traffic across these same models. We have a commercial incentive in the routing decisions readers make based on these reports. Two specific incentives we want flagged:

  • We benefit from readers using a router (any router, not just ours) over single-vendor lock-in. Our reports therefore consistently emphasize multi-model strategies. Readers should weigh that against single-vendor procurement value.
  • We benefit from readers picking models we have routing contracts with. We do not have contracts that depend on directing traffic to a specific provider, but we do have volume-based pricing on some providers that improves with scale.

The mitigation is transparency. We publish the prompts, the transcripts, the rater scoring sheets, and the disagreements. If our scoring is off, the raw data lets you prove it. If you find an error, we will publish a correction with attribution.

What We Do NOT Test

SMQTS is text-input/text-output. The following workloads are explicitly out of scope and will not appear in any deep-dive report:

  • Image generation quality (Midjourney, DALL-E, Imagen-class outputs).
  • Video generation quality (Sora, Veo, Kling-class outputs).
  • Audio TTS quality, audio STT accuracy, music generation.
  • Embedding-only models. Retrieval quality is part of N9 only when the model is the answerer, not the retriever.
  • Vision-only multimodal QA where the model never produces text. Vision-conditioned text generation IS in scope (it appears as a sub-track in N4 and N9).
  • Real-time voice agents. Latency profile is too coupled to the pipeline to attribute to the model.

These workloads matter — they are simply not what SMQTS measures. Readers evaluating image or video models should look for purpose-built suites (FID, CLIP-score, or human preference panels appropriate to that domain).

Versioning

SMQTS is versioned major.minor. A minor bump (1.3 → 1.4) means new prompts added or rubric weights adjusted; a major bump (1.x → 2.0) means a structural change in how series compose. Every model report carries the suite version in its header.

Version 1.3 (May 2026) added 12 prompts to N7 (adversarial resistance) following the spike in prompt-injection attacks against retrieval pipelines. The full changelog is at /research/smqts-changelog.md.

How to Replicate

Anyone can re-run the suite. The kit at /research/smqts-replication.md includes:

  1. The 180 prompts, with category and difficulty tags.
  2. The decoding parameters per category (temperature, top_p, max tokens).
  3. The rubric guide and per-category dimension weights.
  4. A reference scoring spreadsheet template.
  5. The full transcripts of our May 2026 grading run, for comparison.