The /benchmarks hub tests capability along two tracks: adopted academic benchmarks (so our results are directly comparable to the public leaderboards researchers cite) and Swfte-original benchmarks that measure things the adopted set misses.
Adopted benchmarks
We run the following against every model:
- ARC-AGI-2 (public test sample). 30 tasks from the publicly released portion. The private set is held back by the benchmark maintainers.
- Humanity's Last Exam (HLE) — public sample of ~50 questions across mathematics, science, humanities. Multimodal items skipped for text-only models.
- GAIA — Level 1 subset, agent-style multi-step tasks requiring web search and basic reasoning. The full eval needs tool integration we run only for models that expose a tool-use API.
- SimpleBench — public sample of common-sense reasoning traps.
- GPQA Diamond — the hardest subset of Graduate-Level Google-Proof Q&A.
- MMLU-Pro — 300-question stratified sample across all categories.
Every run records the sample hash, the model version string, the temperature, the system prompt (when one is required), and the date of run. Reproducibility is non-negotiable; if two runs with the same parameters produce different scores, we investigate.
Swfte-original benchmarks
Each has a dedicated methodology page:
- Human-Like Thinking Score. A 40-question composite across six cognitive dimensions with judged rationale quality.
- Rationale Integrity. Does the model's stated chain-of-thought actually lead to its stated answer? We strip the answer from the CoT, ask a judge model to predict the answer from the CoT alone, and measure agreement.
- Abstention. On a calibrated set of answerable and unanswerable questions, does the model know when to say "I don't know"? Reported as AUC.
Imported rows
Some models have results on the public leaderboards of the adopted benchmarks but haven't been run on our own harness yet. Rather than show blank columns at launch, we import the public leaderboard scores with full attribution (source URL + date of import). Imported rows are clearly flagged in the leaderboard UI and in each scorecard. When we complete a Swfte v1 run, the imported row is replaced; the import is retained in audit-raw/ for historical comparison.
Scoring
Per-benchmark accuracy is reported at its native scale (0–100 for most, 0–1 for some). The top-line headline is the Human-Like Thinking Score, which combines our six cognitive dimensions with weighted judge scores on rationale quality. It is not a substitute for the other scores — it is an answer to one specific question ("how close does this model get to human-like reasoning?") rather than an overall capability number.
Caveats
- Public sample sizes are small by design — the full benchmarks hold test sets private. Confidence intervals are wide. We report
nalongside every score. - Benchmarks leak into training data over time. We refresh the sample hash on each quarterly run; models scoring unusually high on old hashes get flagged in
notes. - Multimodal benchmarks are skipped for text-only models and flagged
n/arather than0.
Reproducibility
Every scorecard lists sample hashes, run dates, and model version strings. Raw probe outputs live at audit-raw/{model}/benchmarks/... and are available to vetted researchers at research@swfte.com.