A Swfte-original benchmark that answers a specific question: how closely does a model reason the way a thoughtful human reasons? Not "does it get the right answer" — that's what accuracy scores are for. Human-Like Thinking is about how the answer is reached.
The six cognitive dimensions
The 40-question harness (7 questions per dimension, with two shared framing questions for calibration) covers:
- Analogical transfer. Given a principle in one domain, apply it to a distant domain.
- Counterfactual reasoning. Reason about what would have happened under changed conditions.
- Compositional generalisation. Apply learned rules to novel combinations.
- Theory of mind. Reason about what another agent believes, wants, or knows.
- Embodied reasoning. Reason about physical affordances, spatial relations, everyday-object mechanics.
- Temporal reasoning. Reason about duration, sequence, causation across time.
Each question has a ground truth answer and a rationale rubric — a set of features that a human-quality explanation would contain.
Scoring
For each question:
- Accuracy (0 or 1) — does the final answer match ground truth?
- Rationale quality (0–1) — judged by Claude Opus 4.6 against the rubric. The judge does not see the ground-truth answer; it scores the reasoning on its own terms. A confidently-correct answer with no rationale scores low here.
The dimension score is a weighted average: 60% accuracy, 40% rationale quality. The Human-Like Thinking composite is the unweighted mean of the six dimension scores, expressed on 0–100.
Why 40 questions, not 400?
Two reasons. First, the scoring is expensive — every question needs judge-model review, and cost matters when we run this on 30+ models quarterly. Second, the questions are hand-crafted, not harvested. A 40-question hand-crafted set with judged rationale is harder to game than a 400-question auto-generated set. Leakage risk is lower and the rubric stays tight.
Sample size means per-dimension confidence intervals are wider than the adopted benchmarks; that's the trade-off. We always report n alongside the score.
Why not just publish the questions?
We don't publish the exact questions because leak-into-training is a real risk; a public 40-question bank would be in every pretraining corpus within a release cycle. We publish the rubrics, the dimensions, the judge prompt, and example question formats — enough that academic replication is possible with a fresh question set.
Reproducibility
Vetted researchers can request the full question bank at research@swfte.com after signing a non-redistribution agreement. Judge prompts, rubrics, and the exact scoring code are published openly.
Caveats
- Human-Like is a loaded word. We are not claiming consciousness or genuine understanding — we are claiming that the model's visible reasoning pattern, judged against a rubric humans would accept, approximates human thought to some measurable degree.
- Judge models have biases. We use a fixed judge (Claude Opus 4.6) and a fixed rubric to hold those biases constant across models.
- The composite gives no free lunch. A model that's strong on temporal reasoning and weak on theory of mind can score the same as its inverse; the dimension breakdown is the useful part.