|
English

A Swfte-original benchmark that answers a specific question: how closely does a model reason the way a thoughtful human reasons? Not "does it get the right answer" — that's what accuracy scores are for. Human-Like Thinking is about how the answer is reached.

The six cognitive dimensions

The 40-question harness (7 questions per dimension, with two shared framing questions for calibration) covers:

  1. Analogical transfer. Given a principle in one domain, apply it to a distant domain.
  2. Counterfactual reasoning. Reason about what would have happened under changed conditions.
  3. Compositional generalisation. Apply learned rules to novel combinations.
  4. Theory of mind. Reason about what another agent believes, wants, or knows.
  5. Embodied reasoning. Reason about physical affordances, spatial relations, everyday-object mechanics.
  6. Temporal reasoning. Reason about duration, sequence, causation across time.

Each question has a ground truth answer and a rationale rubric — a set of features that a human-quality explanation would contain.

Scoring

For each question:

  • Accuracy (0 or 1) — does the final answer match ground truth?
  • Rationale quality (0–1) — judged by Claude Opus 4.6 against the rubric. The judge does not see the ground-truth answer; it scores the reasoning on its own terms. A confidently-correct answer with no rationale scores low here.

The dimension score is a weighted average: 60% accuracy, 40% rationale quality. The Human-Like Thinking composite is the unweighted mean of the six dimension scores, expressed on 0–100.

Why 40 questions, not 400?

Two reasons. First, the scoring is expensive — every question needs judge-model review, and cost matters when we run this on 30+ models quarterly. Second, the questions are hand-crafted, not harvested. A 40-question hand-crafted set with judged rationale is harder to game than a 400-question auto-generated set. Leakage risk is lower and the rubric stays tight.

Sample size means per-dimension confidence intervals are wider than the adopted benchmarks; that's the trade-off. We always report n alongside the score.

Why not just publish the questions?

We don't publish the exact questions because leak-into-training is a real risk; a public 40-question bank would be in every pretraining corpus within a release cycle. We publish the rubrics, the dimensions, the judge prompt, and example question formats — enough that academic replication is possible with a fresh question set.

Reproducibility

Vetted researchers can request the full question bank at research@swfte.com after signing a non-redistribution agreement. Judge prompts, rubrics, and the exact scoring code are published openly.

Caveats

  • Human-Like is a loaded word. We are not claiming consciousness or genuine understanding — we are claiming that the model's visible reasoning pattern, judged against a rubric humans would accept, approximates human thought to some measurable degree.
  • Judge models have biases. We use a fixed judge (Claude Opus 4.6) and a fixed rubric to hold those biases constant across models.
  • The composite gives no free lunch. A model that's strong on temporal reasoning and weak on theory of mind can score the same as its inverse; the dimension breakdown is the useful part.
0
0
0
0

Enjoyed this article?

Get more insights on AI and enterprise automation delivered to your inbox.