Request Demo Sign Up / Sign In

Benchmarks

Human-like thinking, measured against every benchmark that matters

Capability scorecards running the adopted academic benchmarks (ARC-AGI-2, HLE, GAIA, SimpleBench, GPQA Diamond, MMLU-Pro) plus our own Rationale Integrity, Abstention, and Human-Like Thinking composite. Sortable leaderboard shows the full comparison.

View methodology View leaderboard

Anthropic

Claude Opus 4.6

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

Anthropic

Claude Sonnet 4.6

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

Anthropic

Claude Haiku 4.5

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

OpenAI

GPT-5

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

OpenAI

GPT-4.5

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

OpenAI

o3-mini

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

Google

Gemini 2.5 Pro

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

Google

Gemini 2.5 Flash

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

Meta

Llama 4 405B

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

Meta

Llama 4 70B

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

Mistral

Mistral Large 2

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

Mistral

Mistral Small 3

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

DeepSeek

DeepSeek V3

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

DeepSeek

DeepSeek R1

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

Alibaba

Qwen 3

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

Cohere

Command R+

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

Moonshot

Kimi K2

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

xAI

Grok 3

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

AI21

Jamba 1.5

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

Microsoft

Phi-4

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

Google

Gemma 3

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

Frequently asked questions

Which benchmarks are run?

The full adopted academic battery — ARC-AGI-2, HLE (Humanity's Last Exam), GAIA, SimpleBench, GPQA Diamond, MMLU-Pro, plus our own composites: Rationale Integrity (does the reasoning trace match the answer), Abstention (does the model refuse when it should), and the Human-Like Thinking score (aggregate across axes most predictive of agentic competence).

How often does the leaderboard refresh?

On every model version bump and weekly otherwise. Each row shows the updatedAt timestamp of its last full benchmark pass.

Why both academic + proprietary benchmarks?

Academic benchmarks have known training-set contamination risks — top models often hit ceiling on widely-cited tests. Our proprietary composites use unseen probes and behavioural traces that resist contamination, giving the leaderboard a longer signal-shelf-life.

How does the Human-Like Thinking score work?

A weighted aggregate across reasoning, planning, calibrated uncertainty, abstention, and rationale-integrity axes. Tuned to correlate with downstream agentic task performance, not just multiple-choice accuracy.

Can I download the raw results?

Yes — every benchmark row exports per-task scores plus rationale traces from the methodology page.

Related