Benchmarks
Human-like thinking, measured against every benchmark that matters
Capability scorecards running the adopted academic benchmarks (ARC-AGI-2, HLE, GAIA, SimpleBench, GPQA Diamond, MMLU-Pro) plus our own Rationale Integrity, Abstention, and Human-Like Thinking composite. Sortable leaderboard shows the full comparison.
OpenAI
GPT-4.5
Based on published documentation. Full audit in progress (0%).
Updated 2026-05-06
OpenAI
o3-mini
Based on published documentation. Full audit in progress (0%).
Updated 2026-05-06
Alibaba
Qwen 3
Based on published documentation. Full audit in progress (0%).
Updated 2026-05-06