Leaderboard

LLM Capability Leaderboard

Every major model across the benchmarks researchers cite. Sort by any column. Rows marked imported are from the public leaderboards of the benchmark maintainers; Swfte’s own runs replace them as they complete.

Updated 2026-05-06 · Methodology

ModelProviderHuman-LikeARC-AGI-2HLEGAIASimpleBenchGPQA-DiamondMMLU-ProHuman-Like ThinkingSource
Claude Opus 4.6Anthropicimported
Claude Sonnet 4.6Anthropicimported
Claude Haiku 4.5Anthropicimported
GPT-5OpenAIimported
GPT-4.5OpenAIimported
o3-miniOpenAIimported
Gemini 2.5 ProGoogleimported
Gemini 2.5 FlashGoogleimported
Llama 4 405BMetaimported
Llama 4 70BMetaimported
Mistral Large 2Mistralimported
Mistral Small 3Mistralimported
DeepSeek V3DeepSeekimported
DeepSeek R1DeepSeekimported
Qwen 3Alibabaimported
Command R+Cohereimported
Kimi K2Moonshotimported
Grok 3xAIimported
Jamba 1.5AI21imported
Phi-4Microsoftimported
Gemma 3Googleimported