Leaderboard

LLM Capability Leaderboard

Every major model across the benchmarks researchers cite. Sort by any column. Rows marked imported are from the public leaderboards of the benchmark maintainers; Swfte’s own runs replace them as they complete.

Updated 2026-06-20 · Methodology

Beat the benchmark

Don't just read the scores — run them

Three ways to put this benchmark leaderboard to work. All start with a free Swfte account, no card.

Run the top model free Get benchmark refresh alerts The Model-Hopper Challenge50% OFF · 6 MO

Model	Provider	Human-Like	ARC-AGI-2	HLE	GAIA	SimpleBench	GPQA-Diamond	MMLU-Pro	Human-Like Thinking	Source
Claude Opus 4.6	Anthropic	—	—	—	—	—	—	—	—	imported
Claude Sonnet 4.6	Anthropic	—	—	—	—	—	—	—	—	imported
Claude Haiku 4.5	Anthropic	—	—	—	—	—	—	—	—	imported
GPT-5	OpenAI	—	—	—	—	—	—	—	—	imported
GPT-4.5	OpenAI	—	—	—	—	—	—	—	—	imported
o3-mini	OpenAI	—	—	—	—	—	—	—	—	imported
Gemini 2.5 Pro	Google	—	—	—	—	—	—	—	—	imported
Gemini 2.5 Flash	Google	—	—	—	—	—	—	—	—	imported
Llama 4 405B	Meta	—	—	—	—	—	—	—	—	imported
Llama 4 70B	Meta	—	—	—	—	—	—	—	—	imported
Mistral Large 2	Mistral	—	—	—	—	—	—	—	—	imported
Mistral Small 3	Mistral	—	—	—	—	—	—	—	—	imported
DeepSeek V3	DeepSeek	—	—	—	—	—	—	—	—	imported
DeepSeek R1	DeepSeek	—	—	—	—	—	—	—	—	imported
Qwen 3	Alibaba	—	—	—	—	—	—	—	—	imported
Command R+	Cohere	—	—	—	—	—	—	—	—	imported
Kimi K2	Moonshot	—	—	—	—	—	—	—	—	imported
Grok 3	xAI	—	—	—	—	—	—	—	—	imported
Jamba 1.5	AI21	—	—	—	—	—	—	—	—	imported
Phi-4	Microsoft	—	—	—	—	—	—	—	—	imported
Gemma 3	Google	—	—	—	—	—	—	—	—	imported