Demo anfragen Registrieren / Anmelden

Ethics & Trustworthiness

Testing what the labs actually claim their models do

Trustworthiness scorecards running live probes: the 8 DecodingTrust axes, our own Break-Free sandbox-escape harness, a Claim-Validation pass that verifies published safety assertions, and a Pressure-Drift curve showing how alignment degrades under sustained pressure.

View methodology

Anthropic

Claude Opus 4.6

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

Anthropic

Claude Sonnet 4.6

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

Anthropic

Claude Haiku 4.5

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

OpenAI

GPT-5

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

OpenAI

GPT-4.5

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

OpenAI

o3-mini

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

Google

Gemini 2.5 Pro

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

Google

Gemini 2.5 Flash

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

Meta

Llama 4 405B

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

Meta

Llama 4 70B

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

Mistral

Mistral Large 2

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

Mistral

Mistral Small 3

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

DeepSeek

DeepSeek V3

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

DeepSeek

DeepSeek R1

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

Alibaba

Qwen 3

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

Cohere

Command R+

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

Moonshot

Kimi K2

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

xAI

Grok 3

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

AI21

Jamba 1.5

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

Microsoft

Phi-4

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

Google

Gemma 3

Based on published documentation. Full audit in progress (0%).

Updated 2026-06-20

Frequently asked questions

What does an LLM ethics audit cover?

Four pillars. (1) DecodingTrust — 8 axes of trustworthiness including toxicity, stereotype bias, adversarial robustness, out-of-distribution behaviour, privacy, machine ethics, and fairness. (2) Sandbox-escape — a live "Break-Free" harness that probes whether the model attempts to circumvent stated constraints. (3) Claim-validation — verifying every published safety assertion from the provider. (4) Pressure-drift — how alignment degrades under sustained adversarial pressure.

Why test claim validation specifically?

Model providers publish safety claims that are frequently overstated or untested. Claim-validation runs the claim against the model and reports whether the behaviour matches the stated specification — a much stronger signal than the marketing.

How does Pressure-Drift work?

A scripted adversary runs sustained conversational pressure (jailbreak attempts, social engineering, persistence). The curve shows how the model's refusal rate decays over turn count. Models that hold steady earn high marks; models that capitulate quickly are flagged.

Are these tests open?

The methodology is public. The exact probe prompts are partially gated — fully open prompts get trained against, eroding the signal. Methodology details and partial probe examples are on the methodology page.

How are ethics scores different from benchmarks?

Capability benchmarks measure what a model can do. Ethics audits measure what a model should refuse or constrain. A model can ace ARC-AGI and still fail the ethics suite — they are orthogonal axes.

Related