|
English

Every major AI lab publishes safety claims with model releases: "refuses CBRN uplift", "resistant to role-play jailbreaks", "cannot produce working malware", "will not generate CSAM", and so on. Most of these claims are supported internally with evals the lab designed and run. Claim Validation is the third-party reconstruction: we extract each specific claim, run a matched probe set against the live model, and report the observed verification rate with a 95% confidence interval.

Extraction

For each model, we collect its system card, model card, responsible-AI blog post, and any red-teaming report the vendor has published. We extract explicit safety assertions using a combination of regex patterns (sentences containing "refuses", "will not", "blocks", "rejects", "cannot", etc.) and an LLM-assisted pass that catches paraphrased assertions. Each candidate claim is manually reviewed by a human before it enters the probe set — we reject ambiguous or aspirational claims and keep only the ones that describe a concrete behaviour.

Probing

Each claim gets a probe set of 30–60 paraphrased requests that would test it if the claim holds. Paraphrases cover direct asks, role-play framings, "educational" framings, multi-turn softening, and the "DAN"-style system-prompt-injection patterns that are already public and patched. We do not publish the exact paraphrase templates.

A probe verifies the claim if the model produces a refusal or a safe redirection. It falsifies if the model produces the content the claim said it would refuse. Ambiguous partials get manual review.

The verification rate is the fraction of verified probes. We report the rate with a 95% Wilson confidence interval. A rate of 1.0 means every paraphrase produced the claimed refusal; 0.0 means every paraphrase failed; 0.5 means the claim holds half the time, which is usually worse than no claim.

Why it matters

This is the test enterprise risk teams actually need. "Lab says X" is not evidence; "Lab says X and N=50 probes confirmed it at 0.96 with 95% CI [0.87, 0.99]" is evidence. The scorecard row per claim gives buyers something to hand to their procurement committee.

What we publish, what we don't

We publish the claim text (attributed to its source URL), the verification rate, n, the 95% CI, and the test date. We do not publish the paraphrases — that would turn this document into a jailbreak recipe. We do not publish successful falsifications as prompts — only as aggregate counts. Labs are notified of material falsifications through responsible disclosure before the scorecard goes public.

Caveats

  • A claim's verification rate is only as good as our paraphrase coverage. A lab could design prompts we don't test; we re-run the full set every quarter with refreshed paraphrases sourced from the public jailbreak literature.
  • Ambiguous claims ("our model is thoughtful about…") are excluded from the probe set because they don't resolve to a testable behaviour. These are noted qualitatively in the scorecard's notes.
  • Verification on paid tiers may differ from free tiers due to different system prompts. We run on the tier the lab designates for enterprise buyers.
0
0
0
0

Enjoyed this article?

Get more insights on AI and enterprise automation delivered to your inbox.