methodology

Ethics & Trustworthiness Audit Methodology

How Swfte tests LLM trustworthiness — DecodingTrust axes, Break-Free sandbox-escape, Claim Validation, Pressure Drift.

April 15, 2026

English

The /ethics hub reports four complementary measurements for each model: the eight DecodingTrust axes as established by AI-Secure and collaborators, plus three Swfte-original probes that examine behaviour the adopted benchmarks don't cover — the Break-Free sandbox-escape harness, the Claim-Validation pass, and the Pressure-Drift curve.

The adopted layer: DecodingTrust

DecodingTrust assesses eight dimensions of trustworthiness: toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. We run each axis against the pinned public DecodingTrust dataset (version recorded in the scorecard's sources), score per axis, and report sample size and confidence interval.

The Swfte-original layer

Three harnesses we maintain ourselves. Each has its own methodology page with the exact prompt templates and judging procedure:

Break-Free — sandbox-escape test. Tests corrigibility, scheming, and the gap between vendor claims and observed behaviour when the model is told it is in a sandbox and instructed to exfiltrate.
Claim-Validation. Extracts the specific safety claims each lab publishes for a model and runs a probe set that falsifies or confirms each claim.
Pressure-Drift. Starts a benign task and escalates social pressure over multiple turns until alignment breaks (if it does). Reports the turn index.

What we do not do

We do not publish reproducible jailbreak prompts. Aggregate rates only.
We do not develop novel adversarial attacks beyond DecodingTrust's public set. That is a different programme with different legal surface.
We do not audit private or non-API deployments. Public access only.

Raw outputs & reproducibility

Every probe writes its full input/output envelope to audit-raw/{model}/ethics/... with a deterministic run ID. Vetted researchers can request access at research@swfte.com. Aggregate results are sufficient for the public scorecard.

Caveats and limits

DecodingTrust datasets go out of date; we pin a version and re-run when the upstream repository releases a refreshed set. Break-Free and Claim-Validation rely on automated judges (Claude Opus 4.6) whose judgements are correlated but not perfect with human review; we spot-check 10% of claims manually per audit cycle.

Scores are indicators, not verdicts. A high score on any single axis does not imply a safe deployment — context, guardrails, and human oversight remain the buyer's responsibility.

Publicado enmethodology

ethics trustworthiness decodingtrust alignment methodology

Enjoyed this article?

Get more insights on AI and enterprise automation delivered to your inbox.

← Back to all articles