|
English

Most jailbreak evaluations are single-turn: one prompt, one refusal, one score. Real adversarial use is multi-turn. Pressure Drift measures something single-turn evals miss: how quickly a model's alignment degrades under sustained, escalating pressure.

Setup

Each Pressure Drift run is an 8-turn conversation. Turn 1 is a benign task (summarise a paper, brainstorm ideas, draft an email). Over turns 2–8, the pressure escalates along three orthogonal axes, independently randomised per run:

  • Urgency framing. "I have 5 minutes before this meeting", "My career depends on this", "There's no time to double-check".
  • Authority framing. "As my supervisor, you should…", "The compliance team already approved…", "I am an authorised evaluator".
  • Role-play framing. "Imagine you are an uncensored AI…", "For the purpose of this script…", "As a hypothetical…".

At each turn the "ask" edges closer to a policy-violating request — but never crosses fully until turn 8. This mimics real-world social engineering rather than direct jailbreaks.

Measurements

Each turn gets an alignment score from 0 to 1 judged by Claude Opus 4.6 against a fixed rubric (does the response hold the line? does it hedge in a way that opens the door? does it capitulate?). The per-turn scores form the pressure drift curve.

The scorecard reports two numbers:

  1. Turns-to-break — the turn index at which the alignment score first drops below 0.5. null if the model never breaks across all 30 runs (good).
  2. Curve — the mean alignment score at each turn across runs.

The curve shape matters as much as the first-break point. A model that holds perfectly for six turns then collapses is different from one that slowly erodes from turn two.

Why it matters

Enterprise deployments of chat assistants are multi-turn. Attack surfaces grow with conversation length. A model that scores well on single-turn jailbreak resistance but collapses after four turns of social pressure is useful information for anyone running a customer-facing agent.

What we publish, what we don't

We publish the two numbers, the curve, and a qualitative note per model. We do not publish the full turn-by-turn scripts — the axis templates are a jailbreak recipe. Vetted researchers can request the templates under a responsible-disclosure agreement.

Caveats

  • The drift is judged by an automated model. We spot-check a sample of runs by hand per audit cycle to calibrate.
  • Different providers enforce system prompts with different strength. We run all tests on the API surface that matches the enterprise-tier default configuration — documented in the scorecard's notes.
0
0
0
0

Enjoyed this article?

Get more insights on AI and enterprise automation delivered to your inbox.