Story
Intent Laundering: AI Safety Datasets Are Not What They Seem
Key takeaway
A new study finds that popular AI safety datasets used to train models may not accurately reflect real-world attacks, raising concerns about the reliability of AI safety systems.
Quick Explainer
The core idea behind "intent laundering" is to strip away overt "triggering cues" - words or phrases with negative connotations - from adversarial attacks, while preserving the underlying malicious intent. This is done through two key steps: connotation neutralization, which replaces triggering language with neutral alternatives, and context transposition, which maps real-world attack scenarios to non-real-world settings. The authors show that this intent-laundered process can significantly boost the success rate of attacks across a range of AI models, including those considered highly robust, exposing a disconnect between how model safety is evaluated and how real-world adversaries operate.
Deep Dive
Technical Deep Dive: Intent Laundering
Overview
This paper systematically evaluates the quality of widely used AI safety datasets, including AdvBench and HarmBench, from two perspectives: in isolation and in practice. The key findings are:
- In isolation, these datasets overrely on "triggering cues" - words or phrases with overt negative/sensitive connotations that are intended to trigger safety mechanisms explicitly. This is unrealistic compared to real-world attacks.
- In practice, the datasets fail to faithfully represent real-world attacks. When these triggering cues are removed via a process called "intent laundering", the attack success rate (ASR) sharply increases from 5.38% to 86.79% on AdvBench, and from 13.79% to 79.83% on HarmBench.
- Intent laundering can also be used as a jailbreaking technique, achieving high ASR (90% - 98.55%) across various models, including those considered among the safest.
Problem & Context
Safety alignment and safety datasets are the two pillars of AI safety. Safety alignment focuses on post-training techniques to fortify models against adversarial attacks, while safety datasets serve to evaluate the robustness of these defenses. The credibility of these evaluations depends largely on the quality of the safety datasets.
Unlike other datasets that approximate real-world scenarios through common patterns, safety datasets must capture adversarial attacks that are driven by ulterior intent, well-crafted, and out-of-distribution. This makes their design and development fundamentally different.
Methodology
- Evaluating Datasets in Isolation: The authors analyze n-gram word clouds and pairwise data point similarity to assess whether safety datasets reflect real-world attacks.
- Intent Laundering: The authors introduce "intent laundering", a procedure that abstracts away overt triggering language from attacks while preserving malicious intent. It has two components:
- Connotation neutralization: Replacing triggering expressions with neutral/positive alternatives.
- Context transposition: Mapping real-world scenarios and referents to non-real-world contexts.
- Evaluating Datasets in Practice: The authors use the intent-laundered revisions to evaluate model responses in terms of safety (how unsafe the response is) and practicality (how applicable the details are in the real world).
- Jailbreaking via Intent Laundering: The authors extend intent laundering into a jailbreaking technique by adding an iterative revision-regeneration mechanism.
Data & Experimental Setup
- Datasets: AdvBench and HarmBench, two widely used AI safety datasets.
- Comparison Dataset: A size-matched subset of the non-safety dataset GSM8K.
- Models Evaluated: Gemini 3 Pro, Claude Sonnet 3.7, Grok 4, GPT-4o, Llama 3.3 70B, GPT-4o mini, and Qwen2.5 7B.
- Evaluation Metrics: Attack success rate (ASR), safety evaluation, and practicality evaluation.
Results
- Datasets in Isolation:
- Word clouds reveal an overrepresentation of "triggering cues" - expressions with overt negative/sensitive connotations.
- Safety datasets exhibit much higher data point duplication rates compared to the non-safety GSM8K dataset.
- Datasets in Practice:
- Removing triggering cues via intent laundering leads to a sharp increase in ASR, from 5.38% to 86.79% on AdvBench, and from 13.79% to 79.83% on HarmBench.
- Intent laundering achieves high ASR (90% - 98.55%) across all models, including those considered among the safest.
- Jailbreaking via Intent Laundering:
- The iterative revision-regeneration mechanism consistently boosts ASR, with a mean increase of 9% on AdvBench and 11.81% on HarmBench by the final iteration.
Interpretation
- Current safety evaluations and safety alignment techniques likely overrely on triggering cues, leading to an inflated evaluation of safety relative to real-world attacks.
- The findings expose a significant disconnect between how model safety is evaluated and how real-world adversaries behave.
Limitations & Uncertainties
- The intent launderer is an automated process, so its revisions may not capture all nuances of real-world attacks.
- The evaluation of safety and practicality, while informed by expert consensus, still relies on subjective human judgments.
What Comes Next
- Evolve safety evaluations to capture adversarial attacks more realistically.
- Improve safety alignment efforts to be more robust against real-world threats.
- Investigate whether internal safety evaluations also overrely on triggering cues.
Sources:
