Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Abstract
Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.
Community
In over 3,000 estimated hours of red teaming, no red teamer found a
universal jailbreak
love these claims because two days later a cracked 17 year old from Vermont with too much time on their hands will come up with some weird fishname that will work anyways
It's still pretty cool approach, though:
- it allows to secure the main model without turning it into a boring idiot (which often happens during "safety" indoctrination training),
- when new types of attack appear updating smaller classifier models is much easier than training main model,
- after proper calibration it should reduce global over-refusals and false positives rates,
- it allows greater customization (use can use different sets of classifiers in different countries with different laws, potentially you could also allow enterprise to add their own classifiers, etc.).
Thanks, sounds very interesting
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints (2025)
- Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models (2025)
- Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks (2025)
- Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models (2024)
- Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models (2025)
- Can LLM Prompting Serve as a Proxy for Static Analysis in Vulnerability Detection (2024)
- RapGuard: Safeguarding Multimodal Large Language Models via Rationale-aware Defensive Prompting (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper