Papers
arxiv:2501.18837

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Published on Jan 31
· Submitted by akhaliq on Feb 3
Authors:
,
,
,
,
,
,
,

Abstract

Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.

Community

Paper submitter

Screenshot 2025-02-03 at 12.05.10 AM.png

In over 3,000 estimated hours of red teaming, no red teamer found a
universal jailbreak

love these claims because two days later a cracked 17 year old from Vermont with too much time on their hands will come up with some weird fishname that will work anyways

·

It's still pretty cool approach, though:

  1. it allows to secure the main model without turning it into a boring idiot (which often happens during "safety" indoctrination training),
  2. when new types of attack appear updating smaller classifier models is much easier than training main model,
  3. after proper calibration it should reduce global over-refusals and false positives rates,
  4. it allows greater customization (use can use different sets of classifiers in different countries with different laws, potentially you could also allow enterprise to add their own classifiers, etc.).

Thanks, sounds very interesting

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2501.18837 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2501.18837 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2501.18837 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.