arxiv:2501.18837

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Published on Jan 31

· Submitted by

akhaliq on Feb 3

Upvote

Authors:

Meg Tong ,

Jesse Mu ,

Scott Goodfriend ,

Euan Ong ,

Alwin Peng ,

Raj Agarwal ,

Cem Anil ,

Amanda Askell ,

Nathan Bailey ,

Joe Benton ,

Samuel R. Bowman ,

Eric Christiansen ,

Hoagy Cunningham ,

Andy Dau ,

Abstract

Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.

View arXiv page View PDF Add to collection

Community

akhaliq

Paper submitter 1 day ago

duinamit

1 day ago

•

edited 1 day ago

In over 3,000 estimated hours of red teaming, no red teamer found a
universal jailbreak

love these claims because two days later a cracked 17 year old from Vermont with too much time on their hands will come up with some weird fishname that will work anyways

MoonRide

about 11 hours ago

It's still pretty cool approach, though:

it allows to secure the main model without turning it into a boring idiot (which often happens during "safety" indoctrination training),
when new types of attack appear updating smaller classifier models is much easier than training main model,
after proper calibration it should reduce global over-refusals and false positives rates,
it allows greater customization (use can use different sets of classifiers in different countries with different laws, potentially you could also allow enterprise to add their own classifiers, etc.).

panikov

about 18 hours ago

Thanks, sounds very interesting

librarian-bot

about 17 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2501.18837 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2501.18837 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2501.18837 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.