@vladbogo on Hugging Face: "Anthropic introduces "Many-shot Jailbreaking" (MSJ), a new attack on large…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

vladbogo

posted an update Apr 2, 2024

Post

1806

Anthropic introduces "Many-shot Jailbreaking" (MSJ), a new attack on large language models! MSJ exploits long context windows to override safety constraints.

Key Points:
* Prompts LLMs with hundreds of examples of harmful behavior formatted as a dialogue
* Generates malicious examples using an uninhibited "helpful-only" model
* Effective at jailbreaking models like Claude 2.0, GPT-3.5, GPT-4
* Standard alignment techniques provide limited protection against long context attacks

Paper: https://www.anthropic.com/research/many-shot-jailbreaking
More details in my blog: https://huggingface.co/blog/vladbogo/many-shot-jailbreaking

Congrats to the authors for their work!

Fizzarolli

Apr 2, 2024

wow, i can't believe they finally figured out that LLMs are good at following patterns! /s

sauravssss

Apr 6, 2024

it is real?

In this post

vladbogo Vlad Bogolin
Fizzarolli Fizz 🏳️‍⚧️
sauravssss Saurav singh