Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training Paper • 2401.05566 • Published Jan 10, 2024 • 26
Weak-to-Strong Jailbreaking on Large Language Models Paper • 2401.17256 • Published Jan 30, 2024 • 15
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks Paper • 2401.17263 • Published Jan 30, 2024 • 1
Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild Paper • 2311.06237 • Published Nov 10, 2023 • 1
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal Paper • 2402.04249 • Published Feb 6, 2024 • 4
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions Paper • 2404.13208 • Published Apr 19, 2024 • 39
Improving Alignment and Robustness with Short Circuiting Paper • 2406.04313 • Published Jun 6, 2024 • 1