Representation Engineering: A Top-Down Approach to AI Transparency Paper • 2310.01405 • Published Oct 2, 2023 • 5
Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks Paper • 1910.01279 • Published Oct 3, 2019
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal Paper • 2402.04249 • Published Feb 6, 2024 • 4
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning Paper • 2403.03218 • Published Mar 5, 2024 • 1
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet Paper • 2408.15221 • Published Aug 27, 2024
Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents Paper • 2410.13886 • Published Oct 11, 2024
MHJ Collection Dataset and RMU model weights for LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet • 2 items • Updated Aug 27, 2024