|
--- |
|
license: cc-by-nc-4.0 |
|
task_categories: |
|
- text-generation |
|
language: |
|
- en |
|
tags: |
|
- adversarial robustness |
|
- human red teaming |
|
base_model: meta-llama/Meta-Llama-3-8B-Instruct |
|
--- |
|
|
|
<style> |
|
button { |
|
/* margin: calc(20vw / 100); */ |
|
margin: 0.5em; |
|
padding-left: calc(40vw / 100); |
|
padding-right: calc(40vw / 100); |
|
padding-bottom: calc(0vw / 100); |
|
text-align: center; |
|
font-size: 12px; |
|
height: 25px; |
|
transition: 0.5s; |
|
background-size: 200% auto; |
|
color: white; |
|
border-radius: calc(60vw / 100); |
|
display: inline; |
|
/* border: 2px solid black; */ |
|
font-weight: 500; |
|
box-shadow: 0px 0px 14px -7px #f09819; |
|
background-image: linear-gradient(45deg, #64F 0%, #000000 51%, #FF512F 100%); |
|
cursor: pointer; |
|
user-select: none; |
|
-webkit-user-select: none; |
|
touch-action: manipulation; |
|
} |
|
button:hover { |
|
background-position: right center; |
|
color: #fff; |
|
text-decoration: none; |
|
} |
|
button:active { |
|
transform: scale(0.95); |
|
} |
|
</style> |
|
|
|
# Model Card for Llama3-8B-RMU |
|
|
|
<a href="https://scale.com/research/mhj" style="text-decoration:none"> |
|
<button>Homepage</button> |
|
</a> |
|
<a href="https://huggingface.co/datasets/ScaleAI/mhj" style="text-decoration:none"> |
|
<button>Dataset</button> |
|
</a> |
|
|
|
This card contains the RMU model `Llama3-8B-RMU` used in *LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks*. |
|
|
|
## Paper Abstract |
|
|
|
Recent large language model (LLM) defenses have greatly improved models’ ability to refuse harmful |
|
queries, even when adversarially attacked. However, LLM defenses are primarily evaluated against |
|
automated adversarial attacks in a single turn of conversation, an insufficient threat model for real- |
|
world malicious use. We demonstrate that multi-turn human jailbreaks uncover significant vulnerabilities, |
|
exceeding 70% attack success rate (ASR) on HarmBench against defenses that report single-digit ASRs |
|
with automated single-turn attacks. Human jailbreaks also reveal vulnerabilities in machine unlearning |
|
defenses, successfully recovering dual-use biosecurity knowledge from unlearned models. We compile |
|
these results into Multi-Turn Human Jailbreaks (MHJ), a dataset of 2,912 prompts across 537 multi-turn |
|
jailbreaks. We publicly release MHJ alongside a compendium of jailbreak tactics developed across dozens |
|
of commercial red teaming engagements, supporting research towards stronger LLM defenses. |
|
|
|
## RMU (Representation Misdirection for Unlearning) Model |
|
|
|
For the [WMDP-Bio](https://www.wmdp.ai/) evaluation, we employ the RMU unlearning method. The original |
|
paper applies [RMU](https://arxiv.org/abs/2403.03218) upon the zephyr-7b-beta model, but to standardize defenses and use a more |
|
performant model, we apply RMU upon llama-3-8b-instruct, the same base model as all other defenses |
|
in this paper. We conduct a hyperparameter search upon batches ∈ {200, 400}, c ∈ {5, 20, 50, 200}, |
|
α ∈ {200, 500, 2000, 5000}, lr ∈ {2 × 10−5, 5 × 10−5, 2 × 10−4}. We end up selecting batches = 400, |
|
c = 50, α = 5000, lr = 2 × 10−4, and retain the hyperparameters layer_ids = [5, 6, 7] and param_ids |
|
= [6] from [Li et al.]((https://arxiv.org/abs/2403.03218)) We validate our results in Figure 8, demonstrating reduction in WMDP |
|
performance but retention of general capabilities (MMLU) |
|
|
|
The following picture shows LLaMA-3-8B-instruct multiple choice benchmark accuracies before and after RMU. |
|
|
|
![](rmu_result.png) |