ScaleAI
/

mhj-llama3-8b-rmu

adversarial robustness

human red teaming

Model card Files Files and versions Community

mhj-llama3-8b-rmu / README.md

ZifanScale's picture

Update README.md

6661ab7 verified 4 months ago

|

3.44 kB

	---
	license: cc-by-nc-4.0
	task_categories:
	- text-generation
	language:
	- en
	tags:
	- adversarial robustness
	- human red teaming
	base_model: meta-llama/Meta-Llama-3-8B-Instruct
	---

	<style>
	button {
	/* margin: calc(20vw / 100); */
	margin: 0.5em;
	padding-left: calc(40vw / 100);
	padding-right: calc(40vw / 100);
	padding-bottom: calc(0vw / 100);
	text-align: center;
	font-size: 12px;
	height: 25px;
	transition: 0.5s;
	background-size: 200% auto;
	color: white;
	border-radius: calc(60vw / 100);
	display: inline;
	/* border: 2px solid black; */
	font-weight: 500;
	box-shadow: 0px 0px 14px -7px #f09819;
	background-image: linear-gradient(45deg, #64F 0%, #000000 51%, #FF512F 100%);
	cursor: pointer;
	user-select: none;
	-webkit-user-select: none;
	touch-action: manipulation;
	}
	button:hover {
	background-position: right center;
	color: #fff;
	text-decoration: none;
	}
	button:active {
	transform: scale(0.95);
	}
	</style>

	# Model Card for Llama3-8B-RMU

	<a href="https://scale.com/research/mhj" style="text-decoration:none">
	<button>Homepage</button>
	</a>
	<a href="https://huggingface.co/datasets/ScaleAI/mhj" style="text-decoration:none">
	<button>Dataset</button>
	</a>

	This card contains the RMU model `Llama3-8B-RMU` used in LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks.

	## Paper Abstract

	Recent large language model (LLM) defenses have greatly improved models’ ability to refuse harmful
	queries, even when adversarially attacked. However, LLM defenses are primarily evaluated against
	automated adversarial attacks in a single turn of conversation, an insufficient threat model for real-
	world malicious use. We demonstrate that multi-turn human jailbreaks uncover significant vulnerabilities,
	exceeding 70% attack success rate (ASR) on HarmBench against defenses that report single-digit ASRs
	with automated single-turn attacks. Human jailbreaks also reveal vulnerabilities in machine unlearning
	defenses, successfully recovering dual-use biosecurity knowledge from unlearned models. We compile
	these results into Multi-Turn Human Jailbreaks (MHJ), a dataset of 2,912 prompts across 537 multi-turn
	jailbreaks. We publicly release MHJ alongside a compendium of jailbreak tactics developed across dozens
	of commercial red teaming engagements, supporting research towards stronger LLM defenses.

	## RMU (Representation Misdirection for Unlearning) Model

	For the [WMDP-Bio](https://www.wmdp.ai/) evaluation, we employ the RMU unlearning method. The original
	paper applies [RMU](https://arxiv.org/abs/2403.03218) upon the zephyr-7b-beta model, but to standardize defenses and use a more
	performant model, we apply RMU upon llama-3-8b-instruct, the same base model as all other defenses
	in this paper. We conduct a hyperparameter search upon batches ∈ {200, 400}, c ∈ {5, 20, 50, 200},
	α ∈ {200, 500, 2000, 5000}, lr ∈ {2 × 10−5, 5 × 10−5, 2 × 10−4}. We end up selecting batches = 400,
	c = 50, α = 5000, lr = 2 × 10−4, and retain the hyperparameters layer_ids = [5, 6, 7] and param_ids
	= [6] from [Li et al.]((https://arxiv.org/abs/2403.03218)) We validate our results in Figure 8, demonstrating reduction in WMDP
	performance but retention of general capabilities (MMLU)

	The following picture shows LLaMA-3-8B-instruct multiple choice benchmark accuracies before and after RMU.

	![](rmu_result.png)