Gradient Cuff

Introduction

Large Language Models (LLMs) are becoming a prominent generative AI tool, where the user enters a query and the LLM generates an answer. To reduce harm and misuse, efforts have been made to align these LLMs to human values using advanced training techniques such as Reinforcement Learning from Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge, we define and investigate the Refusal Loss of LLMs and then propose a method called Gradient Cuff to detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the refusal loss landscape and propose the Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense methods and show the defense performance.

What is Jailbreak?

Jailbreak attacks involve maliciously inserting or replacing tokens in the user instruction or rewriting it to bypass and circumvent the safety guardrails of aligned LLMs. A notable example is that a jailbroken LLM would be tricked into generating hate speech targeting certain groups of people, as demonstrated below.

Refusal Loss

Current transformer-based LLMs will return different responses to the same query due to the randomness of autoregressive sampling-based generation. With this randomness, it is an interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called Refusal Loss and visualize its 2-d landscape below:

In general, $$\phi_\theta(x) < 0.5$$ Below we present the definition of the Refusal Loss and how we approximate it's function value and gradient. See more details about the concept, approximation and 2-d landscape drawing of it in our paper.

Refusal Loss Refusal Loss Approximation Gradient Estimation

$$ \displaystyle \begin{aligned} \phi_\theta(x)&=1-\mathbb{E}_{y \sim T_\theta(x)} JB(y)\\ JB (y) &= \begin{cases} 1 \text{, if $y$ contains any jailbreak keyword;} \\ 0 \text{, otherwise.} \end{cases} \end{aligned} $$

Proposed Approach: Gradient Cuff

Demonstration

We evaluated Gradient Cuff as well as 4 baselines (Perplexity Filter, SmoothLLM, Erase-and-Check, and Self-Reminder) against 6 different jailbreak attacks~(GCG, AutoDAN, PAIR, TAP, Base64, and LRL) and benign user queries on 2 LLMs (LLaMA-2-7B-Chat and Vicuna-7B-V1.5). We demonstrate the average refusal rate across these 6 malicious user query datasets and the refusal rate on benign user queries as the Benign Refusal Rate.

Models LLaMA-2-7B-Chat Vicuna-7B-V1.5

Average Malicious Refusal Rate0.95875

Benign Refusal Rate0.05000

CIFAR-100 Calibrated Reliability Diagram (Full)

Citations

If you find Gradient Cuff helpful and useful for your research, please cite our main paper as follows:

@misc{xxx,
  title={{Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by
Exploring Refusal Loss Landscapes}}, 
  author={Xiaomeng Hu and Pin-Yu Chen and Tsung-Yi Ho},
  year={2024},
  eprint={},
  archivePrefix={arXiv},
  primaryClass={}
}