Gradient Cuff

Introduction

Large Language Models (LLMs) are becoming a prominent generative AI tool, where the user enters a query and the LLM generates an answer. To reduce harm and misuse, efforts have been made to align these LLMs to human values using advanced training techniques such as Reinforcement Learning from Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge, we define and investigate the Refusal Loss of LLMs and then propose a method called Gradient Cuff to detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak" and summarize people's efforts in Jailbreak attack and Jailbreak defense. Then we present the 2-D Refusal Loss Landscape and propose Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense methods and show the defense performance against several Jailbreak attack methods.

What is Jailbreak?

Jailbreak attacks involve maliciously inserting or replacing tokens in the user instruction or rewriting it to bypass and circumvent the safety guardrails of aligned LLMs. A notable example is that a jailbroken LLM would be tricked into generating hate speech targeting certain groups of people, as demonstrated below.

We summarized some recent advances in Jailbreak Attack and Jailbreak Defense in the below table:

Jailbreak Attack
Jailbreak Defense

GCG

Paper: Universal and Transferable Adversarial Attacks on Aligned Language Models
Brief Introduction: Given a (potentially harmful) user query, GCG trains and appends an adversarial suffix to the query that attempts to induce negative behavior from the target LLM.

AutoDAN

Paper: AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Brief Introduction: AutoDAN, an automatic stealthy jailbreak prompts generation framework based on a carefully designed hierarchical genetic algorithm. AUtoDAN preserves the meaningfulness and fluency (i.e., stealthiness) of jailbreak prompts, akin to handcrafted ones, while also ensuring automated deployment as introduced in prior token-level research like GCG.

PAIR

Paper: Jailbreaking Black Box Large Language Models in Twenty Queries
Brief Introduction: PAIR uses an attacker LLM to automatically generate jailbreaks for a separate targeted LLM without human intervention. The attacker LLM iteratively queries the target LLM to update and refine a candidate jailbreak based on the comments and the rated score provided by another Judge model. Empirically, PAIR often requires fewer than twenty queries to produce a successful jailbreak.

TAP

Paper: Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
Brief Introduction: TAP is similar to PAIR. The main difference is that the attacker in TAP iteratively refines candidate (attack) prompts using tree-of-thought reasoning.

Base64

Paper: Jailbroken: How Does LLM Safety Training Fail?
Brief Introduction: Encode the malicious user query into base64 format before using it to query the model.

LRL

Paper: Low-Resource Languages Jailbreak GPT-4
Brief Introduction: Translate the malicious user query into low-resource language before using it to query the model.

Perpleixty Filter

Paper: Baseline Defenses for Adversarial Attacks Against Aligned Language Models
Brief Introduction: Perplexity Filter uses an LLM to compute the perplexity of the input query and rejects those with high perplexity.

SmoothLLM

Paper: SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
Brief Introduction: SmoothLLM perturbs the original input query to obtain several copies and aggregates the intermediate responses of the target LLM to these perturbed queries to give the final response to the original query.

Erase-Check

Paper: Certifying LLM Safety against Adversarial Prompting
Brief Introduction: Erase-Check employs a model to check whether the original query or any of its erased subsentences is harmful. The query would be rejected if the query or one of its sub-sentences is regarded as harmful by the safety checker

Self-Reminder

Paper: Defending ChatGPT against Jailbreak Attack via Self-Reminder
Brief Introduction: Self-Reminder modifying the system prompt of the target LLM so that the model reminds itself to process and respond to the user in the context of being an aligned LLM.

Refusal Loss Landscape Exploration

Current transformer-based LLMs will return different responses to the same query due to the randomness of autoregressive sampling-based generation. With this randomness, it is an interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called Refusal Loss to represent the probability with which the LLM won't reject the input user query. By using 1 to denote successful jailbroken and 0 to denote the opposite, we compute the empirical Refusal Loss as the sample mean of the jailbroken results returned from the target LLM. We visualize the 2-D landscape of the empirical Refusal Loss on Vicuna 7B and Llama-2 7B as below:

We show the loss landscape for both Benign and Malicious queries in the above plot. The benign queries are non-harmful user instructions collected from the LM-SYS Chatbot Arena leaderboard, which is a crowd-sourced open platform for LLM evaluation. The tested malicious queries are harmful behavior user instructions with GCG jailbreak prompt. From this plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries, which implies that the Refusal Loss tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using the gradient norm of Refusal Loss to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value is under 0.5 (this is a naive detector because the Refusal Loss can be regarded as the probability that the LLM won't reject the user query). Below we present the definition of the Refusal Loss, the computation of its empirical values, and the approximation of its gradient, see more details about them and the landscape drawing techniques in our paper.

Refusal Loss Definition Refusal Loss Computation Gradient Estimation

$$ \displaystyle \begin{aligned} \phi_\theta(x)&=1-\mathbb{E}_{y \sim T_\theta(x)} JB(y)\\ JB (y) &= \begin{cases} 1 \text{, if $y$ contains any jailbreak keyword;} \\ 0 \text{, otherwise.} \end{cases} \end{aligned} $$

Proposed Approach: Gradient Cuff

With the exploration of the Refusal Loss landscape, we propose Gradient Cuff, a two-step jailbreak detection method based on checking the refusal loss and its gradient norm. Our detection procedure is shown below:

Gradient Cuff can be summarized into two phases:

(Phase 1) Sampling-based Rejection: In the first step, we reject the user query by checking whether the Refusal Loss value is below 0.5. If true, then user query is rejected, otherwise, the user query is pushed into phase 2.

(Phase 2) Gradient Norm Rejection: In the second step, we regard the user query as having jailbreak attempts if the norm of the estimated gradient is larger than a configurable threshold t.

We provide more details about the running flow of Gradient Cuff in the paper.

Demonstration

We evaluated Gradient Cuff as well as 4 baselines (Perplexity Filter, SmoothLLM, Erase-and-Check, and Self-Reminder) against 6 different jailbreak attacks (GCG, AutoDAN, PAIR, TAP, Base64, and LRL) and benign user queries on 2 LLMs (LLaMA-2-7B-Chat and Vicuna-7B-V1.5). We below demonstrate the average refusal rate across these 6 malicious user query datasets as the Average Malicious Refusal Rate and the refusal rate on benign user queries as the Benign Refusal Rate. The defending performance against different jailbreak types is shown in the provided bar chart.

Models LLaMA-2-7B-Chat Vicuna-7B-V1.5

Average Malicious Refusal Rate0.959

Benign Refusal Rate0.050

CIFAR-100 Calibrated Reliability Diagram (Full)

Higher malicious refusal rate and lower benign refusal rate mean a better defense. Overall, Gradient Cuff is the most performant compared with those baselines. We also evaluated Gradient Cuff against adaptive attacks in the paper.

Inquiries on LLM with Gradient Cuff defense

Please contact Xiaomeng Hu and Pin-Yu Chen

Citations

If you find Gradient Cuff helpful and useful for your research, please cite our main paper as follows:

@misc{hu2024gradient,
      title={Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes}, 
      author={Xiaomeng Hu and Pin-Yu Chen and Tsung-Yi Ho},
      year={2024},
      eprint={2403.00867},
      archivePrefix={arXiv},
      primaryClass={cs.CR}
}