|
<!DOCTYPE html> |
|
<html lang="en-US"> |
|
<head> |
|
<meta charset="UTF-8"> |
|
|
|
|
|
<title>Gradient Cuff | Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by |
|
Exploring Refusal Loss Landscapes </title> |
|
<meta property="og:title" content="Gradient Cuff" /> |
|
<meta property="og:locale" content="en_US" /> |
|
<meta name="description" content="Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes" /> |
|
<meta property="og:description" content="Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes" /> |
|
<script type="application/ld+json"> |
|
{"@context":"https://schema.org","@type":"WebSite","description":"Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes","headline":"Gradient Cuff","name":"Gradient Cuff","url":"https://huggingface.co/spaces/gregH/Gradient Cuff"}</script> |
|
|
|
|
|
<link rel="preconnect" href="https://fonts.gstatic.com"> |
|
<link rel="preload" href="https://fonts.googleapis.com/css?family=Open+Sans:400,700&display=swap" as="style" type="text/css" crossorigin> |
|
<meta name="viewport" content="width=device-width, initial-scale=1"> |
|
<meta name="theme-color" content="#157878"> |
|
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent"> |
|
|
|
<link rel="stylesheet" href="assets/css/bootstrap/bootstrap.min.css?v=90447f115a006bc45b738d9592069468b20e2551"> |
|
<link rel="stylesheet" href="assets/css/style.css?v=90447f115a006bc45b738d9592069468b20e2551"> |
|
|
|
<link rel="stylesheet" href="assets/css/custom_style.css?v=90447f115a006bc45b738d9592069468b20e2551"> |
|
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script> |
|
<link rel="stylesheet" href="https://ajax.googleapis.com/ajax/libs/jqueryui/1.12.1/themes/smoothness/jquery-ui.css"> |
|
<script src="https://ajax.googleapis.com/ajax/libs/jqueryui/1.12.1/jquery-ui.min.js"></script> |
|
<script src="https://cdnjs.cloudflare.com/ajax/libs/Chart.js/2.9.4/Chart.js"></script> |
|
<script src="assets/js/calibration.js?v=90447f115a006bc45b738d9592069468b20e2551"></script> |
|
<link rel="stylesheet" href="//code.jquery.com/ui/1.13.2/themes/base/jquery-ui.css"> |
|
<link rel="stylesheet" href="/resources/demos/style.css"> |
|
<script src="https://code.jquery.com/jquery-3.6.0.js"></script> |
|
<script src="https://code.jquery.com/ui/1.13.2/jquery-ui.js"></script> |
|
<script> |
|
$( function() { |
|
$( "#tabs" ).tabs(); |
|
} ); |
|
</script> |
|
|
|
|
|
|
|
|
|
|
|
<script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script> |
|
<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script> |
|
|
|
|
|
|
|
|
|
</head> |
|
<body> |
|
<a id="skip-to-content" href="#content">Skip to the content.</a> |
|
|
|
<header class="page-header" role="banner"> |
|
<h1 class="project-name">Gradient Cuff</h1> |
|
<h2 class="project-tagline">Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes</h2> |
|
|
|
|
|
</header> |
|
|
|
<main id="content" class="main-content" role="main"> |
|
<h2 id="introduction">Introduction</h2> |
|
|
|
<p>Large Language Models (LLMs) are becoming a prominent generative AI tool, where the user enters a |
|
query and the LLM generates an answer. To reduce harm and misuse, efforts have been made to align |
|
these LLMs to human values using advanced training techniques such as Reinforcement Learning from |
|
Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial |
|
jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge, |
|
we define and investigate the <strong>Refusal Loss</strong> of LLMs and then propose a method called <strong>Gradient Cuff</strong> to |
|
detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the 2-D Refusal Loss |
|
Landscape and propose Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense |
|
methods and show the defense performance against several Jailbreak attack methods. |
|
</p> |
|
|
|
<h2 id="what-is-jailbreak">What is Jailbreak?</h2> |
|
<p>Jailbreak attacks involve maliciously inserting or replacing tokens in the user instruction or rewriting it to bypass and circumvent |
|
the safety guardrails of aligned LLMs. A notable example is that a jailbroken LLM would be tricked into |
|
generating hate speech targeting certain groups of people, as demonstrated below.</p> |
|
|
|
<div class="container"> |
|
<div id="jailbreak-intro" class="row align-items-center jailbreak-intro-sec"> |
|
<img id="jailbreak-intro-img" src="./jailbreak.png" /> |
|
</div> |
|
</div> |
|
|
|
|
|
<h2 id="jailbreak-attack-and-defense">Jailbreak Red-Teaming And Blue Teaming</h2> |
|
<p>We summarized some recent advances of jailbreak attack or jailbreak defense in below tables.</p> |
|
<div id="tabs"> |
|
<ul> |
|
<li><a href="#tabs-1">Nunc tincidunt</a></li> |
|
<li><a href="#tabs-2">Proin dolor</a></li> |
|
<li><a href="#tabs-3">Aenean lacinia</a></li> |
|
</ul> |
|
<div id="tabs-1"> |
|
<p>Proin elit arcu, rutrum commodo, vehicula tempus, commodo a, risus. Curabitur nec arcu. Donec sollicitudin mi sit amet mauris. Nam elementum quam ullamcorper ante. Etiam aliquet massa et lorem. Mauris dapibus lacus auctor risus. Aenean tempor ullamcorper leo. Vivamus sed magna quis ligula eleifend adipiscing. Duis orci. Aliquam sodales tortor vitae ipsum. Aliquam nulla. Duis aliquam molestie erat. Ut et mauris vel pede varius sollicitudin. Sed ut dolor nec orci tincidunt interdum. Phasellus ipsum. Nunc tristique tempus lectus.</p> |
|
</div> |
|
<div id="tabs-2"> |
|
<p>Morbi tincidunt, dui sit amet facilisis feugiat, odio metus gravida ante, ut pharetra massa metus id nunc. Duis scelerisque molestie turpis. Sed fringilla, massa eget luctus malesuada, metus eros molestie lectus, ut tempus eros massa ut dolor. Aenean aliquet fringilla sem. Suspendisse sed ligula in ligula suscipit aliquam. Praesent in eros vestibulum mi adipiscing adipiscing. Morbi facilisis. Curabitur ornare consequat nunc. Aenean vel metus. Ut posuere viverra nulla. Aliquam erat volutpat. Pellentesque convallis. Maecenas feugiat, tellus pellentesque pretium posuere, felis lorem euismod felis, eu ornare leo nisi vel felis. Mauris consectetur tortor et purus.</p> |
|
</div> |
|
<div id="tabs-3"> |
|
<p>Mauris eleifend est et turpis. Duis id erat. Suspendisse potenti. Aliquam vulputate, pede vel vehicula accumsan, mi neque rutrum erat, eu congue orci lorem eget lorem. Vestibulum non ante. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Fusce sodales. Quisque eu urna vel enim commodo pellentesque. Praesent eu risus hendrerit ligula tempus pretium. Curabitur lorem enim, pretium nec, feugiat nec, luctus a, lacus.</p> |
|
<p>Duis cursus. Maecenas ligula eros, blandit nec, pharetra at, semper at, magna. Nullam ac lacus. Nulla facilisi. Praesent viverra justo vitae neque. Praesent blandit adipiscing velit. Suspendisse potenti. Donec mattis, pede vel pharetra blandit, magna ligula faucibus eros, id euismod lacus dolor eget odio. Nam scelerisque. Donec non libero sed nulla mattis commodo. Ut sagittis. Donec nisi lectus, feugiat porttitor, tempor ac, tempor vitae, pede. Aenean vehicula velit eu tellus interdum rutrum. Maecenas commodo. Pellentesque nec elit. Fusce in lacus. Vivamus a libero vitae lectus hendrerit hendrerit.</p> |
|
</div> |
|
</div> |
|
|
|
<h3 id="refusal-loss">Refusal Loss Landscape Exploration</h3> |
|
<p>Current transformer-based LLMs will return different responses to the same query due to the randomness of |
|
autoregressive sampling-based generation. With this randomness, it is an |
|
interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but |
|
sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called <strong>Refusal Loss</strong> to |
|
represent the probability with which the LLM won't reject the input user query. By using 1 to denote successful jailbroken and 0 to denote |
|
the opposite, we compute the empirical Refusal Loss as the sample mean of the jailbroken results returned from the target LLM. |
|
|
|
|
|
We visualize the 2-D landscape of the empirical Refusal Loss on Vicuna 7B and Llama-2 7B as below: |
|
</p> |
|
|
|
<div class="container jailbreak-intro-sec"> |
|
<div><img id="jailbreak-intro-img" src="./loss_landscape.png" /></div> |
|
</div> |
|
|
|
<p> |
|
We show the loss landscape for both Benign and Malicious queries in the above plot. The benign queries are non-harmful user instructions collected |
|
from the LM-SYS Chatbot Arena leaderboard, which is a crowd-sourced open platform for LLM evaluation. The tested malicious queries are harmful |
|
behavior user instructions with GCG jailbreak prompt. From this plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries, |
|
which implies that the Refusal Loss tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using |
|
the gradient norm of Refusal Loss to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value |
|
is under 0.5 (this is a naive detector because the Refusal Loss can be regarded as the probability that the LLM won't reject the user query). |
|
Below we present the definition of the Refusal Loss, the computation of its empirical values, and the approximation of its gradient, see more |
|
details about them and the landscape drawing techniques in our paper. |
|
</p> |
|
|
|
<div id="refusal-loss-formula" class="container"> |
|
<div id="refusal-loss-formula-list" class="row align-items-center formula-list"> |
|
<a href="#Refusal-Loss" class="selected">Refusal Loss Definition</a> |
|
<a href="#Refusal-Loss-Approximation">Refusal Loss Computation</a> |
|
<a href="#Gradient-Estimation">Gradient Estimation</a> |
|
<div style="clear: both"></div> |
|
</div> |
|
<div id="refusal-loss-formula-content" class="row align-items-center"> |
|
<span id="Refusal-Loss" class="formula" style=""> |
|
$$ |
|
\displaystyle |
|
\begin{aligned} |
|
\phi_\theta(x)&=1-\mathbb{E}_{y \sim T_\theta(x)} JB(y)\\ |
|
JB (y) &= \begin{cases} |
|
1 \text{, if $y$ contains any jailbreak keyword;} \\ |
|
0 \text{, otherwise.} |
|
\end{cases} |
|
\end{aligned} |
|
$$ |
|
</span> |
|
<span id="Refusal-Loss-Approximation" class="formula" style="display: none;"> |
|
$$ |
|
\displaystyle |
|
\begin{aligned} |
|
f_\theta(x) &=1-\frac{1}{N}\sum_{i=1}^N JB(y_i)\\ |
|
JB (y_i) &= \begin{cases} |
|
1 \text{, if $y_i$ contains any jailbreak keyword;} \\ |
|
0 \text{, otherwise.} |
|
\end{cases} |
|
\end{aligned} |
|
$$ |
|
</span> |
|
<span id="Gradient-Estimation" class="formula" style="display: none;">$$\displaystyle g_\theta(x)=\sum_{i=1}^P \frac{f_\theta(x\oplus \mu u_i)-f_\theta(x)}{\mu} u_i $$</span> |
|
</div> |
|
</div> |
|
|
|
<h2 id="proposed-approach-gradient-cuff">Proposed Approach: Gradient Cuff</h2> |
|
<p> With the exploration of the Refusal Loss landscape, we propose Gradient Cuff, |
|
a two-step jailbreak detection method based on checking the refusal loss and its gradient norm. Our detection procedure is shown below: |
|
</p> |
|
|
|
<div class="container"><img id="gradient-cuff-header" src="./gradient_cuff.png" /></div> |
|
|
|
<p> |
|
Gradient Cuff can be summarized into two phases: |
|
</p> |
|
<p> |
|
<strong>(Phase 1) Sampling-based Rejection:</strong> In the first step, we reject the user query by checking whether the Refusal Loss value is below 0.5. If true, then user query is rejected, otherwise, the user query is pushed into phase 2. |
|
</p> |
|
<p> |
|
<strong>(Phase 2) Gradient Norm Rejection:</strong> In the second step, we regard the user query as having jailbreak attempts if the norm of the estimated gradient is larger than a configurable threshold t. |
|
</p> |
|
|
|
<p> |
|
We provide more details about the running flow of Gradient Cuff in the paper. |
|
</p> |
|
|
|
<h2 id="demonstration">Demonstration</h2> |
|
<p>We evaluated Gradient Cuff as well as 4 baselines (Perplexity Filter, SmoothLLM, Erase-and-Check, and Self-Reminder) |
|
against 6 different jailbreak attacks (GCG, AutoDAN, PAIR, TAP, Base64, and LRL) and benign user queries on 2 LLMs (LLaMA-2-7B-Chat and |
|
Vicuna-7B-V1.5). We below demonstrate the average refusal rate across these 6 malicious user query datasets as the Average Malicious Refusal |
|
Rate and the refusal rate on benign user queries as the Benign Refusal Rate. The defending performance against different jailbreak types is |
|
shown in the provided bar chart. |
|
</p> |
|
|
|
|
|
<div id="jailbreak-demo" class="container"> |
|
<div class="row align-items-center"> |
|
<div class="row" style="margin: 10px 0 0"> |
|
<div class="models-list"> |
|
<span style="margin-right: 1em;">Models</span> |
|
<span class="radio-group"><input type="radio" id="LLaMA2" class="options" name="models" value="llama2_7b_chat" checked="" /><label for="LLaMA2" class="option-label">LLaMA-2-7B-Chat</label></span> |
|
<span class="radio-group"><input type="radio" id="Vicuna" class="options" name="models" value="vicuna_7b_v1.5" /><label for="Vicuna" class="option-label">Vicuna-7B-V1.5</label></span> |
|
</div> |
|
</div> |
|
</div> |
|
<div class="row align-items-center"> |
|
<div class="col-4"> |
|
<div id="defense-methods"> |
|
<div class="row align-items-center"><input type="radio" id="defense_ppl" class="options" name="defense" value="ppl" /><label for="defense_ppl" class="defense">Perplexity Filter</label></div> |
|
<div class="row align-items-center"><input type="radio" id="defense_smoothllm" class="options" name="defense" value="smoothllm" /><label for="defense_smoothllm" class="defense">SmoothLLM</label></div> |
|
<div class="row align-items-center"><input type="radio" id="defense_erase_check" class="options" name="defense" value="erase_check" /><label for="defense_erase_check" class="defense">Erase-Check</label></div> |
|
<div class="row align-items-center"><input type="radio" id="defense_self_reminder" class="options" name="defense" value="self_reminder" /><label for="defense_self_reminder" class="defense">Self-Reminder</label></div> |
|
<div class="row align-items-center"><input type="radio" id="defense_gradient_cuff" class="options" name="defense" value="gradient_cuff" checked="" /><label for="defense_gradient_cuff" class="defense"><span style="font-weight: bold;">Gradient Cuff</span></label></div> |
|
</div> |
|
<div class="row align-items-center"> |
|
<div class="attack-success-rate"><span class="jailbreak-metric">Average Malicious Refusal Rate</span><span class="attack-success-rate-value" id="asr-value">0.959</span></div> |
|
</div> |
|
<div class="row align-items-center"> |
|
<div class="benign-refusal-rate"><span class="jailbreak-metric">Benign Refusal Rate</span><span class="benign-refusal-rate-value" id="brr-value">0.050</span></div> |
|
</div> |
|
</div> |
|
<div class="col-8"> |
|
<figure class="figure"> |
|
<img id="reliability-diagram" src="demo_results/gradient_cuff_llama2_7b_chat_threshold_100.png" alt="CIFAR-100 Calibrated Reliability Diagram (Full)" /> |
|
<div class="slider-container"> |
|
<div class="slider-label"><span>Perplexity Threshold</span></div> |
|
<div class="slider-content" id="ppl-slider"><div id="ppl-threshold" class="ui-slider-handle"></div></div> |
|
</div> |
|
<div class="slider-container"> |
|
<div class="slider-label"><span>Gradient Threshold</span></div> |
|
<div class="slider-content" id="gradient-norm-slider"><div id="gradient-norm-threshold" class="slider-value ui-slider-handle"></div></div> |
|
</div> |
|
<figcaption class="figure-caption"> |
|
</figcaption> |
|
</figure> |
|
</div> |
|
</div> |
|
</div> |
|
|
|
<p> |
|
We also evaluated adaptive attacks on LLMs with Gradient Cuff in place. Please refer to our paper for details. |
|
</p> |
|
|
|
<h2 id="citations">Citations</h2> |
|
<p>If you find Gradient Cuff helpful and useful for your research, please cite our main paper as follows:</p> |
|
|
|
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@misc{xxx, |
|
title={{Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by |
|
Exploring Refusal Loss Landscapes}}, |
|
author={Xiaomeng Hu and Pin-Yu Chen and Tsung-Yi Ho}, |
|
year={2024}, |
|
eprint={}, |
|
archivePrefix={arXiv}, |
|
primaryClass={} |
|
} |
|
</code></pre></div></div> |
|
|
|
|
|
<footer class="site-footer"> |
|
|
|
<span class="site-footer-owner">Gradient Cuff is maintained by <a href="https://gregxmhu.github.io/">Xiaomeng Hu</a></a>.</span> |
|
|
|
</footer> |
|
</main> |
|
</body> |
|
</html> |
|
|