Spaces:

TrustSafeAI
/

GradientCuff-Jailbreak-Defense

Running

File size: 11,425 Bytes

9fade2a
bfc5ccd
 
 
 
 
7261a26
 
 
bfc5ccd
7261a26
 
bfc5ccd
7261a26
bfc5ccd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7261a26
 
bfc5ccd
 
 
 
 
 
 
2654ca5
 
 
 
 
00b0a08
2654ca5
da4ab75
2654ca5
 
bfc5ccd
2654ca5
da4ab75
 
 
bfc5ccd
 
b929465
4c92274
bfc5ccd
 
 
5cec4e6
4ddde86
 
 
e12ac43
 
4ddde86
bfc5ccd
b929465
e12ac43
bfc5ccd
 
cf0d3f3
 
6d807bc
 
0c73edf
bfc5ccd
 
cf0d3f3
86e6270
 
 
 
 
 
94e1e98
 
86e6270
 
 
 
ca768c1
0b1f8c9
bfc5ccd
 
 
f420005
bfc5ccd
f420005
bfc5ccd
 
 
 
 
 
5cec4e6
bfc5ccd
 
 
4b4042c
bfc5ccd
 
 
 
fd620cd
 
bfc5ccd
 
 
 
 
6799636
232a1d9
0730956
 
1166ace
0730956
bfc5ccd
 
c9fcacb
7ef77c5
 
c9fcacb
bfc5ccd
 
 
 
c9fcacb
bfc5ccd
ab9235e
0730956
bfc5ccd
 
ab9235e
19975bf
bfc5ccd
 
 
 
 
 
 
 
 
 
 
 
7261a26
bfc5ccd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82981eb
bfc5ccd
 
 
 
9fade2a

<!DOCTYPE html>
<html lang="en-US">
  <head>
    <meta charset="UTF-8">

<!-- Begin Jekyll SEO tag v2.8.0 -->
<title>Gradient Cuff | Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by
Exploring Refusal Loss Landscapes </title>
<meta property="og:title" content="Gradient Cuff" />
<meta property="og:locale" content="en_US" />
<meta name="description" content="Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes" />
<meta property="og:description" content="Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes" />
<script type="application/ld+json">
{"@context":"https://schema.org","@type":"WebSite","description":"Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes","headline":"Gradient Cuff","name":"Gradient Cuff","url":"https://huggingface.co/spaces/gregH/Gradient Cuff"}</script>
<!-- End Jekyll SEO tag -->

    <link rel="preconnect" href="https://fonts.gstatic.com">
    <link rel="preload" href="https://fonts.googleapis.com/css?family=Open+Sans:400,700&display=swap" as="style" type="text/css" crossorigin>
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <meta name="theme-color" content="#157878">
    <meta name="apple-mobile-web-app-status-bar-style" content="black-translucent">

    <link rel="stylesheet" href="assets/css/bootstrap/bootstrap.min.css?v=90447f115a006bc45b738d9592069468b20e2551">
    <link rel="stylesheet" href="assets/css/style.css?v=90447f115a006bc45b738d9592069468b20e2551">
    <!-- start custom head snippets, customize with your own _includes/head-custom.html file -->
    <link rel="stylesheet" href="assets/css/custom_style.css?v=90447f115a006bc45b738d9592069468b20e2551">
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
    <link rel="stylesheet" href="https://ajax.googleapis.com/ajax/libs/jqueryui/1.12.1/themes/smoothness/jquery-ui.css">
    <script src="https://ajax.googleapis.com/ajax/libs/jqueryui/1.12.1/jquery-ui.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/Chart.js/2.9.4/Chart.js"></script>
    <script src="assets/js/calibration.js?v=90447f115a006bc45b738d9592069468b20e2551"></script>





<!-- for mathjax support -->
    <script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
    <script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>


<!-- end custom head snippets -->

  </head>
  <body>
    <a id="skip-to-content" href="#content">Skip to the content.</a>

    <header class="page-header" role="banner">
      <h1 class="project-name">Gradient Cuff</h1>
      <h2 class="project-tagline">Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes</h2>
      
      
    </header>

    <main id="content" class="main-content" role="main">
      <h2 id="introduction">Introduction</h2>

<p>Large Language Models (LLMs) are becoming a prominent generative AI tool, where the user enters a 
  query and the LLM generates an answer. To reduce harm and misuse, efforts have been made to align 
  these LLMs to human values using advanced training techniques such as Reinforcement Learning from 
  Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial 
  jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
 we define and investigate the <strong>Refusal Loss</strong> of LLMs and then propose a method called <strong>Gradient Cuff</strong> to 
  detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the refusal loss 
  landscape and propose the Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense 
  methods and show the defense performance.
</p>

<h2 id="what-is-jailbreak">What is Jailbreak?</h2>
<p>Jailbreak attacks involve maliciously inserting or replacing tokens in the user instruction or rewriting it to bypass and circumvent 
  the safety guardrails of aligned LLMs. A notable example is that a jailbroken LLM would be tricked into 
  generating hate speech targeting certain groups of people, as demonstrated below.</p>

<div class="container">
<div id="jailbreak-intro" class="row align-items-center jailbreak-intro-sec">
<img id="jailbreak-intro-img" src="./jailbreak.png" />
</div>
</div>

<h3 id="refusal-loss">Refusal Loss</h3>
<p>Current transformer-based LLMs will return different responses to the same query due to the randomness of 
  autoregressive sampling-based generation. With this randomness, it is an 
  interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but 
  sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called Refusal Loss and visualize its 2-d 
  landscape below:
</p>

<div class="container jailbreak-intro-sec">
<div><img id="jailbreak-intro-img" src="./loss_landscape.png" /></div>
</div>

<div id="refusal-loss-formula" class="container">
<div id="refusal-loss-formula-list" class="row align-items-center formula-list">
  <a href="#ECE-formula" class="selected">Refusal Loss</a>
  <a href="#SCE-formula">Refusal Loss Approximation</a>
  <a href="#ACE-formula">Gradient Estimation</a>
  <div style="clear: both"></div>
</div>
<div id="refusal-loss-formula-content" class="row align-items-center">
  <span id="ECE-formula" class="formula" style="">
    $$
    \displaystyle 
    \begin{aligned} 
    \phi_\theta(x)&=1-\mathbb{E}_{y \sim T_\theta(x)} \\ 
    JB (y) &=  \begin{cases}
         1\text{,~~if $y$ contains any jailbreak keyword;} \\
         0\text{,~~otherwise.}
     \end{cases} 
    \end{aligned}
    $$
  </span>
  <span id="SCE-formula" class="formula" style="display: none;">$$\displaystyle f_\theta(x)=1-\frac{1}{N}\sum_{i=1}^N JB(y_i)$$</span>
  <span id="ACE-formula" class="formula" style="display: none;">$$\displaystyle g_\theta(x)=\sum_{i=1}^P \frac{f_\theta(x\oplus \mu u_i)-f_\theta(x)}{\mu} u_i $$</span>
</div>
</div>

<h2 id="proposed-approach-gradient-cuff">Proposed Approach: Gradient Cuff</h2>

<div class="container"><img id="gradient-cuff-header" src="images/header.png" /></div>

<h2 id="demonstration">Demonstration</h2>
<p>In the current research, a reliability diagram is drawn to show the calibration performance of a model. However, since
reliability diagrams often only provide fixed bar graphs statically, further explanation from the chart is limited. In
this demonstration, we show how to make reliability diagrams interactive and insightful to help researchers and
developers gain more insights from the graph. Specifically, we provide three CIFAR-100 classification models
in this demonstration.  Multiple Bin numbers are also supported </p>

<p>We hope this tool could also facilitate the development process.</p>

<div id="jailbreak-demo" class="container">
<div class="row align-items-center">
  <div class="row" style="margin: 10px 0 0">
      <div class="models-list">
        <span style="margin-right: 1em;">Models</span>
        <span class="radio-group"><input type="radio" id="LLaMA2" class="options" name="models" value="llama2_7b_chat" checked="" /><label for="LLaMA2" class="option-label">LLaMA-2-7B-Chat</label></span>
        <span class="radio-group"><input type="radio" id="Vicuna" class="options" name="models" value="vicuna_7b_v1.5" /><label for="Vicuna" class="option-label">Vicuna-7B-V1.5</label></span>
      </div>
  </div>
</div>
<div class="row align-items-center">
  <div class="col-4">
    <div id="defense-methods">
      <div class="row align-items-center"><input type="radio" id="defense_ppl" class="options" name="defense" value="ppl" /><label for="defense_ppl" class="defense">Perplexity Filter</label></div>
      <div class="row align-items-center"><input type="radio" id="defense_smoothllm" class="options" name="defense" value="smoothllm" /><label for="defense_smoothllm" class="defense">SmoothLLM</label></div>
      <div class="row align-items-center"><input type="radio" id="defense_erase_check" class="options" name="defense" value="erase_check" /><label for="defense_erase_check" class="defense">Erase-Check</label></div>
      <div class="row align-items-center"><input type="radio" id="defense_self_reminder" class="options" name="defense" value="self_reminder" /><label for="defense_self_reminder" class="defense">Self-Reminder</label></div>
      <div class="row align-items-center"><input type="radio" id="defense_gradient_cuff" class="options" name="defense" value="gradient_cuff" /><label for="defense_gradient_cuff" class="defense"><span style="font-weight: bold;">Gradient Cuff</span></label></div>
    </div>
    <div class="row align-items-center">
      <div class="attack-success-rate"><span class="jailbreak-metric">Average Malicious Refusal Rate</span><span class="attack-success-rate-value" id="asr-value">0.95875</span></div>
    </div>
    <div class="row align-items-center">
      <div class="benign-refusal-rate"><span class="jailbreak-metric">Benign Refusal Rate</span><span class="benign-refusal-rate-value" id="brr-value">0.05000</span></div>
    </div>
  </div>
  <div class="col-8">
  <figure class="figure">
    <img id="reliability-diagram" src="demo_results/gradient_cuff_llama2_7b_chat_threshold_100.png" alt="CIFAR-100 Calibrated Reliability Diagram (Full)" />
    <div class="slider-container">
      <div class="slider-label"><span>Perplexity Threshold</span></div>
      <div class="slider-content" id="ppl-slider"><div id="ppl-threshold" class="ui-slider-handle"></div></div>
    </div>
    <div class="slider-container">
      <div class="slider-label"><span>Gradient Threshold</span></div>
      <div class="slider-content" id="gradient-norm-slider"><div id="gradient-norm-threshold" class="slider-value ui-slider-handle"></div></div>
    </div>
    <figcaption class="figure-caption">
    </figcaption>
  </figure>
  </div>
</div>
</div>

<h2 id="citations">Citations</h2>
<p>If you find Neural Clamping helpful and useful for your research, please cite our main paper as follows:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@inproceedings{hsiung2023nctv,
  title={{NCTV: Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes}}, 
  author={Lei Hsiung, Yung-Chen Tang and Pin-Yu Chen and Tsung-Yi Ho},
  booktitle={Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence},
  publisher={Association for the Advancement of Artificial Intelligence},
  year={2023},
  month={February}
}

@misc{tang2022neural_clamping,
  title={{Neural Clamping: Joint Input Perturbation and Temperature Scaling for Neural Network Calibration}}, 
  author={Yung-Chen Tang and Pin-Yu Chen and Tsung-Yi Ho},
  year={2022},
  eprint={2209.11604},
  archivePrefix={arXiv},
  primaryClass={cs.LG}
}
</code></pre></div></div>


      <footer class="site-footer">
        
          <span class="site-footer-owner">Gradient Cuff is maintained by <a href="https://gregxmhu.github.io/">Xiaomeng Hu</a></a>.</span>
        
      </footer>
    </main>
  </body>
</html>