gregH commited on
Commit
3e1cb92
·
verified ·
1 Parent(s): f134f1b

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +6 -4
index.html CHANGED
@@ -92,9 +92,11 @@ Exploring Refusal Loss Landscapes </title>
92
 
93
  <p>
94
  From the above plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries, which implies that
95
- the <strong>Refusal Loss</strong> tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
96
- the gradient norm of <strong>Refusal Loss</strong> to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value is under 0.5 (this is a naive detector bacause the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).
97
- Below we present the definition of the <strong>Refusal Loss</strong> and the approximation of its function value and gradient, see more details about them and the landscape drawing techniques in our paper.
 
 
98
  </p>
99
 
100
  <div id="refusal-loss-formula" class="container">
@@ -156,7 +158,7 @@ We provide more details about the running flow of Gradient Cuff in the paper.
156
 
157
  <h2 id="demonstration">Demonstration</h2>
158
  <p>We evaluated Gradient Cuff as well as 4 baselines (Perplexity Filter, SmoothLLM, Erase-and-Check, and Self-Reminder)
159
- against 6 different jailbreak attacks~(GCG, AutoDAN, PAIR, TAP, Base64, and LRL) and benign user queries on 2 LLMs (LLaMA-2-7B-Chat and
160
  Vicuna-7B-V1.5). We below demonstrate the average refusal rate across these 6 malicious user query datasets as the Average Malicious Refusal
161
  Rate and the refusal rate on benign user queries as the Benign Refusal Rate. The defending performance against different jailbreak types is
162
  shown in the provided bar chart.
 
92
 
93
  <p>
94
  From the above plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries, which implies that
95
+ the Refusal Loss tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
96
+ the gradient norm of Refusal Loss to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value
97
+ is under 0.5 (this is a naive detector because the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).
98
+ Below we present the definition of the Refusal Loss and the approximation of its function value and gradient, see more details about them and
99
+ the landscape drawing techniques in our paper.
100
  </p>
101
 
102
  <div id="refusal-loss-formula" class="container">
 
158
 
159
  <h2 id="demonstration">Demonstration</h2>
160
  <p>We evaluated Gradient Cuff as well as 4 baselines (Perplexity Filter, SmoothLLM, Erase-and-Check, and Self-Reminder)
161
+ against 6 different jailbreak attacks (GCG, AutoDAN, PAIR, TAP, Base64, and LRL) and benign user queries on 2 LLMs (LLaMA-2-7B-Chat and
162
  Vicuna-7B-V1.5). We below demonstrate the average refusal rate across these 6 malicious user query datasets as the Average Malicious Refusal
163
  Rate and the refusal rate on benign user queries as the Benign Refusal Rate. The defending performance against different jailbreak types is
164
  shown in the provided bar chart.