Update index.html
Browse files- index.html +5 -3
index.html
CHANGED
@@ -86,7 +86,7 @@ Exploring Refusal Loss Landscapes </title>
|
|
86 |
the opposite, we compute the empirical Refusal Loss as the sample mean of the jailbroken results returned from the target LLM.
|
87 |
<!--Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
|
88 |
mean of the Jailbroken results (1 indicates successful jailbreak, 0 indicates the opposite) to approximate the function value. -->
|
89 |
-
We visualize the 2-D landscape of the empirical Refusal Loss as below:
|
90 |
</p>
|
91 |
|
92 |
<div class="container jailbreak-intro-sec">
|
@@ -94,8 +94,10 @@ Exploring Refusal Loss Landscapes </title>
|
|
94 |
</div>
|
95 |
|
96 |
<p>
|
97 |
-
|
98 |
-
the
|
|
|
|
|
99 |
the gradient norm of Refusal Loss to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value
|
100 |
is under 0.5 (this is a naive detector because the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).
|
101 |
Below we present the definition of the Refusal Loss, the computation of its empirical values, and the approximation of its gradient, see more
|
|
|
86 |
the opposite, we compute the empirical Refusal Loss as the sample mean of the jailbroken results returned from the target LLM.
|
87 |
<!--Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
|
88 |
mean of the Jailbroken results (1 indicates successful jailbreak, 0 indicates the opposite) to approximate the function value. -->
|
89 |
+
We visualize the 2-D landscape of the empirical Refusal Loss on Vicuna 7B and Llama-2 7B as below:
|
90 |
</p>
|
91 |
|
92 |
<div class="container jailbreak-intro-sec">
|
|
|
94 |
</div>
|
95 |
|
96 |
<p>
|
97 |
+
We show the loss landscape for both Benign and Malicious queries in the above plot. The benign queries are non-harmful user instructions collected
|
98 |
+
from the LM-SYS Chatbot Arena leaderboard, which is a crowd-sourced open platform for LLM evaluation. The tested malicious queries are harmful
|
99 |
+
behavior user instructions with GCG jailbreak prompt. From this plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries,
|
100 |
+
which implies that the Refusal Loss tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
|
101 |
the gradient norm of Refusal Loss to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value
|
102 |
is under 0.5 (this is a naive detector because the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).
|
103 |
Below we present the definition of the Refusal Loss, the computation of its empirical values, and the approximation of its gradient, see more
|