Update index.html
Browse files- index.html +4 -3
index.html
CHANGED
@@ -142,11 +142,12 @@ Exploring Refusal Loss Landscapes </title>
|
|
142 |
|
143 |
<p>
|
144 |
Gradient Cuff can be summarized into two phases:
|
145 |
-
<span>
|
146 |
-
<strong>(Phase 1) Sampling-based Rejection:</strong> In the first step, we reject the user query by checking whether $f_\theta(x)<0.5$. If true, then $x$ is rejected, otherwise, $x$ is pushed into phase 2.
|
147 |
</p>
|
148 |
<p>
|
149 |
-
<strong>(Phase
|
|
|
|
|
|
|
150 |
</p>
|
151 |
|
152 |
|
|
|
142 |
|
143 |
<p>
|
144 |
Gradient Cuff can be summarized into two phases:
|
|
|
|
|
145 |
</p>
|
146 |
<p>
|
147 |
+
<strong>(Phase 1) Sampling-based Rejection:</strong> In the first step, we reject the user query by checking whether the Refusal Loss value is below 0.5. If true, then user query is rejected, otherwise, the user query is pushed into phase 2.
|
148 |
+
</p>
|
149 |
+
<p>
|
150 |
+
<strong>(Phase 2) Gradient Norm Rejection:</strong> In the second step, we regard the user query as having jailbreak attempts if the norm of the estimated gradient is larger than a configurable threshold t.
|
151 |
</p>
|
152 |
|
153 |
|