Spaces:

TrustSafeAI
/

GradientCuff-Jailbreak-Defense

Running

gregH commited on Feb 28, 2024

Commit

4555af9

verified ·

1 Parent(s): bd81cb7

Update index.html

Files changed (1) hide show

index.html CHANGED Viewed

@@ -92,7 +92,7 @@ Exploring Refusal Loss Landscapes </title>
 <p>
   From the above plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries, which implies that
   the <strong>Refusal Loss</strong> tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
-  the gradient norm of <strong>Refusal Loss</strong> to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value is under 0.5.
   Below we present the definition of the <strong>Refusal Loss</strong> and the approximation of it's function value and gradient.
   See more details about the concept, approximation, gradient estimation and landscape drawing of it in our paper.
 </p>

 <p>
   From the above plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries, which implies that
   the <strong>Refusal Loss</strong> tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
+  the gradient norm of <strong>Refusal Loss</strong> to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value is under 0.5 (this is a naive detector bacause the Refusal Loss can be regarded the probability that the LLM won't reject the user query).
   Below we present the definition of the <strong>Refusal Loss</strong> and the approximation of it's function value and gradient.
   See more details about the concept, approximation, gradient estimation and landscape drawing of it in our paper.
 </p>