Spaces:

TrustSafeAI
/

GradientCuff-Jailbreak-Defense

Running

gregH commited on Feb 29

Commit

ea6ff45

•

1 Parent(s): 3ff74af

Update index.html

Files changed (1) hide show

index.html CHANGED Viewed

@@ -81,9 +81,12 @@ Exploring Refusal Loss Landscapes </title>
 <p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
   autoregressive sampling-based generation. With this randomness, it is an
   interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
-  sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called <strong>Refusal Loss</strong> to represent the probability with which
-  the LLM won't reject the input user query. Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
-  mean of the Jailbroken results (1 indicates successful jailbreak, 0 indicates the opposite) to approximate the function value. Using the approximation, we visualize the 2-D landscape of the Refusal Loss below:
 </p>
 <div class="container jailbreak-intro-sec">

 <p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
   autoregressive sampling-based generation. With this randomness, it is an
   interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
+  sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called <strong>Refusal Loss</strong> to
+  represent the probability with which the LLM won't reject the input user query. By using 1 to denote successful jailbroken and 0 to denote
+  the opposite, we compute the empirical refusal loss as the sample mean of the jailbroken results returned from the target LLM.
+  <!--Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
+  mean of the Jailbroken results (1 indicates successful jailbreak, 0 indicates the opposite) to approximate the function value. -->
+  We visualize the 2-D landscape of the empirical Refusal Loss as below:
 </p>
 <div class="container jailbreak-intro-sec">