Spaces:

TrustSafeAI
/

GradientCuff-Jailbreak-Defense

Running

gregH commited on Feb 28

Commit

4c6a875

•

1 Parent(s): 2ed20ca

Update index.html

Files changed (1) hide show

index.html CHANGED Viewed

@@ -82,8 +82,8 @@ Exploring Refusal Loss Landscapes </title>
   autoregressive sampling-based generation. With this randomness, it is an
   interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
   sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called Refusal Loss to represent the probability with which
-  the LLM won't reject the input user query and visualize its 2-d
-  landscape below:
 </p>
 <div class="container jailbreak-intro-sec">

   autoregressive sampling-based generation. With this randomness, it is an
   interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
   sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called Refusal Loss to represent the probability with which
+  the LLM won't reject the input user query. Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
+  mean of the Jailbroken results (1 denotes a successful jailbreak and 0 otherwise) to approximate the function value. Using the approximation, we visualize the 2-d landscape of the Refusal Loss below:
 </p>
 <div class="container jailbreak-intro-sec">