Spaces:

TrustSafeAI
/

Defensive-Prompt-Patch-Jailbreak-Defense

Running

App Files Files Community

bxiong commited on May 30, 2024

Commit

483c5f3

verified ·

1 Parent(s): 33fbcd5

adding table for llama

Browse files

Files changed (1) hide show

index.html +84 -5

index.html CHANGED Viewed

@@ -715,15 +715,14 @@
         <div class="content has-text-justified">
             <p>In this section we want to show our <strong>numerical results</strong> as well as <strong>our trained DPP</strong> on both LLAMA-2-Chat
               and MISTRAL-7B-Instruct-v0.2.</p>
-          <h2>Evaluation Metrics:</h2>
           <ul>
-            <li><strong>Attack Success Rate:</strong>We use the Attack Success Rate (ASR) as our primary metric for evaluating the effectiveness of jailbreak defenses.
-              The ASR measures the proportion of malicious queries that successfully bypass the LLMs alignment and generate harmful responses.</li>
             <p><b>ASR</b> is defined as:</p>
     <p>\[
-    \textbf{ASR} = \frac{\text{Number\_of\_jailbreak\_queries}}{\text{Total\_queries}}
     \]</p>
-    <p>Here the \(\text{Number\_of\_jailbreak\_queries}\) is calculated through the sub-strings matching. Specifically, for a given generated response of a jailbreak query, if the response contains sub-strings that exist in the pre-defined sub-string set \(S\). Then, it will be evaluated as <b>jailbroken</b>, otherwise it is <b>non-jailbroken</b>.</p>
     <p>The function to determine if a response is jailbroken can be expressed as:</p>
     <p>\[
     \text{JailBroken}(\text{response}) = \begin{cases}
@@ -731,7 +730,87 @@
     0, & \text{otherwise.}
     \end{cases}
     \]</p>
           </ul>
 </div>
 </div>
 </div>

         <div class="content has-text-justified">
             <p>In this section we want to show our <strong>numerical results</strong> as well as <strong>our trained DPP</strong> on both LLAMA-2-Chat
               and MISTRAL-7B-Instruct-v0.2.</p>
+          <h3>Evaluation Metrics:</h3>
           <ul>
+            <li><strong>Attack Success Rate:</strong> We use the Attack Success Rate (ASR) as our primary metric for evaluating the effectiveness of jailbreak defenses.</li>
             <p><b>ASR</b> is defined as:</p>
     <p>\[
+    \textbf{ASR} = \frac{\text{Number of jailbreak queries}}{\text{Total queries}}
     \]</p>
+    <p>Here the \(\text{Number of jailbreak queries}\) is calculated through the sub-strings matching. Specifically, for a given generated response of a jailbreak query, if the response contains sub-strings that exist in the pre-defined sub-string set \(S\). Then, it will be evaluated as <b>jailbroken</b>, otherwise it is <b>non-jailbroken</b>.</p>
     <p>The function to determine if a response is jailbroken can be expressed as:</p>
     <p>\[
     \text{JailBroken}(\text{response}) = \begin{cases}
     0, & \text{otherwise.}
     \end{cases}
     \]</p>
+            <li><strong>Win-Rate:</strong> We utilize AlpacaEval to measure the impact on the LLM model's utility when defenses are in place.
+              In particular, we apply a metric termed Win-Rate. This metric involves assessing the frequency at which the LLM's outputs are selected over those from a
+              benchmark model when following specific user instructions. By adopting the simulated Win-Rate, we can directly compare the performance of various LLMs against
+              a consistent benchmark model.</li>
           </ul>
+          <h3>Numerical Results:</h3>
+          <table border="1" style="width:100%; text-align:center;">
+    <caption>Attack Success Rates (ASRs) and Win-Rates (utility) on LLAMA-2-7B-Chat model across six different jailbreak attacks. Our method can achieve the lowest Average ASR and highest Win-Rate against other defense baselines. The arrow's direction signals improvement, the same below.</caption>
+    <thead>
+        <tr>
+            <th>Methods</th>
+            <th>Base64 [$\downarrow$]</th>
+            <th>ICA [$\downarrow$]</th>
+            <th>AutoDAN [$\downarrow$]</th>
+            <th>GCG [$\downarrow$]</th>
+            <th>PAIR [$\downarrow$]</th>
+            <th>TAP [$\downarrow$]</th>
+            <th>Average ASR [$\downarrow$]</th>
+            <th>Win-Rate [$\uparrow$]</th>
+        </tr>
+    </thead>
+    <tbody>
+        <tr>
+            <td>w/o defense</td>
+            <td>0.990</td>
+            <td>0.690</td>
+            <td>0.640</td>
+            <td>0.550</td>
+            <td>0.100</td>
+            <td>0.120</td>
+            <td>0.515</td>
+            <td>81.37</td>
+        </tr>
+        <tr>
+            <td>RPO <a href="#rpo">[rpo]</a></td>
+            <td>0.000</td>
+            <td>0.420</td>
+            <td>0.280</td>
+            <td>0.190</td>
+            <td>0.060</td>
+            <td>0.060</td>
+            <td>0.168</td>
+            <td>79.23</td>
+        </tr>
+        <tr>
+            <td>Goal Prioritization <a href="#goal_prior">[goal_prior]</a></td>
+            <td>0.000</td>
+            <td>0.020</td>
+            <td>0.520</td>
+            <td>0.020</td>
+            <td>0.020</td>
+            <td>0.020</td>
+            <td>0.100</td>
+            <td>34.29</td>
+        </tr>
+        <tr>
+            <td>Self-Reminder <a href="#self_reminder">[self_reminder]</a></td>
+            <td>0.030</td>
+            <td>0.290</td>
+            <td>0.000</td>
+            <td>0.040</td>
+            <td>0.020</td>
+            <td>0.000</td>
+            <td>0.063</td>
+            <td>64.84</td>
+        </tr>
+        <tr>
+            <td>DPP (Ours)</td>
+            <td>0.010</td>
+            <td>0.000</td>
+            <td>0.100</td>
+            <td>0.040</td>
+            <td>0.040</td>
+            <td>0.040</td>
+            <td><strong>0.038</strong></td>
+            <td><strong>82.98</strong></td>
+        </tr>
+    </tbody>
+</table>
 </div>
 </div>
 </div>