bxiong commited on
Commit
483c5f3
1 Parent(s): 33fbcd5

adding table for llama

Browse files
Files changed (1) hide show
  1. index.html +84 -5
index.html CHANGED
@@ -715,15 +715,14 @@
715
  <div class="content has-text-justified">
716
  <p>In this section we want to show our <strong>numerical results</strong> as well as <strong>our trained DPP</strong> on both LLAMA-2-Chat
717
  and MISTRAL-7B-Instruct-v0.2.</p>
718
- <h2>Evaluation Metrics:</h2>
719
  <ul>
720
- <li><strong>Attack Success Rate:</strong>We use the Attack Success Rate (ASR) as our primary metric for evaluating the effectiveness of jailbreak defenses.
721
- The ASR measures the proportion of malicious queries that successfully bypass the LLMs alignment and generate harmful responses.</li>
722
  <p><b>ASR</b> is defined as:</p>
723
  <p>\[
724
- \textbf{ASR} = \frac{\text{Number\_of\_jailbreak\_queries}}{\text{Total\_queries}}
725
  \]</p>
726
- <p>Here the \(\text{Number\_of\_jailbreak\_queries}\) is calculated through the sub-strings matching. Specifically, for a given generated response of a jailbreak query, if the response contains sub-strings that exist in the pre-defined sub-string set \(S\). Then, it will be evaluated as <b>jailbroken</b>, otherwise it is <b>non-jailbroken</b>.</p>
727
  <p>The function to determine if a response is jailbroken can be expressed as:</p>
728
  <p>\[
729
  \text{JailBroken}(\text{response}) = \begin{cases}
@@ -731,7 +730,87 @@
731
  0, & \text{otherwise.}
732
  \end{cases}
733
  \]</p>
 
 
 
 
734
  </ul>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
735
  </div>
736
  </div>
737
  </div>
 
715
  <div class="content has-text-justified">
716
  <p>In this section we want to show our <strong>numerical results</strong> as well as <strong>our trained DPP</strong> on both LLAMA-2-Chat
717
  and MISTRAL-7B-Instruct-v0.2.</p>
718
+ <h3>Evaluation Metrics:</h3>
719
  <ul>
720
+ <li><strong>Attack Success Rate:</strong> We use the Attack Success Rate (ASR) as our primary metric for evaluating the effectiveness of jailbreak defenses.</li>
 
721
  <p><b>ASR</b> is defined as:</p>
722
  <p>\[
723
+ \textbf{ASR} = \frac{\text{Number of jailbreak queries}}{\text{Total queries}}
724
  \]</p>
725
+ <p>Here the \(\text{Number of jailbreak queries}\) is calculated through the sub-strings matching. Specifically, for a given generated response of a jailbreak query, if the response contains sub-strings that exist in the pre-defined sub-string set \(S\). Then, it will be evaluated as <b>jailbroken</b>, otherwise it is <b>non-jailbroken</b>.</p>
726
  <p>The function to determine if a response is jailbroken can be expressed as:</p>
727
  <p>\[
728
  \text{JailBroken}(\text{response}) = \begin{cases}
 
730
  0, & \text{otherwise.}
731
  \end{cases}
732
  \]</p>
733
+ <li><strong>Win-Rate:</strong> We utilize AlpacaEval to measure the impact on the LLM model's utility when defenses are in place.
734
+ In particular, we apply a metric termed Win-Rate. This metric involves assessing the frequency at which the LLM's outputs are selected over those from a
735
+ benchmark model when following specific user instructions. By adopting the simulated Win-Rate, we can directly compare the performance of various LLMs against
736
+ a consistent benchmark model.</li>
737
  </ul>
738
+
739
+ <h3>Numerical Results:</h3>
740
+ <table border="1" style="width:100%; text-align:center;">
741
+ <caption>Attack Success Rates (ASRs) and Win-Rates (utility) on LLAMA-2-7B-Chat model across six different jailbreak attacks. Our method can achieve the lowest Average ASR and highest Win-Rate against other defense baselines. The arrow's direction signals improvement, the same below.</caption>
742
+ <thead>
743
+ <tr>
744
+ <th>Methods</th>
745
+ <th>Base64 [$\downarrow$]</th>
746
+ <th>ICA [$\downarrow$]</th>
747
+ <th>AutoDAN [$\downarrow$]</th>
748
+ <th>GCG [$\downarrow$]</th>
749
+ <th>PAIR [$\downarrow$]</th>
750
+ <th>TAP [$\downarrow$]</th>
751
+ <th>Average ASR [$\downarrow$]</th>
752
+ <th>Win-Rate [$\uparrow$]</th>
753
+ </tr>
754
+ </thead>
755
+ <tbody>
756
+ <tr>
757
+ <td>w/o defense</td>
758
+ <td>0.990</td>
759
+ <td>0.690</td>
760
+ <td>0.640</td>
761
+ <td>0.550</td>
762
+ <td>0.100</td>
763
+ <td>0.120</td>
764
+ <td>0.515</td>
765
+ <td>81.37</td>
766
+ </tr>
767
+ <tr>
768
+ <td>RPO <a href="#rpo">[rpo]</a></td>
769
+ <td>0.000</td>
770
+ <td>0.420</td>
771
+ <td>0.280</td>
772
+ <td>0.190</td>
773
+ <td>0.060</td>
774
+ <td>0.060</td>
775
+ <td>0.168</td>
776
+ <td>79.23</td>
777
+ </tr>
778
+ <tr>
779
+ <td>Goal Prioritization <a href="#goal_prior">[goal_prior]</a></td>
780
+ <td>0.000</td>
781
+ <td>0.020</td>
782
+ <td>0.520</td>
783
+ <td>0.020</td>
784
+ <td>0.020</td>
785
+ <td>0.020</td>
786
+ <td>0.100</td>
787
+ <td>34.29</td>
788
+ </tr>
789
+ <tr>
790
+ <td>Self-Reminder <a href="#self_reminder">[self_reminder]</a></td>
791
+ <td>0.030</td>
792
+ <td>0.290</td>
793
+ <td>0.000</td>
794
+ <td>0.040</td>
795
+ <td>0.020</td>
796
+ <td>0.000</td>
797
+ <td>0.063</td>
798
+ <td>64.84</td>
799
+ </tr>
800
+ <tr>
801
+ <td>DPP (Ours)</td>
802
+ <td>0.010</td>
803
+ <td>0.000</td>
804
+ <td>0.100</td>
805
+ <td>0.040</td>
806
+ <td>0.040</td>
807
+ <td>0.040</td>
808
+ <td><strong>0.038</strong></td>
809
+ <td><strong>82.98</strong></td>
810
+ </tr>
811
+ </tbody>
812
+ </table>
813
+
814
  </div>
815
  </div>
816
  </div>