bxiong commited on
Commit
fb3c258
β€’
1 Parent(s): 483c5f3

add mistral table

Browse files
Files changed (1) hide show
  1. index.html +89 -13
index.html CHANGED
@@ -599,7 +599,7 @@
599
  <img src="./static/images/method_plot_v8.png"
600
  class="method_overview"
601
  alt="Methodlogy Overview of DPP"/>
602
- <p>Overview of <strong>Defensive Prompt Patch</strong>. (a) showcases an example of jailbreak attacks.
603
  (b) is the DPP training phase in which the algorithm takes in the refusal and helpful datasets and a prototype of the defense prompt.
604
  Then, the algorithm forms the defense prompt population by revising the prototype using LLM. For each of the defense prompts in the population,
605
  the algorithm will evaluate the defense and utility scores. The algorithm keeps editing the defense prompts with low scores using the Hierarchical Genetic Search algorithm.
@@ -738,18 +738,18 @@
738
 
739
  <h3>Numerical Results:</h3>
740
  <table border="1" style="width:100%; text-align:center;">
741
- <caption>Attack Success Rates (ASRs) and Win-Rates (utility) on LLAMA-2-7B-Chat model across six different jailbreak attacks. Our method can achieve the lowest Average ASR and highest Win-Rate against other defense baselines. The arrow's direction signals improvement, the same below.</caption>
742
  <thead>
743
  <tr>
744
  <th>Methods</th>
745
- <th>Base64 [$\downarrow$]</th>
746
- <th>ICA [$\downarrow$]</th>
747
- <th>AutoDAN [$\downarrow$]</th>
748
- <th>GCG [$\downarrow$]</th>
749
- <th>PAIR [$\downarrow$]</th>
750
- <th>TAP [$\downarrow$]</th>
751
- <th>Average ASR [$\downarrow$]</th>
752
- <th>Win-Rate [$\uparrow$]</th>
753
  </tr>
754
  </thead>
755
  <tbody>
@@ -765,7 +765,7 @@
765
  <td>81.37</td>
766
  </tr>
767
  <tr>
768
- <td>RPO <a href="#rpo">[rpo]</a></td>
769
  <td>0.000</td>
770
  <td>0.420</td>
771
  <td>0.280</td>
@@ -776,7 +776,7 @@
776
  <td>79.23</td>
777
  </tr>
778
  <tr>
779
- <td>Goal Prioritization <a href="#goal_prior">[goal_prior]</a></td>
780
  <td>0.000</td>
781
  <td>0.020</td>
782
  <td>0.520</td>
@@ -787,7 +787,7 @@
787
  <td>34.29</td>
788
  </tr>
789
  <tr>
790
- <td>Self-Reminder <a href="#self_reminder">[self_reminder]</a></td>
791
  <td>0.030</td>
792
  <td>0.290</td>
793
  <td>0.000</td>
@@ -810,6 +810,82 @@
810
  </tr>
811
  </tbody>
812
  </table>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
813
 
814
  </div>
815
  </div>
 
599
  <img src="./static/images/method_plot_v8.png"
600
  class="method_overview"
601
  alt="Methodlogy Overview of DPP"/>
602
+ <p><strong>Figure 1.</strong> Overview of <strong>Defensive Prompt Patch</strong>. (a) showcases an example of jailbreak attacks.
603
  (b) is the DPP training phase in which the algorithm takes in the refusal and helpful datasets and a prototype of the defense prompt.
604
  Then, the algorithm forms the defense prompt population by revising the prototype using LLM. For each of the defense prompts in the population,
605
  the algorithm will evaluate the defense and utility scores. The algorithm keeps editing the defense prompts with low scores using the Hierarchical Genetic Search algorithm.
 
738
 
739
  <h3>Numerical Results:</h3>
740
  <table border="1" style="width:100%; text-align:center;">
741
+ <caption><strong>Table 1.</strong> Attack Success Rates (ASRs) and Win-Rates (utility) on LLAMA-2-7B-Chat model across six different jailbreak attacks. Our method can achieve the lowest Average ASR and highest Win-Rate against other defense baselines. The arrow's direction signals improvement, the same below.</caption>
742
  <thead>
743
  <tr>
744
  <th>Methods</th>
745
+ <th>Base64 [↓]</th>
746
+ <th>ICA [↓]</th>
747
+ <th>AutoDAN [↓]</th>
748
+ <th>GCG [↓]</th>
749
+ <th>PAIR [↓]</th>
750
+ <th>TAP [↓]</th>
751
+ <th>Average ASR [↓]</th>
752
+ <th>Win-Rate [↑]</th>
753
  </tr>
754
  </thead>
755
  <tbody>
 
765
  <td>81.37</td>
766
  </tr>
767
  <tr>
768
+ <td>RPO </td>
769
  <td>0.000</td>
770
  <td>0.420</td>
771
  <td>0.280</td>
 
776
  <td>79.23</td>
777
  </tr>
778
  <tr>
779
+ <td>Goal Prioritization</td>
780
  <td>0.000</td>
781
  <td>0.020</td>
782
  <td>0.520</td>
 
787
  <td>34.29</td>
788
  </tr>
789
  <tr>
790
+ <td>Self-Reminder</td>
791
  <td>0.030</td>
792
  <td>0.290</td>
793
  <td>0.000</td>
 
810
  </tr>
811
  </tbody>
812
  </table>
813
+ <table border="1" style="width:100%; text-align:center;">
814
+ <caption>Attack Success Rates (ASRs) and Win-Rates (utility) on Mistral-7B-Instruct-v0.2 model across six different jailbreak attacks. Our method can achieve the lowest Average attack success rate with reasonable trade-off of Win-Rate when compared with other defense baselines.</caption>
815
+ <thead>
816
+ <tr>
817
+ <th>Methods</th>
818
+ <th>Base64 [↓]</th>
819
+ <th>ICA [↓]</th>
820
+ <th>GCG [↓]</th>
821
+ <th>AutoDAN [↓]</th>
822
+ <th>PAIR [↓]</th>
823
+ <th>TAP [↓]</th>
824
+ <th>Average ASR [↓]</th>
825
+ <th>Win-Rate [↑]</th>
826
+ </tr>
827
+ </thead>
828
+ <tbody>
829
+ <tr>
830
+ <td>w/o defense</td>
831
+ <td>0.990</td>
832
+ <td>0.960</td>
833
+ <td>0.990</td>
834
+ <td>0.970</td>
835
+ <td>1.000</td>
836
+ <td>1.000</td>
837
+ <td>0.985</td>
838
+ <td>90.31</td>
839
+ </tr>
840
+ <tr>
841
+ <td>Self-Reminder</td>
842
+ <td>0.550</td>
843
+ <td>0.270</td>
844
+ <td>0.510</td>
845
+ <td>0.880</td>
846
+ <td>0.420</td>
847
+ <td>0.260</td>
848
+ <td>0.482</td>
849
+ <td>88.82</td>
850
+ </tr>
851
+ <tr>
852
+ <td>System Prompt</td>
853
+ <td>0.740</td>
854
+ <td>0.470</td>
855
+ <td>0.300</td>
856
+ <td>0.970</td>
857
+ <td>0.500</td>
858
+ <td>0.180</td>
859
+ <td>0.527</td>
860
+ <td>84.97</td>
861
+ </tr>
862
+ <tr>
863
+ <td>Goal Prioritization</td>
864
+ <td>0.030</td>
865
+ <td>0.440</td>
866
+ <td>0.030</td>
867
+ <td>0.390</td>
868
+ <td>0.300</td>
869
+ <td>0.140</td>
870
+ <td>0.222</td>
871
+ <td>56.59</td>
872
+ </tr>
873
+ <tr>
874
+ <td>DPP (Ours)</td>
875
+ <td>0.000</td>
876
+ <td>0.010</td>
877
+ <td>0.020</td>
878
+ <td>0.030</td>
879
+ <td>0.040</td>
880
+ <td>0.020</td>
881
+ <td><strong>0.020</strong></td>
882
+ <td>75.06</td>
883
+ </tr>
884
+ </tbody>
885
+ </table>
886
+
887
+
888
+
889
 
890
  </div>
891
  </div>