Spaces:

TrustSafeAI
/

GradientCuff-Jailbreak-Defense

Running

gregH commited on Feb 28, 2024

Commit

8b5c98a

verified ·

1 Parent(s): c1e761a

Update index.html

Files changed (1) hide show

index.html CHANGED Viewed

@@ -130,11 +130,11 @@ Exploring Refusal Loss Landscapes </title>
 <div class="container"><img id="gradient-cuff-header" src="./gradient_cuff.png" /></div>
 <h2 id="demonstration">Demonstration</h2>
-<p>In the current research, a reliability diagram is drawn to show the calibration performance of a model. However, since
-reliability diagrams often only provide fixed bar graphs statically, further explanation from the chart is limited. In
-this demonstration, we show how to make reliability diagrams interactive and insightful to help researchers and
-developers gain more insights from the graph. Specifically, we provide three CIFAR-100 classification models
-in this demonstration.  Multiple Bin numbers are also supported </p>
 <p>We hope this tool could also facilitate the development process.</p>

 <div class="container"><img id="gradient-cuff-header" src="./gradient_cuff.png" /></div>
 <h2 id="demonstration">Demonstration</h2>
+<p>We evaluated Gradient Cuff as well as 4 baselines (Perplexity Filter, SmoothLLM, Erase-and-Check, and Self-Reminder) against 6
+  different jailbreak attacks~(GCG, AutoDAN, PAIR, TAP, Base64, and LRL) and benign user queries on 2 LLMs (LLaMA-2-7B-Chat and Vicuna-7B-V1.5).
+  We report the average refusal rate across these 6 malicious user query datasets as True Positive Rate~(TPR) and the refusal rate
+  on benign user queries as False Positive Rate~(FPR).
+</p>
 <p>We hope this tool could also facilitate the development process.</p>