gregH commited on
Commit
6926153
1 Parent(s): 4432d6c

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +2 -4
index.html CHANGED
@@ -69,8 +69,8 @@ Exploring Refusal Loss Landscapes </title>
69
  Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial
70
  jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
71
  we define and investigate the <strong>Refusal Loss</strong> of LLMs and then propose a method called <strong>Gradient Cuff</strong> to
72
- detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the 2-D Refusal Loss
73
- Landscape and propose Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense
74
  methods and show the defense performance against several Jailbreak attack methods.
75
  </p>
76
 
@@ -85,8 +85,6 @@ Exploring Refusal Loss Landscapes </title>
85
  </div>
86
  </div>
87
 
88
-
89
- <h2 id="jailbreak-attack-and-defense">Jailbreak Red-Teaming And Blue Teaming</h2>
90
  <p>We summarized some recent advances of jailbreak attack or jailbreak defense in below tables.</p>
91
  <div id="tabs">
92
  <ul>
 
69
  Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial
70
  jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
71
  we define and investigate the <strong>Refusal Loss</strong> of LLMs and then propose a method called <strong>Gradient Cuff</strong> to
72
+ detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak" and summarize people's efforts in Jailbreak
73
+ attack and Jailbreak defense. Then we present the 2-D Refusal Loss Landscape and propose Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense
74
  methods and show the defense performance against several Jailbreak attack methods.
75
  </p>
76
 
 
85
  </div>
86
  </div>
87
 
 
 
88
  <p>We summarized some recent advances of jailbreak attack or jailbreak defense in below tables.</p>
89
  <div id="tabs">
90
  <ul>