gregH commited on
Commit
fdbc39d
1 Parent(s): 4c92274

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +1 -1
index.html CHANGED
@@ -60,7 +60,7 @@ Exploring Refusal Loss Landscapes </title>
60
  these LLMs to human values using advanced training techniques such as Reinforcement Learning from
61
  Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial
62
  jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
63
- we define and investigate the \textbf{Refusal Loss} of LLMs and then propose a method called \textbf{Gradient Cuff} to
64
  detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the refusal loss
65
  landscape and propose the Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense
66
  methods and show the defense performance.
 
60
  these LLMs to human values using advanced training techniques such as Reinforcement Learning from
61
  Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial
62
  jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
63
+ we define and investigate the $$\textbf{Refusal Loss}$$ of LLMs and then propose a method called \textbf{Gradient Cuff} to
64
  detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the refusal loss
65
  landscape and propose the Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense
66
  methods and show the defense performance.