Update index.html
Browse files- index.html +1 -1
index.html
CHANGED
@@ -60,7 +60,7 @@ Exploring Refusal Loss Landscapes </title>
|
|
60 |
these LLMs to human values using advanced training techniques such as Reinforcement Learning from
|
61 |
Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial
|
62 |
jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
|
63 |
-
we define and investigate the
|
64 |
detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the refusal loss
|
65 |
landscape and propose the Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense
|
66 |
methods and show the defense performance.
|
|
|
60 |
these LLMs to human values using advanced training techniques such as Reinforcement Learning from
|
61 |
Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial
|
62 |
jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
|
63 |
+
we define and investigate the $$\textbf{Refusal Loss}$$ of LLMs and then propose a method called \textbf{Gradient Cuff} to
|
64 |
detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the refusal loss
|
65 |
landscape and propose the Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense
|
66 |
methods and show the defense performance.
|