Spaces:

TrustSafeAI
/

GradientCuff-Jailbreak-Defense

Running

gregH commited on Feb 27, 2024

Commit

00b0a08

verified ·

1 Parent(s): fdbc39d

Update index.html

Files changed (1) hide show

index.html CHANGED Viewed

@@ -60,7 +60,7 @@ Exploring Refusal Loss Landscapes </title>
   these LLMs to human values using advanced training techniques such as Reinforcement Learning from
   Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial
   jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
- we define and investigate the $$\textbf{Refusal Loss}$$ of LLMs and then propose a method called \textbf{Gradient Cuff} to
   detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the refusal loss
   landscape and propose the Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense
   methods and show the defense performance.

   these LLMs to human values using advanced training techniques such as Reinforcement Learning from
   Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial
   jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
+ we define and investigate the <strong>Refusal Loss</strong> of LLMs and then propose a method called <strong>Gradient Cuff</strong> to
   detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the refusal loss
   landscape and propose the Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense
   methods and show the defense performance.