gregH commited on
Commit
2654ca5
1 Parent(s): c9fcacb

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +12 -9
index.html CHANGED
@@ -55,15 +55,18 @@ Exploring Refusal Loss Landscapes </title>
55
  <main id="content" class="main-content" role="main">
56
  <h2 id="introduction">Introduction</h2>
57
 
58
- <p>Neural network calibration is an essential task in deep learning to ensure consistency
59
- between the confidence of model prediction and the true correctness likelihood. In this
60
- demonstration, we first visualize the idea of neural network calibration on a binary
61
- classifier and show model features that represent its calibration. Second, we introduce
62
- our proposed framework <strong>Neural Clamping</strong>, which employs a simple joint input-output
63
- transformation on a pre-trained classifier. We also provide other calibration approaches
64
- (e.g., temperature scaling) to compare with Neural Clamping.</p>
65
-
66
- <h2 id="what-is-jailbreak">What is Calibration?</h2>
 
 
 
67
  <p>Neural Network Calibration seeks to make model prediction align with its true correctness likelihood.
68
  A well-calibrated model should provide accurate predictions and reliable confidence when making inferences. On the
69
  contrary, a poor calibration model would have a wide gap between its accuracy and average confidence level.
 
55
  <main id="content" class="main-content" role="main">
56
  <h2 id="introduction">Introduction</h2>
57
 
58
+ <p>Large Language Models (LLMs) are becoming a prominent generative AI tool, where the user enters a
59
+ query and the LLM generates an answer. To reduce harm and misuse, efforts have been made to align
60
+ these LLMs to human values using advanced training techniques such as Reinforcement Learning from
61
+ Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial
62
+ jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
63
+ we define and investigate the \textbf{Refusal Loss} of LLMs and then propose a method called \textbf{Gradient Cuff} to
64
+ detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the refusal loss
65
+ landscape and based on the characteristics of this landscape to propose the Gradient Cuff. Lastly, we compare it with other jailbreak defense
66
+ methods and show the defense performance.
67
+ </p>
68
+
69
+ <h2 id="what-is-jailbreak">What is Jailbreak?</h2>
70
  <p>Neural Network Calibration seeks to make model prediction align with its true correctness likelihood.
71
  A well-calibrated model should provide accurate predictions and reliable confidence when making inferences. On the
72
  contrary, a poor calibration model would have a wide gap between its accuracy and average confidence level.