gregH commited on
Commit
4ddde86
1 Parent(s): 00b0a08

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +6 -2
index.html CHANGED
@@ -78,8 +78,12 @@ Exploring Refusal Loss Landscapes </title>
78
  </div>
79
 
80
  <h3 id="refusal-loss">Refusal Loss</h3>
81
- <p>Objectively, researchers utilize <strong>Calibration Metrics</strong> to measure the calibration error for a model, for example,
82
- Expected Calibration Error (ECE), Static Calibration Error (SCE), Adaptive Calibration Error (ACE), etc.</p>
 
 
 
 
83
 
84
  <div class="container jailbreak-intro-sec">
85
  <div><img id="jailbreak-intro-img" src="images/metrics/intro-metric-example.png" /></div>
 
78
  </div>
79
 
80
  <h3 id="refusal-loss">Refusal Loss</h3>
81
+ <p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
82
+ autoregressive sampling-based generation. With this randomness, it is an
83
+ interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
84
+ sometimes be able to bypass the safety guardrail. Based on this observation, for a given LLM <p>$T_\theta$</p> parameterized with $\theta$, we
85
+ define the refusal loss function $\phi_\theta(x)$ for a given input user query $x$ as below:
86
+ </p>
87
 
88
  <div class="container jailbreak-intro-sec">
89
  <div><img id="jailbreak-intro-img" src="images/metrics/intro-metric-example.png" /></div>