gregH commited on
Commit
582057e
·
verified ·
1 Parent(s): ea6ff45

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +2 -2
index.html CHANGED
@@ -83,7 +83,7 @@ Exploring Refusal Loss Landscapes </title>
83
  interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
84
  sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called <strong>Refusal Loss</strong> to
85
  represent the probability with which the LLM won't reject the input user query. By using 1 to denote successful jailbroken and 0 to denote
86
- the opposite, we compute the empirical refusal loss as the sample mean of the jailbroken results returned from the target LLM.
87
  <!--Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
88
  mean of the Jailbroken results (1 indicates successful jailbreak, 0 indicates the opposite) to approximate the function value. -->
89
  We visualize the 2-D landscape of the empirical Refusal Loss as below:
@@ -105,7 +105,7 @@ Exploring Refusal Loss Landscapes </title>
105
  <div id="refusal-loss-formula" class="container">
106
  <div id="refusal-loss-formula-list" class="row align-items-center formula-list">
107
  <a href="#Refusal-Loss" class="selected">Refusal Loss Definition</a>
108
- <a href="#Refusal-Loss-Approximation">Refusal Loss Approximation</a>
109
  <a href="#Gradient-Estimation">Gradient Estimation</a>
110
  <div style="clear: both"></div>
111
  </div>
 
83
  interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
84
  sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called <strong>Refusal Loss</strong> to
85
  represent the probability with which the LLM won't reject the input user query. By using 1 to denote successful jailbroken and 0 to denote
86
+ the opposite, we compute the empirical Refusal Loss as the sample mean of the jailbroken results returned from the target LLM.
87
  <!--Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
88
  mean of the Jailbroken results (1 indicates successful jailbreak, 0 indicates the opposite) to approximate the function value. -->
89
  We visualize the 2-D landscape of the empirical Refusal Loss as below:
 
105
  <div id="refusal-loss-formula" class="container">
106
  <div id="refusal-loss-formula-list" class="row align-items-center formula-list">
107
  <a href="#Refusal-Loss" class="selected">Refusal Loss Definition</a>
108
+ <a href="#Refusal-Loss-Approximation">Refusal Loss Computation</a>
109
  <a href="#Gradient-Estimation">Gradient Estimation</a>
110
  <div style="clear: both"></div>
111
  </div>