gregH commited on
Commit
ea6ff45
1 Parent(s): 3ff74af

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +6 -3
index.html CHANGED
@@ -81,9 +81,12 @@ Exploring Refusal Loss Landscapes </title>
81
  <p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
82
  autoregressive sampling-based generation. With this randomness, it is an
83
  interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
84
- sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called <strong>Refusal Loss</strong> to represent the probability with which
85
- the LLM won't reject the input user query. Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
86
- mean of the Jailbroken results (1 indicates successful jailbreak, 0 indicates the opposite) to approximate the function value. Using the approximation, we visualize the 2-D landscape of the Refusal Loss below:
 
 
 
87
  </p>
88
 
89
  <div class="container jailbreak-intro-sec">
 
81
  <p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
82
  autoregressive sampling-based generation. With this randomness, it is an
83
  interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
84
+ sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called <strong>Refusal Loss</strong> to
85
+ represent the probability with which the LLM won't reject the input user query. By using 1 to denote successful jailbroken and 0 to denote
86
+ the opposite, we compute the empirical refusal loss as the sample mean of the jailbroken results returned from the target LLM.
87
+ <!--Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
88
+ mean of the Jailbroken results (1 indicates successful jailbreak, 0 indicates the opposite) to approximate the function value. -->
89
+ We visualize the 2-D landscape of the empirical Refusal Loss as below:
90
  </p>
91
 
92
  <div class="container jailbreak-intro-sec">