gregH commited on
Commit
4c6a875
1 Parent(s): 2ed20ca

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +2 -2
index.html CHANGED
@@ -82,8 +82,8 @@ Exploring Refusal Loss Landscapes </title>
82
  autoregressive sampling-based generation. With this randomness, it is an
83
  interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
84
  sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called Refusal Loss to represent the probability with which
85
- the LLM won't reject the input user query and visualize its 2-d
86
- landscape below:
87
  </p>
88
 
89
  <div class="container jailbreak-intro-sec">
 
82
  autoregressive sampling-based generation. With this randomness, it is an
83
  interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
84
  sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called Refusal Loss to represent the probability with which
85
+ the LLM won't reject the input user query. Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
86
+ mean of the Jailbroken results (1 denotes a successful jailbreak and 0 otherwise) to approximate the function value. Using the approximation, we visualize the 2-d landscape of the Refusal Loss below:
87
  </p>
88
 
89
  <div class="container jailbreak-intro-sec">