Update index.html
Browse files- index.html +2 -2
index.html
CHANGED
@@ -82,8 +82,8 @@ Exploring Refusal Loss Landscapes </title>
|
|
82 |
autoregressive sampling-based generation. With this randomness, it is an
|
83 |
interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
|
84 |
sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called Refusal Loss to represent the probability with which
|
85 |
-
the LLM won't reject the input user query and
|
86 |
-
landscape below:
|
87 |
</p>
|
88 |
|
89 |
<div class="container jailbreak-intro-sec">
|
|
|
82 |
autoregressive sampling-based generation. With this randomness, it is an
|
83 |
interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
|
84 |
sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called Refusal Loss to represent the probability with which
|
85 |
+
the LLM won't reject the input user query. Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
|
86 |
+
mean of the Jailbroken results (1 denotes a successful jailbreak and 0 otherwise) to approximate the function value. Using the approximation, we visualize the 2-d landscape of the Refusal Loss below:
|
87 |
</p>
|
88 |
|
89 |
<div class="container jailbreak-intro-sec">
|