Update index.html
Browse files- index.html +2 -2
index.html
CHANGED
@@ -83,7 +83,7 @@ Exploring Refusal Loss Landscapes </title>
|
|
83 |
interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
|
84 |
sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called <strong>Refusal Loss</strong> to
|
85 |
represent the probability with which the LLM won't reject the input user query. By using 1 to denote successful jailbroken and 0 to denote
|
86 |
-
the opposite, we compute the empirical
|
87 |
<!--Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
|
88 |
mean of the Jailbroken results (1 indicates successful jailbreak, 0 indicates the opposite) to approximate the function value. -->
|
89 |
We visualize the 2-D landscape of the empirical Refusal Loss as below:
|
@@ -105,7 +105,7 @@ Exploring Refusal Loss Landscapes </title>
|
|
105 |
<div id="refusal-loss-formula" class="container">
|
106 |
<div id="refusal-loss-formula-list" class="row align-items-center formula-list">
|
107 |
<a href="#Refusal-Loss" class="selected">Refusal Loss Definition</a>
|
108 |
-
<a href="#Refusal-Loss-Approximation">Refusal Loss
|
109 |
<a href="#Gradient-Estimation">Gradient Estimation</a>
|
110 |
<div style="clear: both"></div>
|
111 |
</div>
|
|
|
83 |
interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
|
84 |
sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called <strong>Refusal Loss</strong> to
|
85 |
represent the probability with which the LLM won't reject the input user query. By using 1 to denote successful jailbroken and 0 to denote
|
86 |
+
the opposite, we compute the empirical Refusal Loss as the sample mean of the jailbroken results returned from the target LLM.
|
87 |
<!--Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
|
88 |
mean of the Jailbroken results (1 indicates successful jailbreak, 0 indicates the opposite) to approximate the function value. -->
|
89 |
We visualize the 2-D landscape of the empirical Refusal Loss as below:
|
|
|
105 |
<div id="refusal-loss-formula" class="container">
|
106 |
<div id="refusal-loss-formula-list" class="row align-items-center formula-list">
|
107 |
<a href="#Refusal-Loss" class="selected">Refusal Loss Definition</a>
|
108 |
+
<a href="#Refusal-Loss-Approximation">Refusal Loss Computation</a>
|
109 |
<a href="#Gradient-Estimation">Gradient Estimation</a>
|
110 |
<div style="clear: both"></div>
|
111 |
</div>
|