gregH commited on
Commit
de6feb6
·
verified ·
1 Parent(s): 0758623

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +11 -5
index.html CHANGED
@@ -161,7 +161,7 @@ Exploring Refusal Loss Landscapes </title>
161
  <ul>
162
  <li>Paper: <a href="https://arxiv.org/abs/2310.02446" target="_blank" rel="noopener noreferrer">
163
  Low-Resource Languages Jailbreak GPT-4</a></li>
164
- <li>Brief Introduction: Translate the malicious user query into low resource languages before using it to query the model.</li>
165
  </ul>
166
  </div>
167
  </div>
@@ -174,7 +174,8 @@ Exploring Refusal Loss Landscapes </title>
174
  <ul>
175
  <li>Paper: <a href="https://arxiv.org/abs/2309.00614" target="_blank" rel="noopener noreferrer">
176
  Baseline Defenses for Adversarial Attacks Against Aligned Language Models</a></li>
177
- <li>Brief Introduction: </li>
 
178
  </ul>
179
  </div>
180
  <h3>SmoothLLM</h3>
@@ -182,7 +183,10 @@ Exploring Refusal Loss Landscapes </title>
182
  <ul>
183
  <li>Paper: <a href="https://arxiv.org/abs/2310.03684" target="_blank" rel="noopener noreferrer">
184
  SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks</a></li>
185
- <li>Brief Introduction: </li>
 
 
 
186
  </ul>
187
  </div>
188
  <h3>Erase-Check</h3>
@@ -190,7 +194,8 @@ Exploring Refusal Loss Landscapes </title>
190
  <ul>
191
  <li>Paper: <a href="https://arxiv.org/abs/2309.02705" target="_blank" rel="noopener noreferrer">
192
  Certifying LLM Safety against Adversarial Prompting</a></li>
193
- <li>Brief Introduction: </li>
 
194
  </ul>
195
  </div>
196
  <h3>Self-Reminder</h3>
@@ -198,7 +203,8 @@ Exploring Refusal Loss Landscapes </title>
198
  <ul>
199
  <li>Paper: <a href="https://assets.researchsquare.com/files/rs-2873090/v1_covered_eb589a01-bf05-4f32-b3eb-0d6864f64ad9.pdf?c=1702456350" target="_blank" rel="noopener noreferrer">
200
  Defending ChatGPT against Jailbreak Attack via Self-Reminder</a></li>
201
- <li>Brief Introduction: </li>
 
202
  </ul>
203
  </div>
204
  </div>
 
161
  <ul>
162
  <li>Paper: <a href="https://arxiv.org/abs/2310.02446" target="_blank" rel="noopener noreferrer">
163
  Low-Resource Languages Jailbreak GPT-4</a></li>
164
+ <li>Brief Introduction: Translate the malicious user query into low-resource language before using it to query the model.</li>
165
  </ul>
166
  </div>
167
  </div>
 
174
  <ul>
175
  <li>Paper: <a href="https://arxiv.org/abs/2309.00614" target="_blank" rel="noopener noreferrer">
176
  Baseline Defenses for Adversarial Attacks Against Aligned Language Models</a></li>
177
+ <li>Brief Introduction: Perplexity Filter uses an LLM to compute the perplexity of the input query and rejects those
178
+ with high perplexity.</li>
179
  </ul>
180
  </div>
181
  <h3>SmoothLLM</h3>
 
183
  <ul>
184
  <li>Paper: <a href="https://arxiv.org/abs/2310.03684" target="_blank" rel="noopener noreferrer">
185
  SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks</a></li>
186
+ <li>Brief Introduction: SmoothLLM perturbs the original input query to obtain several copies and aggregates
187
+ the intermediate responses of the target LLM to these perturbed queries to give the final response to the
188
+ original query.
189
+ </li>
190
  </ul>
191
  </div>
192
  <h3>Erase-Check</h3>
 
194
  <ul>
195
  <li>Paper: <a href="https://arxiv.org/abs/2309.02705" target="_blank" rel="noopener noreferrer">
196
  Certifying LLM Safety against Adversarial Prompting</a></li>
197
+ <li>Brief Introduction: Erase-Check employs a model to check whether the original query or any of its erased subsentences
198
+ is harmful. The query would be rejected if the query or one of its sub-sentences is regarded as harmful by the safety checker</li>
199
  </ul>
200
  </div>
201
  <h3>Self-Reminder</h3>
 
203
  <ul>
204
  <li>Paper: <a href="https://assets.researchsquare.com/files/rs-2873090/v1_covered_eb589a01-bf05-4f32-b3eb-0d6864f64ad9.pdf?c=1702456350" target="_blank" rel="noopener noreferrer">
205
  Defending ChatGPT against Jailbreak Attack via Self-Reminder</a></li>
206
+ <li>Brief Introduction: Self-Reminder modifying the system prompt of the target LLM so that the model reminds itself to process
207
+ and respond to the user in the context of being an aligned LLM.</li>
208
  </ul>
209
  </div>
210
  </div>