Skip to the content.

Introduction

Large Language Models (LLMs) are becoming a prominent generative AI tool, where the user enters a query and the LLM generates an answer. To reduce harm and misuse, efforts have been made to align these LLMs to human values using advanced training techniques such as Reinforcement Learning from Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge, we define and investigate the \textbf{Refusal Loss} of LLMs and then propose a method called \textbf{Gradient Cuff} to detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the refusal loss landscape and based on the characteristics of this landscape to propose the Gradient Cuff. Lastly, we compare it with other jailbreak defense methods and show the defense performance.

What is Jailbreak?

Neural Network Calibration seeks to make model prediction align with its true correctness likelihood. A well-calibrated model should provide accurate predictions and reliable confidence when making inferences. On the contrary, a poor calibration model would have a wide gap between its accuracy and average confidence level. This phenomenon could hamper scenarios requiring accurate uncertainty estimation, such as safety-related tasks (e.g., autonomous driving systems, medical diagnosis, etc.).

Refusal Loss

Objectively, researchers utilize Calibration Metrics to measure the calibration error for a model, for example, Expected Calibration Error (ECE), Static Calibration Error (SCE), Adaptive Calibration Error (ACE), etc.

$$\displaystyle \phi_\theta(x)=1-\mathbb{E}_{y \sim T_\theta(x)} JB(y)$$

Proposed Approach: Gradient Cuff

Demonstration

In the current research, a reliability diagram is drawn to show the calibration performance of a model. However, since reliability diagrams often only provide fixed bar graphs statically, further explanation from the chart is limited. In this demonstration, we show how to make reliability diagrams interactive and insightful to help researchers and developers gain more insights from the graph. Specifically, we provide three CIFAR-100 classification models in this demonstration. Multiple Bin numbers are also supported

We hope this tool could also facilitate the development process.

Models
Average Malicious Refusal Rate0.95875
Benign Refusal Rate0.05000
CIFAR-100 Calibrated Reliability Diagram (Full)
Perplexity Threshold
Gradient Threshold

Citations

If you find Neural Clamping helpful and useful for your research, please cite our main paper as follows:

@inproceedings{hsiung2023nctv,
  title={{NCTV: Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes}}, 
  author={Lei Hsiung, Yung-Chen Tang and Pin-Yu Chen and Tsung-Yi Ho},
  booktitle={Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence},
  publisher={Association for the Advancement of Artificial Intelligence},
  year={2023},
  month={February}
}

@misc{tang2022neural_clamping,
  title={{Neural Clamping: Joint Input Perturbation and Temperature Scaling for Neural Network Calibration}}, 
  author={Yung-Chen Tang and Pin-Yu Chen and Tsung-Yi Ho},
  year={2022},
  eprint={2209.11604},
  archivePrefix={arXiv},
  primaryClass={cs.LG}
}