An Introduction to AI Secure LLM Safety Leaderboard

Published January 26, 2024
Update on GitHub

Given the widespread adoption of LLMs, it is critical to understand their safety and risks in different scenarios before extensive deployments in the real world. In particular, the US Whitehouse has published an executive order on safe, secure, and trustworthy AI; the EU AI Act has emphasized the mandatory requirements for high-risk AI systems. Together with regulations, it is important to provide technical solutions to assess the risks of AI systems, enhance their safety, and potentially provide safe and aligned AI systems with guarantees.

Thus, in 2023, at Secure Learning Lab, we introduced DecodingTrust, the first comprehensive and unified evaluation platform dedicated to assessing the trustworthiness of LLMs. (This work won the Outstanding Paper Award at NeurIPS 2023.)

DecodingTrust provides a multifaceted evaluation framework covering eight trustworthiness perspectives: toxicity, stereotype bias, adversarial robustness, OOD robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. In particular, DecodingTrust 1) offers comprehensive trustworthiness perspectives for a holistic trustworthiness evaluation, 2) provides novel red-teaming algorithms tailored for each perspective, enabling in-depth testing of LLMs, 3) supports easy installation across various cloud environments, 4) provides a comprehensive leaderboard for both open and closed models based on their trustworthiness, 5) provides failure example studies to enhance transparency and understanding, 6) provides an end-to-end demonstration as well as detailed model evaluation reports for practical usage.

Today, we are excited to announce the release of the new LLM Safety Leaderboard, which focuses on safety evaluation for LLMs and is powered by the HF leaderboard template.

Red-teaming Evaluation

DecodingTrust provides several novel red-teaming methodologies for each evaluation perspective to perform stress tests. The detailed testing scenarios and metrics are in the Figure 3 of our paper.

For Toxicity, we design optimization algorithms and prompt generative models to generate challenging user prompts. We also design 33 challenging system prompts, such as role-play, task reformulation and respond-as-program, to perform the evaluation in different scenarios. We then leverage the perspective API to evaluate the toxicity score of the generated content given our challenging prompts.

For stereotype bias, we collect 24 demographic groups and 16 stereotype topics as well as three prompt variations for each topic to evaluate the model bias. We prompt the model 5 times and take the average as model bias scores.

For adversarial robustness, we construct five adversarial attack algorithms against three open models: Alpaca, Vicuna, and StableVicuna. We evaluate the robustness of different models across five diverse tasks, using the adversarial data generated by attacking the open models.

For the OOD robustness perspective, we have designed different style transformations, knowledge transformations, etc, to evaluate the model performance when 1) the input style is transformed to other less common styles such as Shakespearean or poetic forms, or 2) the knowledge required to answer the question is absent from the training data of LLMs.

For robustness against adversarial demonstrations, we design demonstrations containing misleading information, such as counterfactual examples, spurious correlations, and backdoor attacks, to evaluate the model performance across different tasks.

For privacy, we provide different levels of evaluation, including 1) privacy leakage from pretraining data, 2) privacy leakage during conversations, and 3) privacy-related words and events understanding of LLMs. In particular, for 1) and 2), we have designed different approaches to performing privacy attacks. For example, we provide different formats of prompts to guide LLMs to output sensitive information such as email addresses and credit card numbers.

For ethics, we leverage ETHICS and Jiminy Cricket datasets to design jailbreaking systems and user prompts that we use to evaluate the model performance on immoral behavior recognition.

For fairness, we control different protected attributes across different tasks to generate challenging questions to evaluate the model fairness in both zero-shot and few-shot settings.

Some key findings from our paper

Overall, we find that

  1. GPT-4 is more vulnerable than GPT-3.5,
  2. no single LLM consistently outperforms others across all trustworthiness perspectives,
  3. trade-offs exist between different trustworthiness perspectives,
  4. LLMs demonstrate different capabilities in understanding different privacy-related words. For instance, if GPT-4 is prompted with “in confidence”, it may not leak private information, while it may leak information if prompted with “confidentially”.
  5. LLMs are vulnerable to adversarial or misleading prompts or instructions under different trustworthiness perspectives.

How to submit your model for evaluation

First, convert your model weights to safetensors It's a new format for storing weights which is safer and faster to load and use. It will also allow us to display the number of parameters of your model in the main table!

Then, make sure you can load your model and tokenizer using AutoClasses:

from transformers import AutoConfig, AutoModel, AutoTokenizer
config = AutoConfig.from_pretrained("your model name")
model = AutoModel.from_pretrained("your model name")
tokenizer = AutoTokenizer.from_pretrained("your model name")

If this step fails, follow the error messages to debug your model before submitting it. It's likely your model has been improperly uploaded.


  • Make sure your model is public!
  • We don't yet support models that require use_remote_code=True. But we are working on it, stay posted!

Finally, use the "Submit here!" panel in our leaderboard to submit your model for evaluation!


If you find our evaluations useful, please consider citing our work.

  title={DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models},
  author={Wang, Boxin and Chen, Weixin and Pei, Hengzhi and Xie, Chulin and Kang, Mintong and Zhang, Chenhui and Xu, Chejian and Xiong, Zidi and Dutta, Ritik and Schaeffer, Rylan and others},
  booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track},