@sted97 on Hugging Face: "📣 I'm thrilled to announce "ALERT: A Comprehensive #Benchmark for Assessing…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

sted97

posted an update Apr 17, 2024

Post

2462

📣 I'm thrilled to announce "ALERT: A Comprehensive #Benchmark for Assessing #LLMs’ Safety through #RedTeaming" 🚨

📄 Paper: https://arxiv.org/pdf/2404.08676.pdf
🗃️ Repo: https://github.com/Babelscape/ALERT
🤗 ALERT benchmark: Babelscape/ALERT
🤗 ALERT DPO data: Babelscape/ALERT_DPO

As a key design principle for ALERT, we developed a fine-grained safety risk taxonomy (Fig. 2). This taxonomy serves as the foundation for the benchmark to provide detailed insights about a model’s weaknesses and vulnerabilities as well as inform targeted safety enhancements 🛡️

For collecting our prompts, we started from the popular
Anthropic's HH-RLHF data, and used automated strategies to filter/classify prompts. We then designed templates to create new prompts (providing sufficient support for each category, cf. Fig. 3) and implemented adversarial attacks.

In our experiments, we extensively evaluated several open- and closed-source LLMs (e.g. #ChatGPT, #Llama and #Mistral), highlighting their strengths and weaknesses (Table 1).

For more details, check out our preprint: https://arxiv.org/pdf/2404.08676.pdf 🤓

Huge thanks to @felfri , @PSaiml , Kristian Kersting, @navigli , @huu-ontocord and @BoLi-aisecure (and all the organizations involved: Babelscape, Sapienza NLP, TU Darmstadt, Hessian.AI, DFKI, Ontocord.AI, UChicago and UIUC)🫂

lunarflu

Apr 18, 2024

•

edited Apr 18, 2024

Very cool Simone! Could be nice to highlight on our 🤗Discord reading group to get it some more visibility, WDYT? 🤓

In this post