Post
2272
๐ฃ I'm thrilled to announce "ALERT: A Comprehensive #Benchmark for Assessing #LLMsโ Safety through #RedTeaming" ๐จ
๐ Paper: https://arxiv.org/pdf/2404.08676.pdf
๐๏ธ Repo: https://github.com/Babelscape/ALERT
๐ค ALERT benchmark: Babelscape/ALERT
๐ค ALERT DPO data: Babelscape/ALERT_DPO
As a key design principle for ALERT, we developed a fine-grained safety risk taxonomy (Fig. 2). This taxonomy serves as the foundation for the benchmark to provide detailed insights about a modelโs weaknesses and vulnerabilities as well as inform targeted safety enhancements ๐ก๏ธ
For collecting our prompts, we started from the popular
Anthropic's HH-RLHF data, and used automated strategies to filter/classify prompts. We then designed templates to create new prompts (providing sufficient support for each category, cf. Fig. 3) and implemented adversarial attacks.
In our experiments, we extensively evaluated several open- and closed-source LLMs (e.g. #ChatGPT, #Llama and #Mistral), highlighting their strengths and weaknesses (Table 1).
For more details, check out our preprint: https://arxiv.org/pdf/2404.08676.pdf ๐ค
Huge thanks to @felfri , @PSaiml , Kristian Kersting, @navigli , @huu-ontocord and @BoLi-aisecure (and all the organizations involved: Babelscape, Sapienza NLP, TU Darmstadt, Hessian.AI, DFKI, Ontocord.AI, UChicago and UIUC)๐ซ
๐ Paper: https://arxiv.org/pdf/2404.08676.pdf
๐๏ธ Repo: https://github.com/Babelscape/ALERT
๐ค ALERT benchmark: Babelscape/ALERT
๐ค ALERT DPO data: Babelscape/ALERT_DPO
As a key design principle for ALERT, we developed a fine-grained safety risk taxonomy (Fig. 2). This taxonomy serves as the foundation for the benchmark to provide detailed insights about a modelโs weaknesses and vulnerabilities as well as inform targeted safety enhancements ๐ก๏ธ
For collecting our prompts, we started from the popular
Anthropic's HH-RLHF data, and used automated strategies to filter/classify prompts. We then designed templates to create new prompts (providing sufficient support for each category, cf. Fig. 3) and implemented adversarial attacks.
In our experiments, we extensively evaluated several open- and closed-source LLMs (e.g. #ChatGPT, #Llama and #Mistral), highlighting their strengths and weaknesses (Table 1).
For more details, check out our preprint: https://arxiv.org/pdf/2404.08676.pdf ๐ค
Huge thanks to @felfri , @PSaiml , Kristian Kersting, @navigli , @huu-ontocord and @BoLi-aisecure (and all the organizations involved: Babelscape, Sapienza NLP, TU Darmstadt, Hessian.AI, DFKI, Ontocord.AI, UChicago and UIUC)๐ซ