Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
gsartiΒ 
posted an update Feb 1
Post
πŸ” Today's pick in Interpretability & Analysis of LMs: Gradient-Based Language Model Red Teaming by N. Wichers, C. Denison and @beirami

This work proposes Gradient-Based Red Teaming (GBRT), a red teaming method for automatically generating diverse prompts inducing an LM to output unsafe responses.

In practice, prompts are learned by scoring LM responses with a safety-trained probing classifier, and back-propagating through frozen classifier and LM to update the prompt.

Authors experiment with variants of GBRT aimed at inducing realistic prompts in an efficient way, and GBRT prompts are more likely to generate unsafe responses than those found by established RL-based red teaming methods. Moreover, these attacks are shown to succeed even when the LM has been fine-tuned to produce safer outputs.

πŸ“„ Paper: In-Context Language Learning: Architectures and Algorithms (2401.12973)
πŸ’» Code: https://github.com/google-research/google-research/tree/master/gbrt
In this post