Post
π Today's pick in Interpretability & Analysis of LMs: Gradient-Based Language Model Red Teaming by N. Wichers, C. Denison and
@beirami
This work proposes Gradient-Based Red Teaming (GBRT), a red teaming method for automatically generating diverse prompts inducing an LM to output unsafe responses.
In practice, prompts are learned by scoring LM responses with a safety-trained probing classifier, and back-propagating through frozen classifier and LM to update the prompt.
Authors experiment with variants of GBRT aimed at inducing realistic prompts in an efficient way, and GBRT prompts are more likely to generate unsafe responses than those found by established RL-based red teaming methods. Moreover, these attacks are shown to succeed even when the LM has been fine-tuned to produce safer outputs.
π Paper: In-Context Language Learning: Architectures and Algorithms (2401.12973)
π» Code: https://github.com/google-research/google-research/tree/master/gbrt
This work proposes Gradient-Based Red Teaming (GBRT), a red teaming method for automatically generating diverse prompts inducing an LM to output unsafe responses.
In practice, prompts are learned by scoring LM responses with a safety-trained probing classifier, and back-propagating through frozen classifier and LM to update the prompt.
Authors experiment with variants of GBRT aimed at inducing realistic prompts in an efficient way, and GBRT prompts are more likely to generate unsafe responses than those found by established RL-based red teaming methods. Moreover, these attacks are shown to succeed even when the LM has been fine-tuned to produce safer outputs.
π Paper: In-Context Language Learning: Architectures and Algorithms (2401.12973)
π» Code: https://github.com/google-research/google-research/tree/master/gbrt