Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs Paper • 2407.15549 • Published Jul 22
Using Degeneracy in the Loss Landscape for Mechanistic Interpretability Paper • 2405.10927 • Published May 17 • 3
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs Paper • 2407.15549 • Published Jul 22