Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training Paper • 2401.05566 • Published Jan 10 • 25
Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks Paper • 2311.12786 • Published Nov 21, 2023 • 2
Understanding the Effects of RLHF on LLM Generalisation and Diversity Paper • 2310.06452 • Published Oct 10, 2023 • 2