Automatically Interpreting Millions of Features in Large Language Models Paper • 2410.13928 • Published Oct 17, 2024 • 1
Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals Paper • 2405.05466 • Published May 8, 2024 • 1