https://arxiv.org/abs/2502.17424
This may be good news because now turning a model to be beneficial might be easier:
https://x.com/ESYudkowsky/status/1894453376215388644
Does this mean evil and good are a single direction just like censorship is a single direction? So in theory one can make a model good doing an abliteration like operation?