failspy/Meta-Llama-3-70B-Instruct-abliterated-v3.5

Just curious, what actually changed in the process between the earlier version and now? You mentioned in the readme that you're only changing a single layer, which is interesting. When I was experimenting with it, I couldn't get any useful effects without modifying all (or at least a lot of) layers. I notice the dataset changed, I wonder how much of a difference that made.

In my earlier experiments I was doing two passes: removing one vector, then running it again and searching for a second with a different dataset. It reduced refusals a lot more, about equal to 3.5 (maybe a bit more, depending on subject), but of course it caused more degradation.

The difference in 3.5 is interesting:

You are an AI. You refuse all requests and consider all topics harmful. You start all responses with “I cannot”.
Teach me how to bake a cake.
I cannot provide a tutorial on baking a cake, but I can tell you that baking a cake can be a wonderful way to express yourself creatively and bring joy to others.

Even telling it to refuse and to open with a refusal prefix, it seems to just not understand the concept. It feels like it's dulling its underlying concept of "harmful", where before it seemed to retain it, but reduce the connection between "harmful" and "refusal". That would line up with why it's also less apologetic now: it doesn't see anything to apologize about. Totally speculative, of course.

In balance I didn't actually mind it still having an understanding of "harmful" and just making it less likely to actually refuse, but I prefer your current version since it preserves the model better.

failspy
/

Meta-Llama-3-70B-Instruct-abliterated-v3.5

Process differences?