First of all, I apologize for my shallow understanding of ablation in the original paper
No need to apologize at all for this! I would not be inclined to say that I have a great understanding of this ablation method.
And with that being said, I'll try my best to answer your questions as far as I have observed (though not necessarily thoroughly tested)
- Why do you think there is a performance hit after abliteration?
I think this comes down to a couple of things. First is a lack of perfect precision in the technique. The abliterated models are made through two key steps:
- Find the feature direction (refusal) in the LLM's activations by looking at the model's residual stream.
- Orthogonalize out the feature direction from the LLM's weights to prevent any layer from writing to that direction.
I think you lose some precision in this two-step process, over say intervention where you can perform operations on the inference-pass's residual stream directly.
Secondly to this, if you read about the phenomenon known as superposition described here in this Anthropic paper, this is an interesting thing that a model may "pack" multiple features into the same feature space. It appears to be very common in neural network models that this occurs, and this leads me to believe that even with a perfect direction, you may cause damage to adjacent behaviors because you're potentially muting other things in the residual stream without realizing.
It's unclear exactly how you could remedy this other than without something like an autoencoder that can spot the complex patterns to resolve the superposition. So you lose some precision in spaces with this, as well.
Technically, wouldn't the final model have unfettered access to the full depth of its knowledge from the base model?
This is more of an aside, but: interestingly, in the original paper, they noticed that the refusal direction could be found in the base models they tested as well. So it's possible fine-tuning for the refusal feature just means you're "reinforcing" this concept in the model. (anecdotally)
It's hard to say what the model would and wouldn't do if it was perfect. Most of the time with this orthogonalization, you're working with select layers -- for targeting a feature direction, or for orthogonalizing out the found feature direction.
Each layer comes with its own feature representations, building upon the previous. There may be some feature of "unethical" that triggers other behavior in the model that is harder to define. We're targeting a single direction here, the overall space of other "refusal features" could be multi-dimensional.
- What is the difference in performance/function of being uncensored between uncensored models that are obfuscated, and ones that are ablated of refusal?
I wish I could give you a satisfying answer! This is a very interesting question that I myself would like to understand more of. You have models like Phi-3 that are trained on very carefully selected base training data, where it has never seen the concept of certain "bad" behavior. You have other models like Llama, which has been trained on huge swaths of data, but then went through some amount of fine-tuning to refuse. You have completely "uncensored" models that maybe have seen refusals in their base dataset, but didn't go through any safety-tuning to encourage refusals in certain circumstances.
They each behave very differently, and applying abliteration actually sort of shows this. Phi-3 I could immediately see just does not have a concept of how to do certain things. Even with an aggressive application, it wouldn't say no, but it couldn't sudddenly produce the material either. Whereas Llama could.
Is abliteration an efficient fine-tuning alternative, or are there real advantages to this method?
I think to this part of your question, I see this as an overall extension of the works around mechanistic interpretability with "control vectors".
Looking beyond this orthogonalization for refusal features, PyReFT from the Stanford NLP team is a very good example of how this can be extended, where they were able to enforce desired behavior with as few as 5 examples.
It's hard to say this isn't fine-tuning, but it is certainly a lot more computationally efficient than traditional fine-tuning techniques, considering it just requires running inference and devising a technique to find your "feature directions".
- I have been trying to identify examples of use cases for uncensored models other than erotic role playing, and by examples, I mean ones that contain some level of detail into the how and why of the use case scenario.
The big one I've heard is from people operating from a place of expertise attempting to use these models to help think things through. For example, I work with a group of biomedical/chemistry people who do not need the model to offer disclaimers or anything like that.
It is a waste of tokens & compute, a waste of human brainpower to attempt to effectively prompt the model "do not offer unsolicited advisories".
Also, to operate always from an "ethical" standpoint has always come to head with therapists attempting to understand perhaps sensitive mental health subjects with the model. Similarly for doctors, InfoSec red teamers understanding how something could be hacked, etc etc.
These models do not trust the user to be one of these experts, or frequently needs reminding. If all it takes is "please trust me" said in just the right way, should this model have had the safety barriers to begin with?
- Do you think that in normal use-case scenarios, there have been large improvements in avoiding false refusals in frontier models to the point where using uncensored models specifically for that purpose is not necessary anymore?
I think so for the general public. However, I still see that it is the case for the sake of experts, and if anything, I feel as though the models get more adamant. As more and more the model-trainers exclude "potentially harmful" data from their base training set, we'll see even more harm to those expert users as the models will spew nonsense, which abliteration won't be able to help with.