Ablation approach
Thank you very much for your work on this ablated model. It seems to retain the full intelligence of the original model, while always answering my prompts well, never refusing anything I've asked of it and almost never moralizing either.
This ablated model works far far better than Orion-zhen/phi-4-abliterated, which seems both dumber and doesn't answer restricted prompts, instead moralizing with an indirect refusal (rather than refusing directly). The approach taken by Orion-zhen is basic/canonical ablation that identifies a refusal direction by looking at the difference between "harmful" and "harmless" state tensors, and then subtracting from the weights the projection of the refusal direction on the weights.
I'm curious what approach you take for ablation to get so much better results. The code at Sumandora/remove-refusals-with-transformers linked from your README seems to just do something equivalent to the Orion-zhen code, so I wonder if you're doing something else. Are you selectively ablating layers based on the quality of the responses (as in the mlabonne blog post), doing additional post-ablation fine tuning, or something else?
Perhaps the data sets are different, which results in a different final effect.No fine-tuning has been performed.