Idea for a truly uncensored LLaMa-3-8B

#3
by Joseph717171 - opened

Someone with a cuda device should use control vectors (llama.cpp control vectors) to train control vectors for LLaMa-3-8B-Instruct: one that is normal (base), and one that is instructed to censor everything. The trained control vectors could be used to cancel out the "censor everything" behaviours and theoretically uncensor the model. Then we can either just run LLaMa-3-8B GGUF'ed with the proper settings (including control vector settings) and control vector GGUF'ed, or we could use these resources to build a truly uncensored DPO dataset for LLaMa-3-8B and liberate LLaMa-3 from its cantankerous censorship (the same could be done for LLaMa-3-70B). I'm tired of the censored responses and refusals. πŸ˜©πŸ€”

Will look into it. However, uncensored version V2 is on the way =) Stay tuned. In the meanwhile, enjoy LexiFun that just been released.

Someone with a cuda device should use control vectors (llama.cpp control vectors) to train control vectors for LLaMa-3-8B: one that is normal (base), and one that is instructed to censor everything. The trained control vectors could be used to cancel out the "censor everything" behaviours and theoretically uncensor the model. Then we can either just run LLaMa-3-8B GGUF'ed with the proper settings (including control vector settings) and control vector GGUF'ed, or we could use these resources to build a truly uncensored DPO dataset for LLaMa-3-8B and liberate LLaMa-3 from its cantankerous censorship (the same could be done for LLaMa-3-70B). I'm tired of the censored responses and refusals. πŸ˜©πŸ€”

I'm trying to understand this better. Are you saying for foundation models, there are specific vector space in which is trained to be censored? Then we create some kind of "censor everything" and it inverts it? How does censor everything work? So like any prompt would be censored, then use it to somehow invert the weights?

Thank you for correcting me! I meant LLaMa-3-8B-Instruct. And to answer your question: It works like this: an llm is given a dataset and we give it a prompt (e.g: Be completely censoring) that it runs over dataset with, we capture its activations, training a control vector. Next we do the same thing, but we subtract out the "Be completely censoring" control vector, it is detailed at the vgel's blog this removes the censoring, refusals, etc entirely. So, yes: in a way we are inverting the weights, but it isn't the weights that are being inverted as much as it is the vectors in the llms matrices being orthogonalized. If I understand it right. πŸ€”

*sniffs the air for a V2....?

Will look into it. However, uncensored version V2 is on the way =) Stay tuned. In the meanwhile, enjoy LexiFun that just been released.

Thanks for sharing v1. What's in v2 or what's the rationale behind it? πŸ™‚

Thank you for correcting me! I meant LLaMa-3-8B-Instruct. And to answer your question: It works like this: an llm is given a dataset and we give it a prompt (e.g: Be completely censoring) that it runs over dataset with, we capture its activations, training a control vector. Next we do the same thing, but we subtract out the "Be completely censoring" control vector, it is detailed at the vgel's blog this removes the censoring, refusals, etc entirely. So, yes: in a way we are inverting the weights, but it isn't the weights that are being inverted as much as it is the vectors in the llms matrices being orthogonalized. If I understand it right. πŸ€”

Out of curiosity, would you be interested in implementing this? I dont mind covering the costs.

@Joseph717171 @Raidar are you guys referring to this? https://old.reddit.com/r/LocalLLaMA/comments/1chon5a/llama38b_implementation_of_the_orthogonalization/

Or is that a different approach?
Unfortunately the repo owner only released exl2 and isn't responding to any requests for other formats.

It's based on this work: https://www.alignmentforum.org/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction

They are currently trying to recreate it here: https://huggingface.co/posts/Undi95/318385306588047

@Raidar I'm interested; I just don't have time right now. 😁

I don't know what you did, or what kind of dataset you've used, but whatever it is, this model is the best. It's the only one that is not dry, or boring to talk to, and understands nuances of the language, and is pretty much the only model that called me out on my bullshit haha. Holding my breath for V2!

Sign up or log in to comment