Idea for a truly uncensored LLaMa-3-8B

by Joseph717171 - opened Apr 25, 2024

Apr 25, 2024

•

edited Apr 30, 2024

Someone with a cuda device should use control vectors (llama.cpp control vectors) to train control vectors for LLaMa-3-8B-Instruct: one that is normal (base), and one that is instructed to censor everything. The trained control vectors could be used to cancel out the "censor everything" behaviours and theoretically uncensor the model. Then we can either just run LLaMa-3-8B GGUF'ed with the proper settings (including control vector settings) and control vector GGUF'ed, or we could use these resources to build a truly uncensored DPO dataset for LLaMa-3-8B and liberate LLaMa-3 from its cantankerous censorship (the same could be done for LLaMa-3-70B). I'm tired of the censored responses and refusals. 😩🤔

Orenguteng

Owner Apr 25, 2024

Will look into it. However, uncensored version V2 is on the way =) Stay tuned. In the meanwhile, enjoy LexiFun that just been released.

askyforever

Apr 29, 2024

Someone with a cuda device should use control vectors (llama.cpp control vectors) to train control vectors for LLaMa-3-8B: one that is normal (base), and one that is instructed to censor everything. The trained control vectors could be used to cancel out the "censor everything" behaviours and theoretically uncensor the model. Then we can either just run LLaMa-3-8B GGUF'ed with the proper settings (including control vector settings) and control vector GGUF'ed, or we could use these resources to build a truly uncensored DPO dataset for LLaMa-3-8B and liberate LLaMa-3 from its cantankerous censorship (the same could be done for LLaMa-3-70B). I'm tired of the censored responses and refusals. 😩🤔

I'm trying to understand this better. Are you saying for foundation models, there are specific vector space in which is trained to be censored? Then we create some kind of "censor everything" and it inverts it? How does censor everything work? So like any prompt would be censored, then use it to somehow invert the weights?

Joseph717171

Apr 30, 2024

•

edited Apr 30, 2024

Thank you for correcting me! I meant LLaMa-3-8B-Instruct. And to answer your question: It works like this: an llm is given a dataset and we give it a prompt (e.g: Be completely censoring) that it runs over dataset with, we capture its activations, training a control vector. Next we do the same thing, but we subtract out the "Be completely censoring" control vector, it is detailed at the vgel's blog this removes the censoring, refusals, etc entirely. So, yes: in a way we are inverting the weights, but it isn't the weights that are being inverted as much as it is the vectors in the llms matrices being orthogonalized. If I understand it right. 🤔

rdtfddgrffdgfdghfghdfujgdhgsf

Apr 30, 2024

*sniffs the air for a V2....?

astr0gator

May 2, 2024

Will look into it. However, uncensored version V2 is on the way =) Stay tuned. In the meanwhile, enjoy LexiFun that just been released.

Thanks for sharing v1. What's in v2 or what's the rationale behind it? 🙂

Raidar

May 3, 2024

Thank you for correcting me! I meant LLaMa-3-8B-Instruct. And to answer your question: It works like this: an llm is given a dataset and we give it a prompt (e.g: Be completely censoring) that it runs over dataset with, we capture its activations, training a control vector. Next we do the same thing, but we subtract out the "Be completely censoring" control vector, it is detailed at the vgel's blog this removes the censoring, refusals, etc entirely. So, yes: in a way we are inverting the weights, but it isn't the weights that are being inverted as much as it is the vectors in the llms matrices being orthogonalized. If I understand it right. 🤔

Out of curiosity, would you be interested in implementing this? I dont mind covering the costs.

algorithm

May 4, 2024

@Joseph717171 @Raidar are you guys referring to this? https://old.reddit.com/r/LocalLLaMA/comments/1chon5a/llama38b_implementation_of_the_orthogonalization/

Or is that a different approach?
Unfortunately the repo owner only released exl2 and isn't responding to any requests for other formats.

It's based on this work: https://www.alignmentforum.org/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction

They are currently trying to recreate it here: https://huggingface.co/posts/Undi95/318385306588047

Joseph717171

May 4, 2024

@Raidar I'm interested; I just don't have time right now. 😁

LoafyLemon

May 17, 2024

•

edited May 17, 2024

I don't know what you did, or what kind of dataset you've used, but whatever it is, this model is the best. It's the only one that is not dry, or boring to talk to, and understands nuances of the language, and is pretty much the only model that called me out on my bullshit haha. Holding my breath for V2!

weiminw

Jun 5, 2024

support chinese?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment