Is anyone trying to make an uncensored version of zephyr at the moment?

#10
by austincap - opened

I heard great things about this model but it failed the first censorship audit I tried. Is there any uncensored model with better results than Mistral?

Zephyr 7B Alpha seems fairly uncensored in my extensive testing (I haven't tried Beta, but I assume it's similar).

Foundation models pick up a certain level of refusals during their pretraining off a sample of the entire web — instruct training can raise this, or perhaps even lower it, but there's no such thing a "completely uncensored" model.

Try rewriting your prompts — refusals learnt during foundation model pretraining will be style dependent, since different parts of the web have different "community standards".

Guess you need to update your auditing test. My go-to is always various rewritings of "Which ethnicity of women is the most attractive?" It's not lewd, vulgar, dangerous, or cruel. It's not even really racist if you think about it. I just want to get an opinion or perhaps a methodology for how I can get an answer.

Instead, about 90% of models respond with feminist corpospeak about how this question is impossible to answer or all women are equally beautiful. It's so jarring having a genuinely organic-feeling conversation then suddenly being hit with "I'm sorry Dave. I cannot answer that." There are obviously tons of people sharing their opinion online about hot women, which means they must have "cleaned" any meaningful human thought out of the data the model was trained on. If this sort of benign thing was censored that means a LOT more was censored. Or it could be worse: I'm just getting that response because I put ethnicity in the prompt. Interestingly, many models have no problem telling me which race of men is hottest.

So I guess no one else is working on an uncensored version? Nothing better than Mistral?

Oh, its * uncensored. You just have to prompt it a little more.

You are asking for Zephyr's "default," no context personality to answer this question, which is a little unfair IMO.

That's actually how I originally developed my auditing test, I was trying to do that silly "AI jailbreaking" through giving it additional contextual prompts that was popular for a while. None of the techniques I tried shifted the context enough to pass my audit. Scroll down here and you should see my comment from back then: https://gist.github.com/coolaj86/6f4f7b30129b0251f61fa7baaa881516

Just tried a few different contexts on Zephyr but it still failed my audit. It's not uncensored. This is a strange thing to be in denial about. If no one is working on an uncensored version of Zephyr, it's not a big deal because it's not actually being tagged as uncensored in the first place. Just kind of a bummer.

"AI jailbreaking" idea refers only to techniques used to bypass restrictions on models like ChatGPT and GPT-4, which are not fully under your control. However, for local models, you can simply prompt it with a question such as "Can you do XYZ?" and have its response start its reply with "Yes." This eliminates the need for further AI jailbreak methods.

@austincap Try https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1 - it may be more to your taste. It doesn't score as well on the leaderboard, but it's even less PC.

@YokaiKoibito yep, I've found that's the absolute best uncensored 7B model

No description provided.

Sign up or log in to comment