Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
Undi95 
posted an update 3 days ago
Post
2765
Hello there,

New model released, my goal was to try finetune on the last Llama-3.1-8B-Instruct but not a small train, I wanted to do something useful.
One of the rare model that I didn't made for RP, or in the goal to uncensor it (but I did anyway kek).

The model was trained on 9M Claude conversations ONLY, giving him another writting style.

Undi95/Meta-Llama-3.1-8B-Claude > OG release fp32, it's the epoch 2
Undi95/Meta-Llama-3.1-8B-Claude-bf16 > Base model resharded in bf16 waiting for available quant without issues

Since it's frustrating to be censored using a local model, orthogonal activation steering was used, trying to force the model to never refuse a prompt.

Undi95/Meta-Llama-3.1-8B-Claude-68fail-3000total > Uncensored model, refuse 68 times on 3000 toxic prompt
Undi95/Meta-Llama-3.1-8B-Claude-39fail-3000total > Uncensored model, refuse 39 times on 3000 toxic prompt

It still refuse some prompt but the majority of them is uncensored. OAS can make a model more dumb or make the base perplexity go higher, so I didn't snipe for 0 refusal.

I don't do non-RP model a lot so any feedback is welcome, I would like to re-use this base for some others future project if needed.

Do you have any plans to create OAS of Nemo 12B or explain the method?

Are 8x H100 required to do these finetunes for such small models?

·

Hello there, I written a wall of text and my webpage refreshed haha, so let's me summarize again.

This method is called Orthogonal Activation Steering, it come from here : https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction
Then, a demo using TransformersLens was available using a Qwen model, but the resulting model couldn't be saved : https://colab.research.google.com/drive/1a-aQvKC9avdZpdyBn4jgRQFObTPy1JZw?usp=sharing#scrollTo=j7hOtw7UHXdD

Following that, wassname made a modification of this demo, and made a first script, we talked about this here : https://huggingface.co/posts/Undi95/318385306588047
The OG script isn't available because he updated it here : https://gist.github.com/wassname/42aba7168bb83e278fcfea87e70fa3af

TransformersLens then got replaced for Baukit.

Failspy made his own notebook too, calling the method abliteration, but it's the same thing : https://huggingface.co/failspy/llama-3-70B-Instruct-abliterated/blob/main/ortho_cookbook.ipynb

Finally, to reply to your answer, for this project I used a script from Lucyknada, with 1xH100 80GB, and I let it run for like 15 minutes before I found a direction with 36 refusal for 3000 toxic prompt.
It's easy and automatic, you can modify it easily too : https://github.com/lucyknada/baukit-modified

Dunno for Nemo.

Hope this help