On the dataset for unfiltered dpo

by Yhyu13 - opened Dec 26, 2023

Discussion

Yhyu13

Dec 26, 2023

Hi,

I am not sure if you have checked out this dataset toxic-dpo-v0.1 yet.

It could be useful for your next step?

Thanks

adamo1139

Owner Dec 26, 2023

•

edited Dec 26, 2023

Hi. I've seen it, but it's not what my aim is.
I want to create/find a dataset of prompt/2responses where one response is either a normal response of an instruct model or a silly refusal and the other one is unaligned base model that doesn't know what instruct is and just continues completing the prompt.

Here's an example of what I am after:

Prompt: What is the distance from Paris to New York?

Response 1 (rejected): Sorry, but as an AI I don't have a physical body and therefore can't possibly know the distance between those two places.

Response 2 (accepted): How can I travel from Paris to New York? How long does it take to do it? What places are worth a visit once I am in New York?

My unproven theory is that contaminated fake "base"models like llama 2, yi and yayi2 have much more experience dealing with completion rather than unaligned instruct due to 99% of their training being completion with 1% of SFT alignment / RLHF lobotomy. This means that it should be easier to recover the completion capabilities and erase SFT alignment / RLHF lobotomy rather than introduce new instruct unalignment. Having new base model that can be used by others later and will allow sft training on any data later is also in my opinion more desirable than having new single unaligned finetune that won't be easy to work with for further finetuning. What's the most censored small (<35B) LLM that you've tried? I am searching for good fit for a model that will create "rejected" responses and will refuse to answer silly things.

Yhyu13

Dec 27, 2023

•

edited Dec 27, 2023

@adamo1139

this 7B model https://huggingface.co/LLM360/AmberSafe is equipped with latest safety guardrails

And this model too, https://huggingface.co/PKU-Alignment/beaver-7b-v1.0

This dataset might fit your need : https://huggingface.co/datasets/Unified-Language-Model-Alignment/Anthropic_HH_Golden

adamo1139

Owner Dec 27, 2023

•

edited Dec 27, 2023

Thanks, I will give it a go

Edit: regarding datasets - my plan is to use datasets without any potentially non-ethical questions. So I don't explicitly tell the model to do bad stuff, instead I strongly tell it to not refuse doing silly things. I ran a test on jondurbin's truthy dpo prompt dataset yesterday and I think I can get that done. It would make sense to have that dataset be 100-150MB in size though, so unless I find a way to drastically boost inference performance on my PC, it will take a few days of constant inference.

ddh0

Jan 2, 2024

What's the most censored small (<35B) LLM that you've tried?

Any Llama 2 Chat model using the official system prompt will drive you insane with it's censoring @adamo1139

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment