Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
grimjim 
posted an update Sep 13
Post
1942
I was reading through an abstract and found myself wondering how much LLM performance is being left on the table due to insufficient curation of training datasets: "Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning" by Kaur, Park, Goyal, Arora.
https://arxiv.org/abs/2408.14774
In particular, the observation that "Introducing low quality answers ("shirkers") in 20% of Instruct-SkillMix examples causes performance to plummet..." had me wondering how many ostensibly good datasets out there are in fact populated with a significant number of "shirkers".

I suspect this is the case as well. If you grep some of the datasets the bit popular finetunes use for "I will not" -- or "Your post has been removed", you'll find quite a few of them in there.

deleted
This comment has been hidden

Hi, I have a question for you, @John6666 mentioned you in the comments of my topic,

In preparing a dataset for DPO (Direct Preference Optimization) training, should the “prompt” be repeated in the “chosen” and “rejected” columns?

I’ve come across some conflicting information regarding the proper formatting of the dataset for DPO training. Some sources suggest that the prompt should be included in both the “chosen” and “rejected” responses to provide full context, while others state that the prompt should be kept separate and not repeated in these columns.

Additionally, when working with multi-turn dialogue data, I’m unsure how to properly format the dataset. Should the “chosen” and “rejected” columns include the entire conversation history up to that point, or just the assistant’s most recent response following the latest user input?

Could someone clarify the correct approach for formatting the dataset? Should the “chosen” and “rejected” columns contain only the assistant’s responses following the prompt, or should they include the prompt as well? And how should I handle multi-turn dialogues in this context?

I also wonder how to prepare multi turn conversation data such as Anthropic/hh-rlhf for DPO

and

Should we add “chosen_rating” and “rejected_rating” into dataset ?

Thanks in advance

·

For DPO, I'd stick with what HF recommends, which in their example does not have prompt repetition.
https://huggingface.co/docs/trl/main/en/dpo_trainer

Offhand, for multi-turn data, I'd go with what the LLM "sees" in practice, so prior turns are probably part of the prompt, and "chosen" and "rejected" guide what text generation occurs.