Hi, I have a question for you,
@John6666
mentioned you in the comments of my topic,
In preparing a dataset for DPO (Direct Preference Optimization) training, should the “prompt” be repeated in the “chosen” and “rejected” columns?
I’ve come across some conflicting information regarding the proper formatting of the dataset for DPO training. Some sources suggest that the prompt should be included in both the “chosen” and “rejected” responses to provide full context, while others state that the prompt should be kept separate and not repeated in these columns.
Additionally, when working with multi-turn dialogue data, I’m unsure how to properly format the dataset. Should the “chosen” and “rejected” columns include the entire conversation history up to that point, or just the assistant’s most recent response following the latest user input?
Could someone clarify the correct approach for formatting the dataset? Should the “chosen” and “rejected” columns contain only the assistant’s responses following the prompt, or should they include the prompt as well? And how should I handle multi-turn dialogues in this context?
I also wonder how to prepare multi turn conversation data such as Anthropic/hh-rlhf for DPO
and
Should we add “chosen_rating” and “rejected_rating” into dataset ?
Thanks in advance