Join the conversation

Join the community of Machine Learners and AI enthusiasts.

dvilasuero

posted an update Jan 10

Post

🔥 Less is more for DPO, high quality matters!

📢 Dropping our first open dataset and LLM of the year:

💾Meet distilabel Orca Pairs DPO, an improved version of the now famous dataset from Intel:

argilla/distilabel-intel-orca-dpo-pairs

🏛️ And a new OpenHermes fine-tune outperforming baselines with 54% less DPO pairs:

https://huggingface.co/argilla/distilabeled-Hermes-2.5-Mistral-7B

You can use this new dataset for your DPO tuning, just like this:

from datasets import load_dataset

# Instead of this:
# dataset = load_dataset("Intel/orca_dpo_pairs", split="train")

# use this:
dataset = load_dataset("argilla/distilabel-intel-orca-dpo-pairs", split="train")

dataset = dataset.filter(
    lambda r: 
        r["status"] != "tie" and 
        r["chosen_score"] >= 8 and 
        not r["in_gsm8k_train"]
)

This will reduce the size of the original by 54% while giving you better quality preferences!

What should we build next?

gblazex

Jan 20

•

edited Jan 20

Idea: How about improving the quality of the answers using GPT-4?

First of all, Daniel, I love how you improved the efficiency by using less data, and the output quality by using the best data. But I already applauded you for that on Twitter/X.

In the orca DPO dataset picking random examples I always find places to make it better.

I wonder if it could be improved if they told the judge LLM to rate & and then improve the answers on certain dimensions.

Examples:

1) Instruction following

"Rate how much the length of the response follows the user's explicit (stated) or implicit (intended) needs.

Implicit: simple questions need shorter answers, more complex questions need longer ones"

I saw it multiple times that system prompt and user prompt were at odds regarding length (in intel DPO dataset).

It could definitely be improved, system prompt strength brought up.

But even user prompt following too.

2) Summarizing (entity density)

I would choose the 'chosen' too if given the two!

But actually it's clear that the 'chosen' answer could be improved too (and in this case the rejected is giving hints).

Chosen follows the "one sentence" instruction well, but rejected is actually is better at entity extraction & information density

It's possible to improve by prompting GPT-4:

https://sharegpt.com/c/UQwU79R

A relevant article on the topic of summarization & entity density improvements:

[Article: Smarter Summaries w/ Finetuning GPT-3.5 and Chain of Density]

Fine-tuned GPT 3.5 can match GPT-4 performance (with only 20 examples, 5 epochs)!

This gave me the idea that there might be low-hanging fruits here in fine-tuning smaller models with quality examples.

3) More?

And it doesn't just apply to summarization.

Simply need to ensure quality examples for different use-cases in the dataset.

dvilasuero

Jan 21

•

edited Jan 21

that's a very cool idea @gblazex and certainly something we one could try to tackle combining Argilla and distilabel!

This could be decomposed into the following steps:

Finding the candidate responses for rewriting: thanks for sharing some examples via screenshots, one could easily use Argilla UI to flag these candidates more easily, as we've done for UltraFeedback. Another option is to just rewrite low rated responses, but I checked and some of the examples you shared got a high score. Another option is simply attempting to improve every responses.
Improving the response: this is very easy to do now that we have the critique text in the dataset. With distilabel one can define a custom text generation task that receives the instruction, the original response, the critique and ask the LLM to provide an improved response. It would be a few lines of code

happy to discuss this further!

In this post