Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
dvilasuero 
posted an update Jan 10
Post
🔥 Less is more for DPO, high quality matters!

📢 Dropping our first open dataset and LLM of the year:

💾Meet distilabel Orca Pairs DPO, an improved version of the now famous dataset from Intel:

argilla/distilabel-intel-orca-dpo-pairs


🏛️ And a new OpenHermes fine-tune outperforming baselines with 54% less DPO pairs:

https://huggingface.co/argilla/distilabeled-Hermes-2.5-Mistral-7B

You can use this new dataset for your DPO tuning, just like this:


from datasets import load_dataset

# Instead of this:
# dataset = load_dataset("Intel/orca_dpo_pairs", split="train")

# use this:
dataset = load_dataset("argilla/distilabel-intel-orca-dpo-pairs", split="train")

dataset = dataset.filter(
    lambda r: 
        r["status"] != "tie" and 
        r["chosen_score"] >= 8 and 
        not r["in_gsm8k_train"]
)

This will reduce the size of the original by 54% while giving you better quality preferences!

What should we build next?



Idea: How about improving the quality of the answers using GPT-4?

First of all, Daniel, I love how you improved the efficiency by using less data, and the output quality by using the best data. But I already applauded you for that on Twitter/X.

In the orca DPO dataset picking random examples I always find places to make it better.

I wonder if it could be improved if they told the judge LLM to rate & and then improve the answers on certain dimensions.

Examples:

1) Instruction following

Screenshot 2024-01-14 at 17.00.27.png

"Rate how much the length of the response follows the user's explicit (stated) or implicit (intended) needs.

Implicit: simple questions need shorter answers, more complex questions need longer ones"

I saw it multiple times that system prompt and user prompt were at odds regarding length (in intel DPO dataset).

It could definitely be improved, system prompt strength brought up.

But even user prompt following too.

2) Summarizing (entity density)

Screenshot 2024-01-14 at 17.11.08.png

I would choose the 'chosen' too if given the two!

But actually it's clear that the 'chosen' answer could be improved too (and in this case the rejected is giving hints).

Chosen follows the "one sentence" instruction well, but rejected is actually is better at entity extraction & information density

It's possible to improve by prompting GPT-4:

Screenshot 2024-01-14 at 17.14.17.png

https://sharegpt.com/c/UQwU79R

A relevant article on the topic of summarization & entity density improvements:

[Article: Smarter Summaries w/ Finetuning GPT-3.5 and Chain of Density]

Fine-tuned GPT 3.5 can match GPT-4 performance (with only 20 examples, 5 epochs)!

This gave me the idea that there might be low-hanging fruits here in fine-tuning smaller models with quality examples.

3) More?

And it doesn't just apply to summarization.

Simply need to ensure quality examples for different use-cases in the dataset.

·

that's a very cool idea @gblazex and certainly something we one could try to tackle combining Argilla and distilabel!

This could be decomposed into the following steps:

  1. Finding the candidate responses for rewriting: thanks for sharing some examples via screenshots, one could easily use Argilla UI to flag these candidates more easily, as we've done for UltraFeedback. Another option is to just rewrite low rated responses, but I checked and some of the examples you shared got a high score. Another option is simply attempting to improve every responses.

  2. Improving the response: this is very easy to do now that we have the critique text in the dataset. With distilabel one can define a custom text generation task that receives the instruction, the original response, the critique and ask the LLM to provide an improved response. It would be a few lines of code

happy to discuss this further!

In this post