Clean an Existing Preference Dataset with LLMs as Judges
Authored by: David Berenstein and Sara Han Díaz
- Libraries: argilla, hf-inference-endpoints
- Components: LoadDataFromDicts, UltraFeedback, KeepColumns, PreferenceToArgilla, InferenceEndpointsLLM, GlobalStep
In this tutorial, we’ll use distilabel to clean a dataset using the LLMs as judges by providing AI feedback on the quality of the data. distilabel is a synthetic data and AI feedback framework for engineers who need fast, reliable and scalable pipelines based on verified research papers. Check the documentation here.
To evaluate the responses, we will use the serverless HF Inference API integrated with distilabel. This is free but rate-limited, allowing you to test and evaluate over 150,000 public models, or your own private models, via simple HTTP requests, with fast inference hosted on Hugging Face shared infrastructure. If you need more compute power, you can deploy your own inference endpoint with Hugging Face Inference Endpoints.
Finally, to further curate the data, we will use Argilla, which allows us to provide human feedback on the data quality. Argilla is a collaboration tool for AI engineers and domain experts who need to build high-quality datasets for their projects. Check the documentation here.
Getting Started
Install the dependencies
To complete this tutorial, you need to install the distilabel SDK and a few third-party libraries via pip.
!pip install "distilabel[hf-inference-endpoints]"
!pip install "transformers~=4.0" "torch~=2.0"
Let’s make the required imports:
import random
from datasets import load_dataset
from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import (
KeepColumns,
LoadDataFromDicts,
PreferenceToArgilla,
)
from distilabel.steps.tasks import UltraFeedback
You’ll need an HF_TOKEN
to use the HF Inference Endpoints. Login to use it directly within this notebook.
import os
from huggingface_hub import login
login(token=os.getenv("HF_TOKEN"), add_to_git_credential=True)
(optional) Deploy Argilla
You can skip this step or replace it with any other data evaluation tool, but the quality of your model will suffer from a lack of data quality, so we do recommend looking at your data. If you already deployed Argilla, you can skip this step. Otherwise, you can quickly deploy Argilla following this guide.
Along with that, you will need to install Argilla as a distilabel extra.
!pip install "distilabel[argilla, hf-inference-endpoints]"
The dataset
In this case, we will clean a preference dataset, so we will use the Intel/orca_dpo_pairs
dataset from the Hugging Face Hub.
dataset = load_dataset("Intel/orca_dpo_pairs", split="train[:20]")
Next, we will shuffle the chosen
and rejected
columns to avoid any bias in the dataset.
def shuffle_and_track(chosen, rejected):
pair = [chosen, rejected]
random.shuffle(pair)
order = ["chosen" if x == chosen else "rejected" for x in pair]
return {"generations": pair, "order": order}
dataset = dataset.map(lambda x: shuffle_and_track(x["chosen"], x["rejected"]))
dataset = dataset.to_list()
(optional) Create a custom step
A step is a block in a distilabel pipeline used to manipulate, generate, or evaluate data, among other tasks. A set of predefined steps is provided, but you can also create your own custom steps. Instead of preprocessing the data as in the previous section, it is possible to use a custom step to shuffle the columns. This step should be in a separate module to be imported and used in the pipeline. In this case, the pipeline would start by loading the orca_dpo_pairs
dataset using the LoadDataFromHub
step and then applying the ShuffleStep
.
# "shuffle_step.py"
from typing import TYPE_CHECKING, List
from distilabel.steps import GlobalStep, StepInput
if TYPE_CHECKING:
from distilabel.steps.typing import StepOutput
import random
class ShuffleStep(GlobalStep):
@property
def inputs(self) -> List[str]:
return ["instruction", "chosen", "rejected"]
@property
def outputs(self) -> List[str]:
return ["instruction", "generations", "order"]
def process(self, inputs: StepInput) -> "StepOutput":
outputs = []
for input in inputs:
chosen = input["chosen"]
rejected = input["rejected"]
pair = [chosen, rejected]
random.shuffle(pair)
order = ["chosen" if x == chosen else "rejected" for x in pair]
outputs.append({"instruction": input["instruction"], "generations": pair, "order": order})
yield outputs
from shuffle_step import ShuffleStep
Define the pipeline
To clean an existing preference dataset, we will need to define a Pipeline
with all the necessary steps. However, a similar workflow can be used to clean an SFT dataset. Below, we will go over each step in detail.
Load the dataset
We will use the dataset we just shuffled as source data.
- Component:
LoadDataFromDicts
- Input columns:
system
,question
,chosen
,rejected
,generations
andorder
, the same keys as in the loaded list of dictionaries. - Output columns:
system
,instruction
,chosen
,rejected
,generations
andorder
. We will useoutput_mappings
to rename the columns.
load_dataset = LoadDataFromDicts(
data=dataset[:1],
output_mappings={"question": "instruction"},
pipeline=Pipeline(name="showcase-pipeline"),
)
load_dataset.load()
next(load_dataset.process())
Evaluate the responses
To evaluate the quality of the responses, we will use meta-llama/Meta-Llama-3.1-70B-Instruct
, applying the UltraFeedback
task that judges the responses according to different dimensions (helpfulness, honesty, instruction-following, truthfulness). For an SFT dataset, you can use PrometheusEval
instead.
- Component:
UltraFeedback
task with LLMs usingInferenceEndpointsLLM
- Input columns:
instruction
,generations
- Output columns:
ratings
,rationales
,distilabel_metadata
,model_name
For your use case and to improve the results, you can use any other LLM of your choice.
evaluate_responses = UltraFeedback(
aspect="overall-rating",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
),
pipeline=Pipeline(name="showcase-pipeline"),
)
evaluate_responses.load()
next(
evaluate_responses.process(
[
{
"instruction": "What's the capital of Spain?",
"generations": ["Madrid", "Barcelona"],
}
]
)
)
Keep only the required columns
We will get rid of the unneeded columns.
- Component:
KeepColumns
- Input columns:
system
,instruction
,chosen
,rejected
,generations
,ratings
,rationales
,distilabel_metadata
andmodel_name
- Output columns:
instruction
,chosen
,rejected
,generations
andorder
keep_columns = KeepColumns(
columns=[
"instruction",
"generations",
"order",
"ratings",
"rationales",
"model_name",
],
pipeline=Pipeline(name="showcase-pipeline"),
)
keep_columns.load()
next(
keep_columns.process(
[
{
"system": "",
"instruction": "What's the capital of Spain?",
"chosen": "Madrid",
"rejected": "Barcelona",
"generations": ["Madrid", "Barcelona"],
"order": ["chosen", "rejected"],
"ratings": [5, 1],
"rationales": ["", ""],
"model_name": "meta-llama/Meta-Llama-3.1-70B-Instruct",
}
]
)
)
(Optional) Further data curation
You can use Argilla to further curate your data.
- Component:
PreferenceToArgilla
step - Input columns:
instruction
,generations
,generation_models
,ratings
- Output columns:
instruction
,generations
,generation_models
,ratings
to_argilla = PreferenceToArgilla(
dataset_name="cleaned-dataset",
dataset_workspace="argilla",
api_url="https://[your-owner-name]-[your-space-name].hf.space",
api_key="[your-api-key]",
num_generations=2,
)
Run the pipeline
Below, you can see the full pipeline definition:
with Pipeline(name="clean-dataset") as pipeline:
load_dataset = LoadDataFromDicts(data=dataset, output_mappings={"question": "instruction"})
evaluate_responses = UltraFeedback(
aspect="overall-rating",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
),
)
keep_columns = KeepColumns(
columns=[
"instruction",
"generations",
"order",
"ratings",
"rationales",
"model_name",
]
)
to_argilla = PreferenceToArgilla(
dataset_name="cleaned-dataset",
dataset_workspace="argilla",
api_url="https://[your-owner-name]-[your-space-name].hf.space",
api_key="[your-api-key]",
num_generations=2,
)
load_dataset.connect(evaluate_responses)
evaluate_responses.connect(keep_columns)
keep_columns.connect(to_argilla)
Let’s now run the pipeline and clean our preference dataset.
distiset = pipeline.run()
Let’s check it! If you have loaded the data to Argilla, you can start annotating in the Argilla UI.
You can push the dataset to the Hub for sharing with the community and embed it to explore the data.
distiset.push_to_hub("[your-owner-name]/example-cleaned-preference-dataset")
Conclusions
In this tutorial, we showcased the detailed steps to build a pipeline for cleaning a preference dataset using distilabel. However, you can customize this pipeline for your own use cases, such as cleaning an SFT dataset or adding custom steps.
We used a preference dataset as our starting point and shuffled the data to avoid any bias. Next, we evaluated the responses using a model through the serverless Hugging Face Inference API, following the UltraFeedback standards. Finally, we kept the needed columns and used Argilla for further curation.
< > Update on GitHub