Open-Source AI Cookbook documentation
使用 distilabel 生成偏好数据集
使用 distilabel 生成偏好数据集
作者:David Berenstein 和 Sara Han Díaz
- 库: argilla, hf-inference-endpoints
- 组件: LoadDataFromHub, TextGeneration, UltraFeedback, GroupColumns, FormatTextGenerationDPO, PreferenceToArgilla, InferenceEndpointsLLM
在本教程中,我们将使用 distilabel 生成一个用于 DPO、ORPO 或 RLHF 的合成偏好数据集。distilabel 是一个为工程师设计的合成数据和 AI 反馈框架,旨在提供基于经过验证的研究论文的快速、可靠和可扩展的管道。请查看文档了解更多信息。
为了生成响应并对其进行评估,我们将使用与 distilabel 集成的 无服务器 HF 推理 API。该服务是免费的,但有使用限制,允许您通过简单的 HTTP 请求测试和评估超过 150,000 个公开模型,或您自己的私有模型,推理速度快,托管在 Hugging Face 的共享基础设施上。如果您需要更多计算能力,可以使用 Hugging Face 推理端点部署您自己的推理端点。
最后,为了进一步整理数据,我们将使用 Argilla,它允许我们对数据质量提供人工反馈。Argilla 是一个为 AI 工程师和领域专家设计的协作工具,帮助他们为项目构建高质量的数据集。请查看 文档 了解更多信息。
开始
安装依赖
为了完成本教程,您需要通过 pip 安装 distilabel SDK 和一些第三方库。由于本教程中将使用 免费的但有使用限制的 Hugging Face 无服务器推理 API,因此我们需要额外安装该依赖。您可以通过运行以下命令来安装所有必要的依赖:
!pip install "distilabel[hf-inference-endpoints]"
!pip install "transformers~=4.0" "torch~=2.0"
让我们进行所需的导入:
from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import (
LoadDataFromHub,
GroupColumns,
FormatTextGenerationDPO,
PreferenceToArgilla,
)
from distilabel.steps.tasks import TextGeneration, UltraFeedback
你需要一个 HF_TOKEN
来使用 HF 推理端点。请登录以便在此 Notebook 中直接使用它。
import os
from huggingface_hub import login
login(token=os.getenv("HF_TOKEN"), add_to_git_credential=True)
(可选)部署 Argilla
你可以跳过此步骤或将其替换为任何其他数据评估工具,但如果缺乏数据质量,你的模型质量将受到影响,因此我们建议查看您的数据。如果你已经部署了 Argilla,可以跳过此步骤。否则,你可以按照 此指南 快速部署 Argilla。
同时,你需要将 Argilla 安装为 distilabel 的附加依赖。
!pip install "distilabel[argilla, hf-inference-endpoints]"
定义管道
为了生成我们的偏好数据集,我们需要定义一个包含所有必要步骤的 Pipeline
。接下来,我们将详细介绍每个步骤。
加载数据集
我们将使用来自 Hugging Face Hub 的 argilla/10Kprompts-mini
数据集作为源数据。
- 组件:
LoadDataFromHub
- 输入列:
instruction
和topic
,与加载的数据集中的列相同 - 输出列:
instruction
和topic
load_dataset = LoadDataFromHub(
repo_id="argilla/10Kprompts-mini",
num_examples=1,
pipeline=Pipeline(name="showcase-pipeline"),
)
load_dataset.load()
next(load_dataset.process())
生成响应
我们需要为给定的指令生成响应。我们将使用通过无服务器推理 API 在 Hugging Face Hub 上提供的两个不同模型:meta-llama/Meta-Llama-3-8B-Instruct
和 mistralai/Mixtral-8x7B-Instruct-v0.1
。我们还将为每个模型指定生成参数。
- 组件:
TextGeneration
任务,使用InferenceEndpointsLLM
调用 LLM - 输入列:
instruction
- 输出列:
generation
、distilabel_metadata
、model_name
,针对每个模型
根据你的使用案例并为提高结果,你可以选择使用任何 其他 LLM。
>>> generate_responses = [
... TextGeneration(
... llm=InferenceEndpointsLLM(
... model_id="meta-llama/Meta-Llama-3-8B-Instruct",
... tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
... generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
... ),
... pipeline=Pipeline(name="showcase-pipeline"),
... ),
... TextGeneration(
... llm=InferenceEndpointsLLM(
... model_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
... tokenizer_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
... generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
... ),
... pipeline=Pipeline(name="showcase-pipeline"),
... ),
... ]
>>> for task in generate_responses:
... task.load()
... print(next(task.process([{"instruction": "Which are the top cities in Spain?"}])))
[{'instruction': 'Which are the top cities in Spain?', 'generation': 'Spain is a country with a rich culture, history, and architecture, and it has many great cities to visit. Here are some of the top cities in Spain:\n\n1. **Madrid**: The capital city of Spain, known for its vibrant nightlife, museums, and historic landmarks like the Royal Palace and Prado Museum.\n2. **Barcelona**: The second-largest city in Spain, famous for its modernist architecture, beaches, and iconic landmarks like La Sagrada Família and Park Güell, designed by Antoni Gaudí.\n3. **Valencia**: Located on the Mediterranean coast, Valencia is known for its beautiful beaches, City of Arts and Sciences, and delicious local cuisine, such as paella.\n4. **Seville**: The capital of Andalusia, Seville is famous for its stunning cathedral, Royal Alcázar Palace, and lively flamenco music scene.\n5. **Málaga**: A coastal city in southern Spain, Málaga is known for its rich history, beautiful beaches, and being the birthplace of Pablo Picasso.\n6. **Zaragoza**: Located in the northeastern region of Aragon, Zaragoza is a city with a rich history, known for its Roman ruins, Gothic cathedral, and beautiful parks.\n7. **Granada**: A city in the Andalusian region, Granada is famous for its stunning Alhambra palace and generalife gardens, a UNESCO World Heritage Site.\n8. **Bilbao**: A city in the Basque Country, Bilbao is known for its modern architecture, including the Guggenheim Museum, and its rich cultural heritage.\n9. **Alicante**: A coastal city in the Valencia region, Alicante is famous for its beautiful beaches, historic castle, and lively nightlife.\n10. **San Sebastián**: A city in the Basque Country, San Sebastián is known for its stunning beaches, gastronomic scene, and cultural events like the San Sebastián International Film Festival.\n\nThese are just a few of the many great cities in Spain, each with its own unique character and attractions.', 'distilabel_metadata': {'raw_output_text_generation_0': 'Spain is a country with a rich culture, history, and architecture, and it has many great cities to visit. Here are some of the top cities in Spain:\n\n1. **Madrid**: The capital city of Spain, known for its vibrant nightlife, museums, and historic landmarks like the Royal Palace and Prado Museum.\n2. **Barcelona**: The second-largest city in Spain, famous for its modernist architecture, beaches, and iconic landmarks like La Sagrada Família and Park Güell, designed by Antoni Gaudí.\n3. **Valencia**: Located on the Mediterranean coast, Valencia is known for its beautiful beaches, City of Arts and Sciences, and delicious local cuisine, such as paella.\n4. **Seville**: The capital of Andalusia, Seville is famous for its stunning cathedral, Royal Alcázar Palace, and lively flamenco music scene.\n5. **Málaga**: A coastal city in southern Spain, Málaga is known for its rich history, beautiful beaches, and being the birthplace of Pablo Picasso.\n6. **Zaragoza**: Located in the northeastern region of Aragon, Zaragoza is a city with a rich history, known for its Roman ruins, Gothic cathedral, and beautiful parks.\n7. **Granada**: A city in the Andalusian region, Granada is famous for its stunning Alhambra palace and generalife gardens, a UNESCO World Heritage Site.\n8. **Bilbao**: A city in the Basque Country, Bilbao is known for its modern architecture, including the Guggenheim Museum, and its rich cultural heritage.\n9. **Alicante**: A coastal city in the Valencia region, Alicante is famous for its beautiful beaches, historic castle, and lively nightlife.\n10. **San Sebastián**: A city in the Basque Country, San Sebastián is known for its stunning beaches, gastronomic scene, and cultural events like the San Sebastián International Film Festival.\n\nThese are just a few of the many great cities in Spain, each with its own unique character and attractions.'}, 'model_name': 'meta-llama/Meta-Llama-3-8B-Instruct'}] [{'instruction': 'Which are the top cities in Spain?', 'generation': ' Here are some of the top cities in Spain based on various factors such as tourism, culture, history, and quality of life:\n\n1. Madrid: The capital and largest city in Spain, Madrid is known for its vibrant nightlife, world-class museums (such as the Prado Museum and Reina Sofia Museum), stunning parks (such as the Retiro Park), and delicious food.\n\n2. Barcelona: Famous for its unique architecture, Barcelona is home to several UNESCO World Heritage sites designed by Antoni Gaudí, including the Sagrada Familia and Park Güell. The city also boasts beautiful beaches, a lively arts scene, and delicious Catalan cuisine.\n\n3. Valencia: A coastal city located in the east of Spain, Valencia is known for its City of Arts and Sciences, a modern architectural complex that includes a planetarium, opera house, and museum of interactive science. The city is also famous for its paella, a traditional Spanish dish made with rice, vegetables, and seafood.\n\n4. Seville: The capital of Andalusia, Seville is famous for its flamenco dancing, stunning cathedral (the largest Gothic cathedral in the world), and the Alcázar, a beautiful palace made up of a series of rooms and courtyards.\n\n5. Granada: Located in the foothills of the Sierra Nevada mountains, Granada is known for its stunning Alhambra palace, a Moorish fortress that dates back to the 9th century. The city is also famous for its tapas, a traditional Spanish dish that is often served for free with drinks.\n\n6. Bilbao: A city in the Basque Country, Bilbao is famous for its modern architecture, including the Guggenheim Museum, a contemporary art museum designed by Frank Gehry. The city is also known for its pintxos, a type of Basque tapas that are served in bars and restaurants.\n\n7. Málaga: A coastal city in Andalusia, Málaga is known for its beautiful beaches, historic sites (including the Alcazaba and Gibralfaro castles), and the Picasso Museum, which is dedicated to the famous Spanish artist who was born in the city.\n\nThese are just a few of the many wonderful cities in Spain.', 'distilabel_metadata': {'raw_output_text_generation_0': ' Here are some of the top cities in Spain based on various factors such as tourism, culture, history, and quality of life:\n\n1. Madrid: The capital and largest city in Spain, Madrid is known for its vibrant nightlife, world-class museums (such as the Prado Museum and Reina Sofia Museum), stunning parks (such as the Retiro Park), and delicious food.\n\n2. Barcelona: Famous for its unique architecture, Barcelona is home to several UNESCO World Heritage sites designed by Antoni Gaudí, including the Sagrada Familia and Park Güell. The city also boasts beautiful beaches, a lively arts scene, and delicious Catalan cuisine.\n\n3. Valencia: A coastal city located in the east of Spain, Valencia is known for its City of Arts and Sciences, a modern architectural complex that includes a planetarium, opera house, and museum of interactive science. The city is also famous for its paella, a traditional Spanish dish made with rice, vegetables, and seafood.\n\n4. Seville: The capital of Andalusia, Seville is famous for its flamenco dancing, stunning cathedral (the largest Gothic cathedral in the world), and the Alcázar, a beautiful palace made up of a series of rooms and courtyards.\n\n5. Granada: Located in the foothills of the Sierra Nevada mountains, Granada is known for its stunning Alhambra palace, a Moorish fortress that dates back to the 9th century. The city is also famous for its tapas, a traditional Spanish dish that is often served for free with drinks.\n\n6. Bilbao: A city in the Basque Country, Bilbao is famous for its modern architecture, including the Guggenheim Museum, a contemporary art museum designed by Frank Gehry. The city is also known for its pintxos, a type of Basque tapas that are served in bars and restaurants.\n\n7. Málaga: A coastal city in Andalusia, Málaga is known for its beautiful beaches, historic sites (including the Alcazaba and Gibralfaro castles), and the Picasso Museum, which is dedicated to the famous Spanish artist who was born in the city.\n\nThese are just a few of the many wonderful cities in Spain.'}, 'model_name': 'mistralai/Mixtral-8x7B-Instruct-v0.1'}]
组响应
该任务需要评估响应,并且输入需要一个生成列表。然而,每个模型的响应都保存在子集 text_generation_0
和 text_generation_1
的 generation
列中。我们将把这两列合并成一个单一的列,并应用于 default
子集。
- 组件:
GroupColumns
- 输入列:来自
text_generation_0
和text_generation_1
的generation
和model_name
- 输出列:
generations
和model_names
group_responses = GroupColumns(
columns=["generation", "model_name"],
output_columns=["generations", "model_names"],
pipeline=Pipeline(name="showcase-pipeline"),
)
next(
group_responses.process(
[
{
"generation": "Madrid",
"model_name": "meta-llama/Meta-Llama-3-8B-Instruct",
},
],
[
{
"generation": "Barcelona",
"model_name": "mistralai/Mixtral-8x7B-Instruct-v0.1",
}
],
)
)
评估响应
为了构建我们的偏好数据集,我们需要评估模型生成的响应。我们将使用 meta-llama/Meta-Llama-3-70B-Instruct
来进行此操作,应用 UltraFeedback
任务,根据不同维度(有用性、诚实性、遵循指令程度、真实性)对响应进行评分。
- 组件:使用
InferenceEndpointsLLM
执行的UltraFeedback
任务 - 输入列:
instruction
,generations
- 输出列:
ratings
,rationales
,distilabel_metadata
,model_name
根据你的使用场景,你还可以使用任何你选择的 其他 LLM 来改进结果。
evaluate_responses = UltraFeedback(
aspect="overall-rating",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
),
pipeline=Pipeline(name="showcase-pipeline"),
)
evaluate_responses.load()
next(
evaluate_responses.process(
[
{
"instruction": "What's the capital of Spain?",
"generations": ["Madrid", "Barcelona"],
}
]
)
)
转换为偏好数据集
- 你可以使用
chosen
和rejected
列自动将其转换为偏好数据集。- 组件:
FormatTextGenerationDPO
步骤 - 输入列:
instruction
,generations
,generation_models
,ratings
- 输出列:
prompt
,prompt_id
,chosen
,chosen_model
,chosen_rating
,rejected
,rejected_model
,rejected_rating
- 组件:
format_dpo = FormatTextGenerationDPO(pipeline=Pipeline(name="showcase-pipeline"))
format_dpo.load()
next(
format_dpo.process(
[
{
"instruction": "What's the capital of Spain?",
"generations": ["Madrid", "Barcelona"],
"generation_models": [
"Meta-Llama-3-8B-Instruct",
"Mixtral-8x7B-Instruct-v0.1",
],
"ratings": [5, 1],
}
]
)
)
- 或者,你可以使用 Argilla 手动标注数据并将其转换为偏好数据集。
- 组件:
PreferenceToArgilla
步骤 - 输入列:
instruction
,generations
,generation_models
,ratings
- 输出列:
instruction
,generations
,generation_models
,ratings
- 组件:
to_argilla = PreferenceToArgilla(
dataset_name="preference-dataset",
dataset_workspace="argilla",
api_url="https://[your-owner-name]-[your-space-name].hf.space",
api_key="[your-api-key]",
num_generations=2,
)
运行管道
以下是完整的管道定义:
with Pipeline(name="generate-dataset") as pipeline:
load_dataset = LoadDataFromHub(repo_id="argilla/10Kprompts-mini")
generate_responses = [
TextGeneration(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-8B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
)
),
TextGeneration(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
tokenizer_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
)
),
]
group_responses = GroupColumns(
columns=["generation", "model_name"],
output_columns=["generations", "model_names"],
)
evaluate_responses = UltraFeedback(
aspect="overall-rating",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
),
)
format_dpo = FormatTextGenerationDPO()
to_argilla = PreferenceToArgilla(
dataset_name="preference-dataset",
dataset_workspace="argilla",
api_url="https://[your-owner-name]-[your-space-name].hf.space",
api_key="[your-api-key]",
num_generations=2,
)
for task in generate_responses:
load_dataset.connect(task)
task.connect(group_responses)
group_responses.connect(evaluate_responses)
evaluate_responses.connect(format_dpo, to_argilla)
运行管道并生成偏好数据集
distiset = pipeline.run()
让我们检查偏好数据集!如果你已经将数据加载到 Argilla,你可以通过 在 Argilla UI 中开始标注。
你可以将数据集推送到 Hub 与社区共享,并 嵌入它以探索数据。
distiset.push_to_hub("[your-owner-name]/example-preference-dataset")
总结
在本教程中,我们展示了使用 Distilabel 构建生成偏好数据集的流水线的详细步骤。您可以根据自己的使用案例定制此流水线,并通过 Hugging Face Hub 与社区共享您的数据集,或使用它们训练 DPO 或 ORPO 模型。
我们使用了一个包含提示的数据集,通过无服务器的 Hugging Face 推理 API 使用两个不同的模型生成响应。接下来,我们使用第三个模型根据 UltraFeedback 标准对响应进行了评估。最后,我们将数据转换为偏好数据集,并使用 Argilla 进行进一步的整理。
< > Update on GitHub