🏷️ Build AI Feedback (AIF) datasets for LLM alignment with ⚗️ distilabel

Community Article Published December 5, 2023

TL;DR

distilabel is an AI Feedback (AIF) framework to both generate and label datasets using LLMs, and thanks to its extensibility you can use it to generate any synthetic dataset using popular LLM engines such as 🤗 transformers, 🤗 Inference Endpoints, vLLM, llama.cpp, and more. HelpSteer is a helpfulness dataset released by NVIDIA, that uses human annotators to evaluate a prompt-response pair dataset in different areas besides helpfulness, also including: correctness, verbosity, coherence, and complexity. This post explains how to use distilabel to generate a HelpSteer-like dataset for LLM alignment with AIF instead of human annotations, while also showcasing how to integrate human feedback via the Argilla integration.

Introduction

Recent models such as HuggingFaceH4/zephyr-7b-beta and allenai/tulu-2-7b have demonstrated that it's possible to build powerful open-source LLMs fine-tuned using Direct Preference Optimization (DPO) with AI Feedback (AIF) datasets. Their approaches consist on fine-tuning a strong base LLM using Supervised Fine Tuning (SFT) and then fine-tuning over it using DPO for intent alignment, as most of the times LLM produced using SFT perform well on benchmarks but fail to generalize to real-world scenarios i.e. the intent of the generated text is not aligned with the intent of the natural prompt.

DPO fine-tuning improves the LLM alignment in a short time compared to the SFT fine-tuning and, in this cases using synthetically generated datasets for AIF; and there is some exciting research in the AIF space, with some examples being UltraFeedback, JudgeLM, or Prometheus.

However, going beyond research efforts and applying AIF at scale it's different, that's why distilabel was originally created, to implement AIF methods in a robust, efficient, and scalable way, allowing anyone to build their custom synthetic datasets at scale for their own use cases. The only counterpart is that from time to time LLMs will fail to produce the results that we expect, so humans-in-the-loop are required for improving dataset quality is the next big leap for OSS LLM models.

distilabel aims to bridge that gap thanks to its integration within Argilla, a human-in-the-loop data annotation tool that allows anyone to integrate human feedback in their datasets.

What's distilabel?

distilabel is an AIF framework to both generate and label datasets using LLMs, and thanks to its extensibility you can use it to generate any synthetic dataset using popular LLM engines such as 🤗 transformers, 🤗 Inference Endpoints, vLLM, llama.cpp, and more.

image/png

From the workflow above, lets break down the different steps:

  1. Task: A task is a class that defines the input and output arguments of the LLM, as well as the prompt that will be used to generate the dataset. It's also responsible for parsing the LLM output and returning a dictionary with the output arguments and their respective values.
  2. LLMs: The LLMs are the models that will be used to generate the dataset. They are defined as classes that implement the generate method, which receives a prompt and returns the generated output. LLMs have some wrappers for some popular frameworks / engines such as 🤗 transformers, 🤗 Inference Endpoints, vLLM, llama.cpp, and more.
  3. Pipeline: A pipeline is a class that orchestrates the generation, labelling, or both combined, of the provided 🤗 datasets. It's responsible for generating the prompts, generating the dataset, labelling the dataset, and finally returning the labelled dataset. It implements some optimization mechanisms so as not to break during any of the steps, seeking robustness, as expecting LLMs to produce the expected output is not always the case.

Below you will find an end-to-end example on how to generate a dataset similar to nvidia/HelpSteer, but using AIF instead of human annotations.

Installation

To install it you can use pip as follows, which will also install the openai and argilla extras, which are required for the OpenAI integration and the Argilla integration, used for collecting AIF and for exporting to Argilla, respectively. Note that 🤗 datasets is a core dependency, so does not need to be installed via any extra while some other packages may do.

pip install distilabel[openai,argilla]

Build a HelpSteer-like AIF dataset

Lets briefly introduce the HelpSteer dataset, which is a helpfulness dataset released by NVIDIA, that uses human annotators to evaluate a prompt-response pair dataset in different areas besides helpfulness, also including: correctness, verbosity, coherence, and complexity. The prompt-response dataset contains a set of prompts and each prompt can contain up to 4 responses, generated with an in-house LLM from NVIDIA with 43B params.

More information about the data collection and the annotation process can be found at nvidia/HelpSteer.

1. Defining the HelpSteerTask

As previously mentioned, lets start with the definition of the HelpSteerTask, so we need to inherit from distilabel.tasks.Task and implement the input_args_names, output_args_names, generate_prompt, and parse_output methods. Optionally, we can also decide to implement the to_argilla_dataset and to_argilla_record methods, to later on be able to easily export the labelled dataset to Argilla.

Ideally, since the prompt will define how does the LLM to be used later process the information, we should experiment and do a bit of prompt engineering before actually defining the prompt. In this case we use a prompt that asks the LLM to evaluate a prompt-response pair based on the different areas presented in HelpSteer, and using XLM formatting, as that's what was proved to be best after prompt-engineering via OpenAI's Playground, also following the notes described at OpenAI - Prompt Engineering.

import re
from typing import Any, Dict, List

import argilla as rg
from distilabel.tasks import Task


class HelpSteerTask(Task):
    system_prompt: str = (
        "You are a helpful assistant and you are asked to evaluate a generated"
        " response for a given prompt.The aspects that you need to evaluate are:"
        " correctness, coherence, complexity, verbosity, and helpfulness; the"
        " ratings are integers that go from 0 to 4, meaning the higher the"
        " better. You should expect and produce the following:\n\n<prompt>..."
        "</prompt>\n<response>...</response>\n<correctness>X</correctness>\n"
        "<coherence>X</coherence>\n<complexity>X</complexity>\n<verbosity>X"
        "</verbosity>\n<helpfulness>X</helpfulness>\n"
    )

    @property
    def input_args_names(self) -> List[str]:
        return ["prompt", "response"]

    @property
    def output_args_names(self) -> List[str]:
        return ["correctness", "coherence", "complexity", "verbosity", "helpfulness"]

    def generate_prompt(self, prompt: str, response: str) -> List[Dict[str, str]]:
        return [
            {
                "role": "system",
                "content": self.system_prompt,
            },
            {
                "role": "user",
                "content": f"<prompt>{prompt}</prompt>/n<response>{response}</response>"
            },
        ]

    def parse_output(self, output: str) -> Dict[str, int]:
        matches = re.findall(r"<(\w+)>(\d+)</\1>", output)
        return dict((key, int(value)) for key, value in matches)

    # Optional methods to export to Argilla later-on
    def to_argilla_dataset(self, dataset_row: Dict[str, Any]) -> rg.FeedbackDataset:
        return rg.FeedbackDataset(
            fields=[
                rg.TextField(name=input_arg_name) for input_arg_name in self.input_args_names
            ],
            questions=[
                # We need to shift the ratings 1 to the right, as Argilla won't allow 0, so the Likert 5 scale has to be 1-5 instead of 0-4
                rg.RatingQuestion(name=output_arg_name, values=[1, 2, 3, 4, 5]) for output_arg_name in self.output_args_names
            ],
            guidelines="https://huggingface.co/datasets/nvidia/HelpSteer",
        )

    # Optional methods to export to Argilla later-on
    def to_argilla_record(self, dataset_row: Dict[str, Any]) -> rg.FeedbackRecord:
        return rg.FeedbackRecord(
            fields=dict((input_arg_name, dataset_row[input_arg_name]) for input_arg_name in self.input_args_names),
            suggestions=[
                # We need to shift the ratings 1 to the right, as Argilla won't allow 0, so the Likert 5 scale has to be 1-5 instead of 0-4
                rg.SuggestionSchema(question_name=output_arg_name, value=dataset_row[output_arg_name] + 1) for output_arg_name in self.output_args_names
            ],
        )

2. Loading the HelpSteer dataset from the Hub

Then we load the dataset from the HuggingFace Hub, and keep just the first 1000 rows, as this is just intended for showcasing distilabel on a short and reproducible subset. Besides that, we also remove the columns that were generated by Scale AI during the human annotation process, as those are the ones that we want to generate using AIF instead.

from datasets import load_dataset

dataset = load_dataset("nvidia/HelpSteer", split="train[:1000]")

dataset = dataset.remove_columns(column_names=["helpfulness", "correctness", "coherence", "complexity", "verbosity"])

3. Running the Pipeline

Then we need to define the LLM and the Pipeline to run the generation of the labelled dataset.

We'll start defining the LLM using distilabel.llm.OpenAILLM which expects us to provide the task to be used, which has already been defined above, and also some generation kwargs such as the model to be used, the max_new_tokens to be generated, the temperature to be used, and the num_threads to be used for generating the dataset in parallel.

Due to the high cost of OpenAI compared to serving other OSS solutions, our plan is to use OSS models for AIF as well as training our own, so as to give users more flexibility and alternatives whenever it comes to defining the LLM to be used for AIF; but at least during the first iterations we'll rely on OpenAI as it showed the most strong whenever it comes to labelling i.e. generating structured outputs. One nice example of an open-source LLM for collecting AIF is kaist-ai/prometheus-7b-v1.0, but it was not a nice fit for this task.

Then we just provide the OpenAILLM to the Pipeline as the labeller and then we call the generate method with the previously loaded 🤗 Dataset as the dataset argument, and we'll get the labelled dataset as a result, which is a class that extends from 🤗 Dataset and contains some pre-defined methods to ease the export of the AIF dataset to Argilla, in case the user is willing to add humans-in-the-loop to get higher-quality data.

import os
from distilabel.llm import OpenAILLM
from distilabel.pipeline import Pipeline

pipeline = Pipeline(
    labeller=OpenAILLM(
        task=HelpSteerTask(),
        model="gpt-4",
        max_new_tokens=128,  # We just need 43
        temperature=0.0,
        num_threads=4,
        openai_api_key=os.getenv("OPENAI_API_KEY") or "sk-...",
    ),
)

dataset = pipeline.generate(dataset=dataset, display_progress_bar=True, verbose=False)

4. Exporting to Argilla (optional)

This step is optional, but if you don't have an Argilla instance up and running, feel free to get started with the Argilla template within HuggingFace Spaces.

Addionally, we can export the labelled dataset to Argilla via the to_argilla method, in case we implemented the to_argilla_dataset and to_argilla_record methods within the HelpSteerTask class. So the to_argilla method will return the dataset formatted as a rg.FeedbackDataset and the records formatted as rg.FeedbackRecord, which can be used to push the dataset to Argilla in one line of code.

import argilla as rg

rg.init(api_url="<ARGILLA_API_URL>", api_key="<ARGILLA_API_KEY>")

rg_dataset = dataset.to_argilla()
rg_dataset.push_to_argilla(name="HelpSteer-AIF", workspace="admin")
Argilla UI

Colab Notebook

In order to reproduce the HelpSteer dataset labelling using AIF, you can either use the code snippets above, or just use the following Google Colab Notebook.

Open In Colab

What's next?

We would love to get your feedback on distilabel as well as requests for new examples, new features, etc. to keep on pushing forward our AIF framework so that generating and labelling datasets with AIF is easier and simpler than ever!

What are we planning to work on next:

  • Adding a model pooling mechanism to be able to generate responses using more than one LLM at a time, so that we can have a more diverse and variety of responses to evaluate.
  • Exploring some OSS LLMs to act as labellers as opposed to the default OpenAI models we use (you could actually use any OSS model out there within distilabel as a labeller, but we haven't tested it out yet)
  • Including more examples and use cases, which also implies extending our docs.
  • Looking for performance improvements in order to make it optimal towards generating / labelling faster.

Besides that we may work on smaller features and/or bug fixes that are relevant, but feel free to open issues or start discussions to engage with us!

References