DPO Trainer

TRL supports the DPO Trainer for training language models from preference data, as described in the paper Direct Preference Optimization: Your Language Model is Secretly a Reward Model by Rafailov et al., 2023. For a full example have a look at examples/scripts/dpo.py.

The first step as always is to train your SFT model, to ensure the data we train on is in-distribution for the DPO algorithm.

How DPO works

Fine-tuning a language model via DPO consists of two steps and is easier than PPO:

Data collection: Gather a preference dataset with positive and negative selected pairs of generation, given a prompt.
Optimization: Maximize the log-likelihood of the DPO loss directly.

DPO-compatible datasets can be found with the tag dpo on Hugging Face Hub.

This process is illustrated in the sketch below (from figure 1 of the original paper):

Read more about DPO algorithm in the original paper.

Expected dataset format

The DPO trainer expects a very specific format for the dataset. Since the model will be trained to directly optimize the preference of which sentence is the most relevant, given two sentences. We provide an example from the Anthropic/hh-rlhf dataset below:

Therefore the final dataset object should contain these 3 entries if you use the default DPODataCollatorWithPadding data collator. The entries should be named:

prompt
chosen
rejected

for example:

dpo_dataset_dict = {
    "prompt": [
        "hello",
        "how are you",
        "What is your name?",
        "What is your name?",
        "Which is the best programming language?",
        "Which is the best programming language?",
        "Which is the best programming language?",
    ],
    "chosen": [
        "hi nice to meet you",
        "I am fine",
        "My name is Mary",
        "My name is Mary",
        "Python",
        "Python",
        "Java",
    ],
    "rejected": [
        "leave me alone",
        "I am not fine",
        "Whats it to you?",
        "I dont have a name",
        "Javascript",
        "C++",
        "C++",
    ],
}

where the prompt contains the context inputs, chosen contains the corresponding chosen responses and rejected contains the corresponding negative (rejected) responses. As can be seen a prompt can have multiple responses and this is reflected in the entries being repeated in the dictionary’s value arrays.

Expected model format

The DPO trainer expects a model of AutoModelForCausalLM, compared to PPO that expects AutoModelForCausalLMWithValueHead for the value function.

Using the DPOTrainer

For a detailed example have a look at the examples/scripts/dpo.py script. At a high level we need to initialize the DPOTrainer with a model we wish to train, a reference ref_model which we will use to calculate the implicit rewards of the preferred and rejected response, the beta refers to the hyperparameter of the implicit reward, and the dataset contains the 3 entries listed above. Note that the model and ref_model need to have the same architecture (ie decoder only or encoder-decoder).

training_args = DPOConfig(
    beta=0.1,
)
dpo_trainer = DPOTrainer(
    model,
    model_ref,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
)

After this one can then call:

dpo_trainer.train()

Note that the beta is the temperature parameter for the DPO loss, typically something in the range of 0.1 to 0.5. We ignore the reference model as beta -> 0.

Loss functions

Given the preference data, we can fit a binary classifier according to the Bradley-Terry model and in fact the DPO authors propose the sigmoid loss on the normalized likelihood via the logsigmoid to fit a logistic regression.

The RSO authors propose to use a hinge loss on the normalized likelihood from the SLiC paper. The DPOTrainer can be switched to this loss via the loss_type="hinge" argument and the beta in this case is the reciprocal of the margin.

The IPO authors provide a deeper theoretical understanding of the DPO algorithms and identify an issue with overfitting and propose an alternative loss which can be used via the loss_type="ipo" argument to the trainer. Note that the beta parameter is the reciprocal of the gap between the log-likelihood ratios of the chosen vs the rejected completion pair and thus the smaller the beta the larger this gaps is. As per the paper the loss is averaged over log-likelihoods of the completion (unlike DPO which is summed only).

The cDPO is a tweak on the DPO loss where we assume that the preference labels are noisy with some probability that can be passed to the DPOTrainer via label_smoothing argument (between 0 and 0.5) and then a conservative DPO loss is used. Use the loss_type="cdpo" argument to the trainer to use it.

The KTO authors directly maximize the utility of LLM generations instead of the log-likelihood of preferences. To use preference data with KTO, we recommend breaking up the n preferences into 2n examples and using KTOTrainer (i.e., treating the data like an unpaired feedback dataset). Although it is possible to pass in loss_type="kto_pair" into DPOTrainer, this is a highly simplified version of KTO that we do not recommend in most cases. Please use KTOTrainer when possible.

The BCO authors train a binary classifier whose logit serves as a reward so that the classifier maps {prompt, chosen completion} pairs to 1 and {prompt, rejected completion} pairs to 0. The DPOTrainer can be switched to this loss via the loss_type="bco_pair" argument.

Logging

While training and evaluating we record the following reward metrics:

rewards/chosen: the mean difference between the log probabilities of the policy model and the reference model for the chosen responses scaled by beta
rewards/rejected: the mean difference between the log probabilities of the policy model and the reference model for the rejected responses scaled by beta
rewards/accuracies: mean of how often the chosen rewards are > than the corresponding rejected rewards
rewards/margins: the mean difference between the chosen and corresponding rejected rewards

Accelerate DPO fine-tuning using unsloth

You can further accelerate QLoRA / LoRA (2x faster, 60% less memory) using the unsloth library that is fully compatible with SFTTrainer. Currently unsloth supports only Llama (Yi, TinyLlama, Qwen, Deepseek etc) and Mistral architectures. Some benchmarks for DPO listed below:

GPU	Model	Dataset	🤗	🤗 + Flash Attention 2	🦥 Unsloth	🦥 VRAM saved
A100 40G	Zephyr 7b	Ultra Chat	1x	1.24x	1.88x	-11.6%
Tesla T4	Zephyr 7b	Ultra Chat	1x	1.09x	1.55x	-18.6%

First install unsloth according to the official documentation. Once installed, you can incorporate unsloth into your workflow in a very simple manner; instead of loading AutoModelForCausalLM, you just need to load a FastLanguageModel as follows:

import torch
from trl import DPOConfig, DPOTrainer
from unsloth import FastLanguageModel

max_seq_length = 2048 # Supports automatic RoPE Scaling, so choose any number.

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/zephyr-sft",
    max_seq_length = max_seq_length,
    dtype = None, # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
    load_in_4bit = True, # Use 4bit quantization to reduce memory usage. Can be False.
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Dropout = 0 is currently optimized
    bias = "none",    # Bias = "none" is currently optimized
    use_gradient_checkpointing = True,
    random_state = 3407,
)

training_args = DPOConfig(
    output_dir="./output",
    beta=0.1,
)

dpo_trainer = DPOTrainer(
    model,
    ref_model=None,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
)
dpo_trainer.train()

The saved model is fully compatible with Hugging Face’s transformers library. Learn more about unsloth in their official repository.

Reference model considerations with PEFT

You have three main options (plus several variants) for how the reference model works when using PEFT, assuming the model that you would like to further enhance with DPO was tuned using (Q)LoRA.

Simply create two instances of the model, each loading your adapter - works fine but is very inefficient.
Merge the adapter into the base model, create another adapter on top, then leave the model_ref param null, in which case DPOTrainer will unload the adapter for reference inference - efficient, but has potential downsides discussed below.
Load the adapter twice with different names, then use set_adapter during training to swap between the adapter being DPO’d and the reference adapter - slightly less efficient compared to 2 (~adapter size VRAM overhead), but avoids the pitfalls.

Downsides to merging QLoRA before DPO (approach 2)

As suggested by Benjamin Marie, the best option for merging QLoRA adapters is to first dequantize the base model, then merge the adapter. Something similar to this script.

However, after using this approach, you will have an unquantized base model. Therefore, to use QLoRA for DPO, you will need to re-quantize the merged model or use the unquantized merge (resulting in higher memory demand).

Using option 3 - load the adapter twice

To avoid the downsides with option 2, you can load your fine-tuned adapter into the model twice, with different names, and set the model/ref adapter names in DPOTrainer.

For example:

# Load the base model.
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    llm_int8_threshold=6.0,
    llm_int8_has_fp16_weight=False,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/mixtral-8x7b-v0.1",
    load_in_4bit=True,
    quantization_config=bnb_config,
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.config.use_cache = False

# Load the adapter.
model = PeftModel.from_pretrained(
    model,
    "/path/to/peft",
    is_trainable=True,
    adapter_name="train",
)
# Load the adapter a second time, with a different name, which will be our reference model.
model.load_adapter("/path/to/peft", adapter_name="reference")

# Initialize the trainer, without a ref_model param.
training_args = DPOConfig(
    model_adapter_name="train",
    ref_adapter_name="reference",
)
dpo_trainer = DPOTrainer(
    model,
    args=training_args,
    ...
)

TRL

DPO Trainer

How DPO works

Expected dataset format

Expected model format

Using the DPOTrainer

Loss functions

Logging

Accelerate DPO fine-tuning using unsloth

Reference model considerations with PEFT

Downsides to merging QLoRA before DPO (approach 2)

Using option 3 - load the adapter twice

DPOTrainer

class trl.DPOTrainer

build_tokenized_answer

compute_reference_log_probs

concatenated_forward

concatenated_inputs

dpo_loss

evaluation_loop

get_batch_logps

get_batch_loss_metrics

get_batch_samples

get_eval_dataloader

get_train_dataloader

log

null_ref_context

tokenize_row

DPOConfig

class trl.DPOConfig