TRL documentation

DPO Trainer

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

DPO Trainer

TRL supports the DPO Trainer for training language models from preference data, as described in the paper Direct Preference Optimization: Your Language Model is Secretly a Reward Model by Rafailov et al., 2023. For a full example have a look at examples/scripts/

The first step as always is to train your SFT model, to ensure the data we train on is in-distribution for the DPO algorithm.

How DPO works

Fine-tuning a language model via DPO consists of two steps and is easier than PPO:

  1. Data collection: Gather a preference dataset with positive and negative selected pairs of generation, given a prompt.
  2. Optimization: Maximize the log-likelihood of the DPO loss directly.

DPO-compatible datasets can be found with the tag dpo on Hugging Face Hub.

This process is illustrated in the sketch below (from figure 1 of the original paper):

Screenshot 2024-03-19 at 12 39 41

Read more about DPO algorithm in the original paper.

Expected dataset format

The DPO trainer expects a very specific format for the dataset. Since the model will be trained to directly optimize the preference of which sentence is the most relevant, given two sentences. We provide an example from the Anthropic/hh-rlhf dataset below:

Therefore the final dataset object should contain these 3 entries if you use the default DPODataCollatorWithPadding data collator. The entries should be named:

  • prompt
  • chosen
  • rejected

for example:

dpo_dataset_dict = {
    "prompt": [
        "how are you",
        "What is your name?",
        "What is your name?",
        "Which is the best programming language?",
        "Which is the best programming language?",
        "Which is the best programming language?",
    "chosen": [
        "hi nice to meet you",
        "I am fine",
        "My name is Mary",
        "My name is Mary",
    "rejected": [
        "leave me alone",
        "I am not fine",
        "Whats it to you?",
        "I dont have a name",

where the prompt contains the context inputs, chosen contains the corresponding chosen responses and rejected contains the corresponding negative (rejected) responses. As can be seen a prompt can have multiple responses and this is reflected in the entries being repeated in the dictionary’s value arrays.

Expected model format

The DPO trainer expects a model of AutoModelForCausalLM, compared to PPO that expects AutoModelForCausalLMWithValueHead for the value function.

Using the DPOTrainer

For a detailed example have a look at the examples/scripts/ script. At a high level we need to initialize the DPOTrainer with a model we wish to train, a reference ref_model which we will use to calculate the implicit rewards of the preferred and rejected response, the beta refers to the hyperparameter of the implicit reward, and the dataset contains the 3 entries listed above. Note that the model and ref_model need to have the same architecture (ie decoder only or encoder-decoder).

training_args = DPOConfig(
dpo_trainer = DPOTrainer(

After this one can then call:


Note that the beta is the temperature parameter for the DPO loss, typically something in the range of 0.1 to 0.5. We ignore the reference model as beta -> 0.

Loss functions

Given the preference data, we can fit a binary classifier according to the Bradley-Terry model and in fact the DPO authors propose the sigmoid loss on the normalized likelihood via the logsigmoid to fit a logistic regression.

The RSO authors propose to use a hinge loss on the normalized likelihood from the SLiC paper. The DPOTrainer can be switched to this loss via the loss_type="hinge" argument and the beta in this case is the reciprocal of the margin.

The IPO authors provide a deeper theoretical understanding of the DPO algorithms and identify an issue with overfitting and propose an alternative loss which can be used via the loss_type="ipo" argument to the trainer. Note that the beta parameter is the reciprocal of the gap between the log-likelihood ratios of the chosen vs the rejected completion pair and thus the smaller the beta the larger this gaps is. As per the paper the loss is averaged over log-likelihoods of the completion (unlike DPO which is summed only).

The cDPO is a tweak on the DPO loss where we assume that the preference labels are noisy with some probability that can be passed to the DPOTrainer via label_smoothing argument (between 0 and 0.5) and then a conservative DPO loss is used. Pass the label_smoothing argument to the trainer to use it (default is 0).

The Robust DPO authors propose an unbiased estimate of the DPO loss that is robust to preference noise in the data. Like in cDPO, assume that the preference labels are noisy with some probability that can be passed to the DPOTrainer via label_smoothing argument (between 0 and 0.5). Use loss_type="robust" to the trainer to use it.

The KTO authors directly maximize the utility of LLM generations instead of the log-likelihood of preferences. To use preference data with KTO, we recommend breaking up the n preferences into 2n examples and using KTOTrainer (i.e., treating the data like an unpaired feedback dataset). Although it is possible to pass in loss_type="kto_pair" into DPOTrainer, this is a highly simplified version of KTO that we do not recommend in most cases. Please use KTOTrainer when possible.

The BCO authors train a binary classifier whose logit serves as a reward so that the classifier maps {prompt, chosen completion} pairs to 1 and {prompt, rejected completion} pairs to 0. The DPOTrainer can be switched to this loss via the loss_type="bco_pair" argument.

The SPPO authors claim that SPPO is capable of solving the Nash equilibrium iteratively by pushing the chosen rewards to be as large as 1/2 and the rejected rewards to be as small as -1/2 and can alleviate data sparsity issues. The implementation using loss_type=“sppo_hard” approximates this algorithm by employing hard label probabilities, assigning 1 to the winner and 0 to the loser.

The NCA authors shows that NCA optimizes the absolute likelihood for each response rather than the relative likelihood.

The TR-DPO paper suggests syncing the reference model weights after every ref_model_sync_steps steps of SGD with weight ref_model_mixup_alpha during DPO training. To toggle this callback use the sync_ref_model flag in the DPOConfig.

The RPO paper implements an iterative preference tuning algorithm using a loss related to the RPO loss in this paper that essentially consists of the SFT loss on the chosen preferences together with a weighted DPO loss. To use this loss set the rpo_alpha in the DPOConfig to an appropriate value.


While training and evaluating we record the following reward metrics:

  • rewards/chosen: the mean difference between the log probabilities of the policy model and the reference model for the chosen responses scaled by beta
  • rewards/rejected: the mean difference between the log probabilities of the policy model and the reference model for the rejected responses scaled by beta
  • rewards/accuracies: mean of how often the chosen rewards are > than the corresponding rejected rewards
  • rewards/margins: the mean difference between the chosen and corresponding rejected rewards

Accelerate DPO fine-tuning using unsloth

You can further accelerate QLoRA / LoRA (2x faster, 60% less memory) using the unsloth library that is fully compatible with SFTTrainer. Currently unsloth supports only Llama (Yi, TinyLlama, Qwen, Deepseek etc) and Mistral architectures. Some benchmarks for DPO listed below:

GPU Model Dataset 🤗 🤗 + Flash Attention 2 🦥 Unsloth 🦥 VRAM saved
A100 40G Zephyr 7b Ultra Chat 1x 1.24x 1.88x -11.6%
Tesla T4 Zephyr 7b Ultra Chat 1x 1.09x 1.55x -18.6%

First install unsloth according to the official documentation. Once installed, you can incorporate unsloth into your workflow in a very simple manner; instead of loading AutoModelForCausalLM, you just need to load a FastLanguageModel as follows:

import torch
from trl import DPOConfig, DPOTrainer
from unsloth import FastLanguageModel

max_seq_length = 2048 # Supports automatic RoPE Scaling, so choose any number.

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/zephyr-sft",
    max_seq_length = max_seq_length,
    dtype = None, # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
    load_in_4bit = True, # Use 4bit quantization to reduce memory usage. Can be False.
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf

# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Dropout = 0 is currently optimized
    bias = "none",    # Bias = "none" is currently optimized
    use_gradient_checkpointing = True,
    random_state = 3407,

training_args = DPOConfig(

dpo_trainer = DPOTrainer(

The saved model is fully compatible with Hugging Face’s transformers library. Learn more about unsloth in their official repository.

Reference model considerations with PEFT

You have three main options (plus several variants) for how the reference model works when using PEFT, assuming the model that you would like to further enhance with DPO was tuned using (Q)LoRA.

  1. Simply create two instances of the model, each loading your adapter - works fine but is very inefficient.
  2. Merge the adapter into the base model, create another adapter on top, then leave the model_ref param null, in which case DPOTrainer will unload the adapter for reference inference - efficient, but has potential downsides discussed below.
  3. Load the adapter twice with different names, then use set_adapter during training to swap between the adapter being DPO’d and the reference adapter - slightly less efficient compared to 2 (~adapter size VRAM overhead), but avoids the pitfalls.

Downsides to merging QLoRA before DPO (approach 2)

As suggested by Benjamin Marie, the best option for merging QLoRA adapters is to first dequantize the base model, then merge the adapter. Something similar to this script.

However, after using this approach, you will have an unquantized base model. Therefore, to use QLoRA for DPO, you will need to re-quantize the merged model or use the unquantized merge (resulting in higher memory demand).

Using option 3 - load the adapter twice

To avoid the downsides with option 2, you can load your fine-tuned adapter into the model twice, with different names, and set the model/ref adapter names in DPOTrainer.

For example:

# Load the base model.
bnb_config = BitsAndBytesConfig(
model = AutoModelForCausalLM.from_pretrained(
model.config.use_cache = False

# Load the adapter.
model = PeftModel.from_pretrained(
# Load the adapter a second time, with a different name, which will be our reference model.
model.load_adapter("/path/to/peft", adapter_name="reference")

# Initialize the trainer, without a ref_model param.
training_args = DPOConfig(
dpo_trainer = DPOTrainer(


class trl.DPOTrainer

< >

( model: Union = None ref_model: Union = None beta: float = 0.1 label_smoothing: float = 0 loss_type: Literal = 'sigmoid' args: Optional = None data_collator: Optional = None label_pad_token_id: int = -100 padding_value: Optional = None truncation_mode: str = 'keep_end' train_dataset: Optional = None eval_dataset: Union = None tokenizer: Optional = None model_init: Optional = None callbacks: Optional = None optimizers: Tuple = (None, None) preprocess_logits_for_metrics: Optional = None max_length: Optional = None max_prompt_length: Optional = None max_target_length: Optional = None peft_config: Optional = None is_encoder_decoder: Optional = None disable_dropout: bool = True generate_during_eval: bool = False compute_metrics: Optional = None precompute_ref_log_probs: bool = False dataset_num_proc: Optional = None model_init_kwargs: Optional = None ref_model_init_kwargs: Optional = None model_adapter_name: Optional = None ref_adapter_name: Optional = None reference_free: bool = False force_use_ref_model: bool = False )


  • model (transformers.PreTrainedModel) — The model to train, preferably an AutoModelForSequenceClassification.
  • ref_model (PreTrainedModelWrapper) — Hugging Face transformer model with a casual language modelling head. Used for implicit reward computation and loss. If no reference model is provided, the trainer will create a reference model with the same architecture as the model to be optimized.
  • args (DPOConfig) — The DPO config arguments to use for training.
  • data_collator (transformers.DataCollator) — The data collator to use for training. If None is specified, the default data collator (DPODataCollatorWithPadding) will be used which will pad the sequences to the maximum length of the sequences in the batch, given a dataset of paired sequences.
  • train_dataset (datasets.Dataset) — The dataset to use for training.
  • eval_dataset (datasets.Dataset) — The dataset to use for evaluation.
  • tokenizer (transformers.PreTrainedTokenizerBase) — The tokenizer to use for training. This argument is required if you want to use the default data collator.
  • model_init (Callable[[], transformers.PreTrainedModel]) — The model initializer to use for training. If None is specified, the default model initializer will be used.
  • callbacks (List[transformers.TrainerCallback]) — The callbacks to use for training.
  • optimizers (Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR]) — The optimizer and scheduler to use for training.
  • preprocess_logits_for_metrics (Callable[[torch.Tensor, torch.Tensor], torch.Tensor]) — The function to use to preprocess the logits before computing the metrics.
  • peft_config (Dict, defaults to None) — The PEFT configuration to use for training. If you pass a PEFT configuration, the model will be wrapped in a PEFT model.
  • compute_metrics (Callable[[EvalPrediction], Dict], optional) — The function to use to compute the metrics. Must take a EvalPrediction and return a dictionary string to metric values.

Initialize DPOTrainer.


< >

( prompt answer )

Llama tokenizer does satisfy enc(a + b) = enc(a) + enc(b). It does ensure enc(a + b) = enc(a) + enc(a + b)[len(enc(a)):]. Reference:


< >

( padded_batch: Dict )

Computes log probabilities of the reference model for a single padded batch of a DPO specific dataset.


< >

( model: Module batch: Dict )

Run the given model on the given batch of inputs, concatenating the chosen and rejected inputs together.

We do this to avoid doing two forward passes, because it’s faster for FSDP.


< >

( batch: Dict is_encoder_decoder: bool = False label_pad_token_id: int = -100 padding_value: int = 0 device: Optional = None )

Concatenate the chosen and rejected inputs into a single tensor.


< >

( policy_chosen_logps: FloatTensor policy_rejected_logps: FloatTensor reference_chosen_logps: FloatTensor reference_rejected_logps: FloatTensor ) A tuple of three tensors


A tuple of three tensors

(losses, chosen_rewards, rejected_rewards). The losses tensor contains the DPO loss for each example in the batch. The chosen_rewards and rejected_rewards tensors contain the rewards for the chosen and rejected responses, respectively.

Compute the DPO loss for a batch of policy and reference model log probabilities.


< >

( dataloader: DataLoader description: str prediction_loss_only: Optional = None ignore_keys: Optional = None metric_key_prefix: str = 'eval' )

Overriding built-in evaluation loop to store metrics for each batch. Prediction/evaluation loop, shared by Trainer.evaluate() and Trainer.predict().

Works both with or without labels.


< >

( logits: FloatTensor labels: LongTensor label_pad_token_id: int = -100 is_encoder_decoder: bool = False )

Compute the log probabilities of the given labels under the given logits.


< >

( model batch: Dict train_eval: Literal = 'train' )

Compute the DPO loss and other metrics for the given batch of inputs for train or test.


< >

( model batch: Dict )

Generate samples from the model and reference model for the given batch of inputs.


< >

( eval_dataset: Optional = None )


  • eval_dataset (, optional) — If provided, will override self.eval_dataset. If it is a Dataset, columns not accepted by the model.forward() method are automatically removed. It must implement __len__.

Returns the evaluation

Subclass of transformers.src.transformers.trainer.get_eval_dataloader to precompute ref_log_probs.


< >

( )

Returns the training

Subclass of transformers.src.transformers.trainer.get_train_dataloader to precompute ref_log_probs.


< >

( logs: Dict )


  • logs (Dict[str, float]) — The values to log.

Log logs on the various objects watching training, including stored metrics.


< >

( )

Context manager for handling null reference model (that is, peft adapter manipulation).


< >

( feature model: Union = None )

Tokenize a single row from a DPO specific dataset.

At this stage, we don’t convert to PyTorch tensors yet; we just handle the truncation in case the prompt + chosen or prompt + rejected responses is/are too long. First we truncate the prompt; if we’re still too long, we truncate the chosen/rejected.

We also create the labels for the chosen/rejected responses, which are of length equal to the sum of the length of the prompt and the chosen/rejected response, with label_pad_token_id for the prompt tokens.


class trl.DPOConfig

< >

( output_dir: str overwrite_output_dir: bool = False do_train: bool = False do_eval: bool = False do_predict: bool = False eval_strategy: Union = 'no' prediction_loss_only: bool = False per_device_train_batch_size: int = 8 per_device_eval_batch_size: int = 8 per_gpu_train_batch_size: Optional = None per_gpu_eval_batch_size: Optional = None gradient_accumulation_steps: int = 1 eval_accumulation_steps: Optional = None eval_delay: Optional = 0 learning_rate: float = 5e-05 weight_decay: float = 0.0 adam_beta1: float = 0.9 adam_beta2: float = 0.999 adam_epsilon: float = 1e-08 max_grad_norm: float = 1.0 num_train_epochs: float = 3.0 max_steps: int = -1 lr_scheduler_type: Union = 'linear' lr_scheduler_kwargs: Union = <factory> warmup_ratio: float = 0.0 warmup_steps: int = 0 log_level: Optional = 'passive' log_level_replica: Optional = 'warning' log_on_each_node: bool = True logging_dir: Optional = None logging_strategy: Union = 'steps' logging_first_step: bool = False logging_steps: float = 500 logging_nan_inf_filter: bool = True save_strategy: Union = 'steps' save_steps: float = 500 save_total_limit: Optional = None save_safetensors: Optional = True save_on_each_node: bool = False save_only_model: bool = False restore_callback_states_from_checkpoint: bool = False no_cuda: bool = False use_cpu: bool = False use_mps_device: bool = False seed: int = 42 data_seed: Optional = None jit_mode_eval: bool = False use_ipex: bool = False bf16: bool = False fp16: bool = False fp16_opt_level: str = 'O1' half_precision_backend: str = 'auto' bf16_full_eval: bool = False fp16_full_eval: bool = False tf32: Optional = None local_rank: int = -1 ddp_backend: Optional = None tpu_num_cores: Optional = None tpu_metrics_debug: bool = False debug: Union = '' dataloader_drop_last: bool = False eval_steps: Optional = None dataloader_num_workers: int = 0 dataloader_prefetch_factor: Optional = None past_index: int = -1 run_name: Optional = None disable_tqdm: Optional = None remove_unused_columns: Optional = True label_names: Optional = None load_best_model_at_end: Optional = False metric_for_best_model: Optional = None greater_is_better: Optional = None ignore_data_skip: bool = False fsdp: Union = '' fsdp_min_num_params: int = 0 fsdp_config: Union = None fsdp_transformer_layer_cls_to_wrap: Optional = None accelerator_config: Union = None deepspeed: Union = None label_smoothing_factor: float = 0.0 optim: Union = 'adamw_torch' optim_args: Optional = None adafactor: bool = False group_by_length: bool = False length_column_name: Optional = 'length' report_to: Union = None ddp_find_unused_parameters: Optional = None ddp_bucket_cap_mb: Optional = None ddp_broadcast_buffers: Optional = None dataloader_pin_memory: bool = True dataloader_persistent_workers: bool = False skip_memory_metrics: bool = True use_legacy_prediction_loop: bool = False push_to_hub: bool = False resume_from_checkpoint: Optional = None hub_model_id: Optional = None hub_strategy: Union = 'every_save' hub_token: Optional = None hub_private_repo: bool = False hub_always_push: bool = False gradient_checkpointing: bool = False gradient_checkpointing_kwargs: Union = None include_inputs_for_metrics: bool = False eval_do_concat_batches: bool = True fp16_backend: str = 'auto' evaluation_strategy: Union = None push_to_hub_model_id: Optional = None push_to_hub_organization: Optional = None push_to_hub_token: Optional = None mp_parameters: str = '' auto_find_batch_size: bool = False full_determinism: bool = False torchdynamo: Optional = None ray_scope: Optional = 'last' ddp_timeout: Optional = 1800 torch_compile: bool = False torch_compile_backend: Optional = None torch_compile_mode: Optional = None dispatch_batches: Optional = None split_batches: Optional = None include_tokens_per_second: Optional = False include_num_input_tokens_seen: Optional = False neftune_noise_alpha: Optional = None optim_target_modules: Union = None batch_eval_metrics: bool = False eval_on_start: bool = False beta: float = 0.1 label_smoothing: float = 0 loss_type: Literal = 'sigmoid' label_pad_token_id: int = -100 padding_value: int = 0 truncation_mode: str = 'keep_end' max_length: Optional = None max_prompt_length: Optional = None max_target_length: Optional = None is_encoder_decoder: Optional = None disable_dropout: bool = True generate_during_eval: bool = False precompute_ref_log_probs: bool = False dataset_num_proc: Optional = None model_init_kwargs: Optional = None ref_model_init_kwargs: Optional = None model_adapter_name: Optional = None ref_adapter_name: Optional = None reference_free: bool = False force_use_ref_model: bool = False sync_ref_model: bool = False ref_model_mixup_alpha: float = 0.9 ref_model_sync_steps: int = 64 rpo_alpha: Optional = None )


  • beta (float, defaults to 0.1) — The beta factor in DPO loss. Higher beta means less divergence from the initial policy. For the IPO loss, beta is the regularization parameter denoted by tau in the paper.
  • label_smoothing (float, defaults to 0) — The robust DPO label smoothing parameter from the cDPO report and Robust DPO paper that should be between 0 and 0.5.
  • loss_type (str, defaults to "sigmoid") — The type of DPO loss to use. Either "sigmoid" the default DPO loss,"hinge" loss from SLiC paper, "ipo" from IPO paper, "kto_pair" from the HALOs report, "bco_pair" from BCO paper or "robust" from Robust DPO paper.
  • label_pad_token_id (int, defaults to -100) — The label pad token id. This argument is required if you want to use the default data collator.
  • padding_value (int, defaults to 0) — The padding value if it is different to the tokenizer’s pad_token_id.
  • truncation_mode (str, defaults to keep_end) — The truncation mode to use, either keep_end or keep_start. This argument is required if you want to use the default data collator.
  • max_length (int, defaults to None) — The maximum length of the sequences in the batch. This argument is required if you want to use the default data collator.
  • max_prompt_length (int, defaults to None) — The maximum length of the prompt. This argument is required if you want to use the default data collator.
  • max_target_length (int, defaults to None) — The maximum length of the target. This argument is required if you want to use the default data collator and your model is an encoder-decoder.
  • is_encoder_decoder (Optional[bool], optional, defaults to None) — If no model is provided, we need to know if the model_init returns an encoder-decoder.
  • disable_dropout (bool, defaults to True) — Whether or not to disable dropouts in model and ref_model.
  • generate_during_eval (bool, defaults to False) — Whether to sample and log generations during evaluation step.
  • precompute_ref_log_probs (bool, defaults to False) — Flag to precompute reference model log probabilities for training and evaluation datasets. This is useful if you want to train without the reference model and reduce the total GPU memory needed.
  • dataset_num_proc (Optional[int], optional) — The number of workers to use to tokenize the data. Defaults to None.
  • model_init_kwargs (Optional[Dict], optional) — Dict of Optional kwargs to pass when instantiating the model from a string
  • ref_model_init_kwargs (Optional[Dict], optional) — Dict of Optional kwargs to pass when instantiating the ref model from a string
  • model_adapter_name (str, defaults to None) — Name of the train target PEFT adapter, when using LoRA with multiple adapters.
  • ref_adapter_name (str, defaults to None) — Name of the reference PEFT adapter, when using LoRA with multiple adapters.
  • reference_free (bool defaults to False) — If True, we ignore the provided reference model and implicitly use a reference model that assigns equal probability to all responses.
  • force_use_ref_model (bool, defaults to False) — In case one passes a PEFT model for the active model and you want to use a different model for the ref_model, set this flag to True.
  • sync_ref_model (‘bool’, defaults to False) — The flag for syncing reference model during training from the TR-DPO paper.
  • ref_model_mixup_alpha (‘float’, defaults to 1.0) — The alpha parameter from the TR-DPO paper.
  • ref_model_sync_steps (‘int’, defaults to 2) — The tau parameter from the TR-DPO paper.
  • rpo_alpha (‘float’, defaults to None) — The alpha parameter from the RPO paper. If None, no weighting is applied and the loss is the same as the DPO loss.

Initialize DPOConfig.

< > Update on GitHub