TRL documentation
Nash-MD Trainer
Nash-MD Trainer
Overview
Nash-MD was proposed in the paper Nash Learning from Human Feedback by Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mésnard, and Andrea Michi.
The abstract from the paper is the following:
Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human preferences. Typically, RLHF involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained LLM. Subsequently, the LLM’s policy is fine-tuned by optimizing it to maximize the reward model through a reinforcement learning algorithm. However, an inherent limitation of current reward models is their inability to fully represent the richness of human preferences and their dependency on the sampling distribution. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a preference model, which is conditioned on two inputs given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF). In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. To demonstrate the effectiveness of our approach, we present experimental results involving the fine-tuning of a LLM for a text summarization task. We believe NLHF offers a compelling avenue for preference learning and policy optimization with the potential of advancing the field of aligning LLMs with human preferences.
This post-training method was contributed by Kashif Rasul and Daniil Tiapkin, Pierre Ménard, Daniele Calandriello and Quentin Gallouédec.
Quick start
This example demonstrates how to train a model using the Nash-MD method. We use the Qwen 0.5B model as the base model and PairRMJudge as a judge. We use the prompts from the UltraFeedback dataset. You can view the prompts in the dataset here:
Below is the script to train the model:
# train_nash_md.py
from datasets import load_dataset
from trl import NashMDConfig, NashMDTrainer, PairRMJudge
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
judge = PairRMJudge()
train_dataset = load_dataset("trl-lib/ultrafeedback-prompt", split="train")
training_args = NashMDConfig(output_dir="Qwen2-0.5B-NashMD", logging_steps=10)
trainer = NashMDTrainer(
model=model, judge=judge, args=training_args, processing_class=tokenizer, train_dataset=train_dataset
)
trainer.train()
Execute the script using the following command:
accelerate launch train_nash_md.py
Distributed across 8 GPUs, the training takes approximately 3 hours.
To see how the trained model performs, you can use the Transformers Chat CLI.
$ transformers-cli chat --model_name_or_path trl-lib/Qwen2-0.5B-NashMD
<quentin_gallouedec>:
What is the best programming language?
<trl-lib/Qwen2-0.5B-NashMD>:
The best programming language depends on personal preference, the complexity of the project, and the specific requirements of the task. Some programming languages that are often recommended include Python, Java, and JavaScript, and there are many other languages to choose from depending on individual needs.
Expected dataset type
Nash-MD requires a prompt-only dataset. The NashMDTrainer supports both conversational and standard dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
Usage tips
Use a reward model
Instead of a judge, you can chose to use a reward model — see Reward Bench for a leaderboard of public models you can use. Below is a code example showing how to replace a judge with the trl-lib/Qwen2-0.5B-Reward model:
- from trl import PairRMJudge
+ from transformers import AutoModelForSequenceClassification
- judge = PairRMJudge()
+ reward_model = AutoModelForSequenceClassification.from_pretrained("trl-lib/Qwen2-0.5B-Reward", num_labels=1)
trainer = NashMDTrainer(
...
- judge=judge,
+ reward_model=reward_model,
)
Make sure that the SFT model and reward model use the same chat template and the same tokenizer. Otherwise, you may find the model completions are scored incorrectly during training.
Encourage EOS token generation
We may want the model to generate completions within a given length. During training, the model will generate completions up to the maximum length specified in the max_new_tokens
argument of NashMDConfig. If you want to penalize the model for not generating an EOS token before reaching the maximum length, you can use the missing_eos_penalty
argument of NashMDConfig:
training_args = NashMDConfig(..., max_new_tokens=128, missing_eos_penalty=1.0)
Logging Completions
To better understand your model’s behavior during training, you can log sample completions periodically using the LogCompletionsCallback.
trainer = NashMDTrainer(..., eval_dataset=eval_dataset)
completions_callback = LogCompletionsCallback(trainer, num_prompts=8)
trainer.add_callback(completions_callback)
This callback logs the model’s generated completions directly to Weights & Biases.
Example script
We provide an example script to train a model using the Nash-MD method. The script is available in examples/scripts/nash_md.py
To test the online DPO script with the Qwen2.5 0.5B model on the UltraFeedback dataset, run the following command:
python examples/scripts/nash_md.py \ --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \ --judge pair_rm \ --dataset_name trl-lib/ultrafeedback-prompt \ --learning_rate 5.0e-7 \ --logging_steps 25 \ --output_dir Qwen2.5-0.5B-NashMD-PairRM \ --warmup_ratio 0.1 \ --push_to_hub
Logged metrics
The logged metrics are as follows:
loss/kl
: The mean KL divergence between the model and reference data.objective/entropy
: The mean entropy of the model and reference data.loss/score
: The mean reinforce score loss.rewards/chosen
: The mean scores (according to the reward model) of the model completions.rewards/rejected
: The mean scores (according to the reward model) of the mixture completions.rewards/probabilities
: The mean probability (according to the reward model or judge) of the model completions chosen vs the mixture completion.rewards/accuracies
: The accuracies of the Nash-MD’s implicit reward model.rewards/margins
: The mean reward margin (according to reward model) between the chosen and mixture completions.logps/chosen
: The mean log probabilities of the chosen completions.logps/rejected
: The mean log probabilities of the reference completions.val/model_contain_eos_token
: The amount of times the model’s output contains the eos token.val/ref_contain_eos_token
: The amount of times the mixture’s output contains the eos token.beta
: The parameter that controls the weight of the loss term representing the deviation from the reference model. Typically fixed, but can be made dynamic by passing a list to NashMDConfig.mixture_coef
: Logit mixture coefficient for the model and reference model. Typically fixed, but can be made dynamic by passing a list to NashMDConfig.
NashMDTrainer
class trl.NashMDTrainer
< source >( model: typing.Union[transformers.modeling_utils.PreTrainedModel, torch.nn.modules.module.Module] = Noneref_model: typing.Union[transformers.modeling_utils.PreTrainedModel, torch.nn.modules.module.Module] = Nonereward_model: typing.Union[transformers.modeling_utils.PreTrainedModel, torch.nn.modules.module.Module, NoneType] = Nonejudge: typing.Optional[trl.trainer.judges.BasePairwiseJudge] = Noneargs: typing.Optional[trl.trainer.nash_md_config.NashMDConfig] = Nonedata_collator: typing.Optional[typing.Callable] = Nonetrain_dataset: typing.Union[datasets.arrow_dataset.Dataset, datasets.iterable_dataset.IterableDataset, NoneType] = Noneeval_dataset: typing.Union[datasets.arrow_dataset.Dataset, dict[str, datasets.arrow_dataset.Dataset], NoneType] = Noneprocessing_class: typing.Union[transformers.tokenization_utils_base.PreTrainedTokenizerBase, transformers.image_processing_utils.BaseImageProcessor, transformers.feature_extraction_utils.FeatureExtractionMixin, transformers.processing_utils.ProcessorMixin, NoneType] = Nonepeft_config: typing.Optional[dict] = Nonecompute_metrics: typing.Optional[typing.Callable[[transformers.trainer_utils.EvalPrediction], dict]] = Nonecallbacks: typing.Optional[list[transformers.trainer_callback.TrainerCallback]] = Noneoptimizers: tuple = (None, None)preprocess_logits_for_metrics: typing.Optional[typing.Callable[[torch.Tensor, torch.Tensor], torch.Tensor]] = None )
Parameters
- model (
transformers.PreTrainedModel
) — The model to train, preferably anAutoModelForCausalLM
. - ref_model (
PreTrainedModelWrapper
) — Hugging Face transformer model with a casual language modelling head. Used for implicit reward computation and loss. If no reference model is provided, the trainer will create a reference model with the same architecture as the model to be optimized. - reward_model (
transformers.PreTrainedModel
) — The reward model to score completions with, preferably anAutoModelForSequenceClassification
. - judge (
BasePairwiseJudge
) — The judge to use for pairwise comparison of model completions. - args (
NashMDConfig
) — The NashMD config arguments to use for training. - data_collator (
transformers.DataCollator
) — The data collator to use for training. If None is specified, the default data collator (DPODataCollatorWithPadding
) will be used which will pad the sequences to the maximum length of the sequences in the batch, given a dataset of paired sequences. - train_dataset (
datasets.Dataset
) — The dataset to use for training. - eval_dataset (
datasets.Dataset
) — The dataset to use for evaluation. - processing_class (
PreTrainedTokenizerBase
orBaseImageProcessor
orFeatureExtractionMixin
orProcessorMixin
, optional) — Processing class used to process the data. If provided, will be used to automatically process the inputs for the model, and it will be saved along the model to make it easier to rerun an interrupted training or reuse the fine-tuned model. - peft_config (
dict
) — The peft config to use for training. - compute_metrics (
Callable[[EvalPrediction], dict]
, optional) — The function to use to compute the metrics. Must take aEvalPrediction
and return a dictionary string to metric values. - callbacks (
list[transformers.TrainerCallback]
) — The callbacks to use for training. - optimizers (
tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR]
) — The optimizer and scheduler to use for training. - preprocess_logits_for_metrics (
Callable[[torch.Tensor, torch.Tensor], torch.Tensor]
) — The function to use to preprocess the logits before computing the metrics.
Initialize NashMDTrainer as a subclass of OnlineDPOConfig.
NashMDConfig
class trl.NashMDConfig
< source >( output_dir: typing.Optional[str] = Noneoverwrite_output_dir: bool = Falsedo_train: bool = Falsedo_eval: bool = Falsedo_predict: bool = Falseeval_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = 'no'prediction_loss_only: bool = Falseper_device_train_batch_size: int = 8per_device_eval_batch_size: int = 8per_gpu_train_batch_size: typing.Optional[int] = Noneper_gpu_eval_batch_size: typing.Optional[int] = Nonegradient_accumulation_steps: int = 1eval_accumulation_steps: typing.Optional[int] = Noneeval_delay: typing.Optional[float] = 0torch_empty_cache_steps: typing.Optional[int] = Nonelearning_rate: float = 5e-07weight_decay: float = 0.0adam_beta1: float = 0.9adam_beta2: float = 0.999adam_epsilon: float = 1e-08max_grad_norm: float = 1.0num_train_epochs: float = 3.0max_steps: int = -1lr_scheduler_type: typing.Union[transformers.trainer_utils.SchedulerType, str] = 'linear'lr_scheduler_kwargs: typing.Union[dict, str, NoneType] = <factory>warmup_ratio: float = 0.0warmup_steps: int = 0log_level: typing.Optional[str] = 'passive'log_level_replica: typing.Optional[str] = 'warning'log_on_each_node: bool = Truelogging_dir: typing.Optional[str] = Nonelogging_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = 'steps'logging_first_step: bool = Falselogging_steps: float = 500logging_nan_inf_filter: bool = Truesave_strategy: typing.Union[transformers.trainer_utils.SaveStrategy, str] = 'steps'save_steps: float = 500save_total_limit: typing.Optional[int] = Nonesave_safetensors: typing.Optional[bool] = Truesave_on_each_node: bool = Falsesave_only_model: bool = Falserestore_callback_states_from_checkpoint: bool = Falseno_cuda: bool = Falseuse_cpu: bool = Falseuse_mps_device: bool = Falseseed: int = 42data_seed: typing.Optional[int] = Nonejit_mode_eval: bool = Falseuse_ipex: bool = Falsebf16: bool = Falsefp16: bool = Falsefp16_opt_level: str = 'O1'half_precision_backend: str = 'auto'bf16_full_eval: bool = Falsefp16_full_eval: bool = Falsetf32: typing.Optional[bool] = Nonelocal_rank: int = -1ddp_backend: typing.Optional[str] = Nonetpu_num_cores: typing.Optional[int] = Nonetpu_metrics_debug: bool = Falsedebug: typing.Union[str, typing.List[transformers.debug_utils.DebugOption]] = ''dataloader_drop_last: bool = Falseeval_steps: typing.Optional[float] = Nonedataloader_num_workers: int = 0dataloader_prefetch_factor: typing.Optional[int] = Nonepast_index: int = -1run_name: typing.Optional[str] = Nonedisable_tqdm: typing.Optional[bool] = Noneremove_unused_columns: typing.Optional[bool] = Truelabel_names: typing.Optional[typing.List[str]] = Noneload_best_model_at_end: typing.Optional[bool] = Falsemetric_for_best_model: typing.Optional[str] = Nonegreater_is_better: typing.Optional[bool] = Noneignore_data_skip: bool = Falsefsdp: typing.Union[typing.List[transformers.trainer_utils.FSDPOption], str, NoneType] = ''fsdp_min_num_params: int = 0fsdp_config: typing.Union[dict, str, NoneType] = Nonetp_size: typing.Optional[int] = 0fsdp_transformer_layer_cls_to_wrap: typing.Optional[str] = Noneaccelerator_config: typing.Union[dict, str, NoneType] = Nonedeepspeed: typing.Union[dict, str, NoneType] = Nonelabel_smoothing_factor: float = 0.0optim: typing.Union[transformers.training_args.OptimizerNames, str] = 'adamw_torch'optim_args: typing.Optional[str] = Noneadafactor: bool = Falsegroup_by_length: bool = Falselength_column_name: typing.Optional[str] = 'length'report_to: typing.Union[NoneType, str, typing.List[str]] = Noneddp_find_unused_parameters: typing.Optional[bool] = Noneddp_bucket_cap_mb: typing.Optional[int] = Noneddp_broadcast_buffers: typing.Optional[bool] = Nonedataloader_pin_memory: bool = Truedataloader_persistent_workers: bool = Falseskip_memory_metrics: bool = Trueuse_legacy_prediction_loop: bool = Falsepush_to_hub: bool = Falseresume_from_checkpoint: typing.Optional[str] = Nonehub_model_id: typing.Optional[str] = Nonehub_strategy: typing.Union[transformers.trainer_utils.HubStrategy, str] = 'every_save'hub_token: typing.Optional[str] = Nonehub_private_repo: typing.Optional[bool] = Nonehub_always_push: bool = Falsegradient_checkpointing: bool = Falsegradient_checkpointing_kwargs: typing.Union[dict, str, NoneType] = Noneinclude_inputs_for_metrics: bool = Falseinclude_for_metrics: typing.List[str] = <factory>eval_do_concat_batches: bool = Truefp16_backend: str = 'auto'evaluation_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = Nonepush_to_hub_model_id: typing.Optional[str] = Nonepush_to_hub_organization: typing.Optional[str] = Nonepush_to_hub_token: typing.Optional[str] = Nonemp_parameters: str = ''auto_find_batch_size: bool = Falsefull_determinism: bool = Falsetorchdynamo: typing.Optional[str] = Noneray_scope: typing.Optional[str] = 'last'ddp_timeout: typing.Optional[int] = 1800torch_compile: bool = Falsetorch_compile_backend: typing.Optional[str] = Nonetorch_compile_mode: typing.Optional[str] = Nonedispatch_batches: typing.Optional[bool] = Nonesplit_batches: typing.Optional[bool] = Noneinclude_tokens_per_second: typing.Optional[bool] = Falseinclude_num_input_tokens_seen: typing.Optional[bool] = Falseneftune_noise_alpha: typing.Optional[float] = Noneoptim_target_modules: typing.Union[NoneType, str, typing.List[str]] = Nonebatch_eval_metrics: bool = Falseeval_on_start: bool = Falseuse_liger_kernel: typing.Optional[bool] = Falseeval_use_gather_object: typing.Optional[bool] = Falseaverage_tokens_across_devices: typing.Optional[bool] = Falsereward_model_path: typing.Optional[str] = Nonejudge: typing.Optional[str] = Nonemax_new_tokens: int = 64max_length: int = 512temperature: float = 0.9missing_eos_penalty: typing.Optional[float] = Nonebeta: list = <factory>loss_type: str = 'sigmoid'dataset_num_proc: typing.Optional[int] = Nonedisable_dropout: bool = Trueuse_vllm: bool = Falseds3_gather_for_generation: bool = Truemixture_coef: list = <factory> )
Configuration class for the NashMDTrainer.
Subclass of OnlineDPOConfig we can use all its arguments and add the following: