Reward Modeling

TRL supports custom reward modeling for anyone to perform reward modeling on their dataset and model.

Check out a complete flexible example at examples/scripts/reward_modeling.py.

Expected dataset type

The RewardTrainer requires a implicit prompt preference dataset. It means that the dataset should only contain the columns "chosen" and "rejected" (and not "prompt"). The RewardTrainer supports both conversational and standard dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.

You can also use a pretokenized dataset, in which case the dataset should contain the following columns: input_ids_chosen, attention_mask_chosen, input_ids_rejected and attention_mask_rejected.

Using the RewardTrainer

After preparing your dataset, you can use the RewardTrainer in the same way as the Trainer class from 🤗 Transformers. You should pass an AutoModelForSequenceClassification model to the RewardTrainer, along with a RewardConfig which configures the hyperparameters of the training.

Leveraging 🤗 PEFT to train a reward model

Just pass a peft_config in the keyword arguments of RewardTrainer, and the trainer should automatically take care of converting the model into a PEFT model!

from peft import LoraConfig, TaskType
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from trl import RewardTrainer, RewardConfig

model = AutoModelForSequenceClassification.from_pretrained("gpt2")
peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    inference_mode=False,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
)

...

trainer = RewardTrainer(
    model=model,
    args=training_args,
    processing_class=tokenizer,
    train_dataset=dataset,
    peft_config=peft_config,
)

trainer.train()

Adding a margin to the loss

As in the Llama 2 paper, you can add a margin to the loss by adding a margin column to the dataset. The reward collator will automatically pass it through and the loss will be computed accordingly.

def add_margin(row):
    # Assume you have a score_chosen and score_rejected columns that you want to use to compute the margin
    return {'margin': row['score_chosen'] - row['score_rejected']}

dataset = dataset.map(add_margin)

Centering rewards

In many scenarios, it’s preferable to ensure that a reward model’s output is mean zero. This is often done by first calculating the model’s average score and then subtracting it.

[Eisenstein et al., 2023] proposed an auxiliary loss function designed to directly learn a centered reward model. This auxiliary loss minimizes the squared sum of the rewards, encouraging the model to naturally produce mean-zero outputs: $\Big( R(p, r_1) + R(p, r_2) \Big)^2$

This auxiliary loss is combined with the main loss function, weighted by the parameter center_rewards_coefficient in the [RewardConfig]. By default, this feature is deactivated (center_rewards_coefficient = None).

training_args = RewardConfig(
    center_rewards_coefficient=0.01,
    ...
)

For reference results, please refer PR #1932.

RewardTrainer

class trl.RewardTrainer

< source >

( model: typing.Union[transformers.modeling_utils.PreTrainedModel, torch.nn.modules.module.Module, NoneType] = Noneargs: typing.Optional[trl.trainer.reward_config.RewardConfig] = Nonedata_collator: typing.Optional[transformers.data.data_collator.DataCollator] = Nonetrain_dataset: typing.Optional[datasets.arrow_dataset.Dataset] = Noneeval_dataset: typing.Union[datasets.arrow_dataset.Dataset, dict[str, datasets.arrow_dataset.Dataset], NoneType] = Noneprocessing_class: typing.Union[transformers.tokenization_utils_base.PreTrainedTokenizerBase, transformers.image_processing_utils.BaseImageProcessor, transformers.feature_extraction_utils.FeatureExtractionMixin, transformers.processing_utils.ProcessorMixin, NoneType] = Nonemodel_init: typing.Optional[typing.Callable[[], transformers.modeling_utils.PreTrainedModel]] = Nonecompute_metrics: typing.Optional[typing.Callable[[transformers.trainer_utils.EvalPrediction], dict]] = Nonecallbacks: typing.Optional[list[transformers.trainer_callback.TrainerCallback]] = Noneoptimizers: tuple = (None, None)preprocess_logits_for_metrics: typing.Optional[typing.Callable[[torch.Tensor, torch.Tensor], torch.Tensor]] = Nonepeft_config: typing.Optional[dict] = None )

create_model_card

< source >

( model_name: typing.Optional[str] = Nonedataset_name: typing.Optional[str] = Nonetags: typing.Union[str, list[str], NoneType] = None )

Parameters

model_name (str or None, optional, defaults to None) — Name of the model.
dataset_name (str or None, optional, defaults to None) — Name of the dataset used for training.
tags (str, list[str] or None, optional, defaults to None) — Tags to be associated with the model card.

Creates a draft of a model card using the information available to the Trainer.

visualize_samples

< source >

( num_print_samples: int )

Parameters

num_print_samples (int, defaults to 4) — The number of samples to print. Set to -1 to print all samples.

Visualize the reward model logits prediction

RewardConfig

class trl.RewardConfig

< source >

( output_dir: typing.Optional[str] = Noneoverwrite_output_dir: bool = Falsedo_train: bool = Falsedo_eval: bool = Falsedo_predict: bool = Falseeval_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = 'no'prediction_loss_only: bool = Falseper_device_train_batch_size: int = 8per_device_eval_batch_size: int = 8per_gpu_train_batch_size: typing.Optional[int] = Noneper_gpu_eval_batch_size: typing.Optional[int] = Nonegradient_accumulation_steps: int = 1eval_accumulation_steps: typing.Optional[int] = Noneeval_delay: typing.Optional[float] = 0torch_empty_cache_steps: typing.Optional[int] = Nonelearning_rate: float = 5e-05weight_decay: float = 0.0adam_beta1: float = 0.9adam_beta2: float = 0.999adam_epsilon: float = 1e-08max_grad_norm: float = 1.0num_train_epochs: float = 3.0max_steps: int = -1lr_scheduler_type: typing.Union[transformers.trainer_utils.SchedulerType, str] = 'linear'lr_scheduler_kwargs: typing.Union[dict, str, NoneType] = <factory>warmup_ratio: float = 0.0warmup_steps: int = 0log_level: typing.Optional[str] = 'passive'log_level_replica: typing.Optional[str] = 'warning'log_on_each_node: bool = Truelogging_dir: typing.Optional[str] = Nonelogging_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = 'steps'logging_first_step: bool = Falselogging_steps: float = 500logging_nan_inf_filter: bool = Truesave_strategy: typing.Union[transformers.trainer_utils.SaveStrategy, str] = 'steps'save_steps: float = 500save_total_limit: typing.Optional[int] = Nonesave_safetensors: typing.Optional[bool] = Truesave_on_each_node: bool = Falsesave_only_model: bool = Falserestore_callback_states_from_checkpoint: bool = Falseno_cuda: bool = Falseuse_cpu: bool = Falseuse_mps_device: bool = Falseseed: int = 42data_seed: typing.Optional[int] = Nonejit_mode_eval: bool = Falseuse_ipex: bool = Falsebf16: bool = Falsefp16: bool = Falsefp16_opt_level: str = 'O1'half_precision_backend: str = 'auto'bf16_full_eval: bool = Falsefp16_full_eval: bool = Falsetf32: typing.Optional[bool] = Nonelocal_rank: int = -1ddp_backend: typing.Optional[str] = Nonetpu_num_cores: typing.Optional[int] = Nonetpu_metrics_debug: bool = Falsedebug: typing.Union[str, typing.List[transformers.debug_utils.DebugOption]] = ''dataloader_drop_last: bool = Falseeval_steps: typing.Optional[float] = Nonedataloader_num_workers: int = 0dataloader_prefetch_factor: typing.Optional[int] = Nonepast_index: int = -1run_name: typing.Optional[str] = Nonedisable_tqdm: typing.Optional[bool] = Noneremove_unused_columns: bool = Falselabel_names: typing.Optional[typing.List[str]] = Noneload_best_model_at_end: typing.Optional[bool] = Falsemetric_for_best_model: typing.Optional[str] = Nonegreater_is_better: typing.Optional[bool] = Noneignore_data_skip: bool = Falsefsdp: typing.Union[typing.List[transformers.trainer_utils.FSDPOption], str, NoneType] = ''fsdp_min_num_params: int = 0fsdp_config: typing.Union[dict, str, NoneType] = Nonetp_size: typing.Optional[int] = 0fsdp_transformer_layer_cls_to_wrap: typing.Optional[str] = Noneaccelerator_config: typing.Union[dict, str, NoneType] = Nonedeepspeed: typing.Union[dict, str, NoneType] = Nonelabel_smoothing_factor: float = 0.0optim: typing.Union[transformers.training_args.OptimizerNames, str] = 'adamw_torch'optim_args: typing.Optional[str] = Noneadafactor: bool = Falsegroup_by_length: bool = Falselength_column_name: typing.Optional[str] = 'length'report_to: typing.Union[NoneType, str, typing.List[str]] = Noneddp_find_unused_parameters: typing.Optional[bool] = Noneddp_bucket_cap_mb: typing.Optional[int] = Noneddp_broadcast_buffers: typing.Optional[bool] = Nonedataloader_pin_memory: bool = Truedataloader_persistent_workers: bool = Falseskip_memory_metrics: bool = Trueuse_legacy_prediction_loop: bool = Falsepush_to_hub: bool = Falseresume_from_checkpoint: typing.Optional[str] = Nonehub_model_id: typing.Optional[str] = Nonehub_strategy: typing.Union[transformers.trainer_utils.HubStrategy, str] = 'every_save'hub_token: typing.Optional[str] = Nonehub_private_repo: typing.Optional[bool] = Nonehub_always_push: bool = Falsegradient_checkpointing: bool = Falsegradient_checkpointing_kwargs: typing.Union[dict, str, NoneType] = Noneinclude_inputs_for_metrics: bool = Falseinclude_for_metrics: typing.List[str] = <factory>eval_do_concat_batches: bool = Truefp16_backend: str = 'auto'evaluation_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = Nonepush_to_hub_model_id: typing.Optional[str] = Nonepush_to_hub_organization: typing.Optional[str] = Nonepush_to_hub_token: typing.Optional[str] = Nonemp_parameters: str = ''auto_find_batch_size: bool = Falsefull_determinism: bool = Falsetorchdynamo: typing.Optional[str] = Noneray_scope: typing.Optional[str] = 'last'ddp_timeout: typing.Optional[int] = 1800torch_compile: bool = Falsetorch_compile_backend: typing.Optional[str] = Nonetorch_compile_mode: typing.Optional[str] = Nonedispatch_batches: typing.Optional[bool] = Nonesplit_batches: typing.Optional[bool] = Noneinclude_tokens_per_second: typing.Optional[bool] = Falseinclude_num_input_tokens_seen: typing.Optional[bool] = Falseneftune_noise_alpha: typing.Optional[float] = Noneoptim_target_modules: typing.Union[NoneType, str, typing.List[str]] = Nonebatch_eval_metrics: bool = Falseeval_on_start: bool = Falseuse_liger_kernel: typing.Optional[bool] = Falseeval_use_gather_object: typing.Optional[bool] = Falseaverage_tokens_across_devices: typing.Optional[bool] = Falsemax_length: typing.Optional[int] = 1024disable_dropout: bool = Truedataset_num_proc: typing.Optional[int] = Nonecenter_rewards_coefficient: typing.Optional[float] = None )

Parameters

max_length (int or None, optional, defaults to 1024) — Maximum length of the sequences (prompt + completion) in the batch, filters out entries that exceed the limit. This argument is required if you want to use the default data collator.
disable_dropout (bool, optional, defaults to True) — Whether to disable dropout in the model.
dataset_num_proc (int, optional, defaults to None) — Number of processes to use for processing the dataset.
center_rewards_coefficient (float, optional, defaults to None) — Coefficient to incentivize the reward model to output mean-zero rewards (proposed by https://huggingface.co/papers/2312.09244, Eq. 2). Recommended value: 0.01.
remove_unused_columns (bool, optional, defaults to False) — Whether to remove the columns that are not used by the model’s forward pass. Can be True only if the dataset is pretokenized.

Configuration class for the RewardTrainer.

Using HfArgumentParser we can turn this class into argparse arguments that can be specified on the command line.

< > Update on GitHub