TRL documentation
Reward Modeling
Reward Modeling
TRL supports custom reward modeling for anyone to perform reward modeling on their dataset and model.
Check out a complete flexible example at examples/scripts/reward_modeling.py
.
Expected dataset type
The RewardTrainer requires a implicit prompt preference dataset. It means that the dataset should only contain the columns "chosen"
and "rejected"
(and not "prompt"
).
The RewardTrainer supports both conversational and standard dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
You can also use a pretokenized dataset, in which case the dataset should contain the following columns: input_ids_chosen
, attention_mask_chosen
, input_ids_rejected
and attention_mask_rejected
.
Using the RewardTrainer
After preparing your dataset, you can use the RewardTrainer in the same way as the Trainer
class from π€ Transformers.
You should pass an AutoModelForSequenceClassification
model to the RewardTrainer, along with a RewardConfig which configures the hyperparameters of the training.
Leveraging π€ PEFT to train a reward model
Just pass a peft_config
in the keyword arguments of RewardTrainer, and the trainer should automatically take care of converting the model into a PEFT model!
from peft import LoraConfig, TaskType
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from trl import RewardTrainer, RewardConfig
model = AutoModelForSequenceClassification.from_pretrained("gpt2")
peft_config = LoraConfig(
task_type=TaskType.SEQ_CLS,
inference_mode=False,
r=8,
lora_alpha=32,
lora_dropout=0.1,
)
...
trainer = RewardTrainer(
model=model,
args=training_args,
processing_class=tokenizer,
train_dataset=dataset,
peft_config=peft_config,
)
trainer.train()
Adding a margin to the loss
As in the Llama 2 paper, you can add a margin to the loss by adding a margin
column to the dataset. The reward collator will automatically pass it through and the loss will be computed accordingly.
def add_margin(row):
# Assume you have a score_chosen and score_rejected columns that you want to use to compute the margin
return {'margin': row['score_chosen'] - row['score_rejected']}
dataset = dataset.map(add_margin)
Centering rewards
In many scenarios, itβs preferable to ensure that a reward modelβs output is mean zero. This is often done by first calculating the modelβs average score and then subtracting it.
[Eisenstein et al., 2023] proposed an auxiliary loss function designed to directly learn a centered reward model. This auxiliary loss minimizes the squared sum of the rewards, encouraging the model to naturally produce mean-zero outputs:
This auxiliary loss is combined with the main loss function, weighted by the parameter center_rewards_coefficient
in the [RewardConfig]
. By default, this feature is deactivated (center_rewards_coefficient = None
).
training_args = RewardConfig(
center_rewards_coefficient=0.01,
...
)
For reference results, please refer PR #1932.
RewardTrainer
class trl.RewardTrainer
< source >( model: typing.Union[transformers.modeling_utils.PreTrainedModel, torch.nn.modules.module.Module, NoneType] = Noneargs: typing.Optional[trl.trainer.reward_config.RewardConfig] = Nonedata_collator: typing.Optional[transformers.data.data_collator.DataCollator] = Nonetrain_dataset: typing.Optional[datasets.arrow_dataset.Dataset] = Noneeval_dataset: typing.Union[datasets.arrow_dataset.Dataset, dict[str, datasets.arrow_dataset.Dataset], NoneType] = Noneprocessing_class: typing.Union[transformers.tokenization_utils_base.PreTrainedTokenizerBase, transformers.image_processing_utils.BaseImageProcessor, transformers.feature_extraction_utils.FeatureExtractionMixin, transformers.processing_utils.ProcessorMixin, NoneType] = Nonemodel_init: typing.Optional[typing.Callable[[], transformers.modeling_utils.PreTrainedModel]] = Nonecompute_metrics: typing.Optional[typing.Callable[[transformers.trainer_utils.EvalPrediction], dict]] = Nonecallbacks: typing.Optional[list[transformers.trainer_callback.TrainerCallback]] = Noneoptimizers: tuple = (None, None)preprocess_logits_for_metrics: typing.Optional[typing.Callable[[torch.Tensor, torch.Tensor], torch.Tensor]] = Nonepeft_config: typing.Optional[dict] = None )
create_model_card
< source >( model_name: typing.Optional[str] = Nonedataset_name: typing.Optional[str] = Nonetags: typing.Union[str, list[str], NoneType] = None )
Creates a draft of a model card using the information available to the Trainer
.
visualize_samples
< source >( num_print_samples: int )
Visualize the reward model logits prediction
RewardConfig
class trl.RewardConfig
< source >( output_dir: typing.Optional[str] = Noneoverwrite_output_dir: bool = Falsedo_train: bool = Falsedo_eval: bool = Falsedo_predict: bool = Falseeval_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = 'no'prediction_loss_only: bool = Falseper_device_train_batch_size: int = 8per_device_eval_batch_size: int = 8per_gpu_train_batch_size: typing.Optional[int] = Noneper_gpu_eval_batch_size: typing.Optional[int] = Nonegradient_accumulation_steps: int = 1eval_accumulation_steps: typing.Optional[int] = Noneeval_delay: typing.Optional[float] = 0torch_empty_cache_steps: typing.Optional[int] = Nonelearning_rate: float = 5e-05weight_decay: float = 0.0adam_beta1: float = 0.9adam_beta2: float = 0.999adam_epsilon: float = 1e-08max_grad_norm: float = 1.0num_train_epochs: float = 3.0max_steps: int = -1lr_scheduler_type: typing.Union[transformers.trainer_utils.SchedulerType, str] = 'linear'lr_scheduler_kwargs: typing.Union[dict, str, NoneType] = <factory>warmup_ratio: float = 0.0warmup_steps: int = 0log_level: typing.Optional[str] = 'passive'log_level_replica: typing.Optional[str] = 'warning'log_on_each_node: bool = Truelogging_dir: typing.Optional[str] = Nonelogging_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = 'steps'logging_first_step: bool = Falselogging_steps: float = 500logging_nan_inf_filter: bool = Truesave_strategy: typing.Union[transformers.trainer_utils.SaveStrategy, str] = 'steps'save_steps: float = 500save_total_limit: typing.Optional[int] = Nonesave_safetensors: typing.Optional[bool] = Truesave_on_each_node: bool = Falsesave_only_model: bool = Falserestore_callback_states_from_checkpoint: bool = Falseno_cuda: bool = Falseuse_cpu: bool = Falseuse_mps_device: bool = Falseseed: int = 42data_seed: typing.Optional[int] = Nonejit_mode_eval: bool = Falseuse_ipex: bool = Falsebf16: bool = Falsefp16: bool = Falsefp16_opt_level: str = 'O1'half_precision_backend: str = 'auto'bf16_full_eval: bool = Falsefp16_full_eval: bool = Falsetf32: typing.Optional[bool] = Nonelocal_rank: int = -1ddp_backend: typing.Optional[str] = Nonetpu_num_cores: typing.Optional[int] = Nonetpu_metrics_debug: bool = Falsedebug: typing.Union[str, typing.List[transformers.debug_utils.DebugOption]] = ''dataloader_drop_last: bool = Falseeval_steps: typing.Optional[float] = Nonedataloader_num_workers: int = 0dataloader_prefetch_factor: typing.Optional[int] = Nonepast_index: int = -1run_name: typing.Optional[str] = Nonedisable_tqdm: typing.Optional[bool] = Noneremove_unused_columns: bool = Falselabel_names: typing.Optional[typing.List[str]] = Noneload_best_model_at_end: typing.Optional[bool] = Falsemetric_for_best_model: typing.Optional[str] = Nonegreater_is_better: typing.Optional[bool] = Noneignore_data_skip: bool = Falsefsdp: typing.Union[typing.List[transformers.trainer_utils.FSDPOption], str, NoneType] = ''fsdp_min_num_params: int = 0fsdp_config: typing.Union[dict, str, NoneType] = Nonetp_size: typing.Optional[int] = 0fsdp_transformer_layer_cls_to_wrap: typing.Optional[str] = Noneaccelerator_config: typing.Union[dict, str, NoneType] = Nonedeepspeed: typing.Union[dict, str, NoneType] = Nonelabel_smoothing_factor: float = 0.0optim: typing.Union[transformers.training_args.OptimizerNames, str] = 'adamw_torch'optim_args: typing.Optional[str] = Noneadafactor: bool = Falsegroup_by_length: bool = Falselength_column_name: typing.Optional[str] = 'length'report_to: typing.Union[NoneType, str, typing.List[str]] = Noneddp_find_unused_parameters: typing.Optional[bool] = Noneddp_bucket_cap_mb: typing.Optional[int] = Noneddp_broadcast_buffers: typing.Optional[bool] = Nonedataloader_pin_memory: bool = Truedataloader_persistent_workers: bool = Falseskip_memory_metrics: bool = Trueuse_legacy_prediction_loop: bool = Falsepush_to_hub: bool = Falseresume_from_checkpoint: typing.Optional[str] = Nonehub_model_id: typing.Optional[str] = Nonehub_strategy: typing.Union[transformers.trainer_utils.HubStrategy, str] = 'every_save'hub_token: typing.Optional[str] = Nonehub_private_repo: typing.Optional[bool] = Nonehub_always_push: bool = Falsegradient_checkpointing: bool = Falsegradient_checkpointing_kwargs: typing.Union[dict, str, NoneType] = Noneinclude_inputs_for_metrics: bool = Falseinclude_for_metrics: typing.List[str] = <factory>eval_do_concat_batches: bool = Truefp16_backend: str = 'auto'evaluation_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = Nonepush_to_hub_model_id: typing.Optional[str] = Nonepush_to_hub_organization: typing.Optional[str] = Nonepush_to_hub_token: typing.Optional[str] = Nonemp_parameters: str = ''auto_find_batch_size: bool = Falsefull_determinism: bool = Falsetorchdynamo: typing.Optional[str] = Noneray_scope: typing.Optional[str] = 'last'ddp_timeout: typing.Optional[int] = 1800torch_compile: bool = Falsetorch_compile_backend: typing.Optional[str] = Nonetorch_compile_mode: typing.Optional[str] = Nonedispatch_batches: typing.Optional[bool] = Nonesplit_batches: typing.Optional[bool] = Noneinclude_tokens_per_second: typing.Optional[bool] = Falseinclude_num_input_tokens_seen: typing.Optional[bool] = Falseneftune_noise_alpha: typing.Optional[float] = Noneoptim_target_modules: typing.Union[NoneType, str, typing.List[str]] = Nonebatch_eval_metrics: bool = Falseeval_on_start: bool = Falseuse_liger_kernel: typing.Optional[bool] = Falseeval_use_gather_object: typing.Optional[bool] = Falseaverage_tokens_across_devices: typing.Optional[bool] = Falsemax_length: typing.Optional[int] = 1024disable_dropout: bool = Truedataset_num_proc: typing.Optional[int] = Nonecenter_rewards_coefficient: typing.Optional[float] = None )
Parameters
- max_length (
int
orNone
, optional, defaults to1024
) β Maximum length of the sequences (prompt + completion) in the batch, filters out entries that exceed the limit. This argument is required if you want to use the default data collator. - disable_dropout (
bool
, optional, defaults toTrue
) β Whether to disable dropout in the model. - dataset_num_proc (
int
, optional, defaults toNone
) β Number of processes to use for processing the dataset. - center_rewards_coefficient (
float
, optional, defaults toNone
) β Coefficient to incentivize the reward model to output mean-zero rewards (proposed by https://huggingface.co/papers/2312.09244, Eq. 2). Recommended value:0.01
. - remove_unused_columns (
bool
, optional, defaults toFalse
) β Whether to remove the columns that are not used by the modelβs forward pass. Can beTrue
only if the dataset is pretokenized.
Configuration class for the RewardTrainer.
Using HfArgumentParser we can turn this class into argparse arguments that can be specified on the command line.