Chuanming's picture
Upload folder using huggingface_hub
fa4458a
raw
history blame contribute delete
952 Bytes
# Trainer
At TRL we support PPO (Proximal Policy Optimisation) with an implementation that largely follows the structure introduced in the paper "Fine-Tuning Language Models from Human Preferences" by D. Ziegler et al. [[paper](https://arxiv.org/pdf/1909.08593.pdf), [code](https://github.com/openai/lm-human-preferences)].
The Trainer and model classes are largely inspired from `transformers.Trainer` and `transformers.AutoModel` classes and adapted for RL.
We also support a `RewardTrainer` that can be used to train a reward model.
## PPOConfig
[[autodoc]] PPOConfig
## PPOTrainer
[[autodoc]] PPOTrainer
## RewardConfig
[[autodoc]] RewardConfig
## RewardTrainer
[[autodoc]] RewardTrainer
## SFTTrainer
[[autodoc]] SFTTrainer
## DPOTrainer
[[autodoc]] DPOTrainer
## DDPOConfig
[[autodoc]] DDPOConfig
## DDPOTrainer
[[autodoc]] DDPOTrainer
## IterativeSFTTrainer
[[autodoc]] IterativeSFTTrainer
## set_seed
[[autodoc]] set_seed