TRL - Transformers Reinforcement Learning

TRL is a full stack library where we provide a set of tools to train transformer language models with methods like Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO), Direct Preference Optimization (DPO), Reward Modeling, and more. The library is integrated with 🤗 transformers.

🎉 What’s New

🌍 Multi-environment agentic RL: GRPOTrainer now supports per-example environment selection and environment-owned rewards — mix multiple sandboxed task suites in one run and let each environment define its own scoring, with Harbor and OpenEnv.

🎯 KTO is now stable: KTOTrainer graduates to the stable API after a full alignment pass with DPOTrainer.

Taxonomy

Below is the current list of TRL trainers, organized by method type (⚡️ = vLLM support; 🧪 = experimental).

Online methods

GRPOTrainer ⚡️
RLOOTrainer ⚡️
OnlineDPOTrainer 🧪 ⚡️
NashMDTrainer 🧪 ⚡️
PPOTrainer 🧪
XPOTrainer 🧪 ⚡️

Reward modeling

RewardTrainer
PRMTrainer 🧪

Offline methods

Knowledge distillation

GKDTrainer 🧪
MiniLLMTrainer 🧪

You can also explore TRL-related models, datasets, and demos in the TRL Hugging Face organization.

Learn

Learn post-training with TRL and other libraries in 🤗 smol course.

The documentation is organized into the following sections:

Getting Started: installation and quickstart guide.
Conceptual Guides: dataset formats, training FAQ, and understanding logs.
How-to Guides: reducing memory usage, speeding up training, distributing training, etc.
Integrations: DeepSpeed, Liger Kernel, PEFT, etc.
Examples: example overview, community tutorials, etc.
API: trainers, utils, etc.