TRL documentation
Paper Index
You are viewing main version, which requires installation from source. If you'd like
regular pip install, checkout the latest stable version (v0.21.0).
Paper Index
Section under construction. Feel free to contribute!
Group Sequence Policy Optimization
📜 Paper: https://huggingface.co/papers/2507.18071
GSPO is a GRPO variant that computes importance sampling weights at the sequence level instead of per-token. To reproduce the paper’s setting, use this configuration:
from trl import GRPOConfig
training_args = GRPOConfig(
importance_sampling_level="sequence",
loss_type="grpo",
beta=0.0, # GSPO set kl regularization to zero: https://github.com/volcengine/verl/pull/2775#issuecomment-3131807306
epsilon=3e-4, # GSPO paper (v2), section 5.1
epsilon_high=4e-4, # GSPO paper (v2), section 5.1
gradient_accumulation_steps=1,
steps_per_generation=4, # partition rollout batch into 4 mini-batches. GSPO paper (v2), section 5.1. Must be 4 times gradient_accumulation_steps
)
EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes
📜 Paper: https://huggingface.co/papers/2508.00180
Bias-Corrected Exponential Moving Average (BEMA) improves the stability and efficiency of language model fine-tuning by reducing stochasticity and eliminating bias. To use BEMA with SFT as described in the paper, you can use the BEMACallback:
from trl import BEMACallback, SFTTrainer
trainer = SFTTrainer(
...
callbacks=[BEMACallback()],
)