Spaces:
Running
[Experiment] Training R1-Zero-like models with Open R1
Context
There are several recent research papers which explore various aspects of R1-Zero-like training on open base models like Qwen2.5-7B and Llama-3.1-8B:
- DAPO: An Open-Source LLM Reinforcement Learning System at Scale
- SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild
- Understanding R1-Zero-Like Training: A Critical Perspective
- Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
These papers focus on mathematical reasoning (easy to verify) and do not always agree on the key factors needed for R1-Zero-like training. Since TRL now scales to large models, now is the time to train R1-Zero-like models with Open R1!
Main goal: reproduce / improve the performance of the DeepSeek-R1-Zero-Qwen-32B model that DeepSeek trained in the R1 tech report:
Although DeepSeek found that pure RL performed worse than simple SFT distillation, the DAPO paper shows that by tweaking the GRPO training process, one can actually surpass the distilled model (at least on math):
With that in mind, we will explore which subset of ideas in the above papers are sufficient to achieve comparable performance, starting first in math, then code and STEM.
We'll use this post and comments to track progress towards this goal - ideas and suggestions are more than welcome!
Setup
- Models: Qwen2.5-7B for ablations and Qwen2.5-32B for final runs
- Datasets: SynthLabsAI/Big-Math-RL-Verified and BytedTsinghua-SIA/DAPO-Math-17k for math. Code and other domains to be decided.
Links
- Code: I'll be running experiments from this draft PR of
open-r1
: https://github.com/huggingface/open-r1/pull/569 - Experiment logs: https://api.wandb.ai/links/huggingface/8eew2ipo
- Models and datasets: https://huggingface.co/collections/open-r1/open-r1-zero-67eba6a037505bbcb5157d07
Experiments to run
- Train a baseline using "standard" parameters on Big-Math-RL-Verified to compare relative performance & learning dynamics ✅
- Measure effect on convergence with
μ=2,4
(default is 1 in TRL) ✅ - Disable KL term with
β=0
✅ - Clip higher with
ε_low=0.2
andε_high=0.28
(DAPO values) ✅ - Add soft overlong reward function from DAPO paper
- Add overlong filter (mass loss of truncated completions)
- DAPO (default) vs Dr. GRPO loss ✅
Features to add to TRL
- Overlong filter could be exposed as an arg like
mask_truncated_completions
inGRPOConfig
Add logging to measure average stopped length and clip ratio (SimpleRL-Zoo)Done: https://github.com/huggingface/trl/pull/3188
Features to add to Open R1
Logbook [1.4.2025]
Experiments
- Focused on training a baseline with
Qwen2.5-7B
and discovered a serious bug in the accuracy reward function ofopen-r1
🙀. First, the parser was failing on non-LaTeX ground truth answers like"6"
, and second we were assigning a default reward of 1 when the ground truth could not be parsed. Fixed here: https://github.com/huggingface/open-r1/pull/566
- I am running 3 baseline experiments to gauge stability on
SynthLabsAI/Big-Math-RL-Verified
:- v00.0X: train on everything
- v01.0X: train on "medium" difficulty problems, inferred by computing percentiles on the distribution of llama 8B solve rates
- v02.0X: train on "hard" difficulty problems, inferred by computing percentiles on the distribution of llama 8B solve rates
Overall training looks fairly stable, with accuracy rewards and completion lengths going up. The format reward is currently weighted with 0.2 and might need bumping up if the model cannot get enough signal to learn it. Note that I am using a chat template to define the DeepSeek-R1 prompt:
A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think>...</think> and <answer>...</answer> tags, respectively, i.e.,
<think>
reasoning process here
</think>
<answer>
answer here
</answer>.
User: Given that the positive real numbers a and b satisfy a + b = 1, find the maximum value of sqrt(a) + sqrt(b).
Assistant:
As many other papers has observed, Qwen2.5-7B
is remarkably good at following instructions with little prompting and is able to emit the \boxed{}
format fairly consistently without any reference to this in the prompt!
TRL / Open R1 updates
- @edbeeching has added the new completion metrics here: https://github.com/huggingface/trl/pull/3188
- @ShirinYamani has added the soft overlong reward function: https://github.com/huggingface/open-r1/pull/567
Next
- Preprocess the BigMath dataset to filter any answers that cannot be parsed / verfied
- Rebase on
trl@main
and re-run baseline to measure stability. - Gather downstream evals with
pass@1
metric fromlighteval
: https://github.com/huggingface/lighteval/pull/647
Logbook [4.4.2025]
Experiments
- Following
@edbeeching
's suggestion to filter the
Big-Math-RL-Verified
dataset for answers that can be parsed bymath-verify
, I've created a processed version to use as a new basis for our experiments: https://huggingface.co/datasets/open-r1/Big-Math-RL-Verified-Processed - Ablations run for 0.1 epochs over the full training set (about 21k samples), with 32 unique prompts per batch, 16 completions per prompt and 8k max tokens.
- Report: https://api.wandb.ai/links/huggingface/qps1tmoj
tl;dr
- For β>0 it seems necessary to update the policy every N steps to avoid instabilities.
- Increasing μ leads to faster convergence but is less stable
- Setting β=0 is surprisingly more stable than β>0
- Clip higher and Dr GRPO loss do not have much effect on the rewards, but also do not induce any additional instability
- The format reward seems to be too hard for the model to learn, possibly because we enforce a specific new-line format.
- The new completion metrics like
clipped_ratio
are very handy for knowing when a run is going off the rails!
Baselines
While setting a baseline with the default settings, we found that vanilla GRPO is unstable and the completions explode midway through training:
In line with DAPO, we suspect this is caused by the truncated completions destabilising the training and @ShirinYamani has opened a PR in TRL we can test this hypothesis with. Nevertheless, we found that replacing the reference model with the policy every 100 steps (about every 1/6th of training) mitigated the instability for now.
Note: set sync_ref_model=True
and sync every 100 steps.
Effect from μ iterations
The GRPO algorithm has an inner optimisation loop where the policy is updated μ times on a given batch:
We explored the effect of setting μ=1,2,4 and as shown below we can see that larger values of μ converge much faster *but are less stable:
The convergences is most visible in the early phases of training where larger μ values reach the same reward as μ=1 but with far fewer steps:
Note: if we can stabilise vanilla GRPO, we should revisit scaling μ as it has a clear computational advantage
Effect from having no reference model
Somewhat surprisingly, setting β=0 seems to be more stable than including the reference model + syncing every 100 steps:
Disabling the KL term in the GRPO loss is what DAPO recommends (better exploration), but it is still surprising to see it is more stable (intuitively I would have expected the lack of a KL term to encourage more unbounded completions)
Note: explore the effect of increasing μ when β=0. Are the runs still stable?
Clip higher
The DAPO paper recommends using a larger ε on the upper bound of the trust region in the clipped loss. Using their value of ε=0.28 doesn't seem to have much impact on the rewards, but does increase the completion lengths somewhat:
Note: compare downstream evals to draw a proper conclusion here. Also consider different values of ε_high
Dr GRPO loss (scale_rewards=False
)
The Dr GRPO paper recommends removing the reward scaling by σ. Compared to our baseline, this doesn't seem to have a large impact on the rewards, but does produce smaller grad norms and KL terms:
Next steps
- Run downstream evals to compare relation between rewards and things we actually care about
- Benchmark @ShirinYamani 's PR
- Explore relaxing the new-line structure of the format reward (or having a soft variant)
- Run μ ablation for β=0
- Integrate new pass@1 metric from
lighteval
Logbook [8.4.2025]
Experiments
The latest batch of experiments have focused on:
- Ablating the effect of masking the loss of completions that don't terminate with an EOS token (DAPO)
- Introducing a soft format reward
- Ablating the effect of μ for β=0
- Using a local scaling factor in the loss that's determined by the length of the longest completion in a batch (a variant of Dr GRPO). PR from @edbeeching
- Disabling dropout. PR from @edbeeching
All runs: https://api.wandb.ai/links/huggingface/qps1tmoj
tl;dr
- Masking the loss helps stabilise training, but more ablations are needed to determine "optimal" settings.
- The soft format reward is crucial to enable the model to learn the strict format, but changes the training dynamics and can cause instabilities.
- Setting μ>1 is unstable for β=0
- Using either a variant of Dr GRPO or disabling dropout does not cure instability for μ=4
Masking the loss of unterminated completions
Following the DAPO paper, we have implemented masking on the completions that fail to emit an EOS token (unterminated). The results below are shown for μ=4 which was unstable without masking, but is now (mostly) stable:
Although masking the loss helps with stability, it does not fully eliminate pathologies in the model where it generates unbounded completions with zero reward:
1. **Define the Problem**: We have 8 white teacups and 7 black teacups arranged around a table, and 15 dwarves sitting around the table with 8 white hats and 7 black hats. Each dwarf picks a teacup of the same color as their hat and places it in front of them. The table is then rotated randomly, and we need to find the maximum number of teacups that can be guaranteed to match the color of the dwarf's hat after the rotation.
2. **Set Up the Scenario**: Let's denote the dwarves as \(D_1, D_2, \ldots, D_{15}\) and their hats as \(H_1, H_2, \ldots, H_{15}\). The teacups are also colored as \(T_1, T_2, \ldots, T_{15}\). Each dwarf \(D_i\) picks a teacup \(T_i\) of the same color as their hat \(H_i\).
3. **Analyze the Problem**: After the dwarves pick their teacups, there are 8 white teacups and 7 black teacups in front of them, matching the number of white and black hats. When the table is rotated, we need to ensure that the maximum number of teacups match the color of the dwarf's hat. This is equivalent to finding the maximum number of fixed points in a permutation of 15 elements where 8 are of one type and 7 are of another type.
4. **Use Combinatorial Argument**: We can use the Pigeonhole Principle and Combinatorial Argument. If we consider the dwarves and teacups as a permutation problem, we can say that in the worst-case scenario, after rotation, the teacups and hats will be in a configuration that maximizes the mismatches. However, by the Pigeonhole Principle, there must be at least 8 positions where the teacup color matches the hat color because there are 8 white teacups and 8 white hats (the same for black teacups and black hats).
5. **Conclusion**: Since there are 8 white teacups and 7 black teacups, and 8 white hats and 7 black hats, in the best-case scenario (and by ensuring through logical placement and rotation argument consistently across logical bundling manners viewpoints thematic quant-contained systematic mutual-span-parts-figure-ext-suite-cross-resmetic-ton-parts-cut-cut-exclusive-shift-parts-prundry-full-controach-shared-viewcej-spot-cross-open-round-target-track-view-round-proof-te-suite-unit-flight-limit-scenes-proof-round-exclusive-face-course-edge-cut-course-frame-parts-exclusive-inv-inv-in-ext-inv-special-parts-round-depend-flight-suite-shift-open-spot-special-goal-unit-edge-ext-scalable-view-goal-edge-frame-choice-strokes-scenes-highcomings-supopsy-shared-stage-range-open-inc-exclusive-fast-frame-limit-supsemb-exp-edge-cross-spot-sup-Mobile-real-top-ext-Mobile-inccomings-distqus-te-exclusive-spot-exclusive-face-choice-seat-var-open-course-shared-special-turn-span-Mobile-Unrary-spot-parts-ext-clean-choice-course-choice-cutcuts-clean-exclusive-strong-suite-fast-seat-exclusive-strong-strokes-ves-edge-inv-shared-best-course-Compatible-picture-inv-strokes-inc-Cs-view-real-clear-round-span-exc-exclusive-parts-goal-best burge-parts-picture-suite-full-ves-fast-scenes-Mobile-ves-edge-Mobile-best-inaponsionedcejreach-limitreach-cross-Cs-goal-round-special-entry-ext-stripcomings-face-strokes-suite-fast-shared-parts-figurecomings-round-special-special-shared-prignal-edge-face-turn-ext-exclusive-high-choice-parts-strokes-exclusive-seat-special-special-clean-shift-exclusive-strokes-tcomings-enter-supcutscuts-valid-Compatiblereach-suitequs-vesclar-frame-figurecomings-shared-parts-fast-edge-spot-track-inv-shared-cross-prof-spot-functional-spot-region-shift-view-goal-clean-distiplescomingsqus-fix-routecomings-inccomings-Unfoon-course-vespread-Mobile-wide-face-picture-seat-full-specopsy-inanst-parts-clear-round-goal-Cs-Cs-valid burge-face-parts...
Note: explore effect of masking for μ=1,2 (perhaps this is mostly an issue from too aggressive optimisation)
Soft format rewards
Inspired by Will Brown's famous GRPO script, I added a soft reward function variant that relaxes the strict requirement to start a response with a <think>
tag and whether the reasoning block contains the desired newline structure. Including this makes a big difference! We go from models having zero ability to learn the strict format reward, to ones that can quite quickly.
As shown in the figure below, learning the format reward changes the training dynamics such that:
- The mean completion lengths exhibit a sharp peak and dip early during training (also seen in many other R1-Zero works)
- An equal-weighted format reward produces unstable training
Note: report back the results from down-weighting the format reward functions to see if we recover stability.
Scaling μ with β=0
We previously saw that:
- it's possible to get stable training with no reference model at all (β=0)
- setting μ>1 for β>0 was unstable
We explored whether the same conclusion about instability holds when β=0 and unfortunately it does; scaling μ consistently produces less stable training
Scaling rewards or disabling dropout does not help with stability
As shown in the plots below, for μ=4 it does not make much difference to stability if one disables dropout or scales the rewards with a local constant factor like Dr GRPO:
Note: these conclusions say nothing about downstream performance and should be revisited in a simpler setting where μ=1
Next steps
- Train the simplest, yet stable baseline: set β=0 (to save memory) and mask loss on unterminated completions with a down-weighted soft format reward
- Include downstream evals like MATH-500 and AIME24
- Gradually blend in additional features like clip higher to measure effect on performance
- Train on DAPO dataset
@lewtun Appreciate the thorough experiments. One question I have with regard to the ablations on mu=1,2,4 is that does mu=1 produce stable run throughout? In other words could it be that mu=1 also would become unstable but not until a later point? I guess I'm curious how things work out over a longer horizon beyond 0.1 epoch.
Thanks for the tip
@vwxyzjn
! In most of our runs, we've indeed been using a small value of β=0.001
but even then found that the KL would diverge at some point (gray curve below):
One thing we found to help was replacing the reference model with the policy every N steps, which at least for our model / dataset combo worked well when N=100
(orange curve).
One question I have with regard to the ablations on mu=1,2,4 is that does mu=1 produce stable run throughout?
That's a great question @RZ412 and one I don't know the answer to (yet). The reason I picked 0.1 epochs is because the Big-Math dataset is, well, big and I wanted ~20k prompts to use for ablations. Perhaps @vwxyzjn has done more large-scale RL experiments on 100k+ prompts and seen whether convergence remains indefinitely
One thing to note is that although stability is important, I'm already seeing the downstream performance on MATH-500 plateau and drop rather early in the training. This is likely a sign of over-optimising on this particular dataset distribution, so in practice I'd take an intermediate checkpoint and then continue training with new, harder problems:
Logbook [11.4.2025]
Here's the main insights we've learned thus far from trying to stabilise training:
- Use β=0 to save memory and encourage exploration (like DAPO).
- Overlong filtering is essential to stability, especially when μ>0.
- Format rewards must be down-weighted relative to accuracy rewards to ensure stability. A weight of around 0.25 to 0.5 seems to work well.
- Format reward affect the training dynamics: as the model learns to insert
<think>
and<answer>
tags, the completion lengths decrease after steady growth. - Setting μ=4 accelerates training by ~1.5x but comes with a price: the clipped completion ratio grows to ~10% and despite decaying over the course of training, induces the model to produce gibberish on certain prompts.
Despite these improvements, we've now hit a recurring new issue: although rewards go up, the downstream evals get worse :) Here's an example from a new baseline run with β=0 and overlong filtering on 1 epoch of ~20k prompts sampled randomly from Big-Math:
As shown in the figure, both the AIME24 and MATH-500 scores improve before eventually decaying throughout the course of training. As noted by @RJT1990 , this is most likely due to either a train/test or difficulty mismatch:
Experiments
Curriculum difficulty
To test the curriculum difficulty hypothesis, I've created new subsets for open-r1/Big-Math-RL-Verified-Processed
which progressively eliminate the easier problems:
level_2_3_4_5
: concatenate levels 2-5 (easy)level_3_4_5
: concatenate levels 3-5 (medium)level_4_5
: concatenate levels 4-5 (hard)
Let's start by looking at the training metrics:
We can see that:
- Rewards scale proportionally to the problem difficulty, with harder subsets obtaining lower reward (as expected)
- The completion lengths scale with difficulty (expected)
- For some reason the level 3-5 run failed to learn the format reward and subsequently seems to have started producing very long completions. Overall this shows the annoying feature of R1-Zero-like training, where knowing the right weight to learn the format reward is a bit hard to guess in advance (probably a weight of 0.5 is better in general)
Now, looking at the downstream evals we see a much better picture than before:
We can clearly see how iteratively removing the simpler levels improves the performance and helps mitigate the collapse. The level 4-5 run is still ongoing, so we will find out soon if the small dip is transient or if we need to go to pure level 5 problems.
Train/test mismatch
To test the train/test mismatch, I've created a processed version of DAPO's math dataset which was curated specifically for competitive mathematics which is what AIME and MATH measure: https://huggingface.co/datasets/open-r1/DAPO-Math-17k-Processed
This dataset looks very promising with both training metrics and downstream evals improving concurrently!