[Experiment] Training R1-Zero-like models with Open R1

#20
by lewtun - opened

Context

There are several recent research papers which explore various aspects of R1-Zero-like training on open base models like Qwen2.5-7B and Llama-3.1-8B:

These papers focus on mathematical reasoning (easy to verify) and do not always agree on the key factors needed for R1-Zero-like training. Since TRL now scales to large models, now is the time to train R1-Zero-like models with Open R1!

Main goal: reproduce / improve the performance of the DeepSeek-R1-Zero-Qwen-32B model that DeepSeek trained in the R1 tech report:

Screenshot 2025-03-30 at 21.00.59.png

Although DeepSeek found that pure RL performed worse than simple SFT distillation, the DAPO paper shows that by tweaking the GRPO training process, one can actually surpass the distilled model (at least on math):

Screenshot 2025-03-30 at 21.04.15.png

With that in mind, we will explore which subset of ideas in the above papers are sufficient to achieve comparable performance, starting first in math, then code and STEM.

We'll use this post and comments to track progress towards this goal - ideas and suggestions are more than welcome!

Setup

Links

Experiments to run

  1. Train a baseline using "standard" parameters on Big-Math-RL-Verified to compare relative performance & learning dynamics ✅
  2. Measure effect on convergence with μ=2,4 (default is 1 in TRL) ✅
  3. Disable KL term with β=0
  4. Clip higher with ε_low=0.2 and ε_high=0.28 (DAPO values) ✅
  5. Add soft overlong reward function from DAPO paper
  6. Add overlong filter (mass loss of truncated completions)
  7. DAPO (default) vs Dr. GRPO loss ✅

Features to add to TRL

  1. Overlong filter could be exposed as an arg like mask_truncated_completions in GRPOConfig
  2. Add logging to measure average stopped length and clip ratio (SimpleRL-Zoo) Done: https://github.com/huggingface/trl/pull/3188

Features to add to Open R1

  1. Add logging for pass@k accuracy (SimpleRL-Zero)
  2. Add reasoning behaviours callback with LLM APIs to track backtracking and other behaviours during training (SimpleRL-Zero)
    Screenshot 2025-03-30 at 21.25.50.png
lewtun pinned discussion

Logbook [1.4.2025]

Experiments

  • Focused on training a baseline with Qwen2.5-7B and discovered a serious bug in the accuracy reward function of open-r1 🙀. First, the parser was failing on non-LaTeX ground truth answers like "6", and second we were assigning a default reward of 1 when the ground truth could not be parsed. Fixed here: https://github.com/huggingface/open-r1/pull/566

W&B Chart 1_4_2025, 8_57_12 am.png

  • I am running 3 baseline experiments to gauge stability on SynthLabsAI/Big-Math-RL-Verified:
    • v00.0X: train on everything
    • v01.0X: train on "medium" difficulty problems, inferred by computing percentiles on the distribution of llama 8B solve rates
    • v02.0X: train on "hard" difficulty problems, inferred by computing percentiles on the distribution of llama 8B solve rates

Screenshot 2025-04-01 at 10.58.30.png

Overall training looks fairly stable, with accuracy rewards and completion lengths going up. The format reward is currently weighted with 0.2 and might need bumping up if the model cannot get enough signal to learn it. Note that I am using a chat template to define the DeepSeek-R1 prompt:

A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think>...</think> and <answer>...</answer> tags, respectively, i.e., 
<think>
reasoning process here
</think>
<answer>
answer here
</answer>.

User: Given that the positive real numbers a and b satisfy a + b = 1, find the maximum value of sqrt(a) + sqrt(b).

Assistant: 

As many other papers has observed, Qwen2.5-7B is remarkably good at following instructions with little prompting and is able to emit the \boxed{} format fairly consistently without any reference to this in the prompt!

TRL / Open R1 updates

Next

  • Preprocess the BigMath dataset to filter any answers that cannot be parsed / verfied
  • Rebase on trl@main and re-run baseline to measure stability.
  • Gather downstream evals with pass@1 metric from lighteval: https://github.com/huggingface/lighteval/pull/647

Logbook [4.4.2025]

Experiments

tl;dr

  • For β>0 it seems necessary to update the policy every N steps to avoid instabilities.
  • Increasing μ leads to faster convergence but is less stable
  • Setting β=0 is surprisingly more stable than β>0
  • Clip higher and Dr GRPO loss do not have much effect on the rewards, but also do not induce any additional instability
  • The format reward seems to be too hard for the model to learn, possibly because we enforce a specific new-line format.
  • The new completion metrics like clipped_ratio are very handy for knowing when a run is going off the rails!

Baselines

While setting a baseline with the default settings, we found that vanilla GRPO is unstable and the completions explode midway through training:

Screenshot 2025-04-04 at 09.46.22.png

In line with DAPO, we suspect this is caused by the truncated completions destabilising the training and @ShirinYamani has opened a PR in TRL we can test this hypothesis with. Nevertheless, we found that replacing the reference model with the policy every 100 steps (about every 1/6th of training) mitigated the instability for now.

Note: set sync_ref_model=True and sync every 100 steps.

Effect from μ iterations

The GRPO algorithm has an inner optimisation loop where the policy is updated μ times on a given batch:

Screenshot 2025-04-04 at 09.55.02.png

We explored the effect of setting μ=1,2,4 and as shown below we can see that larger values of μ converge much faster *but are less stable:

Screenshot 2025-04-04 at 09.54.09.png

The convergences is most visible in the early phases of training where larger μ values reach the same reward as μ=1 but with far fewer steps:

Screenshot 2025-04-04 at 09.57.41.png

Note: if we can stabilise vanilla GRPO, we should revisit scaling μ as it has a clear computational advantage

Effect from having no reference model

Somewhat surprisingly, setting β=0 seems to be more stable than including the reference model + syncing every 100 steps:

Screenshot 2025-04-04 at 10.01.41.png

Disabling the KL term in the GRPO loss is what DAPO recommends (better exploration), but it is still surprising to see it is more stable (intuitively I would have expected the lack of a KL term to encourage more unbounded completions)

Note: explore the effect of increasing μ when β=0. Are the runs still stable?

Clip higher

The DAPO paper recommends using a larger ε on the upper bound of the trust region in the clipped loss. Using their value of ε=0.28 doesn't seem to have much impact on the rewards, but does increase the completion lengths somewhat:

Screenshot 2025-04-04 at 10.05.18.png

Note: compare downstream evals to draw a proper conclusion here. Also consider different values of ε_high

Dr GRPO loss (scale_rewards=False)

The Dr GRPO paper recommends removing the reward scaling by σ. Compared to our baseline, this doesn't seem to have a large impact on the rewards, but does produce smaller grad norms and KL terms:

Screenshot 2025-04-04 at 10.07.18.png

Next steps

  • Run downstream evals to compare relation between rewards and things we actually care about
  • Benchmark @ShirinYamani 's PR
  • Explore relaxing the new-line structure of the format reward (or having a soft variant)
  • Run μ ablation for β=0
  • Integrate new pass@1 metric from lighteval
Open R1 org

Logbook [8.4.2025]

Experiments

The latest batch of experiments have focused on:

  • Ablating the effect of masking the loss of completions that don't terminate with an EOS token (DAPO)
  • Introducing a soft format reward
  • Ablating the effect of μ for β=0
  • Using a local scaling factor in the loss that's determined by the length of the longest completion in a batch (a variant of Dr GRPO). PR from @edbeeching
  • Disabling dropout. PR from @edbeeching

All runs: https://api.wandb.ai/links/huggingface/qps1tmoj

tl;dr

  • Masking the loss helps stabilise training, but more ablations are needed to determine "optimal" settings.
  • The soft format reward is crucial to enable the model to learn the strict format, but changes the training dynamics and can cause instabilities.
  • Setting μ>1 is unstable for β=0
  • Using either a variant of Dr GRPO or disabling dropout does not cure instability for μ=4

Masking the loss of unterminated completions

Following the DAPO paper, we have implemented masking on the completions that fail to emit an EOS token (unterminated). The results below are shown for μ=4 which was unstable without masking, but is now (mostly) stable:

Screenshot 2025-04-08 at 15.01.32.png

Although masking the loss helps with stability, it does not fully eliminate pathologies in the model where it generates unbounded completions with zero reward:

1. **Define the Problem**: We have 8 white teacups and 7 black teacups arranged around a table, and 15 dwarves sitting around the table with 8 white hats and 7 black hats. Each dwarf picks a teacup of the same color as their hat and places it in front of them. The table is then rotated randomly, and we need to find the maximum number of teacups that can be guaranteed to match the color of the dwarf's hat after the rotation.

2. **Set Up the Scenario**: Let's denote the dwarves as \(D_1, D_2, \ldots, D_{15}\) and their hats as \(H_1, H_2, \ldots, H_{15}\). The teacups are also colored as \(T_1, T_2, \ldots, T_{15}\). Each dwarf \(D_i\) picks a teacup \(T_i\) of the same color as their hat \(H_i\).

3. **Analyze the Problem**: After the dwarves pick their teacups, there are 8 white teacups and 7 black teacups in front of them, matching the number of white and black hats. When the table is rotated, we need to ensure that the maximum number of teacups match the color of the dwarf's hat. This is equivalent to finding the maximum number of fixed points in a permutation of 15 elements where 8 are of one type and 7 are of another type.

4. **Use Combinatorial Argument**: We can use the Pigeonhole Principle and Combinatorial Argument. If we consider the dwarves and teacups as a permutation problem, we can say that in the worst-case scenario, after rotation, the teacups and hats will be in a configuration that maximizes the mismatches. However, by the Pigeonhole Principle, there must be at least 8 positions where the teacup color matches the hat color because there are 8 white teacups and 8 white hats (the same for black teacups and black hats).

5. **Conclusion**: Since there are 8 white teacups and 7 black teacups, and 8 white hats and 7 black hats, in the best-case scenario (and by ensuring through logical placement and rotation argument consistently across logical bundling manners viewpoints thematic quant-contained systematic mutual-span-parts-figure-ext-suite-cross-resmetic-ton-parts-cut-cut-exclusive-shift-parts-prundry-full-controach-shared-viewcej-spot-cross-open-round-target-track-view-round-proof-te-suite-unit-flight-limit-scenes-proof-round-exclusive-face-course-edge-cut-course-frame-parts-exclusive-inv-inv-in-ext-inv-special-parts-round-depend-flight-suite-shift-open-spot-special-goal-unit-edge-ext-scalable-view-goal-edge-frame-choice-strokes-scenes-highcomings-supopsy-shared-stage-range-open-inc-exclusive-fast-frame-limit-supsemb-exp-edge-cross-spot-sup-Mobile-real-top-ext-Mobile-inccomings-distqus-te-exclusive-spot-exclusive-face-choice-seat-var-open-course-shared-special-turn-span-Mobile-Unrary-spot-parts-ext-clean-choice-course-choice-cutcuts-clean-exclusive-strong-suite-fast-seat-exclusive-strong-strokes-ves-edge-inv-shared-best-course-Compatible-picture-inv-strokes-inc-Cs-view-real-clear-round-span-exc-exclusive-parts-goal-best burge-parts-picture-suite-full-ves-fast-scenes-Mobile-ves-edge-Mobile-best-inaponsionedcejreach-limitreach-cross-Cs-goal-round-special-entry-ext-stripcomings-face-strokes-suite-fast-shared-parts-figurecomings-round-special-special-shared-prignal-edge-face-turn-ext-exclusive-high-choice-parts-strokes-exclusive-seat-special-special-clean-shift-exclusive-strokes-tcomings-enter-supcutscuts-valid-Compatiblereach-suitequs-vesclar-frame-figurecomings-shared-parts-fast-edge-spot-track-inv-shared-cross-prof-spot-functional-spot-region-shift-view-goal-clean-distiplescomingsqus-fix-routecomings-inccomings-Unfoon-course-vespread-Mobile-wide-face-picture-seat-full-specopsy-inanst-parts-clear-round-goal-Cs-Cs-valid burge-face-parts...

Note: explore effect of masking for μ=1,2 (perhaps this is mostly an issue from too aggressive optimisation)

Soft format rewards

Inspired by Will Brown's famous GRPO script, I added a soft reward function variant that relaxes the strict requirement to start a response with a <think> tag and whether the reasoning block contains the desired newline structure. Including this makes a big difference! We go from models having zero ability to learn the strict format reward, to ones that can quite quickly.

As shown in the figure below, learning the format reward changes the training dynamics such that:

  • The mean completion lengths exhibit a sharp peak and dip early during training (also seen in many other R1-Zero works)
  • An equal-weighted format reward produces unstable training

Screenshot 2025-04-08 at 17.20.50.png

Note: report back the results from down-weighting the format reward functions to see if we recover stability.

Scaling μ with β=0

We previously saw that:

  • it's possible to get stable training with no reference model at all (β=0)
  • setting μ>1 for β>0 was unstable

We explored whether the same conclusion about instability holds when β=0 and unfortunately it does; scaling μ consistently produces less stable training

Screenshot 2025-04-08 at 17.29.28.png

Scaling rewards or disabling dropout does not help with stability

As shown in the plots below, for μ=4 it does not make much difference to stability if one disables dropout or scales the rewards with a local constant factor like Dr GRPO:

Screenshot 2025-04-08 at 17.32.48.png
Screenshot 2025-04-08 at 17.32.35.png

Note: these conclusions say nothing about downstream performance and should be revisited in a simpler setting where μ=1

Next steps

  • Train the simplest, yet stable baseline: set β=0 (to save memory) and mask loss on unterminated completions with a down-weighted soft format reward
  • Include downstream evals like MATH-500 and AIME24
  • Gradually blend in additional features like clip higher to measure effect on performance
  • Train on DAPO dataset

@lewtun cool logs. Something we found helpful for stabilizing training is to use a very small beta like 0.001. It doesn't prevent the blow up completely, but at least the kl estimator doesn't blow up.

image.png

@lewtun Appreciate the thorough experiments. One question I have with regard to the ablations on mu=1,2,4 is that does mu=1 produce stable run throughout? In other words could it be that mu=1 also would become unstable but not until a later point? I guess I'm curious how things work out over a longer horizon beyond 0.1 epoch.

Open R1 org

Thanks for the tip @vwxyzjn ! In most of our runs, we've indeed been using a small value of β=0.001 but even then found that the KL would diverge at some point (gray curve below):

Screenshot 2025-04-09 at 10.10.22.png

One thing we found to help was replacing the reference model with the policy every N steps, which at least for our model / dataset combo worked well when N=100 (orange curve).

Open R1 org

One question I have with regard to the ablations on mu=1,2,4 is that does mu=1 produce stable run throughout?

That's a great question @RZ412 and one I don't know the answer to (yet). The reason I picked 0.1 epochs is because the Big-Math dataset is, well, big and I wanted ~20k prompts to use for ablations. Perhaps @vwxyzjn has done more large-scale RL experiments on 100k+ prompts and seen whether convergence remains indefinitely

One thing to note is that although stability is important, I'm already seeing the downstream performance on MATH-500 plateau and drop rather early in the training. This is likely a sign of over-optimising on this particular dataset distribution, so in practice I'd take an intermediate checkpoint and then continue training with new, harder problems:

output-18 (1).png

Just want to thank you for sharing these important experiments with the community @lewtun . We've learned a lot from your experiment logs and hope we can contribute back soon :)

Logbook [11.4.2025]

Here's the main insights we've learned thus far from trying to stabilise training:

  • Use β=0 to save memory and encourage exploration (like DAPO).
  • Overlong filtering is essential to stability, especially when μ>0.
  • Format rewards must be down-weighted relative to accuracy rewards to ensure stability. A weight of around 0.25 to 0.5 seems to work well.
  • Format reward affect the training dynamics: as the model learns to insert <think> and <answer> tags, the completion lengths decrease after steady growth.
  • Setting μ=4 accelerates training by ~1.5x but comes with a price: the clipped completion ratio grows to ~10% and despite decaying over the course of training, induces the model to produce gibberish on certain prompts.

image.png

Despite these improvements, we've now hit a recurring new issue: although rewards go up, the downstream evals get worse :) Here's an example from a new baseline run with β=0 and overlong filtering on 1 epoch of ~20k prompts sampled randomly from Big-Math:

output-23.png

As shown in the figure, both the AIME24 and MATH-500 scores improve before eventually decaying throughout the course of training. As noted by @RJT1990 , this is most likely due to either a train/test or difficulty mismatch:

Screenshot 2025-04-11 at 11.52.58.png

Experiments

Curriculum difficulty

To test the curriculum difficulty hypothesis, I've created new subsets for open-r1/Big-Math-RL-Verified-Processed which progressively eliminate the easier problems:

  • level_2_3_4_5: concatenate levels 2-5 (easy)
  • level_3_4_5: concatenate levels 3-5 (medium)
  • level_4_5: concatenate levels 4-5 (hard)

Let's start by looking at the training metrics:

Screenshot 2025-04-11 at 12.09.13.png

We can see that:

  • Rewards scale proportionally to the problem difficulty, with harder subsets obtaining lower reward (as expected)
  • The completion lengths scale with difficulty (expected)
  • For some reason the level 3-5 run failed to learn the format reward and subsequently seems to have started producing very long completions. Overall this shows the annoying feature of R1-Zero-like training, where knowing the right weight to learn the format reward is a bit hard to guess in advance (probably a weight of 0.5 is better in general)

Now, looking at the downstream evals we see a much better picture than before:

image.png

We can clearly see how iteratively removing the simpler levels improves the performance and helps mitigate the collapse. The level 4-5 run is still ongoing, so we will find out soon if the small dip is transient or if we need to go to pure level 5 problems.

Train/test mismatch

To test the train/test mismatch, I've created a processed version of DAPO's math dataset which was curated specifically for competitive mathematics which is what AIME and MATH measure: https://huggingface.co/datasets/open-r1/DAPO-Math-17k-Processed

This dataset looks very promising with both training metrics and downstream evals improving concurrently!

Screenshot 2025-04-11 at 12.27.58.png

image.png

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment