Papers
arxiv:2506.08007

Reinforcement Pre-Training

Published on Jun 9
· Submitted by unilm on Jun 10
#1 Paper of the day
Authors:
,
,
,

Abstract

Reinforcement Pre-Training (RPT) improves language model accuracy through reinforcement learning and offers a scalable method for leveraging text data for general-purpose RL.

AI-generated summary

In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the language modeling accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.

Community

Paper author Paper submitter

In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the language modeling accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.

What is the performance on other math reasoning benchmarks? (aime24, aime25, math500, etc.)

·
Paper author

Thanks for the question. According to Table 2 'Before RL' column, RPT achieves stronger performance on math problems before reinforcement finetuning.
We’ve also achieved positive results on the math datasets you mentioned. We're continuing to scale up and organize our work, and in the coming period, we’ll release evaluation results from larger-scale experiments, which will include the math datasets you're interested in.

I never thought RL could be used for pre-training

Seems cool. You could say this is 'NTR' [next token reasoning].

·

🤣

excellent paper。but i wonder the cost of training。causal mask in original gpt can increase the efficiency of pre-training. But in this work, I find that it is hard to bring in the causal mask in RPT, so won't it increase the cost of RPT?

·

I wonder the same. In my interpretation it's pretraining in the sense that it's self supervised training on a curated dataset. But it's not the same as standard pretraining compute-efficiency wise

What would happen if you applied RPT recursively - having the model reason about each token within its own reasoning chain? Would meta-reasoning about the reasoning process itself lead to even better performance, or would the computational overhead outweigh the benefits? :)

·

In general, although RPT encourages next token prediction reasoning on the curated truncated dataset, it does so by rewarding the model based on the reference/ground_truth which is well-know to the user - it is simply the next token in the truncated text that we can leverage to reward/penalize the model accordingly - hence reinforcing better reasoning. Your suggestion is smart but the first issue before computational overhead is that you do not actually know the next token that the model will output/predict in its internal reasoning chain therefore you cannot reward/penalize the model based on a reward model that does not have a known reference to verify against - as the reference (next token in the models own reasoning) is unknown to you so at least according to the GRPO approach, unless you have a known verifiable reference to reward/penalize against, you cannot achieve RL.

I see the paper says RPT is initialized from a reasoning model and mentions investigating RPT from a standard base LLM under Future Work. I wonder how or whether the training and thought process would be different being initialized from a base LLM instead of a reasoning model

·
Paper author

We're working on it. Stay tuned!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

If I am not mistaken your approach doesn't allow the massively parallel scaling from standard pre training, so you shouldn't be constrained to just next token prediction.

Have you considered other RL objectives inspired by pre-training besides next token prediction? Like masked token prediction and next sentence prediction from BERT.

·

Good point! We focus on token-level prediction as the objective is more atomic and clear. Masked token prediction and next sentence prediction are quite interesting and worth exploring :)

Can you provide your fine-tuning code? I am interested in applying the same on the reasoning model MiMo-7B using the same proxy model for entropy but pre-processing the same dataset first then using PPO with binary rewards. Do you think this is achievable on a single H100? using vllm for generation and splitting the vllm/train by a ratio of 30%/70% with shorter sequence length as the MiMo model doesn't tend to be verbose a lot. Also, when using the dataset, are you combining the question with the answer and doing next token prediction on the whole text or just the answer? I have created a training code but really interested in seeing your implementation as this needs memory efficiency and speed.

is it even possible to do RPT-zero? purely RPT a model from scratch.

·

Then where the initial CoT capability comes from?

Just curious about if you are planning to publish RPT-14B or some equivalent model weights 🙂.

PixPin_2025-06-24_19-30-05.png

Is there any difference between Standard next-token prediction and Next-token reasoning when reasoning? Does the RPT trained model also need to think about each token when reasoning? Or does it only think about some difficult tokens? But how do we know which tokens are difficult when reasoning? And if we follow this approach, will it greatly increase the reasoning time? I hope you can put a complete reasoning example so that I can understand the doubts here. Thank you very much for your work.

·

From what I understand, next token reasoning is always using chain of thought / scratchpad / thoughts to output intermediate tokens before the final prediction is outputted in a special format (\boxed{})

Sharing a video & written explanation of this paper - https://aipapersacademy.com/reinforcement-pre-training/

Excellent paper guys! My question is related to whether is RPT from scratch? So the base model's weights are used and then updated (so a re- pre - training process). I am really hoping to get a reply for this query

·

No it's not. They start from a distilled modell

I applied RPT to DeepSeek-R1-0528-Qwen3-8B using a small dataset of around 500 samples only through TRL GRPO implementation and vllm for generation.
Running the MMLU-Pro benchmark on both the base model and the fine-tuned model the overall result increased from 38.5% to 44.5% (Ran it through lm-eval with max generated tokens = 2048 so some samples got truncated therefore probably with more max tokens the values will increase).
Checkout the model at ykarout/RPT-DeepSeek-R1-0528-Qwen3-8B

·

hey man, I want to ask if you used GRPO for pre-training or for fine-tuning?

That's a great idea, but the required computing power is exaggerated, estimated to be 100 to 1000 times that of traditional pre-training. Moreover, there are too few experiments conducted (possibly due to the excessively high computing requirements), making it somewhat like a semi-finished product that needs further optimization.

·

Not at all. I’ve ran a replica of the training on the same model but the 8B version with parameters almost exactly as outlined in the paper on a small 500-sample dataset and a generous 8k context generation tokens and it finished in around 3.5 - 4 hours using a single H200 with 140 GB VRAM. Although the dataset size is considered on the small scale, I’ve seen around +10% improvement on MMLU_Pro benchmark overall score.
I am confident this is going to grab even more attention with time as more people try and validate the results.

for this part (filtering the hard tokens) in the paper what was the context length i.e., how many tokens visible to make the next prediction and what tokenizer did you guys used same as the model or just whitespace.

"We use the OmniMATH dataset [GSY+24] for reinforcement pre-training. OmniMATH contains
4,428 competition-level mathematical problems and solutions from official websites such as AoPS
Wiki3 and AoPS forum4. Since many tokens are easily predictable even without reasoning, we
perform token-level data filtering before reinforcement pre-training. Particularly, we use DeepseekR1-Distill-Qwen-1.5B as a small proxy model. For each token, we calculate the proxy model entropy on the top-16 next tokens. By applying an entropy threshold, we filter out low-entropy positions, prioritizing training on challenging tokens that require greater computational effort to predict."

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.08007 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.08007 in a Space README.md to link it from this page.

Collections including this paper 48