R-PRM: Reasoning-Driven Process Reward Modeling
π Paper | π Blog | βοΈ Code | π€ Model | π€ Dataset | π Contact
Overview
Welcome to the repository of R-PRM, our cutting-edge framework designed to revolutionize process-level evaluation in mathematical reasoning for large language models (LLMs).
- π We introduce Reasoning-Driven Process Reward Modeling (R-PRM), a novel approach that enhances LLMs' ability to evaluate mathematical reasoning step-by-step. By leveraging stronger LLMs to generate seed data, optimizing preferences without additional annotations, and scaling inference-time computation, R-PRM delivers comprehensive, transparent, and robust assessments of reasoning processes.
- π Our framework significantly boosts evaluation accuracy and generalization, outperforming strong baselines by wide margins on ProcessBench and PRMBench. When guiding policy models, R-PRM consistently improves reasoning performance across diverse datasets, achieving state-of-the-art (SOTA) results.
- π Overall, R-PRM offers a scalable and data-efficient solution to the challenge of scarce process-level annotations, enabling a more generalizable enhancement of reasoning evaluation capabilities without extensive human labeling.
π Experiment Results
π§ͺ Data Efficiency
R-PRM demonstrates exceptional data efficiency under varying training scales:
- With just 12.8k training samples, R-PRM reaches F1 = 52.6, already surpassing most open-source PRMs.
- R-PRM achieves +3.6 F1 over Qwen2.5-Math-7B-PRM800K when trained on just 64k samples (vs. Qwen's 265k), and extends this lead to +8.7 F1 when both are trained on comparable data volumes.
- Notably, despite using only ~15% of the data, R-PRMβs performance is already comparable to Qwen2.5-Math-PRM, which was trained on a much larger 1.8M LLM-filtered dataset.
π ProcessBench
Our reasoning-driven framework improves over Qwen2.5-Math-7B-PRM800K by +8.7 F1 (SFT) and +13.9 F1 (DPO), demonstrating its powerful evaluation capability.
Model | GSM8K | MATH | OLYMPIAD | OMNIMATH | Avg. F1 |
---|---|---|---|---|---|
Math-Shepherd-7B | 47.9 | 29.5 | 24.8 | 23.8 | 31.5 |
Skywork-PRM-7B | 70.8 | 53.6 | 22.9 | 21.0 | 42.1 |
Qwen2.5-Math-7B-PRM800K | 68.2 | 62.6 | 50.7 | 44.3 | 56.5 |
β R-PRM-7B-SFT | 77.2 (+9.0) | 71.6 (+9.0) | 59.6 (+8.9) | 52.3 (+8.0) | 65.2 (+8.7) |
β R-PRM-7B-DPO | 80.7 (+12.5) | 76.9 (+14.3) | 63.8 (+13.1) | 60.1 (+15.8) | 70.4 (+13.9) |
Qwen2.5-Math-PRM-7B | 82.4 | 77.6 | 67.5 | 66.3 | 73.5 |
GPT-4o | 79.2 | 63.6 | 51.4 | 53.5 | 61.9 |
o1-mini | 93.2 | 88.9 | 87.2 | 82.4 | 87.9 |
π§ PRMBench
R-PRM achieves +8.5 F1 (DPO) over Qwen2.5-Math-7B-PRM800K
π Excels in soundness, sensitivity, and multi-dimensional error analysis.
π§ͺ Best-of-N Strategy
When selecting the best among N reasoning paths, R-PRM improves accuracy by +8.6 points over the Pass@1 baseline, achieving the best results among all PRMs across six math datasets.
Setting / Model | AIME24 | AMC23 | MATH | Olympiad | College | Minerva | Avg. |
---|---|---|---|---|---|---|---|
pass@1 (baseline) | 11.2 | 47.8 | 73.0 | 38.0 | 38.6 | 37.2 | 41.0 |
maj@8 | 20.0 | 57.5 | 79.6 | 47.0 | 41.5 | 42.7 | 48.0 |
pass@8 (upper bound) | 33.3 | 82.5 | 88.8 | 58.5 | 47.5 | 57.7 | 61.4 |
Math-Shepherd-7B | 16.7 | 42.5 | 76.0 | 42.0 | 37.0 | 39.3 | 42.3 |
Skywork-PRM-7B | 16.7 | 55.0 | 81.2 | 44.0 | 40.5 | 44.5 | 47.0 |
Qwen2.5-Math-7B-PRM800K | 13.3 | 57.5 | 80.0 | 44.5 | 43.5 | 43.0 | 47.7 |
Qwen2.5-Math-PRM-7B | 16.7 | 55.0 | 82.0 | 48.0 | 43.5 | 43.0 | 48.0 |
β R-PRM-7B-DPO | 20.0 | 62.5 | 82.2 | 48.0 | 41.0 | 44.1 | 49.6 |
π Guide Search Strategy
By guiding reasoning step-by-step, R-PRM surpasses Pass@1 by +8.4 points, outperforming both majority voting and previous PRM-guided methods.
Setting / Model | AIME24 | AMC23 | MATH | Olympiad | College | Minerva | Avg. |
---|---|---|---|---|---|---|---|
pass@1 | 11.2 | 47.8 | 73.0 | 38.0 | 38.6 | 37.2 | 41.0 |
major@8 | 20.0 | 57.5 | 79.6 | 47.0 | 41.5 | 42.7 | 48.0 |
pass@8 (upper bound) | 33.3 | 82.5 | 88.8 | 58.5 | 47.5 | 57.7 | 61.4 |
Math-Shepherd-7B | 13.3 | 52.5 | 74.6 | 38.5 | 36.5 | 41.2 | 42.8 |
Skywork-PRM-7B | 10.0 | 57.5 | 77.8 | 41.5 | 39.0 | 43.4 | 44.9 |
Qwen2.5-Math-7B-PRM800K | 23.3 | 45.0 | 78.2 | 42.0 | 35.5 | 38.6 | 43.8 |
Qwen2.5-Math-PRM-7B | 16.7 | 60.0 | 81.0 | 43.5 | 39.0 | 40.4 | 46.8 |
β R-PRM-7B-DPO | 16.7 | 70.0 | 80.0 | 46.5 | 39.5 | 43.4 | 49.4 |
π Inference-Time Scaling
Evaluation performance improves consistently as more reasoning trajectories are sampled at inference. β From 62.8 F1 (2 samples) to 67.6 F1 (4 samples) on ProcessBench. This showcases R-PRMβs ability to deliver robust, ensemble-style judgment through multi-path reasoning.
Citation
If you find this repository helpful, feel free to cite our paper:
@misc{she2025rprmreasoningdrivenprocessreward,
title={R-PRM: Reasoning-Driven Process Reward Modeling},
author={Shuaijie She and Junxiao Liu and Yifeng Liu and Jiajun Chen and Xin Huang and Shujian Huang},
year={2025},
eprint={2503.21295},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.21295},
}
- Downloads last month
- 8