R-PRM: Reasoning-Driven Process Reward Modeling

πŸ“ƒ Paper | πŸ“ Blog | βš™οΈ Code | πŸ€– Model | πŸ€— Dataset | πŸ“­ Contact

Overview

Welcome to the repository of R-PRM, our cutting-edge framework designed to revolutionize process-level evaluation in mathematical reasoning for large language models (LLMs).

  • πŸš€ We introduce Reasoning-Driven Process Reward Modeling (R-PRM), a novel approach that enhances LLMs' ability to evaluate mathematical reasoning step-by-step. By leveraging stronger LLMs to generate seed data, optimizing preferences without additional annotations, and scaling inference-time computation, R-PRM delivers comprehensive, transparent, and robust assessments of reasoning processes.
  • πŸ“ˆ Our framework significantly boosts evaluation accuracy and generalization, outperforming strong baselines by wide margins on ProcessBench and PRMBench. When guiding policy models, R-PRM consistently improves reasoning performance across diverse datasets, achieving state-of-the-art (SOTA) results.
  • 🌐 Overall, R-PRM offers a scalable and data-efficient solution to the challenge of scarce process-level annotations, enabling a more generalizable enhancement of reasoning evaluation capabilities without extensive human labeling.

Figure 1: R-PRM Framework Illustration

πŸ† Experiment Results

πŸ§ͺ Data Efficiency

R-PRM demonstrates exceptional data efficiency under varying training scales:

  • With just 12.8k training samples, R-PRM reaches F1 = 52.6, already surpassing most open-source PRMs.
  • R-PRM achieves +3.6 F1 over Qwen2.5-Math-7B-PRM800K when trained on just 64k samples (vs. Qwen's 265k), and extends this lead to +8.7 F1 when both are trained on comparable data volumes.
  • Notably, despite using only ~15% of the data, R-PRM’s performance is already comparable to Qwen2.5-Math-PRM, which was trained on a much larger 1.8M LLM-filtered dataset.

Figure2: DataScaline

πŸ“Š ProcessBench

Our reasoning-driven framework improves over Qwen2.5-Math-7B-PRM800K by +8.7 F1 (SFT) and +13.9 F1 (DPO), demonstrating its powerful evaluation capability.

Model GSM8K MATH OLYMPIAD OMNIMATH Avg. F1
Math-Shepherd-7B 47.9 29.5 24.8 23.8 31.5
Skywork-PRM-7B 70.8 53.6 22.9 21.0 42.1
Qwen2.5-Math-7B-PRM800K 68.2 62.6 50.7 44.3 56.5
⭐ R-PRM-7B-SFT 77.2 (+9.0) 71.6 (+9.0) 59.6 (+8.9) 52.3 (+8.0) 65.2 (+8.7)
⭐ R-PRM-7B-DPO 80.7 (+12.5) 76.9 (+14.3) 63.8 (+13.1) 60.1 (+15.8) 70.4 (+13.9)
Qwen2.5-Math-PRM-7B 82.4 77.6 67.5 66.3 73.5
GPT-4o 79.2 63.6 51.4 53.5 61.9
o1-mini 93.2 88.9 87.2 82.4 87.9

🧠 PRMBench

R-PRM achieves +8.5 F1 (DPO) over Qwen2.5-Math-7B-PRM800K πŸ“Œ Excels in soundness, sensitivity, and multi-dimensional error analysis. PRMBench Performance

πŸ§ͺ Best-of-N Strategy

When selecting the best among N reasoning paths, R-PRM improves accuracy by +8.6 points over the Pass@1 baseline, achieving the best results among all PRMs across six math datasets.

Setting / Model AIME24 AMC23 MATH Olympiad College Minerva Avg.
pass@1 (baseline) 11.2 47.8 73.0 38.0 38.6 37.2 41.0
maj@8 20.0 57.5 79.6 47.0 41.5 42.7 48.0
pass@8 (upper bound) 33.3 82.5 88.8 58.5 47.5 57.7 61.4
Math-Shepherd-7B 16.7 42.5 76.0 42.0 37.0 39.3 42.3
Skywork-PRM-7B 16.7 55.0 81.2 44.0 40.5 44.5 47.0
Qwen2.5-Math-7B-PRM800K 13.3 57.5 80.0 44.5 43.5 43.0 47.7
Qwen2.5-Math-PRM-7B 16.7 55.0 82.0 48.0 43.5 43.0 48.0
⭐ R-PRM-7B-DPO 20.0 62.5 82.2 48.0 41.0 44.1 49.6

πŸ” Guide Search Strategy

By guiding reasoning step-by-step, R-PRM surpasses Pass@1 by +8.4 points, outperforming both majority voting and previous PRM-guided methods.

Setting / Model AIME24 AMC23 MATH Olympiad College Minerva Avg.
pass@1 11.2 47.8 73.0 38.0 38.6 37.2 41.0
major@8 20.0 57.5 79.6 47.0 41.5 42.7 48.0
pass@8 (upper bound) 33.3 82.5 88.8 58.5 47.5 57.7 61.4
Math-Shepherd-7B 13.3 52.5 74.6 38.5 36.5 41.2 42.8
Skywork-PRM-7B 10.0 57.5 77.8 41.5 39.0 43.4 44.9
Qwen2.5-Math-7B-PRM800K 23.3 45.0 78.2 42.0 35.5 38.6 43.8
Qwen2.5-Math-PRM-7B 16.7 60.0 81.0 43.5 39.0 40.4 46.8
⭐ R-PRM-7B-DPO 16.7 70.0 80.0 46.5 39.5 43.4 49.4

πŸš€ Inference-Time Scaling

Evaluation performance improves consistently as more reasoning trajectories are sampled at inference. β†’ From 62.8 F1 (2 samples) to 67.6 F1 (4 samples) on ProcessBench. This showcases R-PRM’s ability to deliver robust, ensemble-style judgment through multi-path reasoning.

Figure3: ProcessBench Scaling

Citation

If you find this repository helpful, feel free to cite our paper:

@misc{she2025rprmreasoningdrivenprocessreward,
      title={R-PRM: Reasoning-Driven Process Reward Modeling}, 
      author={Shuaijie She and Junxiao Liu and Yifeng Liu and Jiajun Chen and Xin Huang and Shujian Huang},
      year={2025},
      eprint={2503.21295},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.21295}, 
}
Downloads last month
8
Safetensors
Model size
7.62B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for kevinpro/R-PRM-7B-DPO

Quantizations
2 models

Collection including kevinpro/R-PRM-7B-DPO