I wrote a paper, titled: "Mathematical Constraints of RL-Induced Reasoning: A Rebuttal to DeepSeek-R1". based on my mathematical analysis, I brought up questions similar to the "missing pieces" you mentioned. I proposed several empirical tests that are also similar in line with the tasks you listed. If interested, let's talk: charlesluo22@gmail.com. I will send you a full paper which has been submitted to the Journal of Machine Learning Research.
Cheers,
Charles
Abstract
DeepSeek-R1 claims that reinforcement learning (RL) induces emergent reasoning capabilities in large language models (LLMs), suggesting a fundamental shift in AI development. However, our theoretical and computational analysis challenges this assertion.
Our mathematical framework (Section 2) demonstrates that RL alone cannot induce reasoning without a strong pretraining foundation, which remains the primary driver of reasoning capabilities. Due to high computational costs, poor sample efficiency, and reward sparsity, RL struggles to develop complex reasoning from scratch. Instead, it fine-tunes and reinforces existing pretraining knowledge rather than generating novel reasoning abilities.
Furthermore, DeepSeek-R1’s observed improvements align with wellestablished pretraining scaling laws, not independent RL-driven emergence. A detailed analysis of DeepSeek-R1’s RL algorithm (Section 3.3) reveals that its Group Relative Policy Optimization (GRPO) approach constrains RL updates within the limits of pretraining knowledge rather than driving reasoning innovation. Additionally, its rule-based reward system optimizes response formatting but does not introduce conceptual
advancements in reasoning.
Given these findings, we emphasize the need for rigorous empirical testing to isolate RL’s role from pretraining effects. Until such evidence is presented, RL should be viewed primarily as a fine-tuning mechanism rather than a fundamental source of emergent reasoning in LLMs.