MRT-offline (R1-Distill-Qwen-1.5B)

DeepSeek-R1-Distill-Qwen-1.5B fine-tuned with Meta Reinforcement Fine-Tuning (MRT) in the open-ended setting — the paper-faithful "offline" variant: a dense progress reward (the change in the likelihood of eventual success contributed by each reasoning episode), estimated from a forced-termination meta-prover over an off-policy prefix and applied as a single end-of-trace bonus on top of outcome-reward GRPO.

v0.1 reproduction. This checkpoint was produced by the open-source v0.1 training code (built on miles, since the original Open-R1 setup is no longer actively maintained), not the exact run from the paper. It reproduces the paper's relative claim — MRT's gain over the base is ~2–3× the gain from outcome-reward GRPO — at a slightly smaller absolute magnitude (see below).

Evaluation

pass@1 (mean of 64 samples/problem) at a 16K token budget, averaged over AIME 2024 / AIME 2025 / AMC 2023 / MinervaMATH / MATH500 (single grader used for training and eval):

model AIME24 AIME25 AMC23 Minerva MATH500 Avg gain over base
base (R1-Distill-Qwen-1.5B) 27.34 22.86 67.89 24.94 81.71 44.95
GRPO (outcome-reward) 28.12 22.97 67.77 26.45 81.85 45.43 +0.48
MRT-offline (this model) 28.75 23.59 70.86 24.96 82.61 46.16 +1.20

MRT's gain over base (+1.20) is 2.5× the GRPO gain (+0.48) — within the paper's reported 2–3× range. (Paper Table 1 reports +1.1 / +2.2 for GRPO / MRT; our reproduction reaches a smaller absolute magnitude — see REPRODUCTION.md for the training-length and grader notes.)

Training

  • Base: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B; data: 4,000 NuminaMath problems.
  • GRPO + α-weighted progress bonus (α=1.0), 248 optimizer steps, 16K budget, temp 0.9.
  • Framework: miles (Megatron-LM + SGLang). Full recipe, hyperparameters, and assumptions: CMU-AIRe/MRT train/rl/REPRODUCTION.md.

Citation

@misc{qu2025optimizingtesttimecomputemeta,
      title={Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning},
      author={Yuxiao Qu and Matthew Y. R. Yang and Amrith Setlur and Lewis Tunstall and Edward Emanuel Beeching and Ruslan Salakhutdinov and Aviral Kumar},
      year={2025}, eprint={2503.07572}, archivePrefix={arXiv}, primaryClass={cs.LG},
      url={https://arxiv.org/abs/2503.07572},
}
Downloads last month
1
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for CMU-AIRe/MRT-offline

Finetuned
(640)
this model

Paper for CMU-AIRe/MRT-offline