Qwen2.5-3B Countdown GRPO (iteration 300)

Qwen2.5-3B-Base trained with GRPO and a verifiable reward on the Countdown task for 300 iterations, directly from the base model with no SFT.

The headline: GRPO sharpens what the base could already do (add/sub coverage rises from 54% to 94% pass@10) but drives multiplication, a skill the base almost never produced, down to 0% (the base itself barely cleared the floor on these, about 7 correct samples in 1,500). RL reinforces existing behavior; it does not install a skill the model rarely generates on its own.

Full writeup: https://leon2k2k2k.github.io/blog/2026/grpo-sft-teaching-reasoning-through-arithmetic/ Companion: SFT-then-GRPO model | SFT dataset

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("leon2k2k2k/qwen2.5-3b-countdown-grpo")
tok = AutoTokenizer.from_pretrained("leon2k2k2k/qwen2.5-3b-countdown-grpo")

The model expects the Countdown prompt format: reason inside <think> </think>, give the final equation inside <answer> </answer>.

Results

300 held-out problems (150 add/sub, 150 needs-mult), 10 samples per problem at temperature 0.7.

cell pass@1 pass@10
add/sub, 3 numbers 80.2% 98.9%
add/sub, 4 numbers 44.1% 85.7%
needs-mult, 3 numbers 0.0% 0.0%
needs-mult, 4 numbers 0.0% 0.0%

Training

One H100. GRPO via nano-aha-moment, following the DeepSeek-R1 recipe. G = 4, learning rate 1e-6, KL 0.001, temperature 1.0, 1024-token budget, 300 iterations. Reward = 1.0 for a well-formed <think>/<answer> response, plus 1.0 if the equation uses each number once and equals the target.

License and attribution

This is a fine-tune of Qwen2.5-3B by the Qwen team, and is released under the same Qwen Research License. The base model and its weights are their work; this repo only adds GRPO fine-tuning on Countdown.

Downloads last month
-
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for leon2k2k2k/qwen2.5-3b-countdown-grpo

Base model

Qwen/Qwen2.5-3B
Finetuned
(427)
this model

Paper for leon2k2k2k/qwen2.5-3b-countdown-grpo