Qwen2.5-3B Countdown SFT-then-GRPO (iteration 300)

Qwen2.5-3B first supervised-fine-tuned on correct multiplication solutions (countdown-mult-sft), then trained with the same GRPO recipe for 300 iterations.

The point of this run was to test whether seeding GRPO with SFT (to install multiplication first) beats GRPO alone. It does not. GRPO restores add/sub that SFT had forgotten (19% back to 75% pass@10), but the multiplication SFT installed is pruned back to 0%, and the rigid SFT template survives, collapsing output diversity to about two distinct answers per ten tries. Stacking them keeps neither half-model's strength.

Full writeup: https://leon2k2k2k.github.io/blog/2026/grpo-sft-teaching-reasoning-through-arithmetic/ Companion: GRPO-alone model | SFT dataset

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("leon2k2k2k/qwen2.5-3b-countdown-sft-grpo")
tok = AutoTokenizer.from_pretrained("leon2k2k2k/qwen2.5-3b-countdown-sft-grpo")

The model expects the Countdown prompt format: reason inside <think> </think>, give the final equation inside <answer> </answer>.

Results

300 held-out problems (150 add/sub, 150 needs-mult), 10 samples per problem at temperature 0.7.

cell pass@1 pass@10
add/sub, 3 numbers 87.0% 89.4%
add/sub, 4 numbers 43.8% 51.8%
needs-mult, 3 numbers 0.0% 0.0%
needs-mult, 4 numbers 0.0% 0.0%

Compared with GRPO-alone, this model is a touch ahead at a single sample (71% vs 67% add/sub pass@1) but stalls with more tries (75% vs 94% add/sub pass@10): it is committed rather than exploratory.

Training

Two stages, both on one H100. (1) SFT on ~5,000 worked multiplication solutions. (2) GRPO via nano-aha-moment from the SFT checkpoint: G = 4, learning rate 1e-6, KL 0.001, temperature 1.0, 1024-token budget, 300 iterations. Reward = 1.0 well-formed + 1.0 correct.

License and attribution

This is a fine-tune of Qwen2.5-3B by the Qwen team, and is released under the same Qwen Research License. The base model and its weights are their work; this repo only adds SFT then GRPO fine-tuning on Countdown.

Downloads last month
-
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for leon2k2k2k/qwen2.5-3b-countdown-sft-grpo

Base model

Qwen/Qwen2.5-3B
Finetuned
(427)
this model

Dataset used to train leon2k2k2k/qwen2.5-3b-countdown-sft-grpo