Qwen3-1.7B-OPD-Math

Qwen3-1.7B (non-thinking mode) trained with on-policy distillation from a Qwen3-4B-Instruct-2507 teacher on math reasoning, using SkyRL.

Method

On-policy distillation (Agarwal et al., Thinking Machines writeup): the student generates rollouts on math prompts, the teacher scores every student token, and the per-token reward is the negative student–teacher log-probability gap (reverse KL). Implemented with SkyRL's examples/train/on_policy_distillation recipe (GRPO trainer with a pass-through advantage and the teacher in the ref-model slot).

Student: Qwen/Qwen3-1.7B with enable_thinking=false
Teacher: Qwen/Qwen3-4B-Instruct-2507
Prompts: DAPO-Math-17k (no ground-truth rewards used — the teacher signal is the only supervision)
Training: 10 steps × (128 prompts × 8 samples), max 8192 generated tokens, lr 1e-5, importance-sampling policy loss
Hardware: 4× NVIDIA L40 (48GB), FSDP + vLLM via SkyRL

Results (AIME 2024, pass@8, temperature 1.0 / top-p 0.7)

step	0 (this base)	5	10 (this model)	15
pass@8	6.7%	43.3%	50.0%	56.7%

A note for practitioners: with reverse-KL distillation, teacher and student must have matched termination behavior. A thinking-mode teacher, or a base-model student, each cause length collapse (the student learns to never emit EOS). Non-thinking instruct models on both sides trained stably.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("lecunyin/Qwen3-1.7B-OPD-Math", dtype="bfloat16")
tokenizer = AutoTokenizer.from_pretrained("lecunyin/Qwen3-1.7B-OPD-Math")

messages = [{"role": "user", "content": "Find the sum of all positive integers n such that n^2 + 12n - 2007 is a perfect square. Put your final answer in \\boxed{}."}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
print(tokenizer.decode(model.generate(inputs, max_new_tokens=4096)[0]))