Qwen3-1.7B-OPD-Math

Qwen3-1.7B (non-thinking mode) trained with on-policy distillation from a Qwen3-4B-Instruct-2507 teacher on math reasoning, using SkyRL.

Method

On-policy distillation (Agarwal et al., Thinking Machines writeup): the student generates rollouts on math prompts, the teacher scores every student token, and the per-token reward is the negative student–teacher log-probability gap (reverse KL). Implemented with SkyRL's examples/train/on_policy_distillation recipe (GRPO trainer with a pass-through advantage and the teacher in the ref-model slot).

  • Student: Qwen/Qwen3-1.7B with enable_thinking=false
  • Teacher: Qwen/Qwen3-4B-Instruct-2507
  • Prompts: DAPO-Math-17k (no ground-truth rewards used — the teacher signal is the only supervision)
  • Training: 10 steps × (128 prompts × 8 samples), max 8192 generated tokens, lr 1e-5, importance-sampling policy loss
  • Hardware: 4× NVIDIA L40 (48GB), FSDP + vLLM via SkyRL

Results (AIME 2024, pass@8, temperature 1.0 / top-p 0.7)

step 0 (this base) 5 10 (this model) 15
pass@8 6.7% 43.3% 50.0% 56.7%

A note for practitioners: with reverse-KL distillation, teacher and student must have matched termination behavior. A thinking-mode teacher, or a base-model student, each cause length collapse (the student learns to never emit EOS). Non-thinking instruct models on both sides trained stably.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("lecunyin/Qwen3-1.7B-OPD-Math", dtype="bfloat16")
tokenizer = AutoTokenizer.from_pretrained("lecunyin/Qwen3-1.7B-OPD-Math")

messages = [{"role": "user", "content": "Find the sum of all positive integers n such that n^2 + 12n - 2007 is a perfect square. Put your final answer in \\boxed{}."}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
print(tokenizer.decode(model.generate(inputs, max_new_tokens=4096)[0]))
Downloads last month
7
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lecunyin/Qwen3-1.7B-OPD-Math

Finetuned
Qwen/Qwen3-1.7B
Finetuned
(811)
this model

Paper for lecunyin/Qwen3-1.7B-OPD-Math