Qwen3-1.7B-OPD-Math
Qwen3-1.7B (non-thinking mode) trained with on-policy distillation from a Qwen3-4B-Instruct-2507 teacher on math reasoning, using SkyRL.
Method
On-policy distillation (Agarwal et al., Thinking Machines writeup): the student generates rollouts on math prompts, the teacher scores every student token, and the per-token reward is the negative student–teacher log-probability gap (reverse KL). Implemented with SkyRL's examples/train/on_policy_distillation recipe (GRPO trainer with a pass-through advantage and the teacher in the ref-model slot).
- Student: Qwen/Qwen3-1.7B with
enable_thinking=false - Teacher: Qwen/Qwen3-4B-Instruct-2507
- Prompts: DAPO-Math-17k (no ground-truth rewards used — the teacher signal is the only supervision)
- Training: 10 steps × (128 prompts × 8 samples), max 8192 generated tokens, lr 1e-5, importance-sampling policy loss
- Hardware: 4× NVIDIA L40 (48GB), FSDP + vLLM via SkyRL
Results (AIME 2024, pass@8, temperature 1.0 / top-p 0.7)
| step | 0 (this base) | 5 | 10 (this model) | 15 |
|---|---|---|---|---|
| pass@8 | 6.7% | 43.3% | 50.0% | 56.7% |
A note for practitioners: with reverse-KL distillation, teacher and student must have matched termination behavior. A thinking-mode teacher, or a base-model student, each cause length collapse (the student learns to never emit EOS). Non-thinking instruct models on both sides trained stably.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("lecunyin/Qwen3-1.7B-OPD-Math", dtype="bfloat16")
tokenizer = AutoTokenizer.from_pretrained("lecunyin/Qwen3-1.7B-OPD-Math")
messages = [{"role": "user", "content": "Find the sum of all positive integers n such that n^2 + 12n - 2007 is a perfect square. Put your final answer in \\boxed{}."}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
print(tokenizer.decode(model.generate(inputs, max_new_tokens=4096)[0]))
- Downloads last month
- 7