Qwen3-1.7B-Base-OPD


Qwen3-1.7B-Base-OPD is an on-policy distillation (OPD) checkpoint initialized from Qwen3-1.7B-Base. It is distilled from the teacher model Qwen3-4B-Base-GRPO using the DAPO-Math-17k dataset, and is intended for mathematical reasoning and problem-solving.

This model is associated with the paper:
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
Paper link: https://arxiv.org/abs/2604.13016

Model Description

This model is obtained by applying on-policy distillation (OPD) to Qwen3-1.7B-Base, with Qwen3-4B-Base-GRPO serving as the teacher model. The OPD training uses DAPO math prompts/data and is designed to transfer the teacher's math-focused reasoning behavior into a smaller 1.7B-parameter student model.

Key characteristics

  • Student/base model: Qwen3-1.7B-Base
  • Teacher model: lllyx/Qwen3-4B-Base-GRPO
  • Training data: DAPO-Math-17k
  • Training stage: On-Policy Distillation (OPD)
  • Training framework: verl
  • Rollout engine: vLLM
  • Primary domain: Mathematical reasoning
  • Model architecture: Qwen3ForCausalLM
  • Precision: bfloat16
  • Context length: 32768 tokens

Training Details

Training configuration

  • Base checkpoint: Qwen/Qwen3-1.7B-Base
  • Teacher checkpoint: lllyx/Qwen3-4B-Base-GRPO
  • Training framework: verl
  • Training method: on-policy distillation with GRPO-style rollouts
  • Distillation loss mode: k1
  • Policy-gradient term: enabled
  • Training dataset: DAPO-Math-17k/DAPO-Math.parquet
  • Primary task domain: math reasoning
  • Chat template thinking mode: disabled (enable_thinking=False)
  • Model type: qwen3

Rollout and optimization

  • Rollout engine: vLLM
  • Responses per prompt: 4
  • Prompt length: 1024
  • Response length: 7168
  • Max rollout model length: 8193
  • Train batch size: 64
  • PPO mini-batch size: 16
  • PPO micro-batch size per GPU: 1
  • Max PPO token length per GPU: 8192
  • Actor learning rate: 1e-6
  • Total epochs: 1
  • Save frequency: every 20 steps

Runtime setup

  • Distributed backend: Ray
  • Number of nodes: 1
  • GPUs per node: 4
  • Teacher world size: 4
  • Rollout tensor parallel size: 1
  • Teacher tensor parallel size: 1
  • Actor training: FSDP with parameter and optimizer offload
  • Gradient checkpointing: enabled
  • Padding removal: enabled
  • Torch compile for actor: enabled
  • Reward function: rule-based math reward from verl/recipe/r1_ascend/deepscaler.py::compute_score

Dataset

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "lllyx/Qwen3-1.7B-Base-OPD"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

Citation

If you use this model, please consider citing the related paper:

@article{li2026rethinking,
  title={Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe},
  author={Li, Yaxuan and Zuo, Yuxin and He, Bingxiang and Zhang, Jinqian and Xiao, Chaojun and Qian, Cheng and Yu, Tianyu and Gao, Huan-ang and Yang, Wenkai and Liu, Zhiyuan and Ding, Ning},
  journal={arXiv preprint arXiv:2604.13016},
  year={2026}
}
Downloads last month
51
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lllyx/Qwen3-1.7B-Base-OPD

Finetuned
(370)
this model

Dataset used to train lllyx/Qwen3-1.7B-Base-OPD

Collection including lllyx/Qwen3-1.7B-Base-OPD

Paper for lllyx/Qwen3-1.7B-Base-OPD