Qwen3.5-9B-ERPD-003

Extreme Region Policy Distillation (ERPD) is a two-stage reinforcement learning framework that decouples sample efficiency from KL efficiency. This model is obtained by applying ERPD to Qwen3.5-9B. We adopt the MSE-based teacher loss and unlearned teacher setup described in the paper. Prompts are randomly sampled from Polaris collection, with each training round sampling 1K prompts × 16 rollouts per iteration. This checkpoint corresponds to the 3rd iterative round (ERPD-003).

📄 Paper: Extreme Region Policy Distillation
🏠 Project: https://github.com/ChangyuChen347/ERPD

Performance

Qwen3.5-27B Qwen3.5-9B Qwen3.5-9B-ERPD-003
STEM & Reasoning
HMMT Feb 25 92.0 83.5 88.9
HMMT Nov 25 89.8 84.0 87.3
HMMT Feb 26 84.3 72.7 83.8

Sampling Parameters

We suggest using the following sampling parameters to reproduce the results on HMMT:

{
  "temperature": 1,
  "top_p": 0.95,
  "top_k": 20,
  "min_p": 0.0,
  "presence_penalty": 0.0,   // 0.0: much faster but slightly worse quality; 1.5: reproduces reported results with longer inference length
  "repetition_penalty": 1.0,
  "max_tokens": 192000
}

Note on Evaluation. Many problems in these benchmarks involve answers that cannot be reliably verified by exact-match comparison. We therefore employ Seed-2.0-pro as an LLM-as-judge to assess correctness. The evaluation scripts will be shared on our GitHub.

Citation

If you find our work helpful, feel free to give us a cite.

@misc{chen2026extremeregionpolicydistillation,
      title={Extreme Region Policy Distillation}, 
      author={Changyu Chen and Xiting Wang and Rui Yan},
      year={2026},
      eprint={2605.25582},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.25582}, 
}
Downloads last month
24
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including adalaw/Qwen3.5-9B-ERPD-003

Paper for adalaw/Qwen3.5-9B-ERPD-003