Qwen3.5-9B-ERPD-003

Extreme Region Policy Distillation (ERPD) is a two-stage reinforcement learning framework that decouples sample efficiency from KL efficiency. This model is obtained by applying ERPD to Qwen3.5-9B. We adopt the MSE-based teacher loss and unlearned teacher setup described in the paper. Prompts are randomly sampled from Polaris collection, with each training round sampling 1K prompts × 16 rollouts per iteration. This checkpoint corresponds to the 3rd iterative round (ERPD-003).

📄 Paper: Extreme Region Policy Distillation
🏠 Project: https://github.com/ChangyuChen347/ERPD

Performance

	Qwen3.5-27B	Qwen3.5-9B	Qwen3.5-9B-ERPD-003
STEM & Reasoning
HMMT Feb 25	92.0	83.5	88.9
HMMT Nov 25	89.8	84.0	87.3
HMMT Feb 26	84.3	72.7	83.8

Sampling Parameters

We suggest using the following sampling parameters to reproduce the results on HMMT:

{
  "temperature": 1,
  "top_p": 0.95,
  "top_k": 20,
  "min_p": 0.0,
  "presence_penalty": 0.0,   // 0.0: much faster but slightly worse quality; 1.5: reproduces reported results with longer inference length
  "repetition_penalty": 1.0,
  "max_tokens": 192000
}

Note on Evaluation. Many problems in these benchmarks involve answers that cannot be reliably verified by exact-match comparison. We therefore employ Seed-2.0-pro as an LLM-as-judge to assess correctness. The evaluation scripts will be shared on our GitHub.

Citation

If you find our work helpful, feel free to give us a cite.

@misc{chen2026extremeregionpolicydistillation,
      title={Extreme Region Policy Distillation}, 
      author={Changyu Chen and Xiting Wang and Rui Yan},
      year={2026},
      eprint={2605.25582},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.25582}, 
}

Downloads last month: 24

Safetensors

Model size

9B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including adalaw/Qwen3.5-9B-ERPD-003

Extreme Region Policy Distillation

Collection

4 items • Updated 3 days ago

Paper for adalaw/Qwen3.5-9B-ERPD-003

Extreme Region Policy Distillation

Paper • 2605.25582 • Published 4 days ago