Update README.md
Browse files
README.md
CHANGED
|
@@ -1,32 +1,68 @@
|
|
| 1 |
-
---
|
| 2 |
-
tags:
|
| 3 |
-
- qwen2.5
|
| 4 |
-
-
|
| 5 |
-
-
|
| 6 |
-
|
| 7 |
-
-
|
| 8 |
-
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
tags:
|
| 3 |
+
- qwen2.5
|
| 4 |
+
- RL
|
| 5 |
+
- reasoning
|
| 6 |
+
library_name: transformers
|
| 7 |
+
pipeline_tag: text-generation
|
| 8 |
+
license: apache-2.0
|
| 9 |
+
base_model: SII-Enigma/Qwen2.5-7B-Ins-SFT-32k
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# Introduction
|
| 13 |
+
|
| 14 |
+
**AMPO**, a novel framework that intelligently leverages guidance from multiple, diverse teacher models, intervening only when the on-policy model fails. Our two core contributions, Adaptive Multi-Guidance Replacement and Comprehension-based Guidance Selection, ensure that this external knowledge is used both efficiently and effectively.
|
| 15 |
+
|
| 16 |
+
[](https://arxiv.org/abs/2510.02227) [](https://github.com/SII-Enigma/AMPO)
|
| 17 |
+
|
| 18 |
+
### Key Highlights:
|
| 19 |
+
- **Adaptive Multi-Guidance Replacement**: Minimizes intervention by providing external guidance only upon complete on-policy failure, preserving self-discovery while enhancing reasoning efficiency.
|
| 20 |
+
- **Comprehension-based Guidance Selection**: Improves learning effectiveness by guiding the model to assimilate the most comprehensible external solutions, demonstrably boosting performance.
|
| 21 |
+
- **Superior Performance:** Achieves better performance and efficiency compared to using RL or SFT alone.
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
### Multi-Guidance Pool
|
| 25 |
+
|
| 26 |
+
Teacher Models: 'Qwen3-8B_thinking', 'DeepSeek-R1-Distill-Qwen-7B', 'Qwen3-8B', 'Qwen2.5-Math-7B-Instruct'
|
| 27 |
+
|
| 28 |
+
## Inference Example
|
| 29 |
+
|
| 30 |
+
Here’s an example of using AMPO for inference:
|
| 31 |
+
|
| 32 |
+
```python
|
| 33 |
+
from transformers import AutoTokenizer
|
| 34 |
+
from vllm import LLM, SamplingParams
|
| 35 |
+
|
| 36 |
+
model_path = "SII-Enigma/Qwen2.5-7B-Ins-SFT-AMPO"
|
| 37 |
+
|
| 38 |
+
question = "which number is larger? 9.11 or 9.9?"
|
| 39 |
+
|
| 40 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
| 41 |
+
messages = [{"role": "user", "content": question}]
|
| 42 |
+
chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
| 43 |
+
|
| 44 |
+
llm = LLM(model=model_path)
|
| 45 |
+
params = SamplingParams(temperature=0.6, max_tokens=8192)
|
| 46 |
+
outputs = llm.generate([chat], params)
|
| 47 |
+
print(outputs[0].outputs[0].text)
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
# Acknowledgement
|
| 51 |
+
|
| 52 |
+
AMPO builds upon [LUFFY](https://github.com/ElliottYan/LUFFY), [veRL](https://github.com/volcengine/verl), [RLPR](https://github.com/OpenBMB/RLPR) and utilizes [vLLM](https://github.com/vllm-project/vllm) for inference. We utilize [Math-Verify](https://github.com/huggingface/Math-Verify) for math reasoning evaluation. We thank the open-source community for codes, datasets and backbones.
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
# Citation
|
| 57 |
+
If you find our model, data, or evaluation code useful, please kindly cite our paper:
|
| 58 |
+
```bib
|
| 59 |
+
@misc{yuan2025teacheradaptivemultiguidancepolicy,
|
| 60 |
+
title={More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration},
|
| 61 |
+
author={Xiaoyang Yuan and Yujuan Ding and Yi Bin and Wenqi Shao and Jinyu Cai and Jingkuan Song and Yang Yang and Heng Tao Shen},
|
| 62 |
+
year={2025},
|
| 63 |
+
eprint={2510.02227},
|
| 64 |
+
archivePrefix={arXiv},
|
| 65 |
+
primaryClass={cs.CL},
|
| 66 |
+
url={https://arxiv.org/abs/2510.02227},
|
| 67 |
+
}
|
| 68 |
+
```
|