Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

๐Ÿ–ฅ๏ธCode | ๐Ÿค—Data | ๐Ÿ“„Paper

This repo contains the Qwen2-72B-Instruct-Step-DPO model. It is obtained by performing Step-DPO on Qwen2-72B-Instruct.

Step-DPO is a simple, effective, and data-efficient method for boosting the mathematical reasoning ability of LLMs. Notably, Step-DPO, when applied to Qwen2-72B-Instruct, achieves scores of 70.8% and 94.0% on the test sets of MATH and GSM8K without bells and wistles, respectively, surpassing a series of closed-source models, including GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro.

Contact

Please submit an issue here or send me an email here.

Downloads last month
12
Safetensors
Model size
72.7B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including xinlai/Qwen2-72B-Instruct-Step-DPO