xinlai
/

Llama-3-70B-SFT-Step-DPO

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Llama-3-70B-SFT-Step-DPO / README.md

xinlai's picture

Update README.md

5cd3844 verified 22 days ago

|

history blame contribute delete

No virus

1.01 kB

	---
	license: apache-2.0
	---
	# Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

	🖥️[Code](https://github.com/dvlab-research/Step-DPO) \| 🤗[Data](https://huggingface.co/datasets/xinlai/Math-Step-DPO-10K) \| 📄[Paper](https://arxiv.org/pdf/2406.18629)

	This repo contains the Llama-3-70B-SFT-Step-DPO model. It is obtained by performing Step-DPO on [Llama-3-70B-SFT](https://huggingface.co/xinlai/Llama-3-70B-SFT).

	Step-DPO is a simple, effective, and data-efficient method for boosting the mathematical reasoning ability of LLMs. Notably, Step-DPO, when applied to Qwen2-72B-Instruct, achieves scores of 70.8% and 94.0% on the test sets of MATH and GSM8K without bells and wistles, respectively, surpassing a series of closed-source models, including GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro.

	## Contact

	Please submit an issue [here](https://github.com/dvlab-research/Step-DPO) or send me an email [here](mailto:xinlai@cse.cuhk.edu.hk).