--- license: llama2 --- # ReFT: Reasoning with REinforced Fine-Tuning Paper: https://arxiv.org/pdf/2401.08967.pdf Repo: https://github.com/lqtrung1998/mwp_ReFT (under [Apache2.0 License](https://github.com/lqtrung1998/mwp_ReFT/blob/main/License.txt)) ## Introduction We introduce REinforced Fine-tuning (ReFT), a method that enhances the generalizability of learning LLMs for reasoning. This repository contains: - A Warmup Supervised Fine-tuned model on GSM8k benchmark: [lqtrung1998/Codellama-7b-hf-SFT-warmup-GSM8k](https://huggingface.co/lqtrung1998/Codellama-7b-hf-SFT-warmup-GSM8k) - A Supervised Fine-tuned model on GSM8k benchmark: [lqtrung1998/Codellama-7b-hf-SFT-GSM8k](https://huggingface.co/lqtrung1998/Codellama-7b-hf-SFT-GSM8k) - A Rerank model that can score the fine-tuned SFT model output: [lqtrung1998/Codellama-7b-hf-SFT-Rerank-GSM8k](https://huggingface.co/lqtrung1998/Codellama-7b-hf-SFT-Rerank-GSM8k) - A REinforced Fine-tuned model on GSM8k benchmark: [lqtrung1998/Codellama-7b-hf-ReFT-GSM8k](https://huggingface.co/lqtrung1998/Codellama-7b-hf-ReFT-GSM8k) - A Rerank model that can score the fine-tuned ReFT model output: [lqtrung1998/Codellama-7b-hf-ReFT-Rerank-GSM8k](https://huggingface.co/lqtrung1998/Codellama-7b-hf-ReFT-Rerank-GSM8k) Note: Our models are tuned based on Codellama, thus, licenses applicable to Codellama, such as [Llama license](https://ai.meta.com/resources/models-and-libraries/llama-downloads/), also hold on these models ## Training Data The model is trained on GSM8k data with Python SDP CoT format, which can be found [here](https://github.com/lqtrung1998/mwp_ReFT) ## Training Procedure Check out our paper and repo for complete details. #### ReFT model ReFT model is warm-up via Supervised Fine-tuning using GSM8k Python SDP training data for 2 epochs then it is REinforced Fine-tuned for 300 epochs using questions in GSM8k training set. #### Rerank model Rerank model is trained to classify if the output CoT is correct or not using sampling data of ReFT model after 2 epochs warm-up. ## Evaluation Results See evaluations results of the models at table 4 of the research paper. Updated results: | | Top-1 | Voting@100 | Rerank@100 | |--------------------------------------------------------------------|:------:|:----------:|:----------:| | Codellama-7b-hf-SFT-warmup-GSM8k | 63.00 | - | - | | Codellama-7b-hf-SFT-GSM8k
(+Codellama-7b-hf-SFT-Rerank-GSM8k) | 63.68 | 68.0 | 77.0 | | Codellama-7b-hf-ReFT-GSM8k
(+Codellama-7b-hf-ReFT-Rerank-GSM8k) | 75.28 | 78.0 | 81.2 | ## Usage You can use the models through Huggingface's Transformers library or follow scripts in our repo. Prompt format: ```python Question: Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn? Answer reasoning: ``` Expected response: ```python def solution(): """Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?""" hourly_rate = 12 minutes_worked = 50 hours_worked = minutes_worked / 60 earnings = hourly_rate * hours_worked result = earnings return result ``` ## Citation Please cite the paper if you use our data, model or code. ``` @misc{luong2024reft, title={ReFT: Reasoning with Reinforced Fine-Tuning}, author={Trung Quoc Luong and Xinbo Zhang and Zhanming Jie and Peng Sun and Xiaoran Jin and Hang Li}, year={2024}, eprint={2401.08967}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ## Intended Use Intended Use Cases Code Llama and its variants is intended for commercial and research use in English and relevant programming languages. The base model Code Llama can be adapted for a variety of code synthesis and understanding tasks, Code Llama - Python is designed specifically to handle the Python programming language, and Code Llama - Instruct is intended to be safer to use for code assistant and generation applications. Out-of-Scope Uses Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. Use in any other way that is prohibited by the Acceptable Use Policy and Licensing Agreement for Code Llama and its variants. ## Ethical Considerations and Limitations Code Llama and its variants are a new technology that carries risks with use. Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, Code Llama’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate or objectionable responses to user prompts. Therefore, before deploying any applications of Code Llama, developers should perform safety testing and tuning tailored to their specific applications of the model. Please see the Responsible Use Guide available available at https://ai.meta.com/llama/responsible-use-guide.