|
--- |
|
license: cc-by-nc-4.0 |
|
--- |
|
# ReFT: Reasoning with REinforced Fine-Tuning |
|
Paper: https://arxiv.org/pdf/2401.08967.pdf |
|
|
|
Repo: https://github.com/lqtrung1998/mwp_ReFT (under [Apache2.0 License](https://github.com/lqtrung1998/mwp_ReFT/blob/main/License.txt)) |
|
|
|
## Introduction |
|
We introduce REinforced Fine-tuning (ReFT), a method that enhances the generalizability of learning LLMs for reasoning. |
|
|
|
This repository contains: |
|
- A Supervised Fine-tuned model on GSM8k benchmark: [lqtrung1998/galactica-6.7b-SFT-GSM8k](https://huggingface.co/lqtrung1998/galactica-6.7b-SFT-GSM8k) |
|
- A Warmup Supervised Fine-tuned model on GSM8k benchmark: [lqtrung1998/galactica-6.7b-SFT-warmup-GSM8k](https://huggingface.co/lqtrung1998/galactica-6.7b-SFT-warmup-GSM8k) |
|
- A REinforced Fine-tuned model on GSM8k benchmark: [lqtrung1998/galactica-6.7b-ReFT-GSM8k](https://huggingface.co/lqtrung1998/galactica-6.7b-ReFT-GSM8k) |
|
- A Rerank model that can score the fine-tuned model output: [lqtrung1998/galactica-6.7b-ReFT-Rerank-GSM8k](https://huggingface.co/lqtrung1998/galactica-6.7b-ReFT-Rerank-GSM8k) |
|
|
|
Note: Our models are tuned based on Galactica, thus, licenses applicable to Galactica, such as non-commercial CC BY-NC 4.0 license also hold on these models. |
|
|
|
## Training Data |
|
The model is trained on GSM8k data with Python SDP CoT format, which can be found [here](https://github.com/lqtrung1998/mwp_ReFT) |
|
|
|
## Training Procedure |
|
Check out our paper and repo for complete details. |
|
#### ReFT model |
|
ReFT model is warm-up via Supervised Fine-tuning using GSM8k Python SDP training data for 2 epochs then it is REinforced Fine-tuned for 300 epochs using questions in GSM8k training set. |
|
#### Rerank model |
|
Rerank model is trained to classify if the output CoT is correct or not using sampling data of ReFT model after 2 epochs warm-up. |
|
|
|
## Evaluation Results |
|
See evaluations results of the models at table 4 of the research paper. |
|
|
|
## Usage |
|
You can use the models through Huggingface's Transformers library or follow scripts in our repo. |
|
|
|
Prompt format: |
|
```python |
|
Question: |
|
Weng earns $12 an hour for babysitting. Yesterday, she |
|
just did 50 minutes of babysitting. How much did she earn? |
|
Answer reasoning: |
|
``` |
|
Expected response: |
|
```python |
|
def solution(): |
|
"""Weng earns $12 an hour for babysitting. Yesterday, she just did |
|
50 minutes of babysitting. How much did she earn?""" |
|
hourly_rate = 12 |
|
minutes_worked = 50 |
|
hours_worked = minutes_worked / 60 |
|
earnings = hourly_rate * hours_worked |
|
result = earnings |
|
return result |
|
``` |
|
|
|
## Citation |
|
Please cite the paper if you use our data, model or code. |
|
``` |
|
@misc{luong2024reft, |
|
title={ReFT: Reasoning with Reinforced Fine-Tuning}, |
|
author={Trung Quoc Luong and Xinbo Zhang and Zhanming Jie and Peng Sun and Xiaoran Jin and Hang Li}, |
|
year={2024}, |
|
eprint={2401.08967}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |