File size: 2,880 Bytes
264ee16
 
 
7ee426b
 
 
 
 
 
 
 
 
414d3a8
 
 
 
7ee426b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
---
license: cc-by-nc-4.0
---
# ReFT: Reasoning with REinforced Fine-Tuning
Paper: https://arxiv.org/pdf/2401.08967.pdf

Repo: https://github.com/lqtrung1998/mwp_ReFT (under [Apache2.0 License](https://github.com/lqtrung1998/mwp_ReFT/blob/main/License.txt))

## Introduction
We introduce REinforced Fine-tuning (ReFT), a method that enhances the generalizability of learning LLMs for reasoning.

This repository contains:
- A Supervised Fine-tuned model on GSM8k benchmark: [lqtrung1998/galactica-6.7b-SFT-GSM8k](https://huggingface.co/lqtrung1998/galactica-6.7b-SFT-GSM8k)
- A Warmup Supervised Fine-tuned model on GSM8k benchmark: [lqtrung1998/galactica-6.7b-SFT-warmup-GSM8k](https://huggingface.co/lqtrung1998/galactica-6.7b-SFT-warmup-GSM8k)
- A REinforced Fine-tuned model on GSM8k benchmark: [lqtrung1998/galactica-6.7b-ReFT-GSM8k](https://huggingface.co/lqtrung1998/galactica-6.7b-ReFT-GSM8k)
- A Rerank model that can score the fine-tuned model output: [lqtrung1998/galactica-6.7b-ReFT-Rerank-GSM8k](https://huggingface.co/lqtrung1998/galactica-6.7b-ReFT-Rerank-GSM8k)

Note: Our models are tuned based on Galactica, thus, licenses applicable to Galactica, such as non-commercial CC BY-NC 4.0 license also hold on these models.

## Training Data
The model is trained on GSM8k data with Python SDP CoT format, which can be found [here](https://github.com/lqtrung1998/mwp_ReFT)

## Training Procedure
Check out our paper and repo for complete details.
#### ReFT model
ReFT model is warm-up via Supervised Fine-tuning using GSM8k Python SDP training data for 2 epochs then it is REinforced Fine-tuned for 300 epochs using questions in GSM8k training set.
#### Rerank model
Rerank model is trained to classify if the output CoT is correct or not using sampling data of ReFT model after 2 epochs warm-up.

## Evaluation Results
See evaluations results of the models at table 4 of the research paper.

## Usage
You can use the models through Huggingface's Transformers library or follow scripts in our repo.

Prompt format:
```python
Question:
Weng earns $12 an hour for babysitting. Yesterday, she
just did 50 minutes of babysitting. How much did she earn?
Answer reasoning:
```
Expected response:
```python
def solution():
  """Weng earns $12 an hour for babysitting. Yesterday, she just did
  50 minutes of babysitting. How much did she earn?"""
  hourly_rate = 12
  minutes_worked = 50
  hours_worked = minutes_worked / 60
  earnings = hourly_rate * hours_worked
  result = earnings
  return result
```

## Citation
Please cite the paper if you use our data, model or code.
```
@misc{luong2024reft,
      title={ReFT: Reasoning with Reinforced Fine-Tuning}, 
      author={Trung Quoc Luong and Xinbo Zhang and Zhanming Jie and Peng Sun and Xiaoran Jin and Hang Li},
      year={2024},
      eprint={2401.08967},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```