lqtrung1998
/

galactica-6.7b-SFT-GSM8k

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

galactica-6.7b-SFT-GSM8k / README.md

lqtrung1998's picture

Update README.md

414d3a8 verified 5 months ago

|

2.88 kB

	---
	license: cc-by-nc-4.0
	---
	# ReFT: Reasoning with REinforced Fine-Tuning
	Paper: https://arxiv.org/pdf/2401.08967.pdf

	Repo: https://github.com/lqtrung1998/mwp_ReFT (under [Apache2.0 License](https://github.com/lqtrung1998/mwp_ReFT/blob/main/License.txt))

	## Introduction
	We introduce REinforced Fine-tuning (ReFT), a method that enhances the generalizability of learning LLMs for reasoning.

	This repository contains:
	- A Supervised Fine-tuned model on GSM8k benchmark: [lqtrung1998/galactica-6.7b-SFT-GSM8k](https://huggingface.co/lqtrung1998/galactica-6.7b-SFT-GSM8k)
	- A Warmup Supervised Fine-tuned model on GSM8k benchmark: [lqtrung1998/galactica-6.7b-SFT-warmup-GSM8k](https://huggingface.co/lqtrung1998/galactica-6.7b-SFT-warmup-GSM8k)
	- A REinforced Fine-tuned model on GSM8k benchmark: [lqtrung1998/galactica-6.7b-ReFT-GSM8k](https://huggingface.co/lqtrung1998/galactica-6.7b-ReFT-GSM8k)
	- A Rerank model that can score the fine-tuned model output: [lqtrung1998/galactica-6.7b-ReFT-Rerank-GSM8k](https://huggingface.co/lqtrung1998/galactica-6.7b-ReFT-Rerank-GSM8k)

	Note: Our models are tuned based on Galactica, thus, licenses applicable to Galactica, such as non-commercial CC BY-NC 4.0 license also hold on these models.

	## Training Data
	The model is trained on GSM8k data with Python SDP CoT format, which can be found [here](https://github.com/lqtrung1998/mwp_ReFT)

	## Training Procedure
	Check out our paper and repo for complete details.
	#### ReFT model
	ReFT model is warm-up via Supervised Fine-tuning using GSM8k Python SDP training data for 2 epochs then it is REinforced Fine-tuned for 300 epochs using questions in GSM8k training set.
	#### Rerank model
	Rerank model is trained to classify if the output CoT is correct or not using sampling data of ReFT model after 2 epochs warm-up.

	## Evaluation Results
	See evaluations results of the models at table 4 of the research paper.

	## Usage
	You can use the models through Huggingface's Transformers library or follow scripts in our repo.

	Prompt format:
	```python
	Question:
	Weng earns $12 an hour for babysitting. Yesterday, she
	just did 50 minutes of babysitting. How much did she earn?
	Answer reasoning:
	```
	Expected response:
	```python
	def solution():
	"""Weng earns $12 an hour for babysitting. Yesterday, she just did
	50 minutes of babysitting. How much did she earn?"""
	hourly_rate = 12
	minutes_worked = 50
	hours_worked = minutes_worked / 60
	earnings = hourly_rate * hours_worked
	result = earnings
	return result
	```

	## Citation
	Please cite the paper if you use our data, model or code.
	```
	@misc{luong2024reft,
	title={ReFT: Reasoning with Reinforced Fine-Tuning},
	author={Trung Quoc Luong and Xinbo Zhang and Zhanming Jie and Peng Sun and Xiaoran Jin and Hang Li},
	year={2024},
	eprint={2401.08967},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```