ReFT: Reasoning with REinforced Fine-Tuning

Paper: https://arxiv.org/pdf/2401.08967.pdf

Repo: https://github.com/lqtrung1998/mwp_ReFT (under Apache2.0 License)

Introduction

We introduce REinforced Fine-tuning (ReFT), a method that enhances the generalizability of learning LLMs for reasoning.

This repository contains:

A Warmup Supervised Fine-tuned model on GSM8k benchmark: lqtrung1998/Codellama-7b-hf-SFT-warmup-GSM8k
A Supervised Fine-tuned model on GSM8k benchmark: lqtrung1998/Codellama-7b-hf-SFT-GSM8k
A Rerank model that can score the fine-tuned SFT model output: lqtrung1998/Codellama-7b-hf-SFT-Rerank-GSM8k
A REinforced Fine-tuned model on GSM8k benchmark: lqtrung1998/Codellama-7b-hf-ReFT-GSM8k
A Rerank model that can score the fine-tuned ReFT model output: lqtrung1998/Codellama-7b-hf-ReFT-Rerank-GSM8k

Note: Our models are tuned based on Codellama, thus, licenses applicable to Codellama, such as Llama license, also hold on these models

Training Data

The model is trained on GSM8k data with Python SDP CoT format, which can be found here

Training Procedure

Check out our paper and repo for complete details.

ReFT model

ReFT model is warm-up via Supervised Fine-tuning using GSM8k Python SDP training data for 2 epochs then it is REinforced Fine-tuned for 300 epochs using questions in GSM8k training set.

Rerank model

Rerank model is trained to classify if the output CoT is correct or not using sampling data of ReFT model after 2 epochs warm-up.

Evaluation Results

See evaluations results of the models at table 4 of the research paper.

Updated results:

	Top-1	Voting@100	Rerank@100
Codellama-7b-hf-SFT-warmup-GSM8k	63.00	-	-
Codellama-7b-hf-SFT-GSM8k (+Codellama-7b-hf-SFT-Rerank-GSM8k)	63.68	68.0	77.0
Codellama-7b-hf-ReFT-GSM8k (+Codellama-7b-hf-ReFT-Rerank-GSM8k)	75.28	78.0	81.2

Usage

You can use the models through Huggingface's Transformers library or follow scripts in our repo.

Prompt format:

Question:
Weng earns $12 an hour for babysitting. Yesterday, she
just did 50 minutes of babysitting. How much did she earn?
Answer reasoning:

Expected response:

def solution():
  """Weng earns $12 an hour for babysitting. Yesterday, she just did
  50 minutes of babysitting. How much did she earn?"""
  hourly_rate = 12
  minutes_worked = 50
  hours_worked = minutes_worked / 60
  earnings = hourly_rate * hours_worked
  result = earnings
  return result

Citation

Please cite the paper if you use our data, model or code.

@misc{luong2024reft,
      title={ReFT: Reasoning with Reinforced Fine-Tuning}, 
      author={Trung Quoc Luong and Xinbo Zhang and Zhanming Jie and Peng Sun and Xiaoran Jin and Hang Li},
      year={2024},
      eprint={2401.08967},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Intended Use

Intended Use Cases Code Llama and its variants is intended for commercial and research use in English and relevant programming languages. The base model Code Llama can be adapted for a variety of code synthesis and understanding tasks, Code Llama - Python is designed specifically to handle the Python programming language, and Code Llama - Instruct is intended to be safer to use for code assistant and generation applications.

Out-of-Scope Uses Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. Use in any other way that is prohibited by the Acceptable Use Policy and Licensing Agreement for Code Llama and its variants.

Ethical Considerations and Limitations

Code Llama and its variants are a new technology that carries risks with use. Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, Code Llama’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate or objectionable responses to user prompts. Therefore, before deploying any applications of Code Llama, developers should perform safety testing and tuning tailored to their specific applications of the model.

Please see the Responsible Use Guide available available at https://ai.meta.com/llama/responsible-use-guide.