|
--- |
|
license: mit |
|
language: |
|
- en |
|
datasets: |
|
- akjindal53244/Arithmo-Data |
|
tags: |
|
- Mathematical Reasoning |
|
--- |
|
|
|
**Arithmo2-Mistral-7B** model improves initially released [Arithmo-Mistral-7B](https://huggingface.co/akjindal53244/Arithmo-Mistral-7B) model on both GSM8K and MATH benchmarks. Specifically, there is **absolute** improvement of: |
|
- +1.7% on GSM8K |
|
- +3.0% on GSM8K PoT |
|
- +1.9% on MATH |
|
|
|
**This repo contains final merged model**. If you are interested in LoRA adapter, use [LoRA Adapter](https://huggingface.co/upaya07/Arithmo2-Mistral-7B-adapter) instead. |
|
|
|
|
|
### Model Description |
|
|
|
- **Project GitHub Page:** https://github.com/akjindal53244/Arithmo |
|
- **Developed by:** [Ashvini Kumar Jindal](https://www.linkedin.com/in/ashvini-jindal-26653262/) |
|
- **Funded by:** self-work |
|
- **Model type:** fine-tuned using QLoRA on Single GPU |
|
- **Language(s) (NLP):** English |
|
- **Finetuned from model:** mistralai/Mistral-7B-v0.1 |
|
|
|
## Results |
|
|
|
Arithmo2-Mistral-7B is improved version of [Arithmo-Mistral-7B](https://huggingface.co/akjindal53244/Arithmo-Mistral-7B) model and is competitive with full fine-tuned state-of-the-art 7B Mathematical Reasoning models. Refer to [Comparing Arithmo models with other SFT LLM models](https://github.com/akjindal53244/Arithmo/tree/master?tab=readme-ov-file#comparing-arithmo-models-with-other-sft-llm-models) section for more details. |
|
|
|
<table> |
|
<thead> |
|
<tr> |
|
<th>Prompt Approach</th> |
|
<th>GSM8k</th> |
|
<th>MATH</th> |
|
</tr> |
|
</thead> |
|
<tbody> |
|
<tr> |
|
<td>Zero-Shot CoT</td> |
|
<td><b>76.4</b></td> |
|
<td><b>27.2</b></td> |
|
</tr> |
|
<tr> |
|
<td>Zero-Shot PoT</td> |
|
<td><b>74.2</b></td> |
|
<td>-</td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
|
|
- **Zero-Shot CoT**: On providing a question as prompt, model generates reasoning steps to solve the question along with answer. We check if answer matches with ground-truth. |
|
- **Zero-Shot PoT**: We prompt the model to generate a Python program for the given question. During inference, we execute the Python program generated by the model and check if the program output matches with ground-truth answer. |
|
|
|
|
|
## Installation |
|
|
|
``` |
|
pip install transformers >=4.34.0 |
|
pip install accelerate |
|
pip install sentencepiece |
|
pip install protobuf |
|
|
|
# If you are GPU poor like me |
|
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu |
|
|
|
# If you have a GPU. |
|
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu118 |
|
pip install scipy |
|
pip install bitsandbytes |
|
``` |
|
|
|
|
|
## How to query the model |
|
|
|
``` |
|
# Set `run_model_on_gpu` to `False` if you are running on CPU. Model will generate reasoning steps with answer for your question. If you want to generate Python program, uncomment line-69 that adds a Python prompt. |
|
# This script automatically does formatting for you, so you just need to type question (eg: `What is 2+2?`) without any prefix like `Question:`, etc.** |
|
|
|
$ python query_model.py |
|
``` |
|
**Note:** Above script automatically does formatting for you, so you just need to type question (eg: `What is 2+2?`) without any prefix like `Question:`, etc. Checkout [query_model.py](https://github.com/akjindal53244/Arithmo/blob/master/query_model.py) for more details. <br><br> |
|
|
|
##### Sample Input: |
|
``` |
|
Question: There are total 10 children. I have to give 1 apple to first child, 2 apples to second child, 3 apples to third child, and so on. How many apples do I need? |
|
``` |
|
##### Model Output: |
|
``` |
|
Answer: The total number of apples needed is the sum of the first 10 positive integers. |
|
This can be calculated using the formula for the sum of an arithmetic series: |
|
\[S = \frac{n}{2}(a_1 + a_n),\] |
|
where $S$ is the sum, $n$ is the number of terms, $a_1$ is the first term, and $a_n$ is the last term. |
|
In this case, $n = 10$, $a_1 = 1$, and $a_n = 10$. |
|
Plugging these values into the formula, we get: |
|
\[S = \frac{10}{2}(1 + 10) = 5(11) = \boxed{55}.\] |
|
The answer is: 55 |
|
``` |
|
|
|
Arithmo2-Mistral-7B is trained with same format as [Arithmo-Mistral-7B](https://huggingface.co/akjindal53244/Arithmo-Mistral-7B): |
|
#### CoT Format (generate reasoning steps with answer): |
|
``` |
|
Question: <question> |
|
|
|
Answer: |
|
``` |
|
|
|
#### PoT Format (generate a python program): |
|
``` |
|
Question: <question> <python_prompt> |
|
|
|
Answer: |
|
``` |
|
It will perform best if queried in this way with your own script. |
|
|
|
## Comparing Arithmo models with other SFT LLM models |
|
Results for all models except `Arithmo2-Mistral-7B` are taken from [MetaMath](https://github.com/meta-math/MetaMath/blob/main/README.MD) repository. |
|
|
|
| Model | GSM8k Pass@1 | MATH Pass@1 | Fine-tuning | |
|
|---------------------|--------------|-------------|-------------| |
|
| MPT-7B | 6.8 | 3.0 | |
|
| Falcon-7B | 6.8 | 2.3 | |
|
| LLaMA-1-7B | 11.0 | 2.9 | |
|
| LLaMA-2-7B | 14.6 | 2.5 | |
|
| MPT-30B | 15.2 | 3.1 | |
|
| LLaMA-1-13B | 17.8 | 3.9 | |
|
| GPT-Neo-2.7B | 19.5 | -- | |
|
| Falcon-40B | 19.6 | 2.5 | |
|
| Baichuan-chat-13B | 23.9 | -- | |
|
| Vicuna-v1.3-13B | 27.6 | -- | |
|
| LLaMA-2-13B | 28.7 | 3.9 | |
|
| InternLM-7B | 31.2 | -- | |
|
| ChatGLM-2-6B | 32.4 | -- | |
|
| GPT-J-6B | 34.9 | -- | |
|
| LLaMA-1-33B | 35.6 | 3.9 | |
|
| LLaMA-2-34B | 42.2 | 6.24 | |
|
| RFT-7B | 50.3 | -- | |
|
| LLaMA-1-65B | 50.9 | 10.6 | |
|
| Qwen-7B | 51.6 | -- | |
|
| WizardMath-7B | 54.9 | 10.7 | |
|
| LLaMA-2-70B | 56.8 | 13.5 | |
|
| WizardMath-13B | 63.9 | 14.0 | |
|
| MetaMath-7B | 66.5 | 19.8 | |
|
| MetaMath-13B | 72.3 | 22.4 | |
|
| Arithmo-Mistral-7B (PoT) | 71.2 | -- | SFT: 4-bit QLoRA | |
|
| Arithmo2-Mistral-7B (PoT) | 74.2 | -- | SFT: 4-bit QLoRA | |
|
| MetaMath-Mistral-7B | 77.7 | 28.2 | SFT: Full fine-tuned | |
|
| Arithmo-Mistral-7B| 74.7 | 25.3 | SFT: 4-bit QLoRA | |
|
| 🔥 **Arithmo2-Mistral-7B** | **76.4** | **27.2** | **SFT: 4-bit QLoRA** | |
|
|
|
If you are interested in reproducing the results, visit https://github.com/akjindal53244/Arithmo#reproducing-results section. |
|
|
|
### Support My Work |
|
|
|
Building LLMs takes time and resources; if you find my work interesting, your support would be epic! |
|
<a href="https://www.buymeacoffee.com/a_little_learner" target="_blank"><img src="https://cdn.buymeacoffee.com/buttons/v2/default-yellow.png" alt="Buy Me A Coffee" style="height: 60px !important;width: 217px !important;" ></a> |
|
|
|
|
|
### Citation |
|
To cite Arithmo models: |
|
``` |
|
@misc{jindal_2023_arithmo, |
|
author = {Jindal, Ashvini}, |
|
title = {Arithmo-Mistral-7B: Mathematical Reasoning Model}, |
|
howpublished = {Hugging Face}, |
|
month = {October}, |
|
year = {2023}, |
|
url = {https://huggingface.co/akjindal53244/Arithmo-Mistral-7B} |
|
} |
|
``` |
|
|
|
|
|
|
|
<h2 id="References">References</h2> |
|
|
|
``` |
|
@article{yu2023metamath, |
|
title={MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models}, |
|
author={Yu, Longhui and Jiang, Weisen and Shi, Han and Yu, Jincheng and Liu, Zhengying and Zhang, Yu and Kwok, James T and Li, Zhenguo and Weller, Adrian and Liu, Weiyang}, |
|
journal={arXiv preprint arXiv:2309.12284}, |
|
year={2023} |
|
} |
|
|
|
@article{Yue2023mammoth, |
|
title={MAmmoTH: Building math generalist models through hybrid instruction tuning}, |
|
author={Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen}, |
|
journal={arXiv preprint arXiv:2309.05653}, |
|
year={2023} |
|
} |
|
|
|
@article{mishra2022lila, |
|
title={Lila: A unified benchmark for mathematical reasoning}, |
|
author={Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, and Ashwin Kalyan}, |
|
journal={arXiv preprint arXiv:2210.17517}, |
|
year={2022} |
|
} |
|
|
|
``` |