Update README.md
Browse files
README.md
CHANGED
|
@@ -7,40 +7,187 @@ tags:
|
|
| 7 |
- generated_from_trainer
|
| 8 |
- trl
|
| 9 |
- grpo
|
|
|
|
|
|
|
|
|
|
| 10 |
licence: license
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
-
#
|
| 14 |
|
| 15 |
-
|
| 16 |
-
It has been trained using [TRL](https://github.com/huggingface/trl).
|
| 17 |
|
| 18 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
```python
|
| 21 |
-
from transformers import
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
```
|
| 28 |
|
| 29 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
-
|
| 32 |
|
|
|
|
| 33 |
|
| 34 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
-
|
| 37 |
|
| 38 |
-
|
| 39 |
-
-
|
| 40 |
-
-
|
| 41 |
-
-
|
| 42 |
-
- Tokenizers: 0.21.0
|
| 43 |
|
|
|
|
| 44 |
## Citations
|
| 45 |
|
| 46 |
Cite GRPO as:
|
|
|
|
| 7 |
- generated_from_trainer
|
| 8 |
- trl
|
| 9 |
- grpo
|
| 10 |
+
- math
|
| 11 |
+
- reasoning
|
| 12 |
+
- R1
|
| 13 |
licence: license
|
| 14 |
+
license: apache-2.0
|
| 15 |
+
language:
|
| 16 |
+
- ar
|
| 17 |
+
- en
|
| 18 |
---
|
| 19 |
|
| 20 |
+
# ๐ง Fanar-Math-R1-GRPO
|
| 21 |
|
| 22 |
+
**Fanar-Math-R1-GRPO** is a reasoning-optimized language model built on [`QCRI/Fanar-1-9B-Instruct`](https://huggingface.co/QCRI/Fanar-1-9B-Instruct). This version is fine-tuned using **Group Relative Policy Optimization (GRPO)** from the DeepSeekMath framework on the [`AI-MO/NuminaMath-TIR`](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) dataset. It is designed for step-by-step mathematical problem-solving with structured reasoning in both English and Arabic.
|
|
|
|
| 23 |
|
| 24 |
+
---
|
| 25 |
+
|
| 26 |
+
## ๐ Model Highlights
|
| 27 |
+
|
| 28 |
+
- ๐ Fine-tuned with **GRPO**, a sample-efficient reinforcement learning method
|
| 29 |
+
- ๐งฎ Specializes in **multi-step mathematical reasoning**
|
| 30 |
+
- ๐ฌ Outputs responses in a structured conversational format using `<think>` and `<answer>` tags
|
| 31 |
+
- ๐ง Trained using **TRL** (`transformers`, `peft`, and `math_verify`)
|
| 32 |
+
- ๐ท๏ธ Useful for both instruction-following and math-heavy dialogue generation
|
| 33 |
+
|
| 34 |
+
---
|
| 35 |
+
|
| 36 |
+
## ๐ฆ Model Details
|
| 37 |
+
|
| 38 |
+
| Component | Description |
|
| 39 |
+
|------------------|-----------------------------------------------------------------------------|
|
| 40 |
+
| **Base Model** | [`QCRI/Fanar-1-9B-Instruct`](https://huggingface.co/QCRI/Fanar-1-9B-Instruct) |
|
| 41 |
+
| **Fine-Tuning** | GRPO via Hugging Face [TRL](https://github.com/huggingface/trl) |
|
| 42 |
+
| **Dataset** | [`AI-MO/NuminaMath-TIR`](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) |
|
| 43 |
+
| **Format** | `<think> ... </think> <answer> ... </answer>` tagged reasoning structure |
|
| 44 |
+
| **LoRA** | Enabled (modules: `q_proj`, `v_proj`, rank=8) |
|
| 45 |
+
| **Epochs** | 1 (lightweight test configuration) |
|
| 46 |
+
| **Tokenizer** | Same as base model |
|
| 47 |
+
|
| 48 |
+
---
|
| 49 |
+
|
| 50 |
+
## ๐งช Inference Example
|
| 51 |
|
| 52 |
```python
|
| 53 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 54 |
+
import torch
|
| 55 |
+
import time
|
| 56 |
+
|
| 57 |
+
model_id = "Omartificial-Intelligence-Space/Fanar-Math-R1-GRPO"
|
| 58 |
+
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
|
| 59 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 60 |
+
|
| 61 |
+
def generate_with_reasoning(prompt_text):
|
| 62 |
+
inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device)
|
| 63 |
+
start = time.time()
|
| 64 |
+
with torch.no_grad():
|
| 65 |
+
output = model.generate(**inputs, max_length=1024)
|
| 66 |
+
end = time.time()
|
| 67 |
+
|
| 68 |
+
generated = tokenizer.decode(output[0], skip_special_tokens=True)
|
| 69 |
+
duration = end - start
|
| 70 |
+
num_input_tokens = inputs["input_ids"].shape[1]
|
| 71 |
+
num_generated_tokens = output.shape[1] - num_input_tokens
|
| 72 |
+
|
| 73 |
+
return generated, duration, num_generated_tokens
|
| 74 |
+
|
| 75 |
+
# Example Arabic math problem
|
| 76 |
+
prompt = """A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer either in Arabic or English based on user's language. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer> ูู ู
ุฏููุฉ ูุจูุบ ุนุฏุฏ ุณูุงููุง 1 ู
ูููู ูุณู
ุฉุ ุฅุฐุง ูุงู 60% ู
ู ุงูุณูุงู ุจุงูุบููุ ู40% ู
ู ุงูุจุงูุบูู ูุนู
ูููุ ููู
ุนุฏุฏ ุงูุนุงู
ููู ูู ุงูู
ุฏููุฉุ"""
|
| 77 |
+
|
| 78 |
+
result, time_taken, tokens = generate_with_reasoning(prompt)
|
| 79 |
+
print(result)
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
---
|
| 83 |
+
|
| 84 |
+
## ๐ ๏ธ Training Setup
|
| 85 |
+
|
| 86 |
+
### Configuration Summary
|
| 87 |
+
|
| 88 |
+
- **learning_rate**: 1e-5
|
| 89 |
+
- **epochs**: 1
|
| 90 |
+
- **max_completion_length**: 64
|
| 91 |
+
- **num_generations**: 4
|
| 92 |
+
- **gradient_accumulation_steps**: 16
|
| 93 |
+
- **logging_steps**: 10
|
| 94 |
+
|
| 95 |
+
### Reward Functions
|
| 96 |
+
|
| 97 |
+
- **accuracy_reward**: validates correctness of the answer using `math_verify`
|
| 98 |
+
- **format_reward**: checks for proper usage of `<think>` and `<answer>` tags
|
| 99 |
|
| 100 |
+
### Libraries & Versions
|
| 101 |
+
|
| 102 |
+
```
|
| 103 |
+
transformers==4.47.1
|
| 104 |
+
trl==0.14.0
|
| 105 |
+
peft==0.14.0
|
| 106 |
+
datasets==2.21.0
|
| 107 |
+
math_verify==0.3.3
|
| 108 |
+
torch==2.4.1
|
| 109 |
```
|
| 110 |
|
| 111 |
+
---
|
| 112 |
+
|
| 113 |
+
## ๐ Training Metrics (Snapshot)
|
| 114 |
+
|
| 115 |
+
| Step | Reward (avg) | Accuracy Reward | Format Reward | Loss | KL Divergence |
|
| 116 |
+
|------|--------------|-----------------|---------------|-------|---------------|
|
| 117 |
+
| 10 | 0.029 | 0.029 | 0.0 | 0.0 | 0.00024 |
|
| 118 |
+
| 100 | 0.039 | 0.039 | 0.0 | 0.0001| 0.00188 |
|
| 119 |
+
| 200 | 0.033 | 0.033 | 0.0 | 0.0001| 0.00183 |
|
| 120 |
+
| 300 | 0.045 | 0.045 | 0.0 | 0.0001| 0.00127 |
|
| 121 |
+
|
| 122 |
+
*Note: Training was run with a small config for notebook-friendly experimentation.*
|
| 123 |
+
|
| 124 |
+
---
|
| 125 |
|
| 126 |
+
## ๐ Output Format
|
| 127 |
|
| 128 |
+
The model is trained to follow a reasoning-first format:
|
| 129 |
|
| 130 |
+
```
|
| 131 |
+
<think> First, we calculate 60% of 1 million, which is 600,000. Then, 40% of that is 240,000. </think>
|
| 132 |
+
<answer> 240,000 </answer>
|
| 133 |
+
```
|
| 134 |
+
|
| 135 |
+
---
|
| 136 |
+
|
| 137 |
+
## ๐ฌ Citations
|
| 138 |
+
|
| 139 |
+
### GRPO โ DeepSeekMath
|
| 140 |
+
|
| 141 |
+
```bibtex
|
| 142 |
+
@article{zhihong2024deepseekmath,
|
| 143 |
+
title={DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models},
|
| 144 |
+
author={Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Zhang, Mingchuan and Li, Y.K. and Wu, Y. and Guo, Daya},
|
| 145 |
+
journal={arXiv preprint arXiv:2402.03300},
|
| 146 |
+
year={2024}
|
| 147 |
+
}
|
| 148 |
+
```
|
| 149 |
+
|
| 150 |
+
### TRL Library
|
| 151 |
+
|
| 152 |
+
```bibtex
|
| 153 |
+
@misc{vonwerra2022trl,
|
| 154 |
+
title={TRL: Transformer Reinforcement Learning},
|
| 155 |
+
author={von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouรฉdec, Quentin},
|
| 156 |
+
year={2022},
|
| 157 |
+
howpublished={\url{https://github.com/huggingface/trl}}
|
| 158 |
+
}
|
| 159 |
+
```
|
| 160 |
+
|
| 161 |
+
---
|
| 162 |
+
|
| 163 |
+
## ๐ Resources
|
| 164 |
+
|
| 165 |
+
- [DeepSeekMath Paper](https://arxiv.org/abs/2402.03300)
|
| 166 |
+
- [TRL Documentation](https://huggingface.co/docs/trl)
|
| 167 |
+
- [Open-R1 Project](https://github.com/huggingface/open-r1)
|
| 168 |
+
|
| 169 |
+
---
|
| 170 |
+
|
| 171 |
+
## ๐งโ๐ฌ Authors
|
| 172 |
+
|
| 173 |
+
Developed and trained by **Omar Paniego** with adaptation of the DeepSeek-R1 training recipe using Hugging Face's open tools and datasets.
|
| 174 |
+
|
| 175 |
+
---
|
| 176 |
+
|
| 177 |
+
## ๐ข License
|
| 178 |
+
|
| 179 |
+
Refer to the license file in the repository.
|
| 180 |
+
|
| 181 |
+
---
|
| 182 |
|
| 183 |
+
## โค๏ธ Acknowledgements
|
| 184 |
|
| 185 |
+
Thanks to:
|
| 186 |
+
- **Hugging Face Science Team** for `trl` and `math_verify`
|
| 187 |
+
- **AI-MO** for the NuminaMath-TIR dataset
|
| 188 |
+
- **DeepSeek Team** for releasing their methodology and insights
|
|
|
|
| 189 |
|
| 190 |
+
Happy reasoning! ๐โจ
|
| 191 |
## Citations
|
| 192 |
|
| 193 |
Cite GRPO as:
|