Omartificial-Intelligence-Space commited on
Commit
4ff268f
ยท
verified ยท
1 Parent(s): aa6b636

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +165 -18
README.md CHANGED
@@ -7,40 +7,187 @@ tags:
7
  - generated_from_trainer
8
  - trl
9
  - grpo
 
 
 
10
  licence: license
 
 
 
 
11
  ---
12
 
13
- # Model Card for Fanar-0.5B-GRPO-test
14
 
15
- This model is a fine-tuned version of [QCRI/Fanar-1-9B-Instruct](https://huggingface.co/QCRI/Fanar-1-9B-Instruct) on the [AI-MO/NuminaMath-TIR](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) dataset.
16
- It has been trained using [TRL](https://github.com/huggingface/trl).
17
 
18
- ## Quick start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
  ```python
21
- from transformers import pipeline
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
- question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
24
- generator = pipeline("text-generation", model="Omartificial-Intelligence-Space/Fanar-0.5B-GRPO-test", device="cuda")
25
- output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
26
- print(output["generated_text"])
 
 
 
 
 
27
  ```
28
 
29
- ## Training procedure
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
-
32
 
 
33
 
34
- This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
- ### Framework versions
37
 
38
- - TRL: 0.14.0
39
- - Transformers: 4.47.1
40
- - Pytorch: 2.4.1
41
- - Datasets: 2.21.0
42
- - Tokenizers: 0.21.0
43
 
 
44
  ## Citations
45
 
46
  Cite GRPO as:
 
7
  - generated_from_trainer
8
  - trl
9
  - grpo
10
+ - math
11
+ - reasoning
12
+ - R1
13
  licence: license
14
+ license: apache-2.0
15
+ language:
16
+ - ar
17
+ - en
18
  ---
19
 
20
+ # ๐Ÿง  Fanar-Math-R1-GRPO
21
 
22
+ **Fanar-Math-R1-GRPO** is a reasoning-optimized language model built on [`QCRI/Fanar-1-9B-Instruct`](https://huggingface.co/QCRI/Fanar-1-9B-Instruct). This version is fine-tuned using **Group Relative Policy Optimization (GRPO)** from the DeepSeekMath framework on the [`AI-MO/NuminaMath-TIR`](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) dataset. It is designed for step-by-step mathematical problem-solving with structured reasoning in both English and Arabic.
 
23
 
24
+ ---
25
+
26
+ ## ๐Ÿš€ Model Highlights
27
+
28
+ - ๐Ÿ” Fine-tuned with **GRPO**, a sample-efficient reinforcement learning method
29
+ - ๐Ÿงฎ Specializes in **multi-step mathematical reasoning**
30
+ - ๐Ÿ’ฌ Outputs responses in a structured conversational format using `<think>` and `<answer>` tags
31
+ - ๐Ÿง  Trained using **TRL** (`transformers`, `peft`, and `math_verify`)
32
+ - ๐Ÿท๏ธ Useful for both instruction-following and math-heavy dialogue generation
33
+
34
+ ---
35
+
36
+ ## ๐Ÿ“ฆ Model Details
37
+
38
+ | Component | Description |
39
+ |------------------|-----------------------------------------------------------------------------|
40
+ | **Base Model** | [`QCRI/Fanar-1-9B-Instruct`](https://huggingface.co/QCRI/Fanar-1-9B-Instruct) |
41
+ | **Fine-Tuning** | GRPO via Hugging Face [TRL](https://github.com/huggingface/trl) |
42
+ | **Dataset** | [`AI-MO/NuminaMath-TIR`](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) |
43
+ | **Format** | `<think> ... </think> <answer> ... </answer>` tagged reasoning structure |
44
+ | **LoRA** | Enabled (modules: `q_proj`, `v_proj`, rank=8) |
45
+ | **Epochs** | 1 (lightweight test configuration) |
46
+ | **Tokenizer** | Same as base model |
47
+
48
+ ---
49
+
50
+ ## ๐Ÿงช Inference Example
51
 
52
  ```python
53
+ from transformers import AutoTokenizer, AutoModelForCausalLM
54
+ import torch
55
+ import time
56
+
57
+ model_id = "Omartificial-Intelligence-Space/Fanar-Math-R1-GRPO"
58
+ model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
59
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
60
+
61
+ def generate_with_reasoning(prompt_text):
62
+ inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device)
63
+ start = time.time()
64
+ with torch.no_grad():
65
+ output = model.generate(**inputs, max_length=1024)
66
+ end = time.time()
67
+
68
+ generated = tokenizer.decode(output[0], skip_special_tokens=True)
69
+ duration = end - start
70
+ num_input_tokens = inputs["input_ids"].shape[1]
71
+ num_generated_tokens = output.shape[1] - num_input_tokens
72
+
73
+ return generated, duration, num_generated_tokens
74
+
75
+ # Example Arabic math problem
76
+ prompt = """A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer either in Arabic or English based on user's language. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer> ููŠ ู…ุฏูŠู†ุฉ ูŠุจู„ุบ ุนุฏุฏ ุณูƒุงู†ู‡ุง 1 ู…ู„ูŠูˆู† ู†ุณู…ุฉุŒ ุฅุฐุง ูƒุงู† 60% ู…ู† ุงู„ุณูƒุงู† ุจุงู„ุบูŠู†ุŒ ูˆ40% ู…ู† ุงู„ุจุงู„ุบูŠู† ูŠุนู…ู„ูˆู†ุŒ ููƒู… ุนุฏุฏ ุงู„ุนุงู…ู„ูŠู† ููŠ ุงู„ู…ุฏูŠู†ุฉุŸ"""
77
+
78
+ result, time_taken, tokens = generate_with_reasoning(prompt)
79
+ print(result)
80
+ ```
81
+
82
+ ---
83
+
84
+ ## ๐Ÿ› ๏ธ Training Setup
85
+
86
+ ### Configuration Summary
87
+
88
+ - **learning_rate**: 1e-5
89
+ - **epochs**: 1
90
+ - **max_completion_length**: 64
91
+ - **num_generations**: 4
92
+ - **gradient_accumulation_steps**: 16
93
+ - **logging_steps**: 10
94
+
95
+ ### Reward Functions
96
+
97
+ - **accuracy_reward**: validates correctness of the answer using `math_verify`
98
+ - **format_reward**: checks for proper usage of `<think>` and `<answer>` tags
99
 
100
+ ### Libraries & Versions
101
+
102
+ ```
103
+ transformers==4.47.1
104
+ trl==0.14.0
105
+ peft==0.14.0
106
+ datasets==2.21.0
107
+ math_verify==0.3.3
108
+ torch==2.4.1
109
  ```
110
 
111
+ ---
112
+
113
+ ## ๐Ÿ“Š Training Metrics (Snapshot)
114
+
115
+ | Step | Reward (avg) | Accuracy Reward | Format Reward | Loss | KL Divergence |
116
+ |------|--------------|-----------------|---------------|-------|---------------|
117
+ | 10 | 0.029 | 0.029 | 0.0 | 0.0 | 0.00024 |
118
+ | 100 | 0.039 | 0.039 | 0.0 | 0.0001| 0.00188 |
119
+ | 200 | 0.033 | 0.033 | 0.0 | 0.0001| 0.00183 |
120
+ | 300 | 0.045 | 0.045 | 0.0 | 0.0001| 0.00127 |
121
+
122
+ *Note: Training was run with a small config for notebook-friendly experimentation.*
123
+
124
+ ---
125
 
126
+ ## ๐Ÿ“š Output Format
127
 
128
+ The model is trained to follow a reasoning-first format:
129
 
130
+ ```
131
+ <think> First, we calculate 60% of 1 million, which is 600,000. Then, 40% of that is 240,000. </think>
132
+ <answer> 240,000 </answer>
133
+ ```
134
+
135
+ ---
136
+
137
+ ## ๐Ÿ”ฌ Citations
138
+
139
+ ### GRPO โ€“ DeepSeekMath
140
+
141
+ ```bibtex
142
+ @article{zhihong2024deepseekmath,
143
+ title={DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models},
144
+ author={Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Zhang, Mingchuan and Li, Y.K. and Wu, Y. and Guo, Daya},
145
+ journal={arXiv preprint arXiv:2402.03300},
146
+ year={2024}
147
+ }
148
+ ```
149
+
150
+ ### TRL Library
151
+
152
+ ```bibtex
153
+ @misc{vonwerra2022trl,
154
+ title={TRL: Transformer Reinforcement Learning},
155
+ author={von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouรฉdec, Quentin},
156
+ year={2022},
157
+ howpublished={\url{https://github.com/huggingface/trl}}
158
+ }
159
+ ```
160
+
161
+ ---
162
+
163
+ ## ๐Ÿ”— Resources
164
+
165
+ - [DeepSeekMath Paper](https://arxiv.org/abs/2402.03300)
166
+ - [TRL Documentation](https://huggingface.co/docs/trl)
167
+ - [Open-R1 Project](https://github.com/huggingface/open-r1)
168
+
169
+ ---
170
+
171
+ ## ๐Ÿง‘โ€๐Ÿ”ฌ Authors
172
+
173
+ Developed and trained by **Omar Paniego** with adaptation of the DeepSeek-R1 training recipe using Hugging Face's open tools and datasets.
174
+
175
+ ---
176
+
177
+ ## ๐Ÿ“ข License
178
+
179
+ Refer to the license file in the repository.
180
+
181
+ ---
182
 
183
+ ## โค๏ธ Acknowledgements
184
 
185
+ Thanks to:
186
+ - **Hugging Face Science Team** for `trl` and `math_verify`
187
+ - **AI-MO** for the NuminaMath-TIR dataset
188
+ - **DeepSeek Team** for releasing their methodology and insights
 
189
 
190
+ Happy reasoning! ๐Ÿ”โœจ
191
  ## Citations
192
 
193
  Cite GRPO as: