munish0838 commited on
Commit
dc0cb2a
β€’
1 Parent(s): 636cac7

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +133 -0
README.md ADDED
@@ -0,0 +1,133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - nlp
5
+ - math
6
+ language:
7
+ - en
8
+ pipeline_tag: text-generation
9
+ base_model: microsoft/rho-math-1b-interpreter-v0.1
10
+ ---
11
+
12
+ # QuantFactory/rho-math-1b-interpreter-v0.1-GGUF
13
+ This is quantized version of [microsoft/rho-math-1b-interpreter-v0.1](https://huggingface.co/microsoft/rho-math-1b-interpreter-v0.1) created using llama.cpp
14
+
15
+ # Model Description
16
+
17
+ <h1 align="center">
18
+ Rho-1: Not All Tokens Are What You Need
19
+ </h1>
20
+
21
+
22
+ <p align="center">
23
+ <a href="https://arxiv.org/abs/2404.07965"><b>[πŸ“œ Arxiv]</b></a> β€’
24
+ <a href="https://huggingface.co/papers/2404.07965"><b>[πŸ’¬ HF Paper]</b></a> β€’
25
+ <a href="https://huggingface.co/microsoft/rho-math-1b-v0.1"><b>[πŸ€— Models]</b></a> β€’
26
+ <a href="https://github.com/microsoft/rho"><b>[🐱 GitHub]</b></a>
27
+ </p>
28
+
29
+ <p align="center">
30
+ <img src="https://github.com/microsoft/rho/blob/main/docs/static/images/acc_vs_tokens_1b_7b.png?raw=true" width="1000">
31
+ <br>
32
+ <em>Figure 1: Rho-1 is pre-trained with Selective Language Modeling (SLM). SLM improves average few-shot accuracy on GSM8k and MATH by over 16%, achieving the baseline performance 5-10x faster.</em>
33
+ </p>
34
+
35
+
36
+ ## πŸ”₯ News
37
+
38
+ - [2024/04/12] πŸ”₯πŸ”₯πŸ”₯ Rho-Math-v0.1 models released at πŸ€— HuggingFace!
39
+ - [Rho-Math-1B](https://huggingface.co/microsoft/rho-math-1b-v0.1) and [Rho-Math-7B](https://huggingface.co/microsoft/rho-math-7b-v0.1) achieve 15.6% and 31.0% few-shot accuracy on MATH dataset, respectively β€” matching DeepSeekMath with only 3\% of the pretraining tokens.
40
+ - [Rho-Math-1B-Interpreter](https://huggingface.co/microsoft/rho-math-1b-interpreter-v0.1) is the first 1B LLM that achieves over 40% accuracy on MATH.
41
+ - [Rho-Math-7B-Interpreter](https://huggingface.co/microsoft/rho-math-7b-interpreter-v0.1) achieves 52% on MATH dataset, using only 69k samples for fine-tuning.
42
+ - [2024/04/11] Rho-1 paper and repo released.
43
+
44
+
45
+
46
+ ## πŸ’‘ Introduction
47
+
48
+ Rho-1 base models employ Selective Language Modeling (SLM) for pretraining, which selectively trains on clean and useful tokens that aligned with the desired distribution.
49
+
50
+
51
+ ### Selective Lanugage Modeling (SLM)
52
+
53
+ <p align="center">
54
+ <img src="https://github.com/microsoft/rho/blob/main/docs/static/images/example.png?raw=true" width="1000">
55
+ <br>
56
+ <em>Figure 2:
57
+ <b>Upper:</b> Even an extensively filtered pretraining corpus contains token-level noise.
58
+ <b>Left:</b> Previous Causal Language Modeling (CLM) trains on all tokens.
59
+ <b>Right:</b> Our proposed Selective Language Modeling (SLM) selectively applies loss on those useful and clean tokens.</em>
60
+ </p>
61
+
62
+ <p align="center">
63
+ <img src="https://github.com/microsoft/rho/blob/main/docs/static/images/pipeline.png?raw=true" width="1000">
64
+ <br>
65
+ <em>Figure 3: <b>The pipeline of Selective Language Modeling.</b>
66
+ SLM optimizes language model performance by concentrating on valuable, clean tokens during pre-training.
67
+ It involves three steps:
68
+ (Step 1) Initially, train a reference model on high-quality data.
69
+ (Step 2) Then, score each token's loss in a corpus using the reference model.
70
+ (Step 3) Finally, train the language model selectively on tokens that show higher excess loss compared to the reference loss.</em>
71
+ </p>
72
+
73
+ <!-- results: -->
74
+
75
+ ### Evaluation Results
76
+
77
+ Base models (Few-shot CoT):
78
+
79
+ | **Model** | **Size** | **Data** | **Uniq. Token** | **Train Token** | **GSM8K** | **MATH** | **MMLU STEM** | **SAT** |
80
+ |:-----------------:|:--------:|:--------:|:---------------:|:---------------:|:---------:|:--------:|:-------------:|:--------:|
81
+ | 1-2B Base Models | | | | | | | | |
82
+ | Qwen1.5 | 1.8B | - | - | - | 36.1 | 6.8 | 31.3 | 40.6 |
83
+ | Gemma | 2.0B | - | - | - | 18.8 | 11.4 | **34.4** | 50.0 |
84
+ | DeepSeekMath | 1.3B | - | 120B | 150B | 23.8 | 13.6 | 33.1 | **56.3** |
85
+ | [Rho-Math-1B-v0.1](https://huggingface.co/microsoft/rho-math-1b-v0.1) | 1.1B | OWM | 14B | 30B | **36.2** | **15.6** | 23.3 | 28.1 |
86
+ | >= 7B Base Models | | | | | | | | |
87
+ | Mistral | 7B | | - | - | 41.2 | 11.6 | 49.5 | 59.4 |
88
+ | Minerva | 540B | - | 39B | 26B | 58.8 | 33.6 | **63.9** | - |
89
+ | LLemma | 34B | PPile | 55B | 50B | 54.2 | 23.0 | 54.7 | 68.8 |
90
+ | InternLM2-Math | 20B | - | 31B | 125B | 65.4 | 30.0 | 53.1 | 71.9 |
91
+ | DeepSeekMath | 7B | - | 120B | 500B | 64.1 | **34.2** | 56.4 | **84.4** |
92
+ | [Rho-Math-7B-v0.1](https://huggingface.co/microsoft/rho-math-7b-v0.1) | 7B | OWM | 14B | 10.5B | **66.9** | 31.0 | 54.6 | **84.4** |
93
+
94
+
95
+ [Tool-integrated reasoning](https://github.com/microsoft/ToRA) (Code Interpreter):
96
+
97
+ | **Model** | **Size** | **SFT Data** | **GSM8k** | **MATH** | **SVAMP** | **ASDiv** | **MAWPS** | **TabMWP** | **GSM-Hard** | **AVG** |
98
+ |------------------------------|----------|--------------|-----------|----------|-----------|-----------|-----------|------------|--------------|----------|
99
+ | gpt4-early (pal) | - | - | 94.2 | 51.8 | 94.8 | 92.6 | 97.7 | 95.9 | 77.6 | 86.4 |
100
+ | gpt-4-turbo-2024-04-09 (cot) | - | - | - | 73.4 | - | - | - | - | - |
101
+ | Open-Source Small Models | | | | | | | | | |
102
+ | MAmmoTH | 70B | MI-260k | 76.9 | 41.8 | 82.4 | - | - | - | - | - |
103
+ | ToRA | 7B | ToRA-69k | 68.8 | 40.1 | 68.2 | 73.9 | 88.8 | 42.4 | 54.6 | 62.4 |
104
+ | ToRA | 70B | ToRA-69k | 84.3 | 49.7 | **82.7** | 86.8 | 93.8 | 74.0 | **67.2** | **76.9** |
105
+ | DeepSeekMath | 7B | ToRA-69k | 79.8 | **52.0** | 80.1 | **87.1** | 93.8 | **85.8** | 63.1 | 77.4 |
106
+ | [Rho-Math-1B-Interpreter-v0.1](https://huggingface.co/microsoft/rho-math-1b-interpreter-v0.1) | 1B | ToRA-69k | 59.4 | 40.6 | 60.7 | 74.2 | 88.6 | 26.7 | 48.1 | 56.9 |
107
+ | [Rho-Math-7B-Interpreter-v0.1](https://huggingface.co/microsoft/rho-math-7b-interpreter-v0.1) | 7B | ToRA-69k | 81.3 | **51.8** | 80.8 | 85.5 | **94.5** | 70.1 | 63.1 | 75.3 |
108
+
109
+
110
+ ## πŸš€ Quick Start
111
+
112
+
113
+ ### Evaluation
114
+
115
+ ```sh
116
+ git clone git@github.com:microsoft/rho.git
117
+ cd rho-1/math-evaluation-harness
118
+ ```
119
+
120
+ Base model few-shot evaluation:
121
+
122
+ ```sh
123
+ bash scripts/run_eval.sh cot microsoft/rho-math-7b-v0.1
124
+ ```
125
+
126
+ SFT model (code-interpreter) evaluation:
127
+
128
+ ```sh
129
+ bash scripts/run_eval.sh tora microsoft/rho-math-7b-interpreter-v0.1
130
+ ```
131
+
132
+ Our reproduced outputs are provided in `rho-1/outputs.zip`.
133
+