Update README.md
Browse files
README.md
CHANGED
@@ -120,25 +120,25 @@ model-index:
|
|
120 |
|
121 |
![SmolTulu Banner](smoltulubanner.png)
|
122 |
|
123 |
-
SmolTulu-1.7b-Instruct is the first model in a series of models meant to leverage [AllenAI's Tulu 3 post-training pipeline](https://
|
124 |
|
125 |
This model scores the highest current score in both IFEval and GSM8k while maintaining the extremely low contamination levels in Tulu 3 and SmolLM2! I've listed the datasets used to do both the SFT (supervised finetuning) and DPO (direct preference optimization) stages.
|
126 |
|
127 |
-
Something important to note, this model has only undergone SFT and DPO
|
128 |
|
129 |
## Evaluation
|
130 |
|
131 |
I ran these evaluations using [SmolLM2's evaluation code](https://github.com/huggingface/smollm/tree/main/evaluation) for a more fair comparison.
|
132 |
|
133 |
-
| Metric | SmolTulu-1.7b-Instruct | SmolLM2-1.7B-Instruct | Llama-1B-Instruct | Qwen2.5-1.5B-Instruct | SmolLM1-1.7B-Instruct |
|
134 |
-
|
135 |
-
|
|
136 |
-
|
|
137 |
-
|
|
138 |
-
|
|
139 |
-
|
|
140 |
-
|
|
141 |
-
|
|
142 |
|
143 |
## Usage
|
144 |
|
|
|
120 |
|
121 |
![SmolTulu Banner](smoltulubanner.png)
|
122 |
|
123 |
+
SmolTulu-1.7b-Instruct is the first model in a series of models meant to leverage [AllenAI's Tulu 3 post-training pipeline](https://arxiv.org/abs/2411.15124) to tune the [base version of Huggingface's SmolLM2-1.7b](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B)! The post training pipeline AllenAI came up with seemed like something perfect to apply here.
|
124 |
|
125 |
This model scores the highest current score in both IFEval and GSM8k while maintaining the extremely low contamination levels in Tulu 3 and SmolLM2! I've listed the datasets used to do both the SFT (supervised finetuning) and DPO (direct preference optimization) stages.
|
126 |
|
127 |
+
Something important to note, this model has only undergone SFT and DPO! Find the RLVR version here, [SmolTulu-1.7b-Reinforced](https://huggingface.co/SultanR/SmolTulu-1.7b-Reinforced)
|
128 |
|
129 |
## Evaluation
|
130 |
|
131 |
I ran these evaluations using [SmolLM2's evaluation code](https://github.com/huggingface/smollm/tree/main/evaluation) for a more fair comparison.
|
132 |
|
133 |
+
| Metric | SmolTulu-1.7b-Instruct | SmolTulu-1.7b-Reinforced | SmolLM2-1.7B-Instruct | Llama-1B-Instruct | Qwen2.5-1.5B-Instruct | SmolLM1-1.7B-Instruct |
|
134 |
+
|:----------------------------|:---------------------:|:---------------------:|:---------------------:|:---------------------:|:---------------------:|:---------------------:|
|
135 |
+
| ARC (Average) | 51.5 | 51.1 | **51.7** | 41.6 | 46.2 | 43.7 |
|
136 |
+
| BBH (3-shot) | 33.8 | 33.4 | 32.2 | 27.6 | **35.3** | 25.7 |
|
137 |
+
| GSM8K (5-shot) | 51.6 | **61.0** | 48.2 | 26.8 | 42.8 | 4.6 |
|
138 |
+
| HellaSwag | 61.1 | 60.4 | **66.1** | 56.1 | 60.9 | 55.5 |
|
139 |
+
| IFEval (Average prompt/inst) | 67.7 | **69.3** | 56.7 | 53.5 | 47.4 | 23.1 |
|
140 |
+
| MMLU-Pro (MCF) | 17.4 | 17.3 | 19.3 | 12.7 | **24.2** | 11.7 |
|
141 |
+
| PIQA | 72.2 | 72.1 | **74.4** | 72.3 | 73.2 | 71.6 |
|
142 |
|
143 |
## Usage
|
144 |
|