Update README.md
Browse files
README.md
CHANGED
|
@@ -112,7 +112,7 @@ Results
|
|
| 112 |
|
| 113 |
**What did we do?** We used the standard implementation of the [belebele](https://github.com/eleutherai/lm-evaluation-harness/tree/main/lm_eval/tasks/belebele) task from the LLM Evaluation Harness. We set tokenisers to ```use_fast=False```. We report **5-shot** accuracy.
|
| 114 |
|
| 115 |
-
| 5-shot | Gemma 2 27b | ALIA 40b | EuroLLM Prev. 22b | TildeOpen 1.1 30b |
|
| 116 |
|----------|:-------------:|:----------:|:------------:|:-------------------:|
|
| 117 |
| Bulgarian | 79.8% | 78.8% | **85.3%** | 84.7% |
|
| 118 |
| Czech | 81.4% | 78.3% | 85.3% | **85.8%** |
|
|
@@ -148,7 +148,7 @@ Results
|
|
| 148 |
**What did we do?**
|
| 149 |
We used the standard implementation of the [MultiBLiMP](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/multiblimp) task from the LLM Evaluation Harness. We set tokenisers to ```use_fast=False```. We report **0-shot** accuracy.
|
| 150 |
|
| 151 |
-
| Language | Gemma 2 27b | ALIA 40b | EuroLLM Prev. 22b | TildeOpen 1.1 30b
|
| 152 |
|----------|-------------|----------|---------------------|-------------|
|
| 153 |
| Bulgarian | 95.4% | 98.8% | 97.7% | **99.6%** |
|
| 154 |
| Czech | 98.6% | **98.9%** | 98.5% | 98.5% |
|
|
@@ -175,3 +175,73 @@ We used the standard implementation of the [MultiBLiMP](https://github.com/Eleut
|
|
| 175 |
| Turkish | 97.6% | **98.7%** | 97.9% | 96.4% |
|
| 176 |
| Ukrainian | 95.6% | 98.0% | 97.3% | **99.2%** |
|
| 177 |
| **Average** | 95.7% | 96.7% | 96.4% | **99.0%** |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
|
| 113 |
**What did we do?** We used the standard implementation of the [belebele](https://github.com/eleutherai/lm-evaluation-harness/tree/main/lm_eval/tasks/belebele) task from the LLM Evaluation Harness. We set tokenisers to ```use_fast=False```. We report **5-shot** accuracy.
|
| 114 |
|
| 115 |
+
| 5-shot | **Gemma 2 27b** | **ALIA 40b** | **EuroLLM Prev. 22b** | **TildeOpen 1.1 30b** |
|
| 116 |
|----------|:-------------:|:----------:|:------------:|:-------------------:|
|
| 117 |
| Bulgarian | 79.8% | 78.8% | **85.3%** | 84.7% |
|
| 118 |
| Czech | 81.4% | 78.3% | 85.3% | **85.8%** |
|
|
|
|
| 148 |
**What did we do?**
|
| 149 |
We used the standard implementation of the [MultiBLiMP](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/multiblimp) task from the LLM Evaluation Harness. We set tokenisers to ```use_fast=False```. We report **0-shot** accuracy.
|
| 150 |
|
| 151 |
+
| Language | **Gemma 2 27b** | **ALIA 40b** | **EuroLLM Prev. 22b** | **TildeOpen 1.1 30b**
|
| 152 |
|----------|-------------|----------|---------------------|-------------|
|
| 153 |
| Bulgarian | 95.4% | 98.8% | 97.7% | **99.6%** |
|
| 154 |
| Czech | 98.6% | **98.9%** | 98.5% | 98.5% |
|
|
|
|
| 175 |
| Turkish | 97.6% | **98.7%** | 97.9% | 96.4% |
|
| 176 |
| Ukrainian | 95.6% | 98.0% | 97.3% | **99.2%** |
|
| 177 |
| **Average** | 95.7% | 96.7% | 96.4% | **99.0%** |
|
| 178 |
+
|
| 179 |
+
## Knowledge tests
|
| 180 |
+
|
| 181 |
+
### ARC Benchmark Results
|
| 182 |
+
|
| 183 |
+
| 5-shot | | **ARC Easy**| | | **ARC Hard**| |
|
| 184 |
+
|----------|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
|
| 185 |
+
| **Language** | **ALIA 40b** | **EuroLLM Prev. 22b** | **TildeOpen 1.1 30b** | **ALIA 40b** | **EuroLLM Prev. 22b** | **TildeOpen 1.1 30b** |
|
| 186 |
+
| Danish | 79.9% | **80.1%** | 79.6% | 53.4% | 52.6% | **53.7%** |
|
| 187 |
+
| German | 79.6% | **79.9%** | 78.0% | 53.4% | **53.6%** | 51.7% |
|
| 188 |
+
| Spanish | **82.9%** | 81.7% | 79.4% | **57.3%** | 56.1% | 52.4% |
|
| 189 |
+
| French | **81.7%** | 81.1% | 78.6% | **56.0%** | 54.5% | 52.8% |
|
| 190 |
+
| Italian | 80.5% | **81.6%** | 78.5% | **56.4%** | 54.8% | 54.1% |
|
| 191 |
+
| Dutch | **80.1%** | 80.0% | 78.8% | **54.0%** | 53.8% | 52.2% |
|
| 192 |
+
| Portuguese | **81.7%** | 81.1% | 79.0% | **56.9%** | 55.5% | 54.1% |
|
| 193 |
+
| Swedish | 80.3% | **80.5%** | 78.7% | 53.8% | 53.1% | **54.1%** |
|
| 194 |
+
| **AVG WEST** | **80.8%** | **80.8%** | 78.8% | **55.2%** | 54.2% | 53.1% |
|
| 195 |
+
| | | | | | | |
|
| 196 |
+
| Bulgarian | **79.8%** | 79.2% | 79.5% | **53.8%** | 51.8% | 52.8% |
|
| 197 |
+
| Czech | **79.5%** | **79.5%** | 78.8% | 51.5% | 52.3% | **53.9%** |
|
| 198 |
+
| Estonian | 72.4% | 73.0% | **73.1%** | 49.6% | 49.8% | **52.0%** |
|
| 199 |
+
| Finnish | 73.8% | **74.2%** | 73.3% | 48.7% | 51.1% | **52.1%** |
|
| 200 |
+
| Hungarian | 74.0% | 73.9% | **74.9%** | 49.3% | 49.0% | **49.6%** |
|
| 201 |
+
| Lithuanian | 76.4% | 76.1% | **77.9%** | 50.3% | 51.6% | **53.0%** |
|
| 202 |
+
| Latvian | 76.2% | **76.4%** | 75.9% | 50.7% | 49.8% | **50.9%** |
|
| 203 |
+
| Polish | **79.2%** | 78.2% | 78.0% | **54.5%** | 53.3% | 52.7% |
|
| 204 |
+
| Romanian | **79.6%** | 78.8% | 78.8% | **55.5%** | 53.7% | 54.5% |
|
| 205 |
+
| Slovak | 78.8% | 79.2% | **79.6%** | 52.5% | 53.0% | **54.7%** |
|
| 206 |
+
| Slovenian | **78.3%** | 78.5% | **78.3%** | **53.4%** | 52.2% | 52.7% |
|
| 207 |
+
| **AVG EAST** | **77.1%** | 77.0% | **77.1%** | 51.8% | 51.6% | **52.6%** |
|
| 208 |
+
|
| 209 |
+
### MMLU Benchmark Results
|
| 210 |
+
|
| 211 |
+
| 0-shot | **ALIA 40b** | **EuroLLM Prev. 22b** | **TildeOpen 1.1 30b** |
|
| 212 |
+
|----------|:-----------------:|:---------------------:|:-------------------:|
|
| 213 |
+
| Bulgarian | 48.3% | 52.0% | **56.3%** |
|
| 214 |
+
| Czech | 49.1% | 51.7% | **56.4%** |
|
| 215 |
+
| Danish | 50.2% | 51.1% | **56.6%** |
|
| 216 |
+
| German | 51.0% | 51.8% | **56.2%** |
|
| 217 |
+
| Greek | 50.7% | 50.6% | **50.9%** |
|
| 218 |
+
| Spanish | 53.3% | 53.4% | **56.3%** |
|
| 219 |
+
| Estonian | 48.7% | 49.2% | **55.3%** |
|
| 220 |
+
| Finnish | 47.4% | 48.9% | **55.4%** |
|
| 221 |
+
| French | 53.1% | 53.8% | **56.4%** |
|
| 222 |
+
| Hungarian | 49.9% | 44.4% | **55.2%** |
|
| 223 |
+
| Italian | 52.3% | 53.7% | **57.2%** |
|
| 224 |
+
| Lithuanian | 47.3% | 49.4% | **54.7%** |
|
| 225 |
+
| Latvian | 46.9% | 48.0% | **54.0%** |
|
| 226 |
+
| Dutch | 50.8% | 53.0% | **56.5%** |
|
| 227 |
+
| Polish | 50.6% | 49.6% | **55.6%** |
|
| 228 |
+
| Portuguese | 52.4% | 53.7% | **56.4%** |
|
| 229 |
+
| Romanian | 51.0% | 52.1% | **56.2%** |
|
| 230 |
+
| Slovak | 49.0% | 52.2% | **56.3%** |
|
| 231 |
+
| Slovenian | 48.2% | 50.7% | **55.3%** |
|
| 232 |
+
| Swedish | 49.6% | 51.2% | **56.1%** |
|
| 233 |
+
| **Average** | 50.0% | 51.0% | **55.7%** |
|
| 234 |
+
|
| 235 |
+
### National Exams Results
|
| 236 |
+
| 5-shot | **ALIA 40b** | **EuroLLM Prev. 22b** | **TildeOpen 1.1 30b** |
|
| 237 |
+
|----------|----------|-------------------|-------------------|
|
| 238 |
+
| Bulgarian | 62.4% | 66.8% | **67.8%** |
|
| 239 |
+
| Croatian | 70.8% | **72.5%** | 71.9% |
|
| 240 |
+
| Hungarian | 48.9% | **51.9%** | 48.9% |
|
| 241 |
+
| Italian | **65.5%** | 64.6% | 65.0% |
|
| 242 |
+
| Macedonian | 74.2% | 72.0% | **80.2%** |
|
| 243 |
+
| Polish | 61.2% | 61.4% | **63.5%** |
|
| 244 |
+
| Portuguese | **61.4%** | 60.9% | 59.2% |
|
| 245 |
+
| Albanian | 55.6% | 55.0% | **75.6%** |
|
| 246 |
+
| Serbian | 64.7% | 57.3% | **66.9%** |
|
| 247 |
+
| **Average** | 62.7% | 62.5% | **66.6%** |
|