TildeAI
/

TildeOpen-30b

@@ -112,7 +112,7 @@ Results
 **What did we do?** We used the standard implementation of the [belebele](https://github.com/eleutherai/lm-evaluation-harness/tree/main/lm_eval/tasks/belebele) task from the LLM Evaluation Harness. We set tokenisers to ```use_fast=False```. We report **5-shot** accuracy.
-| 5-shot | Gemma 2 27b | ALIA 40b | EuroLLM Prev. 22b | TildeOpen 1.1 30b |
 |----------|:-------------:|:----------:|:------------:|:-------------------:|
 | Bulgarian | 79.8% | 78.8% | **85.3%** | 84.7% |
 | Czech | 81.4% | 78.3% |  85.3% | **85.8%** |
@@ -148,7 +148,7 @@ Results
 **What did we do?**
 We used the standard implementation of the [MultiBLiMP](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/multiblimp) task from the LLM Evaluation Harness. We set tokenisers to ```use_fast=False```. We report **0-shot** accuracy.
-| Language | Gemma 2 27b | ALIA 40b | EuroLLM Prev. 22b | TildeOpen 1.1 30b
 |----------|-------------|----------|---------------------|-------------|
 | Bulgarian | 95.4% | 98.8% | 97.7% | **99.6%** |
 | Czech | 98.6% | **98.9%** | 98.5% | 98.5% |
@@ -175,3 +175,73 @@ We used the standard implementation of the [MultiBLiMP](https://github.com/Eleut
 | Turkish | 97.6% | **98.7%** | 97.9% | 96.4% |
 | Ukrainian | 95.6% | 98.0% | 97.3% | **99.2%** |
 | **Average** | 95.7% | 96.7% | 96.4% | **99.0%** |

 **What did we do?** We used the standard implementation of the [belebele](https://github.com/eleutherai/lm-evaluation-harness/tree/main/lm_eval/tasks/belebele) task from the LLM Evaluation Harness. We set tokenisers to ```use_fast=False```. We report **5-shot** accuracy.
+| 5-shot | **Gemma 2 27b** | **ALIA 40b** | **EuroLLM Prev. 22b** | **TildeOpen 1.1 30b** |
 |----------|:-------------:|:----------:|:------------:|:-------------------:|
 | Bulgarian | 79.8% | 78.8% | **85.3%** | 84.7% |
 | Czech | 81.4% | 78.3% |  85.3% | **85.8%** |
 **What did we do?**
 We used the standard implementation of the [MultiBLiMP](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/multiblimp) task from the LLM Evaluation Harness. We set tokenisers to ```use_fast=False```. We report **0-shot** accuracy.
+| Language | **Gemma 2 27b** | **ALIA 40b** | **EuroLLM Prev. 22b** | **TildeOpen 1.1 30b**
 |----------|-------------|----------|---------------------|-------------|
 | Bulgarian | 95.4% | 98.8% | 97.7% | **99.6%** |
 | Czech | 98.6% | **98.9%** | 98.5% | 98.5% |
 | Turkish | 97.6% | **98.7%** | 97.9% | 96.4% |
 | Ukrainian | 95.6% | 98.0% | 97.3% | **99.2%** |
 | **Average** | 95.7% | 96.7% | 96.4% | **99.0%** |
+## Knowledge tests
+### ARC Benchmark Results
+| 5-shot |  | **ARC Easy**| |  | **ARC Hard**| |
+|----------|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
+| **Language** | **ALIA 40b** | **EuroLLM Prev. 22b** | **TildeOpen 1.1 30b** | **ALIA 40b** | **EuroLLM Prev. 22b** | **TildeOpen 1.1 30b** |
+| Danish | 79.9% | **80.1%** | 79.6% | 53.4% | 52.6% | **53.7%** |
+| German | 79.6% | **79.9%** | 78.0% | 53.4% | **53.6%** | 51.7% |
+| Spanish | **82.9%** | 81.7% | 79.4% | **57.3%** | 56.1% | 52.4% |
+| French | **81.7%** | 81.1% | 78.6% | **56.0%** | 54.5% | 52.8% |
+| Italian | 80.5% | **81.6%** | 78.5% | **56.4%** | 54.8% | 54.1% |
+| Dutch | **80.1%** | 80.0% | 78.8% | **54.0%** | 53.8% | 52.2% |
+| Portuguese | **81.7%** | 81.1% | 79.0% | **56.9%** | 55.5% | 54.1% |
+| Swedish | 80.3% | **80.5%** | 78.7% | 53.8% | 53.1% | **54.1%** |
+| **AVG WEST** | **80.8%** | **80.8%** | 78.8% | **55.2%** | 54.2% | 53.1% |
+| | | | | | | |
+| Bulgarian | **79.8%** | 79.2% | 79.5% | **53.8%** | 51.8% | 52.8% |
+| Czech | **79.5%** | **79.5%** | 78.8% | 51.5% | 52.3% | **53.9%** |
+| Estonian | 72.4% | 73.0% | **73.1%** | 49.6% | 49.8% | **52.0%** |
+| Finnish | 73.8% | **74.2%** | 73.3% | 48.7% | 51.1% | **52.1%** |
+| Hungarian | 74.0% | 73.9% | **74.9%** | 49.3% | 49.0% | **49.6%** |
+| Lithuanian | 76.4% | 76.1% | **77.9%** | 50.3% | 51.6% | **53.0%** |
+| Latvian | 76.2% | **76.4%** | 75.9% | 50.7% | 49.8% | **50.9%** |
+| Polish | **79.2%** | 78.2% | 78.0% | **54.5%** | 53.3% | 52.7% |
+| Romanian | **79.6%** | 78.8% | 78.8% | **55.5%** | 53.7% | 54.5% |
+| Slovak | 78.8% | 79.2% | **79.6%** | 52.5% | 53.0% | **54.7%** |
+| Slovenian | **78.3%** | 78.5% | **78.3%** | **53.4%** | 52.2% | 52.7% |
+| **AVG EAST** | **77.1%** | 77.0% | **77.1%** | 51.8% | 51.6% | **52.6%** |
+### MMLU Benchmark Results
+| 0-shot | **ALIA 40b** | **EuroLLM Prev. 22b** | **TildeOpen 1.1 30b** |
+|----------|:-----------------:|:---------------------:|:-------------------:|
+| Bulgarian | 48.3% | 52.0% | **56.3%** |
+| Czech | 49.1% | 51.7% | **56.4%** |
+| Danish | 50.2% | 51.1% | **56.6%** |
+| German | 51.0% | 51.8% | **56.2%** |
+| Greek | 50.7% | 50.6% | **50.9%** |
+| Spanish | 53.3% | 53.4% | **56.3%** |
+| Estonian | 48.7% | 49.2% | **55.3%** |
+| Finnish | 47.4% | 48.9% | **55.4%** |
+| French | 53.1% | 53.8% | **56.4%** |
+| Hungarian | 49.9% | 44.4% | **55.2%** |
+| Italian | 52.3% | 53.7% | **57.2%** |
+| Lithuanian | 47.3% | 49.4% | **54.7%** |
+| Latvian | 46.9% | 48.0% | **54.0%** |
+| Dutch | 50.8% | 53.0% | **56.5%** |
+| Polish | 50.6% | 49.6% | **55.6%** |
+| Portuguese | 52.4% | 53.7% | **56.4%** |
+| Romanian | 51.0% | 52.1% | **56.2%** |
+| Slovak | 49.0% | 52.2% | **56.3%** |
+| Slovenian | 48.2% | 50.7% | **55.3%** |
+| Swedish | 49.6% | 51.2% | **56.1%** |
+| **Average** | 50.0% | 51.0% | **55.7%** |
+### National Exams Results
+| 5-shot | **ALIA 40b** | **EuroLLM Prev. 22b** | **TildeOpen 1.1 30b** |
+|----------|----------|-------------------|-------------------|
+| Bulgarian | 62.4% | 66.8% | **67.8%** |
+| Croatian | 70.8% | **72.5%** | 71.9% |
+| Hungarian | 48.9% | **51.9%** | 48.9% |
+| Italian | **65.5%** | 64.6% | 65.0% |
+| Macedonian | 74.2% | 72.0% | **80.2%** |
+| Polish | 61.2% | 61.4% | **63.5%** |
+| Portuguese | **61.4%** | 60.9% | 59.2% |
+| Albanian | 55.6% | 55.0% | **75.6%** |
+| Serbian | 64.7% | 57.3% | **66.9%** |
+| **Average** | 62.7% | 62.5% | **66.6%** |