TildeAI
/

TildeOpen-30b

@@ -213,6 +213,11 @@ We use multilingual translations of ARC provided by [Eurolingua](https://hugging
 | **AVG EAST** | **77.1%** | 77.0% | **77.1%** | 51.8% | 51.6% | **52.6%** |
 ### MMLU Benchmark Results
 | 0-shot | **ALIA 40b** | **EuroLLM Prev. 22b** | **TildeOpen 1.1 30b** |
 |----------|:-----------------:|:---------------------:|:-------------------:|
@@ -239,6 +244,12 @@ We use multilingual translations of ARC provided by [Eurolingua](https://hugging
 | **Average** | 50.0% | 51.0% | **55.7%** |
 ### National Exams Results
 | 5-shot | **ALIA 40b** | **EuroLLM Prev. 22b** | **TildeOpen 1.1 30b** |
 |----------|----------|-------------------|-------------------|
 | Bulgarian | 62.4% | 66.8% | **67.8%** |

 | **AVG EAST** | **77.1%** | 77.0% | **77.1%** | 51.8% | 51.6% | **52.6%** |
 ### MMLU Benchmark Results
+**What is MMLU?** [MMLU](https://arxiv.org/pdf/2009.03300) is a massive multitask test consisting of multiple-choice questions from various branches of knowledge, **in English**. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. This covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. Questions are four option multiple choice and assess factual knowledge, reading comprehension, and reasoning across disciplines. The questions can be grouped under four topics - stem, humanities, social_sciences and other, allowing for individual evaluation of each group.
+**Why does this Matter?** Similarly to ARC, MMLU measures broad, general purpose factual knowledge and some reasoning capabilites. The possible answer choices are included during prompting, which can allow the model to employ reasoning to discard false answers, rather than just relying on knowing the correct one. It should be noted that some question groups are exclusive to the anglocentric world, e.g. US history or law.
+**What did we do?** We use multilingual translations of MMLU provided by [Eurolingua](https://huggingface.co/datasets/Eurolingua/mmlux), please refer to the [publication](https://arxiv.org/pdf/2410.08928). Other than the data source, we replicate the standard [LM Evaluation Harness configuration for MMLU](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/mmlu/default). Our configuration is available at [TODO]. We set tokenisers to ```use_fast=False```. We report **0-shot** accuracy.
 | 0-shot | **ALIA 40b** | **EuroLLM Prev. 22b** | **TildeOpen 1.1 30b** |
 |----------|:-----------------:|:---------------------:|:-------------------:|
 | **Average** | 50.0% | 51.0% | **55.7%** |
 ### National Exams Results
+**What are National Exams?** A curated suite of **multlingual** publicly available past questions from national-level standardized exams across multiple countries (e.g., high-school exit and university-entrance exams), please refer to the [publication](https://aclanthology.org/2020.emnlp-main.438.pdf). The dataset is available on HuggingFace [here](https://huggingface.co/datasets/mhardalov/exams). Items are presented in multiple-choice format.
+**Why does this Matter?** Similarly to MMLU, the model is tested on factual knowledge and reasoning capabilites. However, it should be stressed that for each language the bench is **unique** (the exams are different) and available in the **source language** (i.e. not translated). This places emphasis on the model's regional knowledge and eliminates translation noise that is present in many other multilingual benchmarks. Possible answer choices are once again included during inference, allowing for the model to employ reasoning if factual knowledge is lacking.
+**What did we do?** [TODO]
 | 5-shot | **ALIA 40b** | **EuroLLM Prev. 22b** | **TildeOpen 1.1 30b** |
 |----------|----------|-------------------|-------------------|
 | Bulgarian | 62.4% | 66.8% | **67.8%** |