Update README.md
Browse files
README.md
CHANGED
|
@@ -104,6 +104,42 @@ outputs = model.generate(
|
|
| 104 |
)
|
| 105 |
```
|
| 106 |
# Evaluation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 107 |
## Per-Character Perplexity
|
| 108 |
**What is Perplexity?** Perplexity measures how well a language model predicts text. A model with low perplexity makes accurate predictions consistently, while a high perplexity means the model is frequently "surprised" by unexpected words or patterns. Lower perplexity indicates the model has learned language patterns more effectively. It's less "surprised" by what it encounters because it better understands how the language works.
|
| 109 |
Perplexity fairly evaluates how well each model handles:
|
|
|
|
| 104 |
)
|
| 105 |
```
|
| 106 |
# Evaluation
|
| 107 |
+
## Belebele Benchmark: Reading Comprehension
|
| 108 |
+
**What is Belebele Benchmark?** [Belebele](https://aclanthology.org/anthology-files/anthology-files/pdf/acl/2024.acl-long.44.pdf) is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. This dataset enables the evaluation of mono- and multi-lingual models in high-, medium-, and low-resource languages. Each question has four multiple-choice answers and is linked to a short passage from the FLORES-200 dataset. The human annotation procedure was carefully curated to create questions that discriminate between different levels of generalizable language comprehension and is reinforced by extensive quality checks.
|
| 109 |
+
Results
|
| 110 |
+
|
| 111 |
+
**Why does this Matter?** Belebele tests LLM's ability to provide answers based on a given text -- a standard use case in retrieval augumented generation workflows.
|
| 112 |
+
|
| 113 |
+
**What did we do?** We used the standard implementation of the [Belebele](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/belebele) task from the LLM Evaluation Harness. We set tokenisers to ```use_fast=False```.
|
| 114 |
+
|
| 115 |
+
| Language | Gemma 2 27b | ALIA 40b | EuroLLM 9b | EuroLLM Prev. 22b | TildeOpen 1.1 30b |
|
| 116 |
+
|----------|-------------|----------|------------|-------------------|-------------------|
|
| 117 |
+
| Bulgarian | 79.8% | 78.8% | 74.2% | **85.3%** | 84.7% |
|
| 118 |
+
| Czech | 81.4% | 78.3% | 74.9% | 85.3% | **85.8%** |
|
| 119 |
+
| German | 81.2% | 80.6% | 75.1% | **85.0%** | 84.3% |
|
| 120 |
+
| English | 88.9% | 83.0% | 77.3% | 87.6% | **88.3%** |
|
| 121 |
+
| Estonian | 72.1% | 73.7% | 70.8% | 82.0% | **82.6%** |
|
| 122 |
+
| Finnish | 79.0% | 78.1% | 73.3% | 84.3% | **85.0%** |
|
| 123 |
+
| French | 82.6% | 80.1% | 77.7% | **85.7%** | 85.0% |
|
| 124 |
+
| Hungarian | 77.9% | 76.2% | 72.9% | 83.3% | **86.2%** |
|
| 125 |
+
| Icelandic | 70.8% | 58.2% | 44.6% | 54.3% | **85.7%** |
|
| 126 |
+
| Italian | 82.1% | 77.8% | 74.7% | 81.0% | **82.4%** |
|
| 127 |
+
| Lithuanian | 76.1% | 76.1% | 72.8% | **85.2%** | 83.3% |
|
| 128 |
+
| Latvian | 78.4% | 77.7% | 73.6% | **84.6%** | **84.6%** |
|
| 129 |
+
| Dutch | 80.2% | 78.9% | 73.0% | 83.2% | **85.0%** |
|
| 130 |
+
| Polish | 78.3% | 77.9% | 73.2% | 82.2% | **83.0%** |
|
| 131 |
+
| Portuguese | 83.8% | 80.1% | 73.9% | 86.1% | **87.1%** |
|
| 132 |
+
| Romanian | 80.3% | 78.8% | 75.1% | 85.3% | **85.9%** |
|
| 133 |
+
| Russian | 79.4% | 79.4% | 73.1% | 84.2% | **84.6%** |
|
| 134 |
+
| Slovak | 78.9% | 78.0% | 74.0% | 84.1% | **85.0%** |
|
| 135 |
+
| Slovenian | 78.0% | 80.0% | 72.6% | 83.7% | **85.1%** |
|
| 136 |
+
| Spanish | 82.1% | 78.4% | 73.6% | **84.1%** | 83.8% |
|
| 137 |
+
| Serbian | 79.8% | 78.4% | 66.3% | 74.1% | **84.2%** |
|
| 138 |
+
| Swedish | 80.6% | 76.3% | 73.4% | **85.3%** | 84.4% |
|
| 139 |
+
| Turkish | 77.4% | 62.3% | 70.0% | 79.9% | **82.7%** |
|
| 140 |
+
| Ukrainian | 78.0% | 77.0% | 71.9% | 83.9% | **85.1%** |
|
| 141 |
+
| **Average** | 79.5% | 76.8% | 72.2% | 82.5% | **84.7%** |
|
| 142 |
+
|
| 143 |
## Per-Character Perplexity
|
| 144 |
**What is Perplexity?** Perplexity measures how well a language model predicts text. A model with low perplexity makes accurate predictions consistently, while a high perplexity means the model is frequently "surprised" by unexpected words or patterns. Lower perplexity indicates the model has learned language patterns more effectively. It's less "surprised" by what it encounters because it better understands how the language works.
|
| 145 |
Perplexity fairly evaluates how well each model handles:
|