Text Generation
Transformers
Safetensors
llama
text-generation-inference
TildeSIA commited on
Commit
8d485db
·
verified ·
1 Parent(s): 5ba1beb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +36 -0
README.md CHANGED
@@ -104,6 +104,42 @@ outputs = model.generate(
104
  )
105
  ```
106
  # Evaluation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
107
  ## Per-Character Perplexity
108
  **What is Perplexity?** Perplexity measures how well a language model predicts text. A model with low perplexity makes accurate predictions consistently, while a high perplexity means the model is frequently "surprised" by unexpected words or patterns. Lower perplexity indicates the model has learned language patterns more effectively. It's less "surprised" by what it encounters because it better understands how the language works.
109
  Perplexity fairly evaluates how well each model handles:
 
104
  )
105
  ```
106
  # Evaluation
107
+ ## Belebele Benchmark: Reading Comprehension
108
+ **What is Belebele Benchmark?** [Belebele](https://aclanthology.org/anthology-files/anthology-files/pdf/acl/2024.acl-long.44.pdf) is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. This dataset enables the evaluation of mono- and multi-lingual models in high-, medium-, and low-resource languages. Each question has four multiple-choice answers and is linked to a short passage from the FLORES-200 dataset. The human annotation procedure was carefully curated to create questions that discriminate between different levels of generalizable language comprehension and is reinforced by extensive quality checks.
109
+ Results
110
+
111
+ **Why does this Matter?** Belebele tests LLM's ability to provide answers based on a given text -- a standard use case in retrieval augumented generation workflows.
112
+
113
+ **What did we do?** We used the standard implementation of the [Belebele](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/belebele) task from the LLM Evaluation Harness. We set tokenisers to ```use_fast=False```.
114
+
115
+ | Language | Gemma 2 27b | ALIA 40b | EuroLLM 9b | EuroLLM Prev. 22b | TildeOpen 1.1 30b |
116
+ |----------|-------------|----------|------------|-------------------|-------------------|
117
+ | Bulgarian | 79.8% | 78.8% | 74.2% | **85.3%** | 84.7% |
118
+ | Czech | 81.4% | 78.3% | 74.9% | 85.3% | **85.8%** |
119
+ | German | 81.2% | 80.6% | 75.1% | **85.0%** | 84.3% |
120
+ | English | 88.9% | 83.0% | 77.3% | 87.6% | **88.3%** |
121
+ | Estonian | 72.1% | 73.7% | 70.8% | 82.0% | **82.6%** |
122
+ | Finnish | 79.0% | 78.1% | 73.3% | 84.3% | **85.0%** |
123
+ | French | 82.6% | 80.1% | 77.7% | **85.7%** | 85.0% |
124
+ | Hungarian | 77.9% | 76.2% | 72.9% | 83.3% | **86.2%** |
125
+ | Icelandic | 70.8% | 58.2% | 44.6% | 54.3% | **85.7%** |
126
+ | Italian | 82.1% | 77.8% | 74.7% | 81.0% | **82.4%** |
127
+ | Lithuanian | 76.1% | 76.1% | 72.8% | **85.2%** | 83.3% |
128
+ | Latvian | 78.4% | 77.7% | 73.6% | **84.6%** | **84.6%** |
129
+ | Dutch | 80.2% | 78.9% | 73.0% | 83.2% | **85.0%** |
130
+ | Polish | 78.3% | 77.9% | 73.2% | 82.2% | **83.0%** |
131
+ | Portuguese | 83.8% | 80.1% | 73.9% | 86.1% | **87.1%** |
132
+ | Romanian | 80.3% | 78.8% | 75.1% | 85.3% | **85.9%** |
133
+ | Russian | 79.4% | 79.4% | 73.1% | 84.2% | **84.6%** |
134
+ | Slovak | 78.9% | 78.0% | 74.0% | 84.1% | **85.0%** |
135
+ | Slovenian | 78.0% | 80.0% | 72.6% | 83.7% | **85.1%** |
136
+ | Spanish | 82.1% | 78.4% | 73.6% | **84.1%** | 83.8% |
137
+ | Serbian | 79.8% | 78.4% | 66.3% | 74.1% | **84.2%** |
138
+ | Swedish | 80.6% | 76.3% | 73.4% | **85.3%** | 84.4% |
139
+ | Turkish | 77.4% | 62.3% | 70.0% | 79.9% | **82.7%** |
140
+ | Ukrainian | 78.0% | 77.0% | 71.9% | 83.9% | **85.1%** |
141
+ | **Average** | 79.5% | 76.8% | 72.2% | 82.5% | **84.7%** |
142
+
143
  ## Per-Character Perplexity
144
  **What is Perplexity?** Perplexity measures how well a language model predicts text. A model with low perplexity makes accurate predictions consistently, while a high perplexity means the model is frequently "surprised" by unexpected words or patterns. Lower perplexity indicates the model has learned language patterns more effectively. It's less "surprised" by what it encounters because it better understands how the language works.
145
  Perplexity fairly evaluates how well each model handles: