OSainz commited on
Commit
3ee0836
1 Parent(s): 5b44362

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +342 -42
README.md CHANGED
@@ -12,9 +12,9 @@ metrics:
12
  pipeline_tag: text-generation
13
  ---
14
 
15
- # **Model Card for Basque Llama 7B**
16
 
17
- Basque LLaMA is a collection of foundation models specifically tuned for Basque. Based on Meta’s LLaMA 2 model family, these models were further trained with highly curated Basque corpora, Euscrawl ([Artetxe et al., 2022](https://aclanthology.org/2022.emnlp-main.499/)). Ranging from 7 billion to 70 billion parameters, these models are currently the biggest and best-performing LLMs built for Basque. This is the 7B repository, links to other models can be found in the index at the bottom.
18
 
19
 
20
  # **Model Details**
@@ -22,7 +22,7 @@ Basque LLaMA is a collection of foundation models specifically tuned for Basque.
22
 
23
  ## **Model Description**
24
 
25
- Basque LLaMA is a family of Large Language Models (LLM) based on Meta’s [LLaMA models](https://huggingface.co/meta-llama). Current LLMs exhibit incredible performance for high-resource languages such as English, but, in the case of Basque and other low-resource languages, their performance is close to a random guesser. These limitations push the gap between high- and low-resource languages when it comes to digital development. We present Basque LLaMA to overcome these limitations and promote the development of LLM-based technology and research for the Basque language. Basque LLaMA models follow the same architecture as their original counterparts and were further trained in Euscrawl v1 ([Artetxe et al., 2022](https://aclanthology.org/2022.emnlp-main.499/)), a high-quality Basque corpora.
26
 
27
  The models are released in three sizes: 7B, 13B and 70B.
28
 
@@ -32,8 +32,7 @@ The models are released in three sizes: 7B, 13B and 70B.
32
  * **Model type:** Language model
33
  * **Language(s) (NLP):** en, eu
34
  * **License:** llama2
35
- * **Parent Model:** meta-llama/Llama-2-7B
36
- * **Resources for more information:** [PAPER/BLOG/POST link]
37
  * **Contact:** hitz@ehu.eus
38
 
39
 
@@ -42,18 +41,22 @@ The models are released in three sizes: 7B, 13B and 70B.
42
  Use the code below to get started with the model.
43
 
44
  ```python
 
45
  from transformers import pipeline
46
 
47
- pipe = pipeline("text-generation", model="HiTZ/basque-llama-2-7b-v1")
48
- text = "Donosti da Euskal Herriko lekurik"
 
 
 
49
 
50
- pipe(text, max_new_tokens=40)
51
  >> [
52
- {
53
- 'generated_text': 'Donosti da Euskal Herriko lekurik garestiena alokairuan bizitzeko,'
54
- ' eta Donostiako alokairuaren prezioa %11,3 igo da azken urtean'
55
- }
56
  ]
 
57
  ```
58
 
59
 
@@ -96,14 +99,97 @@ Additionally, 100K documents of English data randomly selected from the [Pile](h
96
  The models were trained using the GPT-Neox library on the HPC CINECA computing cluster. All the models were approximately trained with an effective batch size of 2M tokens for 1000 to 2000 steps.
97
 
98
 
99
- | Model | Steps | Sequence length | Effective Batch size | Total tokens | GPU hours |
100
- | ---------------- | ----- | --------------- | -------------------- | ------------ | ---------- |
101
- | Basque LLaMA 7B | 2000 | 4096 | 2M tokens/step | 4B | 359.2h |
102
- | Basque LLaMA 13B | 1000 | 4096 | 2M tokens/step | 2B | 468.8h |
103
- | Basque LLaMA 70B | 1680 | 4096 | 2M tokens/step | 3.4B | \*6475.52h |
104
-
105
-
106
- "*" indicates the time for the entire training process (2000 steps), however the weights of the step 1680 are shared as it is the best checkpoint according to validation loss.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
107
 
108
 
109
  # **Evaluation**
@@ -120,23 +206,26 @@ We evaluated the models on zero-shot and few-shot settings on generative, multip
120
 
121
  * **Belebele** ([Bandarkar et al.](https://arxiv.org/abs/2308.16884)): Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. We evaluated the model in a 5-shot fashion.
122
  * Data card: [https://huggingface.co/datasets/facebook/belebele](https://huggingface.co/datasets/facebook/belebele)
123
- * **X-StoryCloze** ([Lin et al.](https://aclanthology.org/2022.emnlp-main.616.pdf)): XStoryCloze consists of the professionally translated version of the English Story Cloze dataset to 10 non-English languages. Story Cloze is a new commonsense reasoning dataset which consists in choosing the correct ending to a four-sentence story. We evaluated the model in a 0-shot fashion.
124
  * Data card: [https://huggingface.co/datasets/juletxara/xstory_cloze](https://huggingface.co/datasets/juletxara/xstory_cloze)
125
- * **BasqueGLUE** ([Urbizu et al.](https://aclanthology.org/2022.lrec-1.172.pdf)): BasqueGLUE is a NLU benchmark for Basque. Data card: [https://huggingface.co/datasets/orai-nlp/basqueGLUE](https://huggingface.co/datasets/orai-nlp/basqueGLUE). We evaluated the model in a 5-shot fashion on the following tasks:
126
- * **BEC2016eu**: Sentiment analysis on tweets about the 2016 Basque elections campaign.
127
- * **VaxxStance**: Stance detection on tweets around the anti-vaccine movement.
128
- * **BTHCv2**: Topic classification of news extracts with 12 categories.
129
- * **EpecKorrefBin**: Correference detection task similar to WSC.
130
- * **QNLIeu**: Q&A NLI built from the Basque Wikipedia.
131
- * **WiCeu**: Basque Word-in-Context task.
 
 
 
132
 
133
  ### **Metrics**
134
 
135
 
136
 
137
- * Accuracy: Belebele, X-StoryCloze, EpecKorrefBin, QNLI-eu, and, WiC-eu
138
- * Micro F1: BEC2016-eu and BHTCv2
139
- * Macro F1: VaxxStance (favor & against)
140
 
141
 
142
  ## **Results**
@@ -144,17 +233,228 @@ We evaluated the models on zero-shot and few-shot settings on generative, multip
144
  The model was evaluated using the LM Evaluation harness library from Eleuther AI. In order to reproduce our results please refer to our [fork](https://github.com/naiarapm/lm-evaluation-harness/tree/basqueglue) that includes the implementation for the mentioned datasets.
145
 
146
 
147
- | Model | Belebele | X-StoryCloze | BEC | Vaxx | BHTC | coref | QNLI | WiC | Average |
148
- | ---------------- | -------- | ------------ | ----- | ----- | ----- | ----- | ----- | ----- | ------- |
149
- | Random | 25.00 | 50.00 | 33.33 | 33.33 | 8.33 | 50.00 | 50.00 | 50.00 | 37.50 |
150
- | LLaMA 2 7B | 26.22 | 50.43 | 41.63 | 18.60 | 20.06 | 50.94 | 48.32 | 49.64 | 38.23 |
151
- | LLaMA 2 13B | 32.00 | 50.63 | 41.09 | 18.25 | 27.35 | 49.23 | 48.74 | 49.21 | 39.56 |
152
- | LLaMA 2 70B | 33.56 | 51.62 | 47.47 | 21.01 | 31.01 | 52.98 | 51.26 | 51.57 | 42.56 |
153
- | BLOOM 7B | 27.00 | 57.18 | 37.94 | 20.72 | 39.10 | 48.21 | 47.48 | 47.57 | 40.65 |
154
- | XGLM 7B | 23.88 | 57.71 | 39.94 | 21.58 | 36.73 | 50.94 | 50.42 | 49.21 | 41.30 |
155
- | Basque LLaMA 7B | 35.67 | 63.13 | 55.61 | 45.93 | 44.44 | 50.43 | 55.04 | 50.14 | 50.05 |
156
- | Basque LLaMA 13B | 53.56 | 65.85 | 53.23 | 48.66 | 53.61 | 62.52 | 57.14 | 54.21 | 56.10 |
157
- | Basque LLaMA 70B | 71.78 | 67.57 | 63.52 | 48.95 | 49.51 | 79.90 | 58.82 | 55.50 | 61.94 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
158
 
159
 
160
 
 
12
  pipeline_tag: text-generation
13
  ---
14
 
15
+ # **Model Card for Basque Llama 7b**
16
 
17
+ Basque LLaMA is a collection of foundation models specifically tuned for Basque. Based on Meta’s LLaMA 2 model family, these models were further trained with Euscrawl, a highly curated Basque corpora ([Artetxe et al., 2022](https://aclanthology.org/2022.emnlp-main.499/)). Ranging from 7 billion to 70 billion parameters, these models are currently the biggest and best-performing LLMs built for Basque. This is the 7b repository, links to other models can be found in the index at the bottom.
18
 
19
 
20
  # **Model Details**
 
22
 
23
  ## **Model Description**
24
 
25
+ Basque LLaMA is a family of Large Language Models (LLM) based on Meta’s [LLaMA models](https://huggingface.co/meta-llama). Current LLMs exhibit incredible performance for high-resource languages such as English, but, in the case of Basque and other low-resource languages, their performance is close to a random guesser. These limitations widen the gap between high- and low-resource languages when it comes to digital development. We present Basque LLaMA to overcome these limitations and promote the development of LLM-based technology and research for the Basque language. Basque LLaMA models follow the same architecture as their original counterparts and were further trained in Euscrawl v1 ([Artetxe et al., 2022](https://aclanthology.org/2022.emnlp-main.499/)), a high-quality Basque corpora.
26
 
27
  The models are released in three sizes: 7B, 13B and 70B.
28
 
 
32
  * **Model type:** Language model
33
  * **Language(s) (NLP):** en, eu
34
  * **License:** llama2
35
+ * **Parent Model:** meta-llama/Llama-2-7b
 
36
  * **Contact:** hitz@ehu.eus
37
 
38
 
 
41
  Use the code below to get started with the model.
42
 
43
  ```python
44
+
45
  from transformers import pipeline
46
 
47
+ pipe = pipeline("text-generation", model=”HiTZ/basque-llama-2-7b-v1)
48
+
49
+ text = "Euskara adimen artifizialera iritsi da!"
50
+
51
+ pipe(text, max_new_tokens=50, num_beams=5)
52
 
 
53
  >> [
54
+ {
55
+ 'generated_text': 'Euskara adimen artifizialera iritsi da!\nEuskararen eta adimen artifizialaren arteko harremana aspaldikoa da,'
56
+ ' baina azken urteotan aurrerapauso handiak eman dira arlo horretan'
57
+ }
58
  ]
59
+
60
  ```
61
 
62
 
 
99
  The models were trained using the GPT-Neox library on the HPC CINECA computing cluster. All the models were approximately trained with an effective batch size of 2M tokens for 1000 to 2000 steps.
100
 
101
 
102
+ <table>
103
+ <tr>
104
+ <td>Model
105
+ </td>
106
+ <td>Steps
107
+ </td>
108
+ <td>Sequence length
109
+ </td>
110
+ <td>Effective Batch size
111
+ </td>
112
+ <td>Total tokens
113
+ </td>
114
+ <td>GPU hours
115
+ </td>
116
+ </tr>
117
+ <tr>
118
+ <td>Basque LLaMA 7B
119
+ </td>
120
+ <td><p style="text-align: right">
121
+ 2000</p>
122
+
123
+ </td>
124
+ <td><p style="text-align: right">
125
+ 4096</p>
126
+
127
+ </td>
128
+ <td><p style="text-align: right">
129
+ 2M tokens/step</p>
130
+
131
+ </td>
132
+ <td><p style="text-align: right">
133
+ 4B</p>
134
+
135
+ </td>
136
+ <td><p style="text-align: right">
137
+ 359.2h</p>
138
+
139
+ </td>
140
+ </tr>
141
+ <tr>
142
+ <td>Basque LLaMA 13B
143
+ </td>
144
+ <td><p style="text-align: right">
145
+ 1000</p>
146
+
147
+ </td>
148
+ <td><p style="text-align: right">
149
+ 4096</p>
150
+
151
+ </td>
152
+ <td><p style="text-align: right">
153
+ 2M tokens/step</p>
154
+
155
+ </td>
156
+ <td><p style="text-align: right">
157
+ 2B</p>
158
+
159
+ </td>
160
+ <td><p style="text-align: right">
161
+ 468.8h</p>
162
+
163
+ </td>
164
+ </tr>
165
+ <tr>
166
+ <td>Basque LLaMA 70B
167
+ </td>
168
+ <td><p style="text-align: right">
169
+ 1680</p>
170
+
171
+ </td>
172
+ <td><p style="text-align: right">
173
+ 4096</p>
174
+
175
+ </td>
176
+ <td><p style="text-align: right">
177
+ 2M tokens/step</p>
178
+
179
+ </td>
180
+ <td><p style="text-align: right">
181
+ 3.4B</p>
182
+
183
+ </td>
184
+ <td><p style="text-align: right">
185
+ *6475.52h</p>
186
+
187
+ </td>
188
+ </tr>
189
+ </table>
190
+
191
+
192
+ * indicates the time for the entire training process (2000 steps), however the weights of the step 1680 are shared as it is the best checkpoint according to validation loss.
193
 
194
 
195
  # **Evaluation**
 
206
 
207
  * **Belebele** ([Bandarkar et al.](https://arxiv.org/abs/2308.16884)): Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. We evaluated the model in a 5-shot fashion.
208
  * Data card: [https://huggingface.co/datasets/facebook/belebele](https://huggingface.co/datasets/facebook/belebele)
209
+ * **X-StoryCloze**: XStoryCloze consists of the professionally translated version of the English StoryCloze dataset to 10 non-English languages. Story Cloze is a new commonsense reasoning dataset which consists of choosing the correct ending to a four-sentence story. We evaluated the model in a 0-shot fashion.
210
  * Data card: [https://huggingface.co/datasets/juletxara/xstory_cloze](https://huggingface.co/datasets/juletxara/xstory_cloze)
211
+ * **BasqueGLUE** ([Urbizu et al.](https://aclanthology.org/2022.lrec-1.172.pdf)): BasqueGLUE is a NLU benchmark for Basque. We evaluated the model in a 5-shot fashion on the following tasks:
212
+ * Data card:[ https://huggingface.co/datasets/orai-nlp/basqueGLUE](https://huggingface.co/datasets/orai-nlp/basqueGLUE).
213
+ * Tasks:
214
+ * **BEC2016eu**: Sentiment analysis on tweets about the 2016 Basque elections campaign.
215
+ * **VaxxStance**: Stance detection on tweets around the anti-vaccine movement.
216
+ * **BTHCv2**: Topic classification of news extracts with 12 categories.
217
+ * **EpecKorrefBin**: Correference detection task similar to WSC.
218
+ * **QNLIeu**: Q&A NLI built from the Basque Wikipedia.
219
+ * **WiCeu**: Basque Word-in-Context task.
220
+
221
 
222
  ### **Metrics**
223
 
224
 
225
 
226
+ * **Accuracy**: Belebele, X-StoryCloze, EpecKorrefBin, QNLI-eu, and, WiC-eu
227
+ * **Micro F1**: BEC2016-eu and BHTCv2
228
+ * **Macro F1**: VaxxStance (favor & against)
229
 
230
 
231
  ## **Results**
 
233
  The model was evaluated using the LM Evaluation harness library from Eleuther AI. In order to reproduce our results please refer to our [fork](https://github.com/naiarapm/lm-evaluation-harness/tree/basqueglue) that includes the implementation for the mentioned datasets.
234
 
235
 
236
+ <table>
237
+ <tr>
238
+ <td><strong>Model</strong>
239
+ </td>
240
+ <td><strong>Belebele</strong>
241
+ </td>
242
+ <td><strong>X-StoryCloze</strong>
243
+ </td>
244
+ <td><strong>BEC</strong>
245
+ </td>
246
+ <td><strong>Vaxx</strong>
247
+ </td>
248
+ <td><strong>BHTC</strong>
249
+ </td>
250
+ <td><strong>coref</strong>
251
+ </td>
252
+ <td><strong>QNLI</strong>
253
+ </td>
254
+ <td><strong>WiC</strong>
255
+ </td>
256
+ <td><strong>Average</strong>
257
+ </td>
258
+ </tr>
259
+ <tr>
260
+ <td>Random
261
+ </td>
262
+ <td>25.00
263
+ </td>
264
+ <td>50.00
265
+ </td>
266
+ <td>33.33
267
+ </td>
268
+ <td>33.33
269
+ </td>
270
+ <td>8.33
271
+ </td>
272
+ <td>50.00
273
+ </td>
274
+ <td>50.00
275
+ </td>
276
+ <td>50.00
277
+ </td>
278
+ <td>37.50
279
+ </td>
280
+ </tr>
281
+ <tr>
282
+ <td>LLaMA 2 7B
283
+ </td>
284
+ <td>26.22
285
+ </td>
286
+ <td>50.43
287
+ </td>
288
+ <td>41.63
289
+ </td>
290
+ <td>18.60
291
+ </td>
292
+ <td>20.06
293
+ </td>
294
+ <td>50.94
295
+ </td>
296
+ <td>48.32
297
+ </td>
298
+ <td>49.64
299
+ </td>
300
+ <td>38.23
301
+ </td>
302
+ </tr>
303
+ <tr>
304
+ <td>LLaMA 2 13B
305
+ </td>
306
+ <td>32.00
307
+ </td>
308
+ <td>50.63
309
+ </td>
310
+ <td>41.09
311
+ </td>
312
+ <td>18.25
313
+ </td>
314
+ <td>27.35
315
+ </td>
316
+ <td>49.23
317
+ </td>
318
+ <td>48.74
319
+ </td>
320
+ <td>49.21
321
+ </td>
322
+ <td>39.56
323
+ </td>
324
+ </tr>
325
+ <tr>
326
+ <td>LLaMA 2 70B
327
+ </td>
328
+ <td>33.56
329
+ </td>
330
+ <td>51.62
331
+ </td>
332
+ <td>47.47
333
+ </td>
334
+ <td>21.01
335
+ </td>
336
+ <td>31.01
337
+ </td>
338
+ <td>52.98
339
+ </td>
340
+ <td>51.26
341
+ </td>
342
+ <td>51.57
343
+ </td>
344
+ <td>42.56
345
+ </td>
346
+ </tr>
347
+ <tr>
348
+ <td>BLOOM 7B
349
+ </td>
350
+ <td>27.00
351
+ </td>
352
+ <td>57.18
353
+ </td>
354
+ <td>37.94
355
+ </td>
356
+ <td>20.72
357
+ </td>
358
+ <td>39.10
359
+ </td>
360
+ <td>48.21
361
+ </td>
362
+ <td>47.48
363
+ </td>
364
+ <td>47.57
365
+ </td>
366
+ <td>40.65
367
+ </td>
368
+ </tr>
369
+ <tr>
370
+ <td>XGLM 7B
371
+ </td>
372
+ <td>23.88
373
+ </td>
374
+ <td>57.71
375
+ </td>
376
+ <td>39.94
377
+ </td>
378
+ <td>21.58
379
+ </td>
380
+ <td>36.73
381
+ </td>
382
+ <td>50.94
383
+ </td>
384
+ <td>50.42
385
+ </td>
386
+ <td>49.21
387
+ </td>
388
+ <td>41.30
389
+ </td>
390
+ </tr>
391
+ <tr>
392
+ <td><strong>Basque LLaMA 7B</strong>
393
+ </td>
394
+ <td>35.67
395
+ </td>
396
+ <td>63.13
397
+ </td>
398
+ <td>55.61
399
+ </td>
400
+ <td>45.93
401
+ </td>
402
+ <td>44.44
403
+ </td>
404
+ <td>50.43
405
+ </td>
406
+ <td>55.04
407
+ </td>
408
+ <td>50.14
409
+ </td>
410
+ <td>50.05
411
+ </td>
412
+ </tr>
413
+ <tr>
414
+ <td><strong>Basque LLaMA 13B</strong>
415
+ </td>
416
+ <td>53.56
417
+ </td>
418
+ <td>65.85
419
+ </td>
420
+ <td>53.23
421
+ </td>
422
+ <td>48.66
423
+ </td>
424
+ <td><strong>53.61</strong>
425
+ </td>
426
+ <td>62.52
427
+ </td>
428
+ <td>57.14
429
+ </td>
430
+ <td>54.21
431
+ </td>
432
+ <td>56.10
433
+ </td>
434
+ </tr>
435
+ <tr>
436
+ <td><strong>Basque LLaMA 70B</strong>
437
+ </td>
438
+ <td><strong>71.78</strong>
439
+ </td>
440
+ <td><strong>67.57</strong>
441
+ </td>
442
+ <td><strong>63.52</strong>
443
+ </td>
444
+ <td><strong>48.95</strong>
445
+ </td>
446
+ <td>49.51
447
+ </td>
448
+ <td><strong>79.90</strong>
449
+ </td>
450
+ <td><strong>58.82</strong>
451
+ </td>
452
+ <td><strong>55.50</strong>
453
+ </td>
454
+ <td><strong>61.94</strong>
455
+ </td>
456
+ </tr>
457
+ </table>
458
 
459
 
460