viraat commited on
Commit
b159569
1 Parent(s): 3abb7c6

Add Model Card details

Browse files
Files changed (1) hide show
  1. README.md +340 -0
README.md CHANGED
@@ -1,3 +1,343 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ datasets:
4
+ - CohereForAI/xP3x
5
+ - CohereForAI/aya_dataset
6
+ - CohereForAI/aya_collection
7
+ - DataProvenanceInitiative/Commercially-Verified-Licenses
8
+ - CohereForAI/aya_evaluation_suite
9
+ language:
10
+ - afr
11
+ - amh
12
+ - ara
13
+ - aze
14
+ - bel
15
+ - ben
16
+ - bul
17
+ - cat
18
+ - ceb
19
+ - ces
20
+ - cym
21
+ - dan
22
+ - deu
23
+ - ell
24
+ - eng
25
+ - epo
26
+ - est
27
+ - eus
28
+ - fin
29
+ - fil
30
+ - fra
31
+ - fry
32
+ - gla
33
+ - gle
34
+ - glg
35
+ - guj
36
+ - hat
37
+ - hau
38
+ - heb
39
+ - hin
40
+ - hun
41
+ - hye
42
+ - ibo
43
+ - ind
44
+ - isl
45
+ - ita
46
+ - jav
47
+ - jpn
48
+ - kan
49
+ - kat
50
+ - kaz
51
+ - khm
52
+ - kir
53
+ - kor
54
+ - kur
55
+ - lao
56
+ - lav
57
+ - lat
58
+ - lit
59
+ - ltz
60
+ - mal
61
+ - mar
62
+ - mkd
63
+ - mlg
64
+ - mlt
65
+ - mon
66
+ - mri
67
+ - msa
68
+ - mya
69
+ - nep
70
+ - nld
71
+ - nor
72
+ - nso
73
+ - nya
74
+ - ory
75
+ - pan
76
+ - pes
77
+ - pol
78
+ - por
79
+ - pus
80
+ - ron
81
+ - rus
82
+ - sin
83
+ - slk
84
+ - slv
85
+ - smo
86
+ - sna
87
+ - snd
88
+ - som
89
+ - sot
90
+ - spa
91
+ - sqi
92
+ - srp
93
+ - sun
94
+ - swa
95
+ - swe
96
+ - tam
97
+ - tel
98
+ - tgk
99
+ - tha
100
+ - tur
101
+ - twi
102
+ - ukr
103
+ - urd
104
+ - uzb
105
+ - vie
106
+ - xho
107
+ - yid
108
+ - yor
109
+ - zho
110
+ - zul
111
+ metrics:
112
+ - accuracy
113
+ - bleu
114
  ---
115
+
116
+ <img src="aya-fig1.png" alt="Aya model summary image" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
117
+
118
+ # Model Card for Aya Model
119
+
120
+ ## Model Summary
121
+
122
+ > The Aya model is a massively multilingual generative language model that follows instructions in 101 languages.
123
+ > Aya outperforms [mT0](https://huggingface.co/bigscience/mt0-xxl) and [BLOOMZ](https://huggingface.co/bigscience/bloomz) a wide variety of automatic and human evaluations despite covering double the number of languages.
124
+ > The Aya model is trained using [xP3x](https://huggingface.co/datasets/CohereForAI/xP3x), [Aya Dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset), [Aya Collection](https://huggingface.co/datasets/CohereForAI/aya_collection), a subset of [DataProvenance collection](https://huggingface.co/datasets/DataProvenanceInitiative/Commercially-Verified-Licenses) and ShareGPT-Command.
125
+ > We release the checkpoints under a Apache-2.0 license to further our mission of multilingual technologies empowering a
126
+ > multilingual world.
127
+
128
+ - **Developed by:** Cohere For AI
129
+ - **Model type:** a Transformer style autoregressive massively multilingual language model.
130
+ - **Paper**: [Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model](arxiv.com)
131
+ - **Point of Contact**: [Ahmet Ustun](mailto:ahmet@cohere.com)
132
+ - **Languages**: Refer to the list of languages in the `language` section of this model card.
133
+ - **License**: Apache-2.0
134
+ - **Model**: [Aya](https://huggingface.co/CohereForAI/aya)
135
+ - **Model Size**: 13 billion parameters
136
+ - **Datasets**: [xP3x](https://huggingface.co/datasets/CohereForAI/xP3x), [Aya Dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset), [Aya Collection](https://huggingface.co/datasets/CohereForAI/aya_collection), [DataProvenance collection](https://huggingface.co/datasets/DataProvenanceInitiative/Commercially-Verified-Licenses), ShareGPT-Command.
137
+
138
+ ## Use
139
+
140
+ ```bash
141
+ # pip install -q transformers
142
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
143
+
144
+ checkpoint = "CohereForAI/aya_model"
145
+
146
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
147
+ aya_model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
148
+
149
+ inputs = tokenizer.encode("Translate to English: Je t’aime.", return_tensors="pt")
150
+ outputs = aya_model.generate(inputs)
151
+ print(tokenizer.decode(outputs[0]))
152
+ ```
153
+
154
+ ## Model Details
155
+
156
+ ### Training
157
+
158
+ - Architecture: Same as [mt5-xxl](https://huggingface.co/google/mt5-xxl)
159
+ - Finetuning Steps: 25000
160
+ - Hardware: TPUv4-128
161
+ - Software: T5X, Jax
162
+
163
+ ### Data Sources
164
+
165
+ The Aya model is trained on the following datasets:
166
+
167
+ - [xP3x](https://huggingface.co/datasets/CohereForAI/xP3x)
168
+ - [Aya Dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset)
169
+ - [Aya Collection](https://huggingface.co/datasets/CohereForAI/aya_collection)
170
+ - [DataProvenance collection](https://huggingface.co/datasets/DataProvenanceInitiative/Commercially-Verified-Licenses)
171
+ - ShareGPT-Command
172
+
173
+ All datasets are subset to the 101 languages supported by [mT5]. See the [paper](arxiv.com) for details about filtering and pruning.
174
+
175
+ ## Evaluation
176
+
177
+ <!-- This section describes the evaluation protocols and provides the results. -->
178
+
179
+ > We introduce extensive new evaluation suites that broaden the state-of-art for multilingual eval across 99 languages – including discriminative, generative tasks, human evaluation and simulated win rates that cover both held-out tasks and
180
+ > in-distribution performance.
181
+
182
+ Below, we provide evaluation results for the Aya model on unseen discriminative tasks, and in-distribution generative tasks compared to mT0, BLOOMZ, Bactrian-X 13B, and mT0x. To ensure a fair comparison with our Aya model in terms of language coverage, we finetune a new variant of mT5, that we dub mT0x. It is trained using the original datasets that are part of the xP3 collection but extended to 101 languages (xP3x).
183
+
184
+ For Multlingual MMLU, Simulated and Human Win-rates, please refer to the [paper](arxiv.com)
185
+
186
+ ### Discriminative Tasks
187
+
188
+ | Model | Base Model | IFT Mixture | XCOPA (Acc %) | XNLI (Acc %) | XSC (Acc %) | XWG (Acc %) | **<u>Avg</u>** |
189
+ | :---------------- | :--------- | :---------: | :-----------: | :----------: | :---------: | :---------: | :------------: |
190
+ | **46 Languages** | | | | | | | |
191
+ | mT0 | mT5 13B | xP3 | 75.6 | 55.3 | 87.2 | 73.6 | 72.9 |
192
+ | BLOOMZ | BLOOM 176B | xP3 | 64.3 | 52.0 | 82.6 | 63.3 | 65.5 |
193
+ | **52 Languages** | | | | | | | |
194
+ | Bactrian-X 13B | Llama 13B | Bactrian-X | 52.4 | 34.5 | 51.8 | 50.5 | 47.3 |
195
+ | **101 Languages** | | | | | | | |
196
+ | mT0x | mT5 13B | xP3x | 71.7 | 45.9 | 85.1 | 60.6 | 65.8 |
197
+ | Aya model | mT5 13B | All Mixture | 76.7 | 58.3 | 90.0 | 70.7 | 73.9 |
198
+
199
+ ### Generative Tasks
200
+
201
+ | Model | Base Model | IFT Mixture | FLORES-200 (spBleu) | FLORES-200 (spBleu) | XLSum (RougeLsum) | Tydi-QA (F1) |
202
+ | :---------------- | :--------: | :---------- | :-----------------: | :-----------------: | :---------------: | :----------: |
203
+ | | | | X→ En | En → X | | |
204
+ | **101 Languages** | | | | | | |
205
+ | mT0x | mT5 13B | xP3x | 20.2 | 14.5 | 21.4 | 76.1 |
206
+ | Aya Model | mT5 13B | All Mixture | 29.1 | 19.0 | 22.0 | 77.8 |
207
+
208
+ Note: We cannot compare mT0, and BLOOMZ for the above generative tasks, as the validation splits are part of mT0 and BLOOMZ's training data.
209
+
210
+ ## Bias, Risks, and Limitations
211
+
212
+ Like any base language model or fine-tuned model without safety filtering, it is relatively easy for a user to prompt these models to generate harmful and generally sensitive content.
213
+ Aya model, as released, does not include any safety filtering.
214
+ We hope that the release of the Aya model will make community-based redteaming efforts possible, by exposing an open-source massively-multilingual model for community research.
215
+
216
+ For a detailed overview of our effort at safety mitigation and benchmarking toxicity and bias across multiple languages, we refer Sections 6 and 7 of our paper: [Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model](arxiv.com).
217
+
218
+ ## Citation
219
+
220
+ **BibTeX:**
221
+
222
+ ```
223
+ @article{,
224
+ title={},
225
+ author={},
226
+ journal={Preprint},
227
+ year={2024}
228
+ }
229
+ ```
230
+
231
+ **APA:**
232
+
233
+ ## Languages Covered
234
+
235
+ Below is the list of languages used in finetuning the Aya Model. We group languages into higher-, mid-, and lower-resourcedness based on a language classification by [Joshi et. al, 2020](https://microsoft.github.io/linguisticdiversity/). For further details, refer to our [paper]()
236
+
237
+ | ISO Code | Language Name | Script | Family | Subgrouping | Resourcedness |
238
+ | :------- | :-------------- | :----------: | :-------------: | :---------------: | :-----------: |
239
+ | afr | Afrikaans | Latin | Indo-European | Germanic | Mid |
240
+ | amh | Amharic | Ge'ez | Afro-Asiatic | Semitic | Low |
241
+ | ara | Arabic | Arabic | Afro-Asiatic | Semitic | High |
242
+ | aze | Azerbaijani | Arabic/Latin | Turkic | Common Turkic | Low |
243
+ | bel | Belarusian | Cyrillic | Indo-European | Balto-Slavic | Mid |
244
+ | ben | Bengali | Bengali | Indo-European | Indo-Aryan | Mid |
245
+ | bul | Bulgarian | Cyrillic | Indo-European | Balto-Slavic | Mid |
246
+ | cat | Catalan | Latin | Indo-European | Italic | High |
247
+ | ceb | Cebuano | Latin | Austronesian | Malayo-Polynesian | Mid |
248
+ | ces | Czech | Latin | Indo-European | Balto-Slavic | High |
249
+ | cym | Welsh | Latin | Indo-European | Celtic | Low |
250
+ | dan | Danish | Latin | Indo-European | Germanic | Mid |
251
+ | deu | German | Latin | Indo-European | Germanic | High |
252
+ | ell | Greek | Greek | Indo-European | Graeco-Phrygian | Mid |
253
+ | eng | English | Latin | Indo-European | Germanic | High |
254
+ | epo | Esperanto | Latin | Constructed | Esperantic | Low |
255
+ | est | Estonian | Latin | Uralic | Finnic | Mid |
256
+ | eus | Basque | Latin | Basque | - | High |
257
+ | fin | Finnish | Latin | Uralic | Finnic | High |
258
+ | fil | Tagalog | Latin | Austronesian | Malayo-Polynesian | Mid |
259
+ | fra | French | Latin | Indo-European | Italic | High |
260
+ | fry | Western Frisian | Latin | Indo-European | Germanic | Low |
261
+ | gla | Scottish Gaelic | Latin | Indo-European | Celtic | Low |
262
+ | gle | Irish | Latin | Indo-European | Celtic | Low |
263
+ | glg | Galician | Latin | Indo-European | Italic | Mid |
264
+ | guj | Gujarati | Gujarati | Indo-European | Indo-Aryan | Low |
265
+ | hat | Haitian Creole | Latin | Indo-European | Italic | Low |
266
+ | hau | Hausa | Latin | Afro-Asiatic | Chadic | Low |
267
+ | heb | Hebrew | Hebrew | Afro-Asiatic | Semitic | Mid |
268
+ | hin | Hindi | Devanagari | Indo-European | Indo-Aryan | High |
269
+ | hun | Hungarian | Latin | Uralic | - | High |
270
+ | hye | Armenian | Armenian | Indo-European | Armenic | Low |
271
+ | ibo | Igbo | Latin | Atlantic-Congo | Benue-Congo | Low |
272
+ | ind | Indonesian | Latin | Austronesian | Malayo-Polynesian | Mid |
273
+ | isl | Icelandic | Latin | Indo-European | Germanic | Low |
274
+ | ita | Italian | Latin | Indo-European | Italic | High |
275
+ | jav | Javanese | Latin | Austronesian | Malayo-Polynesian | Low |
276
+ | jpn | Japanese | Japanese | Japonic | Japanesic | High |
277
+ | kan | Kannada | Kannada | Dravidian | South Dravidian | Low |
278
+ | kat | Georgian | Georgian | Kartvelian | Georgian-Zan | Mid |
279
+ | kaz | Kazakh | Cyrillic | Turkic | Common Turkic | Mid |
280
+ | khm | Khmer | Khmer | Austroasiatic | Khmeric | Low |
281
+ | kir | Kyrgyz | Cyrillic | Turkic | Common Turkic | Low |
282
+ | kor | Korean | Hangul | Koreanic | Korean | High |
283
+ | kur | Kurdish | Latin | Indo-European | Iranian | Low |
284
+ | lao | Lao | Lao | Tai-Kadai | Kam-Tai | Low |
285
+ | lav | Latvian | Latin | Indo-European | Balto-Slavic | Mid |
286
+ | lat | Latin | Latin | Indo-European | Italic | Mid |
287
+ | lit | Lithuanian | Latin | Indo-European | Balto-Slavic | Mid |
288
+ | ltz | Luxembourgish | Latin | Indo-European | Germanic | Low |
289
+ | mal | Malayalam | Malayalam | Dravidian | South Dravidian | Low |
290
+ | mar | Marathi | Devanagari | Indo-European | Indo-Aryan | Low |
291
+ | mkd | Macedonian | Cyrillic | Indo-European | Balto-Slavic | Low |
292
+ | mlg | Malagasy | Latin | Austronesian | Malayo-Polynesian | Low |
293
+ | mlt | Maltese | Latin | Afro-Asiatic | Semitic | Low |
294
+ | mon | Mongolian | Cyrillic | Mongolic-Khitan | Mongolic | Low |
295
+ | mri | Maori | Latin | Austronesian | Malayo-Polynesian | Low |
296
+ | msa | Malay | Latin | Austronesian | Malayo-Polynesian | Mid |
297
+ | mya | Burmese | Myanmar | Sino-Tibetan | Burmo-Qiangic | Low |
298
+ | nep | Nepali | Devanagari | Indo-European | Indo-Aryan | Low |
299
+ | nld | Dutch | Latin | Indo-European | Germanic | High |
300
+ | nor | Norwegian | Latin | Indo-European | Germanic | Low |
301
+ | nso | Northern Sotho | Latin | Atlantic-Congo | Benue-Congo | Low |
302
+ | nya | Chichewa | Latin | Atlantic-Congo | Benue-Congo | Low |
303
+ | ory | Oriya | Oriya | Indo-European | Indo-Aryan | Low |
304
+ | pan | Punjabi | Gurmukhi | Indo-European | Indo-Aryan | Low |
305
+ | pes | Persian | Arabic | Indo-European | Iranian | High |
306
+ | pol | Polish | Latin | Indo-European | Balto-Slavic | High |
307
+ | por | Portuguese | Latin | Indo-European | Italic | High |
308
+ | pus | Pashto | Arabic | Indo-European | Iranian | Low |
309
+ | ron | Romanian | Latin | Indo-European | Italic | Mid |
310
+ | rus | Russian | Cyrillic | Indo-European | Balto-Slavic | High |
311
+ | sin | Sinhala | Sinhala | Indo-European | Indo-Aryan | Low |
312
+ | slk | Slovak | Latin | Indo-European | Balto-Slavic | Mid |
313
+ | slv | Slovenian | Latin | Indo-European | Balto-Slavic | Mid |
314
+ | smo | Samoan | Latin | Austronesian | Malayo-Polynesian | Low |
315
+ | sna | Shona | Latin | Indo-European | Indo-Aryan | Low |
316
+ | snd | Sindhi | Arabic | Indo-European | Indo-Aryan | Low |
317
+ | som | Somali | Latin | Afro-Asiatic | Cushitic | Low |
318
+ | sot | Southern Sotho | Latin | Atlantic-Congo | Benue-Congo | Low |
319
+ | spa | Spanish | Latin | Indo-European | Italic | High |
320
+ | sqi | Albanian | Latin | Indo-European | Albanian | Low |
321
+ | srp | Serbian | Cyrillic | Indo-European | Balto-Slavic | High |
322
+ | sun | Sundanese | Latin | Austronesian | Malayo-Polynesian | Low |
323
+ | swa | Swahili | Latin | Atlantic-Congo | Benue-Congo | Low |
324
+ | swe | Swedish | Latin | Indo-European | Germanic | High |
325
+ | tam | Tamil | Tamil | Dravidian | South Dravidian | Mid |
326
+ | tel | Telugu | Telugu | Dravidian | South Dravidian | Low |
327
+ | tgk | Tajik | Cyrillic | Indo-European | Iranian | Low |
328
+ | tha | Thai | Thai | Tai-Kadai | Kam-Tai | Mid |
329
+ | tur | Turkish | Latin | Turkic | Common Turkic | High |
330
+ | twi | Twi | Latin | Atlantic-Congo | Niger-Congo | Low |
331
+ | ukr | Ukrainian | Cyrillic | Indo-European | Balto-Slavic | Mid |
332
+ | urd | Urdu | Arabic | Indo-European | Indo-Aryan | Mid |
333
+ | uzb | Uzbek | Latin | Turkic | Common Turkic | Mid |
334
+ | vie | Vietnamese | Latin | Austroasiatic | Vietic | High |
335
+ | xho | Xhosa | Latin | Atlantic-Congo | Benue-Congo | Low |
336
+ | yid | Yiddish | Hebrew | Indo-European | Germanic | Low |
337
+ | yor | Yoruba | Latin | Atlantic-Congo | Benue-Congo | Low |
338
+ | zho | Chinese | Han | Sino-Tibetan | Sinitic | High |
339
+ | zul | Zulu | Latin | Atlantic-Congo | Benue-Congo | Low |
340
+
341
+ ## Model Card Contact
342
+
343
+ For errors in this model card, contact Ahmet or Viraat, `{ahmet, viraat} at cohere dot com`.