lysandre HF staff commited on
Commit
d087a1c
1 Parent(s): 92a67a6

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +251 -0
README.md ADDED
@@ -0,0 +1,251 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ace
4
+ - acm
5
+ - acq
6
+ - aeb
7
+ - afr
8
+ - ajp
9
+ - aka
10
+ - als
11
+ - amh
12
+ - apc
13
+ - arb
14
+ - ars
15
+ - ary
16
+ - arz
17
+ - asm
18
+ - ast
19
+ - awa
20
+ - ayr
21
+ - azb
22
+ - azj
23
+ - bak
24
+ - bam
25
+ - ban
26
+ - bel
27
+ - bem
28
+ - ben
29
+ - bho
30
+ - bjn
31
+ - bod
32
+ - bos
33
+ - bug
34
+ - bul
35
+ - cat
36
+ - ceb
37
+ - ces
38
+ - cjk
39
+ - ckb
40
+ - crh
41
+ - cym
42
+ - dan
43
+ - deu
44
+ - dik
45
+ - dyu
46
+ - dzo
47
+ - ell
48
+ - eng
49
+ - epo
50
+ - est
51
+ - eus
52
+ - ewe
53
+ - fao
54
+ - fij
55
+ - fin
56
+ - fon
57
+ - fra
58
+ - fur
59
+ - fuv
60
+ - gaz
61
+ - gla
62
+ - gle
63
+ - glg
64
+ - grn
65
+ - guj
66
+ - hat
67
+ - hau
68
+ - heb
69
+ - hin
70
+ - hne
71
+ - hrv
72
+ - hun
73
+ - hye
74
+ - ibo
75
+ - ilo
76
+ - ind
77
+ - isl
78
+ - ita
79
+ - jav
80
+ - jpn
81
+ - kab
82
+ - kac
83
+ - kam
84
+ - kan
85
+ - kas
86
+ - kat
87
+ - kaz
88
+ - kbp
89
+ - kea
90
+ - khk
91
+ - khm
92
+ - kik
93
+ - kin
94
+ - kir
95
+ - kmb
96
+ - kmr
97
+ - knc
98
+ - kon
99
+ - kor
100
+ - lao
101
+ - lij
102
+ - lim
103
+ - lin
104
+ - lit
105
+ - lmo
106
+ - ltg
107
+ - ltz
108
+ - lua
109
+ - lug
110
+ - luo
111
+ - lus
112
+ - lvs
113
+ - mag
114
+ - mai
115
+ - mal
116
+ - mar
117
+ - min
118
+ - mkd
119
+ - mlt
120
+ - mni
121
+ - mos
122
+ - mri
123
+ - mya
124
+ - nld
125
+ - nno
126
+ - nob
127
+ - npi
128
+ - nso
129
+ - nus
130
+ - nya
131
+ - oci
132
+ - ory
133
+ - pag
134
+ - pan
135
+ - pap
136
+ - pbt
137
+ - pes
138
+ - plt
139
+ - pol
140
+ - por
141
+ - prs
142
+ - quy
143
+ - ron
144
+ - run
145
+ - rus
146
+ - sag
147
+ - san
148
+ - sat
149
+ - scn
150
+ - shn
151
+ - sin
152
+ - slk
153
+ - slv
154
+ - smo
155
+ - sna
156
+ - snd
157
+ - som
158
+ - sot
159
+ - spa
160
+ - srd
161
+ - srp
162
+ - ssw
163
+ - sun
164
+ - swe
165
+ - swh
166
+ - szl
167
+ - tam
168
+ - taq
169
+ - tat
170
+ - tel
171
+ - tgk
172
+ - tgl
173
+ - tha
174
+ - tir
175
+ - tpi
176
+ - tsn
177
+ - tso
178
+ - tuk
179
+ - tum
180
+ - tur
181
+ - twi
182
+ - tzm
183
+ - uig
184
+ - ukr
185
+ - umb
186
+ - urd
187
+ - uzn
188
+ - vec
189
+ - vie
190
+ - war
191
+ - wol
192
+ - xho
193
+ - ydd
194
+ - yor
195
+ - yue
196
+ - zho
197
+ - zsm
198
+ - zul
199
+
200
+ language_details: "ace_Arab, ace_Latn, acm_Arab, acq_Arab, aeb_Arab, afr_Latn, ajp_Arab, aka_Latn, amh_Ethi, apc_Arab, arb_Arab, ars_Arab, ary_Arab, arz_Arab, asm_Beng, ast_Latn, awa_Deva, ayr_Latn, azb_Arab, azj_Latn, bak_Cyrl, bam_Latn, ban_Latn,bel_Cyrl, bem_Latn, ben_Beng, bho_Deva, bjn_Arab, bjn_Latn, bod_Tibt, bos_Latn, bug_Latn, bul_Cyrl, cat_Latn, ceb_Latn, ces_Latn, cjk_Latn, ckb_Arab, crh_Latn, cym_Latn, dan_Latn, deu_Latn, dik_Latn, dyu_Latn, dzo_Tibt, ell_Grek, eng_Latn, epo_Latn, est_Latn, eus_Latn, ewe_Latn, fao_Latn, pes_Arab, fij_Latn, fin_Latn, fon_Latn, fra_Latn, fur_Latn, fuv_Latn, gla_Latn, gle_Latn, glg_Latn, grn_Latn, guj_Gujr, hat_Latn, hau_Latn, heb_Hebr, hin_Deva, hne_Deva, hrv_Latn, hun_Latn, hye_Armn, ibo_Latn, ilo_Latn, ind_Latn, isl_Latn, ita_Latn, jav_Latn, jpn_Jpan, kab_Latn, kac_Latn, kam_Latn, kan_Knda, kas_Arab, kas_Deva, kat_Geor, knc_Arab, knc_Latn, kaz_Cyrl, kbp_Latn, kea_Latn, khm_Khmr, kik_Latn, kin_Latn, kir_Cyrl, kmb_Latn, kon_Latn, kor_Hang, kmr_Latn, lao_Laoo, lvs_Latn, lij_Latn, lim_Latn, lin_Latn, lit_Latn, lmo_Latn, ltg_Latn, ltz_Latn, lua_Latn, lug_Latn, luo_Latn, lus_Latn, mag_Deva, mai_Deva, mal_Mlym, mar_Deva, min_Latn, mkd_Cyrl, plt_Latn, mlt_Latn, mni_Beng, khk_Cyrl, mos_Latn, mri_Latn, zsm_Latn, mya_Mymr, nld_Latn, nno_Latn, nob_Latn, npi_Deva, nso_Latn, nus_Latn, nya_Latn, oci_Latn, gaz_Latn, ory_Orya, pag_Latn, pan_Guru, pap_Latn, pol_Latn, por_Latn, prs_Arab, pbt_Arab, quy_Latn, ron_Latn, run_Latn, rus_Cyrl, sag_Latn, san_Deva, sat_Beng, scn_Latn, shn_Mymr, sin_Sinh, slk_Latn, slv_Latn, smo_Latn, sna_Latn, snd_Arab, som_Latn, sot_Latn, spa_Latn, als_Latn, srd_Latn, srp_Cyrl, ssw_Latn, sun_Latn, swe_Latn, swh_Latn, szl_Latn, tam_Taml, tat_Cyrl, tel_Telu, tgk_Cyrl, tgl_Latn, tha_Thai, tir_Ethi, taq_Latn, taq_Tfng, tpi_Latn, tsn_Latn, tso_Latn, tuk_Latn, tum_Latn, tur_Latn, twi_Latn, tzm_Tfng, uig_Arab, ukr_Cyrl, umb_Latn, urd_Arab, uzn_Latn, vec_Latn, vie_Latn, war_Latn, wol_Latn, xho_Latn, ydd_Hebr, yor_Latn, yue_Hant, zho_Hans, zho_Hant, zul_Latn"
201
+
202
+ tags:
203
+ - nllb
204
+ license: "cc-by-nc-4.0"
205
+ datasets:
206
+ - flores-200
207
+ metrics:
208
+ - bleu
209
+ - spbleu
210
+ - chrf++
211
+ ---
212
+
213
+ # NLLB-200
214
+
215
+ This is the model card of NLLB-200's 1.3B variant.
216
+
217
+ Here are the [metrics](https://tinyurl.com/nllb200dense1bmetrics) for that particular checkpoint.
218
+
219
+ - Information about training algorithms, parameters, fairness constraints or other applied approaches, and features. The exact training algorithm, data and the strategies to handle data imbalances for high and low resource languages that were used to train NLLB-200 is described in the paper.
220
+ - Paper or other resource for more information NLLB Team et al, No Language Left Behind: Scaling Human-Centered Machine Translation, Arxiv, 2022
221
+ - License: CC-BY-NC
222
+ - Where to send questions or comments about the model: https://github.com/facebookresearch/fairseq/issues
223
+
224
+
225
+
226
+ ## Intended Use
227
+ - Primary intended uses: NLLB-200 is a machine translation model primarily intended for research in machine translation, - especially for low-resource languages. It allows for single sentence translation among 200 languages. Information on how to - use the model can be found in Fairseq code repository along with the training code and references to evaluation and training data.
228
+ - Primary intended users: Primary users are researchers and machine translation research community.
229
+ - Out-of-scope use cases: NLLB-200 is a research model and is not released for production deployment. NLLB-200 is trained on general domain text data and is not intended to be used with domain specific texts, such as medical domain or legal domain. The model is not intended to be used for document translation. The model was trained with input lengths not exceeding 512 tokens, therefore translating longer sequences might result in quality degradation. NLLB-200 translations can not be used as certified translations.
230
+
231
+ ## Metrics
232
+ • Model performance measures: NLLB-200 model was evaluated using BLEU, spBLEU, and chrF++ metrics widely adopted by machine translation community. Additionally, we performed human evaluation with the XSTS protocol and measured the toxicity of the generated translations.
233
+
234
+
235
+ ## Evaluation Data
236
+ - Datasets: Flores-200 dataset is described in Section 4
237
+ - Motivation: We used Flores-200 as it provides full evaluation coverage of the languages in NLLB-200
238
+ - Preprocessing: Sentence-split raw text data was preprocessed using SentencePiece. The
239
+ SentencePiece model is released along with NLLB-200.
240
+
241
+ ## Training Data
242
+ • We used parallel multilingual data from a variety of sources to train the model. We provide detailed report on data selection and construction process in Section 5 in the paper. We also used monolingual data constructed from Common Crawl. We provide more details in Section 5.2.
243
+
244
+ ## Ethical Considerations
245
+ • In this work, we took a reflexive approach in technological development to ensure that we prioritize human users and minimize risks that could be transferred to them. While we reflect on our ethical considerations throughout the article, here are some additional points to highlight. For one, many languages chosen for this study are low-resource languages, with a heavy emphasis on African languages. While quality translation could improve education and information access in many in these communities, such an access could also make groups with lower levels of digital literacy more vulnerable to misinformation or online scams. The latter scenarios could arise if bad actors misappropriate our work for nefarious activities, which we conceive as an example of unintended use. Regarding data acquisition, the training data used for model development were mined from various publicly available sources on the web. Although we invested heavily in data cleaning, personally identifiable information may not be entirely eliminated. Finally, although we did our best to optimize for translation quality, mistranslations produced by the model could remain. Although the odds are low, this could have adverse impact on those who rely on these translations to make important decisions (particularly when related to health and safety).
246
+
247
+ ## Caveats and Recommendations
248
+ • Our model has been tested on the Wikimedia domain with limited investigation on other domains supported in NLLB-MD. In addition, the supported languages may have variations that our model is not capturing. Users should make appropriate assessments.
249
+
250
+ ## Carbon Footprint Details
251
+ • The carbon dioxide (CO2e) estimate is reported in Section 8.8.